I'm currently learning Spark and developing custom machine learning algorithms. My question is what is the difference between .map() and .mapValues() and what are cases where I clearly have to use one instead of the other?
 
    
    - 3,046
- 3
- 24
- 40
4 Answers
mapValues is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value).
In other words, given f: B => C and rdd: RDD[(A, B)], these two are identical (almost - see comment at the bottom):
val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }
val result: RDD[(A, C)] = rdd.mapValues(f)
The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.
On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValues because it would only pass the values to your function.
The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.
 
    
    - 30,520
- 16
- 123
- 136
 
    
    - 37,442
- 3
- 79
- 85
- 
                    1I wonder if they play a role in performance, since I am trying to optimize [this](http://stackoverflow.com/questions/39235576/unbalanced-factor-of-kmeans), but from what you said, I guess it won't make any difference... – gsamaras Sep 23 '16 at 04:33
- 
                    2@gsamaras it can have an impact in performance, as losing the partitioning information will force a shuffle down the road if you need to repartition again with the same key. – Madhava Carrillo Oct 04 '17 at 14:41
When we use map() with a Pair RDD, we get access to both Key & value. few times we are only interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().
Example of mapValues
val inputrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60), ("english", 65)))
val mapped = inputrdd.mapValues(mark => (mark, 1));
//
val reduced = mapped.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
reduced.collect
Array[(String, (Int, Int))] = Array((english,(65,1)), (maths,(110,2)))
val average = reduced.map { x =>
                           val temp = x._2
                           val total = temp._1
                           val count = temp._2
                           (x._1, total / count)
                           }
average.collect()
res1: Array[(String, Int)] = Array((english,65), (maths,55))
 
    
    - 10,864
- 5
- 72
- 96
map takes a function that transforms each element of a collection:
 map(f: T => U)
RDD[T] => RDD[U]
When T is a tuple we may want to only act on the values – not the keys
mapValues takes a function that maps the values in the inputs to the values in the output: mapValues(f: V => W)
Where RDD[ (K, V) ] => RDD[ (K, W) ]
Tip: use mapValues when you can avoid reshuffle when data is partitioned by key
 
    
    - 28,239
- 13
- 95
- 121
val inputrdd = sc.parallelize(Seq(("india", 250), ("england", 260), ("england", 180)))
(1)
map():-
val mapresult= inputrdd.map{b=> (b,1)}
mapresult.collect
Result-= Array(((india,250),1), ((england,260),1), ((english,180),1))
(2)
mapvalues():-
val mapValuesResult= inputrdd.mapValues(b => (b, 1));
mapValuesResult.collect
Result-
Array((india,(250,1)), (england,(260,1)), (england,(180,1)))
 
    
    - 6,313
- 15
- 32
- 40
 
    
    - 41
- 5
- 
                    1unless you are absolutely sure that you can fit all the data into memory, don't ever try to use `.collect`. – jtitusj Jun 05 '18 at 10:30