What is the best way to return the max row (value) associated with each unique key in a spark RDD?
I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?
I have in RDD format:
[(v, 3),
 (v, 1),
 (v, 1),
 (w, 7),
 (w, 1),
 (x, 3),
 (y, 1),
 (y, 1),
 (y, 2),
 (y, 3)]
And I need to return:
[(v, 3),
 (w, 7),
 (x, 3),
 (y, 3)]
Ties can return the first value or random.