I'm trying to take the 25 top items of a JavaPairRDD like this:
JavaPairRDD rdd = ...;
List<Tuple2<String, Long>> top25 = rdd.top(25, (t1, t2) -> {
if (!t1._2.equals(t2._2)) {
return -1 * Long.compare(t1._2, t2._2);
}
else {
return t1._1.compareTo(t2._1);
}
})
This is sorting based on first the value and if values are equal, then on the keys. When I run it, I get the following exception:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
I think the problem is that the inline lambda function playing the role of Comparator is not serializable.
I've got two questions. First, assuming my assumption is correct, why the Comparator needs to be serializable? And second, how to solve this problem?