Suppose I have the following class in Spark Scala:
class SparkComputation(i: Int, j: Int) {
def something(x: Int, y: Int) = (x + y) * i
def processRDD(data: RDD[Int]) = {
val j = this.j
val something = this.something _
data.map(something(_, j))
}
}
I get the Task not serializable Exception when I run the following code:
val s = new SparkComputation(2, 5)
val data = sc.parallelize(0 to 100)
val res = s.processRDD(data).collect
I'm assuming that the exception occurs because Spark is trying to serialize the SparkComputation instance. To prevent this from happening, I have stored the class members I'm using in the RDD operation in local variables (j and something). However, Spark still tries to serialize SparkComputation object because of the method. Is there anyway to pass the class method to map without forcing Spark to serializing the whole SparkComputation class? I know the following code works without any problem:
def processRDD(data: RDD[Int]) = {
val j = this.j
val i = this.i
data.map(x => (x + j) * i)
}
So, the class members that store values are not causing the problem. The problem is with the function. I have also tried the following approach with no luck:
class SparkComputation(i: Int, j: Int) {
def processRDD(data: RDD[Int]) = {
val j = this.j
val i = this.i
def something(x: Int, y: Int) = (x + y) * i
data.map(something(_, j))
}
}