I have written a method that must consider a random number to simulate a Bernoulli distribution. I am using random.nextDouble to generate a number between 0 and 1 then making my decision based on that value given my probability parameter.
My problem is that Spark is generating the same random numbers within each iteration of my for loop mapping function. I am using the DataFrame API. My code follows this format:
val myClass = new MyClass()
val M = 3
val myAppSeed = 91234
val rand = new scala.util.Random(myAppSeed)
for (m <- 1 to M) {
val newDF = sqlContext.createDataFrame(myDF
.map{row => RowFactory
.create(row.getString(0),
myClass.myMethod(row.getString(2), rand.nextDouble())
}, myDF.schema)
}
Here is the class:
class myClass extends Serializable {
val q = qProb
def myMethod(s: String, rand: Double) = {
if (rand <= q) // do something
else // do something else
}
}
I need a new random number every time myMethod is called. I also tried generating the number inside my method with java.util.Random (scala.util.Random v10 does not extend Serializable) like below, but I'm still getting the same numbers within each for loop
val r = new java.util.Random(s.hashCode.toLong)
val rand = r.nextDouble()
I've done some research, and it seems this has do to with Sparks deterministic nature.