Assuming the existence of an RDD of tuples similar to the following:
(key1, 1)
(key3, 9)
(key2, 3)
(key1, 4)
(key1, 5)
(key3, 2)
(key2, 7)
...
What is the most efficient (and, ideally, distributed) way to compute statistics corresponding to each key? (At the moment, I am looking to calculate standard deviation / variance, in particular.) As I understand it, my options amount to:
- Use the
colStatsfunction in MLLib: This approach has the advantage of easily-adaptable to use othermllib.statfunctions later, if other statistical computations are deemed necessary. However, it operates on an RDD ofVectorcontaining the data for each column, so as I understand it, this approach would require that the full set of values for each key be collected on a single node, which would seem non-ideal for large data sets. Does a SparkVectoralways imply that the data in theVectorbe resident locally, on a single node? - Perform a
groupByKey, thenstats: Likely shuffle-heavy, as a result of thegroupByKeyoperation. - Perform
aggregateByKey, initializing a newStatCounter, and usingStatCounter::mergeas the sequence and combiner functions: This is the approach recommended by this StackOverflow answer, and avoids thegroupByKeyfrom option 2. However, I haven't been able to find good documentation forStatCounterin PySpark.
I like Option 1 because it makes the code more extensible, in that it could easily accommodate more complicated calculations using other MLLib functions with similar contracts, but if the Vector inputs inherently require that the data sets be collected locally, then it limits the data sizes on which the code can effectively operate. Between the other two, Option 3 looks more efficient because it avoids the groupByKey, but I was hoping to confirm that that is the case.
Are there any other options I haven't considered? (I am currently using Python + PySpark, but I'm open to solutions in Java/Scala as well, if there is a language difference.)