I'm trying to do some data analysis that involves aggregations using the pySpark Dataframe API. My understanding is that the groupBy() operation is equivalent to the groupByKey() Spark command. Is there a command on the Dataframe API that is equivalent to Spark's reduceByKey()? My concern is that groupBy() seems to collects all values for a key into memory, which is not great in terms of performance.
Thanks.
