I have a spark DataFrame with multiple columns. I would like to group the rows based on one column, and then find the mode of the second column for each group. Working with a pandas DataFrame, I would do something like this:
rand_values = np.random.randint(max_value,
                                size=num_values).reshape((num_values/2, 2))
rand_values = pd.DataFrame(rand_values, columns=['x', 'y'])
rand_values['x'] = rand_values['x'] > max_value/2
rand_values['x'] = rand_values['x'].astype('int32')
print(rand_values)
##    x  y
## 0  0  0
## 1  0  4
## 2  0  1
## 3  1  1
## 4  1  2
def mode(series):
    return scipy.stats.mode(series['y'])[0][0]
rand_values.groupby('x').apply(mode)
## x
## 0    4
## 1    1
## dtype: int64
Within pyspark, I am able to find the mode of a single column doing
df = sql_context.createDataFrame(rand_values)
def mode_spark(df, column):
    # Group by column and count the number of occurrences
    # of each x value
    counts = df.groupBy(column).count()
    # - Find the maximum value in the 'counts' column
    # - Join with the counts dataframe to select the row
    #   with the maximum count
    # - Select the first element of this dataframe and
    #   take the value in column
    mode = counts.join(
        counts.agg(F.max('count').alias('count')),
        on='count'
    ).limit(1).select(column)
    return mode.first()[column]
mode_spark(df, 'x')
## 1
mode_spark(df, 'y')
## 1
I'm at a loss for how to apply that function to grouped data. If it's not possible to directly apply this logic to a DataFrame, is it possible to achieve the same effect by some other means?
Thank you in advance!
 
    