PySpark get_dummies equivalent

Question

I have a pyspark dataframe with the following schema:

Key1	Key2	Key3	Value
a	a	a	"value1"
a	a	a	"value2"
a	a	b	"value1"
b	b	a	"value2"

(In real life this dataframe is extremely large, not reasonable to convert to pandas DF)

My goal is to transform the dataframe to look like so:

Key1	Key2	Key3	value1	value2
a	a	a	1	1
a	a	b	1	0
b	b	a	0	1

I know this is possible in pandas using the get_dummies function and I have also seen that there is some sort of pyspark & pandas hybrid function that I am not sure I can use.

It is worth mentioning that column Value can receive (in this example) only the values "value1" and "value2" I have encountered this question that possibly solves my problem but I do not entirely understand it and was wondering if there was a simpler way to solve the problem.
Any help is greatly appreciated!

SMALL EDIT

After implementing the accepted solution, to turn this into a one-hot encoding and not just a sum of appearances, I converted each column to boolean type and then back to integer.

score 2 · Accepted Answer · answered Jul 27 '22 at 10:01

You can group by on the key columns and pivot the value column while counting all records.

data_sdf. \
    groupBy('key1', 'key2', 'key3'). \
    pivot('val'). \
    agg(func.count('*')). \
    fillna(0). \
    show()

# +----+----+----+------+------+
# |key1|key2|key3|value1|value2|
# +----+----+----+------+------+
# |   b|   b|   a|     0|     1|
# |   a|   a|   a|     1|     1|
# |   a|   a|   b|     1|     0|
# +----+----+----+------+------+

score 1 · Answer 2 · answered Jul 27 '22 at 09:21

1

This can be achieved by group by twice.

df = df.groupby(*df.columns).agg(F.count('*').alias('cnt')) \
    .groupby('Key1', 'Key2', 'Key3').pivot('Value').agg(F.sum('cnt')).fillna(0)
df.show(truncate=False)

# +----+----+----+------+------+
# |Key1|Key2|Key3|value1|value2|
# +----+----+----+------+------+
# |a   |a   |b   |1     |0     |
# |b   |b   |a   |0     |1     |
# |a   |a   |a   |1     |1     |
# +----+----+----+------+------+

answered Jul 27 '22 at 09:21

过过招

3,722
2
4
11

you don't need to use group by twice. a single `groupBy().pivot().agg(func.count('*')).fillna(0)` should suffice. – samkart Jul 27 '22 at 09:55

PySpark get_dummies equivalent

SMALL EDIT

2 Answers2