I have the following Dataframe:
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext('local')
df_pd = pd.DataFrame([[11, 'abc', 1, 114],
                      [11, 'abc', 2, 104],
                      [11, 'def', 9, 113],
                      [12, 'abc', 1,  14],
                      [12, 'def', 3, 110],
                      [14, 'abc', 1, 194],
                      [14, 'abc', 2, 164],
                      [14, 'abc', 3, 104],],
                      columns=['id', 'str', 'num', 'val'])
sql_sc = SQLContext(sc)
df_spark = sql_sc.createDataFrame(df_pd)
df_spark.show()
Which prints:
+---+---+---+---+
| id|str|num|val|
+---+---+---+---+
| 11|abc|  1|114|
| 11|abc|  2|104|
| 11|def|  9|113|
| 12|abc|  1| 14|
| 12|def|  3|110|
| 14|abc|  1|194|
| 14|abc|  2|164|
| 14|abc|  3|104|
+---+---+---+---+
My goal is to transform it to this:
+---+-----+-----+-----+-----+-----+
| id|abc_1|abc_2|abc_3|def_3|def_9|
+---+-----+-----+-----+-----+-----+
| 11|  114|  104|  NaN|  NaN|  113|
| 12|   14|  NaN|  NaN|  110|  NaN|
| 14|  194|  164|  104|  NaN|  NaN|
+---+-----+-----+-----+-----+-----+
(One row per id, colum names are str+'_'+str(val), the resulting table is filled with the respective vals, all other entries are NaN)
How would I achieve this? I started with
column = df_spark.select(concat(col("str"), lit("_"), col("num")))
by which I get the column names.
df_spark.select('id').distinct()
Gives the distinct ids
But I fail to build the new Dataframe or fill it.
Edit: The difference to the possible duplicate is that I didnt know about the pivot functionality, whereas the other question asked where to find the function "pivot" in pyspark. I dont know if thats a duplicate or not, but I didnt find the other question because I didnt know what to look for.