Let's say I have a list L=[[a,2],[a,3],[a,4],[b,4],[b,8],[b,9]]
Using pyspark I want to be able to remove the third element so that it will look like this:
[a,2]
[a,3]
[b,4]
[b,8]
I am new to pyspark and not sure what I should do here.
Let's say I have a list L=[[a,2],[a,3],[a,4],[b,4],[b,8],[b,9]]
Using pyspark I want to be able to remove the third element so that it will look like this:
[a,2]
[a,3]
[b,4]
[b,8]
I am new to pyspark and not sure what I should do here.
 
    
    You can try something like this.
The first step is groupby key column and aggregate values in a list. Then use a udf to get the first two values of the list and then explode that column.
df = sc.parallelize([('a',2),('a',3),('a',4),
                       ('b',4),('b',8),('b',9)]).toDF(['key', 'value'])
from pyspark.sql.functions import collect_list, udf, explode
from pyspark.sql.types import *
foo = udf(lambda x:x[0:2], ArrayType(IntegerType()))
df_list = (df.groupby('key').agg(collect_list('value')).
                   withColumn('values',foo('collect_list(value)')).
                   withColumn('value', explode('values')).
                   drop('values', 'collect_list(value)'))
df_list.show()
result
+---+-----+
|key|value|
+---+-----+
|  b|    4|
|  b|    8|
|  a|    2|
|  a|    3|
+---+-----+
