I have an RDD in the form (Group,[word1,word2,..wordn]). It contains a group and the words which are under that group. If I have the below input
rdd=(g1,[w1,w2,w4]),(g2[w3,w2]),(g3[w4.w1]),(g3[w1,w2,w3]),(g2[w2])
I would want to collect the output saying how many times a word occurs in a group. The output format will be.
Word  Group1 Group2  Group3
w1     1       0       2
w2     1       2       1
w3     0       1       1
w4     1       0       1
What would be the pyspark functions I can use to achieve this output in most efficient way
 
    