I have two RDDs in PySpark:
RDD1:
[(u'2013-01-31 00:00:00', u'a', u'Pab', u'abc', u'd'),(u'2013-01-31 00:00:00', u'a', u'ab', u'abc', u'g'),.....]
RDD2:
[(u'41',u'42.0'),(u'24',u'98.0'),....]
Both RDDs have same number or rows. Now what I want to do is take all the columns in each row from RDD1(converted from unicode to normal string) and the 2nd column from each row in RDD2 (converted from unicode string to float ) and form a new RDD with that. So the new RDD will look like this:
RDD3:
[('2013-01-31 00:00:00', 'a', 'Pab', 'abc', 'd',42.0),('2013-01-31 00:00:00', 'a', 'ab', u'abc', 'g',98.0),.....]
Once that is done then I want to do aggregation of last value in each row(the float value) in this new RDD3 by the date value in 1st column. That mans all the rows where date is 2013-01-31 00:00:00, their last numeric values should be added.
How can I do this in PySpark?
 
     
     
    