My 100m in size, quantized data:
(1424411938', [3885, 7898])
(3333333333', [3885, 7898])
Desired result:
(3885, [3333333333, 1424411938])
(7898, [3333333333, 1424411938])
So what I want, is to transform the data so that I group 3885 (for example) with all the data[0] that have it). Here is what I did in python:
def prepare(data):
    result = []
    for point_id, cluster in data:
        for index, c in enumerate(cluster):
            found = 0
            for res in result:
                if c == res[0]:
                    found = 1
            if(found == 0):
                result.append((c, []))
            for res in result:
                if c == res[0]:
                    res[1].append(point_id)
    return result
but when I mapPartitions()'ed data RDD with prepare(), it seem to do what I want only in the current partition, thus return a bigger result than the desired.
For example, if the 1st record in the start was in the 1st partition and the 2nd in the 2nd, then I would get as a result:
(3885, [3333333333])
(7898, [3333333333])
(3885, [1424411938])
(7898, [1424411938])
How to modify my prepare() to get the desired effect? Alternatively, how to process the result that prepare() produces, so that I can get the desired result?
As you may already have noticed from the code, I do not care about speed at all.
Here is a way to create the data:
data = []
from random import randint
for i in xrange(0, 10):
    data.append((randint(0, 100000000), (randint(0, 16000), randint(0, 16000))))
data = sc.parallelize(data)
 
     
    