I'm new to Spark and trying to migrate an existing python application to pyspark.
One of the first functions (in this case f(x)) should run for every element in the dataset, but should also take into account other elements in the dataset.
the best simplification I could get this to is the following pseudo-code:
def idx_gen_a(x):
return x-5
def idx_gen_b(x):
return x*3
def f(i, x, dataset):
elem1 = dataset.get(idx_gen_a(i))
elem2 = dataset.get(idx_gen_b(i))
...
return some_calculation(x, elem1, elem2, ...)
def main(dataset):
result = []
for i, x in enumerate(dataset):
result.append(f(i, x,dataset))
Is there a Spark-ish way of doing this? foreachPartition and aggregate din't seem to quite fit..