I need to process a large collection of items. Every item is processed in the same way and is independent of the other items (maps on an rdd).
Depending on the path taken in the program different types of information are generated for the items in map operations. Subsequent operations can then take advantage of this information that is already present to execute in a most efficient manner.
Here I have to make a design choice of how to keep the generated information associated with the items.
My current approach to achieve this is to return tuples which contain the original information passed to the map and the generated information. I keep adding information like this so that in the end I have all the information available in a single rdd.
This works but I find it would be nicer to have the information in separate rdds. As far as I know there is no way to keep the information generated in a map as a separate rdd associated with the corresponding items that were passed into the map (without using ids). And thus there is no way of either combining two rdds or do operations on two rdds respecting the association.
Is there a mechanism in spark that allows you to store the information generated from your distributed items in a separate rdd but conserving the association with the distributed items?