I'm trying to use Apache Beam to parallelize N trials of a simulation. This simulation is run on a matrix V sourced from a .mat MATLAB file. My first instinct (forgive me, I'm new to Apache Beam/Dataflow) was to extend FileBasedSource, but further investigation convinced me that this is not the case. Most explicitly, Apache Beam documentation indicates, "You should create a new source if you’d like to use the advanced features that the Source API provides," and I don't need any of them—I just want to read a variable (or few)! I eventually stumbled upon https://stackoverflow.com/a/45946643, which is how I now intend to get V (passing the resultant file-like object from FileSystems.open to scipy.io.loadmat).
The next question is how create a PCollection of N permutations of V. The obvious solution is something like beam.Create([permutation(V) for _ in xrange(N)]). However, I was a little thrown off by the comment in the documentation, "The latter is primarily useful for testing and debugging purposes." Maybe a slight improvement is to perform the permutation in a DoFn.
I have one last idea. It sounds a bit stupid (and it may well be stupid), but hear me out on this one (humor my sleep-deprived self). The documentation points to an implementation of CountingSource (and, along with it, ReadFromCountingSource). Is there a benefit to using ReadFromCountingSource(N) over beam.Create(range(N))? If so, is the following (start of a) Pipeline reasonable:
(p | 'trial_count' >> ReadFromCountingSource(N)
| 'permute_v' >> beam.FlatMap(lambda _, x: permutation(x), V)
| ...)
Beam/Dataflow experts, what would you recommend?
Thank you!!