As a follow-up to my previous question, how do I map over an RDD locally, i.e., collect the data into a local stream without actually using collect (because the data is far too large).
Specifically, I want to write something like
from subprocess import Popen, PIPE
with open('out','w') as out:
    with open('err','w') as err:
        myproc = Popen([.....],stdin=PIPE,stdout=out,stderr=err)
myrdd.iterate_locally(lambda x: myproc.stdin.write(x+'\n'))
How do I implement this iterate_locally?
does NOT work:
collectreturn value is far too large:myrdd.collect().foreach(lambda x: myproc.stdin.write(x+'\n'))does NOT work:
foreachexecutes its argument in a distributed mode, NOT locallymyrdd.foreach(lambda x: myproc.stdin.write(x+'\n'))
Related: