1

I have a problem with a large object (400mb pickled) I need to use in a UDF.

The object is pickled and on every worker but I don't know how to have it load on a worker outside the UDF which causes it to be reloaded for every row.

Broadcast hasn't really helped the overhead of loading this up for every task crashes everything in my dev environment.

mvryan
  • 337
  • 1
  • 4
  • 12
  • 1
    [How to run a function on all Spark workers before processing data in PySpark?](https://stackoverflow.com/q/37343437/8371915) and [How can I load data that can't be pickled in each Spark executor?](https://stackoverflow.com/q/35500196/8371915) should be useful for you. – Alper t. Turker Jun 23 '18 at 16:00

0 Answers0