What is the proper way to include external packages (jars) in a pyspark shell?
I am using pyspark from a jupyter notebook.
I would like to read from kafka using spark, via the spark-sql-kafka library, as explained here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying.
I am trying to import the library via the --packages option, set in the environment variable PYSPARK_SUBMIT_ARGS.
But
- I am not sure about the exact version and name of the package to use,
- I don't know whether I also need to include spark-streaming or not, whether I have to specify some repository with
--repositoriesor not, - I don't know whether it's better to download the jar and specify local paths (do they have to be on the machine where jupyter is running, or on the machine where yarn is running? I'm using
--master yarnand--deploy-mode client) or to rely on--packages - I don't know whether options specified after
pyspark-shellinPYSPARK_SUBMIT_ARGSare left out or not (If I try to specify--packagesoptions beforepyspark-shellI can't instantiate the spark context at all) - How can I check whether some package was correctly downloaded and is available to be used
- I don't know what is the route that such downloaded jars (or jars in general) take. How many times are they replicated? Do they pass through the driver? Do these things change if I'm using a cluster manager as YARN? Do they change if I'm using everything from a spark-shell in a jupyter notebook?
Resources I read so far:
Docs and guides:
Examples:
Issues and questions:
Repositories: