Setup
- Spark
spark-2.4.5-bin-without-hadoopconfigured withhadoop-2.8.5. - Spark is running in standalone mode.
- PySpark application runs as a separate container and connects to the master from another container.
What needs to happen?
- PySpark application should be able to connect to AWS S3 bucket by using
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
Attempted solution #1
- Master, workers, and the PySpark application all have the following jars at
/
hadoop-aws-2.8.5.jar
hadoop-common-2.8.5.jar
aws-java-sdk-1.10.6.jar
jackson-core-2.2.3.jar
aws-java-sdk-s3-1.10.6.jar
jackson-databind-2.2.3.jar
jackson-annotations-2.2.3.jar
joda-time-2.9.4.jar
- PySpark application is configured the following way:
builder = SparkSession.builder
builder = builder.config("spark.executor.userClassPathFirst", "true")
builder = builder.config("spark.driver.userClassPathFirst", "true")
class_path = "/hadoop-aws-2.8.5.jar:/hadoop-common-2.8.5.jar:/aws-java-sdk-1.10.6.jar:/jackson-core-2.2.3.jar:/aws-java-sdk-s3-1.10.6.jar:/jackson-databind-2.2.3.jar:/jackson-annotations-2.2.3.jar:/joda-time-2.9.4.jar"
builder = builder.config("spark.driver.extraClassPath", class_path)
builder = builder.config("spark.executor.extraClassPath", class_path)
builder = builder.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
builder = builder.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
builder = builder.config("spark.hadoop.fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
builder = builder.config("spark.hadoop.fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
builder = builder.config("spark.hadoop.fs.s3a.session.token", os.environ.get("AWS_SESSION_TOKEN"))
Output
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.security.authentication.util.KerberosUtil.hasKerberosTicket(Ljavax/security/auth/Subject;)Z
....
- This was happening because
hadoop-authwas missing. I downloadedhadoop-authand got another error. It lead me to conclude that I would need to provide all the dependent jars. - Instead of going manually through maven. I tried to import
hadoop-awsinto IntelliJ IDEA to see if it's going to import other dependencies and it gave me this list: (Don't know if there is amvncommand that would do the same thing)
accessors-smart-1.2.jar
activation-1.1.jar
apacheds-i18n-2.0.0-M15.jar
apacheds-kerberos-codec-2.0.0-M15.jar
api-asn1-api-1.0.0-M20.jar
api-util-1.0.0-M20.jar
asm-3.1.jar
asm-5.0.4.jar
avro-1.7.4.jar
aws-java-sdk-core-1.10.6.jar
aws-java-sdk-kms-1.10.6.jar
aws-java-sdk-s3-1.10.6.jar
commons-beanutils-1.7.0.jar
commons-beanutils-core-1.8.0.jar
commons-cli-1.2.jar
commons-codec-1.4.jar
commons-collections-3.2.2.jar
commons-compress-1.4.1.jar
commons-configuration-1.6.jar
commons-digester-1.8.jar
commons-io-2.4.jar
commons-lang-2.6.jar
commons-logging-1.1.3.jar
commons-math3-3.1.1.jar
commons-net-3.1.jar
curator-client-2.7.1.jar
curator-framework-2.7.1.jar
curator-recipes-2.7.1.jar
gson-2.2.4.jar
guava-11.0.2.jar
hadoop-annotations-2.8.5.jar
hadoop-auth-2.8.5.jar
hadoop-aws-2.8.5.jar
hadoop-common-2.8.5.jar
htrace-core4-4.0.1-incubating.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
jackson-annotations-2.2.3.jar
jackson-core-2.2.3.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.2.3.jar
jackson-jaxrs-1.8.3.jar
jackson-mapper-asl-1.9.13.jar
jackson-xc-1.8.3.jar
java-xmlbuilder-0.4.jar
jaxb-api-2.2.2.jar
jaxb-impl-2.2.3-1.jar
jcip-annotations-1.0-1.jar
jersey-core-1.9.jar
jersey-json-1.9.jar
jersey-server-1.9.jar
jets3t-0.9.0.jar
jettison-1.1.jar
jetty-6.1.26.jar
jetty-sslengine-6.1.26.jar
jetty-util-6.1.26.jar
joda-time-2.9.4.jar
jsch-0.1.54.jar
json-smart-2.3.jar
jsp-api-2.1.jar
jsr305-3.0.0.jar
log4j-1.2.17.jar
netty-3.7.0.Final.jar
nimbus-jose-jwt-4.41.1.jar
paranamer-2.3.jar
protobuf-java-2.5.0.jar
servlet-api-2.5.jar
slf4j-api-1.7.10.jar
slf4j-log4j12-1.7.10.jar
snappy-java-1.0.4.1.jar
stax-api-1.0-2.jar
xmlenc-0.52.jar
xz-1.0.jar
zookeeper-3.4.6.jar
- I suspect these need to be included in PySpark application as well. It would be too long to download each
.jarseparately and probably is not the way to go.
Attempted solution #2
builder = SparkSession.builder
builder = builder.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.8.5,org.apache.hadoop:hadoop-common:2.8.5,com.amazonaws:aws-java-sdk-s3:1.10.6")
builder = builder.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
builder = builder.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
builder = builder.config("spark.hadoop.fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
builder = builder.config("spark.hadoop.fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
builder = builder.config("spark.hadoop.fs.s3a.session.token", os.environ.get("AWS_SESSION_TOKEN"))
- Instead of providing
jarsI use--packagesit installs the dependencies as well. However, I end up with the following error when trying to read from S3:
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
...
My research
- Spark uses s3a: java.lang.NoSuchMethodError
- NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities while reading s3 Data with spark
- https://hadoop.apache.org/docs/r3.1.0/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#java.lang.NoSuchMethodError_referencing_an_org.apache.hadoop_class
From that, I conclude that my problem has to do with hadoop-aws dependencies. However, I'm having a problem checking this. The versions of Hadoop running with Spark are the same as the jars that I use when connecting and I'm providing all the dependencies.
- Any tools or commands to run for debugging this issue would be helpful.