Skip to content

Wrong version of aws-java-sdk-bundle in sagemaker-spark 1.4.5 #149

@jobvisser03

Description

@jobvisser03

System Information

  • Spark or PySpark: 3.3.0
  • SDK Version: 1.4.5
  • Spark Version: 3.3.0

Describe the problem

I just spend 3 days trying to fix this but to no avail. My setup on an AWS notebook instance:
jars:
aws-java-sdk-bundle-1.11.901.jar
aws-java-sdk-core-1.12.262.jar
aws-java-sdk-kms-1.12.262.jar
aws-java-sdk-s3-1.12.262.jar
aws-java-sdk-sagemaker-1.12.262.jar
aws-java-sdk-sagemakerruntime-1.12.262.jar
aws-java-sdk-sts-1.12.262.jar
hadoop-aws-3.3.1.jar
sagemaker-spark_2.12-spark_3.3.0-1.4.5.jar

Problem:

Based on suggested workarounds in the article above I tried 4 things

  1. upgrade aws-java-sdk-bundle to version 1.12.262 like the other jars → didn’t work
  2. downgrade httpclient to version 4.5.10 → didn’t work
  3. tried to set the aws-java-sdk to disable SSL certificate checking (SSLPeerUnverifiedException on S3 actions aws-sdk-java-v2#1786 ) → didn’t work with "-Dcom.amazonaws.sdk.disableCertChecking=true"
  4. try to read from a bucket that doesn’t contain dots (.) → works

Minimal repo / logs

22/08/30 11:00:22 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://comp.data.sci.data.tst/some/folder/export_date=20220822. org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://comp.data.sci.data.tst/some/folder/export_date=20220822: com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for <comp.data.sci.data.tst.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: Unable to execute HTTP request: Certificate for <comp.data.sci.data.tst.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) at org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) at scala.Option.getOrElse(Option.scala:189)

  • Exact command to reproduce:
    Works:
    df = spark.read.parquet("s3a://aws-bucket-with-dashes/file_0_1_0.snappy.parquet")
    Doesn't work:
    df = spark.read.parquet("s3a://aws.bucket.with.dots/file_0_1_0.snappy.parquet")

It's not possible to rename the bucket due to the many data consumers that depend on them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions