Skip to content

Issue migrating directly from Hive Metastore to Glue Data Catalog #112

@vinceRicchiuti

Description

@vinceRicchiuti

I am trying to migrate my Hive Metastore (rds) to my Glue Catalog.

I configure the job to run as spark job with all kind of matching

  • spark 2.4/3.1
  • python 2/3
  • Glue version 3.0/2.0/1.0/0.9

I followed readme to migrate directly from Hive Metastore to AWS Glue Data Catalog, but i experienced " 'str' object has no attribute '_jdf' "when i run the Glue ETL job. See the full error message below:

2022-01-27 16:53:53,940 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last):
File "/tmp/import_into_datacatalog.py", line 130, in main()
File "/tmp/import_into_datacatalog.py", line 126, in main region=options.get('region') or 'us-east-1'
File "/tmp/import_into_datacatalog.py", line 51, in metastore_full_migration sc, sql_context, db_prefix, table_prefix).transform(hive_metastore)
File "/tmp/localPyFiles-0b1af0c4-b70f-4147-a11b-965a99faeb92/hive_metastore_migration.py", line 753, in transform ms_database_params=hive_metastore.ms_database_params)
File "/tmp/localPyFiles-0b1af0c4-b70f-4147-a11b-965a99faeb92/hive_metastore_migration.py", line 734, in transform_databases dbs_with_params = self.join_with_params(df=ms_dbs, df_params=ms_database_params, id_col='DB_ID')
File "/tmp/localPyFiles-0b1af0c4-b70f-4147-a11b-965a99faeb92/hive_metastore_migration.py", line 336, in join_with_params df_params_map = self.transform_params(params_df=df_params, id_col=id_col)
File "/tmp/localPyFiles-0b1af0c4-b70f-4147-a11b-965a99faeb92/hive_metastore_migration.py", line 314, in transform_params return self.kv_pair_to_map(params_df, id_col, key, value, 'parameters')
File "/tmp/localPyFiles-0b1af0c4-b70f-4147-a11b-965a99faeb92/hive_metastore_migration.py", line 326, in kv_pair_to_map id_type = df.get_schema_type(id_col)
File "/tmp/localPyFiles-0b1af0c4-b70f-4147-a11b-965a99faeb92/hive_metastore_migration.py", line 199, in get_schema_type return df.select(column_name).schema.fields[0].dataType
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1671, in select
jdf = self._jdf.select(self._jcols(*cols))AttributeError: 'str' object has no attribute '_jdf'

Actually i dunno how to manage this error. Could you give me some helps or suggestion?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions