This repo contains details about end to end implementation of the GCP GCS to BQ pipeline using CI/CD leveraging Airflow DEV and PROD Environments, Thanks
Data Flow Details:
Created Data Pipeline to load flight booking CSV file in GCS bucket and using Github Actions yml file- deployed Pyspark/Python files, required variables in Json file, leveraged Serverless Dataproc Cluster, Loaded the data into corresponding DEV/PROD BigQuery tables and implemented Looker Dashboard on PROD BQ Table
Deepwiki documentation URL: https://deepwiki.com/ViinayKumaarMamidi/GCP_Flight_Booking_Airflow_GCS_to_BQ_to_Looker_End_to_End_Project
Project Details:
- Implemented Connections to my Github in VS Code, created a repo and activated the connections
- Implemented Pyspark script to read the flight_booking.CSV file from GCS bucket and performed transformations and loaded into Stgaing and final tables in Big Query
- Utilized Serverless Dataproc Cluster concepts inside Airflow Script and Deployed the code
- Created Github YML file which performs actions to Authenticate the GCS Account and through Actions, uploaded Airflow Job, Spark job and Variables information into GCS bucket
- Implemented required variables for DEV and PROD using Json files and uploaded in to GCS bucket folders as needed using YML files
- In Github once DEV Airflow DAG ran to success, created Pull request for PROD and PROD airflow DAG ran to success
- Created Looker Dashboard on the top of the PROD final Table transformed_flight_data_prod
Source Flight Booking CSV File URL:
Airflow DAG File URL:
Pyspark File URL:
DEV Variables JSON File Details:
PROD Variables JSON File Details:
Source GCS Bucket Files:
DEV Airflow DAG Details:
PROD Airflow DAG Details:
Composer Airflow Dev and Prod Details:
DEV Serverless Dataproc Cluster Log Details:
PROD Serverless Dataproc Cluster Log Details:
DEV Github Deployment Details:
PROD Github Deployment Details:
DEV BigQuery Tables Details:
PROD BigQuery Tables Details:
Looker Dashboard Details: