Gabriele Degola, June 2022
This project simulates concrete data engineering scenario, regarding generation of business intelligence reports and delivery of data insights, and applied machine learning.
Solutions are developed using the Python language on top of Apache Spark, leveraging the RDD API, the DataFrame API and the MLlib library. To download and install Spark, refer to the official documentation.
Task 2.5 is solved through Apache Airflow.
This git repo is organized as follows:
.
..
data/
src/
out/
README.md
data/directory contains the datasets used in the different exercises.src/directory contains the source code files, named astask_x_y.py(solution of partx, tasky). Solutions are described in the associatedREADMEfile.out/directory contains output files, named following the same convention.
Three datasets are used in total, one for each part of the challenge:
groceries.csv: shopping transactions, incsvformatsf-airbnb-clean.parquet: small version of the AirBnB dataset, inparquetformatiris.csv: the classic iris dataset, incsvformat
All solutions are designed to be run through the spark-submit command on a local Spark cluster with a single worker thread.
spark-submit task_x_y.py path/to/input/file.txt path/to/output/file.txt
Specific instructions are returned and contained in each Python script.