Skip to content

Commit 6f5e473

Browse files
Refactor readme to reflect the eng fund
1 parent 6c98a44 commit 6f5e473

File tree

4 files changed

+24
-11
lines changed

4 files changed

+24
-11
lines changed

README.md

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,20 @@
11
# spark-with-engineering-fundamentals
2+
23
![CI](https://github.com/syedhassaanahmed/spark-with-engineering-fundamentals/actions/workflows/ci.yml/badge.svg)
34

45
## Why?
6+
57
[Apache Spark](https://spark.apache.org/docs/latest/) is a very popular open-source analytics engine for large-scale data processing. When building Spark applications as part of a production-grade solution, developers need to take care of engineering aspects such as inner dev loop, testing, CI/CD, infra-as-code and observability.
68

79
## What?
10+
811
In this **work-in-progress** sample we'll demonstrate an E2E Spark data pipeline and how to tackle the above-mentioned engineering fundamentals.
912

1013
## How?
11-
### Cloud
12-
In the Cloud version, we provision all infrastructure with [Terraform](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs).
1314

14-
**Note:** Prior to running `terraform apply` you must ensure the [wheel](https://wheel.readthedocs.io/en/stable/) `./src/common_lib/dist/common_lib-*.whl` exists locally by executing `python3 -m build ./src/common_lib`.
15+
The ETL scenario is based on a time series data processing pipeline. The pipeline reads synthetic data from a message broker, processes it with PySpark and writes the enriched data to a Database. The key point to note here is that the data processing logic can be shared between the cloud and local versions through the `common_lib` Wheel.
16+
17+
### Cloud
1518

1619
The IoT Telemetry Simulator is hosted in [Azure Container Instances](https://azure.microsoft.com/en-us/products/container-instances). It sends generated data to a Kafka broker, [exposed through Azure Event Hubs](https://learn.microsoft.com/en-us/azure/event-hubs/azure-event-hubs-kafka-overview).
1720
The ETL workload is represented in a [Databricks Job](https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/jobs). This job is responsible for reading and enriching the data from sources and store the final output to an [Azure SQL DB](https://azure.microsoft.com/en-us/products/azure-sql/database/).
@@ -21,34 +24,44 @@ The ETL workload is represented in a [Databricks Job](https://learn.microsoft.co
2124
</div>
2225

2326
### Local
24-
In the local version, we provision and orchestrate everything with [Docker Compose](https://docs.docker.com/compose/).
2527

26-
**Note:** Please use the `docker compose` tool instead of the [older version](https://stackoverflow.com/a/66516826) `docker-compose`.
27-
28-
The pipeline begins with [Azure IoT Device Telemetry Simulator](https://github.com/Azure-Samples/Iot-Telemetry-Simulator) sending synthetic Time Series data to a [Confluent Community Kafka Server](https://docs.confluent.io/platform/current/platform-quickstart.html#ce-docker-quickstart). A PySpark app then processes the Time Series, applies some metadata and writes the enriched results to a SQL DB hosted in [SQL Server 2022 Linux container](https://learn.microsoft.com/en-us/sql/linux/quickstart-install-connect-docker?view=sql-server-ver16&pivots=cs1-bash). The key point to note here is that the data processing logic is shared between the cloud and local versions through the `common_lib` Wheel.
28+
The pipeline begins with [Azure IoT Device Telemetry Simulator](https://github.com/Azure-Samples/Iot-Telemetry-Simulator) sending synthetic Time Series data to a [Confluent Community Kafka Server](https://docs.confluent.io/platform/current/platform-quickstart.html#ce-docker-quickstart). A PySpark app then processes the Time Series, applies some metadata and writes the enriched results to a SQL DB hosted in [SQL Server 2022 Linux container](https://learn.microsoft.com/en-us/sql/linux/quickstart-install-connect-docker?view=sql-server-ver16&pivots=cs1-bash).
2929

3030
<div align="center">
3131
<img src="./docs/local-architecture.png">
3232
</div>
3333

34-
## NFRs
34+
## Infrastructure as Code
35+
36+
In the Cloud version, we provision all infrastructure with [Terraform](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs). Prior to running `terraform apply` you must ensure the [wheel](https://wheel.readthedocs.io/en/stable/) `./src/common_lib/dist/common_lib-*.whl` exists locally by executing `python3 -m build ./src/common_lib`.
37+
38+
In the local version, we provision and orchestrate everything with [Docker Compose](https://docs.docker.com/compose/).
39+
Please use the `docker compose` tool instead of the [older version](https://stackoverflow.com/a/66516826) `docker-compose`.
40+
41+
## Tests
3542

36-
### Tests
3743
- To validate that the local E2E pipeline is working correctly, we can execute the script `smoke-test.sh`. This script will send messages using the IoT Telemetry Simulator and then query the SQL DB to ensure the messages were processed correctly.
3844
- Unit tests are available for the `common_lib` Wheel in PyTest.
3945
- Both type of tests are also executed in the CI pipeline.
4046

41-
### Observability
47+
## CI/CD
48+
49+
GitHub Actions is used for CI/CD. The CI pipeline runs the tests and the CD pipeline deploys the Cloud infrastructure and the Spark Job.
50+
51+
## Observability
52+
4253
The local version of the solution also deploys additional containers for [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/). The Grafana dashboard below, relies on the [Spark 3.0 metrics](https://spark.apache.org/docs/3.0.0/monitoring.html) emitted in the Prometheus format.
4354

4455
<div align="center">
4556
<img src="./docs/local-grafana.png">
4657
</div>
4758

48-
### Inner Dev Loop
59+
## Inner Dev Loop
60+
4961
[GitHub Codespaces](https://github.com/features/codespaces) are supported through the [VS Code Dev Containers](https://code.visualstudio.com/docs/devcontainers/containers). The minimum required [machine type](https://docs.github.com/en/codespaces/customizing-your-codespace/changing-the-machine-type-for-your-codespace) configuration is `4-core`.
5062

5163
## Team
64+
5265
- [Alexander Gassmann](https://github.com/Salazander)
5366
- [Magda Baran](https://github.com/MagdaPaj)
5467
- [Hassaan Ahmed](https://github.com/syedhassaanahmed)

0 commit comments

Comments
 (0)