A complete, production-style analytics project built with Apache Spark, covering batch processing, graph analytics, and real-time streaming.
This repository demonstrates how to design modular big-data pipelines, clean large datasets, perform advanced transformations, and generate insights at scale.
The project is divided into four major analytical modules, each originally designed as a standalone task.
All questions have been re-implemented and extended into clean, professional Python modules.
| Module | Description | Core Dataset / Topic |
|---|---|---|
| Task 1 – NYC Yellow Taxi (Batch Analytics) | Cleansing, aggregating, and analysing multi-gigabyte taxi trip data to derive operational KPIs. | NYC Yellow Taxi data (CSV) |
| Task 2 – Ethereum Transactions (Exploratory Analytics) | Parsing and summarising blockchain transactions; analysing gas price dynamics and user activity. | Ethereum transaction records |
| Task 3 – NYC Taxi Network (Graph Analytics) | Building a directed graph of taxi zones to study travel flows and influential pickup areas. | NYC Taxi Zone lookup |
| Task 4 – NASA Logs (Streaming Analytics) | Real-time ingestion of NASA web logs to detect request spikes and anomalies over sliding windows. | NASA HTTP request logs |
Below are the analytical objectives rephrased from the original coursework instructions.
Each “Question X” corresponds to a dedicated, runnable script under the relevant src/ folder.
Goal: Use PySpark DataFrames to perform cleaning, transformation, and exploratory analysis.
Questions implemented:
- Load large CSV files into Spark and inspect schema consistency.
- Remove missing or invalid records (zero fare, negative trip distance, etc.).
- Compute key KPIs – total revenue, average trip distance, peak hours.
- Identify busiest pickup and drop-off zones.
- Aggregate monthly trip counts and average fares.
- Calculate correlation between fare amount and trip distance.
- Save clean data partitions in Parquet format.
➡️ All results and code live in: src/nyc_taxi_batch/nyc_taxi_batch_01-07.py
Goal: Analyse blockchain transaction data to uncover time and value patterns.
Questions implemented:
- Load
september_2015_document.csvandoctober_2015_document.csv. - Parse timestamps into Spark SQL datetime format.
- Compute the number of transactions per day.
- Determine min/max/average gas price.
- Generate histograms of transaction value and gas price distributions.
- Compare monthly transaction trends between September and October.
➡️ Results visualised under report/figures/ethereum/.
Goal: Construct and analyse a graph representation of taxi trips.
Questions implemented:
- Create vertices (zones) and edges (trip connections).
- Compute in-degree and out-degree for each zone.
- Identify the top 10 most connected zones.
- Run PageRank to detect influential pickup areas.
- Find the shortest path between two selected zones.
- Visualise the network structure with degree weighting.
➡️ GraphFrames used throughout; outputs stored in report/figures/nyc_graph/.
Goal: Build a Spark Structured Streaming pipeline for NASA web server logs.
Questions implemented:
- Stream log data from directory using a 10-second micro-batch interval.
- Parse timestamp, host, and status fields using regex and Spark schema.
- Count requests per host over a sliding time window.
- Detect the most frequent status codes in the last 60 seconds.
- Identify anomalies in request rate using thresholds.
- Write aggregated output to the console and checkpoint directory.
➡️ Streaming pipeline scripts: src/nasa_streaming/nasa_streaming_01-06.py.
git clone git@github.com:abailey81/Big-data-spark-analytics.git
cd Big-data-spark-analytics
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt