Log Processing Automation & Anomaly Detection

This project simulates a log analysis pipeline, as described in my resume. It uses PySpark (to mimic Azure Databricks) to process raw Apache logs.

The script automates the analysis by:

Parsing raw text logs into a structured format.
Running a SQL query to identify operational patterns (e.g., status code counts).
Running a second SQL query to detect anomalies (e.g., IPs generating the most errors), which simulates reducing manual review time.
Exporting the results to CSV files for visualization in Power BI.

Technologies: Python, PySpark, SQL, Power BI

⚙️ How to Run the Project

Step 1: Get the Data

This project requires a sample Apache log file.

To download a sample Apache log file, run the following command in your terminal:

curl -o sample_logs.log https://raw.githubusercontent.com/elastic/examples/master/Common%20Data%20Formats/apache_logs/apache_logs

Place the sample_logs.log file in the same directory as the process_logs.py script.

Step 2: Setup Environment Variables (Optional)

You can configure the log file name and the regex pattern using environment variables. If not set, default values will be used.

LOG_FILE: The name of the log file to process (default: sample_logs.log).
LOG_REGEX: The regular expression pattern to parse the log entries (default: (\S+) (\S+) (\S+) \[(.*?)\] "(\S+ .*?)" (\d{3}) (\S+)).

Example of setting environment variables:

export LOG_FILE="my_custom_logs.log"
export LOG_REGEX="your_custom_regex_pattern"

Step 3: Set up Virtual Environment and Install Libraries

It's recommended to use a virtual environment to manage dependencies.

python3 -m venv venv
source venv/bin/activate
pip install pyspark pandas

Step 4: Run the Script

Run the script from your terminal:

python3 process_logs.py

This will run a local Spark job and create two new files:

operational_patterns.csv
ip_anomalies.csv

📊 Power BI Visualization Guide

Here is how to create the visualizations.

Open Power BI Desktop.
Click "Get data" -> "Text/CSV" and select operational_patterns.csv.
Load the data.
Repeat the process: Click "Get data" -> "Text/CSV" and select ip_anomalies.csv.
Load the data. (You don't need to join them, just have them both available in the Fields pane).

Visual 1: Operational Patterns (Pie Chart)

Go to the "Report" view.
Select the Pie chart visualization.
From the operational_patterns table, drag status_code to the Legend field.
Drag count to the Values field.
You can customize the labels and title as needed.

Visual 2: Anomaly Detection (Bar Chart)

Create a new page in the report.
Select the Clustered bar chart visualization.
From the ip_anomalies table, drag ip to the Axis field.
Drag error_count to the Values field.
To show just the Top 10:
- Click on the ip field in the "Visualizations" pane, then select "Filters".
- Under "Filter type", select "Top N".
- Set "Show items" to "Top" and value to 10.
- Drag error_count to the "By value" field.
- Click "Apply filter".

Visual 3: KPI Cards

Create a new page in the report.
Select the Card visualization.
From the operational_patterns table, drag count to the Fields well.
To display specific error counts (e.g., 404 errors):
- Drag status_code from operational_patterns to the Filters on this visual pane.
- Select "404" as the filter value.
Rename the card title (e.g., "Total 404 Errors").
Repeat this process for other status codes (e.g., "500" errors) on new Card visualizations.
Finally, create a new Dashboard and arrange your visuals as desired.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Log Processing Automation & Anomaly Detection

⚙️ How to Run the Project

Step 1: Get the Data

Step 2: Setup Environment Variables (Optional)

Step 3: Set up Virtual Environment and Install Libraries

Step 4: Run the Script

📊 Power BI Visualization Guide

Visual 1: Operational Patterns (Pie Chart)

Visual 2: Anomaly Detection (Bar Chart)

Visual 3: KPI Cards

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
ip_anomalies.csv		ip_anomalies.csv
operational_patterns.csv		operational_patterns.csv
process_logs.py		process_logs.py
sample_logs.log		sample_logs.log

BadBoy0170/Log-Processing-Automation

Folders and files

Latest commit

History

Repository files navigation

Log Processing Automation & Anomaly Detection

⚙️ How to Run the Project

Step 1: Get the Data

Step 2: Setup Environment Variables (Optional)

Step 3: Set up Virtual Environment and Install Libraries

Step 4: Run the Script

📊 Power BI Visualization Guide

Visual 1: Operational Patterns (Pie Chart)

Visual 2: Anomaly Detection (Bar Chart)

Visual 3: KPI Cards

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages