This project simulates a log analysis pipeline, as described in my resume. It uses PySpark (to mimic Azure Databricks) to process raw Apache logs.
The script automates the analysis by:
- Parsing raw text logs into a structured format.
- Running a SQL query to identify operational patterns (e.g., status code counts).
- Running a second SQL query to detect anomalies (e.g., IPs generating the most errors), which simulates reducing manual review time.
- Exporting the results to CSV files for visualization in Power BI.
Technologies: Python, PySpark, SQL, Power BI
This project requires a sample Apache log file.
To download a sample Apache log file, run the following command in your terminal:
curl -o sample_logs.log https://raw.githubusercontent.com/elastic/examples/master/Common%20Data%20Formats/apache_logs/apache_logsPlace the sample_logs.log file in the same directory as the process_logs.py script.
You can configure the log file name and the regex pattern using environment variables. If not set, default values will be used.
LOG_FILE: The name of the log file to process (default:sample_logs.log).LOG_REGEX: The regular expression pattern to parse the log entries (default:(\S+) (\S+) (\S+) \[(.*?)\] "(\S+ .*?)" (\d{3}) (\S+)).
Example of setting environment variables:
export LOG_FILE="my_custom_logs.log"
export LOG_REGEX="your_custom_regex_pattern"It's recommended to use a virtual environment to manage dependencies.
python3 -m venv venv
source venv/bin/activate
pip install pyspark pandasRun the script from your terminal:
python3 process_logs.pyThis will run a local Spark job and create two new files:
operational_patterns.csvip_anomalies.csv
Here is how to create the visualizations.
- Open Power BI Desktop.
- Click "Get data" -> "Text/CSV" and select
operational_patterns.csv. - Load the data.
- Repeat the process: Click "Get data" -> "Text/CSV" and select
ip_anomalies.csv. - Load the data. (You don't need to join them, just have them both available in the Fields pane).
- Go to the "Report" view.
- Select the Pie chart visualization.
- From the
operational_patternstable, dragstatus_codeto the Legend field. - Drag
countto the Values field. - You can customize the labels and title as needed.
- Create a new page in the report.
- Select the Clustered bar chart visualization.
- From the
ip_anomaliestable, dragipto the Axis field. - Drag
error_countto the Values field. - To show just the Top 10:
- Click on the
ipfield in the "Visualizations" pane, then select "Filters". - Under "Filter type", select "Top N".
- Set "Show items" to "Top" and value to 10.
- Drag
error_countto the "By value" field. - Click "Apply filter".
- Click on the
- Create a new page in the report.
- Select the Card visualization.
- From the
operational_patternstable, dragcountto the Fields well. - To display specific error counts (e.g., 404 errors):
- Drag
status_codefromoperational_patternsto the Filters on this visual pane. - Select "404" as the filter value.
- Drag
- Rename the card title (e.g., "Total 404 Errors").
- Repeat this process for other status codes (e.g., "500" errors) on new Card visualizations.
- Finally, create a new Dashboard and arrange your visuals as desired.