Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 32 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,24 @@

Unsupervised Machine Learning web attacks detection.


<p align="center">
<p align="center">
<img width="100%" src="https://github.com/slrbl/unsupervised-learning-attack-detection-webhawk-catch/blob/master/IMAGES/hawk.jpg">
Image source:https://unsplash.com/photos/i4Y9hr5dxKc (Mathew Schwartz)
</p>

## About

Webhawk/Catch helps automatically finding web attack traces in HTTP logs and abnormal OS processes without using any preset rules. Based on the usage of Unsupervised Machine Learning, Catch groups log lines into clusters, and detects the outliers that it considers as potentially attack traces.
Webhawk/Catch helps you to automatically find web attack traces in HTTP logs and abnormal OS processes without using any preset rules. Based on the usage of Unsupervised Machine Learning, Catch groups log lines into clusters, and detect the outliers that it considers as potential attack traces.

The tool is able to parse both raw HTTP log files (Apache, Nginx, ...) and files including OS statistics (generated by top command). The tool takes these files as input and returns a report with a list of findings.
The tool is able to parse both raw HTTP log files (Apache, Nginx, ...) and files including OS statistics (generated by `top` command). The tool takes these files as input and returns a report with a list of findings.

Catch uses PCA (Principal Component Analysis) technique to select the most relevant features (Example: user-agent, IP address, number of transmitted parameters, etc.. ). Then, it runs DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm to get all the possible log line clusters and anomalous points (potential attack traces).
Catch uses PCA (Principal Component Analysis) technique to select the most relevant features (Example: user-agent, IP address, number of transmitted parameters, etc.. ). Then, it runs DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm to get all the possible log line clusters and anomalous points (potential attack traces).

Advanced users can fine tune Catch based on a set of options that help optimising the clustering algorithm (Example: minimum number of points by cluster, or the maximum distance between two points within the same cluster).
Advanced users can fine tune Catch based on a set of options that help to optimise the clustering algorithm (Example: minimum number of points by cluster, or the maximum distance between two points within the same cluster).

The current version of Webhawk/Catch generates an easy-to-read HTML report which includes all the findings, and the severity of each one.

Webhawk/Catch is an open-source tool. Catch is the unsupervised version of Webhawk which is a supervised machine learning based cyber-attack detection tool. In contrary to the supervised Webhawk, Catch can be used without manually pertaining a model, the thing that makes it a lightweight and flexible solution to easily identify potential attack traces. Catch is available as an independent repository in Github, it is also included as part of Webhawk which is starred 125 times and forked 68 times.
Webhawk/Catch is an open-source tool. Catch is the unsupervised version of Webhawk which is a supervised machine learning based cyber-attack detection tool. In contrary to the supervised Webhawk, Catch can be used without manually pre-training a model, which makes it a lightweight and flexible solution for identifying potential attack traces easily. Catch is available as an independent repository in Github, it is also included as part of Webhawk which has been starred 125 times and forked 68 times.

## Setup

Expand All @@ -36,7 +35,7 @@ pip install -r requirements.txt

### Create a settings.conf file

Copy settings_template.conf file to settings.conf and fill it with the required parameters as the following.
Copy **settings_template.conf** file to **settings.conf** and fill it with the required parameters as the following.

```shell
[FEATURES]
Expand All @@ -57,7 +56,7 @@ attributes:['status', 'num_ctx_switches', 'memory_full_info', 'connections', 'cm
### Catch.py script

```shell
python catch.py -h
python catch.py -h
usage: catch.py [-h] -l LOG_FILE -t LOG_TYPE [-e EPS] [-s MIN_SAMPLES] [-j LOG_LINES_LIMIT] [-y OPT_LAMDA] [-m MINORITY_THRESHOLD] [-p] [-o] [-r] [-z] [-b] [-c] [-v]

options:
Expand Down Expand Up @@ -86,65 +85,70 @@ options:

```


### Example with HTTP logs

Encoding is automatic for the unsupervised mode. You just need to run the catch.py script.
Get inspired from this example:

```shell
python catch.py -l ../HTTP_LOGS_DTATSETS/SECREPO_LOGS/access.log.2021-10-22 --log_type apache --show_plots --standardize_data --report
python catch.py -l ./SAMPLE_DATA/RAW_APACHE_LOGS/access.log.2021-10-22 --log_type apache --show_plots --standardize_data --report
```

The output of this command is:

<p align="center">
<p align="center">
<img width="100%" src="https://github.com/slrbl/unsupervised-learning-attack-detection-webhawk-catch/blob/master/IMAGES/screenshot_1.png">
</p>

<p align="center">
<p align="center">
<img width="100%" src="https://github.com/slrbl/unsupervised-learning-attack-detection-webhawk-catch/blob/master/IMAGES/clusters_1.png">
</p>

<p align="center">
<p align="center">
<img width="100%" src="https://github.com/slrbl/unsupervised-learning-attack-detection-webhawk-catch/blob/master/IMAGES/clusters_2.png">
</p>

### Example with OS processes
Before running the catch.py, you need to generate a .txt file containing the OS process statistics by taking advantage of top command:

Before running the catch.py, you need to generate a .txt file containing the OS process statistics by taking advantage of `top` command:

```shell
top > PATH/os_processes.txt
```

You can then run the catch.py to detect potential abnormal OS processes:
You can then run the catch.py to detect potentially abnormal OS processes:

```shell
python catch.py -l PATH/os_processes.txt --log_type os_processes --show_plots --standardize_data --report
```

## Used sample data
## Using sample data

The data you will find in SAMPLE_DATA folder comes from<br>
https://www.secrepo.com.
The data located in the **SAMPLE_DATA** folder comes from
<https://www.secrepo.com>.

## Interesting data samples

https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3QBYB5

<https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs>
<https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3QBYB5>

## TODO

Nothing for now.

## Reference

Silhouette Effeciency
<br>https://bioinformatics-training.github.io/intro-machine-learning-2017/clustering.html
Silhouette Efficiency

<https://bioinformatics-training.github.io/intro-machine-learning-2017/clustering.html>

Optimal Value of Epsilon

<https://towardsdatascience.com/machine-learning-clustering-dbscan-determine-the-optimal-value-for-epsilon-eps-python-example-3100091cfbc>

<br>Optimal Value of Epsilon
<br>https://towardsdatascience.com/machine-learning-clustering-dbscan-determine-the-optimal-value-for-epsilon-eps-python-example-3100091cfbc
Max curvature point

<br>Max curvature point
<br>https://towardsdatascience.com/detecting-knee-elbow-points-in-a-graph-d13fc517a63c
<https://towardsdatascience.com/detecting-knee-elbow-points-in-a-graph-d13fc517a63c>

## Contribution

Expand Down
11 changes: 11 additions & 0 deletions settings_template.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[FEATURES]
features:length,params_number,return_code,size,upper_cases,lower_cases,special_chars,url_depth,user_agent,http_query,ip

[LOG]
apache:([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (.+) "(.*?)" "(.*?)"
nginx:([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) (.+) "(.*?)" "(.*?)"
apache_error:
nginx_error:

[PROCESS_DETAILS]
attributes:['status', 'num_ctx_switches', 'memory_full_info', 'connections', 'cmdline', 'create_time', 'num_fds', 'cpu_percent', 'terminal', 'ppid', 'cwd', 'nice', 'username', 'cpu_times', 'memory_info', 'threads', 'open_files', 'name', 'num_threads', 'exe', 'uids', 'gids', 'memory_percent', 'environ']