A machine learning project for detecting malicious URLs using Random Forest and Logistic Regression classifiers. This project includes a REST API for real-time predictions and a live dashboard for visualizing model performance.
This is my machine learning project where I built a system to classify URLs as malicious or benign. I learned about:
- Feature engineering for URL analysis
- Machine learning classification models
- REST API development with Flask
- Data visualization with Plotly
- Model deployment and serving
- Binary Classification: Detects malicious URLs with 94% accuracy
- Feature Extraction: Extracts 40+ features from URLs using regex-based tokenization
- REST API: Real-time URL prediction with <120ms latency
- Live Dashboard: Interactive dashboard with Plotly visualizations
- Two Models: Random Forest and Logistic Regression for comparison
- Docker Support: Easy deployment with Docker
Malicious/
├── app.py # Main Flask application
├── train.py # Model training script
├── test_api.py # API testing script
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration
├── docker-compose.yml # Docker Compose setup
├── src/
│ ├── preprocessing/
│ │ └── feature_extractor.py # Feature extraction module
│ ├── models/
│ │ ├── model_trainer.py # Model training
│ │ └── model_predictor.py # Model prediction
│ └── dashboard/
│ └── dashboard.py # Dashboard visualization
├── data/
│ ├── raw/ # Raw dataset (if you have one)
│ └── processed/ # Processed data
└── models/ # Trained model files (generated)
- Python 3.8 or higher
- pip
- Clone the repository:
git clone <repository-url>
cd Malicious- Create virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Train the model:
python train.py- Run the application:
python app.pyThe API will be available at http://localhost:5000
The easiest way to test URLs is through the web interface:
- Open: http://localhost:5000/test
- Enter a URL and click "Check URL"
curl -X POST http://localhost:5000/api/predict \
-H "Content-Type: application/json" \
-d '{"url": "http://example.com", "model": "random_forest"}'Response:
{
"url": "http://example.com",
"prediction": "benign",
"probability": 0.95,
"benign_probability": 0.95,
"malicious_probability": 0.05,
"model": "random_forest"
}curl -X POST http://localhost:5000/api/predict/batch \
-H "Content-Type: application/json" \
-d '{"urls": ["http://example.com", "http://github.com"]}'Access the live dashboard at: http://localhost:5000/dashboard
The dashboard shows:
- Real-time statistics
- Prediction distribution charts
- Response time histograms
- Prediction timeline
GET /- API informationGET /api/health- Health checkPOST /api/predict- Single URL predictionPOST /api/predict/batch- Batch URL predictionGET /api/stats- Model statisticsGET /dashboard- Performance dashboardGET /test- Web interface for testing URLs
- Total Samples: 10,000
- Benign URLs: 5,000
- Malicious URLs: 5,000
- Train/Test Split: 80/20
- Training: 8,000 samples
- Testing: 2,000 samples
-
Random Forest
- 100 trees
- Max depth: 20
- Accuracy: ~94%
-
Logistic Regression
- Max iterations: 1000
- Solver: lbfgs
- Accuracy: ~94%
The model extracts 40+ features from URLs including:
- URL structure (length, domain, path, query)
- Special character counts
- TLD analysis
- Entropy calculations
- Suspicious keyword detection
- Pattern matching
- Tokenization features
- Python 3.8+
- scikit-learn - Machine learning models
- Flask - Web framework
- Plotly - Data visualization
- pandas, numpy - Data processing
- Docker - Containerization
docker-compose up --builddocker build -t malicious-url-detector .
docker run -p 5000:5000 malicious-url-detector- Model Accuracy: 94%
- Response Latency: <120ms average
- Features Extracted: 40+
- Training Time: ~1-2 minutes
- Use real malicious URL dataset
- Add more features
- Implement model retraining pipeline
- Add authentication to API
- Implement rate limiting
- Add logging and monitoring
MIT License
Daksh Patel