A machine learning techniques to analyze road accidents in the US and predict their severity. This project includes data preprocessing, natural language processing (NLP) on accident descriptions, and training a machine learning model to classify accident severity.
-
ML_Engineer_Task.ipynb
Preprocesses raw accident data by handling missing values, feature engineering, and scaling. Outputs a processed dataset in.csvor.pklformat. -
ML_Engineer_Task_Colab.ipynb
Loads the preprocessed data and applies NLP techniques to accident descriptions. Trains a Random Forest Classifier with SMOTE and outputs a trained model (Random_Forest_Classifier_SMOTE.pkl).
The dataset consists of the following attributes, each providing crucial information about road accidents. Here's a detailed description:
| Attribute | Description | Nullable |
|---|---|---|
| ID | Unique identifier for each accident record. | No |
| Severity | Severity level of the accident (1 to 4), where 1 indicates minor impact and 4 indicates significant impact. | No |
| Start_Time | Start time of the accident in the local timezone. | No |
| End_Time | End time of the accident's impact on traffic flow. | No |
| Start_Lat | Latitude of the starting point of the accident. | No |
| Start_Lng | Longitude of the starting point of the accident. | No |
| End_Lat | Latitude of the ending point of the accident. | Yes |
| End_Lng | Longitude of the ending point of the accident. | Yes |
| Distance(mi) | Length of the road segment affected by the accident. | No |
| Description | Natural language description of the accident. | No |
| Number | Street number in the address field. | Yes |
| Street | Street name in the address field. | Yes |
| Side | Side of the street (Right/Left). | Yes |
| City | City name. | Yes |
| County | County name. | Yes |
| State | State name. | Yes |
| Zipcode | Zip code of the location. | Yes |
| Country | Country name (e.g., US). | Yes |
| Timezone | Timezone of the accident's location (e.g., Eastern, Central). | Yes |
| Airport_Code | Nearest airport-based weather station to the accident location. | Yes |
| Weather_Timestamp | Timestamp of the weather observation for the accident. | Yes |
| Temperature(F) | Temperature at the time of the accident (in Fahrenheit). | Yes |
| Wind_Chill(F) | Wind chill at the time of the accident (in Fahrenheit). | Yes |
| Humidity(%) | Humidity percentage at the time of the accident. | Yes |
| Pressure(in) | Atmospheric pressure at the time of the accident (in inches). | Yes |
| Visibility(mi) | Visibility at the time of the accident (in miles). | Yes |
| Wind_Direction | Wind direction at the time of the accident. | Yes |
| Wind_Speed(mph) | Wind speed at the time of the accident (in miles per hour). | Yes |
| Precipitation(in) | Precipitation amount at the time of the accident (in inches). | Yes |
| Weather_Condition | Weather conditions (e.g., Rain, Snow, Fog). | Yes |
| Amenity | Indicates the presence of nearby amenities. | No |
| Bump | Indicates the presence of nearby speed bumps. | No |
| Crossing | Indicates the presence of a nearby crossing. | No |
| Give_Way | Indicates the presence of a nearby "Give Way" sign. | No |
| Junction | Indicates the presence of a nearby junction. | No |
| No_Exit | Indicates the presence of a nearby "No Exit" sign. | No |
| Railway | Indicates the presence of nearby railways. | No |
| Roundabout | Indicates the presence of a nearby roundabout. | No |
| Station | Indicates the presence of a nearby station. | No |
| Stop | Indicates the presence of a nearby stop sign. | No |
| Traffic_Calming | Indicates the presence of nearby traffic-calming measures. | No |
| Traffic_Signal | Indicates the presence of nearby traffic signals. | No |
| Turning_Loop | Indicates the presence of a nearby turning loop. | No |
| Sunrise_Sunset | Period of the day (Day/Night) based on sunrise and sunset. | Yes |
| Civil_Twilight | Period of the day based on civil twilight. | Yes |
| Nautical_Twilight | Period of the day based on nautical twilight. | Yes |
| Astronomical_Twilight | Period of the day based on astronomical twilight. | Yes |
- Handles missing values using group-based imputations.
- Groups
Weather_Conditioncategories into broader groups (e.g.,Rain,Snow,Fog). - Encodes categorical variables and scales numerical features.
- Cleans, tokenizes, and lemmatizes accident descriptions.
- Vectorizes text data using
TfidfVectorizerfor feature extraction.
- Uses a Random Forest Classifier.
- Employs SMOTE to handle imbalanced class distribution.
- Outputs a model file (
Random_Forest_Classifier_SMOTE.pkl) for future use.
- Clone the repository:
git clone https://github.com/your-username/your-repo-name.git
- Open and run the notebook
ML_Engineer_Task.ipynb:- Loads the raw dataset.
- Cleans and preprocesses data.
- Saves the output as
accident_data.pkloraccident_data.csv.
- Open
ML_Engineer_Task_Colab.ipynbin Google Colab or your preferred environment. - Load the preprocessed file (
.pklor.csv). - Run the notebook to:
- Perform NLP on accident descriptions.
- Train the model and save it as
Random_Forest_Classifier_SMOTE.pkl.
- Processed Dataset:
accident_data.pkloraccident_data.csv - Trained Model:
Random_Forest_Classifier_SMOTE.pkl
Install the required Python libraries:
pip install pandas scikit-learn nltk matplotlib seaborn- Validation Accuracy: ~96.9%
- Class Imbalance Handling: SMOTE improves performance on underrepresented severity classes.
- Experiment with other ML algorithms like Gradient Boosting or neural networks.
- Integrate real-time accident data for dynamic predictions.
- Enhance feature engineering for improved accuracy.
Contributions are welcome! If you have ideas for improvement:
- Fork this repository.
- Create a new branch (
feature-branch-name). - Commit changes and push.
- Open a Pull Request.