Successfully developed a crop yield regression system using advanced machine learning algorithms in R. This project aims to predict crop yields (in tons per hectare) using agro-environmental, soil, and weather data. The solution includes a fully automated pipeline for data preprocessing, model training, and performance evaluation—tailored for practical decision-making in agriculture.
-
Exploratory Data Analysis (EDA):
Comprehensive visualizations usingggplot2,corrplot, andDataExplorerfor pattern recognition and outlier detection. -
Feature Engineering:
- One-hot encoding for categorical variables
- Handling missing values and constant columns
- Ensured numerical format compatibility for ML models
-
Feature Scaling:
Min-max normalization with protection against division-by-zero for constant features. -
Model Building:
Implemented and evaluated 20+ regression models, including:- Linear Regression, Ridge, Lasso
- Decision Tree, Random Forest, Bagging
- Gradient Boosting, XGBoost, LightGBM, CatBoost
- K-Nearest Neighbors (KNN), Support Vector Machines (SVM)
- Bayesian Regression, Quantile Regression
- Neural Networks (
nnet,neuralnet)
-
Hyperparameter Tuning:
Optimized models usingcaret,xgboost::xgb.cv,lgb.cv, andrandomForest::tuneRF. -
Performance Metrics:
RMSE, MAE, MAPE, R², Adjusted R², and RMSLE. -
Model Comparison:
Side-by-side metric comparison to select the best-performing algorithm. -
Interpretability:
- Variable importance via
varImp,xgb.importance,plotnet,olden,garson - Uncertainty estimation using Quantile Regression
- Variable importance via
-
Top-performing models included:
- ⭐ Linear Regression (R² = 0.9132)
- ⭐ Tuned GAM / Generalized Additive Model (R² = 0.9128)
- ⭐ Gaussian GLM (R² = 0.9128)
-
Other strong performers:
Tuned CatBoost, XGBoost, LightGBM, and Neural Networks also achieved R² > 0.91 after tuning. -
Underperformers:
Simpler models like Decision Trees, Random Forests, and KNN lagged behind.
Bayesian Regression, Lasso, Ridge, and Elastic Net showed poor generalization.
- Language: R
- Libraries:
caret,xgboost,lightgbm,catboost,randomForest,rpart,neuralnet,nnet,Metrics,quantreg,ggplot2,dplyr,glmnet,e1071,DataExplorer,mlbench,MASS,corrplot,smotefamily
- Government and NGOs for agricultural planning and subsidy allocation
- Agribusiness firms for forecasting crop production and optimizing supply chain logistics
- Farm-level decision support systems for precision agriculture
- Deploy via Shiny dashboard for interactive yield prediction tools
- Incorporate remote sensing and real-time weather data
- Build a REST API for integration into farm management systems