Skip to content

Predicting lung cancer survival outcomes using clinical and lifestyle data. Includes preprocessing, feature engineering, and Random Forest classification with ~95% test accuracy. Built for healthcare risk stratification and early prognosis

License

Notifications You must be signed in to change notification settings

SBanditaDas/Lung-Cancer-Survival-Prediction-Using-ML

Repository files navigation


Lung Cancer Survival Prediction

Overview

This project builds a machine learning system to predict whether a patient diagnosed with lung cancer is likely to survive, based on clinical and lifestyle data. It leverages a comprehensive dataset of patient records and applies preprocessing, feature engineering, and classification modeling to deliver accurate survival predictions.


Dataset Description

The dataset is sourced from /kaggle/input/lung-cancer-dataset/dataset_med.csv and includes detailed patient information.

Key columns:
  • id: Unique patient identifier
  • age, gender, country: Demographics
  • diagnosis_date, end_treatment_date: Cancer timeline
  • cancer_stage: Stage I–IV
  • family_history, smoking_status: Lifestyle and genetic risk
  • bmi, cholesterol_level: Clinical metrics
  • hypertension, asthma, cirrhosis, other_cancer: Comorbidities
  • treatment_type: Surgery, radiation, chemotherapy, or combined
  • survived: Target label (yes/no)

Workflow Summary :

1. Data Loading

df = pd.read_csv('/kaggle/input/lung-cancer-dataset/dataset_med.csv')

2. Preprocessing

# Convert dates and calculate treatment duration
df['diagnosis_date'] = pd.to_datetime(df['diagnosis_date'])
df['end_treatment_date'] = pd.to_datetime(df['end_treatment_date'])
df['treatment_duration'] = (df['end_treatment_date'] - df['diagnosis_date']).dt.days

# Encode categorical features
df_encoded = pd.get_dummies(df.drop(['id', 'diagnosis_date', 'end_treatment_date'], axis=1), drop_first=True)

3. Model Training

X = df_encoded.drop('survived_1', axis=1)
y = df_encoded['survived_1']

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

4. Evaluation

print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))
print("Classification Report:", classification_report(y_test, test_preds))

5. Prediction on New Patient

# Create input_df using X_train.columns and fill values
prediction = model.predict(input_df)

Performance Metrics :

  • Training Accuracy: ~98%
  • Test Accuracy: ~95%
  • Balanced Precision & Recall: Strong performance across survival classes

Dependencies :

numpy
pandas
scikit-learn

Author: Sushree Bandita Das

S_Bandita_Das sushree-bandita-das-160651309 SBanditaDas dasbanditasushree


About

Predicting lung cancer survival outcomes using clinical and lifestyle data. Includes preprocessing, feature engineering, and Random Forest classification with ~95% test accuracy. Built for healthcare risk stratification and early prognosis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published