This project builds a machine learning system to predict whether a patient diagnosed with lung cancer is likely to survive, based on clinical and lifestyle data. It leverages a comprehensive dataset of patient records and applies preprocessing, feature engineering, and classification modeling to deliver accurate survival predictions.
The dataset is sourced from /kaggle/input/lung-cancer-dataset/dataset_med.csv and includes detailed patient information.
Key columns:
id: Unique patient identifierage,gender,country: Demographicsdiagnosis_date,end_treatment_date: Cancer timelinecancer_stage: Stage I–IVfamily_history,smoking_status: Lifestyle and genetic riskbmi,cholesterol_level: Clinical metricshypertension,asthma,cirrhosis,other_cancer: Comorbiditiestreatment_type: Surgery, radiation, chemotherapy, or combinedsurvived: Target label (yes/no)
df = pd.read_csv('/kaggle/input/lung-cancer-dataset/dataset_med.csv')# Convert dates and calculate treatment duration
df['diagnosis_date'] = pd.to_datetime(df['diagnosis_date'])
df['end_treatment_date'] = pd.to_datetime(df['end_treatment_date'])
df['treatment_duration'] = (df['end_treatment_date'] - df['diagnosis_date']).dt.days
# Encode categorical features
df_encoded = pd.get_dummies(df.drop(['id', 'diagnosis_date', 'end_treatment_date'], axis=1), drop_first=True)X = df_encoded.drop('survived_1', axis=1)
y = df_encoded['survived_1']
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))
print("Classification Report:", classification_report(y_test, test_preds))# Create input_df using X_train.columns and fill values
prediction = model.predict(input_df)- Training Accuracy: ~98%
- Test Accuracy: ~95%
- Balanced Precision & Recall: Strong performance across survival classes
numpy
pandas
scikit-learn