Skip to content

[FEATURE] Advanced Imputation Strategies for Clinical Data (beyond mean/min/max) #40

@kosuri-indu

Description

@kosuri-indu

Context:
In HealthBase.jl and OMOPCommonDataModel.jl, i had earlier developed an impute_missing function that filled in missing values using basic methods like mean, median, min, max, or mode. This worked for simple examples, but it’s not good enough for real clinical use and is quite a naive approach.

Why this matters:
These simple methods don’t consider the patient’s background (like age, gender, or health conditions). Using them can lead to wrong or biased results, especially when the missing data isn’t random.

What we decided:
After a weekly review meeting with @TheCedarPrince, we decided to remove the current impute_missing function from HealthBase.jl and going forward, we could create smarter and advanced methods (like grouping by patient features) in here.

Example:

ID Age Gender Race Cholesterol
1 29 Male White 180
2 35 Female Asian 195
3 30 Female Asian missing

Naively imputing a missing value using the global mean (example: mean([180, 195]) = 187.5) ignores clinical context. But if we know that patient 3 is a 30-year-old Asian female, we can impute using the average cholesterol value for similar patients (example: other Asian females). In this case, patient 2 is the only match, so we would impute 195 instead of 187.5. This context-aware imputation is more clinically meaningful and avoids introducing bias.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions