-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Context:
In HealthBase.jl and OMOPCommonDataModel.jl, i had earlier developed an impute_missing function that filled in missing values using basic methods like mean, median, min, max, or mode. This worked for simple examples, but it’s not good enough for real clinical use and is quite a naive approach.
Why this matters:
These simple methods don’t consider the patient’s background (like age, gender, or health conditions). Using them can lead to wrong or biased results, especially when the missing data isn’t random.
What we decided:
After a weekly review meeting with @TheCedarPrince, we decided to remove the current impute_missing function from HealthBase.jl and going forward, we could create smarter and advanced methods (like grouping by patient features) in here.
Example:
| ID | Age | Gender | Race | Cholesterol |
|---|---|---|---|---|
| 1 | 29 | Male | White | 180 |
| 2 | 35 | Female | Asian | 195 |
| 3 | 30 | Female | Asian | missing |
Naively imputing a missing value using the global mean (example: mean([180, 195]) = 187.5) ignores clinical context. But if we know that patient 3 is a 30-year-old Asian female, we can impute using the average cholesterol value for similar patients (example: other Asian females). In this case, patient 2 is the only match, so we would impute 195 instead of 187.5. This context-aware imputation is more clinically meaningful and avoids introducing bias.