-
Notifications
You must be signed in to change notification settings - Fork 74
Outlier detection numerical values #944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outlier detection numerical values #944
Conversation
…gorithm (Median absolute deviation).
|
Is Tombonfert still a reviewer, @mwojtyczka? I saw in another PR that vb-dbrks removed him as a reviewer and assigned it to you. |
|
Hi @STEFANOVIVAS we are looking into this. Thanks for your PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds outlier detection functionality for numerical columns using the Median Absolute Deviation (MAD) statistical method. The implementation validates that only numeric columns are used and identifies values outside the calculated bounds (median ± 3.5 * MAD) as outliers.
- Implements
has_no_outlierscheck function with MAD-based outlier detection - Adds comprehensive test coverage including edge cases for string columns
- Integrates outlier detection into the rule building and serialization framework
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| src/databricks/labs/dqx/check_funcs.py | Adds has_no_outliers function and _calculate_median_absolute_deviation helper with MAD-based outlier detection logic |
| tests/integration/test_dataset_checks.py | Adds integration tests for outlier detection on numeric columns and validation that string columns are rejected |
| tests/unit/test_build_rules.py | Updates unit tests to include outlier detection rules in serialization and deserialization test cases |
| tests/integration/test_build_rules.py | Adds integration test case for outlier detection rule metadata |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
mwojtyczka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mwojtyczka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requested changes
…,long_numeric_types,decimal_numeric_types,empty_dataframe,row_filter,none_median
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
mwojtyczka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Changes
Added new check function for outlier detection of numeric values. The checkuses a statistical method called MAD (Median Absolute Deviation) to check whether the specified column's values are within the calculated limits. The lower limit is calculated as median - 3.5 * MAD and the upper limit as median + 3.5 * MAD. Values outside these limits are considered outliers.
Linked issues
Resolves #359
Tests