Skip to content

Conversation

@STEFANOVIVAS
Copy link
Contributor

@STEFANOVIVAS STEFANOVIVAS commented Nov 24, 2025

Changes

Added new check function for outlier detection of numeric values. The checkuses a statistical method called MAD (Median Absolute Deviation) to check whether the specified column's values are within the calculated limits. The lower limit is calculated as median - 3.5 * MAD and the upper limit as median + 3.5 * MAD. Values outside these limits are considered outliers.

Linked issues

Resolves #359

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

@STEFANOVIVAS STEFANOVIVAS requested a review from a team as a code owner November 24, 2025 15:39
@STEFANOVIVAS STEFANOVIVAS requested review from tombonfert and removed request for a team November 24, 2025 15:39
@STEFANOVIVAS
Copy link
Contributor Author

STEFANOVIVAS commented Nov 27, 2025

Is Tombonfert still a reviewer, @mwojtyczka? I saw in another PR that vb-dbrks removed him as a reviewer and assigned it to you.

@mwojtyczka mwojtyczka requested a review from Copilot December 1, 2025 14:13
@mwojtyczka
Copy link
Contributor

Hi @STEFANOVIVAS we are looking into this. Thanks for your PR

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds outlier detection functionality for numerical columns using the Median Absolute Deviation (MAD) statistical method. The implementation validates that only numeric columns are used and identifies values outside the calculated bounds (median ± 3.5 * MAD) as outliers.

  • Implements has_no_outliers check function with MAD-based outlier detection
  • Adds comprehensive test coverage including edge cases for string columns
  • Integrates outlier detection into the rule building and serialization framework

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
src/databricks/labs/dqx/check_funcs.py Adds has_no_outliers function and _calculate_median_absolute_deviation helper with MAD-based outlier detection logic
tests/integration/test_dataset_checks.py Adds integration tests for outlier detection on numeric columns and validation that string columns are rejected
tests/unit/test_build_rules.py Updates unit tests to include outlier detection rules in serialization and deserialization test cases
tests/integration/test_build_rules.py Adds integration test case for outlier detection rule metadata

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requested changes

@mwojtyczka mwojtyczka removed the request for review from tombonfert December 1, 2025 14:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mwojtyczka and others added 5 commits December 7, 2025 13:33
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mwojtyczka mwojtyczka merged commit 7c2253b into databrickslabs:main Dec 7, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Outlier detection for numerical values

2 participants