-
Notifications
You must be signed in to change notification settings - Fork 74
929 feature extend is aggr check funcs #951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…indowing functions in DQX. Parameter ordering was changed accidentaly.
|
✅ 457/457 passed, 1 flaky, 41 skipped, 2h47m9s total Flaky tests:
Running from acceptance #3304 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR extends the aggregate check functions (is_aggr_*) from supporting 5 basic aggregates to 20 curated functions using a hybrid "Curated + Custom" approach. The implementation adds support for statistical functions (stddev, variance, median, mode), percentile functions (percentile, approx_percentile), and cardinality functions (count_distinct, approx_count_distinct), while also enabling custom aggregates with runtime validation and clear error messages.
Key Changes:
- Added 15 new curated aggregate functions (total: 20) organized by category (statistical, cardinality, percentiles)
- Implemented custom aggregate support with UserWarning mechanism and runtime validation
- Added
aggr_paramsparameter to all 4is_aggr_*functions for parameterized aggregates
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/databricks/labs/dqx/check_funcs.py | Core implementation: added CURATED_AGGR_FUNCTIONS set, aggr_params parameter, _build_aggregate_expression and _validate_aggregate_return_type helper functions, enhanced validation logic |
| tests/unit/test_row_checks.py | Updated test to verify warning behavior for invalid aggr_type instead of immediate error |
| tests/integration/test_dataset_checks.py | Added comprehensive integration tests for new aggregate functions including count_distinct, statistical functions, percentiles, and custom aggregates |
| docs/dqx/docs/reference/quality_checks.mdx | Added documentation for aggregate function types, categorization, and usage examples |
| docs/dqx/docs/guide/quality_checks_definition.mdx | Added practical use case examples for extended aggregates |
| demos/dqx_demo_library.py | Added 5 demo examples showcasing new aggregate functions in real-world scenarios |
Comments suppressed due to low confidence (2)
tests/integration/test_dataset_checks.py:1
- The documentation example contradicts the implementation. According to the code in check_funcs.py (lines 2385-2391),
count_distinctcannot be used withgroup_bydue to a Spark limitation. This example will fail at runtime with anInvalidParameterError. Either remove thegroup_byparameter or changeaggr_typetoapprox_count_distinct.
from collections.abc import Callable
docs/dqx/docs/guide/quality_checks_definition.mdx:1
- The admonition correctly documents the
count_distinctlimitation, but this contradicts the example at lines 195-198 which showscount_distinctbeing used withgroup_by. The example should be updated to match this documented limitation.
---
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
…entation. docs updated with more user friendly language.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #951 +/- ##
==========================================
+ Coverage 90.07% 90.11% +0.04%
==========================================
Files 64 64
Lines 6138 6174 +36
==========================================
+ Hits 5529 5564 +35
- Misses 609 610 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…nd quality_checks.mdx
1. removed dead code which was "just in case" 2. added test for incorrect parameters 3. More permissive parameter passing to aggr functions
…lude_no_tables_matching
| other_params = {k: v for k, v in aggr_params.items() if k != "percentile"} | ||
|
|
||
| aggr_func = getattr(F, aggr_type) | ||
| return aggr_func(filtered_expr, pct, **other_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing an invalid named argument via **other_params will raise TypeError. We will need to handle this in the except block below.
- criticality: warn
check:
function: is_aggr_not_greater_than
arguments:
column: response_time_ms
aggr_type: approx_percentile
aggr_params:
percentile: 0.99
accuracy: 10000
invalid_param: -1
limit: 5000
ghanse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Left a few minor comments.
| try: | ||
| aggr_func = getattr(F, aggr_type) | ||
| if aggr_params: | ||
| return aggr_func(filtered_expr, **aggr_params) | ||
| return aggr_func(filtered_expr) | ||
| except AttributeError as exc: | ||
| raise InvalidParameterError( | ||
| f"Aggregate function '{aggr_type}' not found in pyspark.sql.functions. " | ||
| f"Verify the function name is correct, or check if your Databricks Runtime version supports this function. " | ||
| f"Some newer aggregate functions (e.g., mode, median) require DBR 15.4+ (Spark 3.5+). " | ||
| f"See: https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-functions-builtin-alpha" | ||
| ) from exc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above. We will need a block for TypeError.
| if aggr_type in WINDOW_INCOMPATIBLE_AGGREGATES: | ||
| # Use two-stage aggregation: groupBy + join (instead of window functions) | ||
| # This is required for aggregates like count_distinct that don't support window DISTINCT operations | ||
| group_cols = [F.col(col) if isinstance(col, str) else col for col in group_by] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can movegroup_cols outside of the current block and reuse it when we create the window expression.
| A human-readable display name for the aggregate function. If no mapping exists, | ||
| returns the capitalized function name. | ||
| """ | ||
| return CURATED_AGGR_FUNCTIONS.get(aggr_type, aggr_type.capitalize()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of aggr_type.capitalize() we can indicate a non-curated aggregate function:
return CURATED_AGGR_FUNCTIONS.get(aggr_type, f"Non-curated aggregate '{aggr_type}'")
Changes
Extended the is_aggr_* check functions from supporting 5 basic aggregates to 20 curated aggregate functions with a hybrid "Curated + Custom" approach.
Added aggr_params: dict[str, Any] to all 4 is_aggr_* functions for aggregates requiring parameters (e.g., percentile, approx_percentile).
Implemented two-stage aggregation (groupBy + join) for window-incompatible aggregates like count_distinct.
Linked issues
closes #933 and #929
Resolves #..
Tests