929 feature extend is aggr check funcs #951

vb-dbrks · 2025-11-27T14:59:53Z

Changes

Extended the is_aggr_* check functions from supporting 5 basic aggregates to 20 curated aggregate functions with a hybrid "Curated + Custom" approach.

Curated Aggregate Functions (20 total) Added 15 new aggregate functions.

Cardinality: count_distinct, approx_count_distinct, count_if
Statistical: stddev, stddev_pop, stddev_samp, variance, var_pop, var_samp, median, mode, skewness, kurtosis
Percentile: percentile, approx_percentile

Custom Aggregate Support.

Warning mechanism for non-curated aggregates (UserWarning, still executes)
Runtime validation with clear error messages for invalid return types
Human-readable violation messages (e.g., "Distinct value count 2..." instead of "Count_distinct 2...")

New aggr_params Parameter
Added aggr_params: dict[str, Any] to all 4 is_aggr_* functions for aggregates requiring parameters (e.g., percentile, approx_percentile).
count_distinct with group_by Support
Implemented two-stage aggregation (groupBy + join) for window-incompatible aggregates like count_distinct.
Bug Fixes & Improvements

Fixed flaky test test_apply_checks_and_save_in_tables_for_patterns_exclude_no_tables_matching
Added performance benchmarks for count_distinct vs approx_count_distinct
Updated documentation with usage examples and performance guidance

Linked issues

closes #933 and #929

Resolves #..

Tests

1. is_aggr with group_by and 2. updated demo library and removed 3. bivariate analysis aggr functions

…indowing functions in DQX. Parameter ordering was changed accidentaly.

github-actions · 2025-11-27T17:02:32Z

✅ 457/457 passed, 1 flaky, 41 skipped, 2h47m9s total

Flaky tests:

🤪 test_e2e_workflow_serverless (9m48.152s)

_{Running from acceptance #3304}

Copilot

Pull request overview

This PR extends the aggregate check functions (is_aggr_*) from supporting 5 basic aggregates to 20 curated functions using a hybrid "Curated + Custom" approach. The implementation adds support for statistical functions (stddev, variance, median, mode), percentile functions (percentile, approx_percentile), and cardinality functions (count_distinct, approx_count_distinct), while also enabling custom aggregates with runtime validation and clear error messages.

Key Changes:

Added 15 new curated aggregate functions (total: 20) organized by category (statistical, cardinality, percentiles)
Implemented custom aggregate support with UserWarning mechanism and runtime validation
Added aggr_params parameter to all 4 is_aggr_* functions for parameterized aggregates

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/databricks/labs/dqx/check_funcs.py	Core implementation: added CURATED_AGGR_FUNCTIONS set, aggr_params parameter, _build_aggregate_expression and _validate_aggregate_return_type helper functions, enhanced validation logic
tests/unit/test_row_checks.py	Updated test to verify warning behavior for invalid aggr_type instead of immediate error
tests/integration/test_dataset_checks.py	Added comprehensive integration tests for new aggregate functions including count_distinct, statistical functions, percentiles, and custom aggregates
docs/dqx/docs/reference/quality_checks.mdx	Added documentation for aggregate function types, categorization, and usage examples
docs/dqx/docs/guide/quality_checks_definition.mdx	Added practical use case examples for extended aggregates
demos/dqx_demo_library.py	Added 5 demo examples showcasing new aggregate functions in real-world scenarios

Comments suppressed due to low confidence (2)

tests/integration/test_dataset_checks.py:1

The documentation example contradicts the implementation. According to the code in check_funcs.py (lines 2385-2391), count_distinct cannot be used with group_by due to a Spark limitation. This example will fail at runtime with an InvalidParameterError. Either remove the group_by parameter or change aggr_type to approx_count_distinct.

from collections.abc import Callable

docs/dqx/docs/guide/quality_checks_definition.mdx:1

The admonition correctly documents the count_distinct limitation, but this contradicts the example at lines 195-198 which shows count_distinct being used with group_by. The example should be updated to match this documented limitation.

---

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

demos/dqx_demo_library.py

src/databricks/labs/dqx/check_funcs.py

…entation. docs updated with more user friendly language.

codecov · 2025-11-27T18:43:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.11%. Comparing base (d200468) to head (24748ae).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #951      +/-   ##
==========================================
+ Coverage   90.07%   90.11%   +0.04%     
==========================================
  Files          64       64              
  Lines        6138     6174      +36     
==========================================
+ Hits         5529     5564      +35     
- Misses        609      610       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…nd quality_checks.mdx

…ion test

1. removed dead code which was "just in case" 2. added test for incorrect parameters 3. More permissive parameter passing to aggr functions

…lude_no_tables_matching

ghanse · 2025-12-04T22:53:01Z

src/databricks/labs/dqx/check_funcs.py

+        other_params = {k: v for k, v in aggr_params.items() if k != "percentile"}
+
+        aggr_func = getattr(F, aggr_type)
+        return aggr_func(filtered_expr, pct, **other_params)


Passing an invalid named argument via **other_params will raise TypeError. We will need to handle this in the except block below.

- criticality: warn check: function: is_aggr_not_greater_than arguments: column: response_time_ms aggr_type: approx_percentile aggr_params: percentile: 0.99 accuracy: 10000 invalid_param: -1 limit: 5000

ghanse

Looks good overall. Left a few minor comments.

ghanse · 2025-12-04T22:59:14Z

src/databricks/labs/dqx/check_funcs.py

+    try:
+        aggr_func = getattr(F, aggr_type)
+        if aggr_params:
+            return aggr_func(filtered_expr, **aggr_params)
+        return aggr_func(filtered_expr)
+    except AttributeError as exc:
+        raise InvalidParameterError(
+            f"Aggregate function '{aggr_type}' not found in pyspark.sql.functions. "
+            f"Verify the function name is correct, or check if your Databricks Runtime version supports this function. "
+            f"Some newer aggregate functions (e.g., mode, median) require DBR 15.4+ (Spark 3.5+). "
+            f"See: https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-functions-builtin-alpha"
+        ) from exc


See comment above. We will need a block for TypeError.

ghanse · 2025-12-04T23:03:20Z

src/databricks/labs/dqx/check_funcs.py

+            if aggr_type in WINDOW_INCOMPATIBLE_AGGREGATES:
+                # Use two-stage aggregation: groupBy + join (instead of window functions)
+                # This is required for aggregates like count_distinct that don't support window DISTINCT operations
+                group_cols = [F.col(col) if isinstance(col, str) else col for col in group_by]


We can movegroup_cols outside of the current block and reuse it when we create the window expression.

ghanse · 2025-12-04T23:06:41Z

src/databricks/labs/dqx/check_funcs.py

+        A human-readable display name for the aggregate function. If no mapping exists,
+        returns the capitalized function name.
+    """
+    return CURATED_AGGR_FUNCTIONS.get(aggr_type, aggr_type.capitalize())


Maybe instead of aggr_type.capitalize() we can indicate a non-curated aggregate function:

return CURATED_AGGR_FUNCTIONS.get(aggr_type, f"Non-curated aggregate '{aggr_type}'")

vb-dbrks added 5 commits November 27, 2025 13:20

feat: extended is_aggr_*

6edda4e

fmt: code formatting and cleanup

e8602ce

add:

ae6de97

1. is_aggr with group_by and 2. updated demo library and removed 3. bivariate analysis aggr functions

fix: test expectation

a661b93

fix + fmt: test assetion for arraytype

91b5029

vb-dbrks requested a review from mwojtyczka November 27, 2025 14:59

vb-dbrks added the enhancement New feature or request label Nov 27, 2025

This was linked to issues Nov 27, 2025

[FEATURE]: Add count distinct #933

Open

[FEATURE]: Extend is_aggr check_funcs to support more aggregate functions #929

Open

vb-dbrks added 2 commits November 27, 2025 15:08

docs: fixes

817cf4a

fix: count_distinct does not support group_by because of the use of w…

45eab64

…indowing functions in DQX. Parameter ordering was changed accidentaly.

vb-dbrks marked this pull request as ready for review November 27, 2025 16:22

vb-dbrks requested a review from a team as a code owner November 27, 2025 16:22

vb-dbrks temporarily deployed to tool November 27, 2025 16:22 — with GitHub Actions Inactive

vb-dbrks had a problem deploying to tool November 27, 2025 16:22 — with GitHub Actions Failure

vb-dbrks temporarily deployed to tool November 27, 2025 16:22 — with GitHub Actions Inactive

vb-dbrks had a problem deploying to tool November 27, 2025 16:22 — with GitHub Actions Error

mwojtyczka requested a review from Copilot November 27, 2025 17:05

Copilot AI reviewed Nov 27, 2025

View reviewed changes

demos/dqx_demo_library.py Outdated Show resolved Hide resolved

src/databricks/labs/dqx/check_funcs.py Show resolved Hide resolved

docs + fix: count_distinct is now supported with a generalised implem…

53ae329

…entation. docs updated with more user friendly language.

vb-dbrks had a problem deploying to tool November 27, 2025 17:15 — with GitHub Actions Error

docs + fmt: improved the doc language and code comments.

1611dfb

vb-dbrks temporarily deployed to tool November 27, 2025 17:25 — with GitHub Actions Inactive

vb-dbrks temporarily deployed to tool November 27, 2025 17:38 — with GitHub Actions Inactive

Merge main into feature branch: resolve conflicts in check_funcs.py a…

91610d0

…nd quality_checks.mdx

vb-dbrks had a problem deploying to tool December 2, 2025 16:41 — with GitHub Actions Failure

vb-dbrks temporarily deployed to tool December 2, 2025 16:41 — with GitHub Actions Inactive

improvement: readability of error messages + update tests

1d0c96a

vb-dbrks temporarily deployed to tool December 2, 2025 22:21 — with GitHub Actions Inactive

vb-dbrks had a problem deploying to tool December 2, 2025 22:21 — with GitHub Actions Error

vb-dbrks temporarily deployed to tool December 2, 2025 22:21 — with GitHub Actions Inactive

fix: stacklevel for warnings, trailing whitespace, add Column express…

1f68600

…ion test

vb-dbrks had a problem deploying to tool December 2, 2025 22:43 — with GitHub Actions Error

code hardening:

7cf0ccf

1. removed dead code which was "just in case" 2. added test for incorrect parameters 3. More permissive parameter passing to aggr functions

vb-dbrks temporarily deployed to tool December 2, 2025 23:14 — with GitHub Actions Inactive

vb-dbrks had a problem deploying to tool December 2, 2025 23:41 — with GitHub Actions Failure

vb-dbrks temporarily deployed to tool December 2, 2025 23:41 — with GitHub Actions Inactive

vb-dbrks temporarily deployed to tool December 3, 2025 08:03 — with GitHub Actions Inactive

vb-dbrks requested a review from ghanse December 3, 2025 09:37

fix flaky test: test_apply_checks_and_save_in_tables_for_patterns_exc…

24748ae

…lude_no_tables_matching

vb-dbrks temporarily deployed to tool December 3, 2025 15:29 — with GitHub Actions Inactive

ghanse reviewed Dec 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

929 feature extend is aggr check funcs #951

929 feature extend is aggr check funcs #951

vb-dbrks commented Nov 27, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 27, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Nov 27, 2025 •

edited

Loading

Uh oh!

ghanse Dec 4, 2025 •

edited

Loading

Uh oh!

ghanse left a comment

Uh oh!

ghanse Dec 4, 2025

Uh oh!

ghanse Dec 4, 2025

Uh oh!

ghanse Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

929 feature extend is aggr check funcs #951

Are you sure you want to change the base?

929 feature extend is aggr check funcs #951

Conversation

vb-dbrks commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Linked issues

Tests

Uh oh!

github-actions bot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ghanse Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghanse left a comment

Choose a reason for hiding this comment

Uh oh!

ghanse Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

ghanse Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

ghanse Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vb-dbrks commented Nov 27, 2025 •

edited

Loading

github-actions bot commented Nov 27, 2025 •

edited

Loading

codecov bot commented Nov 27, 2025 •

edited

Loading

ghanse Dec 4, 2025 •

edited

Loading