Skip to content

Conversation

@cornzyblack
Copy link
Contributor

@cornzyblack cornzyblack commented Oct 15, 2025

Changes

Added new checks:

  • is_valid_json to check whether the values in the input column are valid JSON strings.
  • has_json_keys to check whether the values in the input column contain specific keys in the outermost JSON object.
  • has_valid_json_schema to check whether the values in the specified column, which contain JSON strings, conform to the expected schema. This check is not strict. Extra fields in the JSON that are not defined in the schema are ignored.

Linked issues

Resolves #595

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

@mwojtyczka mwojtyczka requested a review from Copilot October 16, 2025 09:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds row-level JSON validation checks and integrates them into examples and tests.

  • Introduces is_valid_json and has_json_keys row checks.
  • Updates YAML examples, reference docs, and integration/unit tests to cover the new checks.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/databricks/labs/dqx/check_funcs.py Adds JSON validation/check functions; core logic for new checks.
tests/unit/test_build_rules.py Extends metadata conversion tests to include new JSON checks.
tests/integration/test_apply_checks.py Adds col_json_str to test schemas and values; exercises new checks in streaming and class-based tests.
tests/resources/all_row_checks.yaml Includes is_valid_json check in the “all row checks” YAML.
src/databricks/labs/dqx/llm/resources/yaml_checks_examples.yml Adds examples for is_valid_json and has_json_keys.
docs/dqx/docs/reference/quality_checks.mdx Documents new checks and shows usage examples.
Comments suppressed due to low confidence (1)

docs/dqx/docs/reference/quality_checks.mdx:1

  • Both examples use the same name 'col_json_str_has_json_keys', which is confusing and may collide in practice. Use distinct, descriptive names (e.g., 'col_json_str_has_no_json_key1' and 'col_json_str_has_no_json_key1_key2').
---

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@cornzyblack
Copy link
Contributor Author

Updates
The current implementation of has_valid_json_schema is not strict but permissive. Take, for example, row 4 has an additional field c but still passes.

Making it strict will require UDFs for especially nested JSON

import pyspark.sql.functions as F

from databricks.labs.dqx.rule import DQRowRule
from pyspark.sql import types


from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.engine import DQEngine
from pyspark.sql import functions as F
from databricks.labs.dqx import check_funcs

ws = WorkspaceClient()
dq_engine = DQEngine(ws)

json_schema = "STRUCT<a: BIGINT, b: BIGINT>"

checks = [

    DQRowRule(
        criticality="error",
        check_func=check_funcs.has_valid_json_schema,
        column="json",
        check_func_kwargs={"schema": json_schema},
    ),
    DQRowRule(
        criticality="error",
        check_func=check_funcs.has_valid_json_schema,
        column="json",
        check_func_kwargs={"schema": json_schema,},
    ),
]


df = spark.createDataFrame(
    [
        {"json": """{ "a" : 1 }"""},
        {"json": """{ "a" : 1, "b": 2}"""},
        {"json": """{ "a" : 1, "b": null}"""},
        {"json": """{ "a" : 1, "b": 1023455,  "c": null}}"""},
    ]
)

checked_df = dq_engine.apply_checks(df, checks)
display(checked_df)

will result in

image

@cornzyblack cornzyblack marked this pull request as ready for review November 13, 2025 12:03
@cornzyblack cornzyblack requested a review from a team as a code owner November 13, 2025 12:03
@cornzyblack cornzyblack requested review from pratikk-databricks and removed request for a team November 13, 2025 12:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mwojtyczka mwojtyczka requested a review from Copilot December 1, 2025 18:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

mwojtyczka and others added 3 commits December 7, 2025 21:35
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looking good, minor comments

for field in expected_schema.fields:
field_ref = parsed_struct_col[field.name]
if isinstance(field.dataType, types.StructType):
validations += _generate_field_presence_checks(field.dataType, field_ref)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put some protection, to limit the recursion calls.

['{"key": "value"}', 'Not a JSON string'],
['{"key": "value"}', None],
[None, '{"key": "value"}'],
['{"nested": {"inner_key": "inner_value"}}', '{"nested": {"inner_key": "inner_value"}}'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add passing example with nested fields

schema = "a: string, b: string"
test_data = spark.createDataFrame(
[
['{"key": "value", "another_key": 123}', '{"key": "value"}'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add one passing example with valid schema

@mwojtyczka mwojtyczka changed the title Feat add json validation checks Added new checks for JSON validation Dec 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: JSON Validation Checks

3 participants