-
Notifications
You must be signed in to change notification settings - Fork 74
Added new checks for JSON validation #616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Added new checks for JSON validation #616
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds row-level JSON validation checks and integrates them into examples and tests.
- Introduces is_valid_json and has_json_keys row checks.
- Updates YAML examples, reference docs, and integration/unit tests to cover the new checks.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/databricks/labs/dqx/check_funcs.py | Adds JSON validation/check functions; core logic for new checks. |
| tests/unit/test_build_rules.py | Extends metadata conversion tests to include new JSON checks. |
| tests/integration/test_apply_checks.py | Adds col_json_str to test schemas and values; exercises new checks in streaming and class-based tests. |
| tests/resources/all_row_checks.yaml | Includes is_valid_json check in the “all row checks” YAML. |
| src/databricks/labs/dqx/llm/resources/yaml_checks_examples.yml | Adds examples for is_valid_json and has_json_keys. |
| docs/dqx/docs/reference/quality_checks.mdx | Documents new checks and shows usage examples. |
Comments suppressed due to low confidence (1)
docs/dqx/docs/reference/quality_checks.mdx:1
- Both examples use the same name 'col_json_str_has_json_keys', which is confusing and may collide in practice. Use distinct, descriptive names (e.g., 'col_json_str_has_no_json_key1' and 'col_json_str_has_no_json_key1_key2').
---
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
mwojtyczka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looking good, minor comments
| for field in expected_schema.fields: | ||
| field_ref = parsed_struct_col[field.name] | ||
| if isinstance(field.dataType, types.StructType): | ||
| validations += _generate_field_presence_checks(field.dataType, field_ref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put some protection, to limit the recursion calls.
| ['{"key": "value"}', 'Not a JSON string'], | ||
| ['{"key": "value"}', None], | ||
| [None, '{"key": "value"}'], | ||
| ['{"nested": {"inner_key": "inner_value"}}', '{"nested": {"inner_key": "inner_value"}}'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add passing example with nested fields
| schema = "a: string, b: string" | ||
| test_data = spark.createDataFrame( | ||
| [ | ||
| ['{"key": "value", "another_key": 123}', '{"key": "value"}'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add one passing example with valid schema

Changes
Added new checks:
is_valid_jsonto check whether the values in the input column are valid JSON strings.has_json_keysto check whether the values in the input column contain specific keys in the outermost JSON object.has_valid_json_schemato check whether the values in the specified column, which contain JSON strings, conform to the expected schema. This check is not strict. Extra fields in the JSON that are not defined in the schema are ignored.Linked issues
Resolves #595
Tests