-
Notifications
You must be signed in to change notification settings - Fork 129
feat: schema diffing #1346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: schema diffing #1346
Conversation
654aa32 to
3af7c20
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1346 +/- ##
==========================================
+ Coverage 84.96% 85.73% +0.77%
==========================================
Files 114 120 +6
Lines 29445 33039 +3594
Branches 29445 33039 +3594
==========================================
+ Hits 25017 28325 +3308
- Misses 3220 3430 +210
- Partials 1208 1284 +76 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
108283f to
ed148a5
Compare
OussamaSaoudi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some small cleanup. Looks correct to me. Thank you!
Replace string-based field paths with ColumnName type throughout the schema diffing framework for better consistency with the codebase. Changes: - Updated FieldChange, FieldUpdate, and FieldWithPath to use ColumnName - Updated SchemaDiffError to use ColumnName in error messages - Refactored path construction to use ColumnName::new() and join() - Replaced string prefix matching with proper ColumnName comparison - Updated all test assertions to use ColumnName All tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changed unwrap() to expect() with a descriptive message to satisfy clippy linter requirements while maintaining safety guarantees. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add #![allow(dead_code)] to the schema diff module since this API is not yet consumed by other parts of the codebase. The functions are fully tested and ready for use in schema evolution features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Updated schema diffing logic to classify field removals as safe operations rather than breaking changes. This aligns with the principle that dropping columns is safe - existing data files remain valid and queries referencing removed fields will fail at query time, but data integrity is maintained. Changes: - Modified compute_has_breaking_changes() to not flag field removals - Updated 5 test assertions that previously expected removals to be breaking - All 32 schema diff tests pass Rationale: Field removals should not break existing readers since they can safely ignore unknown fields in data files. Query-time failures for removed fields are acceptable as they reflect the intentional schema change. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: OussamaSaoudi <45303303+OussamaSaoudi@users.noreply.github.com>
Co-authored-by: OussamaSaoudi <45303303+OussamaSaoudi@users.noreply.github.com>
Co-authored-by: OussamaSaoudi <45303303+OussamaSaoudi@users.noreply.github.com>
Performance optimizations: - Add ColumnName::parent() method for O(1) parent path access - Remove ColumnName::is_descendant_of() (no longer needed) - Optimize has_added_ancestor() from O(N*M) to O(M) using parent chain - Add early return in classify_data_type_change() for identical types Code quality improvements: - Add check_field_nullability_change() helper for better code reuse - Refactor physical name validation into validate_physical_name() - Simplify nested match statements with helper functions Test coverage: - Add Case 6: Field nullability tightening in deeply nested structures - Add Case 7: Container nullability tightening in deeply nested structures All 32 tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2a065a3 to
f6226f5
Compare
This refactoring simplifies the API by replacing: - `change_type: FieldChangeType` with `change_types: Vec<FieldChangeType>` - Removes the `FieldChangeType::Multiple` variant Benefits: - Reduces indirection - changes can be accessed directly as a Vec - Cleaner API - no need to pattern match on Multiple vs other variants - More consistent with the semantics of tracking multiple concurrent changes Updated `classify_data_type_change()` to return `Vec<FieldChangeType>` directly instead of wrapping in Option<FieldChangeType::Multiple>. All 32 tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Introduces core data structures for schema diffing: - SchemaDiff, FieldChange, FieldUpdate types - FieldChangeType enum for classifying changes - SchemaDiffError for validation errors - ColumnName::parent() helper method This is part 1/5 of the schema diffing feature implementation. The actual diffing algorithm will be added in PR 2. Note: This PR includes a temporary stub for compute_schema_diff() to allow basic tests to compile. The full implementation from the original PR delta-io#1346 will be copied exactly in PR 2. Related to delta-io#1346
Adds complete diffing functionality for non-nested schemas: - Field collection and ID-based matching - Detection of adds, removes, renames, nullability changes - Physical name validation for column mapping - Breaking change classification - Full type classification including arrays and maps - Ancestor filtering for LCA reporting Currently supports flat schemas (top-level fields only). The collect_all_fields_with_paths() function has a commented-out recursive call that will be enabled in PR 3 to support nested fields. All other functions are copied exactly from the original PR delta-io#1346 (murali-db/schema-evol) with no logic changes. This is part 2/5 of the schema diffing feature implementation. Tests included (9 tests): - test_identical_schemas - test_change_count - test_top_level_added_field - test_added_required_field_is_breaking - test_added_nullable_field_is_not_breaking - test_physical_name_validation - test_multiple_change_types - test_multiple_with_breaking_change - test_duplicate_field_id_error Related to delta-io#1346
Adds complete diffing functionality for non-nested schemas: - Field collection and ID-based matching - Detection of adds, removes, renames, nullability changes - Physical name validation for column mapping - Breaking change classification - Full type classification including arrays and maps - Ancestor filtering for LCA reporting Currently supports flat schemas (top-level fields only). The collect_all_fields_with_paths() function has a commented-out recursive call that will be enabled in PR 3 to support nested fields. All other functions are copied exactly from the original PR delta-io#1346 (murali-db/schema-evol) with no logic changes. This is part 2/5 of the schema diffing feature implementation. Tests included (9 tests): - test_identical_schemas - test_change_count - test_top_level_added_field - test_added_required_field_is_breaking - test_added_nullable_field_is_not_breaking - test_physical_name_validation - test_multiple_change_types - test_multiple_with_breaking_change - test_duplicate_field_id_error Related to delta-io#1346
Extends schema diffing to handle nested structures: - Recursive field collection through struct hierarchies - Ancestor filtering to report only LCA changes - Type change classification for struct containers - Support for arbitrarily deep nesting Arrays and maps support coming in follow-up PRs. This is part 3/5 of the schema diffing feature implementation. Related to delta-io#1346 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Adds 8 comprehensive array tests: - Array element struct field changes - Doubly nested array type changes - Nested array nullability (loosened/tightened) - Inner array nullability (loosened/tightened) - Simple array nullability (loosened/tightened) All array implementation code was already present from PR 2. This PR only adds tests to verify array functionality. This is part 4/5 of the schema diffing feature implementation. Related to delta-io#1346 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Completes the schema diffing implementation with comprehensive map support: - Map key and value type tracking and comparison - Map value nullability change detection (loosened/tightened) - Struct keys and values within maps - Complex deeply nested scenarios mixing structs, arrays, and maps - Comprehensive edge case testing Test additions: - test_map_value_struct_field_changes: Tests field changes within map values - test_map_nullability_loosened: Tests map value nullability relaxation (safe) - test_map_nullability_tightened: Tests map value nullability restriction (breaking) - test_map_with_struct_key: Tests maps with struct keys - test_cursed_deeply_nested_complex_types: Tests extremely nested combinations This is the final part (5/5) of the schema diffing feature implementation. Related to delta-io#1346 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Adds complete diffing functionality for non-nested schemas: - Field collection and ID-based matching - Detection of adds, removes, renames, nullability changes - Physical name validation for column mapping - Breaking change classification - Full type classification including arrays and maps - Ancestor filtering for LCA reporting Currently supports flat schemas (top-level fields only). The collect_all_fields_with_paths() function has a commented-out recursive call that will be enabled in PR 3 to support nested fields. All other functions are copied exactly from the original PR delta-io#1346 (murali-db/schema-evol) with no logic changes. This is part 2/5 of the schema diffing feature implementation. Tests included (9 tests): - test_identical_schemas - test_change_count - test_top_level_added_field - test_added_required_field_is_breaking - test_added_nullable_field_is_not_breaking - test_physical_name_validation - test_multiple_change_types - test_multiple_with_breaking_change - test_duplicate_field_id_error Related to delta-io#1346
Adds complete diffing functionality for non-nested schemas: - Field collection and ID-based matching - Detection of adds, removes, renames, nullability changes - Physical name validation for column mapping - Breaking change classification - Full type classification including arrays and maps - Ancestor filtering for LCA reporting Currently supports flat schemas (top-level fields only). The collect_all_fields_with_paths() function has a commented-out recursive call that will be enabled in PR 3 to support nested fields. All other functions are copied exactly from the original PR delta-io#1346 (murali-db/schema-evol) with no logic changes. This is part 2/5 of the schema diffing feature implementation. Tests included (9 tests): - test_identical_schemas - test_change_count - test_top_level_added_field - test_added_required_field_is_breaking - test_added_nullable_field_is_not_breaking - test_physical_name_validation - test_multiple_change_types - test_multiple_with_breaking_change - test_duplicate_field_id_error Related to delta-io#1346
Extends schema diffing to handle nested structures: - Recursive field collection through struct hierarchies - Ancestor filtering to report only LCA changes - Type change classification for struct containers - Support for arbitrarily deep nesting Arrays and maps support coming in follow-up PRs. This is part 3/5 of the schema diffing feature implementation. Related to delta-io#1346 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Adds 8 comprehensive array tests: - Array element struct field changes - Doubly nested array type changes - Nested array nullability (loosened/tightened) - Inner array nullability (loosened/tightened) - Simple array nullability (loosened/tightened) All array implementation code was already present from PR 2. This PR only adds tests to verify array functionality. This is part 4/5 of the schema diffing feature implementation. Related to delta-io#1346 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Completes the schema diffing implementation with comprehensive map support: - Map key and value type tracking and comparison - Map value nullability change detection (loosened/tightened) - Struct keys and values within maps - Complex deeply nested scenarios mixing structs, arrays, and maps - Comprehensive edge case testing This is the final PR in the series that enables the schema diff functionality. The #![allow(dead_code)] attribute has been removed as the implementation is now complete. Test additions: - test_map_value_struct_field_changes: Tests field changes within map values - test_map_nullability_loosened: Tests map value nullability relaxation (safe) - test_map_nullability_tightened: Tests map value nullability restriction (breaking) - test_map_with_struct_key: Tests maps with struct keys - test_cursed_deeply_nested_complex_types: Tests extremely nested combinations This is the final part (5/5) of the schema diffing feature implementation. Related to delta-io#1346 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Completes the schema diffing implementation with map support: - Map key and value type tracking and comparison - Map value nullability change detection (loosened/tightened) - Struct keys and values within maps - Deeply nested scenarios mixing structs, arrays, and maps This is the final PR in the series that enables the schema diff functionality. The feature gate has been removed - implementation is now active. Test additions: - test_map_value_struct_field_changes - test_map_nullability_loosened - test_map_nullability_tightened - test_map_with_struct_key - test_cursed_deeply_nested_complex_types Note: #![allow(dead_code)] is retained as the API is not yet integrated with other modules. This is the final part (5/5) of the schema diffing feature implementation. Related to delta-io#1346 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
closing for the chain of PRs starting with #1477 |
What changes are proposed in this pull request?
Implements a schema diffing framework for Delta Kernel Rust to enable schema evolution. The diff algorithm uses field IDs (from column mapping) to track fields across schema versions, enabling detection of renames, additions, removals, and modifications at all nesting levels.
Java version here.
Summary of the allowed and blocked changes is TODO.
How was this change tested?
Unit tests.