Skip to content

Conversation

@murali-db
Copy link
Collaborator

@murali-db murali-db commented Sep 24, 2025

What changes are proposed in this pull request?

Implements a schema diffing framework for Delta Kernel Rust to enable schema evolution. The diff algorithm uses field IDs (from column mapping) to track fields across schema versions, enabling detection of renames, additions, removals, and modifications at all nesting levels.
Java version here.
Summary of the allowed and blocked changes is TODO.

  • Field ID-based matching: Identifies fields by column mapping ID, not name (enables rename detection)
  • Nested field support: Handles changes in structs, arrays, and maps at any depth
  • LCA filtering: Reports only least common ancestors to reduce noise (e.g., reports "user" added, not "user.name", "user.email", etc.)
  • Breaking change detection: Distinguishes safe changes (nullability loosened) from breaking ones (nullability tightened, type changed)
  • Container nullability tracking: Detects array/map nullability changes separately from element changes

How was this change tested?

Unit tests.

@murali-db murali-db force-pushed the murali-db/schema-evol branch from 654aa32 to 3af7c20 Compare September 24, 2025 11:00
@github-actions github-actions bot added the breaking-change Change that require a major version bump label Sep 24, 2025
@codecov
Copy link

codecov bot commented Oct 1, 2025

Codecov Report

❌ Patch coverage is 98.85458% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.73%. Comparing base (87e3552) to head (4f8c833).
⚠️ Report is 89 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/schema/diff.rs 98.85% 14 Missing and 9 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1346      +/-   ##
==========================================
+ Coverage   84.96%   85.73%   +0.77%     
==========================================
  Files         114      120       +6     
  Lines       29445    33039    +3594     
  Branches    29445    33039    +3594     
==========================================
+ Hits        25017    28325    +3308     
- Misses       3220     3430     +210     
- Partials     1208     1284      +76     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@murali-db murali-db force-pushed the murali-db/schema-evol branch from 108283f to ed148a5 Compare October 1, 2025 11:18
@murali-db murali-db changed the title feat: schema diffing poc feat: schema diffing Oct 1, 2025
Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small cleanup. Looks correct to me. Thank you!

murali-db and others added 8 commits October 16, 2025 00:30
Replace string-based field paths with ColumnName type throughout the
schema diffing framework for better consistency with the codebase.

Changes:
- Updated FieldChange, FieldUpdate, and FieldWithPath to use ColumnName
- Updated SchemaDiffError to use ColumnName in error messages
- Refactored path construction to use ColumnName::new() and join()
- Replaced string prefix matching with proper ColumnName comparison
- Updated all test assertions to use ColumnName

All tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changed unwrap() to expect() with a descriptive message to satisfy
clippy linter requirements while maintaining safety guarantees.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add #![allow(dead_code)] to the schema diff module since this API
is not yet consumed by other parts of the codebase. The functions
are fully tested and ready for use in schema evolution features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
murali-db and others added 6 commits October 16, 2025 00:30
Updated schema diffing logic to classify field removals as safe operations
rather than breaking changes. This aligns with the principle that dropping
columns is safe - existing data files remain valid and queries referencing
removed fields will fail at query time, but data integrity is maintained.

Changes:
- Modified compute_has_breaking_changes() to not flag field removals
- Updated 5 test assertions that previously expected removals to be breaking
- All 32 schema diff tests pass

Rationale: Field removals should not break existing readers since they can
safely ignore unknown fields in data files. Query-time failures for removed
fields are acceptable as they reflect the intentional schema change.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: OussamaSaoudi <45303303+OussamaSaoudi@users.noreply.github.com>
Co-authored-by: OussamaSaoudi <45303303+OussamaSaoudi@users.noreply.github.com>
Co-authored-by: OussamaSaoudi <45303303+OussamaSaoudi@users.noreply.github.com>
Performance optimizations:
- Add ColumnName::parent() method for O(1) parent path access
- Remove ColumnName::is_descendant_of() (no longer needed)
- Optimize has_added_ancestor() from O(N*M) to O(M) using parent chain
- Add early return in classify_data_type_change() for identical types

Code quality improvements:
- Add check_field_nullability_change() helper for better code reuse
- Refactor physical name validation into validate_physical_name()
- Simplify nested match statements with helper functions

Test coverage:
- Add Case 6: Field nullability tightening in deeply nested structures
- Add Case 7: Container nullability tightening in deeply nested structures

All 32 tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@murali-db murali-db force-pushed the murali-db/schema-evol branch from 2a065a3 to f6226f5 Compare October 16, 2025 00:31
This refactoring simplifies the API by replacing:
- `change_type: FieldChangeType` with `change_types: Vec<FieldChangeType>`
- Removes the `FieldChangeType::Multiple` variant

Benefits:
- Reduces indirection - changes can be accessed directly as a Vec
- Cleaner API - no need to pattern match on Multiple vs other variants
- More consistent with the semantics of tracking multiple concurrent changes

Updated `classify_data_type_change()` to return `Vec<FieldChangeType>`
directly instead of wrapping in Option<FieldChangeType::Multiple>.

All 32 tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
murali-db added a commit to murali-db/delta-kernel-rs that referenced this pull request Nov 14, 2025
Introduces core data structures for schema diffing:
- SchemaDiff, FieldChange, FieldUpdate types
- FieldChangeType enum for classifying changes
- SchemaDiffError for validation errors
- ColumnName::parent() helper method

This is part 1/5 of the schema diffing feature implementation.
The actual diffing algorithm will be added in PR 2.

Note: This PR includes a temporary stub for compute_schema_diff()
to allow basic tests to compile. The full implementation from the
original PR delta-io#1346 will be copied exactly in PR 2.

Related to delta-io#1346
murali-db added a commit to murali-db/delta-kernel-rs that referenced this pull request Nov 14, 2025
Adds complete diffing functionality for non-nested schemas:
- Field collection and ID-based matching
- Detection of adds, removes, renames, nullability changes
- Physical name validation for column mapping
- Breaking change classification
- Full type classification including arrays and maps
- Ancestor filtering for LCA reporting

Currently supports flat schemas (top-level fields only). The
collect_all_fields_with_paths() function has a commented-out
recursive call that will be enabled in PR 3 to support nested fields.

All other functions are copied exactly from the original PR delta-io#1346
(murali-db/schema-evol) with no logic changes.

This is part 2/5 of the schema diffing feature implementation.

Tests included (9 tests):
- test_identical_schemas
- test_change_count
- test_top_level_added_field
- test_added_required_field_is_breaking
- test_added_nullable_field_is_not_breaking
- test_physical_name_validation
- test_multiple_change_types
- test_multiple_with_breaking_change
- test_duplicate_field_id_error

Related to delta-io#1346
murali-db added a commit to murali-db/delta-kernel-rs that referenced this pull request Nov 14, 2025
Adds complete diffing functionality for non-nested schemas:
- Field collection and ID-based matching
- Detection of adds, removes, renames, nullability changes
- Physical name validation for column mapping
- Breaking change classification
- Full type classification including arrays and maps
- Ancestor filtering for LCA reporting

Currently supports flat schemas (top-level fields only). The
collect_all_fields_with_paths() function has a commented-out
recursive call that will be enabled in PR 3 to support nested fields.

All other functions are copied exactly from the original PR delta-io#1346
(murali-db/schema-evol) with no logic changes.

This is part 2/5 of the schema diffing feature implementation.

Tests included (9 tests):
- test_identical_schemas
- test_change_count
- test_top_level_added_field
- test_added_required_field_is_breaking
- test_added_nullable_field_is_not_breaking
- test_physical_name_validation
- test_multiple_change_types
- test_multiple_with_breaking_change
- test_duplicate_field_id_error

Related to delta-io#1346
murali-db added a commit to murali-db/delta-kernel-rs that referenced this pull request Nov 14, 2025
Extends schema diffing to handle nested structures:
- Recursive field collection through struct hierarchies
- Ancestor filtering to report only LCA changes
- Type change classification for struct containers
- Support for arbitrarily deep nesting

Arrays and maps support coming in follow-up PRs.

This is part 3/5 of the schema diffing feature implementation.

Related to delta-io#1346

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
murali-db added a commit to murali-db/delta-kernel-rs that referenced this pull request Nov 14, 2025
Adds 8 comprehensive array tests:
- Array element struct field changes
- Doubly nested array type changes
- Nested array nullability (loosened/tightened)
- Inner array nullability (loosened/tightened)
- Simple array nullability (loosened/tightened)

All array implementation code was already present from PR 2.
This PR only adds tests to verify array functionality.

This is part 4/5 of the schema diffing feature implementation.

Related to delta-io#1346

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
murali-db added a commit to murali-db/delta-kernel-rs that referenced this pull request Nov 14, 2025
Completes the schema diffing implementation with comprehensive map support:
- Map key and value type tracking and comparison
- Map value nullability change detection (loosened/tightened)
- Struct keys and values within maps
- Complex deeply nested scenarios mixing structs, arrays, and maps
- Comprehensive edge case testing

Test additions:
- test_map_value_struct_field_changes: Tests field changes within map values
- test_map_nullability_loosened: Tests map value nullability relaxation (safe)
- test_map_nullability_tightened: Tests map value nullability restriction (breaking)
- test_map_with_struct_key: Tests maps with struct keys
- test_cursed_deeply_nested_complex_types: Tests extremely nested combinations

This is the final part (5/5) of the schema diffing feature implementation.

Related to delta-io#1346

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
murali-db added a commit to murali-db/delta-kernel-rs that referenced this pull request Nov 14, 2025
Adds complete diffing functionality for non-nested schemas:
- Field collection and ID-based matching
- Detection of adds, removes, renames, nullability changes
- Physical name validation for column mapping
- Breaking change classification
- Full type classification including arrays and maps
- Ancestor filtering for LCA reporting

Currently supports flat schemas (top-level fields only). The
collect_all_fields_with_paths() function has a commented-out
recursive call that will be enabled in PR 3 to support nested fields.

All other functions are copied exactly from the original PR delta-io#1346
(murali-db/schema-evol) with no logic changes.

This is part 2/5 of the schema diffing feature implementation.

Tests included (9 tests):
- test_identical_schemas
- test_change_count
- test_top_level_added_field
- test_added_required_field_is_breaking
- test_added_nullable_field_is_not_breaking
- test_physical_name_validation
- test_multiple_change_types
- test_multiple_with_breaking_change
- test_duplicate_field_id_error

Related to delta-io#1346
murali-db added a commit to murali-db/delta-kernel-rs that referenced this pull request Nov 14, 2025
Adds complete diffing functionality for non-nested schemas:
- Field collection and ID-based matching
- Detection of adds, removes, renames, nullability changes
- Physical name validation for column mapping
- Breaking change classification
- Full type classification including arrays and maps
- Ancestor filtering for LCA reporting

Currently supports flat schemas (top-level fields only). The
collect_all_fields_with_paths() function has a commented-out
recursive call that will be enabled in PR 3 to support nested fields.

All other functions are copied exactly from the original PR delta-io#1346
(murali-db/schema-evol) with no logic changes.

This is part 2/5 of the schema diffing feature implementation.

Tests included (9 tests):
- test_identical_schemas
- test_change_count
- test_top_level_added_field
- test_added_required_field_is_breaking
- test_added_nullable_field_is_not_breaking
- test_physical_name_validation
- test_multiple_change_types
- test_multiple_with_breaking_change
- test_duplicate_field_id_error

Related to delta-io#1346
murali-db added a commit to murali-db/delta-kernel-rs that referenced this pull request Nov 14, 2025
Extends schema diffing to handle nested structures:
- Recursive field collection through struct hierarchies
- Ancestor filtering to report only LCA changes
- Type change classification for struct containers
- Support for arbitrarily deep nesting

Arrays and maps support coming in follow-up PRs.

This is part 3/5 of the schema diffing feature implementation.

Related to delta-io#1346

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
murali-db added a commit to murali-db/delta-kernel-rs that referenced this pull request Nov 14, 2025
Adds 8 comprehensive array tests:
- Array element struct field changes
- Doubly nested array type changes
- Nested array nullability (loosened/tightened)
- Inner array nullability (loosened/tightened)
- Simple array nullability (loosened/tightened)

All array implementation code was already present from PR 2.
This PR only adds tests to verify array functionality.

This is part 4/5 of the schema diffing feature implementation.

Related to delta-io#1346

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
murali-db added a commit to murali-db/delta-kernel-rs that referenced this pull request Nov 14, 2025
Completes the schema diffing implementation with comprehensive map support:
- Map key and value type tracking and comparison
- Map value nullability change detection (loosened/tightened)
- Struct keys and values within maps
- Complex deeply nested scenarios mixing structs, arrays, and maps
- Comprehensive edge case testing

This is the final PR in the series that enables the schema diff functionality.
The #![allow(dead_code)] attribute has been removed as the implementation is now complete.

Test additions:
- test_map_value_struct_field_changes: Tests field changes within map values
- test_map_nullability_loosened: Tests map value nullability relaxation (safe)
- test_map_nullability_tightened: Tests map value nullability restriction (breaking)
- test_map_with_struct_key: Tests maps with struct keys
- test_cursed_deeply_nested_complex_types: Tests extremely nested combinations

This is the final part (5/5) of the schema diffing feature implementation.

Related to delta-io#1346

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
murali-db added a commit to murali-db/delta-kernel-rs that referenced this pull request Nov 14, 2025
Completes the schema diffing implementation with map support:
- Map key and value type tracking and comparison
- Map value nullability change detection (loosened/tightened)
- Struct keys and values within maps
- Deeply nested scenarios mixing structs, arrays, and maps

This is the final PR in the series that enables the schema diff functionality.
The feature gate has been removed - implementation is now active.

Test additions:
- test_map_value_struct_field_changes
- test_map_nullability_loosened
- test_map_nullability_tightened
- test_map_with_struct_key
- test_cursed_deeply_nested_complex_types

Note: #![allow(dead_code)] is retained as the API is not yet integrated with other modules.

This is the final part (5/5) of the schema diffing feature implementation.

Related to delta-io#1346

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@zachschuermann
Copy link
Collaborator

closing for the chain of PRs starting with #1477

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants