fix deduplication in python sdk #626

nexxeln · 2025-12-23T18:24:05Z

done in a similar way to the ai sdk

nexxeln · 2025-12-23T18:24:22Z

fix deduplication in python sdk #626 👈 (View in Graphite)
main

How to use the Graphite Merge Queue

Add the label Main to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

cloudflare-workers-and-pages · 2025-12-23T18:24:31Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Updated (UTC)
❌ Deployment failed View logs	supermemory-app	`1e7d39f`	Dec 23 2025, 07:27 PM

claude · 2025-12-23T18:26:21Z

Code Review: Fix Deduplication in Python SDK

Thank you for implementing memory deduplication! This is an important fix for preventing duplicate memories from appearing in different sources. Here's my comprehensive review:

✅ Strengths

Clear Priority System: The deduplication logic correctly implements a priority hierarchy (Static > Dynamic > Search Results), ensuring memories appear only in their highest-priority source.
Robust Type Handling: The extract_memory_text function handles multiple input formats gracefully:
- Dict with 'memory' key
- Plain strings
- None values
- Empty strings (properly filtered)
Good Logging: Added debug logging for deduplication metrics helps with observability.
Proper Exports: New functions are correctly exported in __init__.py.

🔍 Issues & Concerns

1. Missing Test Coverage ⚠️ HIGH PRIORITY

The PR adds 95 lines of new code, including critical deduplication logic, but no tests are included. This is a significant gap.

Required tests:

Basic deduplication across all three sources
Priority ordering (static > dynamic > search)
Edge cases: empty lists, None values, empty strings
Mixed input formats (dicts vs strings)
Duplicate handling within the same source
Empty memory strings after trimming

Suggested test cases:

def test_deduplicate_memories_basic():
    """Test basic deduplication across sources."""
    result = deduplicate_memories(
        static=[{"memory": "User likes Python"}],
        dynamic=[{"memory": "User likes Python"}, {"memory": "Lives in SF"}],
        search_results=[{"memory": "User likes Python"}, {"memory": "Uses FastAPI"}]
    )
    
    assert len(result.static) == 1
    assert len(result.dynamic) == 1  # Duplicate removed
    assert len(result.search_results) == 1  # Duplicate removed

def test_deduplicate_memories_priority():
    """Test priority: static > dynamic > search."""
    # Memory appears in all three
    result = deduplicate_memories(
        static=["Important fact"],
        dynamic=["Important fact", "Other fact"],
        search_results=["Important fact"]
    )
    
    # Should only appear in static
    assert "Important fact" in result.static
    assert "Important fact" not in result.dynamic
    assert "Important fact" not in result.search_results

def test_deduplicate_memories_empty_strings():
    """Test handling of whitespace-only strings."""
    result = deduplicate_memories(
        static=["  ", {"memory": "   "}, "Valid"],
        dynamic=[None, ""],
    )
    
    assert len(result.static) == 1
    assert result.static[0] == "Valid"

2. Inconsistent Type Handling in Middleware ⚠️

In middleware.py:137-141, the code assumes memories are already deduplicated as strings, but then references deduplicated.static which are already strings. However, the original code in lines 149-156 was handling extraction differently.

Issue: The profile conversion now receives deduplicated string arrays, but then the search results section at line 167 also uses deduplicated strings. This looks correct, but needs verification.

3. Potential Memory Overhead

The seen_memories set stores full memory strings. For large datasets with long memories, this could consume significant memory. Consider:

Using hashed values for comparison (though this changes semantics)
Documenting expected memory limits
Adding a max memory count parameter

4. No Handling of Near-Duplicates

The current implementation only handles exact string matches (after trimming). Similar memories like:

"User prefers Python for development"
"User prefers Python"

Would be treated as different. This might be intentional, but worth documenting.

🔧 Code Quality

Minor Issues:

Type Hints: DeduplicatedMemories could use @dataclass for cleaner implementation:

from dataclasses import dataclass

@dataclass
class DeduplicatedMemories:
    """Deduplicated memory strings organized by source."""
    static: list[str]
    dynamic: list[str]
    search_results: list[str]

Documentation: The deduplicate_memories function docstring should include:
- Examples of input/output
- Clarification on priority behavior
- Notes about exact string matching
Naming: Consider search_results → search for consistency with the priority comment.

🛡️ Security Considerations

✅ No security issues identified. The function doesn't execute code or perform unsafe operations.

⚡ Performance Considerations

Current Complexity: O(n + m + k) where n, m, k are lengths of static, dynamic, and search arrays.

Potential Optimization: If memory counts are very large (>10k), consider:

Early termination if seen_memories becomes too large
Batch processing for memory extraction

Current implementation is fine for typical use cases (dozens to hundreds of memories).

📋 Action Items

Before Merge:

✅ Add comprehensive unit tests for deduplicate_memories
✅ Add integration tests in test_middleware.py that verify deduplication works end-to-end
📝 Update docstrings with examples and edge case documentation
🔍 Verify the middleware integration works correctly with the new string-based approach

Nice to Have:

Consider using @dataclass for DeduplicatedMemories
Add performance benchmarks for large memory sets
Document expected behavior for near-duplicate memories

📊 Test Coverage Gap Analysis

Current test file (test_middleware.py) has 725 lines with comprehensive middleware testing, but zero tests for the new deduplicate_memories function. This is a significant gap given that:

The function handles multiple input formats
It implements complex priority logic
It's a pure function (easy to test!)
It affects core functionality

🎯 Overall Assessment

Code Quality: 7/10 - Clean implementation with good structure, but needs dataclass and better docs
Test Coverage: 2/10 - Critical gap, no tests for new functionality
Security: 10/10 - No concerns
Performance: 8/10 - Good for typical cases, document limits for large datasets

Recommendation: 🔄 Request Changes - Add tests before merging. The implementation looks solid, but the lack of test coverage is a blocker for a function this critical to data integrity.

claude · 2025-12-23T19:28:50Z

Pull Request Review: Fix Deduplication in Python SDK

Summary

This PR implements memory deduplication in the Python SDK, mirroring the TypeScript implementation. The changes prevent duplicate memories from appearing across static, dynamic, and search result sources.

✅ Positive Aspects

Consistent Architecture: Matches the TypeScript implementation in packages/tools/src/shared.ts, ensuring consistency across SDKs
Clean Implementation: The deduplicate_memories() function is well-structured with clear priority handling (Static > Dynamic > Search Results)
Proper Logging: Added debug logging for deduplication metrics, which aids troubleshooting
Null Safety: Good handling of None values and empty data with proper defaults

🐛 Potential Issues

1. Memory Handling in `middleware.py:120-126`

profile = memories_response.profile or {}
search_results_data = memories_response.search_results or {}
memory_count_static = len(profile.get("static", []))
memory_count_dynamic = len(profile.get("dynamic", []))
memory_count_search = len(search_results_data.get("results", []))

Issue: If memories_response.profile is None, this sets it to an empty dict. However, the original code assumed these could be accessed directly. This is actually an improvement but should be verified that the API can return None for these fields.

Recommendation: Verify the API contract and add a comment explaining why these defaults are needed.

2. Type Consistency with Memory Items

In utils.py:222-233, the function handles both dict and string types:

def extract_memory_text(item: Any) -> Optional[str]:
    if item is None:
        return None
    if isinstance(item, dict):
        memory = item.get("memory")
        if isinstance(memory, str):
            trimmed = memory.strip()
            return trimmed if trimmed else None
        return None
    if isinstance(item, str):
        trimmed = item.strip()
        return trimmed if trimmed else None
    return None

Observation: This is more permissive than the TypeScript version which only handles MemoryItem objects. While this adds robustness, ensure this is intentional behavior for the Python SDK.

3. Missing Unit Tests

$ grep -r "deduplicate" tests/
# No results found

Critical: There are no unit tests for the new deduplicate_memories() function.

Recommendation: Add comprehensive tests covering:

Basic deduplication across sources
Priority handling (static > dynamic > search)
Edge cases (empty inputs, None values, whitespace-only strings)
Mixed type inputs (dict vs string)

Example test structure:

def test_deduplicate_memories_basic():
    result = deduplicate_memories(
        static=[{"memory": "User likes Python"}],
        dynamic=[{"memory": "User likes Python"}, {"memory": "Works remotely"}],
        search_results=[{"memory": "Prefers async"}]
    )
    assert result.static == ["User likes Python"]
    assert result.dynamic == ["Works remotely"]
    assert result.search_results == ["Prefers async"]

def test_deduplicate_memories_with_none_values():
    result = deduplicate_memories(static=None, dynamic=None, search_results=None)
    assert result.static == []
    assert result.dynamic == []
    assert result.search_results == []

def test_deduplicate_memories_whitespace_handling():
    result = deduplicate_memories(
        static=[{"memory": "  trimmed  "}],
        dynamic=[{"memory": "trimmed"}, {"memory": "  "}]
    )
    assert result.static == ["trimmed"]
    assert result.dynamic == []  # Duplicate removed

🔒 Security Concerns

No security issues identified. The deduplication logic is purely string-based and doesn't introduce any injection vectors.

⚡ Performance Considerations

Set-based Deduplication: Uses set for O(1) lookups - excellent choice
String Trimming: Trimming happens for every memory item. For large datasets, consider caching trimmed values if the same item is checked multiple times (though current implementation likely doesn't have this issue)
Overall: Performance should be good for typical use cases with hundreds of memories

📝 Code Quality

Documentation: The DeduplicatedMemories class and deduplicate_memories() function have clear docstrings
Type Hints: Properly using Optional, Any, and list type hints
Formatting: Code appears to follow Python conventions

Minor Suggestion: The DeduplicatedMemories class could use @dataclass decorator for cleaner syntax:

from dataclasses import dataclass

@dataclass
class DeduplicatedMemories:
    """Deduplicated memory strings organized by source."""
    static: list[str]
    dynamic: list[str]
    search_results: list[str]

🎯 Test Coverage Assessment

Current Status: ❌ No test coverage for new functionality

Required Tests:

Unit tests for deduplicate_memories() in tests/test_middleware.py or new tests/test_utils.py
Integration tests verifying deduplication happens during middleware execution
Edge case tests (see examples above)

✅ Recommendation

Status: Approve with requested changes

Required Before Merge:

✅ Add comprehensive unit tests for deduplicate_memories()
⚠️ Add integration test in existing test suite verifying deduplication in the middleware flow
📝 Consider using @dataclass for DeduplicatedMemories

Nice to Have:

Verify and document the API contract for profile and search_results potentially being None
Add type stubs or more specific types instead of Any in extract_memory_text

Great work on maintaining parity with the TypeScript SDK! The implementation is solid and follows best practices. Once tests are added, this will be ready to merge.

fix deduplication in python sdk

5110749

nexxeln marked this pull request as ready for review December 23, 2025 18:24

graphite-app bot requested a review from Dhravya December 23, 2025 18:24

graphite-app bot assigned nexxeln Dec 23, 2025

handle undefined case

1e7d39f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix deduplication in python sdk #626

fix deduplication in python sdk #626

nexxeln commented Dec 23, 2025 •

edited

Loading

Uh oh!

nexxeln commented Dec 23, 2025 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 23, 2025

Uh oh!

claude bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix deduplication in python sdk #626

Are you sure you want to change the base?

fix deduplication in python sdk #626

Conversation

nexxeln commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nexxeln commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

cloudflare-workers-and-pages bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

claude bot commented Dec 23, 2025

Code Review: Fix Deduplication in Python SDK

✅ Strengths

🔍 Issues & Concerns

1. Missing Test Coverage ⚠️ HIGH PRIORITY

2. Inconsistent Type Handling in Middleware ⚠️

3. Potential Memory Overhead

4. No Handling of Near-Duplicates

🔧 Code Quality

Minor Issues:

🛡️ Security Considerations

⚡ Performance Considerations

📋 Action Items

📊 Test Coverage Gap Analysis

🎯 Overall Assessment

Uh oh!

claude bot commented Dec 23, 2025

Pull Request Review: Fix Deduplication in Python SDK

Summary

✅ Positive Aspects

🐛 Potential Issues

1. Memory Handling in middleware.py:120-126

2. Type Consistency with Memory Items

3. Missing Unit Tests

🔒 Security Concerns

⚡ Performance Considerations

📝 Code Quality

🎯 Test Coverage Assessment

✅ Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nexxeln commented Dec 23, 2025 •

edited

Loading

nexxeln commented Dec 23, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 23, 2025 •

edited

Loading

1. Memory Handling in `middleware.py:120-126`