Skip to content

Conversation

@GeigerJ2
Copy link
Contributor

@GeigerJ2 GeigerJ2 commented Dec 2, 2025

After removing filter_size batching in the QueryBuilder IN clause optimization in #6998, the _add_new_entities function's progress bar was broken.

Before PR #6998

for nrows, ufields_batch in batch_iter(ufields, filter_size):  # ~1000 IDs per batch
    rows = [transform(row) for row in QueryBuilder(...).dict()]  # ❌ List comprehension blocks
    bulk_insert(rows)
    progress.update(nrows)  # ✅ Updates every ~1000 entities

Progress updates:

  • From filter_size batching only (every ~1000 entities)
  • NOT from QueryBuilder streaming (blocked by list comprehension)
  • Each batch still had a blocking collection phase

After PR #6998 (Broken)

rows = [transform(row) for row in QueryBuilder(...).dict()]  # ❌ Blocks for ALL
bulk_insert(rows)
progress.update(len(rows))  # ❌ Updates once at end

Progress updates: None until the very end

With this PR (restored AND improved)

for _, ufields_batch in batch_iter(ufields, 50_000):  # 50k IDs per batch
    query = QueryBuilder(...).append(..., filters={unique_field: {'in': ufields_batch}})
    for nrows, rows_batch in batch_iter(query.dict(), batch_size, transform):  # ✅ NEW!
        bulk_insert(rows_batch)
        progress.update(nrows)  # ✅ Updates every ~10k entities

Progress updates:

  • Re-introduced query batching (50k instead of 1k—purely for UX, not due to SQL restrictions)
  • NEW: Added QueryBuilder streaming-based updates (every ~10k entities—previously blocked by list comprehension)

@GeigerJ2 GeigerJ2 linked an issue Dec 2, 2025 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.58%. Comparing base (e79f0a4) to head (b4cfe89).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7118      +/-   ##
==========================================
+ Coverage   79.58%   79.58%   +0.01%     
==========================================
  Files         566      566              
  Lines       43520    43523       +3     
==========================================
+ Hits        34631    34634       +3     
  Misses       8889     8889              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@GeigerJ2 GeigerJ2 force-pushed the fix/7117/progress-bar-archive-import branch from a0a7665 to 583be24 Compare December 2, 2025 17:07
@GeigerJ2 GeigerJ2 marked this pull request as ready for review December 3, 2025 09:34
Comment on lines 229 to 230
# collect the unique entities from the input backend to be added to the output backend
ufields = []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered also adding a report statement here:

Suggested change
# collect the unique entities from the input backend to be added to the output backend
ufields = []
# Additional report statement, as this can take a while
IMPORT_LOGGER.report(f'Finding unique new {etype.value}(s)')

As this can also take a while for a large number of nodes, however, it appears for all entities, so also Users and Computers, for which it completes quickly, so the report only adds noise. I decided to remove it for now.

Copy link
Member

@edan-bainglass edan-bainglass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeigerJ2. LGTM! Have you tested locally?

@GeigerJ2 GeigerJ2 merged commit 4e54cd4 into aiidateam:main Dec 4, 2025
16 checks passed
@GeigerJ2 GeigerJ2 deleted the fix/7117/progress-bar-archive-import branch December 4, 2025 13:50
GeigerJ2 added a commit that referenced this pull request Dec 5, 2025
The removal of the filter_size batching in #6998 caused the progress bar
to only update once at the end instead of incrementally. Here, we
re-introduce query batching (50k IDs per batch for UX) and added
QueryBuilder streaming-based progress updates (every ~10k entities) by
replacing the blocking list comprehension with an explicit for-loop.
Progress now updates during both query batching (for UX) and result
streaming.

Cherry-pick: 4e54cd4
GeigerJ2 added a commit that referenced this pull request Dec 8, 2025
The removal of the filter_size batching in #6998 caused the progress bar
to only update once at the end instead of incrementally. Here, we
re-introduce query batching (50k IDs per batch for UX) and added
QueryBuilder streaming-based progress updates (every ~10k entities) by
replacing the blocking list comprehension with an explicit for-loop.
Progress now updates during both query batching (for UX) and result
streaming.

Cherry-pick: 4e54cd4
GeigerJ2 added a commit that referenced this pull request Dec 9, 2025
The removal of the filter_size batching in #6998 caused the progress bar
to only update once at the end instead of incrementally. Here, we
re-introduce query batching (50k IDs per batch for UX) and added
QueryBuilder streaming-based progress updates (every ~10k entities) by
replacing the blocking list comprehension with an explicit for-loop.
Progress now updates during both query batching (for UX) and result
streaming.

(cherry picked from commit 4e54cd4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Progress bar stuck during archive import when adding new nodes

2 participants