Skip to content

Conversation

@The-Obstacle-Is-The-Way
Copy link
Contributor

Problem

test_push_dataset_dict_to_hub_overwrite_files intermittently fails with:

BadRequestError: LFS pointer pointed to a file that does not exist

This has been causing the deps-latest integration tests to fail on main (visible in recent CI runs). I ran into this while working on the BIDS loader PR and dug into the root cause.

Root Cause

Two race conditions in the test:

  1. LFS propagation timing - Rapid successive push_to_hub calls don't wait for Hub to fully propagate LFS objects between pushes
  2. Repo name reuse - The second test scenario reused the same repo name from scenario 1, creating a race between deletion and recreation

Solution

  • Add _wait_for_repo_ready() helper that polls list_repo_files to ensure the repo is consistent before subsequent operations
  • Use a unique repo name (ds_name_2) for the second scenario, eliminating the delete/create race entirely

Testing

All 4 integration test variants now pass:

  • ubuntu-latest, deps-latest (was failing)
  • ubuntu-latest, deps-minimum
  • windows-latest, deps-latest (was failing)
  • windows-latest, deps-minimum

Validated on fork: The-Obstacle-Is-The-Way#4

Related

cc @lhoestq - small fix but should help CI reliability

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way force-pushed the fix/flaky-lfs-test branch 3 times, most recently from 8e00a44 to 3ea8de7 Compare December 9, 2025 15:37
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq
Copy link
Member

lhoestq commented Dec 19, 2025

Maybe push_to_hub() can retry on error 400 when committing to the Hub instead ? This way we make sure push_to_hub() works without having to add the waiting step

`test_push_dataset_dict_to_hub_overwrite_files` intermittently fails with:
```
BadRequestError: LFS pointer pointed to a file that does not exist
```

Root cause: Two race conditions in the test design:
1. Rapid successive `push_to_hub` calls don't wait for Hub's LFS object
   propagation between pushes
2. Second test scenario reused the same repo name, creating a race between
   repo deletion and recreation

Fix:
- Add `_wait_for_repo_ready()` helper that ensures Hub repository is in a
  consistent state before subsequent operations
- Use unique repo name (`ds_name_2`) for second scenario to eliminate the
  delete/create race entirely

Tested: All 4 integration test variants now pass consistently (ubuntu/windows,
deps-latest/deps-minimum).
@The-Obstacle-Is-The-Way
Copy link
Contributor Author

Thanks for the pointer @lhoestq! Updated to add LFS 400 retry handling directly in push_to_hub() across all implementations (Dataset, DatasetDict, IterableDataset, IterableDatasetDict). Reverted the test-side waits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants