Skip to content

Conversation

@mwiewior
Copy link
Contributor

No description provided.

mwiewior and others added 9 commits September 23, 2025 11:44
…boundaries

- Fix LimitedRangeFile to properly enforce byte range limits during reading
- Add proper FASTQ record boundary detection for splits
- Implement LimitedRangeFile wrapper that stops reading at end offset
- Add FastqLocalReader::PlainRanged variant for byte range reading
- Add comprehensive unit tests for byte range functionality
- Fix missing benchmark_bgzf_threads.rs reference in bio-format-gff

This resolves the issue where FASTQ readers were reading entire files
instead of respecting byte ranges, causing poor parallelization performance.

Tests verify:
- Proper byte range enforcement
- FASTQ record boundary handling
- Various split scenarios (start, middle, end)
- Integration with FastqLocalReader
- Error handling for invalid ranges
The previous implementation only checked if a line starts with '@', which
fails because FASTQ quality scores can contain '@' characters (ASCII 64 = Q31).
This caused false positive matches and incorrect partition boundaries.

Changes:
- Implement sliding window approach that checks 4-line patterns
- Validate complete FASTQ records: @Header, sequence, +separator, quality
- Add strong validation: sequence length must equal quality length
- Search up to 2000 bytes beyond partition end to find valid boundary
- Skip initial partial line before starting pattern search

This ensures all partitions find correct record boundaries, fixing the issue
where some partitions would return 0 records due to false '@' matches in
quality scores.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants