Benchmarks of various genomic ranges operations
Pre-requisites
- pyenv
➜ polars-bio-bench git:(init) ✗ pyenv --version
pyenv 2.5.0- poetry
➜ polars-bio-bench git:(init) ✗ poetry --version
Poetry (version 2.0.0)pyenv install 3.12.8
pyenv local 3.12.8
poetry env use 3.12
poetry updatePlease note that you need at least 64GB of RAM to run the full benchmarks. For the default 16-32GB should be enough.
All the benchmarking scenarios are defined in the conf/benchmark_*.yaml files. By default, the conf/benchmark_small.yaml file is used.
If you would like to run the benchmarks with a different configuration file, you can specify it using the --bench-config option.
export BENCH_DATA_ROOT=/tmp/polars-bio-bench/
poetry run python src/run-benchmarks.py --help
INFO:polars_bio:Creating BioSessionContext
Usage: run-benchmarks.py [OPTIONS]
Options:
--bench-config TEXT Benchmark config file (default:
conf/benchmark_small.yaml)
--help Show this message and exit.
For e2e test suite (benchmark-e2e-overlap) please additionally set :
export POLARS_MAX_THREADS=1conf/benchmark_small.yaml- small dataset, small number of operations for nearest and overlap, native DataFusion inputconf/benchmark_dataframes.yaml- as above but with DataFrames (Polars/Pandas) as inputconf/benchmark_large.yaml- large dataset, large number of operations for nearest and overlap, native DataFusion inputconf/benchmark_parallel.yaml- comparison parallel operations for pyranges0 and polars_bio with bioframe as a baselineconf/benchmark_count_overlaps.yaml- comparison of count overlaps operation for pyranges{0,1} and polars_bio with bioframe as a baselineconf/benchmark_merge.yaml- comparison of merge operation for pyranges{0,1} and polars_bio with bioframe as a baselineconf/benchmark_coverage.yaml- comparison of coverage operation for pyranges{0,1} and polars_bio with bioframe as a baseline
conf/paper/benchmark-e2e-overlap.yaml- end-to-end benchmark for overlap operation with writing results to a CSV file (1-2 and 8-7 datasets)conf/paper/benchmark-4ops-1-2.yaml- overlap, nearest, count_overlaps and coverage operations for 1-2 datasetsconf/paper/benchmark-4ops-8-7.yaml- as above but for 8-7 datasetsconf/paper/benchmark-4ops-8-7-polars-bio-parallel.yaml- as above but polars_bio only and with parallel operations 1,2,4,6,8 threadsconf/paper/benchmark-read_vcf.yaml- read VCF file with polars_bio and 1,2,4,6,8 threads
Example of running memory profiler for polars_bio with 1-2 dataset for polars_bio:
PRFOF_FILE="polars_bio_1-2.dat"
mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-overlap.yaml --tool polars_bio --test-case 1-2 --operation overlap
mprof plot $PRFOF_FILEBENCHMARK_TYPE="synthetic"
for operation in "overlap" "nearest" "coverage" "count-overlaps"; do
for tool in "polars_bio" "polars_bio_streaming" "bioframe" "pyranges0" "pyranges1"; do
for test_case in "100" "10000000"; do
PRFOF_FILE="${tool}_${operation}_${test_case}.dat"
mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-${BENCHMARK_TYPE}.yaml --tool $tool --test-case $test_case --operation $operation
done
done
doneBENCHMARK_TYPE="real"
for operation in "overlap" "nearest" "coverage" "count-overlaps"; do
for tool in "polars_bio" "polars_bio_streaming" "bioframe" "pyranges0" "pyranges1"; do
for test_case in "1-2" "8-7"; do
PRFOF_FILE="${tool}_${operation}_${test_case}.dat"
mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-${BENCHMARK_TYPE}.yaml --tool $tool --test-case $test_case --operation $operation
done
done
doneBENCHMARK_TYPE="real"
for operation in "overlap" ; do
for tool in "genomicranges" ; do
for test_case in "8-7"; do
PRFOF_FILE="${tool}_${operation}_${test_case}.dat"
mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-${BENCHMARK_TYPE}.yaml --tool $tool --test-case $test_case --operation $operation
done
done
done
This repository includes a unified script for generating random genomic interval datasets and uploading them to cloud storage. The script creates datasets with unique timestamps and uploads them with proper directory structure.
-
rclone - Required for uploading datasets to Google Drive
# Install rclone curl https://rclone.org/install.sh | sudo bash # Configure rclone with your Google Drive (follow interactive setup) rclone config
-
Python dependencies - The script requires pandas, numpy, and yaml:
# These are already included in the poetry environment poetry install
To generate a new dataset:
# From the polars-bio-bench root directory
poetry run python src/generate_dataset.pyThe script will:
- Clean up old files - Remove previous datasets and ZIP archives
- Generate test data - Create parquet files with different sizes (100, 1K, 10K, 100K, 1M records)
- Create ZIP archive - Package the datasets into a single ZIP file
- Upload to Google Drive - Upload via rclone and generate public download link
- Generate configuration files - Create YAML configs for benchmarking
The script generates files in the following structure:
polars-bio-bench/
├── tmp/
│ ├── data/ # Generated parquet files
│ │ ├── df1-100.parquet
│ │ ├── df2-100.parquet
│ │ ├── df1-1000.parquet
│ │ ├── df2-1000.parquet
│ │ ├── ... (up to 1M records)
│ └── conf/ # Configuration files
│ ├── common.yaml # Dataset metadata and test cases
│ └── random.yaml # Benchmark definitions
└── random_intervals_YYYYMMDD_HHMMSS.zip # ZIP archive for upload
- Dataset ID:
random_intervals_YYYYMMDD_HHMMSS(unique timestamp) - Test cases: 5 different sizes (100, 1K, 10K, 100K, 1M records)
- File format: Parquet files with genomic intervals (chrom, start, end)
- Chromosome range: chr1 only for simplicity
- Coordinate range: Random intervals up to dataset size
- Archive size: ~17-18 MB (compressed)
common.yaml - Contains dataset metadata:
datasets:
- name: random_intervals_20250530_231351
source: tgambin
unzip: true
format: zip
url: https://drive.google.com/open?id=...
# ... additional metadata
test-cases:
- name: '100'
df_path_1: df1-100.parquet
df_path_2: df2-100.parquet
# ... more test casesrandom.yaml - Contains benchmark definitions for overlap and nearest operations with various tools and parallelization options.
Datasets are automatically uploaded to:
- Remote path:
tgambin:polars-bio-datasets/{dataset_id}/ - Public URL: Generated automatically via rclone link
- Access: Public download links for easy integration
Once generated, the new dataset can be used in benchmarks by:
- Copying the configuration files to the main
conf/directory - Updating benchmark YAML files to reference the new dataset ID
- Running benchmarks with the new configuration
Example:
# Copy generated configs (optional)
cp tmp/conf/common.yaml conf/
cp tmp/conf/random.yaml conf/benchmark_random_new.yaml
# Run benchmarks with new dataset
poetry run python src/run-benchmarks.py --bench-config conf/benchmark_random_new.yamlOn MacOS with MX chips you may encounter the following error when installing polars-bio from source with poetry:
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
To fix this, you can set the following environment variable when installing or updating polars-bio:
RUSTFLAGS="-Clink-arg=-undefined -Clink-arg=dynamic_lookup -Ctarget-cpu=native" poetry update