Skip to content

Commit 6cb242c

Browse files
feat: Streaming VCF reader (#2)
* WIP: VCF * Refactor * Parsing info draft * Working parser * Refactoring infos * async-trait downgrade * Reverting to 0-based * Renamed columns * Add retry go operator * Add IOTimeout * Fixing streams * Adding s3 * fix: Basic fields * Adding support for remote reading of uncompressed VCFs * Fix header * Optimize variant_end * Enabling projection * Fixing local vcf without compression and gcs reads optimization * Fixing local vcf reading with no compression * Describe VCF * fix: Tag case sensitive * add performance/time measurement for batch processing vcf with noodles Signed-off-by: Piotr Dębski <ppdebski@interia.eu> * add retry mechanism and adjust chunk size along with minimal concurrent fetches Signed-off-by: Piotr Dębski <ppdebski@interia.eu> * Refactor scan to separate projected schema computation and use Field::nullable flag Signed-off-by: Piotr Dębski <ppdebski@interia.eu> * propagate builders errors Signed-off-by: Piotr Dębski <ppdebski@interia.eu> * Optimize OptionalField::new() to use with_capacity Signed-off-by: Piotr Dębski <ppdebski@interia.eu> * Improve info_to_arrow_type logic Signed-off-by: Piotr Dębski <ppdebski@interia.eu> * refactor format fields and cleanup code Signed-off-by: Piotr Dębski <ppdebski@interia.eu> * complete bgzf compressed files format ingestion tests Signed-off-by: Piotr Dębski <ppdebski@interia.eu> * add bgzf test in similar format to test_noodles.rs Signed-off-by: Piotr Dębski <ppdebski@interia.eu> * add docker-compose for testing iceberg Signed-off-by: Piotr Dębski <ppdebski@interia.eu> * add simple github workflows CI Signed-off-by: Piotr Dębski <ppdebski@interia.eu> * Cleanup a few warnings * Bump runner image --------- Signed-off-by: Piotr Dębski <ppdebski@interia.eu> Co-authored-by: Piotr Dębski <ppdebski@interia.eu>
1 parent b550a93 commit 6cb242c

File tree

17 files changed

+6283
-2
lines changed

17 files changed

+6283
-2
lines changed

.github/workflows/ci.yml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
9+
jobs:
10+
build-test:
11+
runs-on: ubuntu-22.04
12+
concurrency:
13+
group: ${{ github.workflow }}-${{ github.ref }}
14+
steps:
15+
- name: Checkout code
16+
uses: actions/checkout@v2
17+
with:
18+
submodules: "recursive"
19+
fetch-depth: 1
20+
21+
- name: Setup Rust
22+
uses: actions-rust-lang/setup-rust-toolchain@v1
23+
with:
24+
toolchain: '1.85.0'
25+
components: 'clippy, rustfmt'
26+
27+
- name: Cache Cargo registry and build
28+
uses: actions/cache@v3
29+
with:
30+
path: |
31+
~/.cargo/registry
32+
~/.cargo/git
33+
datafusion/vcf/target
34+
key: ${{ runner.os }}-cargo-${{ hashFiles('datafusion/vcf/Cargo.lock') }}
35+
restore-keys: |
36+
${{ runner.os }}-cargo-
37+
38+
- name: Check formatting
39+
working-directory: datafusion/vcf
40+
run: cargo fmt --all -- --check
41+
42+
- name: Run tests
43+
working-directory: datafusion/vcf
44+
run: cargo test

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,4 @@ Cargo.lock
1818
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
1919
# and can be added to the global gitignore or merged into this file. For a more nuclear
2020
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
21-
#.idea/
21+
.idea/

.pre-commit-config.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
repos:
2+
- repo: https://github.com/doublify/pre-commit-rust
3+
rev: v1.0
4+
hooks:
5+
- id: fmt
6+
args: ["--all", "--"]
7+
- id: cargo-check
8+

0 commit comments

Comments
 (0)