-
Notifications
You must be signed in to change notification settings - Fork 5
Add support for Cell Ranger #101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 11 commits
Commits
Show all changes
46 commits
Select commit
Hold shift + click to select a range
8a285bc
Use efetch directory with -id instead of esearch
arteymix 4a35d2b
Use conda-incubator/setup-miniconda
arteymix 6b21de9
Fix missing GemmaTaskMixin import
arteymix 8ad0e47
Skip checking GemmaDatasetHasBatch since it requires credentials
arteymix 960c2a3
Add support for single-cell RNA-Seq datasets
arteymix 749eea6
Ignore SRA runs that do not contain transcriptomic RNA-Seq data
arteymix 1d0932d
Parse the --readTypes option
arteymix bb11982
Improve and fix logging for extracting SRA metadata
arteymix 7861dcc
Validate SRA metadata by reading it prior to writing it to disk
arteymix 0035f85
Do not open the browser in Google OAuth flow
arteymix 3e88020
Add support for 10x BAM submissions to SRA
arteymix 819368a
Update Python to 3.12
arteymix 9310e18
Improvements for local source
arteymix f4d0588
fixup! Add support for single-cell RNA-Seq datasets
arteymix 46cd60e
Mark test data as generated
arteymix 77f595b
Add missing test data file
arteymix 1396137
Fix Makefile
arteymix 29a3cfb
Replace luigi-wrapper with a simple CLI tool
arteymix cb32904
Skip fac-sorted dataset test since it's not public
arteymix f8df3d2
sra: Cache BAM headers
arteymix 0757134
Delete organized single-cell data implement remove() to DownloadRunTa…
arteymix f4877ef
Fix double-printing of the task summary
arteymix c31b580
Use the new RNASEQ_PIPELINE_REPORT file type
arteymix 1e25051
Add wrapped tools
arteymix 3428a5d
Add a task to reorganize a split experiment
arteymix 32139a5
Remove unused ALIGNQCDIR
arteymix 5fb0325
Rename output files of bamtofastq not ending in '_001.fastq.gz'
arteymix 5e8006a
Check if read_types is provided when detecting layout
arteymix 8e3fbaa
Rename wrapped tools config section
arteymix cfd4a30
More work
arteymix ba16abc
Add missing test data
arteymix 31f87bd
Make it possible to delete an entire run directory instead of individ…
arteymix bd43c52
sra: Include the SRA run identifier when dumping FASTQ files from a BAM
arteymix f84c3b1
Reduce the amount of configuration needed for the pipeline
arteymix 04af7e2
gemma: Add targets for specific QTs existing and use those as target …
arteymix a3a7806
Update cutadapt and MultiQC
arteymix d8fa72c
Remove redundant task definition and add keyword parameters
arteymix c9f2945
Migrate to pyproject.toml
arteymix dd711b6
Move gsheet and webviewer in optional dependencies
arteymix 6ba8539
Remove unused IlluminaFastqHeader and CheckAfterCompleteMixin
arteymix 2e9afee
Add a chemistry option to AlignSingleCellSample
arteymix 916e168
Fix types and imports in tasks.py
arteymix 251b7b6
fixup! Remove unused IlluminaFastqHeader and CheckAfterCompleteMixin
arteymix d5e46ae
Fix incorrect logger usage in sra.py
arteymix dcadc15
Downgrade warning for no fastq-load.py options to info
arteymix 592ac4d
Remove the timelimit for CellRanger count
arteymix File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,3 +14,5 @@ dependencies: | |
| - star==2.7.3a | ||
| - entrez-direct | ||
| - perl # rsem expects this | ||
| - samtools | ||
| - curl | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,9 +27,12 @@ submit_data_jobs=1 | |
| submit_batch_info_jobs=2 | ||
|
|
||
| [bioluigi] | ||
| scheduler=slurm | ||
| scheduler=local | ||
| scheduler_partition= | ||
| scheduler_extra_args=[] | ||
| # Default tools, override as needed | ||
| #cutadapt_bin=cutadapt | ||
| #cell_ranger_bin=cellranger | ||
|
|
||
| # | ||
| # This section contains the necessary variables for the pipeline execution | ||
|
|
@@ -40,6 +43,7 @@ scheduler_extra_args=[] | |
| OUTPUT_DIR=pipeline-output | ||
| GENOMES=genomes | ||
| REFERENCES=references | ||
| SINGLE_CELL_REFERENCES=references-single-cell | ||
| METADATA=metadata | ||
| DATA=data | ||
| DATAQCDIR=data-qc | ||
|
|
@@ -53,6 +57,13 @@ RSEM_DIR=contrib/RSEM | |
|
|
||
| SLACK_WEBHOOK_URL= | ||
|
|
||
| [rnaseq_pipeline.sources.sra] | ||
| # location where tools like prefetch and fastq-dump will store downloaded SRA files | ||
| # you can get this value with vdb-config -p | ||
| ncbi_public_dir=/cosmos/scratch/ncbi/public | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've encountered issues with parsing the output of vdb-config, so this is a more robust solution overall. |
||
| samtools_bin=samtools | ||
| bamtofastq_bin=bamtofastq | ||
|
|
||
| [rnaseq_pipeline.gemma] | ||
| cli_bin=gemma-cli | ||
| # values for $JAVA_HOME and $JAVA_OPTS environment variables | ||
|
|
@@ -63,3 +74,6 @@ appdata_dir=/space/gemmaData | |
| human_reference_id=hg38_ncbi | ||
| mouse_reference_id=mm10_ncbi | ||
| rat_reference_id=rn7_ncbi | ||
| human_single_cell_reference_id=refdata-gex-GRCh38-2024-A | ||
| mouse_single_cell_reference_id=refdata-gex-GRCm39-2024-A | ||
| rat_single_cell_reference_id=refdata-gex-mRatBN7-2-2024-A | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| import tarfile | ||
| import tempfile | ||
| from io import StringIO | ||
| from urllib.parse import urlparse, parse_qs | ||
|
|
||
| import pandas as pd | ||
| import requests | ||
| from tqdm import tqdm | ||
|
|
||
| from rnaseq_pipeline.miniml_utils import collect_geo_samples_info | ||
|
|
||
| ns = {'miniml': 'http://www.ncbi.nlm.nih.gov/geo/info/MINiML'} | ||
|
|
||
| def retrieve_geo_series_miniml_from_ftp(gse): | ||
| res = requests.get(f'https://ftp.ncbi.nlm.nih.gov/geo/series/{gse[:-3]}nnn/{gse}/miniml/{gse}_family.xml.tgz', | ||
| stream=True) | ||
| res.raise_for_status() | ||
| # we need to use a temporary file because Response.raw does not allow seeking | ||
| with tempfile.TemporaryFile() as tmp: | ||
| for chunk in res.iter_content(chunk_size=1024): | ||
| tmp.write(chunk) | ||
| tmp.seek(0) | ||
| with tarfile.open(fileobj=tmp, mode='r:gz') as fin: | ||
| reader = fin.extractfile(f'{gse}_family.xml') | ||
| return reader.read().decode('utf-8') | ||
|
|
||
| def retrieve_geo_series_miniml_from_geo_query(gse): | ||
| res = requests.get('https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi', params=dict(acc=gse, form='xml', targ='gsm')) | ||
| res.raise_for_status() | ||
| return res.text | ||
|
|
||
| def fetch_sra_metadata(gse): | ||
| try: | ||
| miniml = retrieve_geo_series_miniml_from_ftp(gse) | ||
| except Exception as e: | ||
| print( | ||
| f'Failed to retrieve MINiML metadata for {gse} from NCBI FTP server will attempt to use GEO directly.', | ||
| e) | ||
| try: | ||
| miniml = retrieve_geo_series_miniml_from_geo_query(gse) | ||
| except Exception as e: | ||
| print(f'Failed to retrieve MINiML metadata for {gse} from GEO query.', e) | ||
| return [] | ||
| try: | ||
| meta = collect_geo_samples_info(StringIO(miniml)) | ||
| except Exception as e: | ||
| print('Failed to parse MINiML from input: ' + miniml[:100], e) | ||
| return [] | ||
| results = [] | ||
| for gsm in meta: | ||
| platform, srx_url = meta[gsm] | ||
| srx = parse_qs(urlparse(srx_url).query)['term'][0] | ||
| results.append((gse, gsm, srx)) | ||
| return results | ||
|
|
||
| with open('geo-sample-to-sra-experiment.tsv', 'wt') as f: | ||
| print('geo_series', 'geo_sample', 'sra_experiment', file=f, sep='\t', flush=True) | ||
| df = pd.read_table('gemma-rnaseq-datasets.tsv') | ||
| batch = [] | ||
| for gse in tqdm(df.geo_accession): | ||
| samples = fetch_sra_metadata(gse) | ||
| for sample in samples: | ||
| print(*sample, file=f, sep='\t', flush=True) | ||
|
|
||
| # def fetch_runinfo(srx_ids): | ||
| # # fetch the SRX metadata | ||
| # return pd.read_csv(StringIO(retrieve_runinfo(srx_ids))) | ||
| # | ||
| # def print_results(samples): | ||
| # srx_ids = [s[2] for s in samples] | ||
| # try: | ||
| # results = fetch_runinfo(srx_ids) | ||
| # except Exception as e: | ||
| # print('Failed to retrieve runinfo for the following SRX IDs:', srx_ids, e) | ||
| # return | ||
| # for sample in samples: | ||
| # runs = results[results['Experiment'] == sample[2]]['Run'] | ||
| # r = sample + ('|'.join(runs), len(runs)) | ||
| # print(*r, sep='\t', flush=True) | ||
| # | ||
| # BATCH_SIZE = 100 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| [build-system] | ||
| requires = ["setuptools"] | ||
| build-backend = "setuptools.build_meta" | ||
|
|
||
| [tool.mypy] | ||
| plugins = ["luigi.mypy"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| [pytest] | ||
| log_cli=1 | ||
| log_cli_level=warning |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,21 +1,24 @@ | ||
| from typing import Optional | ||
|
|
||
| import luigi | ||
|
|
||
| # see luigi.cfg for details | ||
| class rnaseq_pipeline(luigi.Config): | ||
| task_namespace = '' | ||
|
|
||
| GENOMES = luigi.Parameter() | ||
| GENOMES: str = luigi.Parameter() | ||
|
|
||
| OUTPUT_DIR = luigi.Parameter() | ||
| REFERENCES = luigi.Parameter() | ||
| METADATA = luigi.Parameter() | ||
| DATA = luigi.Parameter() | ||
| DATAQCDIR = luigi.Parameter() | ||
| ALIGNDIR = luigi.Parameter() | ||
| ALIGNQCDIR = luigi.Parameter() | ||
| QUANTDIR = luigi.Parameter() | ||
| BATCHINFODIR = luigi.Parameter() | ||
| OUTPUT_DIR: str = luigi.Parameter() | ||
| REFERENCES: str = luigi.Parameter() | ||
| SINGLE_CELL_REFERENCES: str = luigi.Parameter() | ||
| METADATA: str = luigi.Parameter() | ||
| DATA: str = luigi.Parameter() | ||
| DATAQCDIR: str = luigi.Parameter() | ||
| ALIGNDIR: str = luigi.Parameter() | ||
| ALIGNQCDIR: str = luigi.Parameter() | ||
| QUANTDIR: str = luigi.Parameter() | ||
| BATCHINFODIR: str = luigi.Parameter() | ||
|
|
||
| RSEM_DIR = luigi.Parameter() | ||
| RSEM_DIR: str = luigi.Parameter() | ||
|
|
||
| SLACK_WEBHOOK_URL = luigi.OptionalParameter(default=None) | ||
| SLACK_WEBHOOK_URL: Optional[str] = luigi.OptionalParameter(default=None) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be included immediately in the trunk.