-
Notifications
You must be signed in to change notification settings - Fork 52
TASK-5564 - Update data sources for CellBase 6.2 #696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…aded, #TASK-5575, #TASK-5564
…loaded, #TASK-5575, #TASK-5564
…, #TASK-5575, #TASK-5564
…es, #TASK-5576, #TASK-5564
…fixing sonnar issues, #TASK-5576, #TASK-5564
…peats builder, #TASK-5576, #TASK-5564
…tion builder, #TASK-5576, #TASK-5564
…s (e.g., mmusculus), #TASK-6426, #TASK-5564
…on file for species (e.g., mmusculus), and update the variant downloader according to these changes, #TASK-6426, #TASK-5564
…by the different data, e.g., repeats, #TASK-6142, #TASK-5564
- Re-using the fucntion loadJsonFile - Adding a mongodb index for the collection genome info - Adding log messages
…revious changes, #TASK-6142, #TASK-5564
… files, and rename some constants, #TASK-5776, #TASK-5564
…atest changes, #TASK-6142, #TASK-5564
TASK-7809 - Upgrade avro version from 1.9.1 to 1.11.4
And fix checkstyle after merging
…from the variantion processing, #TASK-5564
…query, #TASK-5564
…ure releases, #TASK-5564
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request updates data sources for CellBase 6.2, involving a significant refactoring of the builder infrastructure and modernization of clinical variant processing.
Key Changes:
- Refactored builder class hierarchy by replacing
CellBaseBuilderwithAbstractBuilderas the base class - Updated clinical variant indexers to handle new data formats and sources (ClinVar, COSMIC, CIViC, GWAS)
- Added new builders for polygenic scores (PGS Catalog)
- Enhanced gene annotation with additional data sources (imprinted genes, gene fusions from ChimerDB, gnomAD constraints)
- Updated data source versions and file formats (e.g., UniProt to version 202502, new ClinVar XML structure)
Reviewed changes
Copilot reviewed 107 out of 226 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| CosmicBuilder.java | Changed parent class from CellBaseBuilder to AbstractBuilder |
| ClinicalVariantBuilder.java | Major refactoring: added file validation, updated to process new ClinVar/COSMIC/CIViC/GWAS formats with version tracking |
| ClinicalIndexer.java | Added version/assembly fields, updated normalization config, removed inner SequenceLocation class, added constant for original property ID |
| ClinVarParser.java | Changed parent class from CellBaseBuilder to AbstractBuilder |
| ClinVarIndexer.java | Updated to handle new ClinVar format with version parameter, added import for SequenceLocation, improved error handling |
| CivicIndexerCallback.java | New file implementing CIViC data indexing callback with evidence entry creation |
| CivicIndexer.java | New file for CIViC data source integration |
| VariationBuilder.java | Complete rewrite to handle VCF files for non-human species |
| SpliceBuilder.java | Updated parent class and constant references |
| RocksDbManager.java | Added methods for gene imprinting and gene fusion data retrieval, added missing imports |
| RevelScoreBuilder.java | Updated to use new data model ProteinSubstitutionPrediction, enhanced error handling |
| RepeatsBuilder.java | Added configuration-based file validation and support for multiple repeat data sources |
| RegulatoryRegionBuilder.java | File deleted (deprecated) |
| RegulatoryFeatureBuilder.java | Complete rewrite with PFM matrix download and new file format handling |
| RefSeqGeneBuilderIndexer.java | Simplified to delegate to common gene builder methods |
| RefSeqGeneBuilder.java | Major refactoring with configuration-based file validation and improved indexing |
| PubMedBuilder.java | Enhanced with configuration-based validation and improved logging |
| ProteinBuilder.java | Updated to UniProt 202502 format with InterPro integration and chunk processing |
| PolygenicScoreBuilder.java | New file for PGS Catalog data processing |
| OntologyBuilder.java | Refactored with configuration-based file validation |
| MiRTarBaseIndexer.java | New file extracting miRTarBase indexing logic |
| InteractionBuilder.java | Changed parent class to AbstractBuilder |
| GenomeSequenceFastaBuilder.java | Updated parent class and improved logging |
| GeneExpressionAtlasBuilder.java | Changed parent class to AbstractBuilder |
| GeneBuilderUtils.java | File deleted (deprecated) |
| GeneBuilderIndexer.java | Extensive additions for constraints, imprinted genes, gene fusions, and ChimerDB integration |
| GeneBuilder.java | Complete rewrite delegating to Ensembl and RefSeq gene builders |
| DbSnpBuilder.java | Updated constant name and parent class |
| CellBaseBuilder.java | File deleted (replaced by AbstractBuilder) |
| CaddAllAnnotationBuilder.java | Changed parent class to AbstractBuilder |
| pom.xml | Version bump to 6.7.0-SNAPSHOT, added dependencies for commons-compress and commons-csv |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (checked) { | ||
| return; | ||
| } |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The boolean field checked is referenced but not declared in this class. It should be declared as a class field (e.g., private boolean checked = false;) or inherited from AbstractBuilder.
| if (1 == 1) { | ||
| CellBaseJsonFileSerializer refSeqGeneSerializer = new CellBaseJsonFileSerializer(buildPath, REFSEQ_GENE_BASENAME); | ||
| this.refSeqGeneBuilder = new RefSeqGeneBuilder(downloadPath.resolve(REFSEQ_DATA), speciesConfiguration, configuration, | ||
| refSeqGeneSerializer); | ||
| } |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if (1 == 1) is always true and suggests incomplete or placeholder code. This should either be replaced with a proper condition or the block should be unconditional.
| if (1 == 1) { | ||
| refSeqGeneBuilder.check(); | ||
| } |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if (1 == 1) is always true and suggests incomplete or placeholder code. This should either be replaced with a proper condition or the block should be unconditional.
| if (1 == 1) { | ||
| if (!Files.exists(downloadPath.resolve(REFSEQ_DATA).resolve(REFSEQ_GENE_OUTPUT_FILENAME))) { | ||
| refSeqGeneBuilder.parse(); | ||
| } else { | ||
| tabixReader = new TabixReader(tfbsFile.toAbsolutePath().toString(), tabixFile.toAbsolutePath().toString()); | ||
| } | ||
|
|
||
| // Preparing the fasta file for fast accessing | ||
| // System.out.println("genomeSequenceFilePath.toString() = " + genomeSequenceFilePath.toString()); | ||
| FastaIndex fastaIndex = new FastaIndex(genomeSequenceFilePath); | ||
|
|
||
| // Empty transcript and exon dictionaries | ||
| transcriptDict.clear(); | ||
| exonDict.clear(); | ||
| logger.info("Parsing gtf..."); | ||
| GtfReader gtfReader = new GtfReader(gtfFile); | ||
|
|
||
| // Gene->Transcript->Feature->GTF line | ||
| Map<String, Map<String, Map<String, Object>>> gtfMap = null; | ||
| if (flexibleGTFParsing) { | ||
| gtfMap = loadGTFMap(gtfReader); | ||
| initializePointers(gtfMap); | ||
| } | ||
|
|
||
| Gtf gtf; | ||
| while ((gtf = getGTFEntry(gtfReader, gtfMap)) != null) { | ||
|
|
||
| if (gtf.getFeature().equals("gene") || gtf.getFeature().equals("transcript") | ||
| || gtf.getFeature().equals("UTR") || gtf.getFeature().equals("Selenocysteine")) { | ||
| continue; | ||
| } | ||
|
|
||
| String geneId = gtf.getAttributes().get("gene_id"); | ||
| String transcriptId = gtf.getAttributes().get("transcript_id"); | ||
| String geneName = gtf.getAttributes().get("gene_name"); | ||
| if (newGene(gene, geneId)) { | ||
| // If new geneId is different from the current then we must serialize before data new gene | ||
| if (gene != null) { | ||
| serializer.serialize(gene); | ||
| } | ||
|
|
||
| GeneAnnotation geneAnnotation = new GeneAnnotation(indexer.getExpression(geneId), indexer.getDiseases(geneName), | ||
| indexer.getDrugs(geneName), indexer.getConstraints(geneId), indexer.getMirnaTargets(geneName), | ||
| indexer.getCancerGeneCensus(geneName), indexer.getCancerHotspot(geneName)); | ||
|
|
||
| gene = new Gene(geneId, geneName, gtf.getSequenceName().replaceFirst("chr", ""), | ||
| gtf.getStart(), gtf.getEnd(), gtf.getStrand(), gtf.getAttributes().get("gene_version"), | ||
| gtf.getAttributes().get("gene_biotype"), "KNOWN", SOURCE, indexer.getDescription(geneId), | ||
| new ArrayList<>(), indexer.getMirnaGene(transcriptId), geneAnnotation); | ||
| } | ||
|
|
||
| // Check if Transcript exist in the Gene Set of transcripts | ||
| if (!transcriptDict.containsKey(transcriptId)) { | ||
| transcript = getTranscript(gene, indexer, tabixReader, gtf, transcriptId); | ||
| } else { | ||
| transcript = gene.getTranscripts().get(transcriptDict.get(transcriptId)); | ||
| } | ||
|
|
||
| // At this point gene and transcript objects are set up | ||
| // Update gene and transcript genomic coordinates, start must be the | ||
| // lower, and end the higher | ||
| updateTranscriptAndGeneCoords(transcript, gene, gtf); | ||
|
|
||
| String transcriptIdWithoutVersion = transcript.getId().split("\\.")[0]; | ||
| if (gtf.getFeature().equalsIgnoreCase("exon")) { | ||
| // Obtaining the exon sequence | ||
| String exonId = gtf.getAttributes().get("exon_id") + "." + gtf.getAttributes().get("exon_version"); | ||
| String exonSequence = fastaIndex.query(gtf.getSequenceName(), gtf.getStart(), gtf.getEnd()); | ||
|
|
||
| exon = new Exon(exonId, gtf.getSequenceName().replaceFirst("chr", ""), | ||
| gtf.getStart(), gtf.getEnd(), gtf.getStrand(), 0, 0, 0, 0, 0, 0, -1, Integer.parseInt(gtf | ||
| .getAttributes().get("exon_number")), exonSequence); | ||
| transcript.getExons().add(exon); | ||
|
|
||
| exonDict.put(transcriptIdWithoutVersion + "_" + exon.getExonNumber(), exon); | ||
| if (gtf.getAttributes().get("exon_number").equals("1")) { | ||
| cdna = 1; | ||
| cds = 1; | ||
| } else { | ||
| // with every exon we update cDNA length with the previous exon length | ||
| cdna += exonDict.get(transcriptIdWithoutVersion + "_" + (exon.getExonNumber() - 1)).getEnd() | ||
| - exonDict.get(transcriptIdWithoutVersion + "_" + (exon.getExonNumber() - 1)).getStart() + 1; | ||
| } | ||
| } else { | ||
| exon = exonDict.get(transcriptIdWithoutVersion + "_" + exon.getExonNumber()); | ||
| if (gtf.getFeature().equalsIgnoreCase("CDS")) { | ||
| // Protein ID is only present in CDS lines | ||
| String proteinId = gtf.getAttributes().get("protein_id") != null | ||
| ? gtf.getAttributes().get("protein_id") + "." + gtf.getAttributes().get("protein_version") | ||
| : ""; | ||
| transcript.setProteinId(proteinId); | ||
| transcript.setProteinSequence(indexer.getProteinFasta(proteinId)); | ||
|
|
||
| if (gtf.getStrand().equals("+") || gtf.getStrand().equals("1")) { | ||
| // CDS states the beginning of coding start | ||
| exon.setGenomicCodingStart(gtf.getStart()); | ||
| exon.setGenomicCodingEnd(gtf.getEnd()); | ||
|
|
||
| // cDNA coordinates | ||
| exon.setCdnaCodingStart(gtf.getStart() - exon.getStart() + cdna); | ||
| exon.setCdnaCodingEnd(gtf.getEnd() - exon.getStart() + cdna); | ||
| // Set cdnaCodingEnd to prevent those cases without stop_codon | ||
|
|
||
| transcript.setCdnaCodingEnd(gtf.getEnd() - exon.getStart() + cdna); | ||
| exon.setCdsStart(cds); | ||
| exon.setCdsEnd(gtf.getEnd() - gtf.getStart() + cds); | ||
|
|
||
| // increment in the coding length | ||
| cds += gtf.getEnd() - gtf.getStart() + 1; | ||
| transcript.setCdsLength(cds - 1); // Set cdnaCodingEnd to prevent those cases without stop_codon | ||
|
|
||
| exon.setPhase(Integer.parseInt(gtf.getFrame())); | ||
|
|
||
| if (transcript.getGenomicCodingStart() == 0 || transcript.getGenomicCodingStart() > gtf.getStart()) { | ||
| transcript.setGenomicCodingStart(gtf.getStart()); | ||
| } | ||
| if (transcript.getGenomicCodingEnd() == 0 || transcript.getGenomicCodingEnd() < gtf.getEnd()) { | ||
| transcript.setGenomicCodingEnd(gtf.getEnd()); | ||
| } | ||
| // only first time | ||
| if (transcript.getCdnaCodingStart() == 0) { | ||
| transcript.setCdnaCodingStart(gtf.getStart() - exon.getStart() + cdna); | ||
| } | ||
| // strand - | ||
| } else { | ||
| // CDS states the beginning of coding start | ||
| exon.setGenomicCodingStart(gtf.getStart()); | ||
| exon.setGenomicCodingEnd(gtf.getEnd()); | ||
| // cDNA coordinates | ||
| // cdnaCodingStart points to the same base position than genomicCodingEnd | ||
| exon.setCdnaCodingStart(exon.getEnd() - gtf.getEnd() + cdna); | ||
| // cdnaCodingEnd points to the same base position than genomicCodingStart | ||
| exon.setCdnaCodingEnd(exon.getEnd() - gtf.getStart() + cdna); | ||
| // Set cdnaCodingEnd to prevent those cases without stop_codon | ||
| transcript.setCdnaCodingEnd(exon.getEnd() - gtf.getStart() + cdna); | ||
| exon.setCdsStart(cds); | ||
| exon.setCdsEnd(gtf.getEnd() - gtf.getStart() + cds); | ||
|
|
||
| // increment in the coding length | ||
| cds += gtf.getEnd() - gtf.getStart() + 1; | ||
| transcript.setCdsLength(cds - 1); // Set cdnaCodingEnd to prevent those cases without stop_codon | ||
| exon.setPhase(Integer.parseInt(gtf.getFrame())); | ||
|
|
||
| if (transcript.getGenomicCodingStart() == 0 || transcript.getGenomicCodingStart() > gtf.getStart()) { | ||
| transcript.setGenomicCodingStart(gtf.getStart()); | ||
| } | ||
| if (transcript.getGenomicCodingEnd() == 0 || transcript.getGenomicCodingEnd() < gtf.getEnd()) { | ||
| transcript.setGenomicCodingEnd(gtf.getEnd()); | ||
| } | ||
| // only first time | ||
| if (transcript.getCdnaCodingStart() == 0) { | ||
| // cdnaCodingStart points to the same base position than genomicCodingEnd | ||
| transcript.setCdnaCodingStart(exon.getEnd() - gtf.getEnd() + cdna); | ||
| } | ||
| } | ||
|
|
||
| } | ||
| // if (gtf.getFeature().equalsIgnoreCase("start_codon")) { | ||
| // // nothing to do | ||
| // System.out.println("Empty block, this should be redesigned"); | ||
| // } | ||
| if (gtf.getFeature().equalsIgnoreCase("stop_codon")) { | ||
| // setCdnaCodingEnd = false; // stop_codon found, cdnaCodingEnd will be set here, | ||
| // no need to set it at the beginning of next feature | ||
| if (exon.getStrand().equals("+")) { | ||
| updateStopCodingDataPositiveExon(exon, cdna, cds, gtf); | ||
|
|
||
| cds += gtf.getEnd() - gtf.getStart(); | ||
| // If stop_codon appears, overwrite values | ||
| transcript.setGenomicCodingEnd(gtf.getEnd()); | ||
| transcript.setCdnaCodingEnd(gtf.getEnd() - exon.getStart() + cdna); | ||
| transcript.setCdsLength(cds - 1); | ||
|
|
||
| } else { | ||
| updateNegativeExonCodingData(exon, cdna, cds, gtf); | ||
|
|
||
| cds += gtf.getEnd() - gtf.getStart(); | ||
| // If stop_codon appears, overwrite values | ||
| transcript.setGenomicCodingStart(gtf.getStart()); | ||
| // cdnaCodingEnd points to the same base position than genomicCodingStart | ||
| transcript.setCdnaCodingEnd(exon.getEnd() - gtf.getStart() + cdna); | ||
| transcript.setCdsLength(cds - 1); | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // last gene must be serialized | ||
| serializer.serialize(gene); | ||
|
|
||
| // cleaning | ||
| gtfReader.close(); | ||
| serializer.close(); | ||
| fastaIndex.close(); | ||
| indexer.close(); | ||
| } catch (Exception e) { | ||
| indexer.close(); | ||
| throw e; | ||
| } | ||
| } | ||
|
|
||
| private Transcript getTranscript(Gene gene, EnsemblGeneBuilderIndexer indexer, TabixReader tabixReader, Gtf gtf, String transcriptId) | ||
| throws IOException, RocksDBException { | ||
| Map<String, String> gtfAttributes = gtf.getAttributes(); | ||
|
|
||
| // To match Ensembl, we set the ID as transcript+version. This also matches the Ensembl website. | ||
| String transcriptIdWithVersion = transcriptId + "." + gtfAttributes.get("transcript_version"); | ||
| String biotype = gtfAttributes.get("transcript_biotype") != null ? gtfAttributes.get("transcript_biotype") : ""; | ||
| String transcriptChromosome = gtf.getSequenceName().replaceFirst("chr", ""); | ||
| List<TranscriptTfbs> transcriptTfbses = getTranscriptTfbses(gtf, transcriptChromosome, tabixReader); | ||
|
|
||
| List<FeatureOntologyTermAnnotation> ontologyAnnotations = getOntologyAnnotations(indexer.getXrefs(transcriptId), indexer); | ||
| TranscriptAnnotation transcriptAnnotation = new TranscriptAnnotation(ontologyAnnotations, indexer.getConstraints(transcriptId)); | ||
|
|
||
| Transcript transcript = new Transcript(transcriptIdWithVersion, gtfAttributes.get("transcript_name"), transcriptChromosome, | ||
| gtf.getStart(), gtf.getEnd(), gtf.getStrand(), biotype, "KNOWN", | ||
| 0, 0, 0, 0, 0, | ||
| indexer.getCdnaFasta(transcriptIdWithVersion), "", "", "", | ||
| gtfAttributes.get("transcript_version"), SOURCE, new ArrayList<>(), indexer.getXrefs(transcriptId), transcriptTfbses, | ||
| new HashSet<>(), transcriptAnnotation); | ||
|
|
||
| // Adding Ids appearing in the GTF to the xrefs is required, since for some unknown reason the ENSEMBL | ||
| // Perl API often doesn't return all genes resulting in an incomplete xrefs.txt file. We must ensure | ||
| // that the xrefs array contains all ids present in the GTF file | ||
| addGtfXrefs(transcript, gene, gtfAttributes); | ||
|
|
||
| // Add HGNC ID mappings, with this we can know which Ensembl and Refseq transcripts match to HGNC ID | ||
| String hgncId = indexer.getHgncId(gene.getName()); | ||
| if (StringUtils.isNotEmpty(hgncId)) { | ||
| transcript.getXrefs().add(new Xref(hgncId, "hgnc_id", "HGNC ID")); | ||
| } | ||
|
|
||
| // Add MANE Select mappings, with this we can know which Ensembl and Refseq transcripts match according to MANE | ||
| for (String suffix: Arrays.asList("refseq", "refseq_protein")) { | ||
| String maneRefSeq = indexer.getMane(transcriptIdWithVersion, suffix); | ||
| if (StringUtils.isNotEmpty(maneRefSeq)) { | ||
| transcript.getXrefs().add(new Xref(maneRefSeq, "mane_select_" + suffix, | ||
| "MANE Select RefSeq" + (suffix.contains("_") ? " Protein" : ""))); | ||
| logger.info(DATA_ALREADY_BUILT, getDataName(REFSEQ_DATA) + " gene"); | ||
| } | ||
| } |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if (1 == 1) is always true and suggests incomplete or placeholder code. This should either be replaced with a proper condition or the block should be unconditional.
| if (1 == 1) { | ||
| if (isHSapiens || isDataSupported(configuration.getDownload().getManeSelect(), prefixId)) { | ||
| dataList.add(MANE_SELECT_DATA); | ||
| } |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if (1 == 1) is always true and suggests incomplete or placeholder code. This should either be replaced with a proper condition or the block should be unconditional.
The merge-base changed after approval.
Update data sources for CellBase 6.2