-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Hi @m-jahn thanks for your great package.
This issue is not so much an error report but several suggestions. :)
I was trying to draw coverage plots of ChIP-seq data with ggcoverage. My genome version is GCA_000001405.28_GRCh38.p13_genomic.gtf that is downloaded from NCBI. Here I got the first problem because chromosome names in NCBI is not same as the ones in UCSC:
Error in stop_if_wrong_length("'seqnames'", ans_len) :
'seqnames' must have the length of the object to construct
(1) or length 1
In addition: Warning message:
In .merge_two_Seqinfo_objects(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)
I had to solve them manually with the code below:
gtf_gr <- rtracklayer::import.gff3(con = "GCF_000001405.39_GRCh38.p13_genomic.gtf", format = "gtf")
ncbi_to_ucsc <- c(
"NC_000001.11" = "chr1",
"NC_000002.12" = "chr2",
"NC_000003.12" = "chr3",
"NC_000004.12" = "chr4",
"NC_000005.10" = "chr5",
"NC_000006.12" = "chr6",
"NC_000007.14" = "chr7",
"NC_000008.11" = "chr8",
"NC_000009.12" = "chr9",
"NC_000010.11" = "chr10",
"NC_000011.10" = "chr11",
"NC_000012.12" = "chr12",
"NC_000013.11" = "chr13",
"NC_000014.9" = "chr14",
"NC_000015.10" = "chr15",
"NC_000016.10" = "chr16",
"NC_000017.11" = "chr17",
"NC_000018.10" = "chr18",
"NC_000019.10" = "chr19",
"NC_000020.11" = "chr20",
"NC_000021.9" = "chr21",
"NC_000022.11" = "chr22",
"NC_000023.11" = "chrX",
"NC_000024.10" = "chrY",
"NC_012920.1" = "chrM"
)
ncbi_to_ucsc <- c(ncbi_to_ucsc,
setNames(paste0("chrUn-", seqlevels(gtf_gr)[grep("(^NT_|^NW_)", seqlevels(gtf_gr))]),
seqlevels(gtf_gr)[grep("(^NT_|^NW_)", seqlevels(gtf_gr))]))
new_seqlevels <- ncbi_to_ucsc[seqlevels(gtf_gr)]
seqlevels(gtf_gr) <- unname(new_seqlevels[seqlevels(gtf_gr)])
Besides, gff3 files now are not offering metadata column "gene_name" and "gene_type" that are essential for ggcoverage::geom_gene(), replacing with "gene", "gene_id", "type" and something like that. It will return an error "gene_name and gene_type not provided", which should be solved by code below:
gtf_gr$gene_name <- gtf_gr$gene_id
gtf_gr$gene_type <- gtf_gr$type
And I got the third error when adding gene annotation at the bottom of peaks:
Error in `dplyr::filter()`:
! Can't transform a data frame with `NA` or `""` names.
Run `rlang::last_trace()` to see where the error occurred.
For solving this issue, I manually checked the metadata columns of gtf_gr but no NA or no-name columns appeared. Finally the clue was found in source code https://github.com/showteeth/ggcoverage/blob/93249019064b8ec66d7723396b1f21141d4a1b0b/R/geom_gene.R#L133C74-L133C74
# gtf.gr <- rtracklayer::import.gff(gtf.file,format = 'gtf')
gtf.df.used <- IRanges::subsetByOverlaps(x = gtf.gr, ranges = plot.range.gr) %>% as.data.frame()
# check information
used.gtf.columns <- c("seqnames", "start", "end", "strand", "type", "gene_name")
gene.type.names <- c("gene_type", "gene_biotype")
if (all(used.gtf.columns %in% colnames(gtf.df.used))) {
used.gene.type.name <- intersect(gene.type.names, colnames(gtf.df.used))
if (length(used.gene.type.name) < 1) {
stop("gene_type/gene_biotype is not in provided GTF file!")
} else {
# select used features
gene.info.used <- gtf.df.used %>%
dplyr::filter(type %in% c("gene", "exon", "UTR")) %>%
dplyr::select(c("seqnames", "start", "end", "strand", "type", used.gene.type.name, "gene_name"))
}
colnames(gene.info.used) <- c("seqnames", "start", "end", "strand", "type", "gene_type", "gene_name")
} else {
used.unique <- setdiff(c(used.gtf.columns, gene.type.names), colnames(gtf.df.used))
stop(paste0(paste0(used.unique, collapse = ", "), " is not in provided GTF file!"))
}
In this function, variable used.gene.type.name is a two element vector c("gene_type", "gene_biotype") but one element in your example code c("gene_type"), which resulted in one more column in gene.info.used in my data than your example data.
Thus the problem came in the next line of code colnames(gene.info.used) <- c("seqnames", "start", "end", "strand", "type", "gene_type", "gene_name"). Here you selected seven columns, making my 8th column a NA name and gene-id content column, which caused the error above. So I have to remove the biotype column to make code work:
gtf_gr$gene_biotype <- NULL
There's no more errors in my attampt. Thanks again! :)