Skip to content

Error in drawing gene annotation #57

@Rong-ao

Description

@Rong-ao

Hi @m-jahn thanks for your great package.
This issue is not so much an error report but several suggestions. :)

I was trying to draw coverage plots of ChIP-seq data with ggcoverage. My genome version is GCA_000001405.28_GRCh38.p13_genomic.gtf that is downloaded from NCBI. Here I got the first problem because chromosome names in NCBI is not same as the ones in UCSC:

Error in stop_if_wrong_length("'seqnames'", ans_len) : 
  'seqnames' must have the length of the object to construct
  (1) or length 1
In addition: Warning message:
In .merge_two_Seqinfo_objects(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)

I had to solve them manually with the code below:

gtf_gr <- rtracklayer::import.gff3(con = "GCF_000001405.39_GRCh38.p13_genomic.gtf", format = "gtf")

ncbi_to_ucsc <- c(
  "NC_000001.11" = "chr1",
  "NC_000002.12" = "chr2",
  "NC_000003.12" = "chr3",
  "NC_000004.12" = "chr4",
  "NC_000005.10" = "chr5",
  "NC_000006.12" = "chr6",
  "NC_000007.14" = "chr7",
  "NC_000008.11" = "chr8",
  "NC_000009.12" = "chr9",
  "NC_000010.11" = "chr10",
  "NC_000011.10" = "chr11",
  "NC_000012.12" = "chr12",
  "NC_000013.11" = "chr13",
  "NC_000014.9"  = "chr14",
  "NC_000015.10" = "chr15",
  "NC_000016.10" = "chr16",
  "NC_000017.11" = "chr17",
  "NC_000018.10" = "chr18",
  "NC_000019.10" = "chr19",
  "NC_000020.11" = "chr20",
  "NC_000021.9"  = "chr21",
  "NC_000022.11" = "chr22",
  "NC_000023.11" = "chrX",
  "NC_000024.10" = "chrY",
  "NC_012920.1"  = "chrM"
)

ncbi_to_ucsc <- c(ncbi_to_ucsc, 
                  setNames(paste0("chrUn-", seqlevels(gtf_gr)[grep("(^NT_|^NW_)", seqlevels(gtf_gr))]),
                           seqlevels(gtf_gr)[grep("(^NT_|^NW_)", seqlevels(gtf_gr))]))
new_seqlevels <- ncbi_to_ucsc[seqlevels(gtf_gr)]
seqlevels(gtf_gr) <- unname(new_seqlevels[seqlevels(gtf_gr)])

Besides, gff3 files now are not offering metadata column "gene_name" and "gene_type" that are essential for ggcoverage::geom_gene(), replacing with "gene", "gene_id", "type" and something like that. It will return an error "gene_name and gene_type not provided", which should be solved by code below:

gtf_gr$gene_name <- gtf_gr$gene_id
gtf_gr$gene_type <- gtf_gr$type

And I got the third error when adding gene annotation at the bottom of peaks:

Error in `dplyr::filter()`:
! Can't transform a data frame with `NA` or `""` names.
Run `rlang::last_trace()` to see where the error occurred.

For solving this issue, I manually checked the metadata columns of gtf_gr but no NA or no-name columns appeared. Finally the clue was found in source code https://github.com/showteeth/ggcoverage/blob/93249019064b8ec66d7723396b1f21141d4a1b0b/R/geom_gene.R#L133C74-L133C74

  # gtf.gr <- rtracklayer::import.gff(gtf.file,format = 'gtf')
  gtf.df.used <- IRanges::subsetByOverlaps(x = gtf.gr, ranges = plot.range.gr) %>% as.data.frame()
  # check information
  used.gtf.columns <- c("seqnames", "start", "end", "strand", "type", "gene_name")
  gene.type.names <- c("gene_type", "gene_biotype")
  if (all(used.gtf.columns %in% colnames(gtf.df.used))) {
    used.gene.type.name <- intersect(gene.type.names, colnames(gtf.df.used))
    if (length(used.gene.type.name) < 1) {
      stop("gene_type/gene_biotype is not in provided GTF file!")
    } else {
      # select used features
      gene.info.used <- gtf.df.used %>%
        dplyr::filter(type %in% c("gene", "exon", "UTR")) %>%
        dplyr::select(c("seqnames", "start", "end", "strand", "type", used.gene.type.name, "gene_name"))
    }
    colnames(gene.info.used) <- c("seqnames", "start", "end", "strand", "type", "gene_type", "gene_name")
  } else {
    used.unique <- setdiff(c(used.gtf.columns, gene.type.names), colnames(gtf.df.used))
    stop(paste0(paste0(used.unique, collapse = ", "), " is not in provided GTF file!"))
  }

In this function, variable used.gene.type.name is a two element vector c("gene_type", "gene_biotype") but one element in your example code c("gene_type"), which resulted in one more column in gene.info.used in my data than your example data.
Thus the problem came in the next line of code colnames(gene.info.used) <- c("seqnames", "start", "end", "strand", "type", "gene_type", "gene_name"). Here you selected seven columns, making my 8th column a NA name and gene-id content column, which caused the error above. So I have to remove the biotype column to make code work:

gtf_gr$gene_biotype <- NULL

There's no more errors in my attampt. Thanks again! :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions