Title: | Detect and Correct Genomic DNA Contamination in RNA-seq Data |
---|---|
Description: | RNA-seq data generated by some library preparation methods, such as rRNA-depletion-based method and the SMART-seq method, might be contaminated by genomic DNA (gDNA), if DNase I disgestion is not performed properly during RNA preparation. CleanUpRNAseq is developed to check if RNA-seq data is suffered from gDNA contamination. If so, it can perform correction for gDNA contamination and reduce false discovery rate of differentially expressed genes. |
Authors: | Haibo Liu [aut, cre] , Kevin O'Connor [ctb], Michelle Kelliher [ctb], Lihua Julie Zhu [aut], Kai Hu [aut] |
Maintainer: | Haibo Liu <[email protected]> |
License: | GPL-3 |
Version: | 1.1.1 |
Built: | 2025-01-19 03:18:36 UTC |
Source: | https://github.com/bioc/CleanUpRNAseq |
Calculate GC content of all genes based on sequences of collapsed exons of each gene.
calc_gene_gc(ensdb_sqlite = NULL, BSgenome = NULL, batch_size = 2000)
calc_gene_gc(ensdb_sqlite = NULL, BSgenome = NULL, batch_size = 2000)
ensdb_sqlite |
A character(1) specifying a path to an SQLite file to store the an object of the ensembldb::EnsDb or an object of the ensembldb::EnsDb. |
BSgenome |
An object of BSgenome::BSgenome. Make sure the chromosome names (aka seqnames) in the BSgenome object match those in the ensembldb::EnsDb specified by ensdb_sqlite. |
batch_size |
An integer(1) vector, specifying how many regions are processed each batch. |
A data frame contains two columns: gc_content and width.
GC contents (proportion) of genes
widths of genes
require("ensembldb") ucsc_BSgenome <- BSgenome.Hsapiens.UCSC.hg38 ucsc_seqnames <- seqnames(ucsc_BSgenome)[!grepl("_alt|fix|hap\\d+", seqnames(ucsc_BSgenome))] ## Wierd: BSgenome.Hsapiens.UCSC.hg19 has both chrM and chrMT for ## mitochondrial genome. renomve chrMT. if (all(c("chrM", "chrMT") %in% ucsc_seqnames)) { ucsc_seqnames <- ucsc_seqnames[!ucsc_seqnames %in% "chrMT"] } ensembl_seqnames <- gsub( "^chr", "", gsub( "chrM$", "MT", gsub( "v", ".", gsub("_random|chr[^_]+_", "", ucsc_seqnames) ) ) ) ## for BSgenome.Hsapiens.UCSC.hg19, scaffold seqnames start with lower case, ## should be changed to upper case. For example, change "gl000231" to ## "GL000231". Add a suffix ".1" to these scaffold seqnames. seqname_alias <- data.frame(ucsc = ucsc_seqnames, ensembl = ensembl_seqnames) bsgenome <- style_BSgenome( UCSC_BSgenome = ucsc_BSgenome, genome_version = "GRCh38", seqname_alias = seqname_alias ) tmp_dir <- tempdir() gtf <- system.file("extdata", "example.gtf.gz", package = "CleanUpRNAseq") hs_ensdb_sqlite <- ensembldb::ensDbFromGtf( gtf = gtf, outfile = file.path(tmp_dir, "EnsDb.hs.v110.sqlite"), organism = "Homo_Sapiens", genomeVersion = "GRCh38", version = 110 ) gene_gc <- calc_gene_gc( ensdb_sqlite = hs_ensdb_sqlite, BSgenome = bsgenome, batch_size = 2000 )
require("ensembldb") ucsc_BSgenome <- BSgenome.Hsapiens.UCSC.hg38 ucsc_seqnames <- seqnames(ucsc_BSgenome)[!grepl("_alt|fix|hap\\d+", seqnames(ucsc_BSgenome))] ## Wierd: BSgenome.Hsapiens.UCSC.hg19 has both chrM and chrMT for ## mitochondrial genome. renomve chrMT. if (all(c("chrM", "chrMT") %in% ucsc_seqnames)) { ucsc_seqnames <- ucsc_seqnames[!ucsc_seqnames %in% "chrMT"] } ensembl_seqnames <- gsub( "^chr", "", gsub( "chrM$", "MT", gsub( "v", ".", gsub("_random|chr[^_]+_", "", ucsc_seqnames) ) ) ) ## for BSgenome.Hsapiens.UCSC.hg19, scaffold seqnames start with lower case, ## should be changed to upper case. For example, change "gl000231" to ## "GL000231". Add a suffix ".1" to these scaffold seqnames. seqname_alias <- data.frame(ucsc = ucsc_seqnames, ensembl = ensembl_seqnames) bsgenome <- style_BSgenome( UCSC_BSgenome = ucsc_BSgenome, genome_version = "GRCh38", seqname_alias = seqname_alias ) tmp_dir <- tempdir() gtf <- system.file("extdata", "example.gtf.gz", package = "CleanUpRNAseq") hs_ensdb_sqlite <- ensembldb::ensDbFromGtf( gtf = gtf, outfile = file.path(tmp_dir, "EnsDb.hs.v110.sqlite"), organism = "Homo_Sapiens", genomeVersion = "GRCh38", version = 110 ) gene_gc <- calc_gene_gc( ensdb_sqlite = hs_ensdb_sqlite, BSgenome = bsgenome, batch_size = 2000 )
Calculate GC content of genomic regions
calc_region_gc(region = NULL, BSgenome = NULL, batch_size = 2000)
calc_region_gc(region = NULL, BSgenome = NULL, batch_size = 2000)
region |
A data frame containing the columns: "Chr", "Start", "End",
and "Strand", such as a data frame contained in a sublist named
intergenic_region from the output of the |
BSgenome |
An object of BSgenome::BSgenome. Make sure the chromosome names (aka seqnames) in the BSgenome object match those in the region_saf data frame and the region_gr object. |
batch_size |
An integer(1) vector, specifying how many regions are processed each batch. |
A data.frame contains two columns: gc_content and width.
GC contents (proportion) of genomic regions
widths of genomic regions
ucsc_BSgenome <- BSgenome.Hsapiens.UCSC.hg38 ucsc_seqnames <- seqnames(ucsc_BSgenome)[!grepl( "_alt|fix|hap\\d+", seqnames(ucsc_BSgenome) )] ## Wierd: BSgenome.Hsapiens.UCSC.hg19 has both chrM and chrMT for ## mitochondrial genome. renomve chrMT. if (all(c("chrM", "chrMT") %in% ucsc_seqnames)) { ucsc_seqnames <- ucsc_seqnames[!ucsc_seqnames %in% "chrMT"] } ensembl_seqnames <- gsub( "^chr", "", gsub( "chrM$", "MT", gsub( "v", ".", gsub("_random|chr[^_]+_", "", ucsc_seqnames) ) ) ) ## for BSgenome.Hsapiens.UCSC.hg19, scaffold seqnames start with lower case, ## should be changed to upper case. For example, change "gl000231" to ## "GL000231". Add a suffix ".1" to these scaffold seqnames. seqname_alias <- data.frame(ucsc = ucsc_seqnames, ensembl = ensembl_seqnames) bsgenome <- style_BSgenome( UCSC_BSgenome = BSgenome.Hsapiens.UCSC.hg38, genome_version = "GRCh38", seqname_alias = seqname_alias ) region <- data.frame( GeneID = as.character(seq_len(10)), Chr = rep("1", 10), Start = 100000 * (seq_len(10)), End = 100000 * (seq_len(10)) + 1000, Strand = rep("+", 10) ) gc_contents <- calc_region_gc( region = region, BSgenome = bsgenome, batch_size = 2000 )
ucsc_BSgenome <- BSgenome.Hsapiens.UCSC.hg38 ucsc_seqnames <- seqnames(ucsc_BSgenome)[!grepl( "_alt|fix|hap\\d+", seqnames(ucsc_BSgenome) )] ## Wierd: BSgenome.Hsapiens.UCSC.hg19 has both chrM and chrMT for ## mitochondrial genome. renomve chrMT. if (all(c("chrM", "chrMT") %in% ucsc_seqnames)) { ucsc_seqnames <- ucsc_seqnames[!ucsc_seqnames %in% "chrMT"] } ensembl_seqnames <- gsub( "^chr", "", gsub( "chrM$", "MT", gsub( "v", ".", gsub("_random|chr[^_]+_", "", ucsc_seqnames) ) ) ) ## for BSgenome.Hsapiens.UCSC.hg19, scaffold seqnames start with lower case, ## should be changed to upper case. For example, change "gl000231" to ## "GL000231". Add a suffix ".1" to these scaffold seqnames. seqname_alias <- data.frame(ucsc = ucsc_seqnames, ensembl = ensembl_seqnames) bsgenome <- style_BSgenome( UCSC_BSgenome = BSgenome.Hsapiens.UCSC.hg38, genome_version = "GRCh38", seqname_alias = seqname_alias ) region <- data.frame( GeneID = as.character(seq_len(10)), Chr = rep("1", 10), Start = 100000 * (seq_len(10)), End = 100000 * (seq_len(10)) + 1000, Strand = rep("+", 10) ) gc_contents <- calc_region_gc( region = region, BSgenome = bsgenome, batch_size = 2000 )
Correct DNA contamination considering GC-bias effect on fragment amplification. Intergenic regions are binned based on their GC content ranging from 0 to 100%, with a bin size of 5%. Per gene DNA contamination is estimated as the product of count per base in a GC content matching bin of intergenic regions and the total collapsed exons of a gene and is subtracted away from the gene count matrix.
correct_GC( SummarizedCounts = NULL, gene_gc = NULL, intergenic_gc = NULL, plot = FALSE )
correct_GC( SummarizedCounts = NULL, gene_gc = NULL, intergenic_gc = NULL, plot = FALSE )
SummarizedCounts |
An object of SummarizedCounts.. |
gene_gc |
A data frame or matrix with two columns: gc_content in
proportion between 0 and 1, and width in basepair, containing
GC-content and total exon lengths of each gene. An output of the
|
intergenic_gc |
A data frame or matrix with two columns: gc_content in
proportion between 0 and 1, and width in basepair, containing
GC-content and lengths of each intergenic region, such as an output from
the |
plot |
A logical(1) vector, specifying whether to output a panel of scatter plot showing a bi-variate distribution of GC content and count per base of intergenic regions in each sample. Default is TRUE. |
A data frame containing corrected count for each gene (row) of each sample (column).
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) data("gene_GC") data("intergenic_GC") gc_bias_corrected_counts <- correct_GC( SummarizedCounts = sc, gene_gc = gene_GC, intergenic_gc = intergenic_GC, plot = FALSE )
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) data("gene_GC") data("intergenic_GC") gc_bias_corrected_counts <- correct_GC( SummarizedCounts = sc, gene_gc = gene_GC, intergenic_gc = intergenic_GC, plot = FALSE )
Correct for DNA contamination in RNA-seq data using the 2.5 times of median counts per base of intergenic regions with at least one count.
correct_global(SummarizedCounts = NULL, lambda = 1)
correct_global(SummarizedCounts = NULL, lambda = 1)
SummarizedCounts |
An object of SummarizedCounts.. |
lambda |
A positive number specifying how many times of the median read
coverage of non-zero count intergenic regions used as an estimate of DNA
contamination. The default of lambda is 1,but it could be adjusted based
on the gene-level count distributions of the resulting corrected count
table output by the |
A matrix containing an RNA-seq count table corrected for DNA contamination, with rows for genes and columns for samples.
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) corrected_counts <- correct_global(SummarizedCounts = sc)
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) corrected_counts <- correct_global(SummarizedCounts = sc)
Correct gene expression with IR% as a co-variate in linear model in the limma framework.
correct_IR(SummarizedCounts = NULL, design = NULL)
correct_IR(SummarizedCounts = NULL, design = NULL)
SummarizedCounts |
An object of SummarizedCounts.. |
design |
A design matrix with rows corresponding to samples and
columns to coefficients to be estimated. See |
A numeric matrix of normalized expression values on the log2 scale,
corrected for gDNA contamination, and batch
(if any).
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) assigned_per_region <- get_region_stat(SummarizedCounts = sc) design = model.matrix(~ group + batch +IR_rate, data = sc$col_data) ir_corrected <- correct_IR(sc, design)
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) assigned_per_region <- get_region_stat(SummarizedCounts = sc) design = model.matrix(~ group + batch +IR_rate, data = sc$col_data) ir_corrected <- correct_IR(sc, design)
Correct for gDNA contamination in stranded libraries based on Salmon quantitation using the real and opposite library strandedness information
correct_stranded(SummarizedCounts = NULL)
correct_stranded(SummarizedCounts = NULL)
SummarizedCounts |
An object of SummarizedCounts.. |
A list of of matrices of gene-level abundance, counts, and length.
See tximport::tximport()
. The count matrix is corrected for DNA
contamination and rounded into integers.
A numeric matrix containing corrected abundance (TPM) for each gene of each sample
An integer matrix containing corrected read count for each gene of each sample
A numeric matrix containing length (bp) for each gene of each sample
lib_strand <- 1 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) stranded_correction <- correct_stranded(SummarizedCounts = sc)
lib_strand <- 1 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) stranded_correction <- correct_stranded(SummarizedCounts = sc)
Set up an analysis by creating an object of SummarizedCounts, which
is used to store summary output from featureCounts and tximport, and store
gDNA-corrected expression data. This is a convenience wrapper for
SummarizedCounts$new()
create_summarizedcounts(lib_strand = 0, colData = NULL)
create_summarizedcounts(lib_strand = 0, colData = NULL)
lib_strand |
An integer(1), specifying the library's strandedness. It has three possible values:
|
colData |
A data frame with rows corresponding to samples. For
unstranded RNA-seq data, it at least contains the following columns:
|
An object of SummarizedCounts, which is used to store summary output from featureCounts and tximport, and store gDNA-corrected expression data.
in_dir <- system.file("extdata", package = "CleanUpRNAseq") BAM_file <- dir(in_dir, ".bam$", full.name = TRUE) salmon_quant_file <- dir(in_dir, ".sf$", full.name = TRUE) sample_name = gsub(".+/(.+?).srt.bam", "\\1", BAM_file) salmon_quant_file_opposite_strand <- salmon_quant_file col_data <- data.frame(sample_name = sample_name, BAM_file = BAM_file, salmon_quant_file = salmon_quant_file, salmon_quant_file_opposite_strand = salmon_quant_file_opposite_strand, group = c("CD1N", "CD1P")) sc <- create_summarizedcounts(lib_strand = 0, colData = col_data)
in_dir <- system.file("extdata", package = "CleanUpRNAseq") BAM_file <- dir(in_dir, ".bam$", full.name = TRUE) salmon_quant_file <- dir(in_dir, ".sf$", full.name = TRUE) sample_name = gsub(".+/(.+?).srt.bam", "\\1", BAM_file) salmon_quant_file_opposite_strand <- salmon_quant_file col_data <- data.frame(sample_name = sample_name, BAM_file = BAM_file, salmon_quant_file = salmon_quant_file, salmon_quant_file_opposite_strand = salmon_quant_file_opposite_strand, group = c("CD1N", "CD1P")) sc <- create_summarizedcounts(lib_strand = 0, colData = col_data)
GC content and lengths of 2000 human intergenic regions calculated using the
calc_region_gc()
function.
data(feature_counts_list)
data(feature_counts_list)
"intergenic_region", "intronic_region", "rRNA", "mitochondrion", "gtf",
"chloroplast" (plant only).
For genomic features, gene, exon, intron, rRNA, mitochondrion, and
chloroplast, only the stat
elements of the Rsubread::featureCounts()
output is stored.
For intergenic region, the stat
, annotation
and counts
elements
are stored; for GTF-based summarization, the stat
and counts
elements
are stored.
GC content and lengths of 2000 human Ensembl genes. For each gene exons
are collapsed to get disjoint meta-exons and GC content is calculated
for each meta-exon. A exon length-weighted GC content is calculated for
each gene using the calc_gene_gc()
function.
data(gene_GC)
data(gene_GC)
A data frame with 2000 rows and 2 columns:
metagene-level GC content, values are between 0 and 1
length of metagenes in base pair(bp)
Homo_sapiens.GRCh38.110.gtf.gz
Calculate read distribution over different types of genomic features: genes, exons, introns, intergenic regions,rRNA regions, and organelle genome(s).
get_region_stat(SummarizedCounts = NULL)
get_region_stat(SummarizedCounts = NULL)
SummarizedCounts |
An object of SummarizedCounts.. |
A data frame described as below.
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) assigned_per_region <- get_region_stat(SummarizedCounts = sc) assigned_per_region
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) assigned_per_region <- get_region_stat(SummarizedCounts = sc) assigned_per_region
Generate a SAF (simplified annotation format) file for genomic features: genicregions, intergenic regions, exonic regions, intronic regions, rRNA genes, mitochondrial genome, and chloroplast genome (only for plants).
get_saf( ensdb_sqlite = NULL, bamfile = NULL, mitochondrial_genome = c("MT", "chrM"), chloroplast_genome = c("chrPltd", "Pltd") )
get_saf( ensdb_sqlite = NULL, bamfile = NULL, mitochondrial_genome = c("MT", "chrM"), chloroplast_genome = c("chrPltd", "Pltd") )
ensdb_sqlite |
A character(1) specifying a path to an SQLite file to store the an object of the ensembldb::EnsDb or an object of the ensembldb::EnsDb. |
bamfile |
A character(1), a path to a BAM file for the experiment of interest. The BAM file is used to extract genome information: chromosome/scaffold names and lengths. |
mitochondrial_genome |
A character(1), mitochondrial genome name in the EnsDb database (ie, the mitochondrial name in column 1 of the GTF file used to generate the EnsDb SQLite database). |
chloroplast_genome |
A character(1), chloroplast genome name in the EnsDb database (ie, the chloroplast name in column 1 of the GTF file used to generate the EnsDb SQLite database). This is only relevant for plants. |
A list of data frames containing SAF for genomic features: genic regions, intergenic regions, exonic regions, intronic regions, rRNA genes, mitochondrial genome, chloroplast genome (only for plants), respectively.
a data frame containing a SAF for genes
a data frame containing a SAF for exons
a data frame containing a SAF for intergenic regions
a data frame containing a SAF for intronic region
a data frame containing a SAF for rRNA exons
a data frame containing a SAF for the mitochodrion
a data frame containing a SAF for chloroplast, plnat only
require("ensembldb") tmp_dir <- tempdir() gtf <- system.file("extdata", "example.gtf.gz", package = "CleanUpRNAseq") hs_ensdb_sqlite <- ensembldb::ensDbFromGtf( gtf = gtf, outfile = file.path(tmp_dir, "EnsDb.hs.v110.sqlite"), organism = "Homo_Sapiens", genomeVersion = "GRCh38", version = 110 ) bam_file <- system.file("extdata", "K084CD7PCD1N.srt.bam", package = "CleanUpRNAseq" ) saf_list <- get_saf( ensdb_sqlite = hs_ensdb_sqlite, bamfile = bam_file, mitochondrial_genome = "MT" )
require("ensembldb") tmp_dir <- tempdir() gtf <- system.file("extdata", "example.gtf.gz", package = "CleanUpRNAseq") hs_ensdb_sqlite <- ensembldb::ensDbFromGtf( gtf = gtf, outfile = file.path(tmp_dir, "EnsDb.hs.v110.sqlite"), organism = "Homo_Sapiens", genomeVersion = "GRCh38", version = 110 ) bam_file <- system.file("extdata", "K084CD7PCD1N.srt.bam", package = "CleanUpRNAseq" ) saf_list <- get_saf( ensdb_sqlite = hs_ensdb_sqlite, bamfile = bam_file, mitochondrial_genome = "MT" )
GC content and lengths of 2000 human intergenic regions calculated using the
calc_region_gc()
function.
data(intergenic_GC)
data(intergenic_GC)
A data frame with 2000 rows and 2 columns:
metagene-level GC content, values are between 0 and 1
length of metagenes in base pair(bp)
Homo_sapiens.GRCh38.110.gtf.gz
Visualize assignment statistics of reads/fragments by featureCounts
plot_assignment_stat(SummarizedCounts = NULL)
plot_assignment_stat(SummarizedCounts = NULL)
SummarizedCounts |
An object of SummarizedCounts.. |
A ggplot object, showing percentages and number of fragments in each
assignment category as determined by Rsubread::featureCounts()
based on
a GTF.
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) plot_assignment_stat(SummarizedCounts = sc)
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) plot_assignment_stat(SummarizedCounts = sc)
Compare expression distribution by boxplot, density plot and empirical cumulative distribution plot.
plot_expr_distr( SummarizedCounts = NULL, normalization = c("DESeq2", "qsmooth", "none") )
plot_expr_distr( SummarizedCounts = NULL, normalization = c("DESeq2", "qsmooth", "none") )
SummarizedCounts |
An object of SummarizedCounts.. |
normalization |
A character(1) vector, specifying a between-sample
normalization methods: DESeq2's median of ratios method, smooth quantile
normalization method |
A list of 3 ggplot objects.
boxplots showing DESeq2-normalized gene-level count distribution, on a log scale
density plots showing DESeq2-normalized gene-level count distribution, on a log scale
plots showing empricial cumulative distribution of fraction of genes with CPM greater than or equal to a given CPM on a log scale
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) wrap_plots(plot_expr_distr(SummarizedCounts = sc, normalization = "DESeq2"), ncol = 1)
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) wrap_plots(plot_expr_distr(SummarizedCounts = sc, normalization = "DESeq2"), ncol = 1)
Check the percentage of genes with counts greater than minimal CPM
plot_gene_content(SummarizedCounts = NULL, min_cpm = 1, min_tpm = 1)
plot_gene_content(SummarizedCounts = NULL, min_cpm = 1, min_tpm = 1)
SummarizedCounts |
An object of SummarizedCounts.. |
min_cpm |
A numeric(1), minimal CPM threshold. |
min_tpm |
A numeric(1), minimal TPM threshold. |
The axis title contains unicode, so please output the plot in the svg format
using the svglite package, instead of grDevices::pdf()
, for high-
resolution plot.
A ggplot object if counts
is not NULL
, showing percentages of
genes with counts above the user-specified minimal CPM (count per million)
in each sample. Or a ggplot object of two panels if both counts
and
abundance
are not NULL
, showing percentages of genes with counts
above the user-specified minimal CPM and minimal TPM (transcript per
million) in each sample.
A ggplot object.
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) plot_gene_content( SummarizedCounts = sc, min_cpm = 1, min_tpm =1 )
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) plot_gene_content( SummarizedCounts = sc, min_cpm = 1, min_tpm =1 )
Perform sample-level exploratory analysis of RNA-seq data, generating heatmap showing sample distances and PCA plot showing sample variations. Internally, DESeq2 is used for vst transformation of count data.
plot_pca_heatmap(SummarizedCounts = NULL, silent = TRUE)
plot_pca_heatmap(SummarizedCounts = NULL, silent = TRUE)
SummarizedCounts |
An object of SummarizedCounts.. |
silent |
A logical(1), specify whether to draw the plot. It is useful to set it to FALSE useful when using the gtable output. |
A list of a ggplot object and a gtable::gtable()
object.
A ggplot object containing the PCA score plot showing sample similarity
A gtable object containing the heatmap showing pairwise sample distances
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) p<- plot_pca_heatmap(SummarizedCounts = sc, silent = TRUE) wrap_plots(p[["pca"]], as.ggplot(p[["heatmap"]]), ncol = 1)
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) p<- plot_pca_heatmap(SummarizedCounts = sc, silent = TRUE) wrap_plots(p[["pca"]], as.ggplot(p[["heatmap"]]), ncol = 1)
Generate ggplot plots showing percentages of fragments assigned to different type of genomic features: genes, exons, introns, intergenic regions, rRNA regions, and organelle genome(s).
plot_read_distr(assigned_per_region = NULL)
plot_read_distr(assigned_per_region = NULL)
assigned_per_region |
A data frame, outputted by |
A ggplot object, showing percentages of fragments assigned to different genomic features, such as genic regions, intergenic regions, exonic regions, intronic regions, rRNA genes, mitochondrial genome, chloroplast genome (only for plants)
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) assigned_per_region <- get_region_stat(SummarizedCounts = sc) p <- plot_read_distr(assigned_per_region) p
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) assigned_per_region <- get_region_stat(SummarizedCounts = sc) p <- plot_read_distr(assigned_per_region) p
Generate a panel of plots based on a count table, with the diagonal showing the sample names, the lower triangle showing smoothed scatterplots for gene expression of pairwise samples, and the upper triangle showing Pearson correlation coefficients of gene expression of pairwise samples.
plot_sample_corr(SummarizedCounts = NULL)
plot_sample_corr(SummarizedCounts = NULL)
SummarizedCounts |
An object of SummarizedCounts.. |
NULL. A plot with pairwise scatter plots and Pearson correlation coefficients is generated. When the sample size is big, save it as tiff file.
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) plot_sample_corr(SummarizedCounts = sc)
lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) data("feature_counts_list") data("salmon_quant") sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) plot_sample_corr(SummarizedCounts = sc)
GC content and lengths of 2000 human intergenic regions calculated using the
calc_region_gc()
function.
data(salmon_quant)
data(salmon_quant)
A list of three elements:
A numeric matrix containing abundance (TPM) for each gene of each sample
A numeric matrix containing read count (fraction) for each gene of each sample
A numeric matrix containing length (bp) for each gene of each sample
Import Salmon quantification output into R using tximport
salmon_tximport( SummarizedCounts = NULL, ensdb_sqlite = NULL, filtered_gene_biotypes = c("artifact", "TEC", "miRNA", "tRNA", "misc_RNA", "Mt_rRNA", "Mt_tRNA", "rRNA", "rRNA_pseudogene", "scaRNA", "scRNA", "snoRNA", "snRNA", "sRNA", "vault_RNA", "TR_V_pseudogene", "IG_C_pseudogene", "IG_J_pseudogene", "IG_V_pseudogene", "IG_pseudogene", "TR_J_pseudogene", "TR_V_pseudogene") )
salmon_tximport( SummarizedCounts = NULL, ensdb_sqlite = NULL, filtered_gene_biotypes = c("artifact", "TEC", "miRNA", "tRNA", "misc_RNA", "Mt_rRNA", "Mt_tRNA", "rRNA", "rRNA_pseudogene", "scaRNA", "scRNA", "snoRNA", "snRNA", "sRNA", "vault_RNA", "TR_V_pseudogene", "IG_C_pseudogene", "IG_J_pseudogene", "IG_V_pseudogene", "IG_pseudogene", "TR_J_pseudogene", "TR_V_pseudogene") )
SummarizedCounts |
An object of SummarizedCounts. |
ensdb_sqlite |
An object of the ensembldb::EnsDb or a character(1) vector, specifying a path to an SQLite database for an object of the ensembldb::EnsDb. |
filtered_gene_biotypes |
A character(n) vector,specifying the biotypes of genes which will not be considered for downstream gene expression analysis. By default, genes of the following biotypes are excluded: "artifact","TEC","miRNA","tRNA", "misc_RNA", "Mt_rRNA", "Mt_tRNA", "rRNA", "rRNA_pseudogene", "scaRNA", "scRNA", "snoRNA", "snRNA", "sRNA", "vault_RNA", "TR_V_pseudogene", "IG_C_pseudogene", "IG_J_pseudogene", "IG_V_pseudogene", "IG_pseudogene", "TR_J_pseudogene", "TR_V_pseudogene", because their transcripts are too short (< 200 nt) or not likely expressed at all. |
An object of SummarizedCounts with the feature_counts field
populated by a list of of matrices of gene-level abundance, counts, and
length outputted by tximport::tximport()
, as described in the folowing:
A numeric matrix containing abundance (TPM) for each gene of each sample
A numeric matrix containing read count (fraction) for each gene of each sample
A numeric matrix containing length (bp) for each gene of each sample
require("ensembldb") in_dir <- system.file("extdata", package = "CleanUpRNAseq") BAM_file <- dir(in_dir, ".bam$", full.name = TRUE) salmon_quant_file <- dir(in_dir, ".sf$", full.name = TRUE) sample_name = gsub(".+/(.+?).srt.bam", "\\1", BAM_file) salmon_quant_file_opposite_strand <- salmon_quant_file col_data <- data.frame(sample_name = sample_name, BAM_file = BAM_file, salmon_quant_file = salmon_quant_file, salmon_quant_file_opposite_strand = salmon_quant_file_opposite_strand, group = c("CD1N", "CD1P")) sc <- create_summarizedcounts(lib_strand = 0, colData = col_data) tmp_dir <- tempdir() gtf <- system.file("extdata", "example.gtf.gz", package = "CleanUpRNAseq") hs_ensdb_sqlite <- ensembldb::ensDbFromGtf( gtf = gtf, outfile = file.path(tmp_dir, "EnsDb.hs.v110.sqlite"), organism = "Homo_Sapiens", genomeVersion = "GRCh38", version = 110 ) salmon_counts <- salmon_tximport( SummarizedCounts = sc, ensdb_sqlite = hs_ensdb_sqlite )
require("ensembldb") in_dir <- system.file("extdata", package = "CleanUpRNAseq") BAM_file <- dir(in_dir, ".bam$", full.name = TRUE) salmon_quant_file <- dir(in_dir, ".sf$", full.name = TRUE) sample_name = gsub(".+/(.+?).srt.bam", "\\1", BAM_file) salmon_quant_file_opposite_strand <- salmon_quant_file col_data <- data.frame(sample_name = sample_name, BAM_file = BAM_file, salmon_quant_file = salmon_quant_file, salmon_quant_file_opposite_strand = salmon_quant_file_opposite_strand, group = c("CD1N", "CD1P")) sc <- create_summarizedcounts(lib_strand = 0, colData = col_data) tmp_dir <- tempdir() gtf <- system.file("extdata", "example.gtf.gz", package = "CleanUpRNAseq") hs_ensdb_sqlite <- ensembldb::ensDbFromGtf( gtf = gtf, outfile = file.path(tmp_dir, "EnsDb.hs.v110.sqlite"), organism = "Homo_Sapiens", genomeVersion = "GRCh38", version = 110 ) salmon_counts <- salmon_tximport( SummarizedCounts = sc, ensdb_sqlite = hs_ensdb_sqlite )
Convert a BSgenome from the UCSC style to the Ensembl style, changing "chrM" to "MT", removing the "chr" prefix from chromosome/scaffold seqnames, only keeping seqnames in the primary genome assembly, and changing the genome version name form the UCSC name to the Ensembl name, such as from "hg38" to "GRCh38".
style_BSgenome( UCSC_BSgenome = NULL, genome_version = "GRCh38", seqname_alias = data.frame(ucsc = paste0("chr", c(seq_len(22), "X", "Y", "M")), ensembl = c(seq_len(22), "X", "Y", "MT")) )
style_BSgenome( UCSC_BSgenome = NULL, genome_version = "GRCh38", seqname_alias = data.frame(ucsc = paste0("chr", c(seq_len(22), "X", "Y", "M")), ensembl = c(seq_len(22), "X", "Y", "MT")) )
UCSC_BSgenome |
An object of the BSgenome::BSgenome in the UCSC style, such as BSgenome.Hsapiens.UCSC.hg38. |
genome_version |
A character(1), specifying the genome version, such as "GRCh38". |
seqname_alias |
A data frame or a tab-delimited file to a data frame,
with two columns: |
An object of the BSgenome::BSgenome in the Ensembl style, such as BSgenome.Dvirilis.Ensembl.dvircaf1.
ucsc_BSgenome <- BSgenome.Hsapiens.UCSC.hg38 ucsc_seqnames <- seqnames(ucsc_BSgenome)[!grepl("_alt|fix|hap\\d+", seqnames(ucsc_BSgenome))] ## Wierd: BSgenome.Hsapiens.UCSC.hg19 has both chrM and chrMT for ## mitochondrial genome. renomve chrMT. if (all(c("chrM", "chrMT") %in% ucsc_seqnames)) { ucsc_seqnames <- ucsc_seqnames[!ucsc_seqnames %in% "chrMT"] } ensembl_seqnames <- gsub( "^chr", "", gsub( "chrM$", "MT", gsub( "v", ".", gsub("_random|chr[^_]+_", "", ucsc_seqnames) ) ) ) ## for BSgenome.Hsapiens.UCSC.hg19, scaffold seqnames start with lower case, ## should be changed to upper case. For example, change "gl000231" to ## "GL000231". Add a suffix ".1" to these scaffold seqnames. seqname_alias <- data.frame(ucsc = ucsc_seqnames, ensembl = ensembl_seqnames) bsgenome <- style_BSgenome( UCSC_BSgenome = ucsc_BSgenome, genome_version = "GRCh38", seqname_alias = seqname_alias )
ucsc_BSgenome <- BSgenome.Hsapiens.UCSC.hg38 ucsc_seqnames <- seqnames(ucsc_BSgenome)[!grepl("_alt|fix|hap\\d+", seqnames(ucsc_BSgenome))] ## Wierd: BSgenome.Hsapiens.UCSC.hg19 has both chrM and chrMT for ## mitochondrial genome. renomve chrMT. if (all(c("chrM", "chrMT") %in% ucsc_seqnames)) { ucsc_seqnames <- ucsc_seqnames[!ucsc_seqnames %in% "chrMT"] } ensembl_seqnames <- gsub( "^chr", "", gsub( "chrM$", "MT", gsub( "v", ".", gsub("_random|chr[^_]+_", "", ucsc_seqnames) ) ) ) ## for BSgenome.Hsapiens.UCSC.hg19, scaffold seqnames start with lower case, ## should be changed to upper case. For example, change "gl000231" to ## "GL000231". Add a suffix ".1" to these scaffold seqnames. seqname_alias <- data.frame(ucsc = ucsc_seqnames, ensembl = ensembl_seqnames) bsgenome <- style_BSgenome( UCSC_BSgenome = ucsc_BSgenome, genome_version = "GRCh38", seqname_alias = seqname_alias )
Summarize reads in alignment files, SAM or BAM, to different genomic regions,
such as genic regions, intergenic regions, exonic regions, intronic regions,
rRNA genes, mitochrondrial genome, chloroplast genome (only for plants), and
gene-level exonic regions using Rsubread::featureCounts()
.
summarize_reads( SummarizedCounts = NULL, isPairedEnd = TRUE, allowMultiOverlap = FALSE, countMultiMappingReads = TRUE, fraction = TRUE, minMQS = 0, saf_list = NULL, gtf = NULL, threads = 1, verbose = FALSE )
summarize_reads( SummarizedCounts = NULL, isPairedEnd = TRUE, allowMultiOverlap = FALSE, countMultiMappingReads = TRUE, fraction = TRUE, minMQS = 0, saf_list = NULL, gtf = NULL, threads = 1, verbose = FALSE )
SummarizedCounts |
An object of SummarizedCounts. |
isPairedEnd |
A logical(1), whether the RNA-seq data is paired-end. |
allowMultiOverlap |
A logical(1), indicating if a read is allowed to be assigned to more than one feature (or meta-feature) if it is found to overlap with more than one feature (or meta-feature). FALSE by default. A read (or read pair) will be assigned to the feature (or meta-feature) that has the largest number of overlapping bases, if the read (or read pair) overlaps with multiple features (or meta-features). |
countMultiMappingReads |
A logical(1), indicating if multi-mapping reads/fragments should be counted, TRUE by default. ‘NH’ tag is used to located multi-mapping reads in the input BAM/SAM files. |
fraction |
A logical(1) indicating if fractional counts are produced for multi-mapping reads and/or multi-overlapping reads. FALSE by default. |
minMQS |
An integer(1), giving the minimum mapping quality score a read must satisfy in order to be counted. For paired-end reads, at least one end should satisfy this criteria. 0 by default. |
saf_list |
A list of data frames containing annotation in the SAF
format, such as the output of the |
gtf |
A character(1), specifying a path to a GTF file. |
threads |
An integer(1), number of threads for
|
verbose |
A logical(1) vector, indicating if verbose information for debugging will be generated. This may include information such as unmatched chromosomes/contigs between reads and annotation. |
An object of SummarizedCounts with the feature_counts field
populated by modified output from the Rsubread::featureCounts()
function for each type of genomic features.
tmp_dir <- tempdir() in_dir <- system.file("extdata", package = "CleanUpRNAseq") gtf.gz <- dir(in_dir, ".gtf.gz$", full.name = TRUE) gtf <- file.path(tmp_dir, gsub("\\.gz", "", basename(gtf.gz))) R.utils::gunzip(gtf.gz, destname= gtf, overwrite = TRUE, remove = FALSE) in_dir <- system.file("extdata", package = "CleanUpRNAseq") BAM_file <- dir(in_dir, ".bam$", full.name = TRUE) salmon_quant_file <- dir(in_dir, ".sf$", full.name = TRUE) sample_name = gsub(".+/(.+?).srt.bam", "\\1", BAM_file) salmon_quant_file_opposite_strand <- salmon_quant_file col_data <- data.frame(sample_name = sample_name, BAM_file = BAM_file, salmon_quant_file = salmon_quant_file, salmon_quant_file_opposite_strand = salmon_quant_file_opposite_strand, group = c("CD1N", "CD1P")) sc <- create_summarizedcounts(lib_strand = 0, colData = col_data) bam_file <- system.file("extdata", "K084CD7PCD1N.srt.bam", package = "CleanUpRNAseq" ) tmp_dir <- tempdir() gtf <- system.file("extdata", "example.gtf.gz", package = "CleanUpRNAseq") hs_ensdb_sqlite <- ensembldb::ensDbFromGtf( gtf = gtf, outfile = file.path(tmp_dir, "EnsDb.hs.v110.sqlite"), organism = "Homo_Sapiens", genomeVersion = "GRCh38", version = 110 ) saf_list <- get_saf( ensdb_sqlite = hs_ensdb_sqlite, bamfile = bam_file, mitochondrial_genome = "MT" ) counts_list <- summarize_reads( SummarizedCounts = sc, saf_list = saf_list, gtf = gtf )
tmp_dir <- tempdir() in_dir <- system.file("extdata", package = "CleanUpRNAseq") gtf.gz <- dir(in_dir, ".gtf.gz$", full.name = TRUE) gtf <- file.path(tmp_dir, gsub("\\.gz", "", basename(gtf.gz))) R.utils::gunzip(gtf.gz, destname= gtf, overwrite = TRUE, remove = FALSE) in_dir <- system.file("extdata", package = "CleanUpRNAseq") BAM_file <- dir(in_dir, ".bam$", full.name = TRUE) salmon_quant_file <- dir(in_dir, ".sf$", full.name = TRUE) sample_name = gsub(".+/(.+?).srt.bam", "\\1", BAM_file) salmon_quant_file_opposite_strand <- salmon_quant_file col_data <- data.frame(sample_name = sample_name, BAM_file = BAM_file, salmon_quant_file = salmon_quant_file, salmon_quant_file_opposite_strand = salmon_quant_file_opposite_strand, group = c("CD1N", "CD1P")) sc <- create_summarizedcounts(lib_strand = 0, colData = col_data) bam_file <- system.file("extdata", "K084CD7PCD1N.srt.bam", package = "CleanUpRNAseq" ) tmp_dir <- tempdir() gtf <- system.file("extdata", "example.gtf.gz", package = "CleanUpRNAseq") hs_ensdb_sqlite <- ensembldb::ensDbFromGtf( gtf = gtf, outfile = file.path(tmp_dir, "EnsDb.hs.v110.sqlite"), organism = "Homo_Sapiens", genomeVersion = "GRCh38", version = 110 ) saf_list <- get_saf( ensdb_sqlite = hs_ensdb_sqlite, bamfile = bam_file, mitochondrial_genome = "MT" ) counts_list <- summarize_reads( SummarizedCounts = sc, saf_list = saf_list, gtf = gtf )
A class for storing and retrieving summarized RNA-seq data.
An R6 class
Create an object of SummarizedCounts by setting lib_strand
and
col_data
:
x <- SummarizedCounts$new(lib_strand = 0, col_data = "colData in a tab-delimited txt file")
Alternatively, users can call the wrapper function
create_summarizedcounts()
to construct an object of
SummarizedCounts.
The library stranded information, colData
, and corrected expression
data of a SummarizedCounts object, x
, can be simply accessed as follows:
”'r
x$lib_strand
x$col_data
x$global_correction
x$gc_correction
x$ir_correction
x$stranded_correction
The field and sub-fields of featureCounts-summarized RNA-seq data can be set and get as follows. The feature_counts list contains the following elements: "gene", "exon", "intergenic_region", "intronic_region", "rRNA", "mitochondrion", "gtf", "chloroplast". For genomic features, gene, exon, intron, rRNA, mitochondrion, and chloroplast, only the `stat` elements of the [Rsubread::featureCounts()] oupt is stored; for intergenic region, the `stat`, `annotation` and `counts` elements are stored; for GTF-based summarization, the `stat` and `counts` elements are stored. ```r ## the whole feature_counts list x$set_feature_counts(feature_counts_list) x$get_feature_counts() ## gene-coding regions: intron + exons x$get_gene_stat () ## exons x$get_exon_stat() ## intergenic region summaries x$get_ir_stat() x$get_ir_counts() x$get_ir_anno() ## intronic regions x$get_intron_stat() ## rRNA-coding regions x$get_rRNA_stat() ## mitochondrion x$get_mt_stat() ## chloroplast x$get_ct_stat() ## GTF meta-gene, exons as features x$get_gtf_stat() x$get_gtf_counts()
Adding Salmon quantification data (setting the library type information to the opposite type) to the SummarizedCounts object. For library type information, see https://salmon.readthedocs.io/en/latest/library_type.html.
x$set_salmon_quant(tximport_list) x$get_salmon_quant() x$get_salmon_counts() x$get_salmon_abundance() x$get_salmon_length()
Adding Salmon quantification data (setting the library type information to the opposite type) to the SummarizedCounts object. For library type information, see https://salmon.readthedocs.io/en/latest/library_type.html.
x$set_salmon_quant_opposite(tximport_list) x$get_salmon_quant_opposite() x$get_salmon_opposite_counts() x$get_salmon_opposite_abundance()
Adding corrected expression data to the SummarizedCounts object:
x$set_global_correction(corrected_expr) x$set_gc_correction(corrected_expr) x$set_ir_correction(corrected_expr) x$set_stranded_correction(salmon_correction)
lib_strand
(integer(1)
)
Library strandedness, which can be 0, 1, or 2. See
create_summarizedcounts()
.
col_data
(data.frame
)
colData for RNA-seq data, see
create_summarizedcounts()
.
feature_counts
(list()
) FeatureCounts summaries for different types of genomic
features.
salmon_quant
(list()
)
Salmon quant using the actual library strandedness,imported by tximport.
salmon_quant_opposite
(list()
)
Salmon quant using strandedness info opposite to actual
library strandedness, imported by tximport.
global_correction
(matrix()
)
Count matrix corrected for gDNA contamination by using
the "Global" method.
gc_correction
(matrix()
)
Count matrix corrected for gDNA contamination by using
the "GC%" method.
ir_correction
(matrix()
)
Count matrix corrected for gDNA contamination by using
the "IR%" method.
stranded_correction
(matrix()
)
Count matrix corrected for gDNA contamination by using
the method dedicated to stranded RNA-seq data.
new()
Creates a new instance of this R6 class.
Note that this object is typically constructed via a wrapper function,
create_summarizedcounts()
.
SummarizedCounts$new(lib_strand, col_data)
lib_strand
(integer(1)
) The library's strandedness. It has
three possible values:
0: unstranded, the default;
1: stranded, read 1 (or single-end read) comes from the forward strand;
2: reversely stranded, read 1 (or single-end read) comes from the
reverse strand
For more details, See
https://sailfish.readthedocs.io/en/master/library_type.html.
colData
col_data
(data.frame()
) A data frame with rows corresponding to
samples. For unstranded RNA-seq data, it at least contains the
following columns: sample_name
, BAM_file
, group
,
salmon_quant_file
, and batch
if the data were generated in more
than one batches. For stranded RNA-seq
data, an extra column, salmon_quant_file_opposite_strand
, should be
included.
add_ir_rate()
Add IR% to the col_data by merging data frames
SummarizedCounts$add_ir_rate(IR_rate)
IR_rate
(data.frame()
)
A data frame with sample_name as rownames, and a single column IR_rate
,
stroing the percentage of reads mapping to intergenic regions.
An object of SummarizedCounts.
set_feature_counts()
Set feature_counts
SummarizedCounts$set_feature_counts(feature_counts_list)
feature_counts_list
(data.frame()
)
The feature_counts list contains the following elements: "gene", "exon",
"intergenic_region", "intronic_region", "rRNA", "mitochondrion", "gtf",
"chloroplast". For genomic features, gene, exon, intron, rRNA,
mitochondrion, and chloroplast, only the stat
elements of the
Rsubread::featureCounts()
oupt is stored; for intergenic region,
the stat
, annotation
and counts
elements are stored; for GTF-based
summarization, the stat
and counts
elements are stored.
An object of SummarizedCounts.
set_salmon_quant()
set salmon_quant
SummarizedCounts$set_salmon_quant(tximport_list)
tximport_list
(list()
)
A three-element tximport()
list aggregated from Salmon quantification
results genrated by using the correct library strandedness information,
containing counts
, abundance
, and length
.
An object of SummarizedCounts.
set_salmon_quant_opposite()
set salmon_quant_opposite
SummarizedCounts$set_salmon_quant_opposite(tximport_list)
tximport_list
(list()
)
A three-element tximport()
list aggregated from Salmon quantification
results genrated by using the opposite library strandedness information,
containing counts
, abundance
, and length
.
An object of SummarizedCounts.
set_global_correction()
Set express matrix by using the "Global" method
SummarizedCounts$set_global_correction(corrected_expr)
corrected_expr
(matrix()
)
A count matrix containing expression corrected by the "Global" method.
An object of SummarizedCounts.
set_gc_correction()
Set express matrix by using the "GC%" method
SummarizedCounts$set_gc_correction(corrected_expr)
corrected_expr
(matrix()
)
A count matrix containing expression corrected by the "GC%" method.
An object of SummarizedCounts.
set_ir_correction()
Set express matrix by using the "IR%" method
SummarizedCounts$set_ir_correction(corrected_expr)
corrected_expr
(matrix()
)
A matrix containing expression corrected by the "Global" method in the form
of log2(count per million).
An object of SummarizedCounts.
set_stranded_correction()
Set express matrix by using the method dedicated to stranded RNA-seq data
SummarizedCounts$set_stranded_correction(corrected_expr)
corrected_expr
(matrix()
)
A count matrix containing expression corrected by the method dedicated to
stranded RNA-seq data.
An object of SummarizedCounts.
get_feature_counts()
Get featureCounts output as a whole list
SummarizedCounts$get_feature_counts()
A list containing the following elements: "gene", "exon",
"intergenic_region", "intronic_region", "rRNA", "mitochondrion", "gtf",
"chloroplast". For genomic features, gene, exon, intron, rRNA,
mitochondrion, and chloroplast, only the stat
elements of the
Rsubread::featureCounts()
output is stored; for intergenic region,
the stat
, annotation
and counts
elements are stored; for GTF-based
summarization, the stat
and counts
elements are stored.
get_gene_stat()
Get featureCounts stat
for the gene-coding regions
SummarizedCounts$get_gene_stat()
only the stat
elements of the Rsubread::featureCounts()
output
for the gene-coding regions
.
get_exon_stat()
Get featureCounts stat
for the exon regions
SummarizedCounts$get_exon_stat()
only the stat
elements of the Rsubread::featureCounts()
output
for the exon regions
get_ir_stat()
Get featureCounts stat
for the intergenic regions
SummarizedCounts$get_ir_stat()
only the stat
elements of the Rsubread::featureCounts()
output
for the intergenic regions
.
get_ir_counts()
Get featureCounts counts
for the intergenic regions
SummarizedCounts$get_ir_counts()
only the counts
elements of the Rsubread::featureCounts()
output
for the intergenic regions
.
get_ir_anno()
Get featureCounts annotation
for the intergenic regions
SummarizedCounts$get_ir_anno()
only the annotation
elements of the Rsubread::featureCounts()
output
for the intergenic regions
.
get_intron_stat()
Get featureCounts stat
for the intronic regions
SummarizedCounts$get_intron_stat()
only the stat
elements of the Rsubread::featureCounts()
output
for the intronic regions
.
get_rRNA_stat()
Get featureCounts stat
for the rRNA-encoding regions
SummarizedCounts$get_rRNA_stat()
only the stat
elements of the Rsubread::featureCounts()
output
for the rRNA-encoding regions
.
get_mt_stat()
Get featureCounts stat
for the mitochondiral genome
SummarizedCounts$get_mt_stat()
only the stat
elements of the Rsubread::featureCounts()
output
for the mitochondiral genome
.
get_ct_stat()
Get featureCounts stat
for the chloroplast genome
SummarizedCounts$get_ct_stat()
only the stat
elements of the Rsubread::featureCounts()
output
for the chloroplast genome
.
get_gtf_stat()
Get featureCounts stat
for the GTF metagene
SummarizedCounts$get_gtf_stat()
only the stat
elements of the Rsubread::featureCounts()
output
for the GTF metagenes
.
get_gtf_counts()
Get featureCounts counts
for the GTF metagene
SummarizedCounts$get_gtf_counts()
only the counts
elements of the Rsubread::featureCounts()
output
for the GTF metagenes
.
get_salmon_quant()
Get tximport()
output as a whole list, including the three matrices
counts
, abundance
, and length
SummarizedCounts$get_salmon_quant()
A list of three elements, counts
, abundance
, and length
of
the tximport::tximport()
output.
get_salmon_counts()
Get tximport()
output as a whole list, only including the counts
matrix
SummarizedCounts$get_salmon_counts()
only the counts
element of the tximport::tximport()
output.
get_salmon_abundance()
Get tximport()
output as a whole list, only including the abundance
matrix
SummarizedCounts$get_salmon_abundance()
only the abundance
element of the tximport::tximport()
output.
get_salmon_length()
Get tximport()
output as a whole list, only including the length
matrix
SummarizedCounts$get_salmon_length()
only the length
element of the tximport::tximport()
output.
get_salmon_quant_opposite()
Get tximport()
output as a whole list, including the three matrices
counts
, abundance
, and length
SummarizedCounts$get_salmon_quant_opposite()
A list of three elements, counts
, abundance
, and length
of
the tximport::tximport()
output.
get_salmon_opposite_counts()
Get tximport()
output as a whole list, only including the count
matrix
SummarizedCounts$get_salmon_opposite_counts()
only the counts
element of the tximport::tximport()
output.
get_salmon_opposite_abundance()
Get tximport()
output as a whole list, only including the abundance
matrix
SummarizedCounts$get_salmon_opposite_abundance()
only the abundance
element of the tximport::tximport()
output.
clone()
The objects of this class are cloneable with this method.
SummarizedCounts$clone(deep = FALSE)
deep
Whether to make a deep clone.