Title: | DegNorm: degradation normalization for RNA-seq data |
---|---|
Description: | This package performs degradation normalization in bulk RNA-seq data to improve differential expression analysis accuracy. |
Authors: | Bin Xiong and Ji-Ping Wang |
Maintainer: | Ji-Ping Wang <[email protected]> |
License: | LGPL (>= 3) |
Version: | 1.17.0 |
Built: | 2024-10-30 05:30:01 UTC |
Source: | https://github.com/bioc/DegNorm |
DegNorm
is an R package for degradation normalication
for bulk RNA-seq data.DegNorm, short for degradation normalization, is a
bioinformatics pipeline designed to correct for bias due to the heterogeneous
patterns of transcript degradation in RNA-seq data.
DegNorm is a data-driven approach for RNA-Seq normalization resulting in the adjusted read count matrix. This adjustment applies to each gene within each sample, accounting for sample- and gene-specific degradation bias while simultaneously controlling for the sequencing depth. The algorithm at the center of DegNorm is the rank-one over-approximation of a gene's coverage score matrix, which is comprised of the different samples' coverage score curves along the transcript for each gene. For each gene, DegNorm estimates (1) an envelope function representing the ideal shape of the gene's coverage curve when no degradation is present, and (2) scale factors for each sample (for said gene) that indicates the relative abundance of the gene within the sample.
functions:
read_coverage_batch,
degnorm,
plot_coverage,
plot_heatmap,
plot_corr,
plot_boxplot
Bin Xiong, Ji-Ping Wang
Maintainer: Ji-Ping Wang <[email protected]>
DegNorm reference:
Xiong, B., Yang, Y., Fineis, F. Wang, J.-P., DegNorm: normalization of generalized transcript degradation improves accuracy in RNA-seq analysis, Genome Biology, 2019,20:75
Example of CoverageClass
data from DegNorm package.
It is the output from read_coverage_batch
function for human
chromosome 21.
data(coverage_res_chr21)
data(coverage_res_chr21)
A coverageClass
list of the following
coverage
a list of converage matrices for all genes within each sample
counts
a data.frame of read counts for all genes within each sample.
data(coverage_res_chr21) summary_CoverageClass(coverage_res_chr21)
data(coverage_res_chr21) summary_CoverageClass(coverage_res_chr21)
degnorm
calcualtes the degradation index score for each
gene within each sample and return the degradation-normalized read counts.
degnorm(read_coverage,counts,iteration,loop,down_sampling=1,grid_size=10, cores=1)
degnorm(read_coverage,counts,iteration,loop,down_sampling=1,grid_size=10, cores=1)
read_coverage |
a list of converage matrices, one per gene |
counts |
dataframe of read counts, each row for one gene, and column for sample. The order and number of genes must match the order in read_coverage matrices. |
iteration |
iteration number for degnorm algorithm. 5 is sufficient. |
loop |
iteration number inside of nonnegative matrix factorization-over approximation. Default is 100. |
down_sampling |
1 for yes (default) and 0 for no. If yes, average coverage score is calcualted on a grid of size specified by grid_size argument. The new coverage matrix formed by the grid average score will be used for baseline selection. This increases the efficiency of algorithm while maintaining comparable accuracy. |
grid_size |
default size is 10 bp. |
cores |
number of cores. Default number if 1. Users should input the maximum possible number of cores for efficiency. |
degnorm
outputs a list of following objects:
counts |
a data.drame of read counts for each gene within each sample. |
counts_normed |
a data.drame of degradation-normalized read counts for each gene within each sample. |
DI |
a matrix of degradation index scores for each gene within each sample. |
K |
normalizing scale factor for each gene within each sample after accounting for degradation normalization. |
convergence |
convergence tag; 0 = degnorm was not done on this gene because smaller counts or too short length.1 = degnorm was done with baseline selection. 2 = degnorm done without baseline selection because gene length (after filtering out low count regions)<200 bp. 3= baseline was found, but DI score is too large. 4 = baseline selection didn't coverge. |
envelop |
list of the envelop curves for all genes. |
##coverage_res_chr21 is a \code{CoverageClass} object from DegNorm Package. data(coverage_res_chr21) res_DegNorm = degnorm(read_coverage = coverage_res_chr21[[1]], counts = coverage_res_chr21[[2]], iteration = 2, down_sampling = 1, grid_size=10, loop = 20, cores=2)
##coverage_res_chr21 is a \code{CoverageClass} object from DegNorm Package. data(coverage_res_chr21) res_DegNorm = degnorm(read_coverage = coverage_res_chr21[[1]], counts = coverage_res_chr21[[2]], iteration = 2, down_sampling = 1, grid_size=10, loop = 20, cores=2)
DegNorm provides three functions for visualization gene-/sample-wise degradation.
plot_corr(DI) plot_heatmap(DI) plot_boxplot(DI)
plot_corr(DI) plot_heatmap(DI) plot_boxplot(DI)
DI |
a matrix or data.frame of degradation index (DI) scores with each row corresponding to one gene and each column for a sample. |
plot_corr
plots the correlation matrix of DI scores between samples.
plot_heatmap
plots the heatmap of DI scores. Left is ploted in
descending order of average DI scores of genes where each row corresponds
to one gene. In the right plot, DI scores were sorted within each sample and
plotted in descending order.
plot_boxplot
plots the boxplot of DI scores by samples.
These functions return a boxplot of DI scores by sample, a heatmap of DIS scores of all genes in all samples and a correlation plot of DI scores between samples respectively.
## res_DegNorm_chr21 is degnorm otuput stored in sysdata.Rda data(res_DegNorm_chr21) plot_boxplot(res_DegNorm_chr21$DI) plot_heatmap(res_DegNorm_chr21$DI) plot_corr(res_DegNorm_chr21$DI)
## res_DegNorm_chr21 is degnorm otuput stored in sysdata.Rda data(res_DegNorm_chr21) plot_boxplot(res_DegNorm_chr21$DI) plot_heatmap(res_DegNorm_chr21$DI) plot_corr(res_DegNorm_chr21$DI)
plot_coverage
plots the before- and after-degradation coverage curves
plot_coverage(gene_name, coverage_output, degnorm_output, group=NULL, samples=NULL)
plot_coverage(gene_name, coverage_output, degnorm_output, group=NULL, samples=NULL)
gene_name |
the name of the gene whose coverage coverage to be plotted. |
coverage_output |
CoverageClass object, the output from function
|
degnorm_output |
DegNormClass object, the output from function
|
group |
a vector of integers or character strings indicating the biological conditions of the samples. Coverage curves will be plotted in the same color for the same group. Default is NULL. By default all curves will plotted in different colors. |
samples |
a string vector for the subset of samples to be plotted. NULL means all samples to be plotted. The length of samples must be of the same length of group if both specified. |
plot_coverage
outputs the coverage curves before- and after-degradation
normalization.
The coverage curve before and after degradation normalization.
## gene named "SOD1",plot coverage curves data(coverage_res_chr21) data(res_DegNorm_chr21) plot_coverage(gene_name="SOD1", coverage_output=coverage_res_chr21, degnorm_output=res_DegNorm_chr21, group=c(0,1,1))
## gene named "SOD1",plot coverage curves data(coverage_res_chr21) data(res_DegNorm_chr21) plot_coverage(gene_name="SOD1", coverage_output=coverage_res_chr21, degnorm_output=res_DegNorm_chr21, group=c(0,1,1))
This function calls read_coverage
to compute read coverage socre and
read counts for all genes and samples.
Notes: 1. Coverage score is calcualted per gene, i.e. concatenation of all exons from the same gene.
2. We follow HTseq protocol for counting valid read or read pairs for each gene.
3. When reading alignment file, isSecondaryAlignment
flag is set as
FALSE to avoid possible redundant counting.
4. For paired-end data, isPaired is set as TRUE. We don't recommend setting isProperPair as TRUE as some fragments length may exceed 200bp.
5. User can modify scanBamParam in the R codes below as needed.
read_coverage_batch(bam_file_list,gtf_file,cores=1)
read_coverage_batch(bam_file_list,gtf_file,cores=1)
bam_file_list |
a character vector of bam file names. |
gtf_file |
the gtf file that RNA-seq reads were aligned with reference to. |
cores |
number of cores to be used. Default=1. |
A list of the following:
coverage |
a list of converage matrices for all genes within each sample. |
counts |
data.frame of read counts for all genes within each sample. |
## read bam file and gtf file from the package bam_file_list <- list.files(path=system.file("extdata",package="DegNorm") ,pattern=".bam$",full.names=TRUE) gtf_file <- list.files(path=system.file("extdata",package="DegNorm"), pattern=".gtf$",full.names=TRUE) # run read_coverage_batch to calculate read coverage curves and read counts coverage_res=read_coverage_batch(bam_file_list, gtf_file,cores=2)
## read bam file and gtf file from the package bam_file_list <- list.files(path=system.file("extdata",package="DegNorm") ,pattern=".bam$",full.names=TRUE) gtf_file <- list.files(path=system.file("extdata",package="DegNorm"), pattern=".gtf$",full.names=TRUE) # run read_coverage_batch to calculate read coverage curves and read counts coverage_res=read_coverage_batch(bam_file_list, gtf_file,cores=2)
Example of DegNormClass
data from DegNorm package. It is the
output from degnorm
function for human chromosome 21.
data("res_DegNorm_chr21")
data("res_DegNorm_chr21")
A DegNormClass
list of the following items:
counts
a data.drame of read counts for each gene within each sample.
counts_normed
a data.drame of degradation-normalized read counts for each gene within each sample.
DI
a matrix of degradation index scores for each gene within each sample.
K
normalizing scale factor for each gene within each sample after accounting for degradation normalization.
convergence
convergence tag; 0 = degnorm was not done on this gene because smaller counts or too short length.1 = degnorm was done with baseline selection. 2 = degnorm done without baseline selection because gene length (after filtering out low count regions)<200 bp. 3= baseline was found, but DI score is too large. 4 = baseline selection didn't coverge.
envelop
a list of the envelop curves for all genes.
data(res_DegNorm_chr21) summary_DegNormClass(res_DegNorm_chr21)
data(res_DegNorm_chr21) summary_DegNormClass(res_DegNorm_chr21)
It prints a summary of the data objects contained in the list from
read_coverage_batch
.
summary_CoverageClass(object)
summary_CoverageClass(object)
object |
CoverageClass from coderead_coverage_batch. |
On-screen plot of summary of CoverageClass
object.
## Summary of coverage_cal_batch output (CoverageClass) data(coverage_res_chr21) summary_CoverageClass(coverage_res_chr21)
## Summary of coverage_cal_batch output (CoverageClass) data(coverage_res_chr21) summary_CoverageClass(coverage_res_chr21)
It prints a summary of the data objects contained in the list from degnorm function.
summary_DegNormClass(object)
summary_DegNormClass(object)
object |
DegNormClass from |
On-screen summary of DegNormClass
object.
## Summary of degnorm output (DegNormlass) data(res_DegNorm_chr21) summary_DegNormClass(res_DegNorm_chr21)
## Summary of degnorm output (DegNormlass) data(res_DegNorm_chr21) summary_DegNormClass(res_DegNorm_chr21)