This package provides a statistical normalization method and differential expression analysis for RNA-seq data between different species. It consider the different gene lengths and unmapped genes between species, and formulate the problem from the perspective of hypothesis testing and search for an optimal scaling factor that minimizes the deviation between the empirical and nominal type I errors. The methods on this package are described in the article A statistical normalization method and differential expression analysis for RNA-seq data between different species by Yan Zhou, Jiadi Zhu, Tiejun Tong, Junhui Wang, Bingqing Lin, Jun Zhang (2018, pending publication).
The scale based normalization (SCBN) method is developed for analyzing cross species RNE-seq data, which is different from the data in same species. We have implemented the SCBN method via a set of R functions. We make a R package named SCBN, and give a tutorial for the package. The method consist three steps.
Step 1: Data Pre-processing;
Step 2: Calculating scaling factor for data;
Step 3: Calculating p-values for each orthologous genes and select significants.
We employ a simulation dataset to illustrate the usage of the SCBN package. The programs can run under the Linux system and windows 10 system. The R versions should larger than 3.5.0.
To install the SCBN package into your R environment, start R and enter:
Then, the SCBN package is ready to load.
In order to reproduce the presented SCBN workflow, the package includes the example data sets, which is generated by function generateDataset(). The next we will give an example for how to generate simulation dataset:
data(orthgenes)
orthgenes[, 6:9] <- round(orthgenes[, 6:9])
orthgenes1 <- orthgenes[!(is.na(orthgenes[,6])|is.na(orthgenes[,7])|
is.na(orthgenes[,8])|is.na(orthgenes[,9])), ]
sim_data <- generateDataset(commonTags=5000, uniqueTags=c(100, 300),
unmapped=c(400, 200),group=c(1, 2),
libLimits=c(.9, 1.1)*1e6,
empiricalDist=orthgenes1[, 6],
genelength=orthgenes1[, 2], randomRate=1/100,
pDifferential=.05, pUp=.5, foldDifference=2)
There are two dataset in the data subdirectory of SCBN package, that is orthgenes.RData and sim_data. orthgenes.RData is a real dataset come from The evolution of gene expression levels in mammalian organs by Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, et al. Nature. 2011;478:343-348. sim_data is a simulation dataset, which is generated by generateDataset() according to the previous steps.
## Ensembl.Gene.ID ExonicLength Mouse.Ensembl.Gene.ID ExonicLength.1 X..Identity
## 1 ENSG00000225566 865 ENSMUSG00000031590 1033 88
## 2 ENSG00000215750 3582 ENSMUSG00000024175 1532 36
## 3 ENSG00000241176 135 ENSMUSG00000078303 90 65
## 4 ENSG00000215700 2370 ENSMUSG00000028675 2213 82
## 5 ENSG00000215764 8571 ENSMUSG00000031424 1558 36
## 6 ENSG00000215764 8571 ENSMUSG00000057439 1932 33
## Human_Brain_Male1 Human_Brain_Male2 Mouse_Brain_Male1 Mouse_Brain_Male2
## 1 16 21 480 91
## 2 38 10 34 15
## 3 66 64 7214 801
## 4 20 12 1117 215
## 5 0 0 1 0
## 6 0 0 7 2
orthgenes.RData includes 9 columns, * the first and third columns are the ID for two species; * the second and fourth columns are the gene length for two species; * from sixth to ninth columns are RNA-seq reads for two species.
## geneLength1 counts1 geneLength2 counts2
## 1 720 2 708 2
## 2 2294 2 3488 2
## 3 3883 17 2031 27
## 4 2705 22 1380 16
## 5 1573 13 2123 22
## 6 2860 1 1294 1
sim_data includes 4 columns, * the first and third columns are gene length for two species; * the second and fourth columns are reads for two species.
For different species, we need to consider not only the total read counts but also the different gene numbers and gene lengths. Therefore, we propose SCBN method, by utilizing the available knowledge of housekeeping genes, it get the optimal scaling factor by minimizing the deviation between the empirical and the nominal type I error.
## $median_val
## [1] 0.781137
##
## $scbn_val
## [1] 0.846137
The output for SCBN() is “median_val” and “scbn_val”. median_val is a scaling factor which is calculated by method in The evolution of gene expression levels in mammalian organs by Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, et al. Nature. 2011;478:343-348. scbn_val is a scaling factor which in our paper. We assess the performance of two method, and find the proposed method outperforms the other method.
Getting the scaling factor, we calculate p-values with adjusted sage.test function. The step is as follows:
data(sim_data)
orth_gene <- sim_data
hkind <- 1:1000
scale <- factor$scbn_val
x <- orth_gene[, 2]
y <- orth_gene[, 4]
lengthx <- orth_gene[, 1]
lengthy <- orth_gene[, 3]
n1 <- sum(x)
n2 <- sum(y)
p_value <- sageTestNew(x, y, lengthx, lengthy, n1, n2, scale)
head(p_value)
## [1] 1.00000000 0.59818429 0.01363256 1.00000000 0.85935631 1.00000000
Getting p-values for each orthologous genes, we choose differentially expressed orthologous genes with p-values. We assume the genes to be differentially expressed genes when p-values less than pre-setting cutoffs, such as 10−3 or 10−5.