SCBN Tutorial

Introduction

This package provides a statistical normalization method and differential expression analysis for RNA-seq data between different species. It consider the different gene lengths and unmapped genes between species, and formulate the problem from the perspective of hypothesis testing and search for an optimal scaling factor that minimizes the deviation between the empirical and nominal type I errors. The methods on this package are described in the article A statistical normalization method and differential expression analysis for RNA-seq data between different species by Yan Zhou, Jiadi Zhu, Tiejun Tong, Junhui Wang, Bingqing Lin, Jun Zhang (2018, pending publication).

The scale based normalization (SCBN) method is developed for analyzing cross species RNE-seq data, which is different from the data in same species. We have implemented the SCBN method via a set of R functions. We make a R package named SCBN, and give a tutorial for the package. The method consist three steps.

Step 1: Data Pre-processing;

Step 2: Calculating scaling factor for data;

Step 3: Calculating p-values for each orthologous genes and select significants.

We employ a simulation dataset to illustrate the usage of the SCBN package. The programs can run under the Linux system and windows 10 system. The R versions should larger than 3.5.0.

Preparations

To install the SCBN package into your R environment, start R and enter:

install.packages("BiocManager")
BiocManager::install("SCBN")

Then, the SCBN package is ready to load.

library(SCBN)

Data format

In order to reproduce the presented SCBN workflow, the package includes the example data sets, which is generated by function generateDataset(). The next we will give an example for how to generate simulation dataset:

data(orthgenes)
orthgenes[, 6:9] <- round(orthgenes[, 6:9])
orthgenes1 <- orthgenes[!(is.na(orthgenes[,6])|is.na(orthgenes[,7])|
                        is.na(orthgenes[,8])|is.na(orthgenes[,9])), ]
sim_data <- generateDataset(commonTags=5000, uniqueTags=c(100, 300),
                            unmapped=c(400, 200),group=c(1, 2),
                            libLimits=c(.9, 1.1)*1e6,
                            empiricalDist=orthgenes1[, 6],
                            genelength=orthgenes1[, 2], randomRate=1/100,
                            pDifferential=.05, pUp=.5, foldDifference=2)

There are two dataset in the data subdirectory of SCBN package, that is orthgenes.RData and sim_data. orthgenes.RData is a real dataset come from The evolution of gene expression levels in mammalian organs by Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, et al. Nature. 2011;478:343-348. sim_data is a simulation dataset, which is generated by generateDataset() according to the previous steps.

data(orthgenes)
head(orthgenes)
##   Ensembl.Gene.ID ExonicLength Mouse.Ensembl.Gene.ID ExonicLength.1 X..Identity
## 1 ENSG00000225566          865    ENSMUSG00000031590           1033          88
## 2 ENSG00000215750         3582    ENSMUSG00000024175           1532          36
## 3 ENSG00000241176          135    ENSMUSG00000078303             90          65
## 4 ENSG00000215700         2370    ENSMUSG00000028675           2213          82
## 5 ENSG00000215764         8571    ENSMUSG00000031424           1558          36
## 6 ENSG00000215764         8571    ENSMUSG00000057439           1932          33
##   Human_Brain_Male1 Human_Brain_Male2 Mouse_Brain_Male1 Mouse_Brain_Male2
## 1                16                21               480                91
## 2                38                10                34                15
## 3                66                64              7214               801
## 4                20                12              1117               215
## 5                 0                 0                 1                 0
## 6                 0                 0                 7                 2

orthgenes.RData includes 9 columns, * the first and third columns are the ID for two species; * the second and fourth columns are the gene length for two species; * from sixth to ninth columns are RNA-seq reads for two species.

data(sim_data)
head(sim_data)
##   geneLength1 counts1 geneLength2 counts2
## 1         720       2         708       2
## 2        2294       2        3488       2
## 3        3883      17        2031      27
## 4        2705      22        1380      16
## 5        1573      13        2123      22
## 6        2860       1        1294       1

sim_data includes 4 columns, * the first and third columns are gene length for two species; * the second and fourth columns are reads for two species.

Calculate scaling factor for data

For different species, we need to consider not only the total read counts but also the different gene numbers and gene lengths. Therefore, we propose SCBN method, by utilizing the available knowledge of housekeeping genes, it get the optimal scaling factor by minimizing the deviation between the empirical and the nominal type I error.

data(sim_data)
factor <- SCBN(orth_gene=sim_data, hkind=1:1000, a=0.05)
factor
## $median_val
## [1] 0.781137
## 
## $scbn_val
## [1] 0.846137

The output for SCBN() is “median_val” and “scbn_val”. median_val is a scaling factor which is calculated by method in The evolution of gene expression levels in mammalian organs by Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, et al. Nature. 2011;478:343-348. scbn_val is a scaling factor which in our paper. We assess the performance of two method, and find the proposed method outperforms the other method.

Calculate p-values for each orthologous genes and select significants

Getting the scaling factor, we calculate p-values with adjusted sage.test function. The step is as follows:

data(sim_data)
orth_gene <- sim_data
hkind <- 1:1000
scale <- factor$scbn_val
x <- orth_gene[, 2]
y <- orth_gene[, 4]
lengthx <- orth_gene[, 1]
lengthy <- orth_gene[, 3]
n1 <- sum(x)
n2 <- sum(y)
p_value <- sageTestNew(x, y, lengthx, lengthy, n1, n2, scale)
head(p_value)
## [1] 1.00000000 0.59818429 0.01363256 1.00000000 0.85935631 1.00000000

Getting p-values for each orthologous genes, we choose differentially expressed orthologous genes with p-values. We assume the genes to be differentially expressed genes when p-values less than pre-setting cutoffs, such as 10−3 or 10−5.