bayNorm has been submitted to Bioconductor, once it is accepted, it can be installed via:
Currently bayNorm can be installed via github:
The main function is bayNorm
which is a wrapper function
of prior parameters estimation and normalized array or matrix
generation.
Essential parameters for running bayNorm
are:
Data
: a SummarizedExperiment
object or
matrix (rows: genes, columns: cells).BETA_vec
: a vector of probabilities which is of length
equal to the number of cells.Conditions
: If Conditions
is provided,
prior parameters will be estimated within each group of cells (we name
this kind of procedure as “LL” procedure where “LL” stands for
estimating both μ and ϕ locally). Otherwise, bayNorm
applied “GG” procedure for estimating prior parameters (estimating both
μ and ϕ globally).Prior_type
: Even if you have specified the
Conditions
, you can still choose to estimate prior
parameters across all the cells by setting
Prior_type="GG"
.The input parameters BETA_vec
, Conditions
(if specified), UMI_sffl
(if specified),
Prior_type
, FIX_MU
, BB_SIZE
and
GR
are stored in the list input_params
which
should be output from bayNorm
. Objects PRIORS
together with input_params
output from bayNorm
should be input in bayNorm_sup
for transforming between 3D
array, mode or mean version’s output of normalized count matricies.
Apart from the raw data, another parameter which user may need to
provide is the mean capture efficiency < β> and hence β can be further calculated using
scaling factors estimated from any other methods. The default β is calculated to be total counts
normalized to 6%. Or you can use BetaFun
in bayNorm to
estimate β:
data('EXAMPLE_DATA_list')
#Suppose the input data is a SummarizedExperiment object:
#Here we just used 30 cells for illustration
rse <- SummarizedExperiment::SummarizedExperiment(assays=SimpleList(counts=EXAMPLE_DATA_list$inputdata[,seq(1,30)]))
#SingleCellExperiment object can also be input in bayNorm:
#rse <- SingleCellExperiment::SingleCellExperiment(assays=list(counts=EXAMPLE_DATA_list$inputdata))
BETA_est<-BetaFun(Data=rse,MeanBETA=0.06)
summary(BETA_est$BETA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02299 0.03859 0.05255 0.06000 0.07473 0.11168
The function BetaFun
selects a subset of genes such that
outlier genes and genes with high proportion of zeros across cells are
excluded. Then the total counts of the subset of genes in each cell are
normalized to < β>.
Another way is to normalize scaling factors estimated from the R package
scran
to < β>.
BETA_vec
can be set to be NULL
, then the
β are estimated to be total
counts normalized to 6%.
#Return 3D array normalzied data, draw 20 samples from posterior distribution:
bayNorm_3D<-bayNorm(
Data=rse,
BETA_vec = NULL,
mode_version=FALSE,
mean_version = FALSE,S=20
,verbose =FALSE,
parallel = TRUE)
## | | | 0% | |==== | 5% | |======= | 10% | |========== | 15% | |============== | 20% | |================== | 25% | |===================== | 30% | |======================== | 35% | |============================ | 40% | |================================ | 45% | |=================================== | 50% | |====================================== | 55% | |========================================== | 60% | |============================================== | 65% | |================================================= | 70% | |==================================================== | 75% | |======================================================== | 80% | |============================================================ | 85% | |=============================================================== | 90% | |================================================================== | 95% | |======================================================================| 100%
bayNorm’s mathematical model is suitable for UMI dataset. However it
can be also applied on non-UMI dataset. In bayNorm
, you
need to specify the following parameter: * UMI_sffl
:
bayNorm can also be applied on the non-UMI dataset. However, user need
to provide a scaled number. Raw data will be divided by the scaled
number and bayNorm will be applied on the rounded scaled data. By doing
so, the Dropout vs Mean expression plots will be similar to that of UMI
dataset.
If you have run bayNorm on a dataset before but want to output
another kind of data (3D array or 2D matrix), you can use the function
bayNorm_sup
. It is important to input the existing
estimated parameters by specifying the following parameter in
bayNorm_sup
:
BETA_vec
: If Conditions
has been specified
previously, then input unlist(bayNorm_output$BETA)
PRIORS
:
input bayNorm_output$PRIORS_LIST
Conditions
: make sure to specify the same
Conditions
as before. You can find these two objects from
the previous output of bayNorm function, which is a list.#Now if you want to generate 2D matrix (MAP) using the same prior
#estimates as generated before:
bayNorm_2D<-bayNorm_sup(
Data=rse,
PRIORS=bayNorm_3D$PRIORS,
input_params=bayNorm_3D$input_params,
mode_version=TRUE,
mean_version = FALSE,
verbose =FALSE)
#Or you may want to generate 2D matrix
#(mean of posterior) using the same prior
#estimates as generated before:
bayNorm_2D<-bayNorm_sup(
Data=rse,
PRIORS=bayNorm_3D$PRIORS,
input_params=bayNorm_3D$input_params,
mode_version=FALSE,
mean_version = TRUE,
verbose =FALSE)
Prior_fun
: This is a wrapper function of
EstPrior
, BB_Fun
and
AdjustSIZE_fun
:
EstPrior
: Estimating priors using MME methods.BB_Fun
: Estimating priors by maximizing marginal
likelihood function with respect to either ϕ or both μ and ϕ.AdjustSIZE_fun
: Adjusting ϕ estimated from
BB_Fun
.noisy_gene_detection
: This is a higher wrapper
function over bayNorm
, bayNorm_sup
and
SyntheticControl
. For details about the rationale behind
this function, see the Methods part in the paper 2.
SyntheticControl
: Given a real scRNA-seq data with
estimated β, it will estimate
μ and ϕ from it and simulate control data
using Poisson
distribution. See the Methods part in the
paper 2 for
more details.
#Downstream analysis: DE genes detection DE function used in bayNorm paper, which was kindly provided by the author of SCnorm.
library(MAST)
library(reshape2)
########SCnorm_runMAST3##########
SCnorm_runMAST3 <- function(Data, NumCells) {
if(length(dim(Data))==2){
resultss<-SCnorm_runMAST(Data, NumCells)
} else if(length(dim(Data))==3){
library(foreach)
resultss<- foreach(sampleind=1:dim(Data)[3],.combine=cbind)%do%{
print(sampleind)
qq<-SCnorm_runMAST(Data[,,sampleind], NumCells)
return(qq$adjpval)
}
}
return(resultss)
}
SCnorm_runMAST <- function(Data, NumCells) {
NA_ind<-which(rowSums(Data)==0)
Data = as.matrix(log2(Data+1))
G1<-Data[,seq(1,NumCells[1])]
G2<-Data[,-seq(1,NumCells[1])]
qq_temp<- rowMeans(G2)-rowMeans(G1)
qq_temp[NA_ind]=NA
numGenes = dim(Data)[1]
datalong = melt(Data)
Cond = c(rep("c1", NumCells[1]*numGenes), rep("c2", NumCells[2]*numGenes))
dataL = cbind(datalong, Cond)
colnames(dataL) = c("gene","cell","value","Cond")
dataL$gene = factor(dataL$gene)
dataL$cell = factor(dataL$cell)
vdata = FromFlatDF(dataframe = dataL, idvars = "cell", primerid = "gene", measurement = "value", id = numeric(0), cellvars = "Cond", featurevars = NULL, phenovars = NULL)
zlm.output = zlm(~ Cond, vdata, method='glm', ebayes=TRUE)
zlm.lr = lrTest(zlm.output, 'Cond')
gpval = zlm.lr[,,'Pr(>Chisq)']
adjpval = p.adjust(gpval[,1],"BH") ## Use only pvalues from the continuous part.
adjpval = adjpval[rownames(Data)]
return(list(adjpval=adjpval, logFC_re=qq_temp))
}
#Now, detect DE genes between two groups of cells with 15 cells in each group respectively
#For 3D array
DE_out<-SCnorm_runMAST3(Data=bayNorm_3D$Bay_out, NumCells=c(15,15))
med_pvals<-apply(DE_out,1,median)
#DE genes are called with threshold 0.05:
DE_genes<-names(which(med_pvals<0.05))
#For 2D array
DE_out<-SCnorm_runMAST3(Data=bayNorm_2D$Bay_out, NumCells=c(15,15))
DE_genes<-names(which(DE_out$adjpval<0.05))
A scRNA-seq dataset is typically represented in a matrix of dimension P × Q, where P denotes the total number of genes observed and Q denotes the total number of cells studied. The element xij (i ∈ {1, 2, …, P} and j ∈ {1, 2, …, Q}) in the matrix represents the number of transcripts reported for the ith gene in the jth cell. This is equal to the total number of sequencing reads mapping to that gene in that cell for a non-UMI protocol. For UMI based protocols this is equal to the number of individual UMIs mapping to each gene. The matrix can include data from different groups or batches of cells, representing different biological conditions. This can be represented as a vector of labels for the cell groups or conditions (Cj). bayNorm generates for each gene (i) in each cell (j) a posterior distribution of original expression counts (xij0), given the observed scRNA-seq read count for that gene (xij).
A common approach for normalizing scRNA-seq data is based on the use of a global scaling factor (sj), ignoring any gene specific biases. The normalized data x̃ij is obtained by dividing the raw data for each cell j by the its global scaling factor sj:
In bayNorm, we implement global scaling using a Bayesian approach. We assume given the original number of transcripts in the cell (xij0), the number of transcripts observed (xij) follows a Binomial model with probability βj, which we refer to as capture effeiciency and it represents the probability of original transcripts in the cell to be observed. In addition, we assume that the original number or true count of the ith gene in the jth cell (xij0) follows Negative Binomial distribution with parameters mean (μ), size (or dispersion parameter, ϕ), such that: $$\Pr(x^0_{ij}=n|\phi_i,\mu_i)=\frac{\Gamma(n+\phi_i)}{\Gamma(\phi_i)n!}(\frac{\phi_i}{\mu_i+\phi_i})^{\phi_i}(\frac{\mu_i}{\mu_i+\phi_i})^{n}$$
So, overall we have the following model:
Using the Bayes rule, we have the following posterior distribution of original number of mRNAs for each gene in each cell:
The prior parameters μ and ϕ of each gene were estimated using an empirical Bayesian method as discussed in detail in Supplementary Information of 1.
For more details about the rationale of bayNorm, please see 1.
Cell specific capture efficiency βj and global scaling factor (sj) are closely related. We can transform scaling factors estimated by different methods (see below) into βj values with the following formula:
β̄ or < β>, a scalar, is an estimate of global mean capture efficiency across all cells, which ranges between 0 and 1.
There are two different methods for estimating β̄ and βj:
We finally note, that estimates of capture efficiency discussed above will assume cells have simular original transcript content. Therefore, the bayNorm outputs estimates of original transcript counts for a typical cell, which is corrected for variation in cell size and transcript content. This is usually desirable for down-stream analysis such as DE detection. However, if one is interested in absolute origianl count and has additional information such as cell size or total transcirpt content per cell, the capture efficiencies can be approporiatly rescaled for this purpose.
Using an emperical bayes approach, one can use the maximisation of marginal likelihood distribution of the observed counts across cells to estimate prior parameters. Let Mi denotes the marginal likelihood function for the ith gene across cells. Assuming independence between cells, the log-marginal distribution for the ith gene is
where Pr (xij|μi, ϕi, βj) is the Negative Binomial (see Methods). Maximizing of this equation yields the pair (μi, ϕi).
The above optimization needs to be done for each of the P genes. We refer to the ϕ and/or μ estimated by maximizing marginal distribution as BB estimates for convenience, because bayNorm utilizes spectral projected gradient method (spg) from the R package named ``BB’’. Optimizing the marginal distribution with respect to both μ and ϕ (2D optimization) is computationally intensive. If we had a good estimate μ, then we could optimize the marginal distribution with respect to ϕ alone, which would be much more efficient.
A heuristic way of estimating μi and ϕi is through a variant of the Method of Moments. The first step is to do a simple normalization of the raw data, to scale expressions given the cell specific capture efficiencies (βj). The simple normalized count xijs is calculated as following:
where the numerator of the scaling factor of xij is obtained by taking the average of scaled total counts across cells.
Based on simple normalized data, we are able to estimate prior parameters μ and ϕ of the Negative Binomial distribution using the Method of Moments Estimation (MME), which simply equates the theoretical and empirical moments. This estimation method is fast and simulations suggests it provides good estimates of μ but the drawback is that the estimation of ϕ show a systematic bias (see Supplementary Figure S24 a-b in 1).
Based on simulation studies (Supplementary Figure S24 in 1), the most robust and efficient estimation of μ and ϕ can be obtained using the following combined approach, which is the default setting in bayNorm:
Cells are grouped together for prior estimation, based on cell-specific attributes (Cj). Prior estimation can be done over all cells irrespective of the experimental condition. We refer to this procedure as “global”. Alternatively, suppose that there are multiple groups of cells in the datasets and we have reasons to believe each group could behave differently. Then we can estimate the prior parameters “μ and ϕ” within each group respectively (within groups with the same Cj value). We refer to this procedure as “local”. Estimating prior parameters across a certain group of cells based on “global” procedure allow for removing potential batch effects. Multiple groups normalization based on “local” procedure allows for amplifying the inter-groups’ differences while mitigating the intra-group’s variability, which is suitable for DE detection.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [3] Biobase_2.67.0 GenomicRanges_1.59.1
## [5] GenomeInfoDb_1.43.2 IRanges_2.41.1
## [7] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [9] generics_0.1.3 MatrixGenerics_1.19.0
## [11] matrixStats_1.4.1 bayNorm_1.25.0
## [13] knitr_1.49 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 SparseArray_1.7.2 lattice_0.22-6
## [4] digest_0.6.37 evaluate_1.0.1 grid_4.4.2
## [7] iterators_1.0.14 fastmap_1.2.0 foreach_1.5.2
## [10] jsonlite_1.8.9 Matrix_1.7-1 survival_3.7-0
## [13] doSNOW_1.0.20 BiocManager_1.30.25 httr_1.4.7
## [16] UCSC.utils_1.3.0 codetools_0.2-20 jquerylib_0.1.4
## [19] abind_1.4-8 cli_3.6.3 crayon_1.5.3
## [22] rlang_1.1.4 XVector_0.47.0 splines_4.4.2
## [25] DelayedArray_0.33.2 cachem_1.1.0 yaml_2.3.10
## [28] S4Arrays_1.7.1 tools_4.4.2 parallel_4.4.2
## [31] BiocParallel_1.41.0 locfit_1.5-9.10 GenomeInfoDbData_1.2.13
## [34] fitdistrplus_1.2-1 buildtools_1.0.0 R6_2.5.1
## [37] lifecycle_1.0.4 zlibbioc_1.52.0 MASS_7.3-61
## [40] bslib_0.8.0 Rcpp_1.0.13-1 xfun_0.49
## [43] sys_3.4.3 htmltools_0.5.8.1 snow_0.4-4
## [46] rmarkdown_2.29 maketools_1.3.1 compiler_4.4.2