Title: | A dimensionality reduction tool using gene detection pattern to mitigate noisy expression profile of scRNA-seq |
---|---|
Description: | This package is designed to model gene detection pattern of scRNA-seq through a binary factor analysis model. This model allows user to pass into a cell level covariate matrix X and gene level covariate matrix Q to account for nuisance variance(e.g batch effect), and it will output a low dimensional embedding matrix for downstream analysis. |
Authors: | Ruoxin Li [aut, cre], Gerald Quon [aut] |
Maintainer: | Ruoxin Li <[email protected]> |
License: | GPL-3 + file LICENSE |
Version: | 1.21.0 |
Built: | 2025-01-02 05:46:14 UTC |
Source: | https://github.com/bioc/scBFA |
Performs Binary PCA (as outlined in our paper). This function take the input of gene expression profile and perform PCA on gene detection pattern
BinaryPCA(scData, X = NULL, scale. = FALSE, center = TRUE)
BinaryPCA(scData, X = NULL, scale. = FALSE, center = TRUE)
scData |
can be a raw count matrix, in which rows are genes and columns are cells; can be a seurat object; can be a SingleCellExperiment object. |
X |
|
scale. |
Logical value isndicating whether the variables should be scaled to have unit variance before the analysis takes place. In general scaling is not advisable, since we think the variance in the gene detection space is potentially associated with celltypes (e.g cell type specific markers) |
center |
Logical value indicating whether the variables should be shifted to be zero centered |
A list with class "prcomp",containing the following components:
sdev: the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the covariance/correlation matrix, though the calculation is actually done with the singular values of the data matrix).
rotation: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). The function princomp returns this in the element loadings.
x: the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix) is returned. Hence, cov(x) is the diagonal matrix diag(sdev^2).
center, scale. centering and scaling used, or FALSE.
## Working with Seurat or SingleCellExperiment object library(Seurat) library(SingleCellExperiment) ## Input expression profile, 5 genes x 3 cells GeneExpr = matrix(rpois(15,1),nrow = 5,ncol = 3) rownames(GeneExpr) = paste0("gene",seq_len(nrow(GeneExpr))) colnames(GeneExpr) = paste0("cell",seq_len(ncol(GeneExpr))) celltype = as.factor(sample(c(1,2,3),3,replace = TRUE)) ## Create cell level technical batches batch = sample(c("replicate 1","replicate 2","replicate 2")) X = matrix(NA,nrow = length(batch),ncol = 1) X[which(batch =="replicate 1"), ] = 0 X[which(batch =="replicate 2"), ] = 1 rownames(X) = colnames(GeneExpr) ##run BFA with raw count matrix bpca_model = BinaryPCA(scData = GeneExpr,X = scale(X)) ## Create Seurat object for input to BFA scData = CreateSeuratObject(counts = GeneExpr,project = "sc",min.cells = 0) ## Standardize the covariate matrix should be a default operation bpca_model = BinaryPCA(scData = scData, X = scale(X)) ## Build the SingleCellExperiment object for input to BFA ## Set up SingleCellExperiment class sce <- SingleCellExperiment(assay = list(counts = GeneExpr)) ## Standardize the covariate matrix should be a default operation bpca_model = BinaryPCA(scData = sce, X = scale(X))
## Working with Seurat or SingleCellExperiment object library(Seurat) library(SingleCellExperiment) ## Input expression profile, 5 genes x 3 cells GeneExpr = matrix(rpois(15,1),nrow = 5,ncol = 3) rownames(GeneExpr) = paste0("gene",seq_len(nrow(GeneExpr))) colnames(GeneExpr) = paste0("cell",seq_len(ncol(GeneExpr))) celltype = as.factor(sample(c(1,2,3),3,replace = TRUE)) ## Create cell level technical batches batch = sample(c("replicate 1","replicate 2","replicate 2")) X = matrix(NA,nrow = length(batch),ncol = 1) X[which(batch =="replicate 1"), ] = 0 X[which(batch =="replicate 2"), ] = 1 rownames(X) = colnames(GeneExpr) ##run BFA with raw count matrix bpca_model = BinaryPCA(scData = GeneExpr,X = scale(X)) ## Create Seurat object for input to BFA scData = CreateSeuratObject(counts = GeneExpr,project = "sc",min.cells = 0) ## Standardize the covariate matrix should be a default operation bpca_model = BinaryPCA(scData = scData, X = scale(X)) ## Build the SingleCellExperiment object for input to BFA ## Set up SingleCellExperiment class sce <- SingleCellExperiment(assay = list(counts = GeneExpr)) ## Standardize the covariate matrix should be a default operation bpca_model = BinaryPCA(scData = sce, X = scale(X))
A vector contains the cell types as labels for cells in example scRNA-seq dataset(exprdata)
data(celltype)
data(celltype)
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89232
The cell type vector is generated from the following code
data(celltype_toy)
data(celltype_toy)
celltype = as.factor(sample(c(1,2,3),5,replace = TRUE))
Perform diagnoisis of dispersion on the expression profile to check whether scBFA works on specific dataset
diagnose( scData, sampleInfo = NULL, disperType = "Fitted", diagnose_feature = "dispersion" )
diagnose( scData, sampleInfo = NULL, disperType = "Fitted", diagnose_feature = "dispersion" )
scData |
can be a raw count matrix, in which rows are genes and columns are cells; can be a seurat object; can be a SingleCellExperiment object. |
sampleInfo |
sample level feature matrix,e.g batch effect,experimental conditions in which rows are cells,columns are number of covariates.Default is NULL |
disperType |
a parameter to tell which dispersion estimate the user can plot DESeq2 offers stepwise dispersion estimate, a gene wise dispersion estimate using "GeneEst", dispersion estimate from fitted disperions ~ mean curve (using "Fitted") And final MAP estimate,using "Map". Default value is "Fitted" |
diagnose_feature |
a parameter to determine whether the user want to check GDR or dispersion. |
A Figure to tell the where the input data's dispersion ~ tpm curve align to the 14 benchmark datasets in Figure 2.a or Gene detection rate
data(exprdata) diagnose(scData = exprdata)
data(exprdata) diagnose(scData = exprdata)
A dataframe contains all the gene-wise dispersion estimates loess curve for 14 datasets we benchmarked in Figure 2.a
data(disperPlot)
data(disperPlot)
The variable in the columns are: fitted_dispersion: the log value of gene-wise dispersion after fitting a loess curve with respect to TPM value. Note that the genes at the top 2.5 meantpm is average tpm value calculated per gene dataset are nams for datasets variance is gene selection method, here is HEG vs HVG
A matrix contains 950 cells and 500 genes. The source of this dataset is cDC/ pre-DC cells(see supplementary files) We subset most variant 100 genes as example scRNA-seq dataset(exprdata)
data(exprdata)
data(exprdata)
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89232
Function to extract gene expression matrix from input observation matrix
getGeneExpr(scData)
getGeneExpr(scData)
scData |
can be a raw count matrix, in which rows are genes and columns are cells; can be a seurat object; can be a SingleCellExperiment object |
a raw expression matrix in which rows are genes and columns are cells.
scData = matrix(rpois(15,1),3,5) GeneExpr = getGeneExpr(scData)
scData = matrix(rpois(15,1),3,5) GeneExpr = getGeneExpr(scData)
Function to get low dimensional loading matrix
getLoading(modelEnv)
getLoading(modelEnv)
modelEnv |
output environment variable |
:
by
compressed feature space
GeneExpr = matrix(rpois(15,1),3,5) bfa_model = scBFA(scData = GeneExpr,X = NULL,numFactors =2) A = getLoading(bfa_model)
GeneExpr = matrix(rpois(15,1),3,5) bfa_model = scBFA(scData = GeneExpr,X = NULL,numFactors =2) A = getLoading(bfa_model)
Function to get low dimensional embedding matrix
getScore(modelEnv)
getScore(modelEnv)
modelEnv |
output environment variable |
Z: by
low dimensional embedding
GeneExpr = matrix(rpois(15,1),3,5) bfa_model = scBFA(scData = GeneExpr,X = NULL,numFactors =2) Z = getScore(bfa_model)
GeneExpr = matrix(rpois(15,1),3,5) bfa_model = scBFA(scData = GeneExpr,X = NULL,numFactors =2) Z = getScore(bfa_model)
Calculate gradient of the negative log likelihood, used for calls to the optim() function.
gradient(parameters, modelEnv)
gradient(parameters, modelEnv)
parameters |
Vectorized parameter space. |
modelEnv |
Environment variable contains parameter space, and global variables such as N,G,C,detection matrix B, etc |
Vectorized gradient
Calculate gradient of the negative log likelihood, used for calls to the optim() function.
gradient_chunk(parameters, modelEnv)
gradient_chunk(parameters, modelEnv)
parameters |
Vectorized parameter space. |
modelEnv |
Environment variable contains parameter space, and global variables such as N,G,C,detection matrix B, etc |
Vectorized gradient
Perform BFA model on the expression profile
scBFA( scData, numFactors, X = NULL, Q = NULL, maxit = 300, method = "L-BFGS-B", initCellcoef = NULL, updateCellcoef = TRUE, updateGenecoef = TRUE, NUM_CELLS_PER_CHUNK = 5000, doChunking = FALSE )
scBFA( scData, numFactors, X = NULL, Q = NULL, maxit = 300, method = "L-BFGS-B", initCellcoef = NULL, updateCellcoef = TRUE, updateGenecoef = TRUE, NUM_CELLS_PER_CHUNK = 5000, doChunking = FALSE )
scData |
can be a raw count matrix, in which rows are genes and columns are cells; can be a seurat object; can be a SingleCellExperiment object. |
numFactors |
Numeric value, number of latent dimensions |
X |
|
Q |
G by T gene-specific covariate matrix(e.g quality control measures), in which rows are genes columns are number of covariates, If no such covariates are available, then Q = NULL |
maxit |
Numeric value, parameter to control the Maximum number of iterations in the optimization, default is 300. |
method |
Method of optimization,default is L-BFGS-B(Limited memory BFGS) approach. Conjugate Gradient (CG) is recommended for larger dataset (number of cells > 10k) |
initCellcoef |
Initialization of C by G gene-specific coefficient matrix
as user-defined coefficient |
updateCellcoef |
Logical value, parameter to decide whether to update C by G gene-specific coefficient matrix. Again, when the cell types are confounded with technical batches or there is no cell level covariate matrix, the user can keep the initialization of coefficients as known estimate. |
updateGenecoef |
Logical value, parameter to decide whether to update N by T gene-specific coefficient matrix. Again, when there is no gene level covariate matrix, this value should be FALSE by default. |
NUM_CELLS_PER_CHUNK |
scBFA can run out of memory on large datasets, so we can chunk up computations to avoid this if necessary. NUM_CELLS_PER_CHUNK is the number of cells per 'chunk' computed. Shrink if running out of mem. |
doChunking |
Use memory-efficient (but slower) chunking. Will do automatically if the chunk size is specified to be smaller than the # of cells in dataset. |
A model environment containing all parameter space of a BFA model as well as global variables needed for calculation:
:
by
compressed feature space matrix
:
by
low dimensional embedding matrix
:
by
cell level coefficient matrix
:
by
gene level coefficient matrix
:
by
offset matrix
:
by
offset matrix
## Working with Seurat or SingleCellExperiment object library(Seurat) library(SingleCellExperiment) ## Input expression profile, 5 genes x 3 cells GeneExpr = matrix(rpois(15,1),nrow = 5,ncol = 3) rownames(GeneExpr) = paste0("gene",seq_len(nrow(GeneExpr))) colnames(GeneExpr) = paste0("cell",seq_len(ncol(GeneExpr))) celltype = as.factor(sample(c(1,2,3),3,replace = TRUE)) ## Create cell level technical batches batch = sample(c("replicate 1","replicate 2","replicate 2")) X = matrix(NA,nrow = length(batch),ncol = 1) X[which(batch =="replicate 1"), ] = 0 X[which(batch =="replicate 2"), ] = 1 rownames(X) = colnames(GeneExpr) ## run BFA with raw count matrix bfa_model = scBFA(scData = GeneExpr,X = scale(X),numFactors =2) ## Create Seurat object for input to BFA scData = CreateSeuratObject(counts = GeneExpr,project="sc",min.cells = 0) ## Standardize the covariate matrix should be a default operation bfa_model = scBFA(scData = scData, X = scale(X), numFactors = 2) ## Build the SingleCellExperiment object for input to BFA ## Set up SingleCellExperiment class sce <- SingleCellExperiment(assay = list(counts = GeneExpr)) ## Standardize the covariate matrix should be a default operation bfa_model = scBFA(scData = sce, X = scale(X), numFactors = 2)
## Working with Seurat or SingleCellExperiment object library(Seurat) library(SingleCellExperiment) ## Input expression profile, 5 genes x 3 cells GeneExpr = matrix(rpois(15,1),nrow = 5,ncol = 3) rownames(GeneExpr) = paste0("gene",seq_len(nrow(GeneExpr))) colnames(GeneExpr) = paste0("cell",seq_len(ncol(GeneExpr))) celltype = as.factor(sample(c(1,2,3),3,replace = TRUE)) ## Create cell level technical batches batch = sample(c("replicate 1","replicate 2","replicate 2")) X = matrix(NA,nrow = length(batch),ncol = 1) X[which(batch =="replicate 1"), ] = 0 X[which(batch =="replicate 2"), ] = 1 rownames(X) = colnames(GeneExpr) ## run BFA with raw count matrix bfa_model = scBFA(scData = GeneExpr,X = scale(X),numFactors =2) ## Create Seurat object for input to BFA scData = CreateSeuratObject(counts = GeneExpr,project="sc",min.cells = 0) ## Standardize the covariate matrix should be a default operation bfa_model = scBFA(scData = scData, X = scale(X), numFactors = 2) ## Build the SingleCellExperiment object for input to BFA ## Set up SingleCellExperiment class sce <- SingleCellExperiment(assay = list(counts = GeneExpr)) ## Standardize the covariate matrix should be a default operation bfa_model = scBFA(scData = sce, X = scale(X), numFactors = 2)
simulation to generate scRNA-seq data with varying level of gene detection noise versus gene count noise
scNoiseSim(zinb, celltype, disper, var_dropout = 1, var_count = 1, delta)
scNoiseSim(zinb, celltype, disper, var_dropout = 1, var_count = 1, delta)
zinb |
a ZINB-WaVE object representing ZINB-WaVE fit to real data to get realistic simulation parameters |
celltype |
a factor to specify the ground-truth cell types in the original dataset that the parameter of zinb object is fit to. Since we filter out some simulated cells due to low amount of genes detected in that cell, we subset the ground truth cell types correspondingly |
disper |
numeric value, parameter to control the size factor
|
var_dropout |
numeric value, parameter to control the noise level added
to a common embedding space for to generate gene detection matrix.
This parameter is formulated as |
var_count |
numeric value, parameter to control the noise level
added to a common embedding space to generate gene count matrix.
This parameter is formulated as |
delta |
intercept to control the overall gene detection rate. and in the paper is selected from the set -2, -0.5, 1,2.5,4 |
GeneExpr,a count matrix with rows number of genes and columns number of cells
celltype,a vector specify the corresponding celltype after QC measures.
## raw counts matrix with rows are genes and columns are cells data("zinb_toy",package = "scBFA", envir = environment()) ## a vector specify the ground truth of cell types provided by conquer database data("celltype_toy",package = "scBFA",envir = environment()) scData = scNoiseSim(zinb = zinb_toy, celltype = celltype_toy, disper = 1, var_dropout =1, var_count = 1, delta = 1)
## raw counts matrix with rows are genes and columns are cells data("zinb_toy",package = "scBFA", envir = environment()) ## a vector specify the ground truth of cell types provided by conquer database data("celltype_toy",package = "scBFA",envir = environment()) scData = scNoiseSim(zinb = zinb_toy, celltype = celltype_toy, disper = 1, var_dropout =1, var_count = 1, delta = 1)
The toy dataset is generated from the following code require(zinbwave) GeneExpr = matrix(rpois(50,1),nrow = 10,ncol = 5) rownames(GeneExpr) = paste0("gene",seq_len(nrow(GeneExpr))) colnames(GeneExpr) = paste0("cell",seq_len(ncol(GeneExpr))) celltype = as.factor(sample(c(1,2,3),5,replace = TRUE)) zinb = zinbFit(Y = GeneExpr,K=2)
data(zinb_toy)
data(zinb_toy)