Title: | Normalization of single cell RNA-seq data |
---|---|
Description: | This package implements SCnorm — a method to normalize single-cell RNA-seq data. |
Authors: | Rhonda Bacher |
Maintainer: | Rhonda Bacher <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.29.0 |
Built: | 2024-10-31 04:46:14 UTC |
Source: | https://github.com/bioc/SCnorm |
Perform the correction within each sample (See loess normalization in original publication Risso et al., 2011 (BMC Bioinformatics)). Similar to function in EDAseq v2.8.0.
correctWithin(y, correctFactor)
correctWithin(y, correctFactor)
y |
gene to perform the regression on. |
correctFactor |
list of data needed for the regression. |
Performs within sample normalization.
within-cell normalized expression estimates
Median quantile regression is fit for each gene using the normalized gene expression values. A slope near zero indicate the sequencing depth effect has been successfully removed. Genes are divided into ten equally sized groups based on their non-zero median expression. Slope densities are plot for each group and estimated modes are calculated. If any of the ten group modes is larger than .1, the K is not sufficient to normalize the data.
evaluateK( Data, SeqDepth, OrigData, Slopes, Name, Tau, PrintProgressPlots, ditherCounts )
evaluateK( Data, SeqDepth, OrigData, Slopes, Name, Tau, PrintProgressPlots, ditherCounts )
Data |
matrix of normalized expression counts. Rows are genes and columns are samples. |
SeqDepth |
vector of sequencing depths estimated as columns sums of un-normalized expression matrix. |
OrigData |
matrix of un-normalized expression counts. Rows are genes and columns are samples. |
Slopes |
vector of slopes estimated in the GetSlopes() function. Only used here to obtain the names of genes considered in the normalization. |
Name |
plot title |
Tau |
value of quantile for the quantile regression used to estimate gene-specific slopes (default is median, Tau = .5 ). |
PrintProgressPlots |
whether to automatically produce plot as SCnorm determines the optimal number of groups (default is FALSE, highly suggest using TRUE). Plots will be printed to the current device. |
ditherCounts |
whether to dither/jitter the counts, may be used for data with many ties, default is FALSE. |
value of largest mode and a plot of the ten normalized slope densities.
Rhonda Bacher
Data generated as in SIM I from the manuscript with K = 4.
ExampleSimSCData
ExampleSimSCData
data matrix
data(ExampleSimSCData)
data(ExampleSimSCData)
Genes are divided into NumExpressionGroups = 10 equally sized groups based on their non-zero median expression. Slope densities are plot for each group.
generateEvalPlot( MedExpr, SeqDepth, Slopes, Name, NumExpressionGroups = 10, BeforeNorm = TRUE )
generateEvalPlot( MedExpr, SeqDepth, Slopes, Name, NumExpressionGroups = 10, BeforeNorm = TRUE )
MedExpr |
non-zero median expression for all genes. |
SeqDepth |
sequencing depth for each cell/sample. |
Slopes |
per gene estimates of the count-depth relationship. |
Name |
name for plot title. |
NumExpressionGroups |
the number of groups to split the data into, genes are split into equally sized groups based on their non-zero median expression. |
BeforeNorm |
whether dat have already been normalized. |
a plot of the un-normalized slope densities.
Rhonda Bacher
Convenient helper function to extract the normalized expression matrix from the SummarizedExperiment
getCounts(DATA)
getCounts(DATA)
DATA |
An object of class |
A matrix
which contains the count data
where genes are in rows and cells are in columns
data(ExampleSimSCData) ExampleData <- SummarizedExperiment::SummarizedExperiment(assays=list("Counts"=ExampleSimSCData)) myData <- getCounts(ExampleData)
data(ExampleSimSCData) ExampleData <- SummarizedExperiment::SummarizedExperiment(assays=list("Counts"=ExampleSimSCData)) myData <- getCounts(ExampleData)
getDens
getDens(ExprGroups, byGroup, RETURN = c("Mode", "Height"))
getDens(ExprGroups, byGroup, RETURN = c("Mode", "Height"))
ExprGroups |
expression groups already split. |
byGroup |
factor (usually slopes) to get density based on ExprGroups. |
RETURN |
whether to return Mode or Height of density. |
get density of slopes in different expression groups
list, length is equal to NumGroups
This is the gene-specific fitting function, where a median (Tau = .5) quantile regression is fit for each gene. Only genes having at least 10 non-zero expression values are considered.
getSlopes( Data, SeqDepth = 0, Tau = 0.5, FilterCellNum = 10, ditherCounts = FALSE )
getSlopes( Data, SeqDepth = 0, Tau = 0.5, FilterCellNum = 10, ditherCounts = FALSE )
Data |
matrix of un-normalized expression counts. Rows are genes and columns are samples. |
SeqDepth |
vector of sequencing depths estimated as columns sums of un-normalized expression matrix. |
Tau |
value of quantile for the quantile regression used to estimate gene-specific slopes (default is median, Tau = .5 ). |
FilterCellNum |
the number of non-zero expression estimate required to include the genes into the SCnorm fitting (default = 10). The initial |
ditherCounts |
whether to dither/jitter the counts, may be used for data with many ties, default is FALSE. |
vector of estimated slopes.
Rhonda Bacher
data(ExampleSimSCData) myslopes <- getSlopes(ExampleSimSCData)
data(ExampleSimSCData) myslopes <- getSlopes(ExampleSimSCData)
This is an internal fitting of the group regression. For a single combination of possible tau and d values the group regression is fist fit, then predicted values are obtained and regressed against the original sequencing depths. The estimates slope is passed back to the SCnorm_fit() function.
GetTD(x, InputData)
GetTD(x, InputData)
x |
specifies a column of the grid matrix of tau and d. |
InputData |
contains the expression values, sequencing depths to fit the group regression, and the quantile used in the individual gene regression for grouping. |
estimated count-depth relationship of predicted values for one value of tau and degree.
Rhonda Bacher
This function iteratively normalizes using K groups and then evaluates whether K is sufficient. If the maximum mode received from the GetK() function is larger than .1, K is increased to K + 1. Uses params sent from SCnorm.
normWrapper( Data, SeqDepth = NULL, Slopes = NULL, CondNum = NULL, PrintProgressPlots, PropToUse, Tau, Thresh, ditherCounts )
normWrapper( Data, SeqDepth = NULL, Slopes = NULL, CondNum = NULL, PrintProgressPlots, PropToUse, Tau, Thresh, ditherCounts )
Data |
can be a matrix of single-cell expression with cells
where rows are genes and columns are samples. Gene names should
not be a column in this matrix, but should be assigned to rownames(Data).
Data can also be an object of class |
SeqDepth |
sequencing depth for each cell/sample. |
Slopes |
per gene estimates of the count-depth relationship. |
CondNum |
name of group being normalized, just for printing messages. |
PrintProgressPlots |
whether to automatically produce plot as SCnorm determines the optimal number of groups (default is FALSE, highly suggest using TRUE). Plots will be printed to the current device. |
PropToUse |
proportion of genes closest to the slope mode used for the group fitting, default is set at .25. This number #' mainly affects speed. |
Tau |
value of quantile for the quantile regression used to estimate gene-specific slopes (default is median, Tau = .5 ). |
Thresh |
threshold to use in evaluating the sufficiency of K, default is .1. |
ditherCounts |
whether to dither/jitter the counts, may be used for data with many ties, default is FALSE. |
matrix of normalized and scaled expression values for all conditions and the evaluation plots are output for each attempted value of K.
Rhonda Bacher
Quantile regression is used to estimate the dependence of read counts on sequencing depth for every gene. If multiple conditions are provided, a separate plot is provided for each and the filters are applied within each condition separately. The plot can be used to evaluate the extent of the count-depth relationship in the dataset or can be be used to evaluate data normalized by alternative methods.
plotCountDepth( Data, NormalizedData = NULL, Conditions = NULL, Tau = 0.5, FilterCellProportion = 0.1, FilterExpression = 0, NumExpressionGroups = 10, NCores = NULL, ditherCounts = FALSE )
plotCountDepth( Data, NormalizedData = NULL, Conditions = NULL, Tau = 0.5, FilterCellProportion = 0.1, FilterExpression = 0, NumExpressionGroups = 10, NCores = NULL, ditherCounts = FALSE )
Data |
can be a matrix of single-cell expression with cells
where rows are genes and columns are samples. Gene names should
not be a column in this matrix, but should be assigned to rownames(Data).
Data can also be an object of class |
NormalizedData |
matrix of normalized expression counts. Rows are genesand columns are samples. Only input this if evaluating already normalized data. |
Conditions |
vector of condition labels, this should correspond to the columns of the un-normalized expression matrix. If not provided data is assumed to come from same condition/batch. |
Tau |
value of quantile for the quantile regression used to estimate gene-specific slopes (default is Tau = .5 (median)). |
FilterCellProportion |
the proportion of non-zero expression estimates required to include the genes into the evaluation. Default is .10, and will not go below a proportion which uses less than 10 total cells/samples. |
FilterExpression |
exclude genes having median of non-zero expression below this threshold from count-depth plots (default = 0). |
NumExpressionGroups |
the number of groups to split the data into, genes are split into equally sized groups based on their non-zero median expression. |
NCores |
number of cores to use, default is detectCores() - 1. This will be used to set up a parallel environment using either MulticoreParam (Linux, Mac) or SnowParam (Windows) with NCores using the package BiocParallel. |
ditherCounts |
whether to dither/jitter the counts, may be used for data with many ties, default is FALSE. |
returns a data.frame containing each gene's slope (count-depth relationship) and its associated expression group. A plot will be output.
Rhonda Bacher
data(ExampleSimSCData) Conditions = rep(c(1,2), each= 90) #plotCountDepth(Data = ExampleSimSCData, Conditions = Conditions, #FilterCellProportion = .1)
data(ExampleSimSCData) Conditions = rep(c(1,2), each= 90) #plotCountDepth(Data = ExampleSimSCData, Conditions = Conditions, #FilterCellProportion = .1)
This function can be used to evaluate the extent of gene-specific biases in the data. If a bias exists, the plots provided here will identify whether it affects cells equally or not. Correction for such features may be considered especially if the bias is different between conditions (see SCnorm vignette for details).
plotWithinFactor( Data, withinSample = NULL, Conditions = NULL, FilterExpression = 0, NumExpressionGroups = 4 )
plotWithinFactor( Data, withinSample = NULL, Conditions = NULL, FilterExpression = 0, NumExpressionGroups = 4 )
Data |
can be a matrix of single-cell expression with cells
where rows are genes and columns are samples. Gene names should
not be a column in this matrix, but should be assigned to rownames(Data).
Data can also be an object of class |
withinSample |
a vector of gene-specific features. |
Conditions |
vector of condition labels, this should correspond to the columns of the un-normalized expression matrix. If provided the cells will be colored by Condition instead of individually. |
FilterExpression |
exclude genes having median of non-zero expression below this threshold. |
NumExpressionGroups |
the number of groups to split the within sample factor into, e.g genes will be split into equally sized groups based on their GC content/Gene length/etc. |
produces a plot and returns the data the plot is based on.
Rhonda Bacher
data(ExampleSimSCData) Conditions = rep(c(1,2), each= 90) exampleFactor = runif(dim(ExampleSimSCData)[1], 0, 1) names(exampleFactor) = rownames(ExampleSimSCData) #plotWithinFactor(Data = ExampleSimSCData, #withinSample=exampleFactor, Conditions = Conditions)
data(ExampleSimSCData) Conditions = rep(c(1,2), each= 90) exampleFactor = runif(dim(ExampleSimSCData)[1], 0, 1) names(exampleFactor) = rownames(ExampleSimSCData) #plotWithinFactor(Data = ExampleSimSCData, #withinSample=exampleFactor, Conditions = Conditions)
Perform the single gene regressions using quantile regression.
quickReg(x, InputData)
quickReg(x, InputData)
x |
gene to perform the regression on. |
InputData |
list of data needed for the regression. |
Perform the single gene regressions using quantile regression.
gene slope.
redoBox
redoBox(DATA, smallc)
redoBox(DATA, smallc)
DATA |
data set to. |
smallc |
what value to ignore, typically is zero. |
Function to log data and turn zeros to NA to mask/ignore in functions.
the dataset has been logged with values below smallc masked.
Convenient helper function to extract the results (
normalized data, list of genes filtered out, or scale factors). Results
data.frames/matrices are stored in the
metadata
slot and can also be accessed without the help of this
convenience function by calling metadata(DataNorm)
.
results(DATA, type = c("NormalizedData", "ScaleFactors", "GenesFilteredOut"))
results(DATA, type = c("NormalizedData", "ScaleFactors", "GenesFilteredOut"))
DATA |
An object of class |
type |
A character variable specifying which output is desired,
with possible values "NormalizedData", "ScaleFactors", and "GenesFilteredOut".
By default results() will
return type="NormalizedData", which is the matrix of normalized counts from SCnorm.
By specifiying type="ScaleFactors" a matrix of scale factors (only returned if
reportSF=TRUE when running |
A data.frame
containing output as detailed in the
description of the type
input parameter
data(ExampleSimSCData) Conditions = rep(c(1), each= 90) #NormData <- SCnorm(Data=ExampleSimSCData, Conditions=Conditions) #normDataMatrix <- results(NormData)
data(ExampleSimSCData) Conditions = rep(c(1), each= 90) #NormData <- SCnorm(Data=ExampleSimSCData, Conditions=Conditions) #normDataMatrix <- results(NormData)
After conditions are independently normalized with the count-depth effect removed, conditions need to be additionally scaled prior to further analysis. Genes that were normalized in both conditions are split into quartiles based on their un-normalized non-zero medians. Genes in each quartile are scaled to the median fold change of condition specific gene means and overall gene means. This function can be used independetly if SCnorm was run across different Conditions separately. However, the input must be as follow: NormData <- list(list(NormData = normalizedDataSet1), list(NormData = normalizedDataSet2)) where normalizedDataSet1 is the normalized matrix obtained using normcounts() on the output of SCnorm().
scaleNormMultCont(NormData, OrigData, Genes, useSpikes, useZerosToScale)
scaleNormMultCont(NormData, OrigData, Genes, useSpikes, useZerosToScale)
NormData |
list of matrices of normalized expression counts and scale factors for each condition. Matrix rows are genes and columns are samples. |
OrigData |
list of matrices of un-normalized expression counts. Matrix rows are genes and columns are samples. Each item in list is a different condition. |
Genes |
vector of genes that will be used to scale conditions, only want to use genes that were normalized. |
useSpikes |
whether to use spike-ins to perform between condition scaling (default=FALSE). Assumes spike-in names start with "ERCC-". |
useZerosToScale |
whether to use zeros when scaling across conditions (default=FALSE). |
matrix of normalized and scaled expression values for all conditions.
Rhonda Bacher
Quantile regression is used to estimate the dependence of read counts on sequencing depth for every gene. Genes with similar dependence are then grouped, and a second quantile regression is used to estimate scale factors within each group. Within-group adjustment for sequencing depth is then performed using the estimated scale factors to provide normalized estimates of expression. If multiple conditions are provided, normalization is performed within condition and then normalized estimates are scaled between conditions. If withinSample=TRUE then the method from Risso et al. 2011 will be implemented.
SCnorm( Data = NULL, Conditions = NULL, PrintProgressPlots = FALSE, reportSF = FALSE, FilterCellNum = 10, FilterExpression = 0, Thresh = 0.1, K = NULL, NCores = NULL, ditherCounts = FALSE, PropToUse = 0.25, Tau = 0.5, withinSample = NULL, useSpikes = FALSE, useZerosToScale = FALSE )
SCnorm( Data = NULL, Conditions = NULL, PrintProgressPlots = FALSE, reportSF = FALSE, FilterCellNum = 10, FilterExpression = 0, Thresh = 0.1, K = NULL, NCores = NULL, ditherCounts = FALSE, PropToUse = 0.25, Tau = 0.5, withinSample = NULL, useSpikes = FALSE, useZerosToScale = FALSE )
Data |
can be a matrix of single-cell expression with cells
where rows are genes and columns are samples. Gene names should
not be a column in this matrix, but should be assigned to rownames(Data).
Data can also be an object of class |
Conditions |
vector of condition labels, this should correspond to the columns of the expression matrix. |
PrintProgressPlots |
whether to automatically produce plot as SCnorm determines the optimal number of groups (default is FALSE, highly suggest using TRUE). Plots will be printed to the current device. |
reportSF |
whether to provide a matrix of scaling counts in the output (default = FALSE). |
FilterCellNum |
the number of non-zero expression estimate required to include the genes into the SCnorm fitting (default = 10). The initial grouping fits a quantile regression to each gene, making this value too low gives unstable fits. |
FilterExpression |
exclude genes having median of non-zero expression from the normalization. |
Thresh |
threshold to use in evaluating the sufficiency of K, default is .1. |
K |
the number of groups for normalizing. If left unspecified, an evaluation procedure will determine the optimal value of K (recommended). |
NCores |
number of cores to use, default is detectCores() - 1. This will be used to set up a parallel environment using either MulticoreParam (Linux, Mac) or SnowParam (Windows) with NCores using the package BiocParallel. |
ditherCounts |
whether to dither/jitter the counts, may be used for data with many ties, default is FALSE. |
PropToUse |
proportion of genes closest to the slope mode used for the group fitting, default is set at .25. This number #' mainly affects speed. |
Tau |
value of quantile for the quantile regression used to estimate gene-specific slopes (default is median, Tau = .5 ). |
withinSample |
a vector of gene-specific features to correct counts within a sample prior to SCnorm. If NULL(default) then no correction will be performed. Examples of gene-specific features are GC content or gene length. |
useSpikes |
whether to use spike-ins to perform across condition scaling (default=FALSE). Spike-ins must be stored in the SingleCellExperiment object using altExp() function from SingleCellExperiment. See vignette for example. |
useZerosToScale |
whether to use zeros when scaling across conditions (default=FALSE). |
List containing matrix of normalized expression (and optionally a matrix of size factors if reportSF = TRUE ).
Rhonda Bacher
data(ExampleSimSCData) Conditions = rep(c(1,2), each= 45) #DataNorm <- SCnorm(ExampleSimSCData, Conditions, #FilterCellNum = 10) #str(DataNorm)
data(ExampleSimSCData) Conditions = rep(c(1,2), each= 45) #DataNorm <- SCnorm(ExampleSimSCData, Conditions, #FilterCellNum = 10) #str(DataNorm)
For each group K, a quantile regression is fit over all genes (PropToUse) for a grid of possible degree's d and quantile's tau. For each value of tau and d, the predicted expression values are obtained and regressed against the original sequencing depths. The optimal tau and d combination is chosen as that closest to the mode of the gene slopes.
SCnormFit(Data, SeqDepth, Slopes, K, PropToUse = 0.25, Tau = 0.5, ditherCounts)
SCnormFit(Data, SeqDepth, Slopes, K, PropToUse = 0.25, Tau = 0.5, ditherCounts)
Data |
can be a matrix of single-cell expression with cells
where rows are genes and columns are samples. Gene names should
not be a column in this matrix, but should be assigned to rownames(Data).
Data can also be an object of class |
SeqDepth |
sequencing depth for each cell/sample. |
Slopes |
per gene estimates of the count-depth relationship. |
K |
the number of groups for normalizing. If left unspecified, an evaluation procedure will determine the optimal value of K (recommended). |
PropToUse |
proportion of genes closest to the slope mode used for the group fitting, default is set at .25. This number #' mainly affects speed. |
Tau |
value of quantile for the quantile regression used to estimate gene-specific slopes (default is median, Tau = .5 ). |
ditherCounts |
whether to dither/jitter the counts, may be used for data with many ties, default is FALSE. |
normalized expression matrix and matrix of scaling factors.
Rhonda Bacher
splitGroups
splitGroups(DATA, NumGroups = 10)
splitGroups(DATA, NumGroups = 10)
DATA |
vector to be splot. |
NumGroups |
number of groups |
helper function to get split a vector into a specified number of groups
list, length is equal to NumGroups