PsiNorm: a scalable normalization for single-cellRNA-seq data

Introduction

library(SingleCellExperiment)
library(splatter)
library(scater)
library(cluster)
library(scone)

PsiNorm is a scalable between-sample normalization for single cell RNA-seq count data based on the power-law Pareto type I distribution. It can be demonstrated that the Pareto parameter is inversely proportional to the sequencing depth, it is sample specific and its estimate can be obtained for each cell independently. PsiNorm computes the shape parameter for each cellular sample and then uses it as multiplicative size factor to normalize the data. The final goal of the transformation is to align the gene expression distribution especially for those genes characterised by high expression. Note that, similar to other global scaling methods, our method does not remove batch effects, which can be dealt with downstream tools.

To evaluate the ability of PsiNorm to remove technical bias and reveal the true cell similarity structure, we used both an unsupervised and a supervised approach. We first simulate a scRNA-seq experiment with four known clusters using the splatter Bioconductor package. Then in the unsupervised approach, we i) reduce dimentionality using PCA, ii) identify clusters using the clara partitional method and then we iii) computed the Adjusted Rand Index (ARI) to compare the known and the estimated partition.

In the supervised approach, we i) reduce dimentionality using PCA, and we ii) compute the silhouette index of the known partition in the reduced dimensional space.

Citation

If you use PsiNorm in publications, please cite the following article:

Borella, M., Martello, G., Risso, D., & Romualdi, C. (2021). PsiNorm: a scalable normalization for single-cell RNA-seq data. bioRxiv. https://doi.org/10.1101/2021.04.07.438822.

Data Simulation

We simulate a matrix of counts with 2000 cellular samples and 10000 genes with splatter.

set.seed(1234)
params <- newSplatParams()
N=2000
sce <- splatSimulateGroups(params, batchCells=N, lib.loc=12,
                           group.prob = rep(0.25,4),
                           de.prob = 0.2, de.facLoc = 0.06,
                           verbose = FALSE) 

sce is a SingleCellExperiment object with a single batch and four different cellular groups.

To visualize the data we used the first two Principal Components estimated starting from the raw log-count matrix.

set.seed(1234)
assay(sce, "lograwcounts") <- log1p(counts(sce))
sce <- runPCA(sce, exprs_values="lograwcounts", scale=TRUE, ncomponents = 2)
plotPCA(sce, colour_by="Group")

PsiNorm data normalization

Data Normalization with PsiNorm

To normalize the raw counts we used the PsiNorm normalization and we visualized the data using the first two principal components.

sce<-PsiNorm(sce)
sce<-logNormCounts(sce)
head(sizeFactors(sce))
#>     Cell1     Cell2     Cell3     Cell4     Cell5     Cell6 
#> 1.1017454 0.9667386 1.0169251 0.9343382 1.0966308 1.1845135

Note that running the PsiNorm function computes a set of size factors that are added to the SingleCellExperiment object.

The logNormCounts function can be then used to normalize the data by multiplying the raw counts and the size factors.

set.seed(1234)
sce<-runPCA(sce, exprs_values="logcounts", scale=TRUE, name = "PsiNorm_PCA",
            ncomponents = 2)
plotReducedDim(sce, dimred = "PsiNorm_PCA", colour_by = "Group")

We can appreciate from the plot that PsiNorm allows a better separation among known cellular groups.

Unsupervised approach: Adusted Rand Index

We calculate ARI of both raw counts and PsiNorm normalized counts after PCA dimension reduction and clara clustering (with k equal to the simulated number of clusters); higher the ARI, better the normalization.

groups<-cluster::clara(reducedDim(sce, "PCA"), k=nlevels(sce$Group))
a<-paste("ARI from raw counts:", 
         round(mclust::adjustedRandIndex(groups$clustering, sce$Group), 
               digits = 3))

groups<-cluster::clara(reducedDim(sce, "PsiNorm_PCA"), k=nlevels(sce$Group))
b<-paste("ARI from PsiNorm normalized data:",
         round(mclust::adjustedRandIndex(groups$clustering, sce$Group), 
               digits = 3))

kableExtra::kable(rbind(a,b), row.names = FALSE)
ARI from raw counts: 0.295
ARI from PsiNorm normalized data: 0.788

Pareto normalization considerably increases the ARI index.

Supervised approach: Silhouette index

We calculate the Silhouette index of both raw counts and PsiNorm normalized counts after tSNE dimension reduction exploiting known simulated clusters; higher the Silhouette, better the normalization.

dist<-daisy(reducedDim(sce, "PCA"))
dist<-as.matrix(dist)
a<-paste("Silhouette from raw counts:", round(summary(
    silhouette(x=as.numeric(as.factor(sce$Group)),
               dmatrix = dist))$avg.width, digits = 3))

dist<-daisy(reducedDim(sce, "PsiNorm_PCA"))
dist<-as.matrix(dist)
b<-paste("Silhouette from PsiNorm normalized data:", round(summary(
    silhouette(x=as.numeric(as.factor(sce$Group)),
               dmatrix = dist))$avg.width, digits = 3))
kableExtra::kable(rbind(a,b), row.names = FALSE)
Silhouette from raw counts: 0.168
Silhouette from PsiNorm normalized data: 0.534

Pareto normalization considerably increases the Silhouette index.

Correlation of PC1 and PC2 with sequencing depth

To check if PsiNorm is able to capture technical noise and remove unwanted variation within a dataset (due for instance to differences in sequencing depth), we check whether the first two PCs are capturing technical variance. We computed the maximum correlation obtained between PC1 and PC2 and cell sequencing depths; a higher correlation indicates that the normalization was not able to properly remove noise.

set.seed(4444)
PCA<-reducedDim(sce, "PCA") 
PCAp<-reducedDim(sce, "PsiNorm_PCA")
depth<-apply(counts(sce), 2, sum)
a<-paste("The Correlation with the raw data is:",
            round(abs(max(cor(PCA[,1], depth), cor(PCA[,2], depth))), digits=3))
b<-paste("The Correlation with the PsiNorm normalized data is:",
            round(abs(max(cor(PCAp[,1], depth), cor(PCAp[,2], depth))), digits = 3))
kableExtra::kable(rbind(a,b), row.names = FALSE)
The Correlation with the raw data is: 0.269
The Correlation with the PsiNorm normalized data is: 0.189

Our results demonstrate that the correlation significantly decreases after the PsiNorm normalization.

Using PsiNorm in scone()

As for other normalizations, scone includes a wrapper function to use PsiNorm in the SCONE evaluation framework. See Section 3.2 of the “Introduction to SCONE” vignette for an example on how to use PsiNorm within the main scone() function.

Using PsiNorm with Seurat

The PsiNorm normalization method can be used as a replacement for Seurat’s default normalization methods. To do so, we need to first normalize the data stored in a SingleCellExperiment object and then coerce that object to a Seurat object. This can be done with the as.Seurat function provided in the Seurat package (tested with Seurat 4.0.3).

library(Seurat)
sce <- PsiNorm(sce)
sce <- logNormCounts(sce)
seu <- as.Seurat(sce)

From this point on, one can continue the analysis with the recommended Seurat workflow, but using PsiNorm log-normalized data.

Using PsiNorm with HDF5 files

Thanks to the HDF5Array and DelayedArray packages, PsiNorm can be applied directly to HDF5-backed matrices without the need for the user to change the code. As an example, we use a dataset from the TENxPBMCData package, which provides several SingleCellExperiment objects with HDF5-backed matrices as their assays.

library(TENxPBMCData)

sce <- TENxPBMCData("pbmc4k")
sce
#> class: SingleCellExperiment 
#> dim: 33694 4340 
#> metadata(0):
#> assays(1): counts
#> rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
#>   ENSG00000268674
#> rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
#> colnames: NULL
#> colData names(11): Sample Barcode ... Individual Date_published
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

In particular, we use the pbmc4k dataset that contains about 4,000 PBMCs from a healthy donor.

The counts assay of this object is a DelayedMatrix backed by a HDF5 file. Hence, the data are store on disk (out of memory).

counts(sce)
#> <33694 x 4340> sparse DelayedMatrix object of type "integer":
#>                    [,1]    [,2]    [,3]    [,4] ... [,4337] [,4338] [,4339]
#> ENSG00000243485       0       0       0       0   .       0       0       0
#> ENSG00000237613       0       0       0       0   .       0       0       0
#> ENSG00000186092       0       0       0       0   .       0       0       0
#> ENSG00000238009       0       0       0       0   .       0       0       0
#> ENSG00000239945       0       0       0       0   .       0       0       0
#>             ...       .       .       .       .   .       .       .       .
#> ENSG00000277856       0       0       0       0   .       0       0       0
#> ENSG00000275063       0       0       0       0   .       0       0       0
#> ENSG00000271254       0       0       0       0   .       0       0       0
#> ENSG00000277475       0       0       0       0   .       0       0       0
#> ENSG00000268674       0       0       0       0   .       0       0       0
#>                 [,4340]
#> ENSG00000243485       0
#> ENSG00000237613       0
#> ENSG00000186092       0
#> ENSG00000238009       0
#> ENSG00000239945       0
#>             ...       .
#> ENSG00000277856       0
#> ENSG00000275063       0
#> ENSG00000271254       0
#> ENSG00000277475       0
#> ENSG00000268674       0
seed(counts(sce))
#> An object of class "HDF5ArraySeed"
#> Slot "filepath":
#> [1] "/github/home/.cache/R/ExperimentHub/2f856e509119_1611"
#> 
#> Slot "name":
#> [1] "/counts"
#> 
#> Slot "as_sparse":
#> [1] TRUE
#> 
#> Slot "type":
#> [1] NA
#> 
#> Slot "dim":
#> [1] 33694  4340
#> 
#> Slot "chunkdim":
#> [1] 512  66
#> 
#> Slot "first_val":
#> [1] 0

Thanks to the DelayedArray framework, we can apply PsiNorm using the same code that we have used in the case of in-memory data.

sce<-PsiNorm(sce)
sce<-logNormCounts(sce)
sce
#> class: SingleCellExperiment 
#> dim: 33694 4340 
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
#>   ENSG00000268674
#> rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
#> colnames: NULL
#> colData names(12): Sample Barcode ... Date_published sizeFactor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

Note that logNormCounts is a delayed operation, meaning that the actual log-normalized values will be computed only when needed by the user. In other words, the data are still stored out-of-memory as the original count matrix and the log-normalized data will be computed only when logcounts(sce) is realized into memory.

seed(logcounts(sce))
#> An object of class "HDF5ArraySeed"
#> Slot "filepath":
#> [1] "/github/home/.cache/R/ExperimentHub/2f856e509119_1611"
#> 
#> Slot "name":
#> [1] "/counts"
#> 
#> Slot "as_sparse":
#> [1] TRUE
#> 
#> Slot "type":
#> [1] NA
#> 
#> Slot "dim":
#> [1] 33694  4340
#> 
#> Slot "chunkdim":
#> [1] 512  66
#> 
#> Slot "first_val":
#> [1] 0

Session Information

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] TENxPBMCData_1.24.0         HDF5Array_1.35.2           
#>  [3] rhdf5_2.51.1                DelayedArray_0.33.3        
#>  [5] SparseArray_1.7.2           S4Arrays_1.7.1             
#>  [7] abind_1.4-8                 Matrix_1.7-1               
#>  [9] scone_1.31.0                cluster_2.1.8              
#> [11] scater_1.35.0               ggplot2_3.5.1              
#> [13] scuttle_1.17.0              splatter_1.31.0            
#> [15] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
#> [17] Biobase_2.67.0              GenomicRanges_1.59.1       
#> [19] GenomeInfoDb_1.43.2         IRanges_2.41.2             
#> [21] S4Vectors_0.45.2            BiocGenerics_0.53.3        
#> [23] generics_0.1.3              MatrixGenerics_1.19.0      
#> [25] matrixStats_1.4.1           BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.2             BiocIO_1.17.1            
#>   [3] bitops_1.0-9              filelock_1.0.3           
#>   [5] tibble_3.2.1              R.oo_1.27.0              
#>   [7] XML_3.99-0.17             lifecycle_1.0.4          
#>   [9] httr2_1.0.7               pwalign_1.3.1            
#>  [11] edgeR_4.5.1               prabclus_2.3-4           
#>  [13] lattice_0.22-6            MASS_7.3-61              
#>  [15] backports_1.5.0           magrittr_2.0.3           
#>  [17] plotly_4.10.4             limma_3.63.2             
#>  [19] sass_0.4.9                rmarkdown_2.29           
#>  [21] jquerylib_0.1.4           yaml_2.3.10              
#>  [23] flexmix_2.3-19            bayesm_3.1-6             
#>  [25] DBI_1.2.3                 buildtools_1.0.0         
#>  [27] RColorBrewer_1.1-3        ShortRead_1.65.0         
#>  [29] zlibbioc_1.52.0           purrr_1.0.2              
#>  [31] mixtools_2.0.0            R.utils_2.12.3           
#>  [33] RCurl_1.98-1.16           nnet_7.3-19              
#>  [35] tensorA_0.36.2.1          rappdirs_0.3.3           
#>  [37] GenomeInfoDbData_1.2.13   ggrepel_0.9.6            
#>  [39] irlba_2.3.5.1             maketools_1.3.1          
#>  [41] RSpectra_0.16-2           svglite_2.1.3            
#>  [43] DelayedMatrixStats_1.29.0 codetools_0.2-20         
#>  [45] xml2_1.3.6                tidyselect_1.2.1         
#>  [47] farver_2.1.2              UCSC.utils_1.3.0         
#>  [49] ScaledMatrix_1.15.0       viridis_0.6.5            
#>  [51] BiocFileCache_2.15.0      GenomicAlignments_1.43.0 
#>  [53] jsonlite_1.8.9            BiocNeighbors_2.1.2      
#>  [55] survival_3.8-3            systemfonts_1.1.0        
#>  [57] segmented_2.1-3           tools_4.4.2              
#>  [59] progress_1.2.3            rARPACK_0.11-0           
#>  [61] Rcpp_1.0.13-1             glue_1.8.0               
#>  [63] gridExtra_2.3             xfun_0.49                
#>  [65] dplyr_1.1.4               withr_3.0.2              
#>  [67] BiocManager_1.30.25       fastmap_1.2.0            
#>  [69] rhdf5filters_1.19.0       latticeExtra_0.6-30      
#>  [71] boot_1.3-31               caTools_1.18.3           
#>  [73] digest_0.6.37             rsvd_1.0.5               
#>  [75] mime_0.12                 R6_2.5.1                 
#>  [77] colorspace_2.1-1          gtools_3.9.5             
#>  [79] jpeg_0.1-10               biomaRt_2.63.0           
#>  [81] RSQLite_2.3.9             diptest_0.77-1           
#>  [83] R.methodsS3_1.8.2         tidyr_1.3.1              
#>  [85] hexbin_1.28.5             data.table_1.16.4        
#>  [87] rtracklayer_1.67.0        robustbase_0.99-4-1      
#>  [89] class_7.3-22              htmlwidgets_1.6.4        
#>  [91] prettyunits_1.2.0         httr_1.4.7               
#>  [93] pkgconfig_2.0.3           gtable_0.3.6             
#>  [95] modeltools_0.2-23         blob_1.2.4               
#>  [97] hwriter_1.3.2.1           XVector_0.47.1           
#>  [99] sys_3.4.3                 htmltools_0.5.8.1        
#> [101] kableExtra_1.4.0          scales_1.3.0             
#> [103] png_0.1-8                 rstudioapi_0.17.1        
#> [105] knitr_1.49                rjson_0.2.23             
#> [107] nlme_3.1-166              checkmate_2.3.2          
#> [109] curl_6.0.1                cachem_1.1.0             
#> [111] stringr_1.5.1             BiocVersion_3.21.1       
#> [113] KernSmooth_2.23-24        parallel_4.4.2           
#> [115] vipor_0.4.7               RUVSeq_1.41.0            
#> [117] AnnotationDbi_1.69.0      restfulr_0.0.15          
#> [119] pillar_1.10.0             grid_4.4.2               
#> [121] vctrs_0.6.5               gplots_3.2.0             
#> [123] BiocSingular_1.23.0       dbplyr_2.5.0             
#> [125] beachmat_2.23.5           beeswarm_0.4.0           
#> [127] evaluate_1.0.1            GenomicFeatures_1.59.1   
#> [129] cli_3.6.3                 locfit_1.5-9.10          
#> [131] compiler_4.4.2            Rsamtools_2.23.1         
#> [133] rlang_1.1.4               crayon_1.5.3             
#> [135] labeling_0.4.3            mclust_6.1.1             
#> [137] interp_1.1-6              aroma.light_3.37.0       
#> [139] ggbeeswarm_0.7.2          stringi_1.8.4            
#> [141] viridisLite_0.4.2         deldir_2.0-4             
#> [143] BiocParallel_1.41.0       munsell_0.5.1            
#> [145] Biostrings_2.75.3         lazyeval_0.2.2           
#> [147] EDASeq_2.41.0             compositions_2.0-8       
#> [149] ExperimentHub_2.15.0      hms_1.1.3                
#> [151] sparseMatrixStats_1.19.0  bit64_4.5.2              
#> [153] Rhdf5lib_1.29.0           KEGGREST_1.47.0          
#> [155] fpc_2.2-13                statmod_1.5.0            
#> [157] AnnotationHub_3.15.0      kernlab_0.9-33           
#> [159] memoise_2.0.1             bslib_0.8.0              
#> [161] DEoptimR_1.1-3-1          bit_4.5.0.1