Dimensionality reduction and batch effect removal using NewWave

Installation

First of all we need to install NewWave:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("NewWave")
suppressPackageStartupMessages(
  {library(SingleCellExperiment)
library(splatter)
library(irlba)
library(Rtsne)
library(ggplot2)
library(mclust)
library(NewWave)}
)

Introduction

NewWave is a new package that assumes a Negative Binomial distributions for dimensionality reduction and batch effect removal. In order to reduce the memory consumption it uses a PSOCK cluster combined with the R package SharedObject that allow to share a matrix between different cores without memory duplication. Thanks to that we can massively parallelize the estimation process with huge benefit in terms of time consumption. We can reduce even more the time consumption using some minibatch approaches on the different steps of the optimization.

I am going to show how to use NewWave with example data generated with Splatter.

params <- newSplatParams()
N=500
set.seed(1234)
data <- splatSimulateGroups(params,batchCells=c(N/2,N/2),
                           group.prob = rep(0.1,10),
                           de.prob = 0.2,
                           verbose = FALSE) 

Now we have a dataset with 500 cells and 10000 genes, I will use only the 500 most variable genes. NewWave takes as input raw data, not normalized.

set.seed(12359)
hvg <- rowVars(counts(data))
names(hvg) <- rownames(counts(data))
data <- data[names(sort(hvg,decreasing=TRUE))[1:500],]

As you can see there is a variable called batch in the colData section.

colData(data)
#> DataFrame with 500 rows and 4 columns
#>                Cell       Batch    Group ExpLibSize
#>         <character> <character> <factor>  <numeric>
#> Cell1         Cell1      Batch1   Group3    37164.6
#> Cell2         Cell2      Batch1   Group3    57658.3
#> Cell3         Cell3      Batch1   Group2    74527.9
#> Cell4         Cell4      Batch1   Group1    63502.5
#> Cell5         Cell5      Batch1   Group2    43093.8
#> ...             ...         ...      ...        ...
#> Cell496     Cell496      Batch2   Group2    45875.1
#> Cell497     Cell497      Batch2   Group4    69811.6
#> Cell498     Cell498      Batch2   Group8    53963.6
#> Cell499     Cell499      Batch2   Group5    64406.7
#> Cell500     Cell500      Batch2   Group5    60056.1

IMPORTANT: For batch effecr removal the batch variable must be a factor

data$Batch <- as.factor(data$Batch)

We also have a variable called Group that represent the cell type labels.

We can see the how the cells are distributed between group and batch

pca <- prcomp_irlba(t(counts(data)),n=10)
plot_data <-data.frame(Rtsne(pca$x)$Y)
plot_data$batch <- data$Batch
plot_data$group <- data$Group
ggplot(plot_data, aes(x=X1,y=X2,col=group, shape=batch))+ geom_point()

There is a clear batch effect between the cells.

Let’s try to correct it.

NewWave

I am going to show different implementation and the suggested way to use them with the given hardware.

Some advise:

  • Verbose option has default FALSE, in this vignette I will change it for explanatory intentions, don’t do it with big dataset because it can sensibly slower the computation
  • There are no concern about the dimension of mini-batches, I always used the 10% of the observations

Standard usage

This is the way to insert the batch variable, in the same manner can be inserted other cell-related variable and if you need some gene related variable those can be inserted in V.

res <- newWave(data,X = "~Batch", K=10, verbose = TRUE)
#> Time of setup
#>    user  system elapsed 
#>   0.007   0.000   0.281 
#> Time of initialization
#>    user  system elapsed 
#>   0.034   0.001   0.403
#> Iteration 1
#> penalized log-likelihood = -1292992.40542381
#> Time of dispersion optimization
#>    user  system elapsed 
#>   0.517   0.144   0.480
#> after optimize dispersion = -1056334.04416405
#> Time of right optimization
#>    user  system elapsed 
#>   0.001   0.000   4.296
#> after right optimization= -1055590.7756004
#> after orthogonalization = -1055590.7364178
#> Time of left optimization
#>    user  system elapsed 
#>   0.023   0.151   4.046
#> after left optimization= -1055297.95497434
#> after orthogonalization = -1055297.95336495
#> Iteration 2
#> penalized log-likelihood = -1055297.95336495
#> Time of dispersion optimization
#>    user  system elapsed 
#>   0.571   0.295   0.534
#> after optimize dispersion = -1055291.72044053
#> Time of right optimization
#>    user  system elapsed 
#>   0.000   0.001   3.934
#> after right optimization= -1055259.41891328
#> after orthogonalization = -1055259.41804494
#> Time of left optimization
#>    user  system elapsed 
#>   0.038   0.144   3.299
#> after left optimization= -1055246.74325827
#> after orthogonalization = -1055246.7431929

In order to make it faster you can increase the number of cores using “children” parameter:

res2 <- newWave(data,X = "~Batch", K=10, verbose = TRUE, children=2)
#> Time of setup
#>    user  system elapsed 
#>   0.008   0.000   0.302 
#> Time of initialization
#>    user  system elapsed 
#>   0.030   0.011   0.403
#> Iteration 1
#> penalized log-likelihood = -1292992.40541933
#> Time of dispersion optimization
#>    user  system elapsed 
#>   0.478   0.045   0.475
#> after optimize dispersion = -1056334.04629773
#> Time of right optimization
#>    user  system elapsed 
#>   0.001   0.001   2.221
#> after right optimization= -1055590.77565548
#> after orthogonalization = -1055590.73645028
#> Time of left optimization
#>    user  system elapsed 
#>   0.021   0.129   2.056
#> after left optimization= -1055297.94951722
#> after orthogonalization = -1055297.94789765
#> Iteration 2
#> penalized log-likelihood = -1055297.94789765
#> Time of dispersion optimization
#>    user  system elapsed 
#>   0.569   0.301   0.534
#> after optimize dispersion = -1055291.7145824
#> Time of right optimization
#>    user  system elapsed 
#>   0.000   0.002   2.067
#> after right optimization= -1055259.39179845
#> after orthogonalization = -1055259.39093587
#> Time of left optimization
#>    user  system elapsed 
#>   0.024   0.125   1.717
#> after left optimization= -1055246.71697819
#> after orthogonalization = -1055246.71692541

Commonwise dispersion and minibatch approaches

If you do not have an high number of cores to run newWave this is the fastest way to run. The optimization process is done by three process itereated until convercence.

  • Optimization of the dispersion parameters
  • Optimization of the gene related parameters
  • Optimization of the cell related parameters

Each of these three steps can be accelerated using mini batch, the number of observation is settled with these parameters:

  • n_gene_disp : Number of genes to use in the dispersion optimization
  • n_cell_par : Number of cells to use in the cells related parameters optimization
  • n_gene_par : Number of genes to use in the genes related parameters optimization
res3 <- newWave(data,X = "~Batch", verbose = TRUE,K=10, children=2,
                n_gene_disp = 100, n_gene_par = 100, n_cell_par = 100)
#> Time of setup
#>    user  system elapsed 
#>   0.007   0.000   0.303 
#> Time of initialization
#>    user  system elapsed 
#>   0.010   0.026   0.344
#> Iteration 1
#> penalized log-likelihood = -1292992.40559705
#> Time of dispersion optimization
#>    user  system elapsed 
#>   0.501   0.037   0.487
#> after optimize dispersion = -1056334.04179154
#> Time of right optimization
#>    user  system elapsed 
#>   0.000   0.000   2.213
#> after right optimization= -1055590.772186
#> after orthogonalization = -1055590.7329797
#> Time of left optimization
#>    user  system elapsed 
#>   0.000   0.013   2.099
#> after left optimization= -1055297.94719946
#> after orthogonalization = -1055297.9455802
#> Iteration 2
#> penalized log-likelihood = -1055297.9455802
#> Time of dispersion optimization
#>    user  system elapsed 
#>   0.253   0.282   0.203
#> after optimize dispersion = -1055297.9455802
#> Time of right optimization
#>    user  system elapsed 
#>   0.001   0.001   0.404
#> after right optimization= -1055292.52487484
#> after orthogonalization = -1055292.52430052
#> Time of left optimization
#>    user  system elapsed 
#>   0.025   0.126   0.246
#> after left optimization= -1055292.34586096
#> after orthogonalization = -1055292.34583198

Genewise dispersion mini-batch

If you have a lot of core disposable or you want to estimate a genewise dispersion parameter this is the fastes configuration:

res3 <- newWave(data,X = "~Batch", verbose = TRUE,K=10, children=2,
                n_gene_par = 100, n_cell_par = 100, commondispersion = FALSE)
#> Time of setup
#>    user  system elapsed 
#>   0.004   0.004   0.302 
#> Time of initialization
#>    user  system elapsed 
#>   0.013   0.021   0.400
#> Iteration 1
#> penalized log-likelihood = -1292992.405343
#> Time of dispersion optimization
#>    user  system elapsed 
#>   0.482   0.047   0.473
#> after optimize dispersion = -1056334.05061668
#> Time of right optimization
#>    user  system elapsed 
#>   0.001   0.000   2.202
#> after right optimization= -1055590.78044977
#> after orthogonalization = -1055590.7412638
#> Time of left optimization
#>    user  system elapsed 
#>   0.018   0.133   2.059
#> after left optimization= -1055297.95762802
#> after orthogonalization = -1055297.95602474
#> Iteration 2
#> penalized log-likelihood = -1055297.95602474
#> Time of dispersion optimization
#>    user  system elapsed 
#>   0.096   0.260   0.409
#> after optimize dispersion = -1051617.53170645
#> Time of right optimization
#>    user  system elapsed 
#>   0.001   0.000   0.399
#> after right optimization= -1051611.39760865
#> after orthogonalization = -1051611.39673644
#> Time of left optimization
#>    user  system elapsed 
#>   0.037   0.115   0.467
#> after left optimization= -1051577.14399211
#> after orthogonalization = -1051577.1429824
#> Iteration 3
#> penalized log-likelihood = -1051577.1429824
#> Time of dispersion optimization
#>    user  system elapsed 
#>   0.111   0.243   0.208
#> after optimize dispersion = -1051577.16177987
#> Time of right optimization
#>    user  system elapsed 
#>   0.001   0.000   0.409
#> after right optimization= -1051571.88204217
#> after orthogonalization = -1051571.88134877
#> Time of left optimization
#>    user  system elapsed 
#>   0.012   0.138   0.476
#> after left optimization= -1051548.69903876
#> after orthogonalization = -1051548.69843924

NB: do not use n_gene_disp in this case, it will slower the computation.

Now I can use the latent dimension rapresentation for visualization purpose:

latent <- reducedDim(res)

tsne_latent <- data.frame(Rtsne(latent)$Y)
tsne_latent$batch <- data$Batch
tsne_latent$group <- data$Group
ggplot(tsne_latent, aes(x=X1,y=X2,col=group, shape=batch))+ geom_point()

or for clustering:

cluster <- kmeans(latent, 10)

adjustedRandIndex(cluster$cluster, data$Group)
#> [1] 0.6808652

Session Information

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] NewWave_1.17.0              mclust_6.1.1               
#>  [3] ggplot2_3.5.1               Rtsne_0.17                 
#>  [5] irlba_2.3.5.1               Matrix_1.7-1               
#>  [7] splatter_1.31.0             SingleCellExperiment_1.29.1
#>  [9] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#> [11] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
#> [13] IRanges_2.41.2              S4Vectors_0.45.2           
#> [15] BiocGenerics_0.53.3         generics_0.1.3             
#> [17] MatrixGenerics_1.19.1       matrixStats_1.5.0          
#> [19] rmarkdown_2.29             
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6            xfun_0.50               bslib_0.8.0            
#>  [4] lattice_0.22-6          vctrs_0.6.5             tools_4.4.2            
#>  [7] parallel_4.4.2          tibble_3.2.1            pkgconfig_2.0.3        
#> [10] SharedObject_1.21.0     checkmate_2.3.2         lifecycle_1.0.4        
#> [13] GenomeInfoDbData_1.2.13 farver_2.1.2            compiler_4.4.2         
#> [16] munsell_0.5.1           codetools_0.2-20        htmltools_0.5.8.1      
#> [19] sys_3.4.3               buildtools_1.0.0        sass_0.4.9             
#> [22] yaml_2.3.10             pillar_1.10.1           crayon_1.5.3           
#> [25] jquerylib_0.1.4         BiocParallel_1.41.0     DelayedArray_0.33.3    
#> [28] cachem_1.1.0            abind_1.4-8             rsvd_1.0.5             
#> [31] locfit_1.5-9.10         digest_0.6.37           BiocSingular_1.23.0    
#> [34] labeling_0.4.3          maketools_1.3.1         fastmap_1.2.0          
#> [37] grid_4.4.2              colorspace_2.1-1        cli_3.6.3              
#> [40] SparseArray_1.7.3       magrittr_2.0.3          S4Arrays_1.7.1         
#> [43] withr_3.0.2             UCSC.utils_1.3.1        scales_1.3.0           
#> [46] backports_1.5.0         XVector_0.47.2          httr_1.4.7             
#> [49] beachmat_2.23.6         ScaledMatrix_1.15.0     evaluate_1.0.3         
#> [52] knitr_1.49              rlang_1.1.4             Rcpp_1.0.14            
#> [55] glue_1.8.0              jsonlite_1.8.9          R6_2.5.1