Single cell RNA sequencing is a prevelant practice to interrogate tissue characteristics and heterogeneity in health and disease. Finding which gene sets (pathways) are enriched in single cells allows to unravel the different subpopulations of cells that exist in the interrogated tissue and elucidate their biological and functional underpinnings. Different methods have been developed for this purpose, the most prominent of which is AUCell. However, some of these methods, including AUCell, use gene rankings to test for such enrichment and avoid using data of different cells when calculating pathway scores for a specific cell. While AUCell and other methods produced insightful results in prior research, we found that some important findings might be missed by using them. We therefore developed SiPSiC to allow the dissection of tissue heterogeneity and unravel the function and biological traits of cell subpopulations. By using gene counts and the transcriptome of different cells in the data when calculating pathway scores for an individual cell, SiPSiC allows to unveil subpopulation characteristics which are sometimes missed by other methods, hence it has been deposited to Bioconductor.
Install SiPSiC by executing the following commands in an R session:
library(SiPSiC)
geneCountsMatrix <- matrix(rpois(16, lambda = 10), ncol = 4, nrow = 4)
geneCountsMatrix <- as(geneCountsMatrix, "dgCMatrix")
## Make sure your matrix is indeed a sparse matrix (of type dgCMatrix)!
rownames(geneCountsMatrix) <- c("Gene1", "Gene2", "Gene3", "Gene4")
colnames(geneCountsMatrix) <- c("Cell1", "Cell2", "Cell3", "Cell4")
assayData <- SingleCellExperiment(assays = list(counts = geneCountsMatrix))
pathwayGenesList <- c("Gene1", "Gene2", "Gene4")
scoresAndIndices <- getPathwayScores(counts(assayData), pathwayGenesList) # The third parameter, percentForNormalization, is optional; If not specified, its value is set to 5.
pathwayScoresOfCells <- scoresAndIndices$pathwayScores
pathwayGeneIndices <- scoresAndIndices$index
Taking an scRNA-seq data matrix and the list of genes of which the relevant pathway consists, SiPSiC uses five steps to calculate the score for all the cells in the data; These are:
Pick only genes which belong to the pathway.
For each gene separately: If none of the cells transcribe the gene, keep the values as they are (all zeros); Otherwise, calculate the median of the X% top expressing cells (X is specified by the percentForNormalization parameter and is 5 by default), divide all values by this median and keep them. If the median is zero, however, the values are divided by the maximum value across all cells instead. The reason behind this step is that scRNA-seq data are normally sparse, namely, the fraction of zeros in the data is large; Thus, by selecting the median of the top 5% cells there is a high likelihood that for most genes the value will be greater than zero, while on the other hand it will also not be an outlier, which may perturb further processing steps.
Independently of step 2, rank the genes by their total counts (TPM or CPM) across all cells, then divide the ranks by the total number of genes; This normalization ensures that all the ranks remain within the range (0,1] regardless of the total number of genes.
Multiply the results of each gene from step 2 by its normalized ranking from step 3.
Set each cell’s pathway score as the average of its values across all genes, as provided by step 4. Note that the higher the total number of counts for a gene is, the more it affects the pathway scores of all the cells in the data. We find this reasonable as the transcription of genes with higher total counts is likely to differ to a greater extent between cells, allowing us to reveal biological differences more accurately.
Following is the output of the ‘sessionInfo()’ function observed on the system on which the package was built:
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SiPSiC_1.7.0 SingleCellExperiment_1.29.1
#> [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [5] GenomicRanges_1.59.0 GenomeInfoDb_1.43.1
#> [7] IRanges_2.41.1 S4Vectors_0.45.2
#> [9] BiocGenerics_0.53.3 generics_0.1.3
#> [11] MatrixGenerics_1.19.0 matrixStats_1.4.1
#> [13] Matrix_1.7-1 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] jsonlite_1.8.9 crayon_1.5.3 compiler_4.4.2
#> [4] BiocManager_1.30.25 jquerylib_0.1.4 yaml_2.3.10
#> [7] fastmap_1.2.0 lattice_0.22-6 R6_2.5.1
#> [10] XVector_0.47.0 S4Arrays_1.7.1 knitr_1.49
#> [13] DelayedArray_0.33.2 maketools_1.3.1 GenomeInfoDbData_1.2.13
#> [16] bslib_0.8.0 rlang_1.1.4 cachem_1.1.0
#> [19] xfun_0.49 sass_0.4.9 sys_3.4.3
#> [22] SparseArray_1.7.2 cli_3.6.3 zlibbioc_1.52.0
#> [25] digest_0.6.37 grid_4.4.2 lifecycle_1.0.4
#> [28] evaluate_1.0.1 buildtools_1.0.0 abind_1.4-8
#> [31] rmarkdown_2.29 httr_1.4.7 tools_4.4.2
#> [34] htmltools_0.5.8.1 UCSC.utils_1.3.0