Single-cell RNA sequencing (scRNA-seq) is a widely used technique for profiling gene expression in individual cells. This allows molecular biology to be studied at a resolution that cannot be matched by bulk sequencing of cell populations. The scran package implements methods to perform low-level processing of scRNA-seq data, including cell cycle phase assignment, variance modelling and testing for marker genes and gene-gene correlations. This vignette provides brief descriptions of these methods and some toy examples to demonstrate their use.
Note: A more comprehensive description of the use of scran (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.
We start off with a count matrix where each row is a gene and each column is a cell. These can be obtained by mapping read sequences to a reference genome, and then counting the number of reads mapped to the exons of each gene. (See, for example, the Rsubread package to do both of these tasks.) Alternatively, pseudo-alignment methods can be used to quantify the abundance of each transcript in each cell. For simplicity, we will pull out an existing dataset from the scRNAseq package.
## class: SingleCellExperiment
## dim: 20064 1728
## metadata(0):
## assays(1): counts
## rownames(20064): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17
## ZZZ3__chr1
## rowData names(2): symbol chr
## colnames(1728): D2ex_1 D2ex_2 ... D17TGFB_95 D17TGFB_96
## colData names(2): donor sample
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(1): ERCC
This particular dataset is taken from a study of the human pancreas
with the CEL-seq protocol (Grun et al.
2016). It is provided as a SingleCellExperiment
object (from the SingleCellExperiment
package), which contains the raw data and various annotations. We
perform some cursory quality control to remove cells with low total
counts or high spike-in percentages:
library(scuttle)
qcstats <- perCellQCMetrics(sce)
qcfilter <- quickPerCellQC(qcstats, percent_subsets="altexps_ERCC_percent")
sce <- sce[,!qcfilter$discard]
summary(qcfilter$discard)
## Mode FALSE TRUE
## logical 1291 437
Cell-specific biases are normalized using the
computeSumFactors
method, which implements the
deconvolution strategy for scaling normalization (Lun, Bach, and Marioni 2016).
library(scran)
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, clusters=clusters)
summary(sizeFactors(sce))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.006722 0.442629 0.801312 1.000000 1.295809 9.227575
We identify genes that drive biological heterogeneity in the data set by modelling the per-gene variance. By only using a subset of highly variable genes in downstream analyses like clustering, we improve resolution of biological structure by removing uninteresting genes driven by technical noise. We decompose the total variance of each gene into its biological and technical components by fitting a trend to the endogenous variances (Lun, McCarthy, and Marioni 2016). The fitted value of the trend is used as an estimate of the technical component, and we subtract the fitted value from the total variance to obtain the biological component for each gene.
dec <- modelGeneVar(sce)
plot(dec$mean, dec$total, xlab="Mean log-expression", ylab="Variance")
curve(metadata(dec)$trend(x), col="blue", add=TRUE)
If we have spike-ins, we can use them to fit the trend instead. This provides a more direct estimate of the technical variance and avoids assuming that most genes do not exhibit biological variaility.
dec2 <- modelGeneVarWithSpikes(sce, 'ERCC')
plot(dec2$mean, dec2$total, xlab="Mean log-expression", ylab="Variance")
points(metadata(dec2)$mean, metadata(dec2)$var, col="red")
curve(metadata(dec2)$trend(x), col="blue", add=TRUE)
If we have some uninteresting factors of variation, we can block on
these using block=
. This will perform the trend fitting and
decomposition within each block before combining the statistics across
blocks for output. Statistics for each individual block can also be
extracted for further inspection.
# Turned off weighting to avoid overfitting for each donor.
dec3 <- modelGeneVar(sce, block=sce$donor, density.weights=FALSE)
per.block <- dec3$per.block
par(mfrow=c(3, 2))
for (i in seq_along(per.block)) {
decX <- per.block[[i]]
plot(decX$mean, decX$total, xlab="Mean log-expression",
ylab="Variance", main=names(per.block)[i])
curve(metadata(decX)$trend(x), col="blue", add=TRUE)
}
We can then extract some top genes for use in downstream procedures
using the getTopHVGs()
function. A variety of different
strategies can be used to define a subset of interesting genes:
# Get the top 10% of genes.
top.hvgs <- getTopHVGs(dec, prop=0.1)
# Get the top 2000 genes.
top.hvgs2 <- getTopHVGs(dec, n=2000)
# Get all genes with positive biological components.
top.hvgs3 <- getTopHVGs(dec, var.threshold=0)
# Get all genes with FDR below 5%.
top.hvgs4 <- getTopHVGs(dec, fdr.threshold=0.05)
The selected subset of genes can then be passed to the
subset.row
argument (or equivalent) in downstream steps.
This process is demonstrated below for the PCA step:
## [1] "PCA"
Principal components analysis is commonly performed to denoise and
compact the data prior to downstream analysis. A common question is how
many PCs to retain; more PCs will capture more biological signal at the
cost of retaining more noise and requiring more computational work. One
approach to choosing the number of PCs is to use the technical component
estimates to determine the proportion of variance that should be
retained. This is implemented in denoisePCA()
, which takes
the estimates returned by modelGeneVar()
or friends. (For
greater accuracy, we use the fit with the spikes; we also subset to only
the top HVGs to remove noise.)
## [1] 50
Another approach is based on the assumption that each subpopulation should be separated from each other on a different axis of variation. Thus, we choose the number of PCs that is not less than the number of subpopulations (which are unknown, of course, so we use the number of clusters as a proxy). It is then a simple matter to subset the dimensionality reduction result to the desired number of PCs.
output <- getClusteredPCs(reducedDim(sce))
npcs <- metadata(output)$chosen
reducedDim(sce, "PCAsub") <- reducedDim(sce, "PCA")[,1:npcs,drop=FALSE]
npcs
## [1] 14
Clustering of scRNA-seq data is commonly performed with graph-based
methods due to their relative scalability and robustness. scran
provides several graph construction methods based on shared nearest
neighbors (Xu and Su 2015) through the
buildSNNGraph()
function. This is most commonly generated
from the selected PCs, after which community detection methods from the
igraph
package can be used to explicitly identify clusters.
# In this case, using the PCs that we chose from getClusteredPCs().
g <- buildSNNGraph(sce, use.dimred="PCAsub")
cluster <- igraph::cluster_walktrap(g)$membership
# Assigning to the 'colLabels' of the 'sce'.
colLabels(sce) <- factor(cluster)
table(colLabels(sce))
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 79 285 64 170 166 164 124 32 57 63 63 24
By default, buildSNNGraph()
uses the mode of shared
neighbor weighting described by Xu and Su
(2015), but other weighting methods (e.g., the Jaccard index) are
also available by setting type=
. An unweighted k-nearest neighbor graph can also be
constructed with buildKNNGraph()
.
We can then use methods from scater to
visualize this clustering on a t-SNE plot. Note that
colLabels()<-
will just add the cluster assignments to
the "label"
field of the colData
; however, any
name can be used as long as downstream functions are adjusted
appropriately.
library(scater)
sce <- runTSNE(sce, dimred="PCAsub")
plotTSNE(sce, colour_by="label", text_by="label")
For graph-based methods, another diagnostic is to examine the ratio
of observed to expected edge weights for each pair of clusters (closely
related to the modularity score used in many cluster_*
functions). We would usually expect to see high observed weights between
cells in the same cluster with minimal weights between clusters,
indicating that the clusters are well-separated. Off-diagonal entries
indicate that some clusters are closely related, which is useful to know
for checking that they are annotated consistently.
library(bluster)
ratio <- pairwiseModularity(g, cluster, as.ratio=TRUE)
library(pheatmap)
pheatmap(log10(ratio+1), cluster_cols=FALSE, cluster_rows=FALSE,
col=rev(heat.colors(100)))
A more general diagnostic involves bootstrapping to determine the
stability of the partitions between clusters. Given a clustering
function, the bootstrapCluster()
function uses
bootstrapping to compute the co-assignment probability for each pair of
original clusters, i.e., the probability that one randomly chosen cell
from each cluster is assigned to the same cluster in the bootstrap
replicate . Larger probabilities indicate that the separation between
those clusters is unstable to the extent that it is sensitive to
sampling noise, and thus should not be used for downstream
inferences.
ass.prob <- bootstrapStability(sce, FUN=function(x) {
g <- buildSNNGraph(x, use.dimred="PCAsub")
igraph::cluster_walktrap(g)$membership
}, clusters=sce$cluster)
pheatmap(ass.prob, cluster_cols=FALSE, cluster_rows=FALSE,
col=colorRampPalette(c("white", "blue"))(100))
If necessary, further subclustering can be performed conveniently
using the quickSubCluster()
wrapper function. This splits
the input SingleCellExperiment
into several smaller objects
containing cells from each cluster and performs another round of
clustering within that cluster, using a freshly identified set of HVGs
to improve resolution for internal structure.
subout <- quickSubCluster(sce, groups=colLabels(sce))
table(metadata(subout)$subcluster) # formatted as '<parent>.<subcluster>'
##
## 1.1 1.2 10.1 10.2 10.3 11.1 11.2 12 2.1 2.2 2.3 2.4 3.1 3.2 3.3 4.1
## 38 41 16 19 28 29 34 24 96 35 81 73 26 19 19 30
## 4.2 4.3 4.4 5.1 5.2 6.1 6.2 6.3 7.1 7.2 7.3 8 9.1 9.2
## 62 40 38 73 93 64 65 35 35 35 54 32 33 24
The scoreMarkers()
wrapper function will perform
differential expression comparisons between pairs of clusters to
identify potential marker genes. For each pairwise comparison, we
compute a variety of effect sizes to quantify the differences between
those clusters.
For each cluster, we then summarize the effect sizes from all
comparisons to other clusters. Specifically, we compute the mean,
median, minimum and maximum effect size across all comparisons. This
yields a series of statistics for each effect size and summary type,
e.g., mean.AUC
.
# Uses clustering information from 'colLabels(sce)' by default:
markers <- scoreMarkers(sce)
markers
## List of length 12
## names(12): 1 2 3 4 5 6 7 8 9 10 11 12
## [1] "self.average" "other.average" "self.detected"
## [4] "other.detected" "mean.logFC.cohen" "min.logFC.cohen"
## [7] "median.logFC.cohen" "max.logFC.cohen" "rank.logFC.cohen"
## [10] "mean.AUC" "min.AUC" "median.AUC"
## [13] "max.AUC" "rank.AUC" "mean.logFC.detected"
## [16] "min.logFC.detected" "median.logFC.detected" "max.logFC.detected"
## [19] "rank.logFC.detected"
The idea is to use this information to sort the
DataFrame
for each cluster based on one of these metrics to
obtain a ranking of the strongest candidate markers. A good default
approach is to sort by either the mean AUC or Cohen’s d in decreasing order. This focuses
on marker genes that are upregulated in the cluster of interest compared
to most other clusters (hopefully). For example, we can get a quick
summary of the best markers for cluster 1 using the code below:
# Just showing the first few columns for brevity.
markers[[1]][order(markers[[1]]$mean.AUC, decreasing=TRUE),1:4]
## DataFrame with 20064 rows and 4 columns
## self.average other.average self.detected other.detected
## <numeric> <numeric> <numeric> <numeric>
## UGDH-AS1__chr4 7.95736 2.97892 1.000000 0.916095
## KCNQ1OT1__chr11 8.63439 3.86267 1.000000 0.977219
## PGM5P2__chr9 5.93842 1.37177 0.987342 0.629164
## MAB21L3__chr1 6.52394 2.07597 0.987342 0.808723
## FBLIM1__chr1 5.05717 1.39738 0.987342 0.700471
## ... ... ... ... ...
## SKP1__chr5 0.526165 1.91257 0.329114 0.827098
## EIF1__chr17 1.453564 3.41013 0.506329 0.943458
## RPL3__chr22 1.606360 3.67678 0.544304 0.963845
## RPS25__chr11 1.400783 3.43745 0.493671 0.962767
## GNAS__chr20 1.295445 3.99483 0.493671 0.941651
The other summaries provide additional information about the comparisons to other clusters without requiring examination of all pairwise changes. For example, a positive median AUC means that the gene is upregulated in this cluster against most (i.e., at least 50%) of the other clusters. A positive minimum AUC means that the gene is upregulated against all other clusters, while a positive maximum AUC indicates upregulation against at least one other cluster. It can be helpful to sort on the median instead, if the mean is too sensitive to outliers; or to sort on the minimum, if we wish to identify genes that are truly unique to a particular cluster.
That said, if the full set of pairwise effects is desired, we can
obtain them with full.stats=TRUE
. This yields a nested
DataFrame
containing the effects from each pairwise
comparison to another cluster.
## DataFrame with 20064 rows and 11 columns
## 2 3 4 5 6 7
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## A1BG-AS1__chr19 0.193192 0.193192 0.193192 0.1511082 0.193192 0.1931922
## A1BG__chr19 -0.383283 -0.810316 0.012951 -1.1128084 0.162714 -1.0253400
## A1CF__chr10 -0.150129 -0.553733 0.451479 -0.7289524 0.361270 0.1736118
## A2M-AS1__chr12 0.185977 -0.119582 0.277078 0.0826094 0.277078 0.0205016
## A2ML1__chr12 0.492297 0.348947 0.528591 0.5659760 0.554566 0.5659760
## ... ... ... ... ... ... ...
## ZYG11A__chr1 0.618620 0.5733540 0.539024 0.644377 0.681953 0.681953
## ZYG11B__chr1 1.535652 1.1280983 1.391627 1.756829 1.601884 1.659232
## ZYX__chr7 -0.221250 0.2823101 -0.349240 0.238468 -0.589359 0.295469
## ZZEF1__chr17 -0.509875 -0.0383699 -0.375748 -0.388188 -0.394200 -0.640098
## ZZZ3__chr1 -0.433631 -0.2503621 -0.341861 -0.361327 -0.344299 -0.522964
## 8 9 10 11 12
## <numeric> <numeric> <numeric> <numeric> <numeric>
## A1BG-AS1__chr19 0.193192 -0.0847779 0.193192 0.1931922 0.193192
## A1BG__chr19 -0.960358 0.2744821 -0.929347 -0.7301001 -1.356154
## A1CF__chr10 -0.327091 0.1862410 0.390848 -0.2785467 0.570966
## A2M-AS1__chr12 0.277078 0.2770778 -0.105477 -0.0741515 0.277078
## A2ML1__chr12 0.525356 0.4527881 0.493265 0.5659760 0.565976
## ... ... ... ... ... ...
## ZYG11A__chr1 0.681953 0.3991683 0.681953 0.616473 0.593276
## ZYG11B__chr1 1.714765 1.3265656 1.539204 1.579707 1.574035
## ZYX__chr7 0.329949 0.0333088 0.155112 0.299053 -1.385087
## ZZEF1__chr17 -0.380311 -0.1101065 -0.614079 -0.544608 -0.370839
## ZZZ3__chr1 -0.183771 -0.3009860 -0.440704 -0.433413 -0.321751
In addition, we report some basic descriptive statistics such as the mean log-expression in each cluster and the proportion of cells with detectable expression. The grand mean of these values across all other clusters is also reported. Note that, because we use a grand mean, we are unaffected by differences in the number of cells between clusters; the same applies for our effect sizes as they are computed by summarizing pairwise comparisons rather than by aggregating cells from all other clusters into a single group.
The SingleCellExperiment
object can be easily converted
into other formats using the convertTo
method. This allows
analyses to be performed using other pipelines and packages. For
example, if DE analyses were to be performed using edgeR, the
count data in sce
could be used to construct a
DGEList
.
By default, rows corresponding to spike-in transcripts are dropped
when get.spikes=FALSE
. As such, the rows of y
may not correspond directly to the rows of sce
– users
should match by row name to ensure correct cross-referencing between
objects. Normalization factors are also automatically computed from the
size factors.
The same conversion strategy roughly applies to the other supported
formats. DE analyses can be performed using DESeq2 by
converting the object to a DESeqDataSet
. Cells can be
ordered on pseudotime with monocle
by converting the object to a CellDataSet
(in this case,
normalized unlogged expression values are stored).
Further information can be obtained by examining the documentation
for each function (e.g., ?convertTo
); reading the Orchestrating Single Cell
Analysis book; or asking for help on the Bioconductor support site (please read the
posting
guide beforehand).
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] pheatmap_1.0.12 bluster_1.17.0
## [3] scater_1.35.0 ggplot2_3.5.1
## [5] scRNAseq_2.20.0 BiocParallel_1.41.0
## [7] scran_1.35.0 scuttle_1.17.0
## [9] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [11] Biobase_2.67.0 GenomicRanges_1.59.1
## [13] GenomeInfoDb_1.43.1 IRanges_2.41.1
## [15] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [17] generics_0.1.3 MatrixGenerics_1.19.0
## [19] matrixStats_1.4.1 knitr_1.49
## [21] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3 jsonlite_1.8.9
## [4] magrittr_2.0.3 ggbeeswarm_0.7.2 GenomicFeatures_1.59.1
## [7] gypsum_1.3.0 farver_2.1.2 rmarkdown_2.29
## [10] BiocIO_1.17.0 zlibbioc_1.52.0 vctrs_0.6.5
## [13] memoise_2.0.1 Rsamtools_2.23.0 RCurl_1.98-1.16
## [16] htmltools_0.5.8.1 S4Arrays_1.7.1 AnnotationHub_3.15.0
## [19] curl_6.0.1 BiocNeighbors_2.1.0 Rhdf5lib_1.29.0
## [22] SparseArray_1.7.2 rhdf5_2.51.0 sass_0.4.9
## [25] alabaster.base_1.7.2 bslib_0.8.0 alabaster.sce_1.7.0
## [28] httr2_1.0.6 cachem_1.1.0 buildtools_1.0.0
## [31] GenomicAlignments_1.43.0 igraph_2.1.1 lifecycle_1.0.4
## [34] pkgconfig_2.0.3 rsvd_1.0.5 Matrix_1.7-1
## [37] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [40] digest_0.6.37 colorspace_2.1-1 AnnotationDbi_1.69.0
## [43] dqrng_0.4.1 irlba_2.3.5.1 ExperimentHub_2.15.0
## [46] RSQLite_2.3.8 beachmat_2.23.1 labeling_0.4.3
## [49] filelock_1.0.3 fansi_1.0.6 httr_1.4.7
## [52] abind_1.4-8 compiler_4.4.2 withr_3.0.2
## [55] bit64_4.5.2 viridis_0.6.5 DBI_1.2.3
## [58] HDF5Array_1.35.1 alabaster.ranges_1.7.0 alabaster.schemas_1.7.0
## [61] rappdirs_0.3.3 DelayedArray_0.33.2 rjson_0.2.23
## [64] tools_4.4.2 vipor_0.4.7 beeswarm_0.4.0
## [67] glue_1.8.0 restfulr_0.0.15 rhdf5filters_1.19.0
## [70] grid_4.4.2 Rtsne_0.17 cluster_2.1.6
## [73] gtable_0.3.6 ensembldb_2.31.0 BiocSingular_1.23.0
## [76] ScaledMatrix_1.15.0 metapod_1.15.0 utf8_1.2.4
## [79] XVector_0.47.0 ggrepel_0.9.6 BiocVersion_3.21.1
## [82] pillar_1.9.0 limma_3.63.2 dplyr_1.1.4
## [85] BiocFileCache_2.15.0 lattice_0.22-6 rtracklayer_1.67.0
## [88] bit_4.5.0 tidyselect_1.2.1 locfit_1.5-9.10
## [91] maketools_1.3.1 Biostrings_2.75.1 gridExtra_2.3
## [94] ProtGenerics_1.39.0 edgeR_4.5.0 xfun_0.49
## [97] statmod_1.5.0 UCSC.utils_1.3.0 lazyeval_0.2.2
## [100] yaml_2.3.10 evaluate_1.0.1 codetools_0.2-20
## [103] tibble_3.2.1 alabaster.matrix_1.7.2 BiocManager_1.30.25
## [106] cli_3.6.3 munsell_0.5.1 jquerylib_0.1.4
## [109] Rcpp_1.0.13-1 dbplyr_2.5.0 png_0.1-8
## [112] XML_3.99-0.17 parallel_4.4.2 blob_1.2.4
## [115] AnnotationFilter_1.31.0 sparseMatrixStats_1.19.0 bitops_1.0-9
## [118] viridisLite_0.4.2 alabaster.se_1.7.0 scales_1.3.0
## [121] crayon_1.5.3 rlang_1.1.4 KEGGREST_1.47.0