scrapper implements bindings to C++ code for analyzing single-cell data, mostly from the libscran libraries. Each function performs an individual analysis step ranging from quality control to clustering and marker detection. scrapper is mostly intended for other Bioconductor package developers to build more user-friendly end-to-end workflows.
Let’s fetch a small single-cell RNA-seq dataset for demonstration purposes:
## class: SingleCellExperiment
## dim: 20006 3005
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
## 1772058148_F03
## colData names(9): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC
We run it through the scrapper analysis pipeline:
# Finding the mitochondrial genes for QC purposes.
is.mito <- grepl("^mt-", rownames(sce.z))
# We can set the number of threads higher in real applications, but
# we want to avoid stressing out Bioconductor's build system.
nthreads <- 2
library(scrapper)
res <- analyze(
sce.z,
rna.subsets=list(mito=is.mito),
num.threads=nthreads
)
Now we can have a look at some of the results. For example, we can make a t-SNE plot:
## [1] 289 257 199 392 192 321 205 177 152 318 124 34 77 137
We can have a look at the markers defining each cluster:
# Top markers for each cluster, say, based on the median AUC:
top.markers <- lapply(res$rna.markers$auc, function(x) {
head(rownames(x)[order(x$median, decreasing=TRUE)], 10)
})
head(top.markers)
## $`1`
## [1] "Gad1" "Gad2" "Ndrg4" "Stmn3" "Tspyl4" "Scg5" "Syp" "Snap25"
## [9] "Nap1l5" "Rab3a"
##
## $`2`
## [1] "Plp1" "Mbp" "Cnp" "Mobp" "Mog" "Sept4" "Ugt8a" "Trf"
## [9] "Gsn" "Cldn11"
##
## $`3`
## [1] "Stmn3" "Calm1" "Scg5" "Chn1" "Snap25" "Pcsk2" "Calm2" "Mdh1"
## [9] "Nme1" "Stmn2"
##
## $`4`
## [1] "Chn1" "Stmn3" "Hpca" "Calm2"
## [5] "Neurod6" "3110035E14Rik" "Crym" "Calm1"
## [9] "Ppp3r1" "Ppp3ca"
##
## $`5`
## [1] "Ywhah" "Syp" "Snap25" "Ndrg4" "Vsnl1" "Rab3a" "Chn1" "Stmn3"
## [9] "Uchl1" "Ppp3r1"
##
## $`6`
## [1] "Atp1b1" "Ppp3ca" "Nsg2" "Tspan13"
## [5] "Syp" "Crym" "Chn1" "3110035E14Rik"
## [9] "Thy1" "Gria1"
# Aggregating all marker statistics for the first cluster
# (delta.mean is the same as the log-fold change in this case).
df1 <- reportGroupMarkerStatistics(res$rna.markers, 1)
df1 <- df1[order(df1$auc.median, decreasing=TRUE),]
head(df1[,c("mean", "detected", "auc.median", "delta.mean.median")])
## mean detected auc.median delta.mean.median
## Gad1 4.783640 1.0000000 0.9976632 4.587905
## Gad2 4.433167 0.9965398 0.9951791 4.293656
## Ndrg4 4.395829 0.9965398 0.9897291 3.755445
## Stmn3 4.713631 0.9930796 0.9869674 4.226543
## Tspyl4 3.364068 1.0000000 0.9831016 2.447678
## Scg5 4.750306 1.0000000 0.9765381 3.095245
Finally, we can convert the results into a
SingleCellExperiment
for interoperability with other
Bioconductor packages like scater:
## class: SingleCellExperiment
## dim: 20006 2874
## metadata(0):
## assays(2): filtered normalized
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(5): means variances fitted residuals is.highly.variable
## colnames(2874): 1772071015_C02 1772071017_G12 ... 1772063068_D01
## 1772066098_A12
## colData names(5): sum detected subsets.mito sizeFactor clusters
## reducedDimNames(3): pca tsne umap
## mainExpName: rna
## altExpNames(0):
The analyze()
function just calls a series of
lower-level functions in scrapper.
First, some quality control:
counts <- assay(sce.z)
rna.qc.metrics <- computeRnaQcMetrics(counts, subsets=list(mt=is.mito), num.threads=nthreads)
rna.qc.thresholds <- suggestRnaQcThresholds(rna.qc.metrics)
rna.qc.filter <- filterRnaQcMetrics(rna.qc.thresholds, rna.qc.metrics)
filtered <- counts[,rna.qc.filter,drop=FALSE]
Then normalization:
size.factors <- centerSizeFactors(rna.qc.metrics$sum[rna.qc.filter])
normalized <- normalizeCounts(filtered, size.factors)
Feature selection and PCA:
gene.var <- modelGeneVariances(normalized, num.threads=nthreads)
top.hvgs <- chooseHighlyVariableGenes(gene.var$statistics$residuals)
pca <- runPca(normalized[top.hvgs,], num.threads=nthreads)
Some visualization and clustering:
umap.out <- runUmap(pca$components, num.threads=nthreads)
tsne.out <- runTsne(pca$components, num.threads=nthreads)
snn.graph <- buildSnnGraph(pca$components, num.threads=nthreads)
clust.out <- clusterGraph(snn.graph)
And finally marker detection:
Readers are referred to the OSCA book for more details on the theory behind each step.
Let’s fetch another brain dataset and combine it with the previous one.
sce.t <- TasicBrainData()
common <- intersect(rownames(sce.z), rownames(sce.t))
combined <- cbind(assay(sce.t)[common,], assay(sce.z)[common,])
block <- rep(c("tasic", "zeisel"), c(ncol(sce.t), ncol(sce.z)))
We call analyze()
with block=
to account
for the batch structure. This ensures that the various calculations are
not affected by inter-block differences. It also uses MNN correction to
batch effects in the low-dimensional space prior to further
analyses.
# No mitochondrial genes in the combined set, so we'll just skip it.
blocked_res <- analyze(combined, block=block, num.threads=nthreads)
# Visually check whether the datasets were suitably merged together.
# Note that these two datasets don't have the same cell types, so
# complete overlap should not be expected.
plot(blocked_res$tsne[,1], blocked_res$tsne[,2], pch=16, col=factor(blocked_res$block))
Again, we can see how to do this step by step.
library(scrapper)
rna.qc.metrics <- computeRnaQcMetrics(combined, subsets=list(), num.threads=nthreads)
rna.qc.thresholds <- suggestRnaQcThresholds(rna.qc.metrics, block=block)
rna.qc.filter <- filterRnaQcMetrics(rna.qc.thresholds, rna.qc.metrics, block=block)
filtered <- combined[,rna.qc.filter,drop=FALSE]
filtered.block <- block[rna.qc.filter]
size.factors <- centerSizeFactors(rna.qc.metrics$sum[rna.qc.filter], block=filtered.block)
normalized <- normalizeCounts(filtered, size.factors)
gene.var <- modelGeneVariances(normalized, num.threads=nthreads, block=filtered.block)
top.hvgs <- chooseHighlyVariableGenes(gene.var$statistics$residuals)
pca <- runPca(normalized[top.hvgs,], num.threads=nthreads, block=filtered.block)
# Now we do a MNN correction to get rid of the batch effect.
corrected <- correctMnn(pca$components, block=filtered.block, num.threads=nthreads)
umap.out <- runUmap(corrected$corrected, num.threads=nthreads)
tsne.out <- runTsne(corrected$corrected, num.threads=nthreads)
snn.graph <- buildSnnGraph(corrected$corrected, num.threads=nthreads)
clust.out <- clusterGraph(snn.graph)
markers <- scoreMarkers(normalized, groups=clust.out$membership, block=filtered.block, num.threads=nthreads)
We can also compute pseudo-bulk profiles for each cluster-dataset combination, e.g., for differential expression analyses.
Let’s fetch a single-cell dataset with both RNA-seq and CITE-seq data. To keep the run-time short, we’ll only consider the first 5000 cells.
## class: SingleCellExperiment
## dim: 27679 5000
## metadata(0):
## assays(1): counts
## rownames(27679): A1BG A1BG-AS1 ... hsa-mir-8072 snoU2-30
## rowData names(0):
## colnames(5000): TGCCAAATCTCTAAGG ACTGCTCAGGTGTTAA ... CGCTATCGTCCGTCAG
## GGCGACTGTAAGGGAA
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(2): adt1 adt2
We extract the counts for both the RNA and the ADTs.
rna.counts <- assay(sce.s)
adt.counts <- rbind(assay(altExp(sce.s, "adt1")), assay(altExp(sce.s, "adt2")))
We pass both matrices to analyze()
, which will proceess
each modality separately before combining them for steps like clustering
and visualization.
is.mito <- grepl("^MT-", rownames(rna.counts))
is.igg <- grepl("^IgG", rownames(adt.counts))
multi_res <- analyze(
rna.counts,
adt.x=adt.counts,
rna.subsets=list(mito=is.mito),
adt.subsets=list(igg=is.igg),
num.threads=nthreads
)
Under the hood, the analysis looks like this:
# QC in both modalities, only keeping the cells that pass in both.
library(scrapper)
rna.qc.metrics <- computeRnaQcMetrics(rna.counts, subsets=list(MT=is.mito), num.threads=nthreads)
rna.qc.thresholds <- suggestRnaQcThresholds(rna.qc.metrics)
rna.qc.filter <- filterRnaQcMetrics(rna.qc.thresholds, rna.qc.metrics)
adt.qc.metrics <- computeAdtQcMetrics(adt.counts, subsets=list(IgG=is.igg), num.threads=nthreads)
adt.qc.thresholds <- suggestAdtQcThresholds(adt.qc.metrics)
adt.qc.filter <- filterAdtQcMetrics(adt.qc.thresholds, adt.qc.metrics)
keep <- rna.qc.filter & adt.qc.filter
rna.filtered <- rna.counts[,keep,drop=FALSE]
adt.filtered <- adt.counts[,keep,drop=FALSE]
rna.size.factors <- centerSizeFactors(rna.qc.metrics$sum[keep])
rna.normalized <- normalizeCounts(rna.filtered, rna.size.factors)
adt.size.factors <- computeClrm1Factors(adt.filtered, num.threads=nthreads)
adt.size.factors <- centerSizeFactors(adt.size.factors)
adt.normalized <- normalizeCounts(adt.filtered, adt.size.factors)
gene.var <- modelGeneVariances(rna.normalized, num.threads=nthreads)
top.hvgs <- chooseHighlyVariableGenes(gene.var$statistics$residuals)
rna.pca <- runPca(rna.normalized[top.hvgs,], num.threads=nthreads)
# Combining the RNA-derived PCs with ADT expression. Here, there's very few ADT
# tags so there's no need for further dimensionality reduction.
combined <- scaleByNeighbors(list(rna.pca$components, as.matrix(adt.normalized)), num.threads=nthreads)
umap.out <- runUmap(combined$combined, num.threads=nthreads)
tsne.out <- runTsne(combined$combined, num.threads=nthreads)
snn.graph <- buildSnnGraph(combined$combined, num.threads=nthreads)
clust.out <- clusterGraph(snn.graph)
rna.markers <- scoreMarkers(rna.normalized, groups=clust.out$membership, num.threads=nthreads)
adt.markers <- scoreMarkers(adt.normalized, groups=clust.out$membership, num.threads=nthreads)
The runAllNeighborSteps()
will run
runUmap()
, runTsne()
,
buildSnnGraph()
and clusterGraph()
in a single
call. This runs the UMAP/t-SNE iterations and the clustering in parallel
to maximize use of multiple threads.
The scoreGeneSet()
function will compute a gene set
score based on the input expression matrix. This can be used to
summarize the activity of pathways into a single per-cell score for
visualization.
The subsampleByNeighbors()
function will
deterministically select a representative subset of cells based on their
local neighborhood. This can be used to reduce the compute time of the
various steps downstream of the PCA.
For CRISPR data, quality control can be performed using
computeCrisprQcMetrics()
,
suggestCrisprQcThresholds()
and
filterCrisprQcMetrics()
. To normalize, we use size factors
defined by centering the total sum of guide counts for each cell. The
resulting matrix can then be directly used in
scaleByNeighbors()
.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scrapper_1.1.12 scRNAseq_2.20.0
## [3] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [5] Biobase_2.67.0 GenomicRanges_1.59.1
## [7] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [9] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [11] generics_0.1.3 MatrixGenerics_1.19.0
## [13] matrixStats_1.4.1 knitr_1.49
## [15] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9 httr2_1.0.7
## [4] rlang_1.1.4 magrittr_2.0.3 gypsum_1.3.0
## [7] compiler_4.4.2 RSQLite_2.3.9 GenomicFeatures_1.59.1
## [10] png_0.1-8 vctrs_0.6.5 ProtGenerics_1.39.1
## [13] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
## [16] dbplyr_2.5.0 XVector_0.47.1 Rsamtools_2.23.1
## [19] rmarkdown_2.29 UCSC.utils_1.3.0 purrr_1.0.2
## [22] bit_4.5.0.1 xfun_0.49 zlibbioc_1.53.0
## [25] cachem_1.1.0 beachmat_2.23.5 jsonlite_1.8.9
## [28] blob_1.2.4 rhdf5filters_1.19.0 DelayedArray_0.33.3
## [31] Rhdf5lib_1.29.0 BiocParallel_1.41.0 parallel_4.4.2
## [34] R6_2.5.1 bslib_0.8.0 rtracklayer_1.67.0
## [37] jquerylib_0.1.4 Rcpp_1.0.13-1 igraph_2.1.2
## [40] Matrix_1.7-1 tidyselect_1.2.1 abind_1.4-8
## [43] yaml_2.3.10 codetools_0.2-20 curl_6.1.0
## [46] lattice_0.22-6 alabaster.sce_1.7.0 tibble_3.2.1
## [49] withr_3.0.2 KEGGREST_1.47.0 evaluate_1.0.1
## [52] BiocFileCache_2.15.0 alabaster.schemas_1.7.0 ExperimentHub_2.15.0
## [55] Biostrings_2.75.3 pillar_1.10.0 BiocManager_1.30.25
## [58] filelock_1.0.3 RCurl_1.98-1.16 BiocVersion_3.21.1
## [61] ensembldb_2.31.0 alabaster.base_1.7.2 glue_1.8.0
## [64] alabaster.ranges_1.7.0 alabaster.matrix_1.7.4 lazyeval_0.2.2
## [67] maketools_1.3.1 tools_4.4.2 AnnotationHub_3.15.0
## [70] BiocIO_1.17.1 BiocNeighbors_2.1.2 sys_3.4.3
## [73] GenomicAlignments_1.43.0 buildtools_1.0.0 XML_3.99-0.18
## [76] rhdf5_2.51.1 grid_4.4.2 AnnotationDbi_1.69.0
## [79] GenomeInfoDbData_1.2.13 HDF5Array_1.35.2 restfulr_0.0.15
## [82] cli_3.6.3 rappdirs_0.3.3 S4Arrays_1.7.1
## [85] dplyr_1.1.4 AnnotationFilter_1.31.0 alabaster.se_1.7.0
## [88] sass_0.4.9 digest_0.6.37 SparseArray_1.7.2
## [91] rjson_0.2.23 memoise_2.0.1 htmltools_0.5.8.1
## [94] lifecycle_1.0.4 httr_1.4.7 mime_0.12
## [97] bit64_4.5.2