This vignette shows an example
pseudobulk experiment testing cell size scale factors using a small
example dataset of single-nucleus RNA-seq data (snRNA-seq) from human
cortex (Darmanis et al. (2015)).
Predictions are made using lute
, and results plots are
generated using ggplot2
.
In this example, we source a real snRNA-seq dataset of human brain,
including cortex and hippocampus published in darmanis_survey_2015. The
full data along with other real single-cell RNA-seq datasets may be
accessed from the scRNAseq
package.
Load a stored subset of the example dataset with the following.
path <- system.file("extdata", "scRNAseq/darmanis_example.rda", package="lute")
data <- get(load(path))
The loaded dataset is of type SingleCellExperiment
,
which is handled by the lute()
function (see
?lute
for details). Before calling the framework function,
it needs to be processed to (1) define cell types and samples of
interest (2) subset on cell type markers, and (3) define pseudobulks for
each available sample.
For this experiment, we will consider two principal cell types for brain, neuron and glial cells (a.k.a. “K2”).
First, identify nuclei labeled from only these types and remove the
rest. Then define a new label "k2"
using the valid
remaining nuclei.
sampleIdVariable <- "experiment_sample_name"
oldTypes <- "cell.type"; newTypes <- "k2"
# remove non-k2 types
filterK2 <- data[[oldTypes]] %in%
c("neurons", "oligodendrocytes", "astrocytes", "OPC", "microglia")
data <- data[,filterK2]
# define new k2 variable
data[[newTypes]] <- ifelse(data[[oldTypes]]=="neurons", "neuron", "glial")
data[[newTypes]] <- factor(data[[newTypes]])
Next, define the samples of interest for the experiment. We will select samples having at least 20 nuclei.
minNuclei <- 20
nucleiPerSample <- table(data[[sampleIdVariable]])
sampleIdVector <- unique(data[[sampleIdVariable]])
sampleIdVector <- sampleIdVector[nucleiPerSample >= minNuclei]
sampleIdVector # view
## [1] "AB_S8" "AB_S11" "AB_S3" "AB_S4" "AB_S5" "AB_S7"
Next, save samples having non-zero amounts of neuron and glial cells.
sampleIdVector <- unlist(lapply(sampleIdVector, function(sampleId){
numTypes <- length(
unique(
data[,data[[sampleIdVariable]]==sampleId][[newTypes]]))
if(numTypes==2){sampleId}
}))
sampleIdVector
## [1] "AB_S11" "AB_S4" "AB_S5" "AB_S7"
View the summaries by sample id, then save these as the true cell type proportions. These will be used later to assess the predictions.
proportionsList <- lapply(sampleIdVector, function(sampleId){
prop.table(table(data[,data$experiment_sample_name==sampleId]$k2))
})
dfProportions <- do.call(rbind, proportionsList)
rownames(dfProportions) <- sampleIdVector
colnames(dfProportions) <- paste0(colnames(dfProportions), ".true")
dfProportions <- as.data.frame(dfProportions)
knitr::kable(dfProportions) # view
glial.true | neuron.true | |
---|---|---|
AB_S11 | 0.1525424 | 0.8474576 |
AB_S4 | 0.6724138 | 0.3275862 |
AB_S5 | 0.3571429 | 0.6428571 |
AB_S7 | 0.3000000 | 0.7000000 |
Define the cell size scale factors and use these to make the
pseudobulks. For demonstration we set these to have large difference
(i.e. neuron/glial > 3). While we set these manually, the cell scale
factors could also be defined from library sizes or by referencing the
cellScaleFactors
package (link).
Next make the pseudobulk datasets.
assayName <- "counts"
pseudobulkList <- lapply(sampleIdVector, function(sampleId){
dataIteration <- data[,data[[sampleIdVariable]]==sampleId]
ypb_from_sce(
singleCellExperiment = dataIteration, assayName = assayName,
cellTypeVariable = newTypes, cellScaleFactors = cellScalesVector)
})
dfPseudobulk <- do.call(cbind, pseudobulkList)
dfPseudobulk <- as.data.frame(dfPseudobulk)
colnames(dfPseudobulk) <- sampleIdVector
knitr::kable(head(dfPseudobulk))
AB_S11 | AB_S4 | AB_S5 | AB_S7 | |
---|---|---|---|---|
ZNHIT3 | 407.6780 | 334.3621 | 342.42857 | 489.48 |
ZYG11B | 675.4746 | 835.5517 | 774.42857 | 1380.54 |
ZNF91 | 1204.6610 | 679.3793 | 2608.28571 | 856.76 |
ZSCAN18 | 197.5254 | 249.0000 | 450.07143 | 323.62 |
ZRANB2 | 1432.9831 | 1121.3621 | 1469.21429 | 1623.72 |
ZYX | 141.7458 | 175.3621 | 39.28571 | 132.82 |
Predict the neuron proportions using non-negative least squares
(NNLS), the default deconvolution algorithm used by lute()
.
First, get the scaled proportions by setting the argument
cellScaleFactors = cellScalesVector
.
scaledResult <- lute(
singleCellExperiment = data,
bulkExpression = as.matrix(dfPseudobulk),
cellScaleFactors = cellScalesVector,
typemarkerAlgorithm = NULL,
cellTypeVariable = newTypes,
assayName = assayName)
## Parsing deconvolution arguments...
## Using NNLS...
glial | neuron | |
---|---|---|
AB_S11 | 0.1638850 | 0.8361150 |
AB_S4 | 0.7281654 | 0.2718346 |
AB_S5 | 0.0000000 | 1.0000000 |
AB_S7 | 0.4049374 | 0.5950626 |
Next, get the unscaled result without setting s
.
unscaledResult <- lute(
singleCellExperiment = data,
bulkExpression = as.matrix(dfPseudobulk),
typemarkerAlgorithm = NULL,
cellTypeVariable = newTypes,
assayName = assayName)
## Parsing deconvolution arguments...
## Using NNLS...
proportionsUnscaled <- unscaledResult[[1]]@predictionsTable
knitr::kable(proportionsUnscaled) # view
glial | neuron | |
---|---|---|
AB_S11 | 0.0555366 | 0.9444634 |
AB_S4 | 0.4455571 | 0.5544429 |
AB_S5 | 0.0000000 | 1.0000000 |
AB_S7 | 0.1695378 | 0.8304622 |
Note proportions didn’t change for samples which had all glial or all
neuron (AB_S8
and AB_S3
).
We will show the outcome of performing the cell scale factor
adjustments using scatterplots and boxplots. Begin by appending the
neuron proportion predictions from scaling treatments (scaled and
unscaled) to the true proportions table dfProportions
.
dfProportions$neuron.unscaled <- proportionsUnscaled$neuron
dfProportions$neuron.scaled <- proportions.scaled$neuron
knitr::kable(dfProportions) # view
glial.true | neuron.true | neuron.unscaled | neuron.scaled | |
---|---|---|---|---|
AB_S11 | 0.1525424 | 0.8474576 | 0.9444634 | 0.8361150 |
AB_S4 | 0.6724138 | 0.3275862 | 0.5544429 | 0.2718346 |
AB_S5 | 0.3571429 | 0.6428571 | 1.0000000 | 1.0000000 |
AB_S7 | 0.3000000 | 0.7000000 | 0.8304622 | 0.5950626 |
Calculate bias as the difference between true and predicted neuron proportions. Then calculate the error as the absolute of the bias thus defined.
# get bias
dfProportions$bias.neuron.unscaled <-
dfProportions$neuron.true-dfProportions$neuron.unscaled
dfProportions$bias.neuron.scaled <-
dfProportions$neuron.true-dfProportions$neuron.scaled
# get error
dfProportions$error.neuron.unscaled <-
abs(dfProportions$bias.neuron.unscaled)
dfProportions$error.neuron.scaled <-
abs(dfProportions$bias.neuron.scaled)
Make the tall version of dfProportions
in order to
generate a plot with facets on the scale treatment (either “scaled” or
“unscaled”).
dfPlotTall <- rbind(
data.frame(true = dfProportions$neuron.true,
predicted = dfProportions$neuron.scaled,
error = dfProportions$error.neuron.scaled,
sampleId = rownames(dfProportions),
type = rep("scaled", nrow(dfProportions))),
data.frame(true = dfProportions$neuron.true,
predicted = dfProportions$neuron.unscaled,
error = dfProportions$error.neuron.unscaled,
sampleId = rownames(dfProportions),
type = rep("unscaled", nrow(dfProportions)))
)
dfPlotTall <- as.data.frame(dfPlotTall)
Show sample results scatterplots of true (x-axis) by predicted (y-axis) neuron proportions. Also include a reference line (slope = 1, yintercept = 0) showing where agreement is absolute between proportions.
Also shows RMSE in plot titles.
dfPlotTallNew <- dfPlotTall
rmseScaled <-
rmse(
dfPlotTallNew[dfPlotTallNew$type=="scaled",]$true,
dfPlotTall[dfPlotTall$type=="scaled",]$predicted, "mean")
rmseUnscaled <-
rmse(
dfPlotTallNew[dfPlotTallNew$type=="unscaled",]$true,
dfPlotTallNew[dfPlotTallNew$type=="unscaled",]$predicted, "mean")
dfPlotTallNew$type <-
ifelse(grepl("un.*", dfPlotTallNew$type),
paste0(dfPlotTallNew$type,
" (RMSE = ", round(rmseScaled, 3), ")"),
paste0(dfPlotTallNew$type,
" (RMSE = ", round(rmseUnscaled, 3), ")"))
textSize <- 15
ggplot(dfPlotTallNew, aes(x = true, y = predicted)) +
geom_point(size = 4, alpha = 0.5) + geom_abline(slope = 1, intercept = 0) +
xlim(0, 1) + ylim(0, 1) + facet_wrap(~type) + theme_bw() +
xlab("True") + ylab("Predicted") +
theme(text = element_text(size = textSize),
axis.text.x = element_text(angle = 45, hjust = 1))
Show jitters and boxplots by sample, depicting the neuron error (y-axis) by scale treatment (x-axis). The sample IDs are depicted by the point colors.
ggplot(dfPlotTall, aes(x = type, y = error, color = sampleId)) +
geom_jitter(alpha = 0.5, size = 4) + theme_bw() +
geom_boxplot(alpha = 0, color = "cyan") +
theme(text = element_text(size = textSize),
axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Type") + ylab("Error (Neuron)")
This process could be readily repeated for the remaining cell types, or just glial cells in this case.
This vignette showed how to conduct a basic pseudobulk experiment using cell size scale factors and an example snRNAseq dataset from human brain Darmanis et al. (2015). Some key details include sourcing and snRNA-seq data, defining a new cell type variable, setting the scale factors, making predictions, and performing comparative analyses of the prediction results. Further details about the importance of cell size scale factors are discussed in Maden et al. (2023), and examples of their utilizations may be found in Monaco et al. (2019), Racle and Gfeller (2020), and Sosina et al. (2021).
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggplot2_3.5.1 lute_1.3.0
## [3] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [5] Biobase_2.67.0 GenomicRanges_1.59.1
## [7] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [9] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [11] generics_0.1.3 MatrixGenerics_1.19.0
## [13] matrixStats_1.4.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4 farver_2.1.2
## [4] fastmap_1.2.0 bluster_1.17.0 digest_0.6.37
## [7] rsvd_1.0.5 lifecycle_1.0.4 cluster_2.1.8
## [10] statmod_1.5.0 magrittr_2.0.3 compiler_4.4.2
## [13] rlang_1.1.4 sass_0.4.9 tools_4.4.2
## [16] igraph_2.1.2 yaml_2.3.10 knitr_1.49
## [19] S4Arrays_1.7.1 labeling_0.4.3 dqrng_0.4.1
## [22] DelayedArray_0.33.3 abind_1.4-8 BiocParallel_1.41.0
## [25] withr_3.0.2 sys_3.4.3 grid_4.4.2
## [28] beachmat_2.23.5 colorspace_2.1-1 edgeR_4.5.1
## [31] scales_1.3.0 cli_3.6.3 rmarkdown_2.29
## [34] crayon_1.5.3 metapod_1.15.0 httr_1.4.7
## [37] scuttle_1.17.0 cachem_1.1.0 nnls_1.6
## [40] parallel_4.4.2 BiocManager_1.30.25 XVector_0.47.1
## [43] vctrs_0.6.5 Matrix_1.7-1 jsonlite_1.8.9
## [46] BiocSingular_1.23.0 BiocNeighbors_2.1.2 irlba_2.3.5.1
## [49] maketools_1.3.1 locfit_1.5-9.10 limma_3.63.2
## [52] jquerylib_0.1.4 glue_1.8.0 codetools_0.2-20
## [55] gtable_0.3.6 UCSC.utils_1.3.0 ScaledMatrix_1.15.0
## [58] munsell_0.5.1 tibble_3.2.1 pillar_1.10.0
## [61] htmltools_0.5.8.1 GenomeInfoDbData_1.2.13 R6_2.5.1
## [64] evaluate_1.0.1 lattice_0.22-6 scran_1.35.0
## [67] bslib_0.8.0 Rcpp_1.0.13-1 SparseArray_1.7.2
## [70] xfun_0.49 buildtools_1.0.0 pkgconfig_2.0.3