| Title: | Automatic Generation of Single-Cell Analyses |
|---|---|
| Description: | Implements pipelines for generating single-cell analysis reports in the augere framework. This uses scrapper to execute routine steps such as quality control, normalization, feature selection, clustering and marker detection. We also implement a pipeline for automatic cell type annotation against a labelled reference with SingleR. Each pipeline function generates a self-contained Rmarkdown report with all of the steps required to reproduce its analysis. |
| Authors: | Aaron Lun [cre, aut] (ORCID: <https://orcid.org/0000-0002-3564-4813>) |
| Maintainer: | Aaron Lun <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.99.3 |
| Built: | 2026-05-15 06:38:42 UTC |
| Source: | https://github.com/bioc/augere.solo |
Implements pipelines for generating single-cell analysis reports in the augere framework. This uses scrapper to execute routine steps such as quality control, normalization, feature selection, clustering and marker detection. We also implement a pipeline for automatic cell type annotation against a labelled reference with SingleR. Each pipeline function generates a self-contained Rmarkdown report with all of the steps required to reproduce its analysis.
Maintainer: Aaron Lun [email protected] (ORCID)
Useful links:
Report bugs at https://github.com/augere-bioinfo/augere.solo/issues
Annotate cells in a scRNA-seq dataset by computing correlations against reference data with known labels.
runAnnotate( test, references, test.assay = 1, test.id.field = NULL, test.block.field = NULL, test.is.lognorm = (test.assay == "logcounts"), test.is.ensembl = FALSE, test.symbol.field = NULL, cluster.field = NULL, reduced.dimensions = NULL, output.dir = "annotate", metadata = NULL, author = NULL, dry.run = FALSE, save.results = TRUE, suppress.plots = FALSE, num.threads = 1 ) configureReferenceAnnotation( ref, ref.label.field, ref.marker.method = NULL, ref.num.markers = NULL, ref.assay = "logcounts", ref.id.field = NULL, ref.block.field = NULL, ref.aggregate = NULL, ref.is.lognorm = (ref.assay == "logcounts") )runAnnotate( test, references, test.assay = 1, test.id.field = NULL, test.block.field = NULL, test.is.lognorm = (test.assay == "logcounts"), test.is.ensembl = FALSE, test.symbol.field = NULL, cluster.field = NULL, reduced.dimensions = NULL, output.dir = "annotate", metadata = NULL, author = NULL, dry.run = FALSE, save.results = TRUE, suppress.plots = FALSE, num.threads = 1 ) configureReferenceAnnotation( ref, ref.label.field, ref.marker.method = NULL, ref.num.markers = NULL, ref.assay = "logcounts", ref.id.field = NULL, ref.block.field = NULL, ref.aggregate = NULL, ref.is.lognorm = (ref.assay == "logcounts") )
test |
A SummarizedExperiment object containing cells in the test dataset to be assigned labels. |
references |
A list created by Alternatively, a list of these lists may be supplied to specify multiple references. This list may be named, in which case the names will be used to identify each reference in both the report and output of this function. |
test.assay |
Integer or string specifying the assay of |
test.id.field |
String specifying the name of the |
test.block.field |
String specifying the name of the |
test.is.lognorm |
Boolean indicating whether the assay at |
test.is.ensembl |
Boolean indicating whether |
test.symbol.field |
String specifying the name of the |
cluster.field |
String specifying the column of |
reduced.dimensions |
Character vector of reduced dimensions in |
output.dir |
String containing the path to an output directory in which to write the Rmarkdown report and save results. |
metadata |
Named list of additional fields to add to each result's metadata. |
author |
Character vector of authors. |
dry.run |
Boolean indicating whether to perform a dry run. This will write the Rmarkdown report without evaluating it. |
save.results |
Boolean indicating whether the results should also be saved to file. |
suppress.plots |
Boolean indicating whether to suppress the generation of plots.
This can be set to |
num.threads |
Integer specifying the number of threads to use in the various computations. |
ref |
A SummarizedExperiment object containing reference samples with known labels in Alternatively, a string can be provided containing the name of celldex reference dataset (e.g., |
ref.label.field |
String specifying the name of the |
ref.marker.method |
String specifying the method for choosing the top markers from each pairwise comparison between labels,
see the |
ref.num.markers |
Integer specifying the number of markers to use from each pairwise comparison between labels.
See the |
ref.assay |
Integer or string specifying the assay of |
ref.id.field |
String specifying the name of the |
ref.block.field |
String specifying the name of the |
ref.aggregate |
Boolean indicating that references should be aggregated inside |
ref.is.lognorm |
Boolean indicating whether the assay at |
For runAnnotate, a Rmarkdown report named report.Rmd is written inside output.dir that contains the analysis commands.
If dry.run=FALSE, a list is returned containing:
predictions, a list of length equal to references.
Each entry is a DataFrame containing the classification results of test against the corresponding reference.
See classifySingleR for more details on the expected columns.
combined, a DataFrame containing the combined classification results across multiple references.
See combineRecomputedResults for more details on the expected format.
Only present if references contains more than one reference.
If save.results=TRUE, the results are saved in a results directory inside output.
If dry.run=TRUE, NULL is returned.
Only the Rmarkdown report is saved to file.
For configureReferenceAnnotation, a list of class "reference" is returned containing the configuration details for each reference.
library(scRNAseq) hESCs <- LaMannoBrainData('human-es') hESCs <- hESCs[,1:100] # subsetting to speed it up. tmp <- tempfile() results <- runAnnotate( hESCs, configureReferenceAnnotation( "HumanPrimaryCellAtlasData", "label.main" ), output.dir = tmp, num.threads = 2 # speed it up a little. ) list.files(tmp, recursive=TRUE) results$predictionslibrary(scRNAseq) hESCs <- LaMannoBrainData('human-es') hESCs <- hESCs[,1:100] # subsetting to speed it up. tmp <- tempfile() results <- runAnnotate( hESCs, configureReferenceAnnotation( "HumanPrimaryCellAtlasData", "label.main" ), output.dir = tmp, num.threads = 2 # speed it up a little. ) list.files(tmp, recursive=TRUE) results$predictions
Simple analysis of scRNA-seq or CITE-seq data, from quality control to clustering and marker gene detection.
runSolo( x, rna.experiment = TRUE, adt.experiment = NULL, subset.factor = NULL, subset.levels = NULL, block.field = NULL, qc.mito.seqnames = c("MT", "M", "chrM", "chrMT"), qc.mito.regex = NULL, qc.igg.regex = "IgG|igg|IGG", qc.num.mads = 3, qc.filter = TRUE, num.hvgs = 2000, num.pcs = 25, mnn.num.neighbors = 15, mnn.num.steps = 1, cluster.method = "graph", cluster.kmeans.k = 10, cluster.graph.method = c("multilevel", "leiden", "walktrap"), cluster.graph.num.neighbors = 10, cluster.graph.resolution = NULL, reduced.dimensions = c("tsne", "umap"), tsne.perplexity = 30, umap.num.neighbors = 15, umap.min.dist = 0.1, marker.effect.size = c("cohens.d", "auc", "delta.mean", "delta.detected"), marker.summary = c("min.rank", "mean", "median", "min"), marker.lfc.threshold = 0, assay = 1, symbol.field = NULL, metadata = NULL, output.dir = "solo", author = NULL, dry.run = FALSE, save.results = TRUE, suppress.plots = FALSE, num.threads = 1 )runSolo( x, rna.experiment = TRUE, adt.experiment = NULL, subset.factor = NULL, subset.levels = NULL, block.field = NULL, qc.mito.seqnames = c("MT", "M", "chrM", "chrMT"), qc.mito.regex = NULL, qc.igg.regex = "IgG|igg|IGG", qc.num.mads = 3, qc.filter = TRUE, num.hvgs = 2000, num.pcs = 25, mnn.num.neighbors = 15, mnn.num.steps = 1, cluster.method = "graph", cluster.kmeans.k = 10, cluster.graph.method = c("multilevel", "leiden", "walktrap"), cluster.graph.num.neighbors = 10, cluster.graph.resolution = NULL, reduced.dimensions = c("tsne", "umap"), tsne.perplexity = 30, umap.num.neighbors = 15, umap.min.dist = 0.1, marker.effect.size = c("cohens.d", "auc", "delta.mean", "delta.detected"), marker.summary = c("min.rank", "mean", "median", "min"), marker.lfc.threshold = 0, assay = 1, symbol.field = NULL, metadata = NULL, output.dir = "solo", author = NULL, dry.run = FALSE, save.results = TRUE, suppress.plots = FALSE, num.threads = 1 )
x |
A SummarizedExperiment object where each column represents a single cell.
Rows are usually expected to contain genes or antibody-derived tags, see |
rna.experiment |
Identity of the experiment containing the RNA data, when |
adt.experiment |
Identity of the experiment containing the ADT data, when |
subset.factor |
String specifying the name of the |
subset.levels |
Vector containing the subset of levels to retain in the factor specified by |
block.field |
String specifying the name of the |
qc.mito.seqnames |
Character vector containing the sequence names of the mitochondrial chromosome.
Only used if |
qc.mito.regex |
String containing a regular expression to identify mitochondrial genes from the row names.
If |
qc.igg.regex |
String containing a regular expression to identify IgG controls from the row names.
If |
qc.num.mads |
Integer specifying the number of median absolute deviations (MADs) with which to define a quality control (QC) filtering threshold. Smaller values increase the stringency of the filter. |
qc.filter |
Boolean indicating whether putative low-quality cells should be removed.
If |
num.hvgs |
Integer specifying the number of highly variable genes (HVGs) to retain for downstream analyses.
More HVGs capture more biological signal at the cost of capturing more technical noise and increasing computational work.
Only used if |
num.pcs |
Integer specifying the number of top principal components (PCs) to retain for downstream analyses. More PCs capture more biological signal at the cost of capturing more technical noise and increasing computational work. |
mnn.num.neighbors |
Integer specifying the number of neighbors for batch correction with mutual nearest neighbors. Larger values improve stability but reduce resolution for rare subpopulations. |
mnn.num.steps |
Integer specifying the number of steps for the center of mass calculation during batch correction with mutual nearest neighbors. Larger values improve intermingling of cells from different batches but increase the risk of merging the wrong subpopulations. |
cluster.method |
Character vector specifying the clustering methods to run on the top PCs,
namely graph-based clustering ( |
cluster.kmeans.k |
Integer specifying the number of clusters to generate from k-means clustering.
Only used if |
cluster.graph.method |
String naming the community detection method to use in graph-based clustering.
These are roughly equivalent to the functions of the same name from the igraph R package.
(Note that |
cluster.graph.num.neighbors |
Integer specifying the number of nearest neighbors to use during construction of the shared-nearest neighbor graph.
Larger values increase graph connectivity and decrease cluster resolution.
Only used if |
cluster.graph.resolution |
Number specifying the resolution to use in the multi-level or Leiden community detection algorithms.
Larger values usually result in a greater number of smaller clusters.
Only used if |
reduced.dimensions |
Character vector specifying the dimensionality reduction algorithms to use for visualization.
This can be zero, one or both of |
tsne.perplexity |
Number specifying the perplexity to use in t-SNE.
Larger values increase the size of each cell's neighborhood and focus more on global structure.
Only used if |
umap.num.neighbors |
Integer specifying the number of neighbors to use in the UMAP.
Larger values increase connectivity and focus more on global structure.
Only used if |
umap.min.dist |
Number specifying the minimum distance between points in the UMAP.
Larger values favor a more even distribution of cells throughout the low-dimensional space.
Only used if |
marker.effect.size |
String naming the effect size to use when ranking marker genes. This should be one of:
This is combined with |
marker.summary |
String specifying the summary statistic to use when ranking marker genes. This should be one of:
This is combined with |
marker.lfc.threshold |
Non-negative number specifying the log-fold change threshold to test against. Larger values focus on marker genes with larger log-fold changes at the expense of those with smaller variances. |
assay |
String or integer specifying the assay in |
symbol.field |
String specifying the name of the |
metadata |
Named list of additional fields to add to each result's metadata. |
output.dir |
String containing the path to an output directory in which to write the Rmarkdown report and save results. |
author |
Character vector of authors. |
dry.run |
Boolean indicating whether to perform a dry run. This will write the Rmarkdown report without evaluating it. |
save.results |
Boolean indicating whether the results should also be saved to file. |
suppress.plots |
Boolean indicating whether to suppress the generation of plots.
This can be set to |
num.threads |
Integer specifying the number of threads to use in the various computations. |
A Rmarkdown report named report.Rmd is written inside output.dir that contains the analysis commands.
If dry.run=FALSE, a list is returned containing:
sce, a SingleCellExperiment object with dimensionality reduction and clustering results.
This may have fewer cells than the input x if qc.filter=TRUE.
qc.rna, a DataFrame object with RNA-based QC metrics for each cell in the input x.
Only returned if rna.experiment= indicates that RNA data is available.
qc.adt, a DataFrame object with ADT-based QC metrics for each cell in the input x.
Only returned if adt.experiment= indicates that ADT data is available.
markers.rna, a named list of DataFrame results containing marker gene statistics for each cluster.
Only returned if rna.experiment= indicates that RNA data is available.
markers.adt, a named list of DataFrame results containing marker tag statistics for each cluster.
Only returned if adt.experiment= indicates that ADT data is available.
If save.results=TRUE, the results are saved in a results directory inside output.
If dry.run=TRUE, NULL is returned.
Only the Rmarkdown report is saved to file.
Aaron Lun
library(scRNAseq) se <- ZeiselBrainData() tmp <- tempfile() output <- runSolo( se, qc.mito.regex="^mt-", output.dir=tmp, num.threads = 2 # speed it up a little. ) list.files(tmp, recursive=TRUE) output$sce output$markerslibrary(scRNAseq) se <- ZeiselBrainData() tmp <- tempfile() output <- runSolo( se, qc.mito.regex="^mt-", output.dir=tmp, num.threads = 2 # speed it up a little. ) list.files(tmp, recursive=TRUE) output$sce output$markers