| Title: | Feature Extraction for scRNA-seq Dataset Integration |
|---|---|
| Description: | anglemania extracts genes from multi-batch scRNA-seq experiments for downstream dataset integration. It shows improvement over the conventional usage of highly-variable genes for many integration tasks. We leverage gene-gene correlations that are stable across batches to identify biologically informative genes which are less affected by batch effects. Currently, its main use is for single-cell RNA-seq dataset integration, but it can be applied for other multi-batch downstream analyses such as NMF. |
| Authors: | Aaron Kollotzek [aut, cre] (ORCID: <https://orcid.org/0009-0009-7142-4015>), Vedran Franke [aut] (ORCID: <https://orcid.org/0000-0003-3606-6792>), Artem Baranovskii [aut], Altuna Akalin [aut], SFB1588 [fnd] (Funded by the DFG – Deutsche Forschungsgemeinschaft) |
| Maintainer: | Aaron Kollotzek <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.3.0 |
| Built: | 2026-05-29 08:53:32 UTC |
| Source: | https://github.com/bioc/anglemania |
anglemania computes critical angles between genes across all samples
provided in an
SingleCellExperiment
object. It calculates angles, transforms them to z-scores, computes
statistical measures, and selects the top genes based on mean and standard
deviation of z-scores.
These genes are biologically informative and invariant to batch effects.
This function adds a unique batch identifier to the metadata of a
SingleCellExperiment object by combining specified dataset and batch
keys. This is useful for distinguishing samples during integration or
analysis.
anglemania( sce, batch_key, dataset_key = NA_character_, max_n_genes = 2000, min_cells_per_gene = 1, min_samples_per_gene = 2, allow_missing_features = FALSE, method = "cosine", permute_row_or_column = "column", permutation_function = "sample", prefilter_threshold = 0.5, normalization_method = "divide_by_total_counts", verbose = TRUE ) check_params( sce, batch_key, dataset_key, max_n_genes, method, min_cells_per_gene, min_samples_per_gene, allow_missing_features, permute_row_or_column, permutation_function, prefilter_threshold, normalization_method, verbose ) add_unique_batch_key(sce, dataset_key = NA_character_, batch_key) get_intersect_genes( matrix_list, allow_missing_features = FALSE, min_samples_per_gene = 1, verbose = TRUE )anglemania( sce, batch_key, dataset_key = NA_character_, max_n_genes = 2000, min_cells_per_gene = 1, min_samples_per_gene = 2, allow_missing_features = FALSE, method = "cosine", permute_row_or_column = "column", permutation_function = "sample", prefilter_threshold = 0.5, normalization_method = "divide_by_total_counts", verbose = TRUE ) check_params( sce, batch_key, dataset_key, max_n_genes, method, min_cells_per_gene, min_samples_per_gene, allow_missing_features, permute_row_or_column, permutation_function, prefilter_threshold, normalization_method, verbose ) add_unique_batch_key(sce, dataset_key = NA_character_, batch_key) get_intersect_genes( matrix_list, allow_missing_features = FALSE, min_samples_per_gene = 1, verbose = TRUE )
sce |
A |
batch_key |
A character string specifying the column name in the metadata that identifies the batch. |
dataset_key |
A character string specifying the column name in the
metadata that identifies the dataset. If |
max_n_genes |
Integer specifying the maximum number of genes to select. |
min_cells_per_gene |
Integer specifying the minimum number of cells per
gene. Default is |
min_samples_per_gene |
Integer indicating the minimum number of samples per gene. |
allow_missing_features |
Logical indicating whether to allow missing features. |
method |
Character string specifying the method to use for calculating
the relationship between gene pairs. Default is |
permute_row_or_column |
Character "row" or "column", whether
permutations
should be executed row-wise or column wise. Default is |
permutation_function |
Character "sample" or "permute_nonzero". If
sample,then sample is used for constructing background distributions. If
permute_nonzero, then only non-zero values are permuted. Default is
|
prefilter_threshold |
Numeric value specifying the threshold prefiltering genes. Speeds up gene selection. |
normalization_method |
Character "divide_by_total_counts" or
"scale_by_total_counts". Default is |
verbose |
Logical indicating whether to print messages. |
matrix_list |
A list of |
This function performs the following steps:
Computes angles between genes for each batch in the
SingleCellExperiment using the specified method, via
factorise.
Transforms the angles to z-scores.
Computes statistical measures (mean z-score, signal-to-noise ratio)
across batches using get_list_stats.
Selects the top n genes based on mean and standard deviation of
z-scores using select_genes.
The computed statistics and selected genes are added to the
SingleCellExperiment object, which is returned.
An updated SingleCellExperiment object with computed
statistics and selected genes.
The results are stored in the metadata of the SingleCellExperiment
object.
A list of validated parameters
A SingleCellExperiment object with an additional metadata
column containing the unique batch key.
A character vector of intersected genes.
check_params(): Check Parameters provided to the anglemania function
add_unique_batch_key(): Temporarily add a unique batch key to the dataset
get_intersect_genes(): Extract the intersected genes from a
list of matrices (count matrices from different batches/datasets).
It also allows for missing features in individual matrices, so
that a feature does not have to be present in every single batch.
get_list_stats,
select_genes,
factorise,
big_apply,
https://arxiv.org/abs/1306.0256
# Set seed (optional) set.seed(1) sce <- sce_example() sce <- anglemania( sce, batch_key = "batch", method = "cosine" ) # Access the selected genes selected_genes <- get_anglemania_genes(sce) selected_genes[1:10] sce <- sce_example() params <- check_params( sce, batch_key = "batch", dataset_key = "dataset", max_n_genes = 2000, method = "cosine", min_cells_per_gene = 1, min_samples_per_gene = 2, allow_missing_features = FALSE, permute_row_or_column = "column", permutation_function = "sample", prefilter_threshold = 0.5, normalization_method = "divide_by_total_counts", verbose = TRUE ) sce <- sce_example() head(SummarizedExperiment::colData(sce)) sce <- add_unique_batch_key( sce, batch_key = "batch", dataset_key = "dataset" ) head(SummarizedExperiment::colData(sce)) library(SingleCellExperiment) sce <- sce_example() barcodes_by_batch <- split(colnames(sce), colData(sce)$batch) matrix_list <- lapply(barcodes_by_batch, function(barcodes) { SingleCellExperiment::counts(sce)[, barcodes] }) intersect_genes <- get_intersect_genes(matrix_list) head(intersect_genes)# Set seed (optional) set.seed(1) sce <- sce_example() sce <- anglemania( sce, batch_key = "batch", method = "cosine" ) # Access the selected genes selected_genes <- get_anglemania_genes(sce) selected_genes[1:10] sce <- sce_example() params <- check_params( sce, batch_key = "batch", dataset_key = "dataset", max_n_genes = 2000, method = "cosine", min_cells_per_gene = 1, min_samples_per_gene = 2, allow_missing_features = FALSE, permute_row_or_column = "column", permutation_function = "sample", prefilter_threshold = 0.5, normalization_method = "divide_by_total_counts", verbose = TRUE ) sce <- sce_example() head(SummarizedExperiment::colData(sce)) sce <- add_unique_batch_key( sce, batch_key = "batch", dataset_key = "dataset" ) head(SummarizedExperiment::colData(sce)) library(SingleCellExperiment) sce <- sce_example() barcodes_by_batch <- split(colnames(sce), colData(sce)$batch) matrix_list <- lapply(barcodes_by_batch, function(barcodes) { SingleCellExperiment::counts(sce)[, barcodes] }) intersect_genes <- get_intersect_genes(matrix_list) head(intersect_genes)
Constructs a matrix of gene-gene relationships based on distance metrics.
extract_angles(x_mat, method = "cosine")extract_angles(x_mat, method = "cosine")
x_mat |
An |
method |
A character string specifying the method to compute the gene-gene relationships. Options are:
|
The function returns the gene-gene angle matrix as an
FBM object.
An FBM object containing the gene-gene
correlation matrix. The matrix is square with dimensions equal to the
number of genes and contains the pairwise correlations between genes.
The diagonal elements are set to NA.
big_apply,
big_cor,
FBM,
factorise
mat <- matrix( c( 5, 3, 0, 0, 0, 0, 0, 3, 2, 1, 3, 4, 0, 0, 1, 0, 1, 2, 1, 2, 3, 4, 3, 4 ), nrow = 6, # 6 genes ncol = 4, # 4 cells byrow = TRUE ) mat <- bigstatsr::FBM(nrow = nrow(mat), ncol = ncol(mat), init = mat) angle_mat <- extract_angles(mat) angle_mat[]mat <- matrix( c( 5, 3, 0, 0, 0, 0, 0, 3, 2, 1, 3, 4, 0, 0, 1, 0, 1, 2, 1, 2, 3, 4, 3, 4 ), nrow = 6, # 6 genes ncol = 4, # 4 cells byrow = TRUE ) mat <- bigstatsr::FBM(nrow = nrow(mat), ncol = ncol(mat), init = mat) angle_mat <- extract_angles(mat) angle_mat[]
factorise computes the angle matrix of the input gene expression
matrix using the specified method, performs permutation to create a null
distribution, and transforms the correlations into z-scores. This function
is optimized for large datasets using the bigstatsr package.
factorise( x_mat, method = "cosine", seed = 1, permute_row_or_column = "column", permutation_function = "sample", normalization_method = "divide_by_total_counts" )factorise( x_mat, method = "cosine", seed = 1, permute_row_or_column = "column", permutation_function = "sample", normalization_method = "divide_by_total_counts" )
x_mat |
A |
method |
A character string specifying the method for calculating the
relationship between gene pairs. Default is |
seed |
An integer value for setting the seed for reproducibility during
permutation. Default is |
permute_row_or_column |
Character "row" or "column", whether
permutations should be executed row-wise or column wise.
Default is |
permutation_function |
Character "sample" or "permute_nonzero".
If sample, then sample is used for constructing background distributions.
If permute_nonzero, then only non-zero values are permuted.
Default is |
normalization_method |
Character "divide_by_total_counts" or
"scale_by_total_counts". Default is |
The function performs the following steps:
Permutation: The input matrix is permuted column-wise to disrupt existing angles, creating a null distribution.
Angle Computation: Computes the angle matrix for both the
original and permuted matrices using extract_angles.
Method-Specific Processing:
For other methods ("cosine", "spearman"),
statistical measures are computed from the permuted data.
Statistical Measures: Calculates mean, variance, and standard
deviation using get_dstat.
Z-Score Transformation: Transforms the original angle matrix into z-scores.
This process allows for the identification of invariant gene-gene relationships by comparing them to a null distribution derived from the permuted data.
An FBM object containing the
z-score-transformed angle matrix.
extract_angles,
get_dstat,
big_apply,
FBM
mat <- matrix( c( 5, 3, 0, 0, 0, 0, 0, 3, 2, 1, 3, 4, 0, 0, 1, 0, 1, 2, 1, 2, 3, 4, 3, 4 ), nrow = 6, # 6 genes ncol = 4, # 4 cells byrow = TRUE ) mat <- bigstatsr::FBM(nrow = nrow(mat), ncol = ncol(mat), init = mat) # Run factorise with method "cosine" and a fixed seed result_fbm <- factorise(mat, method = "cosine", seed = 1) result_fbm[]mat <- matrix( c( 5, 3, 0, 0, 0, 0, 0, 3, 2, 1, 3, 4, 0, 0, 1, 0, 1, 2, 1, 2, 3, 4, 3, 4 ), nrow = 6, # 6 genes ncol = 4, # 4 cells byrow = TRUE ) mat <- bigstatsr::FBM(nrow = nrow(mat), ncol = ncol(mat), init = mat) # Run factorise with method "cosine" and a fixed seed result_fbm <- factorise(mat, method = "cosine", seed = 1) result_fbm[]
Permutes the non-zero elements of a numeric vector.
permute_nonzero(v)permute_nonzero(v)
v |
A numeric vector. |
A numeric vector with non-zero elements permuted.
v <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # put in some zeroes v = c(v, 0, 0, 0) permute_nonzero(v)v <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # put in some zeroes v = c(v, 0, 0, 0) permute_nonzero(v)
This function generates a SingleCellExperiment object with 2 batches and 2
datasets. The object contains 300 genes and 600 cells. The counts matrix is
generated using the rpois function with different lambda values
for the two batches.
sce_example(seed = 42)sce_example(seed = 42)
seed |
The seed for the random number generator. Because this function is only used locally, we allow to set a seed within the function. |
A SingleCellExperiment object
sce <- sce_example() scesce <- sce_example() sce