| Title: | PLAID ultrafast gene set enrichment scoring |
|---|---|
| Description: | PLAID (Pathway Level Average Intensity Detection) is an ultra-fast method to compute single-sample enrichment scores for gene expression or proteomics data. For each sample, plaid computes the gene set score as the average intensity of the genes/proteins in the gene set. The output is a gene set score matrix suitable for further analyses. |
| Authors: | Ivo Kwee [aut] (ORCID: <https://orcid.org/0000-0002-2751-4218>), Antonino Zito [cre] (ORCID: <https://orcid.org/0000-0003-1931-984X>) |
| Maintainer: | Antonino Zito <[email protected]> |
| License: | GPL-3 |
| Version: | 1.1.0 |
| Built: | 2026-05-30 09:37:35 UTC |
| Source: | https://github.com/bioc/plaid |
Compute crossprod (t(x) %*% y) for very large y by computing in chunks.
chunked_crossprod(x, y, chunk = NULL)chunked_crossprod(x, y, chunk = NULL)
x |
Matrix First matrix for multiplication. Can be sparse. |
y |
Matrix Second matrix for multiplication. Can be sparse. |
chunk |
Integer Chunk size (max number of columns) for computation. |
Matrix. Result of matrix cross product.
Computes columnwise rank of matrix. Can be sparse. Tries to call optimized functions from Rfast or matrixStats.
colranks( X, sparse = NULL, signed = FALSE, keep.zero = FALSE, ties.method = "average" )colranks( X, sparse = NULL, signed = FALSE, keep.zero = FALSE, ties.method = "average" )
X |
Input matrix |
sparse |
Logical indicating to use sparse methods |
signed |
Logical indicating using signed ranks |
keep.zero |
Logical indicating whether to keep zero as ranked zero |
ties.method |
Character Choice of ties.method |
Matrix of columnwise ranks with same dimensions as input.
# Create example matrix set.seed(123) X <- matrix(rnorm(100), nrow = 10, ncol = 10) rownames(X) <- paste0("Gene", 1:10) colnames(X) <- paste0("Sample", 1:10) # Compute column ranks ranks <- colranks(X) print(ranks[1:5, 1:5]) # Compute signed ranks signed_ranks <- colranks(X, signed = TRUE) print(signed_ranks[1:5, 1:5])# Create example matrix set.seed(123) X <- matrix(rnorm(100), nrow = 10, ncol = 10) rownames(X) <- paste0("Gene", 1:10) colnames(X) <- paste0("Sample", 1:10) # Compute column ranks ranks <- colranks(X) print(ranks[1:5, 1:5]) # Compute signed ranks signed_ranks <- colranks(X, signed = TRUE) print(signed_ranks[1:5, 1:5])
Calculate sparse correlation matrix handling missing values
cor_sparse_matrix(G, mat)cor_sparse_matrix(G, mat)
G |
Sparse matrix containing gene sets |
mat |
Matrix of values |
If mat has no missing values, calculates correlation directly using corSparse. Otherwise computes column-wise correlations only using non-missing values.
Correlation matrix between G and mat
Reimplementation of dualGSEA (Bull et al., 2024) but defaults with replaid backend. For the preranked test we still use fgsea. Should be much faster than original using fgsea + GSVA::ssGSEA.
dualGSEA( X, y, G, gmt = NULL, gsetX = NULL, fc.method = c("fgsea", "rankcor", "ztest", "ttest", "cor")[2], ss.method = c("plaid", "replaid.ssgsea", "replaid.gsva", "ssgsea", "gsva")[1], metap.method = c("stouffer", "fisher", "maxp")[1], pv1 = NULL, pv2 = NULL, sort.by = "p.dual" )dualGSEA( X, y, G, gmt = NULL, gsetX = NULL, fc.method = c("fgsea", "rankcor", "ztest", "ttest", "cor")[2], ss.method = c("plaid", "replaid.ssgsea", "replaid.gsva", "ssgsea", "gsva")[1], metap.method = c("stouffer", "fisher", "maxp")[1], pv1 = NULL, pv2 = NULL, sort.by = "p.dual" )
X |
Expression matrix with genes on rows and samples ont columns |
y |
Binary vector (0/1) indicating group membership |
G |
Sparse matrix of gene sets. Non-zero entry indicates gene/feature is part of gene sets. Features on rows, gene sets on columns. |
gmt |
List of gene sets in GMT format |
gsetX |
Optional pre-computed matrix of gene set enrichment scores with
gene sets on rows and samples on columns. If NULL (default), scores will be
computed using the method specified by |
fc.method |
Method for fold change testing ("fgsea", "ztest", "ttest", "rankcor", "cor") |
ss.method |
Method for single-sample enrichment ("plaid", "replaid.ssgsea", "replaid.gsva", "ssgsea", "gsva") |
metap.method |
Method for combining p-values ("stouffer", "fisher" or "maxp"). Default "stouffer". |
pv1 |
Pre-computed p-values from fold change test. If NULL, will be computed based on fc.test. |
pv2 |
Pre-computed p-values from single sample test. If NULL, will be computed using gset_ttest. |
sort.by |
Column name to sort results by ("p.dual", "gsetFC", "p.fc", "p.ss"). Default "p.dual". |
Data frame with columns: gsetFC (gene set fold change), size (gene set size), p.fc (p-value from fold change test), p.ss (p-value from single sample test), p.dual (combined p-value), and q.dual (FDR-adjusted combined p-value).
# Create example expression matrix set.seed(123) X <- matrix(rnorm(1000), nrow = 100, ncol = 20) rownames(X) <- paste0("GENE", 1:100) colnames(X) <- paste0("Sample", 1:20) # Create binary group vector y <- rep(c(0, 1), each = 10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:20), "Pathway2" = paste0("GENE", 15:35), "Pathway3" = paste0("GENE", 30:50) ) # Perform dualGSEA with correlation test (fast method) results_cor <- dualGSEA(X, y, G = NULL, gmt = gmt, fc.method = "cor", ss.method = "replaid.gsva") print(head(results_cor)) # Perform dualGSEA with fgsea (requires fgsea package) if (requireNamespace("fgsea", quietly = TRUE)) { results <- dualGSEA(X, y, G = NULL, gmt = gmt, fc.method = "fgsea", ss.method = "replaid.ssgsea") print(head(results)) }# Create example expression matrix set.seed(123) X <- matrix(rnorm(1000), nrow = 100, ncol = 20) rownames(X) <- paste0("GENE", 1:100) colnames(X) <- paste0("Sample", 1:20) # Create binary group vector y <- rep(c(0, 1), each = 10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:20), "Pathway2" = paste0("GENE", 15:35), "Pathway3" = paste0("GENE", 30:50) ) # Perform dualGSEA with correlation test (fast method) results_cor <- dualGSEA(X, y, G = NULL, gmt = gmt, fc.method = "cor", ss.method = "replaid.gsva") print(head(results_cor)) # Perform dualGSEA with fgsea (requires fgsea package) if (requireNamespace("fgsea", quietly = TRUE)) { results <- dualGSEA(X, y, G = NULL, gmt = gmt, fc.method = "fgsea", ss.method = "replaid.ssgsea") print(head(results)) }
This function performs statistical testing for differential enrichment using plaid
fc_ttest(fc, G, sort.by = "pvalue")fc_ttest(fc, G, sort.by = "pvalue")
fc |
Vector of logFC values |
G |
Sparse matrix of gene sets. Non-zero entry indicates gene/feature is part of gene sets. Features on rows, gene sets on columns. |
sort.by |
Column name to sort results by ("pvalue", "gsetFC", or "none") |
Data frame with columns: gsetFC (gene set fold change), pvalue (p-value from one-sample t-test), and qvalue (FDR-adjusted p-value).
This function performs statistical testing for differential enrichment using plaid
fc_ztest(fc, G, zmat = FALSE, alpha = 0.5)fc_ztest(fc, G, zmat = FALSE, alpha = 0.5)
fc |
Vector of logFC values |
G |
Sparse matrix of gene sets. Non-zero entry indicates gene/feature is part of gene sets. Features on rows, gene sets on columns. |
zmat |
Logical indicating to return z-matrix |
alpha |
Scalar weight for SD estimation. Default 0.5. |
List with element: z_statistic (z-statistic from one-sample z-test), p_value (p-value from one-sample z-test), and zmat (z-matrix).
Convert a GMT file (Gene Matrix Transposed) to a binary matrix, where rows represent genes and columns represent gene sets. The binary matrix indicates presence or absence of genes in a gene set.
gmt2mat( gmt, max.genes = -1, ntop = -1, sparse = TRUE, bg = NULL, use.multicore = TRUE )gmt2mat( gmt, max.genes = -1, ntop = -1, sparse = TRUE, bg = NULL, use.multicore = TRUE )
gmt |
List representing the GMT file: each element is a character vector representing a gene set. |
max.genes |
Max number of genes to include in the binary matrix. Default = -1 to include all genes. |
ntop |
Number of top genes to consider for each gene set. Default = -1 to include all genes. |
sparse |
Logical: create a sparse matrix. Default |
bg |
Character vector of background gene set. Default |
use.multicore |
Logical: use parallel processing ('parallel' R package). Default |
A binary matrix representing the presence or absence of genes in each gene set. Rows correspond to genes, and columns correspond to gene sets.
# Create example GMT data gmt <- list( "Pathway1" = c("GENE1", "GENE2", "GENE3"), "Pathway2" = c("GENE2", "GENE4", "GENE5"), "Pathway3" = c("GENE1", "GENE5", "GENE6") ) # Convert to binary matrix mat <- gmt2mat(gmt) print(mat) # Create dense matrix instead of sparse mat_dense <- gmt2mat(gmt, sparse = FALSE) print(mat_dense)# Create example GMT data gmt <- list( "Pathway1" = c("GENE1", "GENE2", "GENE3"), "Pathway2" = c("GENE2", "GENE4", "GENE5"), "Pathway3" = c("GENE1", "GENE5", "GENE6") ) # Convert to binary matrix mat <- gmt2mat(gmt) print(mat) # Create dense matrix instead of sparse mat_dense <- gmt2mat(gmt, sparse = FALSE) print(mat_dense)
Compute geneset expression as the average log-ration of genes in the geneset. Requires log-expression matrix X and (sparse) geneset matrix matG.
gset_averageCLR(X, matG, center = TRUE)gset_averageCLR(X, matG, center = TRUE)
X |
Log-expression matrix with genes on rows and samples on columns |
matG |
Sparse gene set matrix with genes on rows and gene sets on columns |
center |
Logical indicating whether to center the results |
Matrix of gene set expression scores with gene sets on rows and samples on columns.
Perform t-test on gene set scores
gset_ttest(gsetX, y)gset_ttest(gsetX, y)
gsetX |
Matrix of gene set scores with gene sets on rows and samples on columns |
y |
Binary vector (0/1) indicating group membership |
Data frame with columns: diff (difference in means), statistic (t-statistic), pvalue (p-value), and other t-test results.
Compute rank correlation between a gene rank vector/matrix and gene sets
gset.rankcor(rnk, gset, compute.p = FALSE, use.rank = TRUE)gset.rankcor(rnk, gset, compute.p = FALSE, use.rank = TRUE)
rnk |
Numeric vector or matrix of gene ranks, with genes as row names |
gset |
Numeric matrix of gene sets, with genes as row/column names |
compute.p |
Logical indicating whether to compute p-values |
use.rank |
Logical indicating whether to rank transform rnk before correlation |
This function calculates sparse rank correlation between rnk and each
column of gset using qlcMatrix::corSparse(). It handles missing values in
rnk by computing column-wise correlations.
P-values are computed from statistical distribution
Named list with components:
rho - Matrix of correlation coefficients between rnk and gset
p.value - Matrix of p-values for correlation (if compute.p = TRUE)
q.value - Matrix of FDR adjusted p-values (if compute.p = TRUE)
# Create example rank vector set.seed(123) ranks <- rnorm(100) names(ranks) <- paste0("GENE", 1:100) # Create example gene sets as sparse matrix gmt <- list( "Pathway1" = paste0("GENE", 1:20), "Pathway2" = paste0("GENE", 15:35), "Pathway3" = paste0("GENE", 30:50) ) genesets <- gmt2mat(gmt) # Calculate rank correlation result <- gset.rankcor(ranks, genesets, compute.p = TRUE) print(result$rho) print(result$p.value)# Create example rank vector set.seed(123) ranks <- rnorm(100) names(ranks) <- paste0("GENE", 1:100) # Create example gene sets as sparse matrix gmt <- list( "Pathway1" = paste0("GENE", 1:20), "Pathway2" = paste0("GENE", 15:35), "Pathway3" = paste0("GENE", 30:50) ) genesets <- gmt2mat(gmt) # Calculate rank correlation result <- gset.rankcor(ranks, genesets, compute.p = TRUE) print(result$rho) print(result$p.value)
Calculate row standard deviations for matrix
mat.rowsds(X)mat.rowsds(X)
X |
Input matrix (can be sparse or dense) |
Vector of row standard deviations.
Convert binary matrix to a GMT (Gene Matrix Transposed) list. The binary matrix indicates presence or absence of genes in each gene set. Rows represent genes and columns represent gene sets.
mat2gmt(mat)mat2gmt(mat)
mat |
Matrix with non-zero entries representing genes in each gene set. Rows represent genes and columns represent gene sets. |
A list of vector representing each gene set. Each list element correspond to a gene set and is a vector of genes
# Create example binary matrix mat <- matrix(0, nrow = 6, ncol = 3) rownames(mat) <- paste0("GENE", 1:6) colnames(mat) <- paste0("Pathway", 1:3) mat[1:3, 1] <- 1 # Pathway1: GENE1, GENE2, GENE3 mat[c(2,4,5), 2] <- 1 # Pathway2: GENE2, GENE4, GENE5 mat[c(1,5,6), 3] <- 1 # Pathway3: GENE1, GENE5, GENE6 # Convert to GMT list gmt <- mat2gmt(mat) print(gmt)# Create example binary matrix mat <- matrix(0, nrow = 6, ncol = 3) rownames(mat) <- paste0("GENE", 1:6) colnames(mat) <- paste0("Pathway", 1:3) mat[1:3, 1] <- 1 # Pathway1: GENE1, GENE2, GENE3 mat[c(2,4,5), 2] <- 1 # Pathway2: GENE2, GENE4, GENE5 mat[c(1,5,6), 3] <- 1 # Pathway3: GENE1, GENE5, GENE6 # Convert to GMT list gmt <- mat2gmt(mat) print(gmt)
Matrix version for combining p-values using fisher or stouffer method. Much faster than doing metap::sumlog() and metap::sumz()
matrix_metap(plist, method = "stouffer")matrix_metap(plist, method = "stouffer")
plist |
List of p-value vectors or matrix of p-values |
method |
Method for combining p-values ("fisher"/"sumlog" or "stouffer"/"sumz") |
Vector of combined p-values.
Perform one-sample t-test on matrix with gene sets
matrix_onesample_ttest(Fm, G)matrix_onesample_ttest(Fm, G)
Fm |
Vector of feature values (e.g., fold changes) |
G |
Sparse matrix of gene sets with genes on rows and gene sets on columns |
List containing mean, t-statistic, and p-value matrices.
This function normalizes the column medians of matrix x. It calls optimized functions from the matrixStats package.
normalize_medians(x, ignore.zero = NULL)normalize_medians(x, ignore.zero = NULL)
x |
Input matrix |
ignore.zero |
Logical indicating whether to ignore zeros to exclude for median calculation |
Matrix with normalized column medians.
# Create example matrix set.seed(123) x <- matrix(rnorm(100), nrow = 10, ncol = 10) x[1:3, 1:3] <- 0 # Add some zeros # Normalize medians x_norm <- normalize_medians(x) head(x_norm)# Create example matrix set.seed(123) x <- matrix(rnorm(100), nrow = 10, ncol = 10) x[1:3, 1:3] <- 0 # Add some zeros # Normalize medians x_norm <- normalize_medians(x) head(x_norm)
Compute single-sample geneset expression as the
average log-expression f genes in the geneset. Requires log-expression
matrix X and (sparse) geneset matrix matG. If you have gene sets
as a gmt list, please convert it first using the function gmt2mat().
plaid( X, matG, stats = c("mean", "sum"), chunk = NULL, normalize = TRUE, nsmooth = 3, assay = "logcounts", min.genes = 5, max.genes = 500 )plaid( X, matG, stats = c("mean", "sum"), chunk = NULL, normalize = TRUE, nsmooth = 3, assay = "logcounts", min.genes = 5, max.genes = 500 )
X |
Log-transformed expr. matrix. Genes on rows, samples on columns. Also accepts SummarizedExperiment or SingleCellExperiment objects. |
matG |
Gene sets sparse matrix. Genes on rows, gene sets on columns. Also accepts BiocSet objects or GMT lists (named list of gene vectors). |
stats |
Score computation stats: mean or sum of intensity. Default 'mean'. |
chunk |
Logical: use chunks for large matrices. Default 'NULL' for autodetect. |
normalize |
Logical: median normalize results or not. Default 'TRUE'. |
nsmooth |
Smoothing parameter for more stable average when stats="mean". Default 3. |
assay |
Character: assay name to extract from SummarizedExperiment/SingleCellExperiment. Default "logcounts". |
min.genes |
Integer: minimum genes per gene set (for BiocSet/GMT input). Default 5. |
max.genes |
Integer: maximum genes per gene set (for BiocSet/GMT input). Default 500. |
PLAID needs the gene sets as sparse matrix. If you have your collection of gene sets a a list, we need first to convert the gmt list to matrix format.
We recommend to run PLAID on the log transformed expression matrix, not on the counts, as the average in the logarithmic space is more robust and is in concordance to calculating the geometric mean.
It is not necessary to normalize your expression matrix before running PLAID because PLAID performs median normalization of the enrichment scores afterwards.
It is recommended to use sparse matrix as PLAID relies on sparse matrix computations. But, PLAID is also fast for dense matrices.
PLAID can also be run on the ranked matrix. This corresponds to the singscore (Fouratan et al., 2018). PLAID can also be run on the (non-logarithmic) counts which can be used to calculate the scSE score (Pont et al., 2019).
PLAID is fast and memery efficient because it uses efficient sparse matrix computation. When input matrix is very large, PLAID performs 'chunked' computation by splitting the matrix in chunks.
Although X and matG are generally sparse, the result
matrix gsetX generally is dense and can thus be very large.
Example: computing gene set scores for 10K gene sets on 1M cells
will create a 10K x 1M dense matrix which requires ~75GB memory.
PLAID now automatically detects and handles Bioconductor objects. If X is a SummarizedExperiment or SingleCellExperiment, it will extract the appropriate assay. If matG is a BiocSet object or GMT list, it will be converted to sparse matrix format automatically.
Matrix of single-sample enrichment scores. Gene sets on rows, samples on columns.
library(plaid) # Create example expression matrix set.seed(123) X <- matrix(rnorm(1000), nrow = 100, ncol = 10) rownames(X) <- paste0("GENE", 1:100) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:20), "Pathway2" = paste0("GENE", 15:35), "Pathway3" = paste0("GENE", 30:50) ) matG <- gmt2mat(gmt) # Compute PLAID scores gsetX <- plaid(X, matG) print(dim(gsetX)) print(gsetX[1:3, 1:5]) # Use sum statistics instead of mean gsetX_sum <- plaid(X, matG, stats = "sum") # Using real data (if available in package) extdata_path <- system.file("extdata", "pbmc3k-50cells.rda", package = "plaid") if (file.exists(extdata_path)) { load(extdata_path) hallmarks <- system.file("extdata", "hallmarks.gmt", package = "plaid") gmt <- read.gmt(hallmarks) matG <- gmt2mat(gmt) gsetX <- plaid(X, matG) }library(plaid) # Create example expression matrix set.seed(123) X <- matrix(rnorm(1000), nrow = 100, ncol = 10) rownames(X) <- paste0("GENE", 1:100) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:20), "Pathway2" = paste0("GENE", 15:35), "Pathway3" = paste0("GENE", 30:50) ) matG <- gmt2mat(gmt) # Compute PLAID scores gsetX <- plaid(X, matG) print(dim(gsetX)) print(gsetX[1:3, 1:5]) # Use sum statistics instead of mean gsetX_sum <- plaid(X, matG, stats = "sum") # Using real data (if available in package) extdata_path <- system.file("extdata", "pbmc3k-50cells.rda", package = "plaid") if (file.exists(extdata_path)) { load(extdata_path) hallmarks <- system.file("extdata", "hallmarks.gmt", package = "plaid") gmt <- read.gmt(hallmarks) matG <- gmt2mat(gmt) gsetX <- plaid(X, matG) }
Read data from a GMT file (Gene Matrix Transposed). The GMT format is commonly used to store gene sets or gene annotations.
read.gmt(gmt.file, dir = NULL, add.source = FALSE, nrows = -1)read.gmt(gmt.file, dir = NULL, add.source = FALSE, nrows = -1)
gmt.file |
Path to GMT file. |
dir |
(Optional) The directory where the GMT file is located. |
add.source |
(optional) Include the source information in the gene sets' names. |
nrows |
(optional) Number of rows to read from the GMT file. |
A list of gene sets: each gene set is represented as a character vector of gene names.
# Read GMT file (requires file to exist) gmt_file <- system.file("extdata", "hallmarks.gmt", package = "plaid") if (file.exists(gmt_file)) { gmt <- read.gmt(gmt_file) print(names(gmt)) print(head(gmt[[1]])) # Read with source information gmt_with_source <- read.gmt(gmt_file, add.source = TRUE) print(head(names(gmt_with_source))) }# Read GMT file (requires file to exist) gmt_file <- system.file("extdata", "hallmarks.gmt", package = "plaid") if (file.exists(gmt_file)) { gmt <- read.gmt(gmt_file) print(names(gmt)) print(head(gmt[[1]])) # Read with source information gmt_with_source <- read.gmt(gmt_file, add.source = TRUE) print(head(names(gmt_with_source))) }
Calculates single-sample enrichment AUCell (Aibar et al., 2017) using plaid back-end. The computation is 10-100x faster than the original code.
replaid.aucell( X, matG, aucMaxRank = NULL, assay = "logcounts", min.genes = 5, max.genes = 500 )replaid.aucell( X, matG, aucMaxRank = NULL, assay = "logcounts", min.genes = 5, max.genes = 500 )
X |
Gene or protein expression matrix. Generally log transformed. See details. Genes on rows, samples on columns. Also accepts SummarizedExperiment or SingleCellExperiment objects. |
matG |
Gene sets sparse matrix. Genes on rows, gene sets on columns. Also accepts BiocSet objects or GMT lists. |
aucMaxRank |
Rank threshold (see AUCell paper). Default aucMaxRank = 0.05*nrow(X). |
assay |
Character: assay name for Bioconductor objects. Default "logcounts". |
min.genes |
Integer: minimum genes per gene set. Default 5. |
max.genes |
Integer: maximum genes per gene set. Default 500. |
Computing the AUCell score requires to compute the ranks of the expression matrix and approximating the AUC of a gene set. We have wrapped this in a single convenience function.
We have extensively compared the results of replaid.aucell and
from the original AUCell R package and we showed good concordance
of results in the score, logFC and p-values.
Matrix of single-sample AUCell enrichment scores. Gene sets on rows, samples on columns.
# Create example expression matrix set.seed(123) X <- matrix(rnorm(500), nrow = 50, ncol = 10) rownames(X) <- paste0("GENE", 1:50) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:15), "Pathway2" = paste0("GENE", 10:25) ) matG <- gmt2mat(gmt) # Compute AUCell scores scores <- replaid.aucell(X, matG) print(scores[1:2, 1:5])# Create example expression matrix set.seed(123) X <- matrix(rnorm(500), nrow = 50, ncol = 10) rownames(X) <- paste0("GENE", 1:50) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:15), "Pathway2" = paste0("GENE", 10:25) ) matG <- gmt2mat(gmt) # Compute AUCell scores scores <- replaid.aucell(X, matG) print(scores[1:2, 1:5])
Calculates single-sample enrichment GSVA (Hänzelmann et al., 2013) using plaid back-end. The computation is 10-100x faster than the original code.
replaid.gsva( X, matG, tau = 0, rowtf = c("z", "ecdf")[1], assay = "logcounts", min.genes = 5, max.genes = 500 )replaid.gsva( X, matG, tau = 0, rowtf = c("z", "ecdf")[1], assay = "logcounts", min.genes = 5, max.genes = 500 )
X |
Gene or protein expression matrix. Generally log transformed. See details. Genes on rows, samples on columns. Also accepts SummarizedExperiment or SingleCellExperiment objects. |
matG |
Gene sets sparse matrix. Genes on rows, gene sets on columns. Also accepts BiocSet objects or GMT lists. |
tau |
Rank weight parameter (see GSVA publication). Default tau=0. |
rowtf |
Row transformation method ("z" or "ecdf"). Default "z". |
assay |
Character: assay name for Bioconductor objects. Default "logcounts". |
min.genes |
Integer: minimum genes per gene set. Default 5. |
max.genes |
Integer: maximum genes per gene set. Default 500. |
Computing the GSVA score requires to compute the CDF of the expression matrix, ranking and scoring the genesets. We have wrapped this in a single convenience function.
We have extensively compared the results of replaid.gsva and
from the original GSVA R package and we showed good concordance
of results in the score, logFC and p-values.
In the original formulation, GSVA uses an emperical CDF to transform expression of each feature to a (0;1) relative expression value. For efficiency reasons, this is here approximated by a z-transform (center+scale) of each row.
Matrix of single-sample GSVA enrichment scores. Gene sets on rows, samples on columns.
# Create example expression matrix set.seed(123) X <- matrix(rnorm(500), nrow = 50, ncol = 10) rownames(X) <- paste0("GENE", 1:50) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:15), "Pathway2" = paste0("GENE", 10:25) ) matG <- gmt2mat(gmt) # Compute GSVA scores scores <- replaid.gsva(X, matG) print(scores[1:2, 1:5])# Create example expression matrix set.seed(123) X <- matrix(rnorm(500), nrow = 50, ncol = 10) rownames(X) <- paste0("GENE", 1:50) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:15), "Pathway2" = paste0("GENE", 10:25) ) matG <- gmt2mat(gmt) # Compute GSVA scores scores <- replaid.gsva(X, matG) print(scores[1:2, 1:5])
Calculates Single-Cell Signature Explorer (Pont et al., 2019) scores using plaid back-end. The computation is 10-100x faster than the original code.
replaid.scse( X, matG, removeLog2 = NULL, scoreMean = FALSE, assay = "logcounts", min.genes = 5, max.genes = 500 )replaid.scse( X, matG, removeLog2 = NULL, scoreMean = FALSE, assay = "logcounts", min.genes = 5, max.genes = 500 )
X |
Gene or protein expression matrix. Generally log transformed. See details. Genes on rows, samples on columns. Also accepts SummarizedExperiment or SingleCellExperiment objects. |
matG |
Gene sets sparse matrix. Genes on rows, gene sets on columns. Also accepts BiocSet objects or GMT lists. |
removeLog2 |
Logical for whether to remove the Log2, i.e. will apply power transform (base2) on input (default TRUE). |
scoreMean |
Logical for whether computing sum or mean as score (default FALSE). |
assay |
Character: assay name for Bioconductor objects. Default "logcounts". |
min.genes |
Integer: minimum genes per gene set. Default 5. |
max.genes |
Integer: maximum genes per gene set. Default 500. |
Computing the scSE requires running plaid on the linear (not logarithmic) score and perform additional normalization by the total UMI per sample. We have wrapped this in a single convenience function:
To replicate the original "sum-of-UMI" scSE score, set removeLog2=TRUE
and scoreMean=FALSE. scSE and plaid scores become more similar for
removeLog2=FALSE and scoreMean=TRUE.
We have extensively compared the results from replaid.scse and
from the original scSE (implemented in GO lang) and we showed
almost identical results in the score, logFC and p-values.
Matrix of single-sample scSE enrichment scores. Gene sets on rows, samples on columns.
# Create example expression matrix (log-transformed) set.seed(123) X <- log2(matrix(rpois(500, lambda = 10) + 1, nrow = 50, ncol = 10)) rownames(X) <- paste0("GENE", 1:50) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:15), "Pathway2" = paste0("GENE", 10:25) ) matG <- gmt2mat(gmt) # Compute scSE scores (original method) scores <- replaid.scse(X, matG, removeLog2 = TRUE, scoreMean = FALSE) print(scores[1:2, 1:5]) # Compute scSE scores (mean method) scores_mean <- replaid.scse(X, matG, removeLog2 = TRUE, scoreMean = TRUE) print(scores_mean[1:2, 1:5])# Create example expression matrix (log-transformed) set.seed(123) X <- log2(matrix(rpois(500, lambda = 10) + 1, nrow = 50, ncol = 10)) rownames(X) <- paste0("GENE", 1:50) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:15), "Pathway2" = paste0("GENE", 10:25) ) matG <- gmt2mat(gmt) # Compute scSE scores (original method) scores <- replaid.scse(X, matG, removeLog2 = TRUE, scoreMean = FALSE) print(scores[1:2, 1:5]) # Compute scSE scores (mean method) scores_mean <- replaid.scse(X, matG, removeLog2 = TRUE, scoreMean = TRUE) print(scores_mean[1:2, 1:5])
Calculates single-sample enrichment singscore (Fouratan et al., 2018) using plaid back-end. The computation is 10-100x faster than the original code.
replaid.sing(X, matG, assay = "logcounts", min.genes = 5, max.genes = 500)replaid.sing(X, matG, assay = "logcounts", min.genes = 5, max.genes = 500)
X |
Gene or protein expression matrix. Generally log transformed. See details. Genes on rows, samples on columns. Also accepts SummarizedExperiment or SingleCellExperiment objects. |
matG |
Gene sets sparse matrix. Genes on rows, gene sets on columns. Also accepts BiocSet objects or GMT lists. |
assay |
Character: assay name for Bioconductor objects. Default "logcounts". |
min.genes |
Integer: minimum genes per gene set. Default 5. |
max.genes |
Integer: maximum genes per gene set. Default 500. |
Computing the singscore requires to compute the ranks of the expression matrix. We have wrapped this in a single convenience function.
We have extensively compared the results of replaid.sing and from
the original singscore R package and we showed identical result
in the score, logFC and p-values.
Matrix of single-sample singscore enrichment scores. Gene sets on rows, samples on columns.
# Create example expression matrix set.seed(123) X <- matrix(rnorm(500), nrow = 50, ncol = 10) rownames(X) <- paste0("GENE", 1:50) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:15), "Pathway2" = paste0("GENE", 10:25) ) matG <- gmt2mat(gmt) # Compute singscore scores <- replaid.sing(X, matG) print(scores[1:2, 1:5])# Create example expression matrix set.seed(123) X <- matrix(rnorm(500), nrow = 50, ncol = 10) rownames(X) <- paste0("GENE", 1:50) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:15), "Pathway2" = paste0("GENE", 10:25) ) matG <- gmt2mat(gmt) # Compute singscore scores <- replaid.sing(X, matG) print(scores[1:2, 1:5])
Calculates single-sample enrichment singscore (Barbie et al., 2009; Hänzelmann et al., 2013) using plaid back-end. The computation is 10-100x faster than the original code.
replaid.ssgsea( X, matG, alpha = 0, assay = "logcounts", min.genes = 5, max.genes = 500 )replaid.ssgsea( X, matG, alpha = 0, assay = "logcounts", min.genes = 5, max.genes = 500 )
X |
Gene or protein expression matrix. Generally log transformed. See details. Genes on rows, samples on columns. Also accepts SummarizedExperiment or SingleCellExperiment objects. |
matG |
Gene sets sparse matrix. Genes on rows, gene sets on columns. Also accepts BiocSet objects or GMT lists. |
alpha |
Weighting factor for exponential weighting of ranks |
assay |
Character: assay name for Bioconductor objects. Default "logcounts". |
min.genes |
Integer: minimum genes per gene set. Default 5. |
max.genes |
Integer: maximum genes per gene set. Default 500. |
Computing ssGSEA score requires to compute the ranks of the expression matrix and weighting of the ranks. We have wrapped this in a single convenience function.
We have extensively compared the results of replaid.ssgsea and
from the original GSVA R package and we showed highly similar
results in the score, logFC and p-values. For alpha=0 we obtain
exact results, for alpha>0 the results are highly similar but not
exactly the same.
Matrix of single-sample ssGSEA enrichment scores. Gene sets on rows, samples on columns.
# Create example expression matrix set.seed(123) X <- matrix(rnorm(500), nrow = 50, ncol = 10) rownames(X) <- paste0("GENE", 1:50) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:15), "Pathway2" = paste0("GENE", 10:25) ) matG <- gmt2mat(gmt) # Compute ssGSEA scores (alpha = 0) scores <- replaid.ssgsea(X, matG, alpha = 0) print(scores[1:2, 1:5]) # Compute ssGSEA scores with weighting (alpha = 0.25) scores_weighted <- replaid.ssgsea(X, matG, alpha = 0.25) print(scores_weighted[1:2, 1:5])# Create example expression matrix set.seed(123) X <- matrix(rnorm(500), nrow = 50, ncol = 10) rownames(X) <- paste0("GENE", 1:50) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:15), "Pathway2" = paste0("GENE", 10:25) ) matG <- gmt2mat(gmt) # Compute ssGSEA scores (alpha = 0) scores <- replaid.ssgsea(X, matG, alpha = 0) print(scores[1:2, 1:5]) # Compute ssGSEA scores with weighting (alpha = 0.25) scores_weighted <- replaid.ssgsea(X, matG, alpha = 0.25) print(scores_weighted[1:2, 1:5])
Calculates single-sample enrichment UCell (Andreatta et al., 2021) using plaid back-end. The computation is 10-100x faster than the original code.
replaid.ucell( X, matG, rmax = 1500, assay = "logcounts", min.genes = 5, max.genes = 500 )replaid.ucell( X, matG, rmax = 1500, assay = "logcounts", min.genes = 5, max.genes = 500 )
X |
Gene or protein expression matrix. Generally log transformed. See details. Genes on rows, samples on columns. Also accepts SummarizedExperiment or SingleCellExperiment objects. |
matG |
Gene sets sparse matrix. Genes on rows, gene sets on columns. Also accepts BiocSet objects or GMT lists. |
rmax |
Rank threshold (see Ucell paper). Default rmax = 1500. |
assay |
Character: assay name for Bioconductor objects. Default "logcounts". |
min.genes |
Integer: minimum genes per gene set. Default 5. |
max.genes |
Integer: maximum genes per gene set. Default 500. |
Computing ssGSEA score requires to compute the ranks of the expression matrix and truncation of the ranks. We have wrapped this in a single convenience function.
We have extensively compared the results of replaid.ucell and
from the original UCell R package and we showed near exacct
results in the score, logFC and p-values.
Matrix of single-sample UCell enrichment scores. Gene sets on rows, samples on columns.
# Create example expression matrix set.seed(123) X <- matrix(rnorm(500), nrow = 50, ncol = 10) rownames(X) <- paste0("GENE", 1:50) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:15), "Pathway2" = paste0("GENE", 10:25) ) matG <- gmt2mat(gmt) # Compute UCell scores (default rmax = 1500) scores <- replaid.ucell(X, matG) print(scores[1:2, 1:5]) # Compute UCell scores with custom rmax scores_custom <- replaid.ucell(X, matG, rmax = 1000) print(scores_custom[1:2, 1:5])# Create example expression matrix set.seed(123) X <- matrix(rnorm(500), nrow = 50, ncol = 10) rownames(X) <- paste0("GENE", 1:50) colnames(X) <- paste0("Sample", 1:10) # Create example gene sets gmt <- list( "Pathway1" = paste0("GENE", 1:15), "Pathway2" = paste0("GENE", 10:25) ) matG <- gmt2mat(gmt) # Compute UCell scores (default rmax = 1500) scores <- replaid.ucell(X, matG) print(scores[1:2, 1:5]) # Compute UCell scores with custom rmax scores_custom <- replaid.ucell(X, matG, rmax = 1000) print(scores_custom[1:2, 1:5])
Compute columm ranks for sparse matrix. Internally used by colranks()
sparse_colranks(X, signed = FALSE, ties.method = "average")sparse_colranks(X, signed = FALSE, ties.method = "average")
X |
Input matrix |
signed |
Logical: use or not signed ranks |
ties.method |
Character Choice of ties.method |
Sparse matrix of columnwise ranks with same dimensions as input.
Write gene sets to GMT file (Gene Matrix Transposed). The GMT format is commonly used to store gene sets or gene annotations.
write.gmt(gmt, file, source = NA)write.gmt(gmt, file, source = NA)
gmt |
A list of gene sets in GMT format: each gene set is represented as a vector of gene names. |
file |
The file path to write the GMT file. |
source |
A character vector specifying the source of each gene set. If not provided, the names of the gene sets are used as the source. |
Does not return anything.
# Create example GMT data gmt <- list( "Pathway1" = c("GENE1", "GENE2", "GENE3"), "Pathway2" = c("GENE2", "GENE4", "GENE5"), "Pathway3" = c("GENE1", "GENE5", "GENE6") ) # Write to GMT file (creates file in temp directory) temp_file <- tempfile(fileext = ".gmt") write.gmt(gmt, temp_file) # Write with custom source information temp_file2 <- tempfile(fileext = ".gmt") write.gmt(gmt, temp_file2, source = c("DB1", "DB2", "DB3")) # Clean up unlink(c(temp_file, temp_file2))# Create example GMT data gmt <- list( "Pathway1" = c("GENE1", "GENE2", "GENE3"), "Pathway2" = c("GENE2", "GENE4", "GENE5"), "Pathway3" = c("GENE1", "GENE5", "GENE6") ) # Write to GMT file (creates file in temp directory) temp_file <- tempfile(fileext = ".gmt") write.gmt(gmt, temp_file) # Write with custom source information temp_file2 <- tempfile(fileext = ".gmt") write.gmt(gmt, temp_file2, source = c("DB1", "DB2", "DB3")) # Clean up unlink(c(temp_file, temp_file2))