Title: | Tools for Molecular Signatures |
---|---|
Description: | The TMSig package contains tools to prepare, analyze, and visualize named lists of sets, with an emphasis on molecular signatures (such as gene or kinase sets). It includes fast, memory efficient functions to construct sparse incidence and similarity matrices and filter, cluster, invert, and decompose sets. Additionally, bubble heatmaps can be created to visualize the results of any differential or molecular signatures analysis. |
Authors: | Tyler Sagendorf [aut, cre] , Di Wu [ctb], Gordon Smyth [ctb] |
Maintainer: | Tyler Sagendorf <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.0 |
Built: | 2024-10-31 06:21:31 UTC |
Source: | https://github.com/bioc/TMSig |
Pre-ranked Correlation-Adjusted MEan RAnk gene set testing
(CAMERA-PR) tests whether a set of genes is highly ranked relative to other
genes in terms of some measure of differential expression, accounting for
inter-gene correlation (Wu & Smyth, 2012). See
cameraPR
for details.
While the language is gene-centric, any a priori groups of molecules may be tested.
## S3 method for class 'matrix' cameraPR( statistic, index, use.ranks = FALSE, inter.gene.cor = 0.01, sort = TRUE, alternative = c("two.sided", "less", "greater"), adjust.globally = FALSE, min.size = 2L, ... )
## S3 method for class 'matrix' cameraPR( statistic, index, use.ranks = FALSE, inter.gene.cor = 0.01, sort = TRUE, alternative = c("two.sided", "less", "greater"), adjust.globally = FALSE, min.size = 2L, ... )
statistic |
a matrix of statistics (moderated z-statistics preferred) with genes/molecules as row names and one or more contrasts or coefficients as column names. Missing values are allowed. |
index |
a named list of sets to test. Passed to
|
use.ranks |
logical; whether to perform a parametric test ( |
inter.gene.cor |
numeric; the inter-gene correlation within tested sets.
May be a single value or a named vector with names matching those of
|
sort |
logical; should the results of each contrast be sorted by
p-value? Default is |
alternative |
character; the alternative hypothesis. Must be one of
" |
adjust.globally |
logical; whether p-values from different contrasts
should be adjusted together. It is recommended to set this to |
min.size |
integer; the minimum set size. To be considered for testing,
sets must have at least |
... |
other arguments are not currently used |
A data.frame
with the following columns:
Contrast |
factor; the contrast of interest. |
GeneSet |
character; the gene set being tested. |
NGenes |
integer; number of genes in the set with values in the
|
Correlation |
numeric; inter-gene correlation (only included if
|
Direction |
character; direction of change ("Up" or "Down"). |
TwoSampleT |
numeric; two-sample t-statistic (only included if
|
df |
integer; degrees of freedom (only included if
|
ZScore |
numeric; the z-score equivalent of |
PValue |
numeric; one- or two-sided (if
|
FDR |
numeric; Benjamini and Hochberg FDR adjusted p-value. |
If use.ranks=FALSE
, the parametric version of CAMERA-PR will be
used. Since this is a modification of Student's two sample t-test, it is
assumed that the statistics in each column of statistic
are
approximately Normally distributed. In camera.default
,
the moderated t-statistics are converted to z-statistics with
zscoreT
and used for the analysis.
If use.ranks=TRUE
, a modified Wilcoxon-Mann-Whitney rank sum test
will be used.
CAMERA-PR was developed by Di Wu and Gordon Smyth (2012). With
permission, Tyler Sagendorf modified their original code to create the
matrix method. If using cameraPR.matrix
, please cite the original
paper, as well as the TMSig R package.
Wu, D., and Smyth, G. K. (2012). Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Research 40, e133. doi:10.1093/nar/gks461.
Goeman, J. J., and Bühlmann, P. (2007). Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23, 980-987. doi:10.1093/bioinformatics/btm051.
cameraPR
,
rankSumTestWithCorrelation
require(stats) # Simulate experimental data with control and treatment groups (3 samples # each) group <- rep(c("control", "treatment"), each = 3) design <- model.matrix(~ 0 + group) contrasts <- makeContrasts(contrasts = "grouptreatment - groupcontrol", levels = colnames(design)) ngenes <- 1000L nsamples <- length(group) set.seed(0) y <- matrix(data = rnorm(ngenes * nsamples), nrow = ngenes, ncol = nsamples, dimnames = list(paste0("gene", seq_len(ngenes)), make.unique(group))) # First set of 20 genes are genuinely differentially expressed # (trt1 and trt2 are lower than control) index1 <- 1:20 y[index1, 1:3] <- y[index1, 1:3] + 1 # Second set of 20 genes are not DE index2 <- 21:40 # Generate matrix of moderated t-statistics fit <- lmFit(y, design) fit.contr <- contrasts.fit(fit, contrasts = contrasts) fit.smooth <- eBayes(fit.contr) index <- list(set1 = rownames(y)[index1], set2 = rownames(y)[index2]) # Compute z-score equivalents of moderated t-statistics statistic <- zscoreT(fit.smooth$t, fit.smooth$df.total) head(statistic) # Only set1 is DE cameraPR(statistic = statistic, index = index) # Non-parametric version cameraPR(statistic = statistic, index = index, use.ranks = TRUE)
require(stats) # Simulate experimental data with control and treatment groups (3 samples # each) group <- rep(c("control", "treatment"), each = 3) design <- model.matrix(~ 0 + group) contrasts <- makeContrasts(contrasts = "grouptreatment - groupcontrol", levels = colnames(design)) ngenes <- 1000L nsamples <- length(group) set.seed(0) y <- matrix(data = rnorm(ngenes * nsamples), nrow = ngenes, ncol = nsamples, dimnames = list(paste0("gene", seq_len(ngenes)), make.unique(group))) # First set of 20 genes are genuinely differentially expressed # (trt1 and trt2 are lower than control) index1 <- 1:20 y[index1, 1:3] <- y[index1, 1:3] + 1 # Second set of 20 genes are not DE index2 <- 21:40 # Generate matrix of moderated t-statistics fit <- lmFit(y, design) fit.contr <- contrasts.fit(fit, contrasts = contrasts) fit.smooth <- eBayes(fit.contr) index <- list(set1 = rownames(y)[index1], set2 = rownames(y)[index2]) # Compute z-score equivalents of moderated t-statistics statistic <- zscoreT(fit.smooth$t, fit.smooth$df.total) head(statistic) # Only set1 is DE cameraPR(statistic = statistic, index = index) # Non-parametric version cameraPR(statistic = statistic, index = index, use.ranks = TRUE)
Determine clusters of highly similar sets. Used to reduce the redundancy of sets prior to statistical analysis.
clusterSets( x, type = c("jaccard", "overlap", "otsuka"), cutoff = 0.85, method = "complete", h = 0.9 )
clusterSets( x, type = c("jaccard", "overlap", "otsuka"), cutoff = 0.85, method = "complete", h = 0.9 )
x |
a named list of sets. Elements must be of type |
type |
character; the type of similarity measure to use. Either
|
cutoff |
numeric 0-1; minimum similarity coefficient required to classify two sets as being similar. Default is 0.85. |
method |
character; the clustering method passed to
|
h |
numeric 0-1; cut height used to define clusters. Passed to
|
A data.frame
with 3 columns:
set |
character; the name of the set. |
cluster |
integer; the cluster identifier. |
set_size |
integer; the size of the set (number of elements). |
Results are arranged in ascending order by cluster, descending order by set size, and then alphanumerically by set name.
Given a named list of sets, clusterSets
calculates all pairwise
Jaccard, overlap, or Ōtsuka similarity coefficients (see
similarity
for details). Any coefficients below cutoff
are set to 0 and complete-linkage hierarchical clustering is performed on
the dissimilarity matrix (calculated as 1 - coefficients). Lastly,
cutree
is used with cut height h
to define
clusters, and the results are stored in a data.frame
.
Clustering does not need to be performed on those sets that are not
sufficiently similar (value of similarity
below cutoff
) to
any other set, as they will always be placed in their own cluster. By
excluding these sets during the hierarchical clustering step, the speed of
clusterSets
will increase as the value of cutoff
approaches
1 (as the size of the dissimilarity matrix decreases).
Sets that are not sufficiently large will always appear as singleton
clusters, unless they are aliased or subsets (overlap similarity only). For
two sets and
to be sufficiently similar, defined as having
a similarity coefficient at least equal to some cutoff (e.g.,
), they must have minimum sizes
,
,
and intersection size
:
Jaccard:
Overlap:
Ōtsuka:
where is the ceiling function applied to some real
number
.
For example, if the cutoff is , then the minimum set and
intersection sizes are
Jaccard:
Overlap:
Ōtsuka:
That is, sets with fewer elements or smaller intersections will always appear as singleton clusters unless they are aliased or, in the case of the overlap similarity, subsets.
This function is based on the procedure described in the Molecular Signatures Database (MSigDB) v7.0 Release Notes (Liberzon 2011, 2015): https://docs.gsea-msigdb.org/#MSigDB/Release_Notes/MSigDB_7.0/. Specifically, sections "C2:CP:Reactome — Major Overhaul" and "C5 (Gene Ontology Collection) — Major Overhaul". Though hierarchical clustering is widely used, the defaults are exactly what is set for MSigDB (private correspondence).
Liberzon, A., Subramanian, A., Pinchback, R., Thorvaldsdóttir, H., Tamayo, P., & Mesirov, J. P. (2011). Molecular signatures database (MSigDB) 3.0. Bioinformatics, 27(12), 1739–1740. doi:10.1093/bioinformatics/btr260
Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, J. P., & Tamayo, P. (2015). The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell systems, 1(6), 417–425. doi:10.1016/j.cels.2015.12.004
x <- list("A" = letters[1:5], "B" = letters[1:4], # subset of A "C" = letters[1:4], # aliased with B "D" = letters[1:3], # subset of A, B, C "E" = c("a", "a", NA), # duplicates and NA "F" = c("x", "y", "z"), # distinct elements "G" = letters[3:6]) # overlaps with A-E # Default clustering based on Jaccard similarity clusterSets(x) clusterSets(x, type = "overlap") # overlap similarity clusterSets(x, type = "otsuka") # Ōtsuka similarity # Relax Jaccard similarity cutoff (df <- clusterSets(x, cutoff = 0.5)) # Keep the first (largest) set from each cluster with(df, set[!duplicated(cluster)]) # A, G, E, F # Keep the smallest set from each cluster df <- df[order(df$set_size), ] with(df, set[!duplicated(cluster)]) # E, D, F, G # Cluster aliased sets (type = "otsuka" would produce # identical results) clusterSets(x, type = "jaccard", cutoff = 1) # Cluster subsets and aliased sets clusterSets(x, type = "overlap", cutoff = 1)
x <- list("A" = letters[1:5], "B" = letters[1:4], # subset of A "C" = letters[1:4], # aliased with B "D" = letters[1:3], # subset of A, B, C "E" = c("a", "a", NA), # duplicates and NA "F" = c("x", "y", "z"), # distinct elements "G" = letters[3:6]) # overlaps with A-E # Default clustering based on Jaccard similarity clusterSets(x) clusterSets(x, type = "overlap") # overlap similarity clusterSets(x, type = "otsuka") # Ōtsuka similarity # Relax Jaccard similarity cutoff (df <- clusterSets(x, cutoff = 0.5)) # Keep the first (largest) set from each cluster with(df, set[!duplicated(cluster)]) # A, G, E, F # Keep the smallest set from each cluster df <- df[order(df$set_size), ] with(df, set[!duplicated(cluster)]) # E, D, F, G # Cluster aliased sets (type = "otsuka" would produce # identical results) clusterSets(x, type = "jaccard", cutoff = 1) # Cluster subsets and aliased sets clusterSets(x, type = "overlap", cutoff = 1)
Decompose all pairs of sufficiently overlapping sets into 3
disjoint parts: the elements unique to the first set, the elements unique
to the second set, and the elements found in both sets. See the examples
section in invertSets
for a method to decompose an entire
list of sets.
decomposeSets( x, overlap = 1L, AND = "~AND~", MINUS = "~MINUS~", verbose = TRUE )
decomposeSets( x, overlap = 1L, AND = "~AND~", MINUS = "~MINUS~", verbose = TRUE )
x |
a named list of sets. Elements must be of type |
overlap |
integer; only pairs of sets with at least |
AND |
character; string used to denote the intersection of two sets. Defaut is "~AND~", which produces intersections of the form "A ~AND~ B" (i.e., elements in both A and B). |
MINUS |
character; string used to denote the difference of two sets. Defualt is "~MINUS~", which produces differences of the form "A ~MINUS~ B" (i.e., elements in A and not in B). |
verbose |
logical; whether to print warnings and messages. |
A named list of disjoint parts of sets. May contain aliases.
Since the size of the intersection between two sets is at most the size of
the smaller set, any sets with fewer than overlap
elements can be
immediately discarded.
Decomposition of sets is described by Jiang and Gentleman (2007) in section 2.3.1 "Overlap among gene sets". It is a method to reduce the redundancy of significant gene set testing results whereby the decomposed sets are reanalyzed and the following selections can be made:
If the elements unique to set 1 and set 2, elements common to both sets, or all 3 parts are statistically significant, keep both set 1 and set 2 in the original results. We can not separate their effects.
If the elements unique to set 1 or the elements unique to set 1 and common to both sets are statistically significant, only keep set 1 in the original results. (The same logic can be applied for set 2.)
Jiang, Z., & Gentleman, R. (2007). Extensions to gene set enrichment. Bioinformatics, 23(3), 306–313. doi:10.1093/bioinformatics/btl599
x <- list("A" = letters[1:10], "B" = letters[3:7], "C" = letters[1:4], "D" = letters[6:12]) decomposeSets(x) decomposeSets(x, overlap = 5L)
x <- list("A" = letters[1:10], "B" = letters[3:7], "C" = letters[1:4], "D" = letters[6:12]) decomposeSets(x) decomposeSets(x, overlap = 5L)
Create a bubble heatmap summarizing molecular signature analysis
results, such as those from cameraPR.matrix
. May also be used
to generate bubble heatmaps of differential analysis results.
enrichmap( x, n_top = 15L, set_column = "GeneSet", statistic_column = "ZScore", contrast_column = "Contrast", padj_column = "FDR", padj_legend_title = padj_column, padj_aggregate_fun = function(padj) median(-log10(padj), na.rm = TRUE), padj_cutoff = 0.05, plot_sig_only = TRUE, padj_fill = "grey", colors = c("#3366ff", "darkred"), heatmap_color_fun = cameraColorFun, scale_by = c("row", "column", "max"), cell_size = unit(14, "points"), filename, height = 5, width = 5, units = "in", heatmap_args = list(), padj_args = list(), save_args = list(), draw_args = list() )
enrichmap( x, n_top = 15L, set_column = "GeneSet", statistic_column = "ZScore", contrast_column = "Contrast", padj_column = "FDR", padj_legend_title = padj_column, padj_aggregate_fun = function(padj) median(-log10(padj), na.rm = TRUE), padj_cutoff = 0.05, plot_sig_only = TRUE, padj_fill = "grey", colors = c("#3366ff", "darkred"), heatmap_color_fun = cameraColorFun, scale_by = c("row", "column", "max"), cell_size = unit(14, "points"), filename, height = 5, width = 5, units = "in", heatmap_args = list(), padj_args = list(), save_args = list(), draw_args = list() )
x |
an object that can be coerced to a |
n_top |
integer; number of sets (rows) to display. Defaults to the top
15 sets with the highest median |
set_column |
character; the name of a column in |
statistic_column |
character; the name of a column in |
contrast_column |
character; the name of a column in |
padj_column |
character; the name of a column in |
padj_legend_title |
character; title of the background fill legend.
Defaults to |
padj_aggregate_fun |
function; a function used to aggregate the adjusted
p-values in |
padj_cutoff |
numeric; cutoff for terms to be statistically significant.
If |
plot_sig_only |
logical; whether to plot only those |
padj_fill |
character; the background color used for values in
|
colors |
character; vector of length 2 specifying the colors for the
largest negative and largest positive values of
|
heatmap_color_fun |
function; used to create the legend for the heatmap
bubble fill. See |
scale_by |
character; whether to scale the bubbles such that the term
with the largest |
cell_size |
|
filename |
character; the file name used to save the heatmap. If missing (default), the heatmap will be displayed instead. |
height |
numeric; height of the file in |
width |
numeric; width of the file in |
units |
character; units that define |
heatmap_args |
list; additional arguments passed to
|
padj_args |
list; additional arguments passed to
|
save_args |
list; additional arguments passed to the graphics device
determined by the |
draw_args |
list; additional arguments passed to
|
The diameter of each bubble is determined by the
adjusted p-values. By default, the bubbles are scaled such that the
contrast with the largest
adjusted p-value per row
(
scale_by="row"
) has a bubble diameter of 0.95 * cell_size
,
and all other bubbles in that row are scaled relative to this maximum
diameter; this is to better visualize patterns across contrasts. Bubbles
can also be scaled so that largest adjusted p-value by
column (
scale_by="column"
) or in the entire heatmap
(scale_by="max"
) has the maximum diameter. The bubble diameters will
be no smaller than 0.2 * cell_size
.
Nothing. Displays heatmap or saves the heatmap to a file (if
filename
is provided).
## Simulate results of cameraPR.matrix set.seed(1) df <- 5000L x <- data.frame( Contrast = rep(paste("Contrast", 1:3), each = 4), GeneSet = rep(paste("GeneSet", 1:4), times = 3), TwoSampleT = 5 * rt(n = 12L, df = df) ) # Calculate z-statistics, two-sided p-values, and BH adjusted p-values x$ZScore <- limma::zscoreT(x = x$TwoSampleT, df = df) x$PValue <- 2 * pnorm(abs(x$ZScore), lower.tail = FALSE) x$FDR <- p.adjust(x$PValue, method = "BH") ## Plot results # Same as enrichmap(x, statistic_column = "ZScore") enrichmap(x = x, set_column = "GeneSet", statistic_column = "ZScore", contrast_column = "Contrast", padj_column = "FDR", padj_cutoff = 0.05) # Include gene sets with adjusted p-values above padj_cutoff (0.05). # Also update adjusted p-value legend title. enrichmap(x = x, statistic_column = "ZScore", plot_sig_only = FALSE, padj_legend_title = "BH Adjusted\nP-Value")
## Simulate results of cameraPR.matrix set.seed(1) df <- 5000L x <- data.frame( Contrast = rep(paste("Contrast", 1:3), each = 4), GeneSet = rep(paste("GeneSet", 1:4), times = 3), TwoSampleT = 5 * rt(n = 12L, df = df) ) # Calculate z-statistics, two-sided p-values, and BH adjusted p-values x$ZScore <- limma::zscoreT(x = x$TwoSampleT, df = df) x$PValue <- 2 * pnorm(abs(x$ZScore), lower.tail = FALSE) x$FDR <- p.adjust(x$PValue, method = "BH") ## Plot results # Same as enrichmap(x, statistic_column = "ZScore") enrichmap(x = x, set_column = "GeneSet", statistic_column = "ZScore", contrast_column = "Contrast", padj_column = "FDR", padj_cutoff = 0.05) # Include gene sets with adjusted p-values above padj_cutoff (0.05). # Also update adjusted p-value legend title. enrichmap(x = x, statistic_column = "ZScore", plot_sig_only = FALSE, padj_legend_title = "BH Adjusted\nP-Value")
Heatmap color functions for plotting GSEA-like and CAMERA-like
results with enrichmap
.
gseaColorFun(statistics, colors = c("#3366ff", "darkred")) cameraColorFun(statistics, colors = c("#3366ff", "darkred"))
gseaColorFun(statistics, colors = c("#3366ff", "darkred")) cameraColorFun(statistics, colors = c("#3366ff", "darkred"))
statistics |
numeric matrix or vector of statistics. Missing values are removed. Used to compute limits for the color legend. |
colors |
vector of 2 colors for the most negative and most positive
values of |
For gseaColorFun
, the statistics
are expected to be
normalized enrichment scores (NES). Due to how the NES is formulated,
values between -1 and 1 are never significant or otherwise interesting, so
they are given a white fill so as to not appear in the heatmap (see
examples).
a named list of breaks and colors for the heatmap legend.
set.seed(0) x <- rnorm(10, mean = 0, sd = 2) cameraColorFun(x) cameraColorFun(x[x >= 0]) # positive only gseaColorFun(x) gseaColorFun(x[x >= 0]) # positive only
set.seed(0) x <- rnorm(10, mean = 0, sd = 2) cameraColorFun(x) cameraColorFun(x[x >= 0]) # positive only gseaColorFun(x) gseaColorFun(x[x >= 0]) # positive only
Extend the Range of Values Out to the Nearest Digit
extendRangeNum(x, nearest = 1)
extendRangeNum(x, nearest = 1)
x |
|
nearest |
numeric; the range of |
A numeric
vector of length 2 containing the minimum and
maximum values of x
after extending them outward to the value
provided by nearest
.
set.seed(0) x <- runif(5, min = -10, max = 10) range(x) # -4.689827 8.164156 extendRangeNum(x) # -5 9 extendRangeNum(x, nearest = 2) # -6 10 extendRangeNum(x, nearest = 0.1) # -4.7 8.2
set.seed(0) x <- runif(5, min = -10, max = 10) range(x) # -4.689827 8.164156 extendRangeNum(x) # -5 9 extendRangeNum(x, nearest = 2) # -6 10 extendRangeNum(x, nearest = 0.1) # -4.7 8.2
Given a named list of sets, filter to those that contain at
least min_size
and no more than max_size
elements. The sets
are optionally restricted to elements of background
before filtering
by size.
filterSets(x, background = NULL, min_size = 5L, max_size = Inf, ...) ## S4 method for signature 'GeneSet,character,numeric,numeric' filterSets(x, background = NULL, min_size = 5L, max_size = Inf, ...) ## S4 method for signature 'GeneSetCollection,character,numeric,numeric' filterSets(x, background = NULL, min_size = 5L, max_size = Inf, ...)
filterSets(x, background = NULL, min_size = 5L, max_size = Inf, ...) ## S4 method for signature 'GeneSet,character,numeric,numeric' filterSets(x, background = NULL, min_size = 5L, max_size = Inf, ...) ## S4 method for signature 'GeneSetCollection,character,numeric,numeric' filterSets(x, background = NULL, min_size = 5L, max_size = Inf, ...)
x |
a named list of sets. Elements must be of type |
background |
character; optional character vector. |
min_size |
integer ( |
max_size |
integer ( |
... |
additional arguments are not currently used. |
A named list of sets at most the same size as x
.
x <- list("A" = c("a", "b", "c"), "B" = c("a", "a", "d", "e", NA), # duplicates and NA "C" = c("f", "g")) # All sets have at least 2 elements, # so this just removes duplicates and NA filterSets(x, min_size = 2L) # Limit scope of sets before filtering filterSets(x, min_size = 2L, background = c("a", "c", "e", "z"))
x <- list("A" = c("a", "b", "c"), "B" = c("a", "a", "d", "e", NA), # duplicates and NA "C" = c("f", "g")) # All sets have at least 2 elements, # so this just removes duplicates and NA filterSets(x, min_size = 2L) # Limit scope of sets before filtering filterSets(x, min_size = 2L, background = c("a", "c", "e", "z"))
Converts an incidence matrix to a named list of sets. The
inverse of sparseIncidence
.
incidenceToList(incidence)
incidenceToList(incidence)
incidence |
incidence matrix with set names as rows and elements as
columns. For instance, the output of |
a named list of sets with the same length as nrow(incidence)
.
Currently, there are no checks to ensure incidence
is a valid
incidence matrix.
x <- list("A" = c("a", "b", "c"), "B" = c("c", "d"), "C" = c("x", "y", "z", "z"), # duplicates "D" = c("a", NA)) # missing values (imat <- sparseIncidence(x)) # incidence matrix incidenceToList(incidence = imat)
x <- list("A" = c("a", "b", "c"), "B" = c("c", "d"), "C" = c("x", "y", "z", "z"), # duplicates "D" = c("a", NA)) # missing values (imat <- sparseIncidence(x)) # incidence matrix incidenceToList(incidence = imat)
Invert a list of sets so that elements become set names and set names become elements.
invertSets(x, ...) ## S4 method for signature 'GeneSet' invertSets(x, ...) ## S4 method for signature 'GeneSetCollection' invertSets(x, ...)
invertSets(x, ...) ## S4 method for signature 'GeneSet' invertSets(x, ...) ## S4 method for signature 'GeneSetCollection' invertSets(x, ...)
x |
a named list of sets. Elements must be of type |
... |
additional arguments are not currently used. |
A named list of sets.
This function is essentially a more limited version of
purrr::transpose_list
.
x <- list("A" = c("a", "b", "c"), "B" = c("c", "d"), "C" = c("x", "y", "z"), "D" = c("a", "c", "d")) # Invert sets (y <- invertSets(x)) # Jaccard similarity of pairs of elements similarity(y) # Decompose sets into disjoint parts yc <- lapply(y, paste, collapse = ", ") invertSets(yc)
x <- list("A" = c("a", "b", "c"), "B" = c("c", "d"), "C" = c("x", "y", "z"), "D" = c("a", "c", "d")) # Invert sets (y <- invertSets(x)) # Jaccard similarity of pairs of elements similarity(y) # Decompose sets into disjoint parts yc <- lapply(y, paste, collapse = ", ") invertSets(yc)
Create a named list of sets from a GMT file, or a file structured like a GMT.
readGMT(path, check = TRUE)
readGMT(path, check = TRUE)
path |
character; path to a GMT file. Files may include one additional extension after ".gmt", such as ".gmt.gzip". |
check |
logical; check that |
The second entry in each line of the GMT file is assumed to be a URL or some other additional information, so it is discarded.
A named list of character vectors.
Similar to fgsea::gmtPathways
.
path <- system.file("extdata", "c5.go.v2023.2.Hs.symbols.gmt.gz", package = "TMSig") x <- readGMT(path) head(names(x)) # First 6 gene set names x[1] # first set
path <- system.file("extdata", "c5.go.v2023.2.Hs.symbols.gmt.gz", package = "TMSig") x <- readGMT(path) head(names(x)) # First 6 gene set names x[1] # first set
Construct a sparse matrix of similarity coefficients for each pair of sets in a list.
similarity(x, type = c("jaccard", "overlap", "otsuka"))
similarity(x, type = c("jaccard", "overlap", "otsuka"))
x |
a named list of sets. Elements must be of type |
type |
character; the type of similarity measure to use. Either
|
A symmetric dgCMatrix
containing all pairwise set similarity coefficients.
If and
are sets, we define the Jaccard similarity
coefficient
as the size of their intersection divided by the size
of their union (Jaccard, 1912):
The overlap coefficient is defined as the size of the intersection divided by the size of the smaller set (Simpson, 1943, 1947, 1960; Fallaw 1979):
The Ōtsuka coefficient is defined as the size of the intersection divided by the geometric mean of the set sizes (Ōtsuka, 1936), which is equivalent to the cosine similarity of two bit vectors:
The Jaccard and Ōtsuka coefficients can identify aliased sets (sets which contain the same elements, but have different names), while the overlap coefficient can identify both aliased sets and subsets. Aliases and subsets are not easily distinguished without also having the matrix of Jaccard (or Ōtsuka) coefficients or the set sizes.
Notice the relationship between the similarity coefficients:
Calculations are only performed for pairs of sets with nonzero
intersections in the lower triangular part of the matrix. As such,
similarity
is efficient even for large similarity matrices, and it
is especially efficient for sparse similarity matrices.
Jaccard, P. (1912). The distribution of the flora in the alpine zone. The New Phytologist, 11(2), 37–50. doi:10.1111/j.1469-8137.1912.tb05611.x. https://www.jstor.org/stable/2427226
Ōtsuka, Y. (1936). The faunal character of the Japanese Pleistocene marine Mollusca, as evidence of the climate having become colder during the Pleistocene in Japan. Bulletin of the Biogeographical Society of Japan, 6(16), 165–170.
Simpson, G. G. (1943). Mammals and the nature of continents. American Journal of Science, 241(1), 1–31.
Simpson, G. G. (1947). Holarctic mammalian faunas and continental relationships during the Cenozoic. Bulletin of the Geological Society of America, 58(7), 613–688.
Simpson, G. G. (1960). Notes on the measurement of faunal resemblance. American Journal of Science, 258-A, 300–311.
Fallaw, W. C. (1979). A test of the Simpson coefficient and other binary coefficients of faunal similarity. Journal of Paleontology, 53(4), 1029–1034. http://www.jstor.org/stable/1304126
x <- list("A" = c("a", "b", "c", "d", "e"), "B" = c("d", "e", "f", "g"), # overlaps with A "C" = c("d", "e", "f", "g"), # aliased with B "D" = c("a", "b", "c")) # subset of A similarity(x) # Jaccard coefficients similarity(x, type = "overlap") # overlap coefficients similarity(x, type = "otsuka") # Ōtsuka coefficients
x <- list("A" = c("a", "b", "c", "d", "e"), "B" = c("d", "e", "f", "g"), # overlaps with A "C" = c("d", "e", "f", "g"), # aliased with B "D" = c("a", "b", "c")) # subset of A similarity(x) # Jaccard coefficients similarity(x, type = "overlap") # overlap coefficients similarity(x, type = "otsuka") # Ōtsuka coefficients
Construct a sparse incidence matrix from a named list of sets.
sparseIncidence(x, ...) ## S4 method for signature 'GeneSet' sparseIncidence(x, ...) ## S4 method for signature 'GeneSetCollection' sparseIncidence(x, ...)
sparseIncidence(x, ...) ## S4 method for signature 'GeneSet' sparseIncidence(x, ...) ## S4 method for signature 'GeneSetCollection' sparseIncidence(x, ...)
x |
a named list of sets. Elements must be of type |
... |
additional arguments are not currently used. |
sparseIncidence
differs from
GSEABase::incidence
in that it returns a
sparse matrix, rather than a dense matrix, so it is more memory efficient.
It also removes missing elements, removes empty sets, and combines sets
with duplicate names.
An object of class dgCMatrix
with unique set names as rows and unique elements as columns.
An incidence matrix, , is defined such that
incidenceToList
, similarity
,
sparseMatrix
x <- list("A" = c("a", "b"), "A" = c("c"), # duplicate sets "B" = c("c", "d"), "C" = c("x", "y", "z", "z"), # duplicates "D" = c("a", NA)) # missing values (imat <- sparseIncidence(x)) # Sizes of sets and their pairwise intersections tcrossprod(imat) # Number of sets in which each element and pair of elements appears crossprod(imat) # Count number of elements unique to each set keep <- apply(imat, 2, sum) == 1 apply(imat[, keep], 1, sum)
x <- list("A" = c("a", "b"), "A" = c("c"), # duplicate sets "B" = c("c", "d"), "C" = c("x", "y", "z", "z"), # duplicates "D" = c("a", NA)) # missing values (imat <- sparseIncidence(x)) # Sizes of sets and their pairwise intersections tcrossprod(imat) # Number of sets in which each element and pair of elements appears crossprod(imat) # Count number of elements unique to each set keep <- apply(imat, 2, sum) == 1 apply(imat[, keep], 1, sum)