License: GPL (>= 3)
GMT files, such as those available through the Molecular Signatures
Database (MSigDB)(Liberzon et al. 2011, 2015), can be read as
named lists with readGMT
:
# Path to GMT file - MSigDB Gene Ontology sets
pathToGMT <- system.file(
"extdata", "c5.go.v2023.2.Hs.symbols.gmt.gz",
package = "TMSig"
)
geneSets <- readGMT(path = pathToGMT)
length(geneSets) # 10461 gene sets
#> [1] 10461
head(names(geneSets)) # first 6 gene set names
#> [1] "GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS"
#> [2] "GOBP_2FE_2S_CLUSTER_ASSEMBLY"
#> [3] "GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS"
#> [4] "GOBP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_METABOLIC_PROCESS"
#> [5] "GOBP_3_UTR_MEDIATED_MRNA_DESTABILIZATION"
#> [6] "GOBP_3_UTR_MEDIATED_MRNA_STABILIZATION"
geneSets[[1]] # genes in first set
#> [1] "AASDHPPT" "ALDH1L1" "ALDH1L2" "MTHFD1" "MTHFD1L" "MTHFD2L"
filterSets
keeps sets that satisfy minimum and maximum
size constraints. Optionally, sets can be restricted to a smaller
background of genes before filtering by size.
# Filter by size - no background
filt <- filterSets(x = geneSets,
min_size = 10L,
max_size = 500L)
length(filt) # 7096 gene sets remain (down from 10461)
#> [1] 7096
Normally, the background is the set of genes quantified in a particular experiment. For the purposes of this example, the top 2000 most common genes will be selected to serve as the background. This ensures the gene sets will maintain some degree of overlap for later steps.
# Top 2000 most common genes
topGenes <- table(unlist(geneSets))
topGenes <- head(sort(topGenes, decreasing = TRUE), 2000L)
head(topGenes)
#>
#> TNF AKT1 WNT5A CTNNB1 TGFB1 NOTCH1
#> 809 675 669 655 609 588
background <- names(topGenes)
The gene sets will be restricted to elements of the background before filtering by size.
# Restrict genes sets to background of genes
geneSetsFilt <- filterSets(
x = geneSets,
background = background,
min_size = 10L # min. overlap of each set with background
)
length(geneSetsFilt) # 4629 gene sets pass
#> [1] 4629
The proportion of genes in each set that overlap with the background will be calculated and used to further select sets.
# Calculate proportion of overlap with background
setSizes <- lengths(geneSetsFilt)
setSizesOld <- lengths(geneSets)[names(geneSetsFilt)]
overlapProp <- setSizes / setSizesOld
plot(setSizesOld, overlapProp, main = "Set Size vs. Overlap")
Gene sets with an overlap of at least 70% will be kept, so we can be reasonably confident that the gene sets are correctly described by their labels. However, since the background is small, this will remove the majority of sets.
An incidence matrix is a representation of a graph. For a named list
of sets, the set names form the rows of the matrix, and all unique
elements are columns. A value of 1 indicates that the element is a
member of the set, while a value of 0 indicates otherwise. The incidence
matrix forms the basis for many of the functions in TMSig, including
similarity
, clusterSets
,
decomposeSets
, and cameraPR.matrix
.
imat <- sparseIncidence(x = geneSetsFilt)
dim(imat) # 1015 sets, 1914 genes
#> [1] 1015 1914
imat[seq_len(8L), seq_len(5L)] # first 8 sets, first 5 genes
#> 8 x 5 sparse Matrix of class "dgCMatrix"
#> ABL1 AGER ARG1 BTN2A2 CADM1
#> GOBP_ACTIVATED_T_CELL_PROLIFERATION 1 1 1 1 1
#> GOBP_ACTIVATION_OF_JANUS_KINASE_ACTIVITY . . . . .
#> GOBP_ACTIVATION_OF_PROTEIN_KINASE_ACTIVITY 1 . . . .
#> GOBP_ACTIVATION_OF_PROTEIN_KINASE_B_ACTIVITY . . . . .
#> GOBP_ADHESION_OF_SYMBIONT_TO_HOST . . . . .
#> GOBP_ADRENAL_GLAND_DEVELOPMENT . . . . .
#> GOBP_ALPHA_BETA_T_CELL_ACTIVATION 1 1 . . .
#> GOBP_ALPHA_BETA_T_CELL_DIFFERENTIATION 1 . . . .
The incidence matrix can be used to calculate the sizes of all pairwise set intersections or the number of sets or pairs of sets to which each gene belongs.
# Calculate sizes of all pairwise intersections
tcrossprod(imat)[1:3, 1:3] # first 3 gene sets
#> 3 x 3 sparse Matrix of class "dsCMatrix"
#> GOBP_ACTIVATED_T_CELL_PROLIFERATION
#> GOBP_ACTIVATED_T_CELL_PROLIFERATION 40
#> GOBP_ACTIVATION_OF_JANUS_KINASE_ACTIVITY 3
#> GOBP_ACTIVATION_OF_PROTEIN_KINASE_ACTIVITY 8
#> GOBP_ACTIVATION_OF_JANUS_KINASE_ACTIVITY
#> GOBP_ACTIVATED_T_CELL_PROLIFERATION 3
#> GOBP_ACTIVATION_OF_JANUS_KINASE_ACTIVITY 13
#> GOBP_ACTIVATION_OF_PROTEIN_KINASE_ACTIVITY 13
#> GOBP_ACTIVATION_OF_PROTEIN_KINASE_ACTIVITY
#> GOBP_ACTIVATED_T_CELL_PROLIFERATION 8
#> GOBP_ACTIVATION_OF_JANUS_KINASE_ACTIVITY 13
#> GOBP_ACTIVATION_OF_PROTEIN_KINASE_ACTIVITY 67
# Calculate number of sets and pairs of sets to which each gene belongs
crossprod(imat)[1:3, 1:3] # first 3 genes
#> 3 x 3 sparse Matrix of class "dsCMatrix"
#> ABL1 AGER ARG1
#> ABL1 76 15 3
#> AGER 15 52 6
#> ARG1 3 6 31
The similarity
function constructs a sparse symmetric
matrix of pairwise Jaccard, overlap/Simpson, or Ōtsuka similarity
coefficients for all pairs of sets A and B, where
All 3 similarity measures can identify aliasing of sets: when two or more sets contain the same elements, but have different descriptions. Only the overlap/Simpson similarity can also identify when one set is a subset of another.
# Jaccard similarity (default)
jaccard <- similarity(x = geneSetsFilt)
dim(jaccard) # 1015 1015`
#> [1] 1015 1015
class(jaccard)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
The 6 sets having the highest Jaccard similarity with GOBP_CARDIAC_ATRIUM_DEVELOPMENT are shown for each of the 3 measures of set similarity.
# 6 sets with highest Jaccard for a specific term
idx <- order(jaccard[, "GOBP_CARDIAC_ATRIUM_DEVELOPMENT"],
decreasing = TRUE)
idx <- head(idx)
jaccard[idx, "GOBP_CARDIAC_ATRIUM_DEVELOPMENT", drop = FALSE]
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#> GOBP_CARDIAC_ATRIUM_DEVELOPMENT
#> GOBP_CARDIAC_ATRIUM_DEVELOPMENT 1.0000000
#> GOBP_CARDIAC_ATRIUM_MORPHOGENESIS 0.8529412
#> GOBP_ATRIAL_SEPTUM_DEVELOPMENT 0.5882353
#> GOBP_ATRIAL_SEPTUM_MORPHOGENESIS 0.4411765
#> GOBP_ATRIOVENTRICULAR_VALVE_DEVELOPMENT 0.3111111
#> GOBP_CARDIAC_CHAMBER_MORPHOGENESIS 0.3039216
GOBP_CARDIAC_ATRIUM_MORPHOGENESIS is highly similar to GOBP_CARDIAC_ATRIUM_DEVELOPMENT (J = 0.853).
overlap <- similarity(x = geneSetsFilt, type = "overlap")
overlap[idx, "GOBP_CARDIAC_ATRIUM_DEVELOPMENT", drop = FALSE]
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#> GOBP_CARDIAC_ATRIUM_DEVELOPMENT
#> GOBP_CARDIAC_ATRIUM_DEVELOPMENT 1.0000000
#> GOBP_CARDIAC_ATRIUM_MORPHOGENESIS 1.0000000
#> GOBP_ATRIAL_SEPTUM_DEVELOPMENT 1.0000000
#> GOBP_ATRIAL_SEPTUM_MORPHOGENESIS 1.0000000
#> GOBP_ATRIOVENTRICULAR_VALVE_DEVELOPMENT 0.5600000
#> GOBP_CARDIAC_CHAMBER_MORPHOGENESIS 0.9117647
GOBP_CARDIAC_ATRIUM_MORPHOGENESIS, GOBP_ATRIAL_SEPTUM_DEVELOPMENT, and GOBP_ATRIAL_SEPTUM_MORPHOGENESIS are subsets of GOBP_CARDIAC_ATRIUM_DEVELOPMENT, since they are smaller (each containing 15 to 29 genes compared to 34 in GOBP_CARDIAC_ATRIUM_DEVELOPMENT).
otsuka <- similarity(x = geneSetsFilt, type = "otsuka")
otsuka[idx, "GOBP_CARDIAC_ATRIUM_DEVELOPMENT", drop = FALSE]
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#> GOBP_CARDIAC_ATRIUM_DEVELOPMENT
#> GOBP_CARDIAC_ATRIUM_DEVELOPMENT 1.0000000
#> GOBP_CARDIAC_ATRIUM_MORPHOGENESIS 0.9235481
#> GOBP_ATRIAL_SEPTUM_DEVELOPMENT 0.7669650
#> GOBP_ATRIAL_SEPTUM_MORPHOGENESIS 0.6642112
#> GOBP_ATRIOVENTRICULAR_VALVE_DEVELOPMENT 0.4801960
#> GOBP_CARDIAC_CHAMBER_MORPHOGENESIS 0.5343239
clusterSets
employs hierarchical clustering to identify
groups of highly similar sets. This procedure was developed for the
removal of redundant Gene Ontology and Reactome gene sets for the
MSigDB. See help(topic = "clusterSets", package = "TMSig")
for details.
# clusterSets with default arguments
clusterDF <- clusterSets(x = geneSetsFilt,
type = "jaccard",
cutoff = 0.85,
method = "complete",
h = 0.9)
# First 4 clusters with 2 or more sets
subset(clusterDF, subset = cluster <= 4L)
#> set
#> 1 GOBP_CARDIAC_ATRIUM_DEVELOPMENT
#> 2 GOBP_CARDIAC_ATRIUM_MORPHOGENESIS
#> 3 GOBP_T_CELL_LINEAGE_COMMITMENT
#> 4 GOBP_CD4_POSITIVE_OR_CD8_POSITIVE_ALPHA_BETA_T_CELL_LINEAGE_COMMITMENT
#> 5 GOBP_CELL_COMMUNICATION_BY_ELECTRICAL_COUPLING
#> 6 GOBP_CELL_COMMUNICATION_BY_ELECTRICAL_COUPLING_INVOLVED_IN_CARDIAC_CONDUCTION
#> 7 GOBP_CHOLESTEROL_STORAGE
#> 8 GOBP_REGULATION_OF_CHOLESTEROL_STORAGE
#> cluster set_size
#> 1 1 34
#> 2 1 29
#> 3 2 27
#> 4 2 23
#> 5 3 24
#> 6 3 21
#> 7 4 18
#> 8 4 16
In each cluster, a single set can be retained for analysis using a combination of criteria such as set size, overlap proportion, or length of the description (shorter descriptions tend to be more general terms).
## Use clusterSets output to select sets to retain for analysis
clusterDF$overlap_prop <- overlapProp[clusterDF$set] # overlap proportion
clusterDF$n_char <- nchar(clusterDF$set) # length of set description
# Reorder rows so that the first set in each cluster is the one to keep
o <- with(clusterDF,
order(cluster, set_size, overlap_prop, n_char, set,
decreasing = c(FALSE, TRUE, TRUE, FALSE, TRUE),
method = "radix"))
clusterDF <- clusterDF[o, ]
subset(clusterDF, cluster <= 4L) # show first 4 clusters
#> set
#> 1 GOBP_CARDIAC_ATRIUM_DEVELOPMENT
#> 2 GOBP_CARDIAC_ATRIUM_MORPHOGENESIS
#> 3 GOBP_T_CELL_LINEAGE_COMMITMENT
#> 4 GOBP_CD4_POSITIVE_OR_CD8_POSITIVE_ALPHA_BETA_T_CELL_LINEAGE_COMMITMENT
#> 5 GOBP_CELL_COMMUNICATION_BY_ELECTRICAL_COUPLING
#> 6 GOBP_CELL_COMMUNICATION_BY_ELECTRICAL_COUPLING_INVOLVED_IN_CARDIAC_CONDUCTION
#> 7 GOBP_CHOLESTEROL_STORAGE
#> 8 GOBP_REGULATION_OF_CHOLESTEROL_STORAGE
#> cluster set_size overlap_prop n_char
#> 1 1 34 0.9189189 31
#> 2 1 29 1.0000000 33
#> 3 2 27 0.8437500 30
#> 4 2 23 0.8846154 70
#> 5 3 24 0.7058824 46
#> 6 3 21 0.7777778 77
#> 7 4 18 0.7200000 24
#> 8 4 16 0.8421053 38
Now that the first gene set (first row) in each cluster is the one to
keep, we can remove rows where the cluster is duplicated (this only
keeps the first row of each cluster) and then extract the vector of sets
to subset geneSetsFilt
. Note that
max(clusterDF$cluster)
will indicate how many gene sets
will remain after this redundancy filter.
# Sets to keep for analysis
keepSets <- with(clusterDF, set[!duplicated(cluster)])
head(keepSets, 4L)
#> [1] "GOBP_CARDIAC_ATRIUM_DEVELOPMENT"
#> [2] "GOBP_T_CELL_LINEAGE_COMMITMENT"
#> [3] "GOBP_CELL_COMMUNICATION_BY_ELECTRICAL_COUPLING"
#> [4] "GOBP_CHOLESTEROL_STORAGE"
# Subset geneSetsFilt to those sets
geneSetsFilt <- geneSetsFilt[keepSets]
length(geneSetsFilt) # 986 (down from 1015)
#> [1] 986
Decompose all pairs of sufficiently overlapping sets, A and B, into 3 disjoint parts:
Decomposition of sets is described in section 2.3.1 of “Extensions to Gene Set Enrichment” (Jiang and Gentleman 2007) as a method to reduce the redundancy of gene set testing results.
# Limit maximum set size for this example
geneSetsFilt2 <- filterSets(geneSetsFilt, max_size = 50L)
# Decompose all pairs of sets with at least 10 genes in common
decompSets <- decomposeSets(x = geneSetsFilt2, overlap = 10L)
# Last 3 components
tail(decompSets, 3L)
#> $`GOCC_NPBAF_COMPLEX ~MINUS~ GOCC_NBAF_COMPLEX`
#> [1] "ACTL6A" "PHF10"
#>
#> $`GOCC_NBAF_COMPLEX ~MINUS~ GOCC_NPBAF_COMPLEX`
#> [1] "ACTL6B"
#>
#> $`GOCC_NBAF_COMPLEX ~AND~ GOCC_NPBAF_COMPLEX`
#> [1] "SMARCD3" "ACTB" "ARID1A" "SMARCB1" "SMARCA4" "SMARCC1" "ARID1B"
#> [8] "SMARCA2" "SMARCC2" "SMARCE1" "SMARCD1"
A list of sets can be inverted so that elements become set names and set names become elements. This is primarily used to identify all sets to which a particular element belongs.
invertedSets <- invertSets(x = geneSetsFilt)
length(invertedSets) # 1914 genes
#> [1] 1914
head(names(invertedSets)) # names are genes
#> [1] "ABAT" "ABCA1" "ABCA12" "ABCA2" "ABCA3" "ABCA7"
invertedSets["FOXO1"] # all gene sets containing FOXO1
#> $FOXO1
#> [1] "GOBP_CELLULAR_RESPONSE_TO_REACTIVE_NITROGEN_SPECIES"
#> [2] "GOBP_MUSCLE_HYPERTROPHY"
#> [3] "GOBP_NEGATIVE_REGULATION_OF_MUSCLE_HYPERTROPHY"
#> [4] "GOBP_POSITIVE_REGULATION_OF_CARBOHYDRATE_METABOLIC_PROCESS"
#> [5] "GOBP_POSITIVE_REGULATION_OF_GLUCONEOGENESIS"
#> [6] "GOBP_POSITIVE_REGULATION_OF_GLUCOSE_METABOLIC_PROCESS"
#> [7] "GOBP_POSITIVE_REGULATION_OF_SMALL_MOLECULE_METABOLIC_PROCESS"
#> [8] "GOBP_REGULATION_OF_CARDIAC_MUSCLE_ADAPTATION"
#> [9] "GOBP_REGULATION_OF_MUSCLE_HYPERTROPHY"
#> [10] "GOBP_REGULATION_OF_REACTIVE_OXYGEN_SPECIES_METABOLIC_PROCESS"
#> [11] "GOBP_RESPONSE_TO_HYPEROXIA"
#> [12] "GOBP_RESPONSE_TO_NITRIC_OXIDE"
This can also be used to calculate the similarity of each pair of genes. That is, do pairs of genes tend to appear in the same sets?
similarity(x = invertedSets[1:5]) # first 5 genes
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#> ABAT ABCA1 ABCA12 ABCA2 ABCA3
#> ABAT 1 . . . .
#> ABCA1 . 1.0000000 0.1333333 0.05 0.1333333
#> ABCA12 . 0.1333333 1.0000000 . 1.0000000
#> ABCA2 . 0.0500000 . 1.00 .
#> ABCA3 . 0.1333333 1.0000000 . 1.0000000
We will simulate a matrix of gene expression data for those 2000
genes that were selected earlier and perform differential analysis using
limma
. Then, the differential analysis results, as well as
the filtered gene sets, will be used as input for the pre-ranked version
of Correlation Adjusted MEan RAnk gene set testing (CAMERA) (Wu and Smyth
2012).
Gene expression data is simulated for 3 biological replicates in each of 3 experimental groups: ctrl (control), treat1 (treatment group 1), and treat2 (treatment group 2.
# Control and 2 treatment groups, 3 replicates each
group <- rep(c("ctrl", "treat1", "treat2"),
each = 3)
design <- model.matrix(~ 0 + group) # design matrix
contrasts <- makeContrasts(
contrasts = c("grouptreat1 - groupctrl",
"grouptreat2 - groupctrl"),
levels = colnames(design)
)
# Shorten contrast names
colnames(contrasts) <- gsub("group", "", colnames(contrasts))
ngenes <- length(background) # 2000 genes
nsamples <- length(group) # 9 samples
set.seed(0)
y <- matrix(data = rnorm(ngenes * nsamples),
nrow = ngenes, ncol = nsamples,
dimnames = list(background, make.unique(group)))
head(y)
#> ctrl ctrl.1 ctrl.2 treat1 treat1.1 treat1.2
#> TNF 1.2629543 0.444345820 0.7928027 -0.4210379 0.2277948 -1.21955126
#> AKT1 -0.3262334 0.011929380 -1.1373714 -1.2714751 1.9445279 -1.20146018
#> WNT5A 1.3297993 -0.009280045 -0.2015088 0.8393109 -2.0414912 -0.49604255
#> CTNNB1 1.2724293 -0.302377554 -1.0223649 0.5400129 0.1747742 0.06693112
#> TGFB1 0.4146414 0.492355022 -0.8237988 1.8679387 0.1490125 -0.05694914
#> NOTCH1 -1.5399500 -0.602719618 0.2414621 0.4445180 1.5725517 0.25558017
#> treat2 treat2.1 treat2.2
#> TNF -0.1984714 1.2707294 1.3388829
#> AKT1 0.1498199 -0.4548994 1.1401323
#> WNT5A -0.9205130 0.5317540 -0.1473067
#> CTNNB1 -0.5952829 0.4932479 0.9614810
#> TGFB1 -1.0171751 -0.5465015 -1.3908547
#> NOTCH1 0.5891424 0.5235780 0.5885306
Now, we introduce differential expression in two randomly selected gene sets. We will make genes in the “GOBP_CARDIAC_ATRIUM_DEVELOPMENT” set higher in control relative to treatment samples, and we will make genes in the “GOBP_ACTIVATED_T_CELL_PROLIFERATION” set higher in treatment relative to control samples. Since the contrasts are “treat1 - ctrl” and “treat2 - ctrl”, the direction of change will be “Down” for cardiac atrium development and “Up” for activated T cell proliferation. The degree of change will be less for the “treat2 - ctrl” comparison.
cardiacGenes <- geneSetsFilt[["GOBP_CARDIAC_ATRIUM_DEVELOPMENT"]]
tcellGenes <- geneSetsFilt[["GOBP_ACTIVATED_T_CELL_PROLIFERATION"]]
# Indices of treatment group samples
trt1 <- which(group == "treat1")
trt2 <- which(group == "treat2")
# Cardiac genes: higher in control relative to treatment
y[cardiacGenes, trt1] <- y[cardiacGenes, trt1] - 2
y[cardiacGenes, trt2] <- y[cardiacGenes, trt2] - 0.7
# T cell proliferation genes: higher in treatment relative to control
y[tcellGenes, trt1] <- y[tcellGenes, trt1] + 2
y[tcellGenes, trt2] <- y[tcellGenes, trt2] + 1
# Differential analysis with LIMMA
fit <- lmFit(y, design)
fit.contr <- contrasts.fit(fit, contrasts = contrasts)
fit.smooth <- eBayes(fit.contr)
# Matrix of z-score equivalents of moderated t-statistics
modz <- with(fit.smooth, zscoreT(x = t, df = df.total))
head(modz)
#> Contrasts
#> treat1 - ctrl treat2 - ctrl
#> TNF -1.6151363 -0.03685223
#> AKT1 0.3762490 0.93096154
#> WNT5A -3.5528576 -1.53460494
#> CTNNB1 0.3442251 0.37629330
#> TGFB1 0.7751130 -1.25290937
#> NOTCH1 -0.7565066 0.62257285
The results will be reformatted to make bubble heatmaps later.
# Reformat differential analysis results for enrichmap
resDA <- lapply(colnames(contrasts), function(contrast_i) {
res_i <- topTable(fit.smooth,
coef = contrast_i,
number = Inf,
sort.by = "none")
res_i$contrast <- contrast_i
res_i$gene <- rownames(res_i)
res_i$df.total <- fit.smooth$df.total
return(res_i)
})
resDA <- data.table::rbindlist(resDA)
# Add z-statistic column
resDA$z <- with(resDA, zscoreT(x = t, df = df.total))
# Reorder rows
resDA <- resDA[with(resDA, order(contrast, P.Value, z)), ]
head(resDA)
#> logFC AveExpr t P.Value adj.P.Val B
#> <num> <num> <num> <num> <num> <num>
#> 1: -3.703006 -0.4690872 -4.582075 8.487516e-06 0.01697503 3.295885
#> 2: -3.394212 -0.9283708 -4.236070 3.595590e-05 0.02756659 2.030462
#> 3: -3.473165 -1.2123347 -4.139674 5.296838e-05 0.02756659 1.692116
#> 4: -3.307502 -0.3271868 -4.129613 5.513317e-05 0.02756659 1.657166
#> 5: -3.326092 -0.4175039 -4.036147 7.971089e-05 0.03137933 1.335823
#> 6: -3.246138 -0.3336773 -3.993443 9.413798e-05 0.03137933 1.191014
#> contrast gene df.total z
#> <char> <char> <num> <num>
#> 1: treat1 - ctrl BMPR2 183.4439 -4.452503
#> 2: treat1 - ctrl SOX4 183.4439 -4.132039
#> 3: treat1 - ctrl MYH6 183.4439 -4.042128
#> 4: treat1 - ctrl EXOSC6 183.4439 -4.032728
#> 5: treat1 - ctrl MDM2 183.4439 -3.945268
#> 6: treat1 - ctrl HEY2 183.4439 -3.905224
Below is the number of significantly differentially expressed genes for each comparison.
LIMMA moderated t-statistics are converted to z-score equivalents and
used as input for cameraPR.matrix
. CAMERA-PR is a
modification of the two-sample t-test that accounts for inter-gene
correlation to correctly control the false discovery rate (FDR) (Wu and Smyth
2012). For the pre-ranked version, a default inter-gene
correlation of 0.01 is assumed for all gene sets. A non-parametric
version of CAMERA-PR is also available by specifying
use.ranks=TRUE
. See
help(topic = "cameraPR.matrix", package = "TMSig")
for
details. See help(topic = "cameraPR", package = "limma")
for the original implementation.
The main benefits of using cameraPR.matrix
over
cameraPR.default
are significantly faster execution times
and the ability to perform simultaneous inference of multiple contrasts
or coefficients.
# CAMERA-PR (matrix method)
res <- cameraPR(statistic = modz,
index = geneSetsFilt)
head(res)
#> Contrast GeneSet
#> 1 treat1 - ctrl GOBP_ACTIVATED_T_CELL_PROLIFERATION
#> 2 treat1 - ctrl GOBP_CARDIAC_ATRIUM_DEVELOPMENT
#> 3 treat1 - ctrl GOBP_POSITIVE_REGULATION_OF_ACTIVATED_T_CELL_PROLIFERATION
#> 4 treat1 - ctrl GOBP_ATRIAL_SEPTUM_DEVELOPMENT
#> 5 treat1 - ctrl GOBP_ATRIAL_SEPTUM_MORPHOGENESIS
#> 6 treat1 - ctrl GOBP_NEGATIVE_REGULATION_OF_ACTIVATED_T_CELL_PROLIFERATION
#> NGenes Direction TwoSampleT df ZScore PValue FDR
#> 1 40 Up 13.884761 1998 13.564587 6.494436e-42 6.403514e-39
#> 2 34 Down -11.679847 1998 -11.486100 1.549498e-30 7.639023e-28
#> 3 22 Up 11.164767 1998 10.994909 4.043202e-28 1.328866e-25
#> 4 20 Down -10.148402 1998 -10.019904 1.246239e-23 3.071979e-21
#> 5 15 Down -8.934331 1998 -8.845874 9.081462e-19 1.790864e-16
#> 6 13 Up 8.170612 1998 8.102558 5.381537e-16 8.843659e-14
Both cardiac atrium development and activated T cell proliferation gene sets have adjusted p-values below 0.05 and are ranked at the top of “Down” and “Up” sets. However, other gene sets are also statistically significant after adjustment due to their genes overlapping with these two terms.
# Number of sets passing FDR threshold
table(res$Contrast, res$FDR < 0.05)
#>
#> FALSE TRUE
#> treat1 - ctrl 916 70
#> treat2 - ctrl 978 8
To illustrate the above point, notice that GOBP_CARDIAC_ATRIUM_DEVELOPMENT and GOBP_ATRIAL_SEPTUM_DEVELOPMENT are both significantly down. Their Jaccard and Overlap coefficients are 0.588 and 1, respectively. That is, atrial septum development is a subset of cardiac atrium development (at least, when both sets are restricted to the background we defined earlier), so it is being driven by changes in the latter.
A bubble heatmap will be generated to visualize the top genes from
the differential expression analysis. Since
plot_sig_only=TRUE
only those 10 genes that were
significantly differentially expressed in the “treat1 - ctrl” comparison
will appear in the heatmap. The bubbles will be scaled such that the
most significant contrast/gene combination (smallest adjusted p-value)
is of maximum diameter, and all other bubbles will be scaled relative to
it based on their -log10
adjusted p-values. See
help(topic = "enrichmap", package = "TMSig")
for more
details.
# Differential analysis bubble heatmap
enrichmap(x = resDA,
scale_by = "max",
n_top = 15L, # default
plot_sig_only = TRUE, # default
set_column = "gene",
statistic_column = "z",
contrast_column = "contrast",
padj_column = "adj.P.Val",
padj_legend_title = "BH Adjusted\nP-Value",
padj_cutoff = 0.05,
cell_size = grid::unit(20, "pt"),
# Additional arguments passed to ComplexHeatmap::Heatmap. Used to
# modify default appearance.
heatmap_args = list(
name = "Z-Score",
column_names_rot = 60,
column_names_side = "top"
))
Now, a bubble heatmap will be generated to visualize the CAMERA-PR gene set analysis results. The top 20 gene sets will be shown.
# CAMERA-PR bubble heatmap
enrichmap(x = res,
scale_by = "row", # default
n_top = 20L,
set_column = "GeneSet",
statistic_column = "ZScore",
contrast_column = "Contrast",
padj_column = "FDR",
padj_legend_title = "BH Adjusted\nP-Value",
padj_cutoff = 0.05,
cell_size = grid::unit(13, "pt"),
# Additional arguments passed to ComplexHeatmap::Heatmap. Used to
# modify default appearance.
heatmap_args = list(
name = "Z-Score",
column_names_rot = 60,
column_names_side = "top"
))
Although no genes were differentially expressed in the “treat2 - ctrl” comparison, summarizing results at the gene set level uncovered several significant changes, though they are less pronounced than those in the “treat1 - ctrl” comparison, as evidenced by the sizes and colors of the bubbles.
print(sessionInfo(), locale = FALSE)
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] TMSig_1.1.0 limma_3.63.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.47.0 circlize_0.4.16 shape_1.4.6.1
#> [4] rjson_0.2.23 xfun_0.48 bslib_0.8.0
#> [7] GlobalOptions_0.1.2 Biobase_2.67.0 lattice_0.22-6
#> [10] vctrs_0.6.5 tools_4.4.1 generics_0.1.3
#> [13] stats4_4.4.1 parallel_4.4.1 AnnotationDbi_1.69.0
#> [16] RSQLite_2.3.7 highr_0.11 cluster_2.1.6
#> [19] blob_1.2.4 Matrix_1.7-1 data.table_1.16.2
#> [22] RColorBrewer_1.1-3 S4Vectors_0.44.0 graph_1.85.0
#> [25] lifecycle_1.0.4 GenomeInfoDbData_1.2.13 compiler_4.4.1
#> [28] Biostrings_2.75.0 statmod_1.5.0 codetools_0.2-20
#> [31] ComplexHeatmap_2.23.0 clue_0.3-65 GenomeInfoDb_1.43.0
#> [34] htmltools_0.5.8.1 sys_3.4.3 buildtools_1.0.0
#> [37] sass_0.4.9 yaml_2.3.10 crayon_1.5.3
#> [40] jquerylib_0.1.4 cachem_1.1.0 iterators_1.0.14
#> [43] foreach_1.5.2 digest_0.6.37 maketools_1.3.1
#> [46] fastmap_1.2.0 grid_4.4.1 colorspace_2.1-1
#> [49] cli_3.6.3 XML_3.99-0.17 GSEABase_1.69.0
#> [52] UCSC.utils_1.2.0 bit64_4.5.2 rmarkdown_2.28
#> [55] XVector_0.46.0 httr_1.4.7 matrixStats_1.4.1
#> [58] bit_4.5.0 png_0.1-8 GetoptLong_1.0.5
#> [61] memoise_2.0.1 evaluate_1.0.1 knitr_1.48
#> [64] IRanges_2.41.0 doParallel_1.0.17 rlang_1.1.4
#> [67] xtable_1.8-4 DBI_1.2.3 BiocManager_1.30.25
#> [70] BiocGenerics_0.53.1 annotate_1.85.0 jsonlite_1.8.9
#> [73] R6_2.5.1 zlibbioc_1.52.0