ccmap
finds drugs and drug combinations that are
predicted to reverse or mimic gene expression signatures. These drugs
might reverse diseases or mimic healthy lifestyles.
###Query Signatures
To obtain a query gene expression signature, it is reccommended that you perform a meta-analysis of all gene expression studies that have compared similar groups. This can be accomplished with the crossmeta package.
This meta-analysis approach was validated by querying the cmap drug signatures using independant drug expression data. Query signatures from meta-analyses have improved rankings of the selfsame cmap drugs (figure 1).
Figure 1. Receiver operating curves comparing query results using signatures from individual contrasts (auc = 0.720) to meta-analyses (auc = 0.913). Queries from signatures generated by meta-analyses for 10 drugs were compared to queries from the 260 contrasts used in the meta-analyses.
To use ccmap
, the query signature needs to be a named
vector of effect size values where the names correspond to uppercase
HGNC symbols. If you used crossmeta
, proceeed as
follows:
library(crossmeta)
library(ccmap)
# microarray data from studies using drug LY-294002
library(lydata)
data_dir <- system.file("extdata", package = "lydata")
# gather all GSEs
gse_names <- c("GSE9601", "GSE15069", "GSE50841", "GSE34817", "GSE29689")
# load previous crossmeta differential expression analysis
anals <- load_diff(gse_names, data_dir)
# run meta-analysis
es <- es_meta(anals)
# contribute your signature to our public meta-analysis database
# contribute(anals, subject = "LY-294002")
# extract moderated adjusted standardized effect sizes
dprimes <- get_dprimes(es)
# query signature
query_sig <- dprimes$all$meta
###Drug Signatures
CMAP drug signatures were generated using the raw data from the Connectivity Map build
2. The raw data from experiments with a shared platform were
norm-exp background corrected, quantile normalized, and log2 transformed
(RMA algorithm). After preprocessing, contrasts were specified such that
all signatures for each drug were compared to all vehicle treated
signatures. Non-treatment related variables (cell-line, drug dose, batch
effects, etc.) were discovered using sva
and accounted for
during differential expression analysis by limma
. Finally,
moderated t-statistics calculated by limma
were used by
GeneMeta
to calculate moderated unbiased standardised
effect sizes.
The final drug signatures are available in the ccdata
package.
LINCS l1000 signatures (drugs and genetic under/over expression) were
generated using the raw level 1 lxb files. For each cell line, all
vehicle wells were quantile normalized. These were used as reference
distributions in order to quantile normalize all treatment wells for the
corresponding cell line. For deconvolution of each probe pair in each
well, four gaussian mixtures models were fitted to the normalized and
log2 transformed data: 1) Mclust
with
equal variance, 2) Mclust
with equal
variance and outliers initialized by spoutlier
,
3/4) a modified mixtools
model excluding
outliers determined from (2) and biased towards the predominant peak
lying at either lower (3) or higher (4) expression values.
xgboost
was used to choose one of these four models using a
manually labeled sample as training data. In order to correct flipped
peaks, another round of manual labeling was performed with summaries
displayed from the first round for the same cell/treatment. As for CMAP
drug signatures above, surrogate variable were accounted for and
moderated unbiased standardised effect sizes were calculted.
The final drug signatures are available in the ccdata
package.
###Querying Drug Signatures
Cosine similarity is calculated between the query and drug signatures.
top_cmap <- query_drugs(query_sig, cmap_es)
top_l1000 <- query_drugs(query_sig, l1000_es)
# LY-294002 best match among 1309 cmap signatures
# other PI3K inhibitors are also identified among top matching drugs
head(top_cmap, 4)
## LY-294002 sirolimus dilazep wortmannin
## 0.7086261 0.5667522 0.5478363 0.5467520
# LY-294002 matches 4 of top 10 l1000 signatures (230,829 total)
# other PI3K inhibitors are also identified among top matching drugs
head(top_l1000, 4)
## TG-101348_MCF7_10um_24h LY-294002_MCF7_10um_6h LY-294002_HT29_10um_24h
## 0.6065423 0.5980197 0.5968607
## LY-294002_MCF7_10um_24h
## 0.5942186
Note that only a subset of the cmap genes were measured in the l1000 signatures. As such, only common genes should be included if you wish to directly compare cmap and l1000 queries. To do this:
# remove genes in cmap_es that are not measured in l1000_es
cmap_lm <- cmap_es[row.names(l1000_es), ]
# query using genes common to cmap_es and l1000_es
top_cmap_lm <- query_drugs(query_sig, cmap_lm)
###Drug Combinations
To more closely mimic or reverse a gene expression signature, drug combinations may be promising. For the 1309 drugs in the Connectivity Map build 2, there are 856086 unique two-drug combinations. It is currently unfeasable to assay all these combinations, but their expression profiles can be predicted.
In order to do so, I collected microarray data from GEO where single treatments and their combinations were assayed. In total, 148 studies with 257 treatment combinations were obtained.
Remarkably, simply averaging the expression profiles from the single treatments predicted the direction of differential expression of the combined treatment with 78.96% accuracy. The average expression profiles for all 856086 unique two-drug cmap combinations can be generated and queried as follows:
# query all 856086 combinations (takes ~2 minutes on Intel Core i7-6700)
# top_combos <- query_combos(query_sig, cmap_es)
# query only combinations with LY-294002
top_combos <- query_combos(query_sig, cmap_es, include='LY-294002', ncores=1)
Combinations of l1000 signatures can also be queried using the average method. As ~26 billion two-perturbagen combinations are possible, queries should be limited to combinations with the top few perturbagens. For example:
# query only combinations with LY-294002_NKDBA_10um_24h
top_combos <- query_combos(query_sig, l1000_es, include='LY-294002_NKDBA_10um_24h', ncores=1)
# query combinations with all LY-294002 signatures
# top_combos <- query_combos(query_sig, l1000_es, include='LY-294002')
A small improvement to 80.18% acurracy was obtained using machine learning models. To use these models requires 8-10GB of RAM and about 2 hours (Intel Core i7-6700 with the MRO+MKL distribution of R) to predict and query all 856086 unique two-drug cmap combinations. In practice, the drug combinations that most closely mimic or reverse a query signature usually include the top few single drugs. By only predicting drug combinations that include the top few single drugs, prediction times are greatly reduced:
# Times on Intel Core i7-6700 with MRO+MKL
# requires ~8-10GB of RAM
method <- 'ml'
include <- names(head(top_cmap))
# query all 856086 combinations (~2 hours)
# top_combos <- query_combos(query_sig, 'cmap', method)
# query combinations with top single drugs (~1 minute)
# top_combos <- query_combos(query_sig, 'cmap', method, include)
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ccdata_1.32.0 limma_3.63.2 lydata_1.32.0 ccmap_1.33.0
## [5] crossmeta_1.33.0
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.47.0 xfun_0.49 bslib_0.8.0
## [4] shinyjs_2.1.0 Biobase_2.67.0 lattice_0.22-6
## [7] vctrs_0.6.5 tools_4.4.2 generics_0.1.3
## [10] parallel_4.4.2 stats4_4.4.2 AnnotationDbi_1.69.0
## [13] RSQLite_2.3.9 blob_1.2.4 Matrix_1.7-1
## [16] data.table_1.16.4 S4Vectors_0.45.2 metaMA_3.1.3
## [19] lifecycle_1.0.4 GenomeInfoDbData_1.2.13 compiler_4.4.2
## [22] Biostrings_2.75.3 statmod_1.5.0 codetools_0.2-20
## [25] httpuv_1.6.15 GenomeInfoDb_1.43.2 htmltools_0.5.8.1
## [28] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
## [31] fdrtool_1.2.18 yaml_2.3.10 later_1.4.1
## [34] crayon_1.5.3 jquerylib_0.1.4 cachem_1.1.0
## [37] iterators_1.0.14 foreach_1.5.2 mime_0.12
## [40] digest_0.6.37 maketools_1.3.1 fastmap_1.2.0
## [43] grid_4.4.2 cli_3.6.3 magrittr_2.0.3
## [46] UCSC.utils_1.3.0 promises_1.3.2 bit64_4.5.2
## [49] xgboost_1.7.8.1 SMVar_1.3.4 rmarkdown_2.29
## [52] XVector_0.47.0 httr_1.4.7 bit_4.5.0.1
## [55] png_0.1-8 memoise_2.0.1 shiny_1.10.0
## [58] evaluate_1.0.1 knitr_1.49 IRanges_2.41.2
## [61] doParallel_1.0.17 miniUI_0.1.1.1 rlang_1.1.4
## [64] Rcpp_1.0.13-1 xtable_1.8-4 DBI_1.2.3
## [67] BiocGenerics_0.53.3 jsonlite_1.8.9 R6_2.5.1
## [70] zlibbioc_1.52.0