mitch is an R package for multi-contrast enrichment analysis. At it’s heart, it uses a rank-MANOVA based statistical approach to detect sets of genes that exhibit enrichment in the multidimensional space as compared to the background. The rank-MANOVA concept dates to work by Cox and Mann (https://doi.org/10.1186/1471-2105-13-S16-S12). mitch is useful for pathway analysis of profiling studies with one, two or more contrasts, or in studies with multiple omics profiling, for example proteomic, transcriptomic, epigenomic analysis of the same samples. mitch is perfectly suited for pathway level differential analysis of scRNA-seq data.
The main strengths of mitch are that it can import datasets easily from many upstream tools and has advanced plotting features to visualise these enrichments. mitch consists of five functions. A typical mitch workflow would consist of:
1: Import gene sets with gmt_import()
2: Import profiling data with
mitch_import()
3: Calculate enrichments with mitch_calc()
4: And generate plots and reports with mitch_plots()
and mitch_report()
mitch has a function to import GMT files to R lists (adapted from Yu et al, 2012 in the clusterProfiler package). For example we can grab some gene sets from Reactome.org:
download.file("https://reactome.org/download/current/ReactomePathways.gmt.zip",
destfile="ReactomePathways.gmt.zip")
unzip("ReactomePathways.gmt.zip")
genesets<-gmt_import("ReactomePathways.gmt")
In this cut down example we will be using a sample of 200 Reactome gene sets:
## $`2-LTR circle formation`
## [1] "Reactome Pathway" "BANF1" "HMGA1" "LIG4"
## [5] "PSIP1" "XRCC4" "XRCC5" "XRCC6"
## [9] "gag" "gag-pol" "rev" "vif"
## [13] "vpr" "vpu"
##
## $`5-Phosphoribose 1-diphosphate biosynthesis`
## [1] "Reactome Pathway" "PRPS1" "PRPS1L1" "PRPS2"
##
## $`A tetrasaccharide linker sequence is required for GAG synthesis`
## [1] "Reactome Pathway" "AGRN" "B3GALT6" "B3GAT1"
## [5] "B3GAT2" "B3GAT3" "B4GALT7" "BCAN"
## [9] "BGN" "CSPG4" "CSPG5" "DCN"
## [13] "GPC1" "GPC2" "GPC3" "GPC4"
## [17] "GPC5" "GPC6" "HSPG2" "NCAN"
## [21] "SDC1" "SDC2" "SDC3" "SDC4"
## [25] "VCAN" "XYLT1" "XYLT2"
mitch accepts pre-ranked data supplied by the user, but also has a
function called mitch_import
for importing tables generated
by limma, edgeR, DESeq2, ABSSeq, Sleuth, Seurat, Muscat and several
other upstream tools. By default, only the genes that are detected in
all contrasts are included, but this behaviour can be modified for
sparse data setting joinType=full
. The below example
imports two edgeR tables called “rna” and “k9a” Where gene identifiers
are present as row names. Note that if there is more than one profile
being imported, they need to be part of a list.
## Note: Mean no. genes in input = 1000
## Note: no. genes in output = 1000
## Note: estimated proportion of input genes in output = 1
## rna k9a
## NR4A3 68.07237 10.7310010
## HSPA1B 47.19114 18.8135155
## DNAJB1 35.12799 2.4326983
## MIR133A1HG -27.36199 8.9061967
## HSPH1 25.83750 10.8922577
## CXCL2 24.76570 0.8459414
mitch can do unidimensional analysis if you provide it a single profile as a dataframe (not in a list).
## The input is a single dataframe; one contrast only. Converting
## it to a list for you.
## Note: Mean no. genes in input = 1000
## Note: no. genes in output = 1000
## Note: estimated proportion of input genes in output = 1
## x
## NR4A3 68.07237
## HSPA1B 47.19114
## DNAJB1 35.12799
## MIR133A1HG -27.36199
## HSPH1 25.83750
## CXCL2 24.76570
If the gene identifiers are not given in the rownames, then the
column can be specified with the geneIDcol
parameter like
this:
# first rearrange cols
rna_mod<-rna
rna_mod$MyGeneIDs<-rownames(rna_mod)
rownames(rna_mod)<-seq(nrow(rna_mod))
head(rna_mod)
## logFC logCPM PValue adj.p.value MyGeneIDs
## 1 4.12734 5.09552 8.46507e-69 6.24341e-65 NR4A3
## 2 3.64685 7.42834 6.43968e-48 3.16639e-44 HSPA1B
## 3 2.35432 7.13208 7.44748e-36 2.74645e-32 DNAJB1
## 4 -1.02085 7.29935 4.34519e-28 1.28192e-24 MIR133A1HG
## 5 1.11729 6.03741 1.45377e-26 3.06351e-23 HSPH1
## 6 5.48158 2.88719 1.71515e-25 3.16252e-22 CXCL2
## The input is a single dataframe; one contrast only. Converting
## it to a list for you.
## Note: Mean no. genes in input = 1000
## Note: no. genes in output = 1000
## Note: estimated proportion of input genes in output = 1
## x
## NR4A3 68.07237
## HSPA1B 47.19114
## DNAJB1 35.12799
## MIR133A1HG -27.36199
## HSPH1 25.83750
## CXCL2 24.76570
By default, differential gene activity is scored using a supplied test statistic or directional p-value (D):
D = sgn(logFC) * -log10(p-value)
If this is not desired, then users can perform their own custom
scoring procedure and import with DEtype="prescored"
.
There are many cases where the gene IDs don’t match the gene sets. To
overcome this, mitch_import
also accepts a two-column table
(gt here) that relates gene identifiers in the profiling data to those
in the gene sets. In this example we can create some fake gene accession
numbers to demonstrate this feature.
library("stringi")
# obtain vector of gene names
genenames<-rownames(rna)
# create fake accession numbers
accessions<-paste("Gene0",stri_rand_strings(nrow(rna)*2, 6, pattern = "[0-9]"),sep="")
accessions<-head(unique(accessions),nrow(rna))
# create a gene table file that relates gene names to accession numbers
gt<-data.frame(genenames,accessions)
# now swap gene names for accessions
rna2<-merge(rna,gt,by.x=0,by.y="genenames")
rownames(rna2)<-rna2$accessions
rna2<-rna2[,2:5]
k9a2<-merge(k9a,gt,by.x=0,by.y="genenames")
rownames(k9a2)<-k9a2$accessions
k9a2<-k9a2[,2:5]
# now have a peek at the input data before importing
head(rna2,3)
## logFC logCPM PValue adj.p.value
## Gene0696844 0.296028 6.82814 3.84512e-04 1.46941e-02
## Gene0927839 -0.375440 4.71470 1.09120e-03 3.05432e-02
## Gene0196571 0.882624 8.12078 2.11945e-11 8.01642e-09
## logFC logCPM PValue adj.p.value
## Gene0696844 0.339535 3.67309 1.62925e-03 1.43363e-02
## Gene0927839 0.585837 3.66069 3.23724e-04 4.06552e-03
## Gene0196571 1.138700 2.78713 4.94270e-14 1.69263e-11
## genenames accessions
## 1 NR4A3 Gene0337415
## 2 HSPA1B Gene0880286
## 3 DNAJB1 Gene0697406
## Note: Mean no. genes in input = 1000
## Note: no. genes in output = 1000
## Note: estimated proportion of input genes in output = 1
## rna2 k9a2
## A2M 3.415090 2.788012
## AAAS -2.962096 3.489825
## ABRA 10.673777 13.306036
?mitch_import
provides more instructions on using this
feature.
The mitch_calc
function performs multivariate enrichment
analysis of the supplied gene sets in the scored profiling data. At its
simpest form mitch_calc
function accepts the scored data as
the first argument and the genesets as the second argument. Users can
prioritise enrichments based on small adjusted p-values, by the observed
effect size (magnitude of “s”, the enrichment score) or the standard
deviation of the s scores. Note that the number of parallel cores is set
here to cores=2 but the default is to use all but one available
cores.
## Note: When prioritising by significance (ie: small
## p-values), large effect sizes might be missed.
## set setSize
## 5 Biological oxidations 10
## 2 Antigen processing: Ubiquitination & Proteasome degradation 20
## 1 Adaptive Immune System 44
## 3 Asparagine N-linked glycosylation 17
## 4 Axon guidance 48
## pMANOVA s.rna2 s.k9a2 p.rna2 p.k9a2 s.dist
## 5 0.00899742 -0.555757576 -0.04141414 0.002429897 0.82165390 0.5572985
## 2 0.05970462 -0.055510204 0.28377551 0.670730417 0.02956405 0.2891538
## 1 0.52997174 0.000000000 0.09775580 1.000000000 0.27259775 0.0977558
## 3 0.53769811 -0.136137873 0.04589791 0.335578518 0.74549830 0.1436668
## 4 0.94522077 0.006959034 -0.02551646 0.935141485 0.76540436 0.0264484
## SD p.adjustMANOVA
## 5 0.36369573 0.0449871
## 2 0.23991123 0.1492616
## 1 0.06912379 0.6721226
## 3 0.12871874 0.6721226
## 4 0.02296364 0.9452208
## Note: Enrichments with large effect sizes may not be
## statistically significant.
## set setSize
## 5 Biological oxidations 10
## 2 Antigen processing: Ubiquitination & Proteasome degradation 20
## 3 Asparagine N-linked glycosylation 17
## 1 Adaptive Immune System 44
## 4 Axon guidance 48
## pMANOVA s.rna2 s.k9a2 p.rna2 p.k9a2 s.dist
## 5 0.00899742 -0.555757576 -0.04141414 0.002429897 0.82165390 0.5572985
## 2 0.05970462 -0.055510204 0.28377551 0.670730417 0.02956405 0.2891538
## 3 0.53769811 -0.136137873 0.04589791 0.335578518 0.74549830 0.1436668
## 1 0.52997174 0.000000000 0.09775580 1.000000000 0.27259775 0.0977558
## 4 0.94522077 0.006959034 -0.02551646 0.935141485 0.76540436 0.0264484
## SD p.adjustMANOVA
## 5 0.36369573 0.0449871
## 2 0.23991123 0.1492616
## 3 0.12871874 0.6721226
## 1 0.06912379 0.6721226
## 4 0.02296364 0.9452208
By default, gene sets with fewer than 10 members present in the profiling data are discarded. This threshold can be modified using the minsetsize option. There is no upper limit of gene set size.
## Note: When prioritising by significance (ie: small
## p-values), large effect sizes might be missed.
By default, in downstream visualisation steps, charts are made from the top 50 gene sets, but this can be modified using the resrows option.
## Note: When prioritising by significance (ie: small
## p-values), large effect sizes might be missed.
The HTML reports contain several plots as raster images and interactive charts which are useful as a first-pass visualisation. These can be generated like this:
## Dataset saved as " /tmp/RtmpyDFynf/myreport.rds ".
## processing file: mitch.Rmd
## output file: /tmp/RtmpBEALCH/Rbuild24915085bd81/mitch/vignettes/mitch.knit.md
##
## Output created: /tmp/RtmpyDFynf/mitch_report.html
In case you want the charts in PDF format, for example for publications, these can be generated as such:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] stringi_1.8.4
## [2] kableExtra_1.4.0
## [3] pkgload_1.4.0
## [4] GGally_2.2.1
## [5] ggplot2_3.5.1
## [6] reshape2_1.4.4
## [7] beeswarm_0.4.0
## [8] gplots_3.2.0
## [9] gtools_3.9.5
## [10] tibble_3.2.1
## [11] dplyr_1.1.4
## [12] echarts4r_0.4.5
## [13] IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0
## [14] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
## [15] minfi_1.53.1
## [16] bumphunter_1.49.0
## [17] locfit_1.5-9.10
## [18] iterators_1.0.14
## [19] foreach_1.5.2
## [20] Biostrings_2.75.1
## [21] XVector_0.47.0
## [22] SummarizedExperiment_1.37.0
## [23] Biobase_2.67.0
## [24] MatrixGenerics_1.19.0
## [25] matrixStats_1.4.1
## [26] GenomicRanges_1.59.1
## [27] GenomeInfoDb_1.43.2
## [28] IRanges_2.41.2
## [29] S4Vectors_0.45.2
## [30] BiocGenerics_0.53.3
## [31] generics_0.1.3
## [32] HGNChelper_0.8.15
## [33] mitch_1.19.3
## [34] rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] splines_4.4.2 later_1.4.1
## [3] BiocIO_1.17.1 bitops_1.0-9
## [5] preprocessCore_1.69.0 XML_3.99-0.17
## [7] lifecycle_1.0.4 lattice_0.22-6
## [9] MASS_7.3-61 base64_2.0.2
## [11] scrime_1.3.5 magrittr_2.0.3
## [13] limma_3.63.2 sass_0.4.9
## [15] jquerylib_0.1.4 yaml_2.3.10
## [17] httpuv_1.6.15 doRNG_1.8.6
## [19] askpass_1.2.1 DBI_1.2.3
## [21] buildtools_1.0.0 RColorBrewer_1.1-3
## [23] abind_1.4-8 zlibbioc_1.52.0
## [25] quadprog_1.5-8 purrr_1.0.2
## [27] RCurl_1.98-1.16 GenomeInfoDbData_1.2.13
## [29] maketools_1.3.1 rentrez_1.2.3
## [31] genefilter_1.89.0 annotate_1.85.0
## [33] svglite_2.1.3 DelayedMatrixStats_1.29.0
## [35] codetools_0.2-20 DelayedArray_0.33.3
## [37] xml2_1.3.6 tidyselect_1.2.1
## [39] farver_2.1.2 UCSC.utils_1.3.0
## [41] beanplot_1.3.1 illuminaio_0.49.0
## [43] GenomicAlignments_1.43.0 jsonlite_1.8.9
## [45] multtest_2.63.0 survival_3.7-0
## [47] systemfonts_1.1.0 tools_4.4.2
## [49] Rcpp_1.0.13-1 glue_1.8.0
## [51] gridExtra_2.3 SparseArray_1.7.2
## [53] xfun_0.49 HDF5Array_1.35.2
## [55] withr_3.0.2 fastmap_1.2.0
## [57] rhdf5filters_1.19.0 fansi_1.0.6
## [59] openssl_2.2.2 caTools_1.18.3
## [61] digest_0.6.37 R6_2.5.1
## [63] mime_0.12 colorspace_2.1-1
## [65] RSQLite_2.3.9 utf8_1.2.4
## [67] tidyr_1.3.1 data.table_1.16.2
## [69] rtracklayer_1.67.0 httr_1.4.7
## [71] htmlwidgets_1.6.4 S4Arrays_1.7.1
## [73] ggstats_0.7.0 pkgconfig_2.0.3
## [75] gtable_0.3.6 blob_1.2.4
## [77] siggenes_1.81.0 sys_3.4.3
## [79] htmltools_0.5.8.1 scales_1.3.0
## [81] png_0.1-8 knitr_1.49
## [83] rstudioapi_0.17.1 tzdb_0.4.0
## [85] rjson_0.2.23 nlme_3.1-166
## [87] curl_6.0.1 cachem_1.1.0
## [89] rhdf5_2.51.0 stringr_1.5.1
## [91] KernSmooth_2.23-24 AnnotationDbi_1.69.0
## [93] restfulr_0.0.15 GEOquery_2.75.0
## [95] pillar_1.9.0 grid_4.4.2
## [97] reshape_0.8.9 vctrs_0.6.5
## [99] promises_1.3.2 xtable_1.8-4
## [101] evaluate_1.0.1 readr_2.1.5
## [103] GenomicFeatures_1.59.1 cli_3.6.3
## [105] compiler_4.4.2 Rsamtools_2.23.1
## [107] rlang_1.1.4 crayon_1.5.3
## [109] rngtools_1.5.2 labeling_0.4.3
## [111] nor1mix_1.3-3 mclust_6.1.1
## [113] plyr_1.8.9 viridisLite_0.4.2
## [115] BiocParallel_1.41.0 munsell_0.5.1
## [117] Matrix_1.7-1 hms_1.1.3
## [119] sparseMatrixStats_1.19.0 bit64_4.5.2
## [121] Rhdf5lib_1.29.0 KEGGREST_1.47.0
## [123] statmod_1.5.0 shiny_1.9.1
## [125] memoise_2.0.1 bslib_0.8.0
## [127] bit_4.5.0.1 splitstackshape_1.4.8