Due to the range of features that are available in musicatk, we have prepared several vignettes to help you get started, which are all available at https://www.camplab.net/musicatk/
#Introduction A variety of exogenous exposures or endogenous biological processes can contribute to the overall mutational load observed in human tumors. Many different mutational patterns, or “mutational signatures”, have been identified across different tumor types. These signatures can provide a record of environmental exposure and can give clues about the etiology of carcinogenesis. The Mutational Signature Comprehensive Analysis Toolkit (musicatk) has utilities for extracting variants from a variety of file formats, contains multiple methods for discovery of novel signatures or prediction of known signatures, as well as many types of downstream visualizations for exploratory analysis. This package has the ability to parse and combine multiple motif classes in the mutational signature discovery or prediction processes. Mutation motifs include single base substitutions (SBS), double base substitutions (DBS), insertions (INS) and deletions (DEL).
Currently musicatk can be installed from on Bioconductor using the following code:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("musicatk")
To install the latest version from Github, use the following code:
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
library(devtools)
install_github("campbio/musicatk")
The package can be loaded using the library
command.
In order to discover or predict mutational signatures, we must first set up our musica object by 1) extracting variants from files or objects such as VCFs and MAFs, 2) selecting the appropriate reference genome 3) creating a musica object, and 4) building a count tables for our variants of interest. Alternatively, a musica object can be created directly from a count table.
Variants can be extracted from various formats using the following functions:
extract_variants_from_vcf_file()
function will
extract variants from a VCF file. The file will
be imported using the readVcf function from the VariantAnnotation
package and then the variant information will be extracted from this
object.extract_variants_from_vcf()
function extracts
variants from a CollapsedVCF
or ExpandedVCF
object from the VariantAnnotation
package.extract_variants_from_maf_file()
function will
extract variants from a file in Mutation
Annotation Format (MAF) used by TCGA.extract_variants_from_maf()
function will extract
variants from a MAF object created by the maftools
package.extract_variants_from_matrix()
function will get
the information from a matrix or data.frame like object that has columns
for the chromosome, start position, end position, reference allele,
mutation allele, and sample name.extract_variants()
function will extract variants
from a list of objects. These objects can be any combination of VCF
files, VariantAnnotation objects, MAF files, MAF objects, and data.frame
objects.Below are some examples of extracting variants from MAF and VCF files:
# Extract variants from a MAF File
lusc_maf <- system.file("extdata", "public_TCGA.LUSC.maf", package = "musicatk")
lusc.variants <- extract_variants_from_maf_file(maf_file = lusc_maf)
# Extract variants from an individual VCF file
luad_vcf <- system.file("extdata", "public_LUAD_TCGA-97-7938.vcf",
package = "musicatk"
)
luad.variants <- extract_variants_from_vcf_file(vcf_file = luad_vcf)
# Extract variants from multiple files and/or objects
melanoma_vcfs <- list.files(system.file("extdata", package = "musicatk"),
pattern = glob2rx("*SKCM*vcf"), full.names = TRUE
)
variants <- extract_variants(c(lusc_maf, luad_vcf, melanoma_vcfs))
## | | | 0% | |============== | 20%
## | |============================ | 40%
## | |========================================== | 60%
## | |======================================================== | 80%
## | |======================================================================| 100%
musicatk uses BSgenome
objects to access genome sequence information that flanks each mutation
which is used bases for generating mutation count tables. BSgenome
objects store full genome sequences for different organisms. A full list
of supported organisms can be obtained by running
available.genomes()
after loading the BSgenome library.
Custom genomes can be forged as well (see BSgenome
documentation). musicatk provides a utility function called
select_genome()
to allow users to quickly select human
genome build versions “hg19” and “hg38” or mouse genome builds “mm9” and
“mm10”. The reference sequences for these genomes are in UCSC format
(e.g. chr1).
The last preprocessing step is to create an object with the variants
and the genome using the create_musica_from_variants
function. This function will perform checks to ensure that the
chromosome names and reference alleles in the input variant object match
those in supplied BSgenome object. These checks can be turned off by
setting check_ref_chromosomes = FALSE
and
check_ref_bases = FALSE
, respectively. This function also
looks for adjacent single base substitutions (SBSs) and will convert
them to double base substitutions (DBSs). To disable this automatic
conversion, set convert_dbs = FALSE
.
## Checking that chromosomes in the 'variant' object match chromosomes in the 'genome' object.
## Checking that the reference bases in the 'variant' object match the reference bases in the 'genome' object.
## Warning in .check_variant_ref_in_genome(dt = dt, genome = genome): Reference
## bases for 6 out of 911 variants did not match the reference base in the
## 'genome'. Make sure the genome reference is correct.
## Standardizing INS/DEL style
## Converting 7 insertions
## Converting 1 deletions
## Converting adjacent SBS into DBS
## 5 SBS converted to DBS
Motifs are the building blocks of mutational signatures. Motifs themselves are a mutation combined with other genomic information. For instance, SBS96 motifs are constructed from an SBS mutation and one upstream and one downstream base sandwiched together. We build tables by counting these motifs for each sample.
## Warning in build_standard_table(musica, g = g, table_name = "SBS96"):
## table_name argument deprecated. Use modality instead.
## Building count table from SBS with SBS96 schema
Here is a list of mutation tables that can be created by setting the
table_name
parameter in the
build_standard_table
function:
"Transcript_Strand"
."Replication_Strand"
.## Building count table from SBS with SBS96 schema
## Warning in .table_exists_warning(musica, "SBS96", overwrite): Overwriting
## counts table: SBS96
## Building count table from DBS with DBS78 schema
## Warning in .table_exists_warning(musica, "DBS78", overwrite): Overwriting
## counts table: DBS78
# Subset SBS table to DBS samples so they cam be combined
count_tables <- tables(dbs_musica)
overlap_samples <- which(colnames(count_tables$SBS96@count_table) %in%
colnames(count_tables$DBS78@count_table))
count_tables$SBS96@count_table <-
count_tables$SBS96@count_table[, overlap_samples]
tables(dbs_musica) <- count_tables
combine_count_tables(
musica = dbs_musica, to_comb = c("SBS96", "DBS78"),
name = "sbs_dbs", description = "An example combined
table, combining SBS96 and DBS", overwrite = TRUE
)
## 403 genes were dropped because they have exons located on both strands of the
## same reference sequence or on more than one reference sequence, so cannot be
## represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
## object, or use suppressMessages() to suppress this message.
build_custom_table(
musica = musica, variant_annotation = "Transcript_Strand",
name = "Transcript_Strand",
description = "A table of transcript strand of variants",
data_factor = c("T", "U"), overwrite = TRUE
)
Different count tables can be combined into one using the
combine_count_tables
function. For example, the SBS96 and
the DBS tables could be combined and mutational signature discovery
could be performed across both mutations modalities. Tables with
information about the same types of variants (e.g. two related SBS
tables) should generally not be combined and used together.
Custom count tables can be created from user-defined mutation-level
annotations using the build_custom_table
function.
If a count table is already available, a musica object can be created directly without need for a variant file and building tables.
Mutational signature discovery is the process of deconvoluting a
matrix containing counts for each mutation type in each sample into two
matrices: 1) a signature matrix containing the
probability of each mutation motif in signature and 2) an
exposure matrix containing the estimated counts of each
signature in each sample. Discovery and prediction results are saved in
the result_list slot of a musica
object.
discover_signatures(
musica = musica, modality = "SBS96", num_signatures = 3,
algorithm = "nmf", model_id = "ex_result", nstart = 10
)
Supported signature discovery algorithms include:
Both have built-in seed
capabilities for reproducible
results, nstarts
for multiple independent chains from which
the best final result will be chosen. NMF also allows for parallel
processing via par_cores
.
To get the signatures or exposures from a result within a musica object, the following functions can be used:
# Extract the exposure matrix
expos <- exposures(musica, "result", "SBS96", "ex_result")
expos[1:3, 1:3]
## TCGA-56-7582-01A-11D-2042-08 TCGA-77-7335-01A-11D-2042-08
## Signature1 9.910392e+00 2.963251e-09
## Signature2 2.860983e-07 1.610374e+02
## Signature3 1.800896e+02 1.069626e+02
## TCGA-94-7557-01A-11D-2122-08
## Signature1 1.030442e+00
## Signature2 1.159696e+02
## Signature3 2.886599e-13
# Extract the signature matrix
sigs <- signatures(musica, "result", "SBS96", "ex_result")
sigs[1:3, 1:3]
## Signature1 Signature2 Signature3
## C>A_ACA 7.561675e-11 0.013628647 1.254321e-02
## C>A_ACC 7.684567e-11 0.007898625 1.641794e-02
## C>A_ACG 8.596657e-11 0.010725916 8.347563e-11
The k value is the number of signatures that are predicted from a
discovery method. To help determine an appropriate k value, the
compare_k_vals
function can be used to compare to the
stability and error associated with various k values. Generally, 100
replicates is suggested, as well as a larger span of k values to test.
Here, fewer k values and replicates are used for simplicity.
##
## k = 2:
## Calculating signatures on original count table...
## Replicate 1 of 5...
## Replicate 2 of 5...
## Replicate 3 of 5...
## Replicate 4 of 5...
## Replicate 5 of 5...
##
## k = 3:
## Calculating signatures on original count table...
## Replicate 1 of 5...
## Replicate 2 of 5...
## Replicate 3 of 5...
## Replicate 4 of 5...
## Replicate 5 of 5...
##
## k = 4:
## Calculating signatures on original count table...
## Replicate 1 of 5...
## Replicate 2 of 5...
## Replicate 3 of 5...
## Replicate 4 of 5...
## Replicate 5 of 5...
Barplots showing the probability of each mutation type in each
signature can be plotted with the plot_signatures
function:
By default, the scales on the y-axis are forced to be the same across
all signatures. This behavior can be turned off by setting
same_scale = FALSE
. Signatures can be re-named based on
prior knowledge and displayed in the plots:
Barplots showing the exposures in each sample can be plotted with the
plot_exposures
function:
The proportion of each exposure in each tumor can be shown by setting
proportional = TRUE
:
Counts for individual samples can also be plotted with the
plot_sample_counts
function:
A common analysis is to compare the signatures estimated in a dataset
to those generated in other datasets or to those in the COSMIC
database. We have a set of functions that can be used to easily
perform pairwise correlations between signatures. The
compare_results
functions compares the signatures between
two models in the same or different musica
objects. The
compare_cosmic_v2
will correlate the signatures between a
model and the SBS signatures in COSMIC V2. For example:
## cosine x_sig_index y_sig_index x_sig_name y_sig_name
## 1 0.9733111 1 7 Smoking SBS7
## 4 0.7906308 3 4 UV SBS4
## 2 0.7720306 1 11 Smoking SBS11
## 3 0.7547538 1 30 Smoking SBS30
In this example, our Signatures 1 and 2 were most highly correlated
to COSMIC Signature 7 and 4, respectively, so this may indicate that
samples in our dataset were exposed to UV radiation or cigarette smoke.
Only pairs of signatures who have a correlation above the
threshold
parameter will be returned. If no pairs of
signatures are found, then you may want to consider lowering the
threshold. The default correlation metric is the cosine similarity, but
this can be changed to using 1 - Jensen-Shannon Divergence by setting
metric = "jsd"
Signatures can also be correlated to those
in the COSMIC V3 database using the compare_cosmic_v3
function.
Instead of discovering mutational signatures and exposures from a dataset de novo, one might get better results by predicting the exposures of signatures that have been previously estimated in other datasets. We incorporate several methods for estimating exposures given a set of pre-existing signatures. For example, we can predict exposures for COSMIC signatures 1, 4, 7, and 13 in our current dataset:
# Load COSMIC V2 data
data("cosmic_v2_sigs")
# Predict pre-existing exposures using the "lda" method
predict_exposure(
musica = musica, modality = "SBS96",
signature_res = cosmic_v2_sigs, model_id = "ex_pred_result",
signatures_to_use = c(1, 4, 7, 13), algorithm = "lda"
)
# Plot exposures
plot_exposures(musica, "ex_pred_result", plot_type = "bar")
The cosmic_v2_sigs
object is a result_model
object containing COSMIC V2 signatures without any sample or exposure
information. Note that if signatures_to_use
is not supplied
by the user, then exposures for all signatures in the result object will
be estimated. We can predict exposures for samples in any
musica
object from any result_model
object as
long as the same mutation schema was utilized.
We can list which signatures are present in each tumor type according to the COSMIC V2 database. For example, we can find which signatures are present in lung cancers:
## lung adeno
## 124561317
## lung small cell
## 14515
## lung squamous
## 124513
We can predict exposures for samples in a musica
object
using the signatures from any result_model
object. Just for
illustration, we will predict exposures using the estimated signatures
from model we created earlier:
model <- get_model(musica, "result", "SBS96", "ex_result")
predict_exposure(
musica = musica, modality = "SBS96",
signature_res = model, algorithm = "lda",
model_id = "pred_our_sigs"
)
Of course, this example is not very useful because we are predicting
exposures using signatures that were learned on the same set of samples.
Most of the time, you want to give the signature_res
parameter a result_model
object that was generated using
independent samples from those in the input musica
object.
As mentioned above, different signatures in different models can be
compared to each other using the compare_results
function:
compare_results(musica,
model_id = "ex_pred_result",
other_model_id = "pred_our_sigs", threshold = 0.60
)
## cosine x_sig_index y_sig_index x_sig_name y_sig_name
## 2 0.9733111 3 1 SBS7 Smoking
## 1 0.7906308 2 3 SBS4 UV
## 3 0.7412271 4 2 SBS13 APOBEC
We first must add an annotation to the musica
object
annot <- read.table(system.file("extdata", "sample_annotations.txt",
package = "musicatk"),
sep = "\t", header = TRUE)
samp_annot(musica, "Tumor_Subtypes") <- annot$Tumor_Subtypes
Note that the annotations can also be added directly the
musica
object in the beginning using the same function:
samp_annot(musica, "Tumor_Subtypes") <- annot$Tumor_Subtypes
.
These annotations will then automatically be included in any down-stream
result object.
musica
object.
You can view the sample order with the sample_names
function:## [1] TCGA-56-7582-01A-11D-2042-08 TCGA-77-7335-01A-11D-2042-08
## [3] TCGA-94-7557-01A-11D-2122-08 TCGA-97-7938-01A-11D-2167-08
## [5] TCGA-EE-A3J5-06A-11D-A20D-08 TCGA-ER-A197-06A-32D-A197-08
## [7] TCGA-ER-A19O-06A-11D-A197-08
## 7 Levels: TCGA-56-7582-01A-11D-2042-08 ... TCGA-ER-A19O-06A-11D-A197-08
As mentioned previously, the plot_exposures
function can
plot sample exposures in a musica_result
object. It can
group exposures by either a sample annotation or by a signature by
setting the group_by
parameter. Here will will group the
exposures by the Tumor_Subtype
annotation:
plot_exposures(musica, "ex_result",
plot_type = "bar", group_by = "annotation",
annotation = "Tumor_Subtypes"
)
The distribution of exposures with respect to annotation can be
viewed using boxplots by setting plot_type = "box"
and
group_by = "annotation"
:
plot_exposures(musica, "ex_result", plot_type = "box", group_by = "annotation",
annotation = "Tumor_Subtypes")
Note that the name of the annotation must be supplied via the
annotation
parameter. Boxplots can be converted to violin
plots by setting plot_type = "violin"
. To compare the level
of each exposure across sample groups within a signature, we can set
group_by = "signature"
and
color_by = "annotation"
:
The create_umap
function embeds samples in 2 dimensions
using the umap
function from the uwot
package. The major parameters for fine tuning the UMAP are
n_neighbors
, min_dist
, and
spread
. See ?uwot::umap
for more information
on these parameters.
## The parameter 'n_neighbors' cannot be bigger than the total number of samples. Setting 'n_neighbors' to 7.
The plot_umap
function will generate a scatter plot of
the UMAP coordinates. The points of plot will be colored by the level of
a signature by default:
By default, the exposures for each sample will share the same color
scale. However, exposures for some signatures may have really high
levels compared to others. To make a plot where exposures for each
signature will have their own color scale, you can set
same_scale = FALSE
:
Lastly, points can be colored by a Sample Annotation by setting
color_by = "annotation"
and annotation
to the
name of the annotation:
plot_umap(musica, "ex_result",
color_by = "annotation",
annotation = "Tumor_Subtypes", add_annotation_labels = TRUE
)
When add_annotation_labels = TRUE
, the centroid of each
group is identified using medians and the labels are plotted on top of
the centroid.
plot_signatures, plot_exposures, and plot_umap, all have builty in
ggplotly capabilities. Simply specifying plotly = TRUE
enables interactive plots that allows examination of individuals
sections, zooming and resizing, and turning on and off annotation types
and legend values.
Several functions make use of stochastic algorithms or procedures
which require the use of random number generator (RNG) for simulation or
sampling. To maintain reproducibility, all these functions should be
called using set_seed(1)
or
withr::with_seed(seed, function())
to make sure same
results are generatedeach time one of these functions is called. Using
with_seed for reproducibility has the advantage of not interfering with
any other user seeds, but using one or the other is important for
several functions including discover_signatures,
predict_exposure, and create_umap, as these functions
use stochastic processes that may produce different results when run
multiple times with the same settings.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] doParallel_1.0.17 iterators_1.0.14 foreach_1.5.2
## [4] musicatk_2.1.0 NMF_0.28 Biobase_2.67.0
## [7] BiocGenerics_0.53.3 generics_0.1.3 cluster_2.1.6
## [10] rngtools_1.5.2 registry_0.5-1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3
## [2] sys_3.4.3
## [3] jsonlite_1.8.9
## [4] shape_1.4.6.1
## [5] magrittr_2.0.3
## [6] GenomicFeatures_1.59.1
## [7] farver_2.1.2
## [8] rmarkdown_2.29
## [9] GlobalOptions_0.1.2
## [10] BiocIO_1.17.1
## [11] zlibbioc_1.52.0
## [12] vctrs_0.6.5
## [13] memoise_2.0.1
## [14] matrixTests_0.2.3
## [15] Rsamtools_2.23.1
## [16] RCurl_1.98-1.16
## [17] rstatix_0.7.2
## [18] htmltools_0.5.8.1
## [19] S4Arrays_1.7.1
## [20] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [21] curl_6.0.1
## [22] broom_1.0.7
## [23] Formula_1.2-5
## [24] SparseArray_1.7.2
## [25] sass_0.4.9
## [26] bslib_0.8.0
## [27] htmlwidgets_1.6.4
## [28] plyr_1.8.9
## [29] plotly_4.10.4
## [30] cachem_1.1.0
## [31] buildtools_1.0.0
## [32] GenomicAlignments_1.43.0
## [33] lifecycle_1.0.4
## [34] pkgconfig_2.0.3
## [35] Matrix_1.7-1
## [36] R6_2.5.1
## [37] fastmap_1.2.0
## [38] GenomeInfoDbData_1.2.13
## [39] MatrixGenerics_1.19.0
## [40] clue_0.3-66
## [41] digest_0.6.37
## [42] colorspace_2.1-1
## [43] AnnotationDbi_1.69.0
## [44] maftools_2.23.0
## [45] S4Vectors_0.45.2
## [46] irlba_2.3.5.1
## [47] crosstalk_1.2.1
## [48] GenomicRanges_1.59.1
## [49] RSQLite_2.3.8
## [50] ggpubr_0.6.0
## [51] labeling_0.4.3
## [52] fansi_1.0.6
## [53] httr_1.4.7
## [54] abind_1.4-8
## [55] compiler_4.4.2
## [56] bit64_4.5.2
## [57] withr_3.0.2
## [58] backports_1.5.0
## [59] BiocParallel_1.41.0
## [60] carData_3.0-5
## [61] DBI_1.2.3
## [62] ggsignif_0.6.4
## [63] MASS_7.3-61
## [64] DelayedArray_0.33.2
## [65] rjson_0.2.23
## [66] gtools_3.9.5
## [67] DNAcopy_1.81.0
## [68] tools_4.4.2
## [69] MCMCprecision_0.4.0
## [70] glue_1.8.0
## [71] restfulr_0.0.15
## [72] grid_4.4.2
## [73] gridBase_0.4-7
## [74] reshape2_1.4.4
## [75] gtable_0.3.6
## [76] BSgenome_1.75.0
## [77] tidyr_1.3.1
## [78] data.table_1.16.2
## [79] car_3.1-3
## [80] utf8_1.2.4
## [81] XVector_0.47.0
## [82] RcppAnnoy_0.0.22
## [83] ggrepel_0.9.6
## [84] pillar_1.9.0
## [85] stringr_1.5.1
## [86] circlize_0.4.16
## [87] splines_4.4.2
## [88] dplyr_1.1.4
## [89] lattice_0.22-6
## [90] survival_3.7-0
## [91] rtracklayer_1.67.0
## [92] bit_4.5.0
## [93] tidyselect_1.2.1
## [94] BSgenome.Hsapiens.UCSC.hg38_1.4.5
## [95] ComplexHeatmap_2.23.0
## [96] maketools_1.3.1
## [97] Biostrings_2.75.1
## [98] knitr_1.49
## [99] gridExtra_2.3
## [100] IRanges_2.41.1
## [101] SummarizedExperiment_1.37.0
## [102] stats4_4.4.2
## [103] xfun_0.49
## [104] matrixStats_1.4.1
## [105] stringi_1.8.4
## [106] UCSC.utils_1.3.0
## [107] lazyeval_0.2.2
## [108] yaml_2.3.10
## [109] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0
## [110] evaluate_1.0.1
## [111] codetools_0.2-20
## [112] tibble_3.2.1
## [113] BiocManager_1.30.25
## [114] cli_3.6.3
## [115] uwot_0.2.2
## [116] munsell_0.5.1
## [117] jquerylib_0.1.4
## [118] Rcpp_1.0.13-1
## [119] GenomeInfoDb_1.43.2
## [120] tidyverse_2.0.0
## [121] png_0.1-8
## [122] XML_3.99-0.17
## [123] ggplot2_3.5.1
## [124] blob_1.2.4
## [125] bitops_1.0-9
## [126] viridisLite_0.4.2
## [127] VariantAnnotation_1.53.0
## [128] scales_1.3.0
## [129] combinat_0.0-8
## [130] purrr_1.0.2
## [131] crayon_1.5.3
## [132] GetoptLong_1.0.5
## [133] rlang_1.1.4
## [134] cowplot_1.1.3
## [135] KEGGREST_1.47.0
## [136] conclust_1.1