--- title: "scMitoMut demo: CRC dataset" author: - name: Wenjie Sun package: scMitoMut output: BiocStyle::html_document: toc_float: true BiocStyle::pdf_document: default vignette: > %\VignetteIndexEntry{CRC_dataset_demo} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r echo=FALSE} library(knitr) opts_chunk$set(echo = TRUE, TOC = TRUE) ``` Install {scMitoMut} from Biocondcutor: ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("scMitoMut") ``` or from github: ```{r, eval=FALSE} if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes") remotes::install_github("wenjie1991/scMitoMut") ``` Load required packages. ```{r} library(scMitoMut) library(data.table) library(ggplot2) library(rhdf5) ``` # Overview ## Key functions Following are the key functions used in the scMitoMut package: - `parse_mgatk()`: Parses the mgatk output and saves the result in an H5 file. - `open_h5_file()`: Opens the H5 file and returns a "mtmutObj" object. - `subset_cell()`: Subsets the cells in the mtmutObj object. - `run_model_fit()`: Runs the model fitting and saves the results in the H5 file. - `filter_loc()`: Filters the mutations based on criterias. - `plot_heatmap()`: Plots a heatmap for p-values, allele frequencies, or binary mutation status. - `export_df()`, `export_af()`, `export_pval()`, and `export_binary()`: Export the mutation data in `data.frame` format and the allele frequency, p-value, and binary mutation status in `data.matrix` format. ## Key conceptions - Somatic mutation: the mutation that occurs in cells after fertilizatoin. - AF or VAF: allele frequency (AF) or variant allele frequency (VAF). It is the ratio of the number of reads supporting the variant allele to the total number of reads covering that locus. **IMPORTANT**: In this vignette, I used the term "mutation" to refer to the lineage-related somatic mutation. For each mutation, I used the dominant allele as the reference allele. If the reference allele frequency is significantly (FDR < 0.05) lower, I will call the **locus** a mutation. ## Background Single-cell genomics technology serves as a powerful tool for gaining insights into cellular heterogeneity and diversity within complex tissues. Mitochondrial DNA (mtDNA) is characterized by its small size and the presence of multiple copies within a cell. These attributes contribute to achieving robust mtDNA sequencing coverage and depth in single-cell sequencing data, thereby enhancing the detection of somatic mutations without dropout. In this vignette, the scMitoMut package is used to identify and visualize lineage-related mtDNA single nucleic somatic mutations. In the following analysis, scMitoMut was used to analyze the allele count data, which is the output of [mgatk](https://github.com/caleblareau/mgatk). Only a few loci have been selected for demonstration purposes to reduce file size and run time. The full dataset can be access from {Signac} [vignette](https://github.com/stuart-lab/signac/blob/1.9.0/vignettes/mito.Rmd). # Loading data We load the allele count table with the `parse_table` function. The allele count table consists with following columns: 1. `loc`: the locus name 2. `cell_barcode`: the cell barcode in single-cell sequencing data 3. `fwd_depth`: the forward read count of the allele 4. `rev_depth`: the reverse read count of the allele 5. `alt`: the allele name 6. `coverage`: the read count covering the locus 7. `ref`: the reference allele name Instead of using the table above as input, the output from `mgatk` can also be directly read using the `parse_mgatk` function. Using the `parse_table` function or `parse_mgatk` function, the allele count data are read and saved into an `H5` file. The `H5` file works as a database, which does not occupy memory, and data can be randomly accessed by querying. It helps with better memory usage and faster loading. The process may take some minutes. The return value is the `H5` file path. ```{r, eval=TRUE} ## Load the allele count table f <- system.file("extdata", "mini_dataset.tsv.gz", package = "scMitoMut") f_h5_tmp <- tempfile(fileext = ".h5") f_h5 <- parse_table(f, h5_file = f_h5_tmp) ``` ```{r, eval=TRUE} f_h5 ``` After obtaining the `H5` file, the `open_h5_file` function can be utilized to load it, resulting in an object referred to as "mtmutObj". **Detail**: On this step, the `mtmutObj` has 6 slots - `h5f` is the `H5` file handle - `mut_table` is the allele count table - `loc_list` is a list of available loci - `loc_selected` is the selected loci - `cell_list` is a list of available cell ids - `cell_selected` is the selected cell ids ```{r} ## Open the h5 file as a scMitoMut object x <- open_h5_file(f_h5) str(x) ## Show what's in the h5 file h5ls(x$h5f, recursive = FALSE) ``` # Selecting cells We are only interested in annotated good-quality cells. So we will select the cells with annotation, which are good quality cells. ```{r} f <- system.file("extdata", "mini_dataset_cell_ann.csv", package = "scMitoMut") cell_ann <- read.csv(f, row.names = 1) ## Subset the cells, the cell id can be found by colnames() for the Seurat object x <- subset_cell(x, rownames(cell_ann)) ``` After subsetting the cells, the `cell_selected` slot will be updated. Only the selected cells will be used in the following p-value calculation. ```{r} head(x$cell_selected) ``` Similarly, we can select loci by using the `subset_locus` function. It saves time when we only focus on a few loci. # Calculating mutation p-value We built an null-hypothesis that there are not lineage-related mutation for specific locus in all cells. Then we fit the allele frequency distribution with beta-binomial distribution and calculate the probability of observing allele frequency for a specific locus in a cell. If the probability is small, we can reject the null hypothesis and conclude that there is a mutation for that locus in the cell. To calculate the probability value (p-value), we run `run_calling` function, which has 2 arguments: - `mtmutObj` is the `scMitoMut` object - `mc.cores` is the number of CPU threads to be used The process will take some time. The output will be stored in the `pval` group of the `H5` file. The result is stored in the hard drive, instead of in memory. We don't need to re-run the mutation calling when loading the `H5` file next time. The mutation calling is based on beta-binomial distribution. The mutation p-value is the probability that with the null hypothesis: there are no mutations for that locus in the cell. **Detail**: For each locus, we calculate the p-value using the following steps. 1. Defining the wild-type allele as the allele with the highest median allele frequency among cells. 2. Fitting a 2 components binomial-mixture distribution as classifier to select the likely wild-type cells. We define the likely wild-type cells if it has a probability >= 0.001 to be the wild type. 3. Using those likely wild-type cells, we fit the beta-binomial model. 4. At last, based on the model, we calculate the p-value of observing the allele frequency of the wild-type allele in specific cell. ```{r, eval=TRUE} ## Run the model fitting run_model_fit(x, mc.cores = 1) ## The p-value is kept in the pval group of H5 file h5ls(x$h5f, recursive = FALSE) ``` # Filter mutations Then we will filter the mutations by using the `mut_filter` function with the following criteria: - The mutation has at least 5 cells mutant. - The FDR (false discovery rate) adjusted p-value (mutation quality q-value) is less than 0.05. The output is a `data.frame` with 2 columns - `loc` is the locus - `mut_cell_n` is the cell number We can see that there are 12 loci after filtering. **Detail**: The `mut_filter` function has 4 arguments: - `mtmutObj` is the `mtmutObj` object - `min_cell` is the minimum number of mutant cells - `p_adj_method` is the method used to adjust the p-value. - `p_threshold` is the adjusted p-value (q-value) threshold ```{r} ## Filter mutation x <- filter_loc( mtmutObj = x, min_cell = 2, model = "bb", p_threshold = 0.01, p_adj_method = "fdr" ) x$loc_pass ``` # Visualization We will visualize the mutation by heatmap using the `plot_heatmap` function. It can draw a heatmap of q-value, allele frequency, or binarized mutation status. Its input is the `mtmutObj` object. It will independently apply all the filters we used in the `mut_filter` function, and select the cells and loci that pass the filter criteria. In all kinds of figures, the mutation status will be calculated, and the loci and cells are ordered by the mutation status. **Detail**: The `plot_heatmap` arguments. - `mtmutObj` is the `scMitoMut` object. - `pos_list` is the list of loci. - `cell_ann` is the cell annotation. - `ann_colors` is the color of the cell annotation. - `type` is the type of the heatmap which can be `p`, `af`, or `binary`. - `p_adj_method` is the method used to adjust the p-value. - `p_threshold` is the adjusted p-value threshold to determine if a cell has mutation when selecting the cells and loci. - `min_cell_n` is the minimum number of cells that have mutation when selecting the cells and loci. - `p_binary` is the adjusted p-value threshold to get the binary mutation status. - `percent_interp` is the percentage overlap threshold between mutations, to determine if two mutations are correlated for interpolating the mutation status - `n_interp` is the number of overlapped cells to determine if two mutations are correlated for interpolating. The interpolation is based on the assumption that the mutation are unique, it is rare to have two mutation in the same population. Therefore, when two mutations are correlated, one of them is likely a subclone of the other one. The interpolation is utilized primarily for the purpose of ordering cells during visualization. ## Binary heatmap The binary heatmap displays the mutation status of each cell corresponding to each locus. The color red suggests the presence of a mutant, whereas blue indicates its absence, and white denotes a missing value. ```{r, fig.width = 12} ## Prepare the color for cell annotation colors <- c( "Cancer Epi" = "#f28482", Blood = "#f6bd60" ) ann_colors <- list("SeuratCellTypes" = colors) ``` ```{r, fig.width = 12} ## binary heatmap plot_heatmap(x, cell_ann = cell_ann, ann_colors = ann_colors, type = "binary", percent_interp = 0.2, n_interp = 3 ) ``` Also we can turn off the interpolation by setting `percent_interp = 1`. ```{r, fig.width = 12} ## binary heatmap plot_heatmap(x, cell_ann = cell_ann, ann_colors = ann_colors, type = "binary", percent_interp = 1, n_interp = 3 ) ``` ## P value heatmap The p-value heatmap illustrates the adjusted p-values for each cell corresponding to each locus. The arrangement of the cells and loci is based on their binary mutation status. ```{r, fig.width = 12} ## p value heatmap plot_heatmap(x, cell_ann = cell_ann, ann_colors = ann_colors, type = "p", percent_interp = 0.2, n_interp = 3 ) ``` ## AF heatmap The allele frequency heatmap illustrates the allele frequency of each cell at each locus. The order of the cells and loci are determined by the mutation status too. ```{r, fig.width = 12} ## allele frequency heatmap plot_heatmap(x, cell_ann = cell_ann, ann_colors = ann_colors, type = "af", percent_interp = 0.2, n_interp = 3 ) ``` # Exporting mutation data We can export the mutation data by using the following functions: - `export_df` export the mutation data as a `data.frame` - `export_af` export the AF data as a `data.matrix` with loci as row names and cells as column names. - `export_pval` export the p-value data as a `data.matrix` with loci as row names and cells as column names. - `export_binary` export the mutation status data as a `data.matrix` with loci as row names and cells as column names. Those functions have the same filtering options as the `plot_heatmap` function. ```{r} ## Export the mutation data as data.frame m_df <- export_df(x) m_df[1:10, ] ## Export allele frequency data as data.matrix export_af(x)[1:5, 1:5] ## Export p-value data as data.matrix export_pval(x)[1:5, 1:5] ## Export binary mutation status data as data.matrix export_binary(x)[1:5, 1:5] ``` # Show the p value, af plot versus cell types Lastly, we try to show the distribution of p value and allele frequency value versus cell types. ```{r, fig.width = 5} ## The `m_df` is exported using the `export_df` function above. m_dt <- data.table(m_df) m_dt[, cell_type := cell_ann[as.character(m_dt$cell_barcode), "SeuratCellTypes"]] m_dt_sub <- m_dt[loc == "chrM.1227"] m_dt_sub[, sum((pval) < 0.01, na.rm = TRUE), by = cell_type] m_dt_sub[, sum((1 - af) > 0.05, na.rm = TRUE), by = cell_type] ## The p value versus cell types ggplot(m_dt_sub) + aes(x = cell_type, y = -log10(pval), color = cell_type) + geom_jitter() + scale_color_manual(values = colors) + theme_bw() + geom_hline(yintercept = -log10(0.01), linetype = "dashed") + ylab("-log10(FDR)") ## The allele frequency versus cell types ggplot(m_dt_sub) + aes(x = cell_type, y = 1 - af, color = factor(cell_type)) + geom_jitter() + scale_color_manual(values = colors) + theme_bw() + geom_hline(yintercept = 0.05, linetype = "dashed") + ylab("1 - Dominant Allele Frequency") ## The p value versus allele frequency ggplot(m_dt_sub) + aes(x = -log10(pval), y = 1 - af, color = factor(cell_type)) + geom_point() + scale_color_manual(values = colors) + theme_bw() + geom_hline(yintercept = 0.05, linetype = "dashed") + geom_vline(xintercept = -log10(0.01), linetype = "dashed") + xlab("-log10(FDR)") + ylab("1 - Dominant Allele Frequency") ``` # Session Info ```{r} sessionInfo() ```