In this vignette, we demonstrate the ability of mitoClone2 to identify mitochondrial variants from single-cell sequencing data of a single or multiple individuals without any a priori knowledge of pre-existing variants by capitalizing on either an extensive exclusionlist (that removes known/potential false-positive signals) OR information shared across multiple individuals. We note here that of these methods, the second, is the most powerful. The latter strategy relies on the assumption that biological noise (e.g. RNA editing) and technical noise (arising during either library preparation, RNA sequencing, read processing, or alignment steps of the experiment) should be consistent across samples from different individuals. On this end, each individual sample can then act as essentially a technical/pseudo-replicate for identification and elimination the mentioned artifacts. In situations where a single sample from one individual is analyzed, we fall back on our standard filtering method using pre-constructed exclusionlists including information related to homo-polymers, repeat regions, RNA editing sites, and known sequencing artifacts.
A major untapped resource in identifying clonal populations in single-cells can be found in the mitochondrial reads. Often discarded or used a marker for bad cells, the high mutation rate and coverage of the mitochondria makes it a prime piece of real estate for detecting genetic differences among cells. Another important note of biological importance is that in many cases multiple mitochondria exist within individual cells, each in turn containing multiple copies of the mitochondrial genome. Thus the heterogeneity of mitochondria does not follow the standard Mendelian bifurcation into homo/heterozygous mutations. The term used to describe this phenomenon is heteroplasmy - which is defined as the presence of more than one organellar genome within a single cell.
The primary goal of this package is to extract and analyze mitochondrial variants from single-cell RNA-seq data. The method should also work equally well with single-cell ATAC/DNA-seq data. Although the majority of functions provided by the package should work with BAM files from bulk sequencing experiments, a portion of the core functionality will be lost.
There is built in compatibility with the hg38/hg19 human genomes and mm10 mouse genome. IMPORTANT: The mitochondrial genome sequence used in the UCSC hg19 reference is different than that used for NCBI’s GRCh37. GRCh37 uses the same reference as hg38/GRCh38 so make sure to check your relevant FASTA files to verify the expected length of your mitochondria. The lengths are as follows: UCSC hg38 (16569 bp), UCSC hg19 (16571 bp), and UCSC mm10 (16299 bp).
Before we can begin addressing the different aspects of filtering and
identifying bonafide variants, it is important to import your data in
the appropriate format. In our case, we rely most heavily on the
pre-existing structure defined by the deepSNV::bam2R()
function which produces 22 columns matrix with a number of rows
corresponding to ranges provided as input parameters. IMPORTANT: For the
majority of this vignette, the assumption is that we are working with
the entire mitochondrial genome of a given organism. You are free to
input any position by mutation matrix to the filtering and clustering
functions, but keep in mind certain parameters should be adjusted.
To generate your count tables we have provided two functions to do
so: Firstly, these tables can be created from a list of bam files using
the function baseCountsFromBamList
this function takes a
list of bam files which can be created with list.files()
.
This function assumes that each individual single cell is represented by
a single bam file. Alternatively, with our recent updates you can now
pull variants directly from a 10x Genomics multiplexed BAM file using
bam2R_10x
. This relies on using the CB
bam tag
present within the BAM file to distinguish individual cells. Two helpful
suggestions for this process are to initially subset BAM files for reads
located on the mitochondrial chromosome and also to deduplicate reads if
UMIs are present. You may also choose to count up the different
nucleotides using a different method. In such cases, the only
requirements for downstream mutation calling functions is that your
baseCounts
object be an R list
that contains 5
to 8 columns where rows represent genomic positions and columns
represent base calls (i.e. A,G,C,T,-,N,INS,DEL)
The basic function provided is
baseCountsFromBamList
which reads in BAM files from a list
and has four parameters:
bamfiles
= This is simply a list of bamfiles for
which you would like to count the number of nucleotides at each genomic
position from the sites
parameter.
sites
= These are the chromosome locations where you
will count variants from. The default is the hg38 mitochondrial genome
which starts at position 1 and has a length of 16569 bp.
ncores
= The number of threads used to count up
reads from your bamfiles. If you have thousands of input BAM files and
your system can handle it, reading in multiple BAM files at once will
help speed things up. Default: 1.
ignore_nonstandard
= Ignores basecalls that are not
an A, G, C, T, or N (unknown). A user can enable this if their reads
tend to have a lot of spliced reads (which are likely erroneous) or
insertions/deletions. Default: FALSE.
Note: Be aware that this function calls deepSNV::bam2R()
with default parameters.
## example using function for a list of BAM files
baseCounts <- baseCountsFromBamList(bamfiles = list(system.file("extdata",
"mm10_10x.bam", package = "mitoClone2")), sites = "chrM:1-15000", ncores = 1)
print(length(baseCounts))
## [1] 1
## A T C G - N INS DEL
## [1,] 4 395 0 0 0 14 0 0
## [2,] 398 0 0 0 0 22 0 0
## [3,] 383 0 0 0 0 36 0 0
## [4,] 381 1 0 1 0 36 0 0
## [5,] 402 0 0 0 0 23 0 0
## [6,] 0 0 406 0 0 15 0 0
The next function is for reading in a multiplexed
BAM file bam2R_10x
takes multiple parameters:
file
= This is a string containing the path for
which you would like to count the number of nucleotide at each genomic
position from the sites
parameter for each cell
barcode.
sites
= These are the chromosome locations where you
will count variants from. The default is the hg38 mitochondrial genome
which starts at position 1 and has a length of 16569 bp. Again, it is
very important that you provide the appropriate chromosome names
relative to your BAM file.
ncores
= The number of threads used to combine and
summarize all the counts from your BAM file. In contrast to the previous
function this parameter will not provide a dramatic speed boost.
Default: 1.
ignore_nonstandard
= Ignores basecalls that are not
an A, G, C, T, or N (unknown). A user can enable this if their reads
tend to have a lot of spliced reads (which are likely erroneous) or
insertions/deletions. Default: FALSE.
## example using function for a multiplexed BAM files cell barcodes
## found by extracting the 'CB' BAM tag
baseCounts <- bam2R_10x(file = system.file("extdata", "mm10_10x.bam", package = "mitoClone2"),
sites = "chrM:1-15000", ncores = 1)
print(length(baseCounts))
## [1] 6
## A T C G - N INS DEL
## [1,] 0 20 0 0 0 0 0 0
## [2,] 20 0 0 0 0 0 0 0
## [3,] 20 0 0 0 0 0 0 0
## [4,] 20 0 0 0 0 0 0 0
## [5,] 20 0 0 0 0 0 0 0
## [6,] 0 0 20 0 0 0 0 0
The next step of the analysis is to take your counts of each nucleotide in each cell and identify variants that are present. A major barrier in this process is the inherent noise present in single cell datasets. Although we will only briefly discuss the important aspects of filtering this is by far the most crucial step of the analysis. The prevalence of mitochondrial variants is highly variable, in some cases more or less depending on the dataset. It is wise to adjust the parameters in order to tailor them to each particular dataset - one size fits all does not apply here. It is critical to choose optimal parameters for a dataset and experimentation is often necessary.
The simplest way to identify variants, beyond simply enriching for those that are present in only a fraction of the population, is by removing detected variants that are likely to be caused by noise (as previously discussed). One conservative approach to accomplish this is to simply remove all variants that overlap with problematic sites. As a side note, this type of approach will decrease false positive hits but may incidentally increase false negative hits. This type of method is used when a you only have a single sample from one individual/timepoint.
The function mutationCallsFromExclusionlist
performs
filtering and then removes any variants on the exclusionlists.
Parameters:
lim.cov
= The minimum number of reads overlapping a
site in a given single cell for it to be considered covered. The default
value is 20.
min.af
= The minimal allele frequency for a site to
considered mutant in an single cell. Depending on the dataset this value
may be higher or lower as illustrated in the examples at the end of
vignette. The default value is 0.2.
universal.var.cells
= The number of cells which must
be mutant for a given variant to filtered out. In this case mutations
that are in the vast majority of cells, likely germline mutations, may
be removed. The default setting for this is 95% of all cells. However,
again this parameter should be adjusted based on the number and type of
cells in a given analysis.
min.af.universal
= The parameter used in conjunction
with the previous parameter which sets a threshold as to how high the
allele frequency must be in a cell to qualify it as mutant. By default
this parameter is identical to min.af
but if needed it can
be changed.
min.num.samples
= The number of cells that must be
considered mutant for a given variant for that variant to be included.
Again this value is highly dependent on the number and type of cells
present in a given dataset. Default is set to 1% of cells.
exclusionlists
= The list of exclusionlists to use
for filtering. The parameter should be a named list where each element
in the list is a GRanges object. The GRanges object should contain the
genomic coordinates of variants to be excluded from the analysis. By
default, the exclusionlist is made up of: three
which are
elements within stretches of homopolymers on the mitochondrial
chromosome of length 3 bp or greater, masked
which are
regions masked in the Refseq or UCSC genome releases (likely repetitive
elements), mutaseq
which are regions identified from a
previous study (these can be excluded in most cases), and
rnaEDIT
which are RNA editing sites from the REDIportal.
In all cases, these are relative to the ‘hg38’ genome
coordinates and should not be used with other genomes without
applying a liftOver step.
library(mitoClone2)
library(pheatmap)
## load our count data
load(system.file("extdata/example_counts.Rda", package = "mitoClone2"))
## investigate what our ideal baseCounts object looks like
print(head(example.counts[[1]]))
## A T C G N
## [1,] 1358 0 2 1 386
## [2,] 1745 0 1 2 309
## [3,] 507 0 0 0 47
## [4,] 7 9 3 50627 6075
## [5,] 9 10 1 70151 5843
## [6,] 54 20 42 57073 5373
## calling mutations using our exclusionlist
Example <- mutationCallsFromExclusionlist(example.counts, min.af = 0.05,
min.num.samples = 5, universal.var.cells = 0.5 * length(example.counts),
binarize = 0.1)
## Subsetting for specific cells...
## Extracting sum of all nonmutant alleles
Example.meta <- data.frame(row.names = rownames(Example.rawnames), Clone = gsub("_.*",
"", gsub("Donor1_", "", rownames(Example.rawnames))))
## showing the clustering via heatmap (notice only 20 mutations were
## kept)
clustered <- quick_cluster(Example, binarize = TRUE, drop_empty = TRUE,
clustering.method = "ward.D2", annotation_col = Example.meta, show_colnames = FALSE,
fontsize_row = 7)
Now (knowing the original parental colony from the publication) we apply cohort filtering to revealing similar results.
## Calling mutations using our known clonal information
Example <- mutationCallsFromCohort(example.counts, MINFRAC = 0.01, MINCELL = 8,
MINCELLS.PATIENT = 5, patient = Example.meta$Clone, sites = "chrM:1-55")
## Making sure 'sites' parameter is set correctly...
## Looks good. Using the UCSC hg38 genome as a reference for variants. Be wary, only mitochondrial annotations are available!
## check the variants for a specific cluster
print(colnames(getAlleleCount(Example$C112, "mutant")))
## Extracting sum of all mutant alleles
## [1] "X19C.A" "X25A.G" "X50T.G" "X54G.A"
private.mutations <- unique(unlist(lapply(Example[grep("^C", names(Example))][unlist(lapply(Example[grep("^C",
names(Example))], function(x) !is.null(x)))], function(y) names(getVarsCandidate(y)))))
private.mutations.df <- pullcountsVars(example.counts, gsub("X(\\d+)([AGCT])\\.([AGCT])",
"\\1 \\2>\\3", private.mutations))
private.mutations.df <- mutationCallsFromMatrix(t(private.mutations.df$M),
t(private.mutations.df$N), cluster = rep(TRUE, length(private.mutations)))
## Showing the clustering looks via heatmap
clustered <- quick_cluster(private.mutations.df, binarize = TRUE, drop_empty = TRUE,
clustering.method = "ward.D2", annotation_col = Example.meta, show_colnames = FALSE,
fontsize_row = 7)
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pheatmap_1.0.12 mitoClone2_1.13.0 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] farver_2.1.2 blob_1.2.4
## [3] Biostrings_2.75.3 bitops_1.0-9
## [5] fastmap_1.2.0 RCurl_1.98-1.16
## [7] VariantAnnotation_1.53.0 GenomicAlignments_1.43.0
## [9] XML_3.99-0.17 digest_0.6.37
## [11] lifecycle_1.0.4 KEGGREST_1.47.0
## [13] RSQLite_2.3.9 magrittr_2.0.3
## [15] compiler_4.4.2 rlang_1.1.4
## [17] sass_0.4.9 tools_4.4.2
## [19] yaml_2.3.10 rtracklayer_1.67.0
## [21] Rhtslib_3.3.1 knitr_1.49
## [23] S4Arrays_1.7.1 bit_4.5.0.1
## [25] curl_6.0.1 DelayedArray_0.33.3
## [27] plyr_1.8.9 RColorBrewer_1.1-3
## [29] abind_1.4-8 BiocParallel_1.41.0
## [31] BiocGenerics_0.53.3 sys_3.4.3
## [33] grid_4.4.2 stats4_4.4.2
## [35] colorspace_2.1-1 ggplot2_3.5.1
## [37] scales_1.3.0 SummarizedExperiment_1.37.0
## [39] cli_3.6.3 crayon_1.5.3
## [41] generics_0.1.3 httr_1.4.7
## [43] reshape2_1.4.4 rjson_0.2.23
## [45] DBI_1.2.3 cachem_1.1.0
## [47] stringr_1.5.1 zlibbioc_1.52.0
## [49] splines_4.4.2 parallel_4.4.2
## [51] AnnotationDbi_1.69.0 formatR_1.14
## [53] XVector_0.47.0 restfulr_0.0.15
## [55] matrixStats_1.4.1 vctrs_0.6.5
## [57] Matrix_1.7-1 jsonlite_1.8.9
## [59] VGAM_1.1-12 IRanges_2.41.2
## [61] S4Vectors_0.45.2 bit64_4.5.2
## [63] deepSNV_1.53.0 GenomicFeatures_1.59.1
## [65] maketools_1.3.1 jquerylib_0.1.4
## [67] glue_1.8.0 codetools_0.2-20
## [69] stringi_1.8.4 gtable_0.3.6
## [71] GenomeInfoDb_1.43.2 GenomicRanges_1.59.1
## [73] BiocIO_1.17.1 UCSC.utils_1.3.0
## [75] munsell_0.5.1 tibble_3.2.1
## [77] pillar_1.10.0 htmltools_0.5.8.1
## [79] GenomeInfoDbData_1.2.13 BSgenome_1.75.0
## [81] R6_2.5.1 evaluate_1.0.1
## [83] lattice_0.22-6 Biobase_2.67.0
## [85] png_0.1-8 Rsamtools_2.23.1
## [87] memoise_2.0.1 bslib_0.8.0
## [89] Rcpp_1.0.13-1 SparseArray_1.7.2
## [91] xfun_0.49 MatrixGenerics_1.19.0
## [93] buildtools_1.0.0 pkgconfig_2.0.3