An introduction to miQC

Installation

To install the package, please use the following.

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("miQC")

Introduction

This vignette provides a basic example of how to run miQC, which allows users to perform cell-wise filtering of single-cell RNA-seq data for quality control. Single-cell RNA-seq data is very sensitive to tissue quality and choice of experimental workflow; it’s critical to ensure compromised cells and failed cell libraries are removed. A high proportion of reads mapping to mitochondrial DNA is one sign of a damaged cell, so most analyses will remove cells with mtRNA over a certain threshold, but those thresholds can be arbitrary and/or detrimentally stringent, especially for archived tumor tissues. miQC jointly models both the proportion of reads mapping to mtDNA genes and the number of detected genes with mixture models in a probabilistic framework to identify the low-quality cells in a given dataset.

For more information about the statistical background of miQC and for biological use cases, consult the miQC paper (Hippen et al. 2021).

You’ll need the following packages installed to run this tutorial:

suppressPackageStartupMessages({
    library(SingleCellExperiment)
    library(scRNAseq)
    library(scater)
    library(flexmix)
    library(splines)
    library(miQC)
})

Example data

To demonstrate how to run miQC on a single-cell RNA-seq dataset, we’ll use data from mouse brain cells, originating from an experiment by Zeisel et al (Zeisel et al. 2015), and available through the Bioconductor package scRNAseq.

sce <- ZeiselBrainData()
sce
## class: SingleCellExperiment 
## dim: 20006 3005 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(9): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC

Scater preprocessing

In order to calculate the percent of reads in a cell that map to mitochondrial genes, we first need to establish which genes are mitochondrial. For genes listed as HGNC symbols, this is as simple as searching for genes starting with mt-. For other IDs, we recommend using a biomaRt query to map to chromosomal location and identify all mitochondrial genes.

mt_genes <- grepl("^mt-",  rownames(sce))
feature_ctrls <- list(mito = rownames(sce)[mt_genes])

feature_ctrls
## $mito
##  [1] "mt-Tl1"  "mt-Tn"   "mt-Tr"   "mt-Tq"   "mt-Ty"   "mt-Tk"   "mt-Ta"  
##  [8] "mt-Tf"   "mt-Tp"   "mt-Tc"   "mt-Td"   "mt-Tl2"  "mt-Te"   "mt-Tv"  
## [15] "mt-Tg"   "mt-Tt"   "mt-Tw"   "mt-Tm"   "mt-Ti"   "mt-Nd3"  "mt-Nd6" 
## [22] "mt-Nd4"  "mt-Atp6" "mt-Nd2"  "mt-Nd5"  "mt-Nd1"  "mt-Co3"  "mt-Cytb"
## [29] "mt-Atp8" "mt-Co2"  "mt-Co1"  "mt-Rnr2" "mt-Rnr1" "mt-Nd4l"

miQC is designed to be run with the Bioconductor package scater, which has a built-in function addPerCellQC to calculate basic QC metrics like number of unique genes detected per cell and total number of reads. When we pass in our list of mitochondrial genes, it will also calculate percent mitochondrial reads.

sce <- addPerCellQC(sce, subsets = feature_ctrls)
head(colData(sce))
## DataFrame with 6 rows and 21 columns
##                     tissue   group # total mRNA mol      well       sex
##                <character> <numeric>      <numeric> <numeric> <numeric>
## 1772071015_C02    sscortex         1          21580        11         1
## 1772071017_G12    sscortex         1          21748        95        -1
## 1772071017_A05    sscortex         1          31642        33        -1
## 1772071014_B06    sscortex         1          32916        42         1
## 1772067065_H06    sscortex         1          21531        48         1
## 1772071017_E02    sscortex         1          24799        13        -1
##                      age  diameter  level1class level2class       sum  detected
##                <numeric> <numeric>  <character> <character> <numeric> <integer>
## 1772071015_C02        21      0.00 interneurons       Int10     22354      4871
## 1772071017_G12        20      9.56 interneurons       Int10     22869      4712
## 1772071017_A05        20     11.10 interneurons        Int6     32594      6055
## 1772071014_B06        21     11.70 interneurons       Int10     33525      5852
## 1772067065_H06        25     11.00 interneurons        Int9     21694      4724
## 1772071017_E02        20     11.90 interneurons        Int9     25919      5427
##                subsets_mito_sum subsets_mito_detected subsets_mito_percent
##                       <numeric>             <integer>            <numeric>
## 1772071015_C02              774                    23             3.462468
## 1772071017_G12             1121                    27             4.901832
## 1772071017_A05              952                    27             2.920783
## 1772071014_B06              611                    28             1.822521
## 1772067065_H06              164                    23             0.755969
## 1772071017_E02             1122                    19             4.328871
##                altexps_repeat_sum altexps_repeat_detected
##                         <numeric>               <numeric>
## 1772071015_C02               8181                     419
## 1772071017_G12              11854                     480
## 1772071017_A05              18021                     582
## 1772071014_B06              13955                     512
## 1772067065_H06               6876                     363
## 1772071017_E02              17364                     618
##                altexps_repeat_percent altexps_ERCC_sum altexps_ERCC_detected
##                             <numeric>        <numeric>             <numeric>
## 1772071015_C02                21.9677             6706                    43
## 1772071017_G12                28.8012             6435                    46
## 1772071017_A05                31.6435             6335                    47
## 1772071014_B06                25.5999             7032                    43
## 1772067065_H06                19.9299             5931                    39
## 1772071017_E02                34.7600             6671                    43
##                altexps_ERCC_percent     total
##                           <numeric> <numeric>
## 1772071015_C02              18.0070     37241
## 1772071017_G12              15.6349     41158
## 1772071017_A05              11.1238     56950
## 1772071014_B06              12.8999     54512
## 1772067065_H06              17.1908     34501
## 1772071017_E02              13.3543     49954

miQC

Basic usage

Using the QC metrics generated via scater, we can use the plotMetrics function to visually inspect the quality of our dataset.

plotMetrics(sce)

We can see that most cells have a fairly low proportion of mitochondrial reads, given that the graph is much denser at the bottom. We likely have many cells that are intact and biologically meaningful. There are also a few cells that have almost half of their reads mapping to mitochondrial genes, which are likely broken or otherwise compromised and we will want to exclude from our downstream analysis. However, it’s not clear what boundaries to draw to separate the two groups of cells. With that in mind, we’ll generate a linear mixture model using the mixtureModel function.

model <- mixtureModel(sce)

This function is a wrapper for flexmix, which fits a mixture model on our data and returns the parameters of the two lines that best fit the data, as well as the posterior probability of each cell being derived from each distribution.

We can look at the parameters and posterior values directly with the functions

parameters(model)
##                        Comp.1        Comp.2
## coef.(Intercept) 26.900519082  9.5053786015
## coef.detected    -0.003756534 -0.0008346818
## sigma             6.489944246  3.3060823263
head(posterior(model))
##            [,1]      [,2]
## [1,] 0.10540275 0.8945973
## [2,] 0.09952611 0.9004739
## [3,] 0.12845440 0.8715456
## [4,] 0.14688077 0.8531192
## [5,] 0.14376299 0.8562370
## [6,] 0.11493152 0.8850685

Or we can visualize the model results using the plotModel function:

plotModel(sce, model)

As expected, the cells at the very top of the graph are almost certainly compromised, most likely to have been derived from the distribution with fewer unique genes and higher baseline mitochondrial expression.

We can use these posterior probabilities to choose which cells to keep, and visualize the consequences of this filtering with the plotFiltering function.

plotFiltering(sce, model)

To actually perform the filtering and remove the indicated cells from our SingleCellExperiment object, we can run the filterCells parameter.

sce <- filterCells(sce, model)
## Removing 359 out of 3005 cells.
sce
## class: SingleCellExperiment 
## dim: 20006 2646 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(2): featureType subsets_mito
## colnames(2646): 1772071015_C02 1772071017_G12 ... 1772066097_D04
##   1772066098_A12
## colData names(22): tissue group # ... total prob_compromised
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC

Other model types

In most cases, a linear mixture model will be satisfactory as well as simpler, but miQC also supports some non-linear mixture models: currently polynomials and b-splines. A user should only need to change the model_type parameter when making the model, and all visualization and filtering functions will work the same as with a linear model.

sce <- ZeiselBrainData()
sce <- addPerCellQC(sce, subsets = feature_ctrls)

model2 <- mixtureModel(sce, model_type = "spline")
plotModel(sce, model2)

plotFiltering(sce, model2)

model3 <- mixtureModel(sce, model_type = "polynomial")
plotModel(sce, model3)

plotFiltering(sce, model3)

miQC is designed to combine information about mitochondrial percentage and library complexity (number of genes discovered) to make filtering decisions, but if an even simpler model is preferred, miQC can make a model based only on mitochondrial information. This can be done by changing the model_type parameter to “one_dimensional”, which runs a 1D gaussian mixture model. When library size is not added to the model, it is possible to calculate a single mitochondrial threshold to apply, which can be directly calculated using the get1DCutoff function.

model4 <- mixtureModel(sce, model_type = "one_dimensional")
plotModel(sce, model4)

plotFiltering(sce, model4)

get1DCutoff(sce, model4)
## [1] 14.57058

Extras

Changing posterior cutoff

miQC defaults to removing any cell with 75% or greater posterior probability of being compromised, but if we want to be more or less stringent, we can alter the posterior_cutoff parameter, like so:

plotFiltering(sce, model4, posterior_cutoff = 0.9)

Note that when performing miQC multiple times on different samples for the same experiment, it’s recommended to select the same posterior_cutoff for all, to give consistency in addition to the flexibility of sample-specific models.

Preventing exclusion of low-mito cells

miQC includes two parameters to accomodate unusual and undesired behavior in the linear distributions. These issues are especially visible in some cancer datasets with a stringent posterior cutoff. We’ve included a bare-bones version of QC data for a high-grade serous ovarian tumor (full version of the data is available at GEO, accession GSM4816047).

data("hgsoc_metrics")
sce <- SingleCellExperiment(colData = metrics)
model <- mixtureModel(sce)
plotFiltering(sce, model, posterior_cutoff = 0.6, enforce_left_cutoff = FALSE,
                keep_all_below_boundary = FALSE)

The first issue is the group of cells at the bottom of the distribution getting marked for removal. These cells happen to be near the x-intercept of the compromised cell line, which increases their posterior probability of being compromised. But since they have decent library complexity and a low mitochondrial percentage, so it doesn’t make biological sense to exclude them. When the keep_all_below_boundary parameter is set to True, as is the default, any cells below the intact cell line are kept:

plotFiltering(sce, model, posterior_cutoff = 0.6, enforce_left_cutoff = FALSE,
                keep_all_below_boundary = TRUE)

Preventing U-shaped boundary

The second issue is the U-shape in the boundary between kept and discarded cells. When this occurs, there will be cells at the bottom of the trough that are discarded, but some cells with less library complexity (farther left) and higher percentage of mitochondrial reads (higher) – meaning they are worse in both of our QC metrics – will be kept. To avoid this happening, the parameter enforce_left_cutoff will identify the cell marked for removal with the lowest mitochondrial percentage, determine its library complexity, and discard all cells with both lower complexity and higher mitochondrial percentage:

plotFiltering(sce, model, posterior_cutoff = 0.6, enforce_left_cutoff = TRUE,
                keep_all_below_boundary = TRUE)

This will make a de facto mitochondrial percentage cutoff for all cells with low library complexity, but will be more permissive for cells with high library complexity and high mitochondrial percentage, which are more likely to be intact cells with a biological reason for high mitochondrial expression than their low library complexity counterparts.

When not to use miQC

The miQC model is based on the assumption that there are a non-trivial number of compromised cells in the dataset, which is not true in all datasets. We recommend running plotMetrics on a dataset before running miQC to see if the two-distribution model is appropriate. Look for the distinctive triangular shape where cells have a wide variety of mitochondrial percentages at lower gene counts and taper off to lower mitochondrial percentage at higher gene counts, as can be seen in the Zeisel data.

For example of a dataset where there’s not a significant number of compromised cells, so the two-distribution assumption is not met, look at another dataset from the scRNAseq package, mouse data from Buettner et al (Buettner et al. 2015).

sce <- BuettnerESCData()
mt_genes <- grepl("^mt-", rowData(sce)$AssociatedGeneName)
feature_ctrls <- list(mito = rownames(sce)[mt_genes])
sce <- addPerCellQC(sce, subsets = feature_ctrls)

plotMetrics(sce)

The mixtureModel function will throw a warning if only one distribution is found, in which case no cells would be filtered. In these cases, we recommend using other filtering methods, such as a cutoff on mitochondrial percentage or identifying outliers using median absolute deviation (MAD).

Session Information

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] splines   stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ensembldb_2.29.1            AnnotationFilter_1.31.0     GenomicFeatures_1.57.1      AnnotationDbi_1.69.0       
##  [5] miQC_1.15.0                 flexmix_2.3-19              lattice_0.22-6              scater_1.33.4              
##  [9] ggplot2_3.5.1               scuttle_1.15.5              scRNAseq_2.19.1             SingleCellExperiment_1.27.2
## [13] SummarizedExperiment_1.35.5 Biobase_2.67.0              GenomicRanges_1.57.2        GenomeInfoDb_1.41.2        
## [17] IRanges_2.39.2              S4Vectors_0.43.2            BiocGenerics_0.53.0         MatrixGenerics_1.17.1      
## [21] matrixStats_1.4.1           BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] sys_3.4.3                jsonlite_1.8.9           magrittr_2.0.3           modeltools_0.2-23       
##   [5] ggbeeswarm_0.7.2         gypsum_1.1.6             farver_2.1.2             rmarkdown_2.28          
##   [9] BiocIO_1.17.0            zlibbioc_1.51.2          vctrs_0.6.5              memoise_2.0.1           
##  [13] Rsamtools_2.21.2         RCurl_1.98-1.16          htmltools_0.5.8.1        S4Arrays_1.5.11         
##  [17] AnnotationHub_3.15.0     curl_5.2.3               BiocNeighbors_2.1.0      Rhdf5lib_1.27.0         
##  [21] SparseArray_1.5.45       rhdf5_2.49.0             sass_0.4.9               alabaster.base_1.7.0    
##  [25] bslib_0.8.0              alabaster.sce_1.7.0      httr2_1.0.5              cachem_1.1.0            
##  [29] buildtools_1.0.0         GenomicAlignments_1.41.0 mime_0.12                lifecycle_1.0.4         
##  [33] pkgconfig_2.0.3          rsvd_1.0.5               Matrix_1.7-1             R6_2.5.1                
##  [37] fastmap_1.2.0            GenomeInfoDbData_1.2.13  digest_0.6.37            colorspace_2.1-1        
##  [41] irlba_2.3.5.1            ExperimentHub_2.13.1     RSQLite_2.3.7            beachmat_2.23.0         
##  [45] labeling_0.4.3           filelock_1.0.3           fansi_1.0.6              httr_1.4.7              
##  [49] abind_1.4-8              compiler_4.4.1           bit64_4.5.2              withr_3.0.2             
##  [53] BiocParallel_1.41.0      viridis_0.6.5            DBI_1.2.3                highr_0.11              
##  [57] HDF5Array_1.33.8         alabaster.ranges_1.7.0   alabaster.schemas_1.7.0  rappdirs_0.3.3          
##  [61] DelayedArray_0.33.1      rjson_0.2.23             tools_4.4.1              vipor_0.4.7             
##  [65] beeswarm_0.4.0           nnet_7.3-19              glue_1.8.0               restfulr_0.0.15         
##  [69] rhdf5filters_1.17.0      grid_4.4.1               generics_0.1.3           gtable_0.3.6            
##  [73] BiocSingular_1.23.0      ScaledMatrix_1.13.0      utf8_1.2.4               XVector_0.45.0          
##  [77] ggrepel_0.9.6            BiocVersion_3.21.1       pillar_1.9.0             dplyr_1.1.4             
##  [81] BiocFileCache_2.15.0     rtracklayer_1.65.0       bit_4.5.0                tidyselect_1.2.1        
##  [85] maketools_1.3.1          Biostrings_2.75.0        knitr_1.48               gridExtra_2.3           
##  [89] ProtGenerics_1.37.1      xfun_0.48                UCSC.utils_1.1.0         lazyeval_0.2.2          
##  [93] yaml_2.3.10              evaluate_1.0.1           codetools_0.2-20         tibble_3.2.1            
##  [97] alabaster.matrix_1.7.0   BiocManager_1.30.25      cli_3.6.3                munsell_0.5.1           
## [101] jquerylib_0.1.4          Rcpp_1.0.13              dbplyr_2.5.0             png_0.1-8               
## [105] XML_3.99-0.17            parallel_4.4.1           blob_1.2.4               bitops_1.0-9            
## [109] viridisLite_0.4.2        alabaster.se_1.7.0       scales_1.3.0             purrr_1.0.2             
## [113] crayon_1.5.3             rlang_1.1.4              KEGGREST_1.45.1

References

Buettner, Florian, Kedar N. Natarajan, F. Paolo Casale, Valentina Proserpio, Antonio Scialdone, Fabian J. Theis, Sarah A. Teichmann, John C. Marioni, and Oliver Stegle. 2015. “Computational Analysis of Cell-to-Cell Heterogeneity in Single-Cell RNA-Sequencing Data Reveals Hidden Subpopulations of Cells.” Nature Biotechnology 33 (2): 155–60. https://doi.org/10.1038/nbt.3102.
Hippen, Ariel A., Matias M. Falco, Lukas M. Weber, Erdogan Pekcan Erkan, Kaiyang Zhang, Jennifer Anne Doherty, Anna Vähärautio, Casey S. Greene, and Stephanie C. Hicks. 2021. miQC: An Adaptive Probabilistic Framework for Quality Control of Single-Cell RNA-Sequencing Data.” PLOS Computational Biology 17 (8): e1009290. https://doi.org/10.1371/journal.pcbi.1009290.
Zeisel, Amit, Ana B. Muñoz-Manchado, Simone Codeluppi, Peter Lönnerberg, Gioele La Manno, Anna Juréus, Sueli Marques, et al. 2015. “Brain Structure. Cell Types in the Mouse Cortex and Hippocampus Revealed by Single-Cell RNA-Seq.” Science (New York, N.Y.) 347 (6226). https://doi.org/10.1126/science.aaa1934.