Sample Quality Check for NGS Data using SeqSQC package

Quick start

The QC process in SeqSQC included five different steps: missing rate check, sex check, inbreeding coefficient check, Identity-by-descent (IBD) check, and population outlier check. Problematic samples identified in each step could be removed from later steps. The entire sample level QC was wrapped up in one function: sampleQC. By executing this function, a problematic sample list with criteria from each QC step as well as a QC report with interactive plots in html format will be generated.

Here we use an exemplar NGS dataset as a study cohort to demonstrate the execution of the wrap-up function sampleQC. The example dataset, with four EUR (European) samples and one AFR (African) sample, is assembled from the 1000 Genomes Project. We labeled the one AFR sample as EUR to mimic a population outlier. Samples from the 1000 Genome Project are whole-genome sequencing dataset. To mimic whole exome sequencing (WES) data, we kept in the vcf file only the variants in capture regions of Agilent SureSelect Human Exon v5, one of the most popular WES capture platforms to date.

The code chunk below assumes that you have a vcf file called infile with samples from a single population, a sample annotation file called sample.annot with sample name, population (e.g., AFR, EUR, EAS (East Asian), SAS (South Asian), or ASN (Asian)) and gender info (male/female), and a bed file called cr which contains capture regions. User need to specify the name for output files, incuding the gds file containing genotypes, and the SeqSQC(discussed in section below) object with QC results.

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("SeqSQC")
library(SeqSQC)
## Loading required package: ExperimentHub
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
##     tapply, union, unique, unsplit, which.max, which.min
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
## Loading required package: SNPRelate
## Loading required package: gdsfmt
## SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## Warning: replacing previous import 'S4Vectors::rename' by 'plotly::rename' when
## loading 'SeqSQC'
## Warning: replacing previous import 'IRanges::slice' by 'plotly::slice' when
## loading 'SeqSQC'
## Warning: replacing previous import 'ggplot2::last_plot' by 'plotly::last_plot'
## when loading 'SeqSQC'

The wrap-up function sampleQC transforms the vcf file into SeqSQC object, or take the SeqSQC object generated from LoadVfile directly as input, evaluates all sample QC steps (as a convenient all inclusive method), outputs the QC results into your designated directory, and generates a sample QC report in html format. We recommend saving the output SeqSQC object into an RData.

This is an example when we have input as vcf file, sample annotation file, and capture region bed file. Note that our example vcf file only include variants in chromosome 1, and the capture region only include the

infile <- system.file("extdata", "example_sub.vcf", package="SeqSQC")
sample.annot <- system.file("extdata", "sampleAnnotation.txt", package="SeqSQC")
cr <- system.file("extdata", "CCDS.Hs37.3.reduced_chr1.bed", package="SeqSQC")
outdir <- tempdir()
outfile <- file.path(outdir, "testWrapUp")
seqfile <- sampleQC(vfile = infile, output = outfile, capture.region = cr, sample.annot = sample.annot, format.data = "NGS", format.file = "vcf", QCreport = FALSE)
save(seqfile, file="seqfile.RData")

This is an example when we directly use SeqSQC object as input, and evaluate all sample QC steps.

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))

seqfile <- sampleQC(vfile = seqfile, output = "testWrapUp", QCreport = TRUE)
save(seqfile, file="seqfile.Rdata")

Detailed descriptions of data structure and functionality usage of SeqSQC are provided as below. Users are recommended to use these specific QC functions for each QC step if they want to skip any of them by using sampleQC().

Data preparation

Input data

SeqSQC takes VCF file as input.

seqfile <- LoadVfile(vfile = infile, output = outfile, capture.region = cr, sample.annot = sample.annot)
## Start file conversion from VCF to SNP GDS ...
## Method: extracting biallelic SNPs
## Number of samples: 5
## Parsing "/tmp/RtmpNhK8PI/Rinst2bec4c2de841/SeqSQC/extdata/example_sub.vcf" ...
##  import 1000 variants.
## + genotype   { Bit2 5x1000, 1.2K } *
## SNP genotypes: 5 samples, 1000 SNPs
## Genotype matrix is being transposed ...
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##     open the file '/tmp/RtmpDKL1em/file2ca247e3c3b8' (12.4K)
##     # of fragments: 48
##     save to '/tmp/RtmpDKL1em/file2ca247e3c3b8.tmp'
##     rename '/tmp/RtmpDKL1em/file2ca247e3c3b8.tmp' (10.7K, reduced: 1.6K)
##     # of fragments: 20
## Clean up the fragments of GDS file:
##     open the file '/tmp/RtmpDKL1em/file2ca2613ea3f9' (360.6M)
##     # of fragments: 36
##     save to '/tmp/RtmpDKL1em/file2ca2613ea3f9.tmp'
##     rename '/tmp/RtmpDKL1em/file2ca2613ea3f9.tmp' (279.2K, reduced: 360.4M)
##     # of fragments: 31
## Clean up the fragments of GDS file:
##     open the file '/tmp/RtmpDKL1em/file2ca247e3c3b8' (14.9K)
##     # of fragments: 44
##     save to '/tmp/RtmpDKL1em/file2ca247e3c3b8.tmp'
##     rename '/tmp/RtmpDKL1em/file2ca247e3c3b8.tmp' (7.8K, reduced: 7.1K)
##     # of fragments: 31
## Clean up the fragments of GDS file:
##     open the file '/tmp/RtmpDKL1em/file2ca247e3c3b8' (7.8K)
##     # of fragments: 31
##     save to '/tmp/RtmpDKL1em/file2ca247e3c3b8.tmp'
##     rename '/tmp/RtmpDKL1em/file2ca247e3c3b8.tmp' (7.8K, reduced: 0B)
##     # of fragments: 31
## Clean up the fragments of GDS file:
##     open the file '/tmp/RtmpDKL1em/testWrapUp.gds' (1.4M)
##     # of fragments: 39
##     save to '/tmp/RtmpDKL1em/testWrapUp.gds.tmp'
##     rename '/tmp/RtmpDKL1em/testWrapUp.gds.tmp' (858.5K, reduced: 557.1K)
##     # of fragments: 31
## SNP pruning based on LD:
## Excluding 0 SNP on non-autosomes
## Excluding 2,879 SNPs (monomorphic: TRUE, MAF: 0.005, missing rate: 0.05)
##     # of samples: 92
##     # of SNPs: 11,443
##     using 1 thread
##     sliding window: 500,000 basepairs, Inf SNPs
##     |LD| threshold: 0.3
##     method: composite
## Chrom 1: |====================|====================|
##     29.63%, 4,243 / 14,322 (Thu Oct 31 05:20:44 2024)
## 4,243 markers are selected in total.

SeqSQC class

We define an object class SeqSQC to store the genotype data, variant / sample annotation data, and the sample QC results.

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
slotNames(seqfile)
## [1] "gdsfile"  "QCresult"

A SeqSQC object is a list of two objects. The first object is the filepath of the GDS (discussed in section below) file which stores the genotype information from the original VCF file.

gdsfile(seqfile)
## [1] "/tmp/RtmpNhK8PI/Rinst2bec4c2de841/SeqSQC/extdata/example.gds"

The second object is a list of sample information and QC results, which include the dimension (# of samples and variants), sample annotation, and QC results for sample missing rate, sex check, inbreeding outlier check, IBD check, and population outlier check.

QCresult(seqfile)
## List of length 7
## names(7): dimension sample.annot MissingRate SexCheck Inbreeding IBD PCA
head(QCresult(seqfile)$sample.annot)
##    sample population gender relation group
## 1 HG00119        EUR   male            pop
## 2 HG00133        EUR female            pop
## 3 HG00137        EUR female            pop
## 4 HG00174        EUR female            pop
## 5 HG00271        EUR   male            pop
## 6 HG00311        EUR   male            pop

GDS class

The genotype information and variant annotation from the input VCF file will be stored in Genomic Data Structure (GDS) format (gdsfmt). Compared to VCF format, the GDS format could increase the storage efficiency by 5 fold and data access speed by 2-3 fold. We only include the bi-allelic single nucleotide variants (SNVs) from the VCF input for sample QC analysis. Other information from the VCF file, including the chromosome, position, snp rs id, reference allele, alternative allele, quality score and filter can also be passed into the gds file. The functions SeqOpen and closefn.gds can be used to open and close the gds file in SeqSQC object. It is recommended to close the gds file once it has been opened.

showfile.gds(closeall=TRUE)
dat <- SeqOpen(seqfile)
dat
## File: /tmp/RtmpNhK8PI/Rinst2bec4c2de841/SeqSQC/extdata/example.gds (2.8M)
## +    [  ]
## |--+ sample.id   { Str8 92 ZIP(34.9%), 257B }
## |--+ sample.annot   [ data.frame ] *
## |  |--+ sample   { Str8 92 ZIP(34.9%), 257B }
## |  |--+ population   { Str8 92 ZIP(15.5%), 57B }
## |  |--+ gender   { Str8 92 ZIP(10.6%), 58B }
## |  |--+ relation   { Str8 92 ZIP(33.4%), 97B }
## |  \--+ group   { Str8 92 ZIP(7.67%), 29B }
## |--+ snp.id   { Int32 138320 ZIP(51.6%), 278.6K }
## |--+ snp.rs.id   { Str8 138320 ZIP(40.3%), 598.6K }
## |--+ snp.chromosome   { Str8 138320 ZIP(0.12%), 414B }
## |--+ snp.position   { Int32 138320 ZIP(75.3%), 406.6K }
## |--+ snp.allele   { Str8 138320 ZIP(15.1%), 81.5K }
## |--+ genotype   { Bit2 138320x92 ZIP(46.9%), 1.4M }
## \--+ snp.annot   [  ]
##    |--+ qual   { Float64 138320 ZIP(0.23%), 2.5K }
##    |--+ filter   { Str8 138320 ZIP(0.15%), 1.0K }
##    \--+ LDprune   { Int32,logical 138320 ZIP(5.46%), 29.5K } *
closefn.gds(dat)

Standard workflow

Sample missing rate check

Samples with missing rate > 0.1 are considered problematic. The function MissingRate and plotQC(QCstep = "MissingRate") calculate and plot the sample missing rate respectively.

The result from sample missing rate check contains three columns: sample name, sample missing rate, and an indicator of whether the sample has a missing rate greater than 0.1. The value of the outlier column is set to NA for benchmark samples. When running the QC process through the wrap-up function sampleQC, problematic samples identified in each QC step are automatically remove before getting to the next step. However when a QC step is executed separately, users need to specify the list of problematic samples to be removed using the remove.samples option.

seqfile <- MissingRate(seqfile, remove.samples=NULL)
res.mr <- QCresult(seqfile)$MissingRate
tail(res.mr)
##     sample missingRate outlier
## 21 HG00124           0    <NA>
## 22 HG00105           0      No
## 23 HG00115           0      No
## 24 HG01695           0      No
## 25 NA11930           0      No
## 26 HG02585           0      No

The function plotQC(QCstep = "MissingRate") generates the plot for the sample missing rate, where the problematic samples with missing rate greater than 0.1 are highlighted in the plot. The default plot generated is not interactive. Users can generate the interactive plot in each QC step by specifying the interactive argument to be TRUE. The interactive plot allows users to visually inspect the QC result by putting the cursor on samples of interest.

plotQC(seqfile, QCstep = "MissingRate")

In this plot, all five samples in the study cohort have missing rate of zero.

Sex check

After filtering out the pseudo-autosomal region in X chromosome, we calculate the sample inbreeding coefficient with variants on X chromosome for all samples in the study cohort and for benchmark samples of the same population as the study cohort. The function SexCheck predicts the sample gender and plotQC(QCstep = "SexCheck") draws the plot for X chromosome inbreeding coefficients.

The result contains sample name, reported gender, X chromosome inbreeding coefficient, and predicted gender. Samples are predicted to be female or male if the inbreeding coefficient is below 0.2, or greater than 0.8. The samples with discordant reported gender and predicted gender are considered as problematic. When the inbreeding coefficient is within the range of [0.2, 0.8], “0” will be shown in the column of pred.sex to indicate ambiguous gender, which won’t be considered as problematic.

seqfile <- SexCheck(seqfile, remove.samples=NULL)
res.sexc <- QCresult(seqfile)$SexCheck
tail(res.sexc)
##     sample    sex      sexinb pred.sex
## 21 HG00124 female -0.01590996   female
## 22 HG00105   male  1.00442136     male
## 23 HG00115   male  0.91816384     male
## 24 HG01695 female  0.29287884        0
## 25 NA11930   male  0.96900002     male
## 26 HG02585   male  1.87093224     male

The function plotQC(QCstep = "SexCheck") generates the plot for the inbreeding coefficient on X chromosome where samples are labeled with different color according to their self-reported gender. If there is any sample detected to be gender mismatched by SeqSQC, it will be highlighted with a red circle.

plotQC(seqfile, QCstep = "SexCheck")

In this plot, none of the five samples in the study cohort have dis-concordant self-reported and predicted genders.

Inbreeding check

Using LD-pruned autosomal variants, we calculate the inbreeding coefficient for each sample in the study cohort and for benchmark samples of the same population as the study cohort. Samples with inbreeding coefficients that are five standard deviations beyond the mean are considered problematic. The function Inbreeding and plotQC(QCstep = "Inbreeding") calculates and plots the inbreeding coefficients respectively.

The result contains sample name, inbreeding coefficient, and an indicator of whether the inbreeding coefficient is five standard deviation beyond the mean. For Benchmark samples the indicator column is set to be “NA”.

seqfile <- Inbreeding(seqfile, remove.samples=NULL)
res.inb <- QCresult(seqfile)$Inbreeding
tail(res.inb)
##     sample  inbreeding outlier.5sd
## 21 HG00124 -0.04297042        <NA>
## 22 HG00105  0.10507886          No
## 23 HG00115  0.11097193          No
## 24 HG01695  0.11478018          No
## 25 NA11930  0.11011020          No
## 26 HG02585  0.32389122          No

The function plotQC(QCstep = "Inbreeding") generates the plot for the inbreeding coefficient. Problematic samples with extreme inbreeding coefficients will be highlighted in the plot.

plotQC(seqfile, QCstep = "Inbreeding")

In this plot, none of the five samples in the study cohort have extreme inbreeding coefficients.

IBD check

Using LD-pruned autosomal variants, we calculate the IBD coefficients for all sample pairs. we then predict related sample pairs in study cohort by using the support vector machine (SVM) method with linear kernel and the known relatedness embedded in benchmark data as training set. All predicted related pairs are also required to have coefficient of kinship ≥ 0.08. The sample with higher missing rate in each related pair is selected for removal from further analysis. The function IBD calculates the IBD coefficients for each sample pair and predicts the relatedness for samples in the study cohort. The function plotQC(QCstep = "IBD") draws the descent coefficients, K0 and K1, for each pair.

The result contains sample names, the descent coefficients k0, k1 and kinship, self-reported relationship and predicted relationship for each pair of samples. Sample pairs with discordant self-reported and predicted relationship are considered as problematic.

seqfile <- IBD(seqfile, remove.samples=NULL)
res.ibd <- QCresult(seqfile)$IBD
head(res.ibd)
##         id1     id2        k0 k1        kin label pred.label
## 93  HG00119 HG00133 1.0000000  0 0.00000000    UN         UN
## 185 HG00119 HG00137 0.8775824  0 0.06120879    UN         UN
## 186 HG00133 HG00137 0.8907753  0 0.05461233    UN         UN
## 277 HG00119 HG00174 1.0000000  0 0.00000000    UN         UN
## 278 HG00133 HG00174 1.0000000  0 0.00000000    UN         UN
## 279 HG00137 HG00174 0.8851751  0 0.05741246    UN         UN

The function plotQC(QCstep = "IBD") draws the descent coefficients k0 and k1 for each sample pair. The relationship for each sample pair are labelled by different colors. If there is any problematic sample pair detected to be cryptically related by SeqSQC, it will be highlighted with a red circle.

plotQC(seqfile, QCstep = "IBD")
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

In this plot, none of the five samples in the study cohort have cryptic relationship with each other.

Population outlier check

Using LD-pruned autosomal variants we calculate the eigenvectors and eigenvalues for principle component analysis (PCA). We use the benchmark samples as training dataset, and predict the population group for each sample in the study cohort using the top four principle components. Samples with discordant predicted and self-reported population groups are considered problematic. The function PCA performs the PCA analysis and identifies population outliers in study cohort.

The result contains sample name, reported population, data resource (benchmark / study cohort), the first four eigenvectors and predicted population. In the example data, we identified one population outlier. The sample HG02585 was reported as EUR but predicted to be AFR. HG02585 is indeed a AFR sample. We put it as an intended population outlier in the example data.

seqfile <- PCA(seqfile, remove.samples=NULL)
res.pca <- QCresult(seqfile)$PCA
tail(res.pca)
##     sample pop  type         EV1         EV2         EV3         EV4 pred.pop
## 80 NA20895 SAS   pop -0.05707629 -0.05250343 -0.03554849  0.03699505      SAS
## 81 HG00105 EUR study -0.04287047 -0.16439102  0.14085165 -0.15338474      EUR
## 82 HG00115 EUR study -0.04626500 -0.15455063  0.12749117 -0.14449323      EUR
## 83 HG01695 EUR study -0.03515475 -0.14758276  0.14503975 -0.15852459      EUR
## 84 NA11930 EUR study -0.04509767 -0.15497010  0.13152295 -0.15107911      EUR
## 85 HG02585 EUR study  0.17683331 -0.08140563  0.51735383 -0.59243887      AFR

The function plotQC(QCstep = "PCA") generates the plot of first two principle components for each sample. Samples with different population are labelled by different colors. If there is any population outlier detected by SeqSQC, it will be highlighted with a red circle.

plotQC(seqfile, QCstep = "PCA")

In the following interactive plot, the intended population outlier HG02585 was grouped closer to the AFR benchmark samples, and far from the other four EUR samples. Users could easily pick it out and remove it from downstream analysis.

plotQC(seqfile, QCstep = "PCA", interactive=TRUE)

Summary of QC results

Problematic sample list

The SeqSQC function ProblemList generates a data frame including all problematic samples, and the reasons for removal recommendation. We recommend users to execute this function after finishing all QC steps in standard workflow, so that they can get a full list of problematic samples.

problemList(seqfile)
## $prob.list
##    sample      remove.reason
## 1 HG02585 population outlier
## 
## $remove.list
##    sample
## 1 HG02585
save(seqfile, file="seqfile.Rdata")

reporting of results

After finish all or some of the QC steps, the users can use RenderReport to generate the report in html format with optional interactive plots.

RenderReport(seqfile, output="report.html", interactive=TRUE)

How to get help for SeqSQC

Any SeqSQC question can be posted to the Bioconductor support site, which serves as a searchable knowledge base of questions and answers:

https://support.bioconductor.org

Posting a question and tagging with “SeqSQC” will automatically send an alert to the package authors to respond on the support site.

Session info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] SeqSQC_1.29.0        SNPRelate_1.40.0     gdsfmt_1.43.0       
## [4] ExperimentHub_2.15.0 AnnotationHub_3.15.0 BiocFileCache_2.15.0
## [7] dbplyr_2.5.0         BiocGenerics_0.53.0  BiocStyle_2.35.0    
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1        viridisLite_0.4.2       farver_2.1.2           
##  [4] dplyr_1.1.4             blob_1.2.4              filelock_1.0.3         
##  [7] Biostrings_2.75.0       lazyeval_0.2.2          fastmap_1.2.0          
## [10] GGally_2.2.1            digest_0.6.37           mime_0.12              
## [13] lifecycle_1.0.4         KEGGREST_1.47.0         RSQLite_2.3.7          
## [16] magrittr_2.0.3          compiler_4.4.1          rlang_1.1.4            
## [19] sass_0.4.9              tools_4.4.1             utf8_1.2.4             
## [22] yaml_2.3.10             data.table_1.16.2       knitr_1.48             
## [25] labeling_0.4.3          htmlwidgets_1.6.4       bit_4.5.0              
## [28] curl_5.2.3              plyr_1.8.9              RColorBrewer_1.1-3     
## [31] withr_3.0.2             purrr_1.0.2             sys_3.4.3              
## [34] grid_4.4.1              stats4_4.4.1            fansi_1.0.6            
## [37] e1071_1.7-16            colorspace_2.1-1        ggplot2_3.5.1          
## [40] scales_1.3.0            cli_3.6.3               rmarkdown_2.28         
## [43] crayon_1.5.3            generics_0.1.3          reshape2_1.4.4         
## [46] httr_1.4.7              DBI_1.2.3               cachem_1.1.0           
## [49] proxy_0.4-27            stringr_1.5.1           zlibbioc_1.52.0        
## [52] AnnotationDbi_1.69.0    BiocManager_1.30.25     XVector_0.46.0         
## [55] vctrs_0.6.5             jsonlite_1.8.9          IRanges_2.41.0         
## [58] S4Vectors_0.44.0        bit64_4.5.2             crosstalk_1.2.1        
## [61] maketools_1.3.1         plotly_4.10.4           jquerylib_0.1.4        
## [64] tidyr_1.3.1             glue_1.8.0              ggstats_0.7.0          
## [67] stringi_1.8.4           gtable_0.3.6            BiocVersion_3.21.1     
## [70] GenomeInfoDb_1.43.0     GenomicRanges_1.59.0    UCSC.utils_1.2.0       
## [73] munsell_0.5.1           tibble_3.2.1            pillar_1.9.0           
## [76] rappdirs_0.3.3          htmltools_0.5.8.1       GenomeInfoDbData_1.2.13
## [79] R6_2.5.1                evaluate_1.0.1          Biobase_2.67.0         
## [82] highr_0.11              png_0.1-8               memoise_2.0.1          
## [85] bslib_0.8.0             class_7.3-22            Rcpp_1.0.13            
## [88] xfun_0.48               buildtools_1.0.0        pkgconfig_2.0.3