Package 'SeqSQC' reference manual

Title:	A bioconductor package for sample quality check with next generation sequencing data
Description:	The SeqSQC is designed to identify problematic samples in NGS data, including samples with gender mismatch, contamination, cryptic relatedness, and population outlier.
Authors:	Qian Liu [aut, cre]
Maintainer:	Qian Liu <[email protected]>
License:	GPL-3
Version:	1.29.0
Built:	2025-03-25 05:52:22 UTC
Source:	https://github.com/bioc/SeqSQC

Sample Quality Check for NGS Data using SeqSQC package

Description

SeqSQC

Details

Sample Quality Check for NGS Data.

Author(s)

Qian Liu

Example capture region file used in vignette.

Description

This .bed file contains only CCDS capture region only in chromosome 1, which is meant to be used together with the example_sub.vcf as a runnable example in the function of LoadVfile and sampleQC in the vignette.

Author(s)

Qian Liu [email protected]

Example vcf file used in vignette.

Description

This vcf file contains only a subset (1000 lines of variants) of the original vcf file for the 5 study samples (examples assembled from the 1000 Genomes Project). This is to be used as a runnable example in the function of LoadVfile and sampleQC in the vignette.

Author(s)

Qian Liu [email protected]

example gds file used in vignette.

Description

This gds file contains genotype and phenotype for 92 whole-genome sequenced samples captured by CCDS region. This is a merged dataset of the 87 benchmark samples and the 5 study samples (all are assembled from the 1000 Genomes Project). The meta info for these 92 samples includes sample name, pupulation, age, relation note and group info (benchmark or study).

Author(s)

Qian Liu [email protected]

Example SeqSQC file used in vignette.

Description

The SeqSQC object is a list of two objects. The first object gdsfile is the filepath of the "example.gds" file which stores the genotype and meta info of the example data merged with the benchmark data. The second object QCresult contains the data dimensions (# of samples and variants), sample annotation, and QC results for sample missing rate, sex check, inbreeding outlier check, IBD check, and population outlier check.

Author(s)

Qian Liu [email protected]

Sample relationship check with SeqSQC object input file.

Description

Function to calculate the IBD coefficients for all sample pairs and to predict related sample pairs in study cohort.

Usage

IBDCheck(
  seqfile,
  remove.samples = NULL,
  LDprune = TRUE,
  kin.filter = TRUE,
  missing.rate = 0.1,
  ss.cutoff = 300,
  maf = 0.01,
  hwe = 1e-06,
  ...
)
IBDCheck(
  seqfile,
  remove.samples = NULL,
  LDprune = TRUE,
  kin.filter = TRUE,
  missing.rate = 0.1,
  ss.cutoff = 300,
  maf = 0.01,
  hwe = 1e-06,
  ...
)

Arguments

`seqfile`	SeqSQC object, which includes the merged gds file for study cohort and benchmark.
`remove.samples`	a vector of sample names for removal from IBD calculation. Could be problematic samples identified from previous QC steps, or user-defined samples.
`LDprune`	whether to use LD-pruned snp set. The default is TRUE.
`kin.filter`	whether to use "kinship coefficient >= 0.08" as the additional criteria for related samples. The default is TRUE.
`missing.rate`	to use the SNPs with "<= `missing.rate`" only; if NaN, no threshold. By default, we use `missing.rate = 0.1` to filter out variants with missing rate greater than 10%.
`ss.cutoff`	the minimum sample size (300 by default) to apply the MAF filter. This sample size is the sum of study samples and the benchmark samples of the same population as the study cohort.
`maf`	to use the SNPs with ">= `maf`" if sample size defined in above argument is greater than `ss.cutoff`; otherwise NaN is used by default for no MAF threshold.
`hwe`	to use the SNPs with Hardy-Weinberg equilibrium p >= `hwe` if sample size defined in above argument is greater than `ss.cutoff`; otherwise no hwe threshold. The default is 1e-6.
`...`	Arguments to be passed to other methods.

Details

Using LD-pruned variants (by default), we calculate the IBD coefficients for all sample pairs, and then predict related sample pairs in study cohort using the support vector machine (SVM) method with linear kernel and the known relatedness embedded in benchmark data as training set.
Sample pairs with discordant self-reported and predicted relationship are considered as problematic. All predicted related pairs are also required to have coefficient of kinship >= 0.08 by default. The sample with higher missing rate in each related pair is selected for removal from further analysis by function of IBDRemove.

Value

a data frame with sample names, the descent coefficients of k0, k1 and kinship, self-reported relationship and predicted relationship for each pair of samples.

Author(s)

Qian Liu [email protected]

Examples

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- IBDCheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1)
res.ibd <- QCresult(seqfile)$IBD
tail(res.ibd)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- IBDCheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1)
res.ibd <- QCresult(seqfile)$IBD
tail(res.ibd)

Obtain the problematic sample list from IBD relatedness.

Description

Function to extract the related sample pairs from IBD results, and to generate the sample list for removal from the related pairs based on sample missing rate.

Usage

IBDRemove(seqfile, all = FALSE)
IBDRemove(seqfile, all = FALSE)

Arguments

`seqfile`	SeqSQC object, with IBD results.
`all`	whether to check the IBD for all sample pairs (including the benchmark samples). The default is FALSE.

Value

a list of 2 elements: $ibd.pairs is a data frame with 5 columns including sample names(id1, id2), IBD coefficients of k0 and k1, and kinship for samples with cryptic relatedness. $ibd.remove is a vector of samples to be removed, which are generated by extracting the sample with higher missing rate in each problematic sample pair.

Author(s)

Qian Liu [email protected]

Examples

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- IBDCheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1)
IBDRemove(seqfile)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- IBDCheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1)
IBDRemove(seqfile)

Sample inbreeding check with SeqSQC object input file.

Description

Function to calculate population-specific inbreeding coefficients, and to predict inbreeding outliers that are five standard deviation beyond the mean.

Usage

Inbreeding(
  seqfile,
  remove.samples = NULL,
  LDprune = TRUE,
  missing.rate = 0.1,
  ss.cutoff = 300,
  maf = 0.01,
  hwe = 1e-06,
  ...
)
Inbreeding(
  seqfile,
  remove.samples = NULL,
  LDprune = TRUE,
  missing.rate = 0.1,
  ss.cutoff = 300,
  maf = 0.01,
  hwe = 1e-06,
  ...
)

Arguments

`seqfile`	SeqSQC object, which includes the merged gds file for study cohort and benchmark.
`remove.samples`	a vector of sample names for removal from inbreeding coefficient calculation. Could be problematic samples identified from previous QC steps, or user-defined samples.
`LDprune`	whether to use LD-pruned snp set. The default is TRUE.
`missing.rate`	to use the SNPs with "<= `missing.rate`" only; if NaN, no threshold. By default, we use `missing.rate = 0.1` to filter out variants with missing rate greater than 10%.
`ss.cutoff`	the minimum sample size (300 by default) to apply the MAF filter. This sample size is the sum of study samples and the benchmark samples of the same population as the study cohort.
`maf`	to use the SNPs with ">= `maf`" if sample size defined in above argument is greater than `ss.cutoff`; otherwise NaN is used by default for no MAF threshold.
`hwe`	to use the SNPs with Hardy-Weinberg equilibrium p >= `hwe` if sample size defined in above argument is greater than `ss.cutoff`; otherwise no hwe threshold. The default is 1e-6.
`...`	Arguments to be passed to other methods.

Details

Using LD-pruned variants (by default), we calculate the inbreeding coefficients for each sample in the study cohort and for benchmark samples of the same population as the study cohort. Samples with inbreeding coefficients that are five standard deviations beyond the mean are considered problematic and are shown as "Yes" in the column of outlier.5sd. Benchmark samples in this column are set to be “NA”.

Value

a data frame with sample name, inbreeding coefficient, and an indicator of whether the inbreeding coefficient is five standard deviation beyond the mean.

Author(s)

Qian Liu [email protected]

Examples

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- Inbreeding(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1)
res.inb <- QCresult(seqfile)$Inbreeding
tail(res.inb)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- Inbreeding(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1)
res.inb <- QCresult(seqfile)$Inbreeding
tail(res.inb)

Data preprocessing for VCF or plink input from NGS or GWAS data.

Description

Function to read VCF or plink files, merge with benchmark data, and output as SeqSQC object.

Usage

LoadVfile(
  vfile,
  output = "sampleqc",
  capture.region = NULL,
  sample.annot = NULL,
  LDprune = TRUE,
  vfile.restrict = FALSE,
  slide.max.bp = 5e+05,
  ld.threshold = 0.3,
  format.data = "NGS",
  format.file = "vcf",
  ...
)
LoadVfile(
  vfile,
  output = "sampleqc",
  capture.region = NULL,
  sample.annot = NULL,
  LDprune = TRUE,
  vfile.restrict = FALSE,
  slide.max.bp = 5e+05,
  ld.threshold = 0.3,
  format.data = "NGS",
  format.file = "vcf",
  ...
)

Arguments

`vfile`	vcf or PLINK input file (ped/map/bed/bim/fam with same basename). Vfile could be a vector of character strings, see details.
`output`	a character string for name of merged data of SeqSQC object. The `dirname(output)` would be used as the directory to save the QC results and plots. The default is `sampleqc` in working directory.
`capture.region`	the BED file of sequencing capture regions. The default is NULL. For exome-sequencing data, the capture region file must be provided.
`sample.annot`	sample annotation file with 3 columns (with header) in the order of sample id, sample population and sex info. The default is NULL.
`LDprune`	whether to use LD-pruned snp set. The default is TRUE.
`vfile.restrict`	whether the input vcf or plink file has already been restricted by capture region. The default is FALSE.
`slide.max.bp`	the window size of SNPs when calculating linkage disequilibrium. The default is 5e+05.
`ld.threshold`	the r^2 threshold for LD-based SNP pruning if `LDprune = TRUE`. The default is 0.3.
`format.data`	the data source. The default is `NGS` for sequencing data.
`format.file`	the data format. The default is `vcf`.
`...`	Arguments to be passed to other methods.

Details

For vfile with more than one file names, LoadVfile will merge all dataset together if they all contain the same samples. It is useful to combine genetic/genomic data together if VCF data is divided by chromosomes.
sample.annot file contains 3 columns with column names. col 1 is sample with sample ids; col 2 is population with values of "AFR/EUR/ASN/EAS/SAS"; col 3 is gender with values of "male/female".

Value

a SeqSQC object with the filepath to the gds file which stores the genotype, the summary of samples and variants, and the QCresults including the sample annotation information.

Author(s)

Qian Liu [email protected]

Examples

infile <- system.file("extdata", "example_sub.vcf", package="SeqSQC")
sample.annot <- system.file("extdata", "sampleAnnotation.txt", package="SeqSQC")
cr <- system.file("extdata", "CCDS.Hs37.3.reduced_chr1.bed", package="SeqSQC")
outfile <- file.path(tempdir(), "testWrapUp")
seqfile <- LoadVfile(vfile = infile, output = outfile, capture.region = cr,
sample.annot = sample.annot)
infile <- system.file("extdata", "example_sub.vcf", package="SeqSQC")
sample.annot <- system.file("extdata", "sampleAnnotation.txt", package="SeqSQC")
cr <- system.file("extdata", "CCDS.Hs37.3.reduced_chr1.bed", package="SeqSQC")
outfile <- file.path(tempdir(), "testWrapUp")
seqfile <- LoadVfile(vfile = infile, output = outfile, capture.region = cr,
sample.annot = sample.annot)

Sample missing rate check with SeqSQC object input file.

Description

Function to calculate sample missing rate and to identify sample outlier with high missing rate (> 0.1).

Usage

MissingRate(seqfile, remove.samples = NULL)
MissingRate(seqfile, remove.samples = NULL)

Arguments

`seqfile`	SeqSQC object, which includes the merged gds file for study cohort and benchmark.
`remove.samples`	a vector of sample names for removal from missing rate check. Could be problematic samples identified from other QC steps, or user-defined samples.

Details

The value of the outlier column is set to NA for benchmark samples.

Value

a data frame with sample name, sample missing rate, and an indicator of whether the sample has a missing rate greater than 0.1.

Author(s)

Qian Liu [email protected]

Examples

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- MissingRate(seqfile, remove.samples=NULL)
res.mr <- QCresult(seqfile)$MissingRate
tail(res.mr)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- MissingRate(seqfile, remove.samples=NULL)
res.mr <- QCresult(seqfile)$MissingRate
tail(res.mr)

Population outlier check with SeqSQC object input file.

Description

Function to perform principle component analysis for all samples and to infer sample ancestry.

Usage

PCACheck(
  seqfile,
  remove.samples = NULL,
  npcs = 4,
  LDprune = TRUE,
  missing.rate = 0.1,
  ss.cutoff = 300,
  maf = 0.01,
  hwe = 1e-06,
  ...
)
PCACheck(
  seqfile,
  remove.samples = NULL,
  npcs = 4,
  LDprune = TRUE,
  missing.rate = 0.1,
  ss.cutoff = 300,
  maf = 0.01,
  hwe = 1e-06,
  ...
)

Arguments

`seqfile`	SeqSQC object, which includes the merged gds file for study cohort and benchmark.
`remove.samples`	a vector of sample names for removal from PCA calculation. Could be problematic samples identified from previous QC steps, or user-defined samples.
`npcs`	the number principle components to use for the population prediction in SVM model. The default value is 4, and it is required to be <= 10.
`LDprune`	whether to use LD-pruned snp set, the default is TRUE.
`missing.rate`	to use the SNPs with "<= `missing.rate`" only; if NaN, no threshold. By default, we use `missing.rate = 0.1` to filter out variants with missing rate greater than 10%.
`ss.cutoff`	the minimum sample size (300 by default) to apply the MAF filter. This sample size is the sum of study samples and the benchmark samples of the same population as the study cohort.
`maf`	to use the SNPs with ">= `maf`" if sample size defined in above argument is greater than `ss.cutoff`; otherwise NaN is used by default for no MAF threshold.
`hwe`	to use the SNPs with Hardy-Weinberg equilibrium p >= `hwe` if sample size defined in above argument is greater than `ss.cutoff`; otherwise no hwe threshold. The default is 1e-6.
`...`	Arguments to be passed to other methods.

Details

Using LD-pruned autosomal variants (by default), we calculate the eigenvectors and eigenvalues for principle component analysis (PCA). We use the benchmark samples as training dataset, and predict the population group for each sample in the study cohort based on the top four eigenvectors. Samples with discordant predicted and self-reported population groups are considered problematic. The function PCACheck performs the PCA analysis and identifies population outliers in study cohort.

Value

a data frame with sample name, reported population, data resource (benchmark vs study cohort), the first four eigenvectors and the predicted population.

Author(s)

Qian Liu [email protected]

Examples

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- PCACheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1)
res.pca <- QCresult(seqfile)$PCA
tail(res.pca)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- PCACheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1)
res.pca <- QCresult(seqfile)$PCA
tail(res.pca)

Plot the QC results for specific QC steps.

Description

Plot QC results.

Usage

plotQC(
  seqfile,
  QCstep = c("MissingRate", "SexCheck", "Inbreeding", "IBD", "PCA"),
  interactive = FALSE,
  sdcoef = 5,
  pc1 = "EV1",
  pc2 = "EV2",
  pairedScatter = FALSE,
  ...
)
plotQC(
  seqfile,
  QCstep = c("MissingRate", "SexCheck", "Inbreeding", "IBD", "PCA"),
  interactive = FALSE,
  sdcoef = 5,
  pc1 = "EV1",
  pc2 = "EV2",
  pairedScatter = FALSE,
  ...
)

Arguments

`seqfile`	SeqSQC object with QC results.
`QCstep`	which QC step the user want to do plotting. Takes values of `c("MissingRate", "SexCheck", "Inbreeding", "IBD", "PCA")`
`interactive`	whether to generate interactive plot. Recommend to use `interactive = TRUE` if user perform sample QC using an rmarkdown script and output plot to html format.
`sdcoef`	for inbreeding outlier check, how many standard deviation we need for identification of inbreeding outliers. The default is 5.
`pc1`	the eigenvector on x axis for PCA result. The default is "EV1" for eigenvector 1.
`pc2`	the eigenvector on y axis for PCA result. The default is "EV2" for eigenvector 2.
`pairedScatter`	for PCA result, whether to plot the paired scatterplot for the first 4 PC axes.
`...`	Arguments to be passed to other methods.

Value

the ggplot or interactive plot (if output is in html format) for specific QC result. If "interactive=FALSE", it returns a ggplot and author could have the flexibility to add on any layers, scales, faceting specifications and coordinate systems.

Author(s)

Qian Liu [email protected]

Examples

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
p <- plotQC(seqfile, QCstep="PCA", interactive=FALSE)
p
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
p <- plotQC(seqfile, QCstep="PCA", interactive=FALSE)
p

Generate the problematic sample list.

Description

generate the problematic sample list from QC steps that have been done, and provide each problematic sample with a reason for removal (high missing rate, gender mismatch, inbreeding outlier, cryptic relationship or population outlier).

Usage

problemList(seqfile)
problemList(seqfile)

Arguments

seqfile

SeqSQC object with sample QC results.

Value

a list of 2 datasets: 1) a data frame with 2 columns: sample for problematic sample name, and remove.reason for the reason of removing the sample. 2) a data frame with 1 column sample for problematic samples to be removed.

Author(s)

Qian Liu [email protected]

Examples

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
problemList(seqfile)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
problemList(seqfile)

Render the rmarkdown file for generating a sample QC report.

Description

Function to render the pre-compiled rmarkdown file to generate the sample QC report.

Usage

RenderReport(input, output, interactive = TRUE)
RenderReport(input, output, interactive = TRUE)

Arguments

`input`	SeqSQC object with QC results.
`output`	a character string to define the file name for the QC report.
`interactive`	whether to generate interative plots in the report. The default is TRUE.

Value

Will incure the rendering of the rmarkdown file for generating the sample QC report. The report will return to the file denoted in output in the function.

Author(s)

Qian Liu [email protected]

Examples

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
RenderReport(seqfile, output="report.html", interactive=FALSE)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
RenderReport(seqfile, output="report.html", interactive=FALSE)

Sample annotation file for the example data used in vignette.

Description

This sample annotation file is a required input from the user when using SeqSQC. It includes the sample info with sample name stored in the column of sample, the population info stored in the column of population, and the gender info stored in the column of gender. The population column must be a in the format of "AFR/EUR/ASN/EAS/SAS". The gender column must be in the format of "female/male". This file is meant to be used together with the example_sub.vcf as a runnable example in the function of LoadVfile and sampleQC in the vignette.

Author(s)

Qian Liu [email protected]

The wrap-up function for sample QC of sequencing/GWAS data.

Description

A wrap-up function for sample QC. It reads in the variant genotypes in vcf/PLINK format, merges study cohort with benchmark data, and performs sample QC for the merged dataset.

Usage

sampleQC(
  vfile = NULL,
  output = "sampleqc",
  capture.region = NULL,
  sample.annot = NULL,
  LDprune = TRUE,
  vfile.restrict = FALSE,
  slide.max.bp = 5e+05,
  ld.threshold = 0.3,
  format.data = "NGS",
  format.file = "vcf",
  QCreport = TRUE,
  out.report = "report.html",
  interactive = TRUE,
  results = TRUE,
  plotting = TRUE,
  ...
)
sampleQC(
  vfile = NULL,
  output = "sampleqc",
  capture.region = NULL,
  sample.annot = NULL,
  LDprune = TRUE,
  vfile.restrict = FALSE,
  slide.max.bp = 5e+05,
  ld.threshold = 0.3,
  format.data = "NGS",
  format.file = "vcf",
  QCreport = TRUE,
  out.report = "report.html",
  interactive = TRUE,
  results = TRUE,
  plotting = TRUE,
  ...
)

Arguments

`vfile`	vcf or PLINK input file (ped/map/bed/bim/fam with same basename). The default is NULL. Vfile could be a vector of character strings, see details. Could also take file in `SeqSQC` object generated from `LoadVfile`.
`output`	a character string for name of merged data of SeqSQC object. The `dirname(output)` would be used as the directory to save the QC result and plots. The default is `sampleqc` in the working directory.
`capture.region`	the BED file of sequencing capture regions. The default is NULL. For exome-sequencing data, the capture region file must be provided.
`sample.annot`	sample annotation file with 3 columns (with header) in the order of sample id, sample population and sex info. The default is NULL.
`LDprune`	whether to use LD-pruned snp set. The default is TRUE.
`vfile.restrict`	whether the input vcf or plink file has already been restricted by capture region. The default is FALSE.
`slide.max.bp`	the window size of SNPs when calculating linkage disequilibrium. The default is 5e+05.
`ld.threshold`	the r^2 threshold for LD-based SNP pruning if `LDprune = TRUE`. The default is 0.3.
`format.data`	the data source. The default is `NGS` for sequencing data.
`format.file`	the data format. The default is `vcf`.
`QCreport`	Whether to generate the sample QC report in html format.
`out.report`	the file name for the sample QC report. The default is `report.html`.
`interactive`	whether to generate interactive plots in the sample QC report if `QCreport = TRUE`.
`results`	whether to write out the results for each QC steps in .txt files. The default is TRUE.
`plotting`	whether to output the plots for each QC steps in .pdf files. the default is TRUE.
`...`	Arguments to be passed to other methods.

Details

For vfile with more than one file names, sampleQC will merge all dataset together if they all contain the same samples. It is useful to combine genetic/genomic data together if VCF data is divided by chromosomes.
There are 3 columns in sample.annot file. col 1 is sample with sample ids, col 2 is population with values of "AFR/EUR/ASN/EAS/SAS", col 3 is gender with values of "male/female".

Value

a SeqSQC object with the filepath to the gds file which stores the genotype, the summary of samples and variants, and the QCresults including the sample annotation information and all QC results.

Author(s)

Qian Liu [email protected]

Examples

## Not run: 
infile <- system.file("extdata", "example_sub.vcf", package="SeqSQC")
sample.annot <- system.file("extdata", "sampleAnnotation.txt", package="SeqSQC")
cr <- system.file("extdata", "CCDS.Hs37.3.reduced_chr1.bed", package="SeqSQC")
outfile <- file.path(tempdir(), "testWrapUp")
seqfile <- sampleQC(vfile = infile, output = outfile, capture.region = cr,
sample.annot = sample.annot, format.data = "NGS", format.file = "vcf",
QCreport = TRUE, out.report="report.html", interactive = TRUE)
## save(seqfile, file="seqfile.RData")

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- sampleQC(sfile = seqfile, output = outfile, QCreport = FALSE,
out.report="report.html", interactive = TRUE)

## End(Not run)
## Not run: 
infile <- system.file("extdata", "example_sub.vcf", package="SeqSQC")
sample.annot <- system.file("extdata", "sampleAnnotation.txt", package="SeqSQC")
cr <- system.file("extdata", "CCDS.Hs37.3.reduced_chr1.bed", package="SeqSQC")
outfile <- file.path(tempdir(), "testWrapUp")
seqfile <- sampleQC(vfile = infile, output = outfile, capture.region = cr,
sample.annot = sample.annot, format.data = "NGS", format.file = "vcf",
QCreport = TRUE, out.report="report.html", interactive = TRUE)
## save(seqfile, file="seqfile.RData")

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- sampleQC(sfile = seqfile, output = outfile, QCreport = FALSE,
out.report="report.html", interactive = TRUE)

## End(Not run)

Open the gds file in SeqSQC objects.

Description

Function to open the gds file inside the SeqSQC object.

Usage

SeqOpen(seqfile, readonly = TRUE, allow.duplicate = FALSE)
SeqOpen(seqfile, readonly = TRUE, allow.duplicate = FALSE)

Arguments

`seqfile`	SeqSQC object, which has been merged with benchmark data.
`readonly`	whether to open the gds file in read-only mode. If "FALSE", it is allowed to write data to the file. The default is TRUE.
`allow.duplicate`	whether to allow to open a GDS file with read-only mode when it has been opened in the same R session. The default is FALSE.

Value

a gds file with the filepath in the input SeqSQC object.

Author(s)

Qian Liu [email protected]

Examples

library(gdsfmt)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
dat <- SeqOpen(seqfile)
dat
closefn.gds(dat)
library(gdsfmt)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
dat <- SeqOpen(seqfile)
dat
closefn.gds(dat)

A data format to store genotype phenotype and sample QC results from SeqSQC.

Description

A SeqSQC object is a list of two objects. The first object gdsfile is the filepath of the GDS (discussed in section below) file which stores the genotype information from the original VCF file. The second object QCresult is a list of sample information and QC results, which include the dimension (# of samples and variants), sample annotation, and QC results for sample missing rate, sex check, inbreeding outlier check, IBD check, and population outlier check.

Usage

SeqSQC(gdsfile, QCresult = List())

gdsfile(x)

gdsfile(x) <- value

QCresult(x)

QCresult(x) <- value

## S4 method for signature 'SeqSQC'
gdsfile(x)

## S4 replacement method for signature 'SeqSQC'
gdsfile(x) <- value

## S4 method for signature 'SeqSQC'
QCresult(x)

## S4 replacement method for signature 'SeqSQC'
QCresult(x) <- value
SeqSQC(gdsfile, QCresult = List())

gdsfile(x)

gdsfile(x) <- value

QCresult(x)

QCresult(x) <- value

## S4 method for signature 'SeqSQC'
gdsfile(x)

## S4 replacement method for signature 'SeqSQC'
gdsfile(x) <- value

## S4 method for signature 'SeqSQC'
QCresult(x)

## S4 replacement method for signature 'SeqSQC'
QCresult(x) <- value

Arguments

`gdsfile`	A character string for the filepath of the GDS file.
`QCresult`	A list with sample information and sample QC results.
`x`	an SeqSQCClass object.
`value`	the new value for the SeqSQC object slots.

Value

The filepath to the gds file.

Slots

gdsfile: A character string for the filepath of the GDS file.
QCresult: A list with sample information and sample QC results.

Examples

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gdsfile(seqfile)
QCresult(seqfile)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gdsfile(seqfile)
QCresult(seqfile)

Sample gender check with SeqSQC object input file.

Description

Function to calculate the X chromosome inbreeding coefficient and to predict sample gender.

Usage

SexCheck(
  seqfile,
  remove.samples = NULL,
  missing.rate = 0.1,
  ss.cutoff = 300,
  maf = 0.01,
  ...
)
SexCheck(
  seqfile,
  remove.samples = NULL,
  missing.rate = 0.1,
  ss.cutoff = 300,
  maf = 0.01,
  ...
)

Arguments

`seqfile`	SeqSQC object, which includes the merged gds file for study cohort and benchmark.
`remove.samples`	a vector of sample names for removal from sex check. Could be problematic samples identified from previous QC steps, or user-defined samples.
`missing.rate`	to use the SNPs with "<= `missing.rate`" only; if NaN, no threshold. By default, we use `missing.rate = 0.1` to filter out variants with missing rate greater than 10%.
`ss.cutoff`	the minimum sample size (300 by default) to apply the MAF filter. This sample size is the sum of study samples and the benchmark samples of the same population as the study cohort.
`maf`	to use the SNPs with ">= `maf`" if sample size defined in above argument is greater than `ss.cutoff`; otherwise NaN is used by default for no MAF threshold.
`...`	Arguments to be passed to other methods.

Details

Samples are predicted to be female or male if the inbreeding coefficient is below 0.2, or greater than 0.8, respectively. The samples with discordant reported gender and predicted gender are considered as problematic. When the inbreeding coefficient is within the range of [0.2, 0.8], “0” is shown in the column of pred.sex to indicate ambiguous gender, which is not considered as problematic.

Value

a data frame with sample name, reported gender, x chromosome inbreeding coefficient, and predicted gender.

Author(s)

Qian Liu [email protected]

Examples

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- SexCheck(seqfile, remove.samples=NULL, missing.rate=0.1)
res.sexc <- QCresult(seqfile)$SexCheck
tail(res.sexc)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- SexCheck(seqfile, remove.samples=NULL, missing.rate=0.1)
res.sexc <- QCresult(seqfile)$SexCheck
tail(res.sexc)

Package 'SeqSQC'

Help Index

Sample Quality Check for NGS Data using SeqSQC package

Description

Details

Author(s)

See Also

Example capture region file used in vignette.

Description

Author(s)

Example vcf file used in vignette.

Description

Author(s)

example gds file used in vignette.

Description

Author(s)

Example SeqSQC file used in vignette.

Description

Author(s)

Sample relationship check with SeqSQC object input file.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Obtain the problematic sample list from IBD relatedness.

Description

Usage

Arguments

Value

Author(s)

Examples

Sample inbreeding check with SeqSQC object input file.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Data preprocessing for VCF or plink input from NGS or GWAS data.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Sample missing rate check with SeqSQC object input file.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Population outlier check with SeqSQC object input file.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Plot the QC results for specific QC steps.

Description

Usage

Arguments

Value

Author(s)

Examples

Generate the problematic sample list.

Description

Usage

Arguments

Value

Author(s)

Examples