Title: | A bioconductor package for sample quality check with next generation sequencing data |
---|---|
Description: | The SeqSQC is designed to identify problematic samples in NGS data, including samples with gender mismatch, contamination, cryptic relatedness, and population outlier. |
Authors: | Qian Liu [aut, cre] |
Maintainer: | Qian Liu <[email protected]> |
License: | GPL-3 |
Version: | 1.29.0 |
Built: | 2024-11-25 06:12:17 UTC |
Source: | https://github.com/bioc/SeqSQC |
SeqSQC
Sample Quality Check for NGS Data.
Qian Liu
LoadVfile
for data preparation; MissingRate
PCACheck
Inbreeding
IBDCheck
PCACheck
for individual sample QC checks; problemList
for the summary of problematic samples with reason and sample list to be removed; IBDRemove
for the problematic sample pairs detected with cryptic relationship; RenderReport
to generate the sample QC report; plotQC
to generate the ggplot or interactive plots in html format for each individual QC check; sampleQC
for wrapper of data preparation, all sample QC checks, QC result summary, and sample QC report.
This .bed
file contains only CCDS capture region only in
chromosome 1, which is meant to be used together with the
example_sub.vcf
as a runnable example in the function of
LoadVfile
and sampleQC
in the vignette.
Qian Liu [email protected]
This vcf file contains only a subset (1000 lines of variants) of
the original vcf file for the 5 study samples (examples assembled
from the 1000 Genomes Project). This is to be used as a runnable
example in the function of LoadVfile
and sampleQC
in
the vignette.
Qian Liu [email protected]
This gds file contains genotype and phenotype for 92 whole-genome sequenced samples captured by CCDS region. This is a merged dataset of the 87 benchmark samples and the 5 study samples (all are assembled from the 1000 Genomes Project). The meta info for these 92 samples includes sample name, pupulation, age, relation note and group info (benchmark or study).
Qian Liu [email protected]
The SeqSQC object is a list of two objects. The first object
gdsfile
is the filepath of the "example.gds" file which
stores the genotype and meta info of the example data merged with
the benchmark data. The second object QCresult
contains the
data dimensions (# of samples and variants), sample annotation, and
QC results for sample missing rate, sex check, inbreeding outlier
check, IBD check, and population outlier check.
Qian Liu [email protected]
Function to calculate the IBD coefficients for all sample pairs and to predict related sample pairs in study cohort.
IBDCheck( seqfile, remove.samples = NULL, LDprune = TRUE, kin.filter = TRUE, missing.rate = 0.1, ss.cutoff = 300, maf = 0.01, hwe = 1e-06, ... )
IBDCheck( seqfile, remove.samples = NULL, LDprune = TRUE, kin.filter = TRUE, missing.rate = 0.1, ss.cutoff = 300, maf = 0.01, hwe = 1e-06, ... )
seqfile |
SeqSQC object, which includes the merged gds file for study cohort and benchmark. |
remove.samples |
a vector of sample names for removal from IBD calculation. Could be problematic samples identified from previous QC steps, or user-defined samples. |
LDprune |
whether to use LD-pruned snp set. The default is TRUE. |
kin.filter |
whether to use "kinship coefficient >= 0.08" as the additional criteria for related samples. The default is TRUE. |
missing.rate |
to use the SNPs with "<= |
ss.cutoff |
the minimum sample size (300 by default) to apply the MAF filter. This sample size is the sum of study samples and the benchmark samples of the same population as the study cohort. |
maf |
to use the SNPs with ">= |
hwe |
to use the SNPs with Hardy-Weinberg equilibrium p >=
|
... |
Arguments to be passed to other methods. |
Using LD-pruned variants (by default), we calculate the
IBD coefficients for all sample pairs, and then predict related
sample pairs in study cohort using the support vector machine
(SVM) method with linear kernel and the known relatedness
embedded in benchmark data as training set.
Sample pairs
with discordant self-reported and predicted relationship are
considered as problematic. All predicted related pairs are also
required to have coefficient of kinship >= 0.08 by default. The
sample with higher missing rate in each related pair is
selected for removal from further analysis by function of
IBDRemove
.
a data frame with sample names, the descent coefficients of k0, k1 and kinship, self-reported relationship and predicted relationship for each pair of samples.
Qian Liu [email protected]
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- IBDCheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1) res.ibd <- QCresult(seqfile)$IBD tail(res.ibd)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- IBDCheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1) res.ibd <- QCresult(seqfile)$IBD tail(res.ibd)
Function to extract the related sample pairs from IBD results, and to generate the sample list for removal from the related pairs based on sample missing rate.
IBDRemove(seqfile, all = FALSE)
IBDRemove(seqfile, all = FALSE)
seqfile |
SeqSQC object, with IBD results. |
all |
whether to check the IBD for all sample pairs (including the benchmark samples). The default is FALSE. |
a list of 2 elements: $ibd.pairs
is a data frame
with 5 columns including sample names(id1, id2), IBD
coefficients of k0 and k1, and kinship for samples with cryptic
relatedness. $ibd.remove
is a vector of samples to be
removed, which are generated by extracting the sample with
higher missing rate in each problematic sample pair.
Qian Liu [email protected]
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- IBDCheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1) IBDRemove(seqfile)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- IBDCheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1) IBDRemove(seqfile)
Function to calculate population-specific inbreeding coefficients, and to predict inbreeding outliers that are five standard deviation beyond the mean.
Inbreeding( seqfile, remove.samples = NULL, LDprune = TRUE, missing.rate = 0.1, ss.cutoff = 300, maf = 0.01, hwe = 1e-06, ... )
Inbreeding( seqfile, remove.samples = NULL, LDprune = TRUE, missing.rate = 0.1, ss.cutoff = 300, maf = 0.01, hwe = 1e-06, ... )
seqfile |
SeqSQC object, which includes the merged gds file for study cohort and benchmark. |
remove.samples |
a vector of sample names for removal from inbreeding coefficient calculation. Could be problematic samples identified from previous QC steps, or user-defined samples. |
LDprune |
whether to use LD-pruned snp set. The default is TRUE. |
missing.rate |
to use the SNPs with "<= |
ss.cutoff |
the minimum sample size (300 by default) to apply the MAF filter. This sample size is the sum of study samples and the benchmark samples of the same population as the study cohort. |
maf |
to use the SNPs with ">= |
hwe |
to use the SNPs with Hardy-Weinberg equilibrium p >=
|
... |
Arguments to be passed to other methods. |
Using LD-pruned variants (by default), we calculate the
inbreeding coefficients for each sample in the study cohort and
for benchmark samples of the same population as the study
cohort. Samples with inbreeding coefficients that are five
standard deviations beyond the mean are considered problematic
and are shown as "Yes" in the column of
outlier.5sd
. Benchmark samples in this column are set to
be βNAβ.
a data frame with sample name, inbreeding coefficient, and an indicator of whether the inbreeding coefficient is five standard deviation beyond the mean.
Qian Liu [email protected]
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- Inbreeding(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1) res.inb <- QCresult(seqfile)$Inbreeding tail(res.inb)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- Inbreeding(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1) res.inb <- QCresult(seqfile)$Inbreeding tail(res.inb)
Function to read VCF or plink files, merge with benchmark data, and output as SeqSQC object.
LoadVfile( vfile, output = "sampleqc", capture.region = NULL, sample.annot = NULL, LDprune = TRUE, vfile.restrict = FALSE, slide.max.bp = 5e+05, ld.threshold = 0.3, format.data = "NGS", format.file = "vcf", ... )
LoadVfile( vfile, output = "sampleqc", capture.region = NULL, sample.annot = NULL, LDprune = TRUE, vfile.restrict = FALSE, slide.max.bp = 5e+05, ld.threshold = 0.3, format.data = "NGS", format.file = "vcf", ... )
vfile |
vcf or PLINK input file (ped/map/bed/bim/fam with same basename). Vfile could be a vector of character strings, see details. |
output |
a character string for name of merged data of SeqSQC
object. The |
capture.region |
the BED file of sequencing capture regions. The default is NULL. For exome-sequencing data, the capture region file must be provided. |
sample.annot |
sample annotation file with 3 columns (with header) in the order of sample id, sample population and sex info. The default is NULL. |
LDprune |
whether to use LD-pruned snp set. The default is TRUE. |
vfile.restrict |
whether the input vcf or plink file has already been restricted by capture region. The default is FALSE. |
slide.max.bp |
the window size of SNPs when calculating linkage disequilibrium. The default is 5e+05. |
ld.threshold |
the r^2 threshold for LD-based SNP pruning if
|
format.data |
the data source. The default is |
format.file |
the data format. The default is |
... |
Arguments to be passed to other methods. |
For vfile
with more than one file names,
LoadVfile
will merge all dataset together if they all
contain the same samples. It is useful to combine
genetic/genomic data together if VCF data is divided by
chromosomes. sample.annot
file contains 3 columns
with column names. col 1 is sample
with sample ids; col
2 is population
with values of "AFR/EUR/ASN/EAS/SAS";
col 3 is gender
with values of "male/female".
a SeqSQC object with the filepath to the gds file which stores the genotype, the summary of samples and variants, and the QCresults including the sample annotation information.
Qian Liu [email protected]
infile <- system.file("extdata", "example_sub.vcf", package="SeqSQC") sample.annot <- system.file("extdata", "sampleAnnotation.txt", package="SeqSQC") cr <- system.file("extdata", "CCDS.Hs37.3.reduced_chr1.bed", package="SeqSQC") outfile <- file.path(tempdir(), "testWrapUp") seqfile <- LoadVfile(vfile = infile, output = outfile, capture.region = cr, sample.annot = sample.annot)
infile <- system.file("extdata", "example_sub.vcf", package="SeqSQC") sample.annot <- system.file("extdata", "sampleAnnotation.txt", package="SeqSQC") cr <- system.file("extdata", "CCDS.Hs37.3.reduced_chr1.bed", package="SeqSQC") outfile <- file.path(tempdir(), "testWrapUp") seqfile <- LoadVfile(vfile = infile, output = outfile, capture.region = cr, sample.annot = sample.annot)
Function to calculate sample missing rate and to identify sample outlier with high missing rate (> 0.1).
MissingRate(seqfile, remove.samples = NULL)
MissingRate(seqfile, remove.samples = NULL)
seqfile |
SeqSQC object, which includes the merged gds file for study cohort and benchmark. |
remove.samples |
a vector of sample names for removal from missing rate check. Could be problematic samples identified from other QC steps, or user-defined samples. |
The value of the outlier column is set to NA for benchmark samples.
a data frame with sample name, sample missing rate, and an indicator of whether the sample has a missing rate greater than 0.1.
Qian Liu [email protected]
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- MissingRate(seqfile, remove.samples=NULL) res.mr <- QCresult(seqfile)$MissingRate tail(res.mr)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- MissingRate(seqfile, remove.samples=NULL) res.mr <- QCresult(seqfile)$MissingRate tail(res.mr)
Function to perform principle component analysis for all samples and to infer sample ancestry.
PCACheck( seqfile, remove.samples = NULL, npcs = 4, LDprune = TRUE, missing.rate = 0.1, ss.cutoff = 300, maf = 0.01, hwe = 1e-06, ... )
PCACheck( seqfile, remove.samples = NULL, npcs = 4, LDprune = TRUE, missing.rate = 0.1, ss.cutoff = 300, maf = 0.01, hwe = 1e-06, ... )
seqfile |
SeqSQC object, which includes the merged gds file for study cohort and benchmark. |
remove.samples |
a vector of sample names for removal from PCA calculation. Could be problematic samples identified from previous QC steps, or user-defined samples. |
npcs |
the number principle components to use for the population prediction in SVM model. The default value is 4, and it is required to be <= 10. |
LDprune |
whether to use LD-pruned snp set, the default is TRUE. |
missing.rate |
to use the SNPs with "<= |
ss.cutoff |
the minimum sample size (300 by default) to apply the MAF filter. This sample size is the sum of study samples and the benchmark samples of the same population as the study cohort. |
maf |
to use the SNPs with ">= |
hwe |
to use the SNPs with Hardy-Weinberg equilibrium p >=
|
... |
Arguments to be passed to other methods. |
Using LD-pruned autosomal variants (by default), we
calculate the eigenvectors and eigenvalues for principle
component analysis (PCA). We use the benchmark samples as
training dataset, and predict the population group for each
sample in the study cohort based on the top four
eigenvectors. Samples with discordant predicted and
self-reported population groups are considered problematic. The
function PCACheck
performs the PCA analysis and
identifies population outliers in study cohort.
a data frame with sample name, reported population, data resource (benchmark vs study cohort), the first four eigenvectors and the predicted population.
Qian Liu [email protected]
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- PCACheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1) res.pca <- QCresult(seqfile)$PCA tail(res.pca)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- PCACheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1) res.pca <- QCresult(seqfile)$PCA tail(res.pca)
Plot QC results.
plotQC( seqfile, QCstep = c("MissingRate", "SexCheck", "Inbreeding", "IBD", "PCA"), interactive = FALSE, sdcoef = 5, pc1 = "EV1", pc2 = "EV2", pairedScatter = FALSE, ... )
plotQC( seqfile, QCstep = c("MissingRate", "SexCheck", "Inbreeding", "IBD", "PCA"), interactive = FALSE, sdcoef = 5, pc1 = "EV1", pc2 = "EV2", pairedScatter = FALSE, ... )
seqfile |
SeqSQC object with QC results. |
QCstep |
which QC step the user want to do plotting. Takes
values of |
interactive |
whether to generate interactive plot. Recommend
to use |
sdcoef |
for inbreeding outlier check, how many standard deviation we need for identification of inbreeding outliers. The default is 5. |
pc1 |
the eigenvector on x axis for PCA result. The default is "EV1" for eigenvector 1. |
pc2 |
the eigenvector on y axis for PCA result. The default is "EV2" for eigenvector 2. |
pairedScatter |
for PCA result, whether to plot the paired scatterplot for the first 4 PC axes. |
... |
Arguments to be passed to other methods. |
the ggplot or interactive plot (if output is in html format) for specific QC result. If "interactive=FALSE", it returns a ggplot and author could have the flexibility to add on any layers, scales, faceting specifications and coordinate systems.
Qian Liu [email protected]
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) p <- plotQC(seqfile, QCstep="PCA", interactive=FALSE) p
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) p <- plotQC(seqfile, QCstep="PCA", interactive=FALSE) p
generate the problematic sample list from QC steps that have been done, and provide each problematic sample with a reason for removal (high missing rate, gender mismatch, inbreeding outlier, cryptic relationship or population outlier).
problemList(seqfile)
problemList(seqfile)
seqfile |
SeqSQC object with sample QC results. |
a list of 2 datasets: 1) a data frame with 2 columns:
sample
for problematic sample name, and
remove.reason
for the reason of removing the sample. 2)
a data frame with 1 column sample
for problematic
samples to be removed.
Qian Liu [email protected]
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) problemList(seqfile)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) problemList(seqfile)
Function to render the pre-compiled rmarkdown file to generate the sample QC report.
RenderReport(input, output, interactive = TRUE)
RenderReport(input, output, interactive = TRUE)
input |
SeqSQC object with QC results. |
output |
a character string to define the file name for the QC report. |
interactive |
whether to generate interative plots in the report. The default is TRUE. |
Will incure the rendering of the rmarkdown file for
generating the sample QC report. The report will return to the
file denoted in output
in the function.
Qian Liu [email protected]
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) RenderReport(seqfile, output="report.html", interactive=FALSE)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) RenderReport(seqfile, output="report.html", interactive=FALSE)
This sample annotation file is a required input from the user when
using SeqSQC. It includes the sample info with sample name stored
in the column of sample
, the population info stored in the
column of population
, and the gender info stored in the
column of gender
. The population
column must be a in
the format of "AFR/EUR/ASN/EAS/SAS". The gender
column must
be in the format of "female/male". This file is meant to be used
together with the example_sub.vcf
as a runnable example in
the function of LoadVfile
and sampleQC
in the
vignette.
Qian Liu [email protected]
A wrap-up function for sample QC. It reads in the variant genotypes in vcf/PLINK format, merges study cohort with benchmark data, and performs sample QC for the merged dataset.
sampleQC( vfile = NULL, output = "sampleqc", capture.region = NULL, sample.annot = NULL, LDprune = TRUE, vfile.restrict = FALSE, slide.max.bp = 5e+05, ld.threshold = 0.3, format.data = "NGS", format.file = "vcf", QCreport = TRUE, out.report = "report.html", interactive = TRUE, results = TRUE, plotting = TRUE, ... )
sampleQC( vfile = NULL, output = "sampleqc", capture.region = NULL, sample.annot = NULL, LDprune = TRUE, vfile.restrict = FALSE, slide.max.bp = 5e+05, ld.threshold = 0.3, format.data = "NGS", format.file = "vcf", QCreport = TRUE, out.report = "report.html", interactive = TRUE, results = TRUE, plotting = TRUE, ... )
vfile |
vcf or PLINK input file (ped/map/bed/bim/fam with same
basename). The default is NULL. Vfile could be a vector of
character strings, see details. Could also take file in
|
output |
a character string for name of merged data of SeqSQC
object. The |
capture.region |
the BED file of sequencing capture regions. The default is NULL. For exome-sequencing data, the capture region file must be provided. |
sample.annot |
sample annotation file with 3 columns (with header) in the order of sample id, sample population and sex info. The default is NULL. |
LDprune |
whether to use LD-pruned snp set. The default is TRUE. |
vfile.restrict |
whether the input vcf or plink file has already been restricted by capture region. The default is FALSE. |
slide.max.bp |
the window size of SNPs when calculating linkage disequilibrium. The default is 5e+05. |
ld.threshold |
the r^2 threshold for LD-based SNP pruning if
|
format.data |
the data source. The default is |
format.file |
the data format. The default is |
QCreport |
Whether to generate the sample QC report in html format. |
out.report |
the file name for the sample QC report. The
default is |
interactive |
whether to generate interactive plots in the
sample QC report if |
results |
whether to write out the results for each QC steps in .txt files. The default is TRUE. |
plotting |
whether to output the plots for each QC steps in .pdf files. the default is TRUE. |
... |
Arguments to be passed to other methods. |
For vfile
with more than one file names,
sampleQC
will merge all dataset together if they all
contain the same samples. It is useful to combine
genetic/genomic data together if VCF data is divided by
chromosomes.
There are 3 columns in sample.annot
file. col 1 is sample
with sample ids, col 2 is
population
with values of "AFR/EUR/ASN/EAS/SAS", col 3
is gender
with values of "male/female".
a SeqSQC object with the filepath to the gds file which stores the genotype, the summary of samples and variants, and the QCresults including the sample annotation information and all QC results.
Qian Liu [email protected]
## Not run: infile <- system.file("extdata", "example_sub.vcf", package="SeqSQC") sample.annot <- system.file("extdata", "sampleAnnotation.txt", package="SeqSQC") cr <- system.file("extdata", "CCDS.Hs37.3.reduced_chr1.bed", package="SeqSQC") outfile <- file.path(tempdir(), "testWrapUp") seqfile <- sampleQC(vfile = infile, output = outfile, capture.region = cr, sample.annot = sample.annot, format.data = "NGS", format.file = "vcf", QCreport = TRUE, out.report="report.html", interactive = TRUE) ## save(seqfile, file="seqfile.RData") load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- sampleQC(sfile = seqfile, output = outfile, QCreport = FALSE, out.report="report.html", interactive = TRUE) ## End(Not run)
## Not run: infile <- system.file("extdata", "example_sub.vcf", package="SeqSQC") sample.annot <- system.file("extdata", "sampleAnnotation.txt", package="SeqSQC") cr <- system.file("extdata", "CCDS.Hs37.3.reduced_chr1.bed", package="SeqSQC") outfile <- file.path(tempdir(), "testWrapUp") seqfile <- sampleQC(vfile = infile, output = outfile, capture.region = cr, sample.annot = sample.annot, format.data = "NGS", format.file = "vcf", QCreport = TRUE, out.report="report.html", interactive = TRUE) ## save(seqfile, file="seqfile.RData") load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- sampleQC(sfile = seqfile, output = outfile, QCreport = FALSE, out.report="report.html", interactive = TRUE) ## End(Not run)
Function to open the gds file inside the SeqSQC object.
SeqOpen(seqfile, readonly = TRUE, allow.duplicate = FALSE)
SeqOpen(seqfile, readonly = TRUE, allow.duplicate = FALSE)
seqfile |
SeqSQC object, which has been merged with benchmark data. |
readonly |
whether to open the gds file in read-only mode. If "FALSE", it is allowed to write data to the file. The default is TRUE. |
allow.duplicate |
whether to allow to open a GDS file with read-only mode when it has been opened in the same R session. The default is FALSE. |
a gds file with the filepath in the input SeqSQC object.
Qian Liu [email protected]
library(gdsfmt) load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) dat <- SeqOpen(seqfile) dat closefn.gds(dat)
library(gdsfmt) load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) dat <- SeqOpen(seqfile) dat closefn.gds(dat)
A SeqSQC object is a list of two objects. The first object
gdsfile
is the filepath of the GDS (discussed in section
below) file which stores the genotype information from the original
VCF file. The second object QCresult
is a list of sample
information and QC results, which include the dimension (# of
samples and variants), sample annotation, and QC results for sample
missing rate, sex check, inbreeding outlier check, IBD check, and
population outlier check.
SeqSQC(gdsfile, QCresult = List()) gdsfile(x) gdsfile(x) <- value QCresult(x) QCresult(x) <- value ## S4 method for signature 'SeqSQC' gdsfile(x) ## S4 replacement method for signature 'SeqSQC' gdsfile(x) <- value ## S4 method for signature 'SeqSQC' QCresult(x) ## S4 replacement method for signature 'SeqSQC' QCresult(x) <- value
SeqSQC(gdsfile, QCresult = List()) gdsfile(x) gdsfile(x) <- value QCresult(x) QCresult(x) <- value ## S4 method for signature 'SeqSQC' gdsfile(x) ## S4 replacement method for signature 'SeqSQC' gdsfile(x) <- value ## S4 method for signature 'SeqSQC' QCresult(x) ## S4 replacement method for signature 'SeqSQC' QCresult(x) <- value
gdsfile |
A character string for the filepath of the GDS file. |
QCresult |
A list with sample information and sample QC results. |
x |
an SeqSQCClass object. |
value |
the new value for the SeqSQC object slots. |
The filepath to the gds file.
gdsfile
A character string for the filepath of the GDS file.
QCresult
A list with sample information and sample QC results.
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gdsfile(seqfile) QCresult(seqfile)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gdsfile(seqfile) QCresult(seqfile)
Function to calculate the X chromosome inbreeding coefficient and to predict sample gender.
SexCheck( seqfile, remove.samples = NULL, missing.rate = 0.1, ss.cutoff = 300, maf = 0.01, ... )
SexCheck( seqfile, remove.samples = NULL, missing.rate = 0.1, ss.cutoff = 300, maf = 0.01, ... )
seqfile |
SeqSQC object, which includes the merged gds file for study cohort and benchmark. |
remove.samples |
a vector of sample names for removal from sex check. Could be problematic samples identified from previous QC steps, or user-defined samples. |
missing.rate |
to use the SNPs with "<= |
ss.cutoff |
the minimum sample size (300 by default) to apply the MAF filter. This sample size is the sum of study samples and the benchmark samples of the same population as the study cohort. |
maf |
to use the SNPs with ">= |
... |
Arguments to be passed to other methods. |
Samples are predicted to be female or male if the
inbreeding coefficient is below 0.2, or greater than 0.8,
respectively. The samples with discordant reported gender and
predicted gender are considered as problematic. When the
inbreeding coefficient is within the range of [0.2, 0.8], β0β
is shown in the column of pred.sex
to indicate ambiguous
gender, which is not considered as problematic.
a data frame with sample name, reported gender, x chromosome inbreeding coefficient, and predicted gender.
Qian Liu [email protected]
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- SexCheck(seqfile, remove.samples=NULL, missing.rate=0.1) res.sexc <- QCresult(seqfile)$SexCheck tail(res.sexc)
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- SexCheck(seqfile, remove.samples=NULL, missing.rate=0.1) res.sexc <- QCresult(seqfile)$SexCheck tail(res.sexc)