Title: | A RangedSummarizedExperiment Container for VCF/GDS Data with GDS Backend |
---|---|
Description: | VariantExperiment is a Bioconductor package for saving data in VCF/GDS format into RangedSummarizedExperiment object. The high-throughput genetic/genomic data are saved in GDSArray objects. The annotation data for features/samples are saved in DelayedDataFrame format with mono-dimensional GDSArray in each column. The on-disk representation of both assay data and annotation data achieves on-disk reading and processing and saves memory space significantly. The interface of RangedSummarizedExperiment data format enables easy and common manipulations for high-throughput genetic/genomic data with common SummarizedExperiment metaphor in R and Bioconductor. |
Authors: | Qian Liu [aut, cre], Hervé Pagès [aut], Martin Morgan [aut] |
Maintainer: | Qian Liu <[email protected]> |
License: | GPL-3 |
Version: | 1.21.0 |
Built: | 2024-12-30 05:22:03 UTC |
Source: | https://github.com/bioc/VariantExperiment |
The package VariantExperiment
takes GDS file or VCF file as input, and save them in VariantExperiment object. Assay data are saved in GDSArray
objects and annotation data are saved in DelayedDataFrame
format, both of which remain on-disk until needed. Common manipulations like subsetting, mathematical transformation and statistical analysis are done easily and quickly in _R_.
Maintainer: Qian Liu [email protected]
Authors:
Herv<U+00E9> Pag<U+00E8>s
Martin Morgan
Useful links:
Report bugs at https://github.com/Bioconductor/VariantExperiment/issues
loadVariantExperiment to load the GDS back-end SummarizedExperiment object into R console.
loadVariantExperiment(dir = tempdir())
loadVariantExperiment(dir = tempdir())
dir |
The directory to save the gds format of the array data, and the newly generated SummarizedExperiment object with array data in GDSArray format. |
An VariantExperiment
object.
gds <- SeqArray::seqExampleFileName("gds") ## ve <- makeVariantExperimentFromGDS(gds) ## ve1 <- subsetByOverlaps(ve, GRanges("22:1-48958933")) aa <- tempfile() ## saveVariantExperiment(ve1, dir=aa, replace=TRUE) ## loadVariantExperiment(dir = aa)
gds <- SeqArray::seqExampleFileName("gds") ## ve <- makeVariantExperimentFromGDS(gds) ## ve1 <- subsetByOverlaps(ve, GRanges("22:1-48958933")) aa <- tempfile() ## saveVariantExperiment(ve1, dir=aa, replace=TRUE) ## loadVariantExperiment(dir = aa)
Conversion of gds files into SummarizedExperiment object.
makeVariantExperimentFromGDS( file, ftnode, smpnode, assayNames = NULL, rowDataColumns = NULL, colDataColumns = NULL, rowDataOnDisk = TRUE, colDataOnDisk = TRUE, infoColumns = NULL ) makeVariantExperimentFromSEQGDS( file, ftnode = "variant.id", smpnode = "sample.id", assayNames = NULL, rowDataColumns = NULL, colDataColumns = NULL, infoColumns = NULL, rowDataOnDisk = TRUE, colDataOnDisk = TRUE ) makeVariantExperimentFromSNPGDS( file, ftnode = "snp.id", smpnode = "sample.id", assayNames = NULL, rowDataColumns = NULL, colDataColumns = NULL, rowDataOnDisk = TRUE, colDataOnDisk = TRUE )
makeVariantExperimentFromGDS( file, ftnode, smpnode, assayNames = NULL, rowDataColumns = NULL, colDataColumns = NULL, rowDataOnDisk = TRUE, colDataOnDisk = TRUE, infoColumns = NULL ) makeVariantExperimentFromSEQGDS( file, ftnode = "variant.id", smpnode = "sample.id", assayNames = NULL, rowDataColumns = NULL, colDataColumns = NULL, infoColumns = NULL, rowDataOnDisk = TRUE, colDataOnDisk = TRUE ) makeVariantExperimentFromSNPGDS( file, ftnode = "snp.id", smpnode = "sample.id", assayNames = NULL, rowDataColumns = NULL, colDataColumns = NULL, rowDataOnDisk = TRUE, colDataOnDisk = TRUE )
file |
the GDS file name to be converted. |
ftnode |
the node name for feature id (e.g., "variant.id", "snp.id", etc.). |
smpnode |
the node name for sample id (e.g., "sample.id"). |
assayNames |
the gds node name that will be read into the
|
rowDataColumns |
which columns of |
colDataColumns |
which columns of |
rowDataOnDisk |
whether to save the |
colDataOnDisk |
whether to save the |
infoColumns |
which columns of |
An VariantExperiment
object.
## gds file from DNA-seq data seqfile <- SeqArray::seqExampleFileName(type="gds") ve <- makeVariantExperimentFromGDS(seqfile) ## all assay data names(assays(ve)) showAvailable(seqfile) ## only read specific columns for feature / sample annotation. assayNamess <- showAvailable(seqfile)$assayNames rowdatacols <- showAvailable(seqfile)$rowDataColumns coldatacols <- showAvailable(seqfile)$colDataColumns infocols <- showAvailable(seqfile)$infoColumns ve1 <- makeVariantExperimentFromGDS( seqfile, assayNames = assayNamess[2], rowDataColumns = rowdatacols[1:3], colDataColumns = coldatacols[1], infoColumns = infocols[c(1,3,5,7)], rowDataOnDisk = FALSE, colDataOnDisk = FALSE) assay(ve1) ## the rowData(ve1) and colData(ve1) are now in DataFrame format rowData(ve1) colData(ve1) ## gds file from genotyping data snpfile <- SNPRelate::snpgdsExampleFileName() ve <- makeVariantExperimentFromGDS(snpfile) rowData(ve) colData(ve) metadata(ve) ## Only read specific columns for feature annotation. showAvailable(snpfile) ve1 <- makeVariantExperimentFromGDS(snpfile, rowDataColumns=c("snp.allele")) rowRanges(ve1) ## use specific conversion functions for certain gds types veseq <- makeVariantExperimentFromSEQGDS(seqfile) vesnp <- makeVariantExperimentFromSNPGDS(snpfile)
## gds file from DNA-seq data seqfile <- SeqArray::seqExampleFileName(type="gds") ve <- makeVariantExperimentFromGDS(seqfile) ## all assay data names(assays(ve)) showAvailable(seqfile) ## only read specific columns for feature / sample annotation. assayNamess <- showAvailable(seqfile)$assayNames rowdatacols <- showAvailable(seqfile)$rowDataColumns coldatacols <- showAvailable(seqfile)$colDataColumns infocols <- showAvailable(seqfile)$infoColumns ve1 <- makeVariantExperimentFromGDS( seqfile, assayNames = assayNamess[2], rowDataColumns = rowdatacols[1:3], colDataColumns = coldatacols[1], infoColumns = infocols[c(1,3,5,7)], rowDataOnDisk = FALSE, colDataOnDisk = FALSE) assay(ve1) ## the rowData(ve1) and colData(ve1) are now in DataFrame format rowData(ve1) colData(ve1) ## gds file from genotyping data snpfile <- SNPRelate::snpgdsExampleFileName() ve <- makeVariantExperimentFromGDS(snpfile) rowData(ve) colData(ve) metadata(ve) ## Only read specific columns for feature annotation. showAvailable(snpfile) ve1 <- makeVariantExperimentFromGDS(snpfile, rowDataColumns=c("snp.allele")) rowRanges(ve1) ## use specific conversion functions for certain gds types veseq <- makeVariantExperimentFromSEQGDS(seqfile) vesnp <- makeVariantExperimentFromSNPGDS(snpfile)
makeVariantExperimentFromVCF
is the function
to convert a vcf file into VariantExperiment
object. The
genotype data will be written as GDSArray
format, which
is saved in the assays
slot. The annotation info for
variants or samples will be written as DelayedDataFrame
object, and saved in the rowData
or colData
slot.
makeVariantExperimentFromVCF( vcf.fn, out.dir = tempfile(), replace = FALSE, header = NULL, info.import = NULL, fmt.import = NULL, sample.info = NULL, ignore.chr.prefix = "chr", reference = NULL, start = 1L, count = -1L, parallel = FALSE, verbose = FALSE )
makeVariantExperimentFromVCF( vcf.fn, out.dir = tempfile(), replace = FALSE, header = NULL, info.import = NULL, fmt.import = NULL, sample.info = NULL, ignore.chr.prefix = "chr", reference = NULL, start = 1L, count = -1L, parallel = FALSE, verbose = FALSE )
vcf.fn |
the file name(s) of (compressed) VCF format; or a ‘connection’ object. |
out.dir |
The directory to save the gds format of the vcf
data, and the newly generated VariantExperiment object with
array data in |
replace |
Whether to replace the directory if it already exists. The default is FALSE. |
header |
if NULL, ‘header’ is set to be ‘seqVCF_Header(vcf.fn)’, which is a list (with a class name "SeqVCFHeaderClass", S3 object). |
info.import |
characters, the variable name(s) in the INFO field for import; default is ‘NULL’ for all variables. |
fmt.import |
characters, the variable name(s) in the FORMAT field for import; default is ‘NULL’ for all variables. |
sample.info |
characters (with) file path for the sample info data. The data must have colnames (for phenotypes), rownames (sample ID's). No blank line allowed. The default is ‘NULL’ for no sample info. |
ignore.chr.prefix |
a vector of character, indicating the prefix of chromosome which should be ignored, like "chr"; it is not case-sensitive. |
reference |
genome reference, like "hg19", "GRCh37"; if the genome reference is not available in VCF files, users could specify the reference here. |
start |
the starting variant if importing part of VCF files. |
count |
the maximum count of variant if importing part of VCF files, -1 indicates importing to the end. |
parallel |
‘FALSE’ (serial processing), ‘TRUE’ (parallel processing), a numeric value indicating the number of cores, or a cluster object for parallel processing; ‘parallel’ is passed to the argument ‘cl’ in ‘seqParallel’, see ‘?SeqArray::seqParallel’ for more details. The default is "FALSE". |
verbose |
whether to print the process messages. The default is FALSE. |
An VariantExperiment
object.
## the vcf file vcf <- SeqArray::seqExampleFileName("vcf") ## conversion ve <- makeVariantExperimentFromVCF(vcf) ve ## the filepath to the gds file. gdsfile(ve) ## only read in specific info columns ve <- makeVariantExperimentFromVCF(vcf, out.dir = tempfile(), info.import=c("OR", "GP")) ve ## convert without the INFO and FORMAT fields ve <- makeVariantExperimentFromVCF(vcf, out.dir = tempfile(), info.import=character(0), fmt.import=character(0)) ve ## now the assay data does not include the #"annotation/format/DP/data", and the rowData(ve) does not include #any info columns.
## the vcf file vcf <- SeqArray::seqExampleFileName("vcf") ## conversion ve <- makeVariantExperimentFromVCF(vcf) ve ## the filepath to the gds file. gdsfile(ve) ## only read in specific info columns ve <- makeVariantExperimentFromVCF(vcf, out.dir = tempfile(), info.import=c("OR", "GP")) ve ## convert without the INFO and FORMAT fields ve <- makeVariantExperimentFromVCF(vcf, out.dir = tempfile(), info.import=character(0), fmt.import=character(0)) ve ## now the assay data does not include the #"annotation/format/DP/data", and the rowData(ve) does not include #any info columns.
saveVariantExperiment Save all the assays in GDS format, including in-memory assays. Delayed assays with delayed operations on them are realized while they are written to disk.
saveVariantExperiment( ve, dir = tempdir(), replace = FALSE, fileFormat = NULL, compress = "LZMA_RA", chunk_size = 1000, rowDataOnDisk = TRUE, colDataOnDisk = TRUE, verbose = FALSE )
saveVariantExperiment( ve, dir = tempdir(), replace = FALSE, fileFormat = NULL, compress = "LZMA_RA", chunk_size = 1000, rowDataOnDisk = TRUE, colDataOnDisk = TRUE, verbose = FALSE )
ve |
A SummarizedExperiment object, with the array data being ordinary array structure. |
dir |
The directory to save the gds format of the array data, and the newly generated SummarizedExperiment object with array data in GDSArray format. The default is temporary directory within the R session. |
replace |
Whether to replace the directory if it already exists. The default is FALSE. |
fileFormat |
File format for the output gds file. See details. |
compress |
the compression method for writing the gds file. The default is "LZMA_RA". |
chunk_size |
The chunk size (number of columns) when reading
GDSArray-based assays from input |
rowDataOnDisk |
whether to save the |
colDataOnDisk |
whether to save the |
verbose |
whether to print the process messages. The default is FALSE. |
If the input SummarizedExperiment
object has
GDSArray-based assay data, there is no need to specify the
argument fileFomat
. Otherwise, it takes values of
SEQ_ARRAY
for sequencing data or SNP_ARRAY
SNP
array data.
An VariantExperiment
object with the new
gdsfile()
ve.gds
as specified in dir
argument.
gds <- SeqArray::seqExampleFileName("gds") ve <- makeVariantExperimentFromGDS(gds) gdsfile(ve) ve1 <- subsetByOverlaps(ve, GRanges("22:1-48958933")) ve1 gdsfile(ve1) aa <- tempfile() obj <- saveVariantExperiment(ve1, dir=aa, replace=TRUE) obj gdsfile(obj)
gds <- SeqArray::seqExampleFileName("gds") ve <- makeVariantExperimentFromGDS(gds) gdsfile(ve) ve1 <- subsetByOverlaps(ve, GRanges("22:1-48958933")) ve1 gdsfile(ve1) aa <- tempfile() obj <- saveVariantExperiment(ve1, dir=aa, replace=TRUE) obj gdsfile(obj)
The function to show the available entries for the arguments within
makeVariantExperimentFromGDS
showAvailable( file, args = c("assayNames", "rowDataColumns", "colDataColumns", "infoColumns"), ftnode, smpnode )
showAvailable( file, args = c("assayNames", "rowDataColumns", "colDataColumns", "infoColumns"), ftnode, smpnode )
file |
the path to the gds.class file. |
args |
the arguments in |
ftnode |
the node name for feature id (e.g., "variant.id",
"snp.id", etc.). Must be provided if the file format is not
|
smpnode |
the node name for sample id (e.g.,
"sample.id"). Must be provided if the file format is not
|
## snp gds file gds <- SNPRelate::snpgdsExampleFileName() showAvailable(gds) ## sequencing gds file gds <- SeqArray::seqExampleFileName("gds") showAvailable(gds)
## snp gds file gds <- SNPRelate::snpgdsExampleFileName() showAvailable(gds) ## sequencing gds file gds <- SeqArray::seqExampleFileName("gds") showAvailable(gds)
VariantExperiment could represent big genomic data in
RangedSummarizedExperiment object, with on-disk GDS back-end
data. The assays are represented by DelayedArray
objects; rowData
and colData
could be represented
by DelayedDataFrame
or DataFrame
objects.
VariantExperiment( assays, rowRanges = GRangesList(), colData = DelayedDataFrame(), metadata = list() ) ## S4 method for signature 'VariantExperiment' gdsfile(object)
VariantExperiment( assays, rowRanges = GRangesList(), colData = DelayedDataFrame(), metadata = list() ) ## S4 method for signature 'VariantExperiment' gdsfile(object)
assays |
A ‘list’ or ‘SimpleList’ of matrix-like elements, or a matrix-like object. All elements of the list must have the same dimensions, and dimension names (if present) must be consistent across elements and with the row names of ‘rowRanges’ and ‘colData’. |
rowRanges |
A GRanges or GRangesList object describing the ranges of interest. Names, if present, become the row names of the SummarizedExperiment object. The length of the GRanges or GRangesList must equal the number of rows of the matrices in ‘assays’. |
colData |
An optional DataFrame describing the samples. Row names, if present, become the column names of the VariantExperiment. |
metadata |
An optional ‘list’ of arbitrary content describing the overall experiment. |
object |
a |
VariantExperiment class and slot getters and setters.
check "?RangedSummarizedExperiment" for more details.
a VariantExperiment
object.