Package 'h5vc' reference manual

Title:	Managing alignment tallies using a hdf5 backend
Description:	This package contains functions to interact with tally data from NGS experiments that is stored in HDF5 files.
Authors:	Paul Theodor Pyl
Maintainer:	Paul Theodor Pyl <[email protected]>
License:	GPL (>= 3)
Version:	2.41.1
Built:	2025-03-24 03:27:40 UTC
Source:	https://git.bioconductor.org/packages/h5vc

Managing alignment tallies using a hdf5 backend

Description

This package contains functions to interact with tally data from NGS experiments that is stored in HDF5 files. For detail see vignettes shipped with this package.

Details

Package:	h5vc
Type:	Package
Version:	1.0.4
Date:	2013-10-11
License:	GPL (>= 3)

This package is desgned to facilitate the analysis of genomics data through tallies stored in a HDF5 file. Within a HDF5 file the tally is simply a table of bases times genomic positions listing for each position the count of each base observed as a mismatch in the sample at any given position. Strand and sample are additional dimension in this array, which leads to a 4D-array called 'Counts'. The total coverage is stored in a separate array of 3 dimensions (Sample x Strand x Genomic Position) called 'Coverages', there is a 3 dimensional 'Deletions' array and a 1D-vector encoding the reference base ('Reference'). Those 4 arrays are stored as datasets within a HDF5 tally file in which the group-structure of the tally file encodes for the organisatorial levels of 'Study' and 'Chromosome'. For details on the layout of HDF5 files visit (http://www.hdfgroup.org), a short description is given in the vignettes.

Creating those HDF5 tally files can be accomplished from within R or through a Python script that will generate a tally file from a set of .bam files. The workflow is described in the vignettes h5vc.creating.tallies and h5vc.creating.tallies.within.R.

Author(s)

Paul Pyl Maintainer: Paul Pyl [email protected]

Preparing the results of tallyBAM for writing to an HDF5 tally file

Description

This function tallies a set of bam files and prepares the data for writing to an HDF5 tally file.

Usage

applyTallies( bamfiles, chrom, start, stop, q=25, ncycles = 0, max.depth=1000000, prepForHDF5 = TRUE, reference = NULL)
applyTallies( bamfiles, chrom, start, stop, q=25, ncycles = 0, max.depth=1000000, prepForHDF5 = TRUE, reference = NULL)

Arguments

`bamfiles`	A character vector of filenames of the bam files that should be tallies. Note that for writing to an HDF5 file the order of this vector must match the order of the Column field in the sampledata object that corresponds to the dataset - see `setSampleData` for details.
`prepForHDF5`	Boolean flag to specify whether the data shall be structured for compatibility with the HDF5 tally file format. See the details section of this manual page.
`reference`	A DNAString object containing the reference sequence corresponding to the region that is described in the counts array – if this is `NULL` a consensus vote will be used to estimate the reference at any given position, this means you cannot detect variants with AF >= 0.5 anymore
`chrom`	Chromosome in which to tally
`start`	First position of the tally
`stop`	Last position of the tally
`q`	quality cut-off for considering a base call
`ncycles`	number of sequencing cycles form the front and back of the read that should be considered unreliable - used for stratifying the nucleotide counts
`max.depth`	only tally a position if there are less than this many reads overlapping it - can prevent long runtimes in unreliable regions

Details

This is a wrapper function for applying tallyBAM to a set of bam files specified in the bamfiles argument. If prepForHDF5 is not true the result is equivalent to calling tallyBAM with lapply on the file names, otherwise the resulting data structure has the same layout as the return value of h5readBlock and can be written to an HDF5 tally file directly. The order or samples along the sample dimension is the same as the order of the file names (i.e. the order of the bamfiles argument).

Value

A list with slots containing the Counts,Coverages,Deletions and Reference datasets for the given sample if prepForHDF5 is true, a list of 3D-arrays (Nucleotide x Strand x Position) otherwise.

Author(s)

Paul Pyl

Examples

library(h5vc)
library(BSgenome.Hsapiens.UCSC.hg19)
files <- c("NRAS.AML.bam","NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
chrom = "1"
startpos <- 115247090
endpos <- 115259515
theData <- applyTallies( bamFiles, reference = Hsapiens[["chr1"]][startpos:endpos], chr = chrom, start = startpos, stop = endpos, ncycles = 10 )
str(theData)
library(h5vc)
library(BSgenome.Hsapiens.UCSC.hg19)
files <- c("NRAS.AML.bam","NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
chrom = "1"
startpos <- 115247090
endpos <- 115259515
theData <- applyTallies( bamFiles, reference = Hsapiens[["chr1"]][startpos:endpos], chr = chrom, start = startpos, stop = endpos, ncycles = 10 )
str(theData)

Tallying bam files in parallel using `BatchJobs` on high performance compute clusters (HPC)

Description

These function tally a set of bam files in blocks spanning a specified region and write the results to an HDF5 tally file; uses BatchJobs for parallel computation on HPCs

Usage

batchTallyParam(
  bamFiles,
  destination,
  group,
  chrom, start, stop,
  blocksize = 100000,
  registryDir = tempdir(),
  resources = list("queue" = "research-rh6", "memory"="4000", "ncpus"="4", walltime="90:00"),
  q=25, ncycles = 0, max.depth=1000000,
  reference = NULL,
  sleep = 5
)

batchTallies( confList = batchTallyParam() )

rerunBatchTallies( confList, tryCollect = TRUE )

collectTallies(blocks, confList, registries )
batchTallyParam(
  bamFiles,
  destination,
  group,
  chrom, start, stop,
  blocksize = 100000,
  registryDir = tempdir(),
  resources = list("queue" = "research-rh6", "memory"="4000", "ncpus"="4", walltime="90:00"),
  q=25, ncycles = 0, max.depth=1000000,
  reference = NULL,
  sleep = 5
)

batchTallies( confList = batchTallyParam() )

rerunBatchTallies( confList, tryCollect = TRUE )

collectTallies(blocks, confList, registries )

Arguments

`bamFiles`	A character vector of filenames of the bam files that should be tallies. Note that for writing to an HDF5 file the order of this vector must match the order of the Column field in the sampledata object that corresponds to the dataset - see `setSampleData` for details.
`reference`	A DNAString object containing the reference sequence corresponding to the region that is to be tallied – if this is `NULL` a consensus vote will be used to estimate the reference at any given position, this means you cannot detect variants with AF >= 0.5 anymore – especially when tallying more than one bamFile you really should specify this
`destination`	Filename of the HDF5 tally file that will be written to – this needs to contain all the groups and datasets already – see `prepareTallyFile` for details
`group`	Location within the tally file where the data will be written – e.g. `"/ExampleStudy/22"`
`chrom`	Chromosome in which to tally
`start`	First position of the tally
`stop`	Last position of the tally
`q`	quality cut-off for considering a base call
`ncycles`	number of sequencing cycles form the front and back of the read that should be considered unreliable - used for stratifying the nucleotide counts
`max.depth`	only tally a position if there are less than this many reads overlapping it - can prevent long runtimes in unreliable regions
`blocksize`	Size of the blocks in bases that the tallying will be performed in, this influences the number of jobs send to the cluster
`registryDir`	Directory in which the registries created by `BatchJobs` wil be held, this can be temporary since we delete them when we are done
`resources`	A named list specifying the resource requirements of the cluster jobs, this must contain names for the fields specified in the cluster configuration file – see the documentation of `BatchJobs` for details
`confList`	A configuration list as returned by a call to `batchTallyParam()`
`sleep`	Number of seconds to sleep before checking if blocks are finshed, increase this if you have large blocks and find the output of `batchTallies` to verbose
`tryCollect`	Boolean flag specifying whether the `rerunBatchTallies` function should try to collect data from the specified registries before re-submitting.
`blocks`	`data.frame` defining blocks to tally in, result of a cal to `defineBlocks`
`registries`	A list mapping registry IDs to the work paths of the corresponding registries

Details

This is a wrapper function for applying tallyBAM to a set of bam files specified in the bamFiles argument. The order or samples along the sample dimension is the same as the order of the file names (i.e. the order of the bamfiles argument). The function uses BatchJobs to dispatch tallying in blocks along the genome to a HPC and collects the results and writes them into the HDF5 tally file specified in the destination parameter.

rerunBatchTallies can be used to re-submit failed blocks.

collectTallies can be used to manually collect tally data from the registries created by batchTallies

Value

[None] – prints progress messages along the way.

Author(s)

Paul Pyl

Examples

## Not run: 
library(h5vc)
files <- c("NRAS.AML.bam","NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
chrom = "1"
startpos <- 115247090
endpos <- 115259515
batchTallies( batchTallyParam(bamFiles, chrom, startpos, endpos) )

## End(Not run)
## Not run: 
library(h5vc)
files <- c("NRAS.AML.bam","NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
chrom = "1"
startpos <- 115247090
endpos <- 115259515
batchTallies( batchTallyParam(bamFiles, chrom, startpos, endpos) )

## End(Not run)

Function for binning a genome.

Description

Function for generating a GRanges representation of a binning of the genome given in the reference object.

Usage

binGenome(reference, binsize = 1e+06, chroms = seqnames(reference))
binGenome(reference, binsize = 1e+06, chroms = seqnames(reference))

Arguments

`reference`	A `BSgenome` object.
`binsize`	Size of bins along the genome.
`chroms`	Which chromosomes to use, defaults to all chromosome described as `seqnames` of the `reference` object.

Details

This function creates a GRanges object that represents bins of size binsize along the genome represented by the reference object.

Value

A GRanges object that represents bins of size binsize along the genome represented by the reference object, includes special handling of chromosomes shorter than binsize and the last bin of each chromosome.

Author(s)

Paul Theodor Pyl

Examples

library(BSgenome.Hsapiens.NCBI.GRCh38)
bins <- binGenome(Hsapiens, binsize = 100e6, chroms = c("1","2","3","X","MT"))
bins
library(BSgenome.Hsapiens.NCBI.GRCh38)
bins <- binGenome(Hsapiens, binsize = 100e6, chroms = c("1","2","3","X","MT"))
bins

Estimate allelic frequency distributions in bins along the genome

Description

This function is used to give estimates of the ditribution of observed allelic freuqencies in a regions of the genome, use in conjunction with h5dapply

Usage

binnedAFs(data, sampledata, normalise = TRUE, binWidth = 0.05, minCov = 10, minCount = 2)
binnedAFs(data, sampledata, normalise = TRUE, binWidth = 0.05, minCov = 10, minCount = 2)

Arguments

`data`	A `list` object returned by a call to `h5dapply` or `h5readBlock`.
`sampledata`	Sample metadata describing the cohort, can be extracted from an HDF5 tally file using the `getSampleData` function.
`normalise`	Boolean flag to specify whether the counts or percentages of observed allelic frequencies should be returned.
`binWidth`	Width of bins in allelic frequency space, defaults to 0.05.
`minCov`	Minimum required coverage for a position to be considered.
`minCount`	Minimum required number of mismatches for a position to be considered.

Value

A matrix of AF bins times samples.

Author(s)

Paul Theodor Pyl

Examples

library(h5vc)
tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
afs <- h5dapply(
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c("Counts", "Coverages"),
  range = c(29e6, 29.05e6),
  blocksize = 1e4,
  FUN = binnedAFs,
  sampledata = sampleData
)
afs[[3]]
library(h5vc)
tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
afs <- h5dapply(
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c("Counts", "Coverages"),
  range = c(29e6, 29.05e6),
  blocksize = 1e4,
  FUN = binnedAFs,
  sampledata = sampleData
)
afs[[3]]

Variant calling

Description

These functions implement various attempts at variant calling.

Usage

callVariantsPaired( data, sampledata, cl = vcConfParams() )

vcConfParams(
  minStrandCov = 5,
  maxStrandCov = 200,
  minStrandAltSupport = 2,
  maxStrandAltSupportControl = 0,
  minStrandDelSupport = minStrandAltSupport,
  maxStrandDelSupportControl = maxStrandAltSupportControl,
  minStrandInsSupport = minStrandAltSupport,
  maxStrandInsSupportControl = maxStrandAltSupportControl,
  minStrandCovControl = 5,
  maxStrandCovControl = 200,
  bases = 5:8,
  returnDataPoints = TRUE,
  annotateWithBackground = TRUE,
  mergeCalls = TRUE,
  mergeAggregator = mean,
  pValueAggregator = max
)
callVariantsPaired( data, sampledata, cl = vcConfParams() )

vcConfParams(
  minStrandCov = 5,
  maxStrandCov = 200,
  minStrandAltSupport = 2,
  maxStrandAltSupportControl = 0,
  minStrandDelSupport = minStrandAltSupport,
  maxStrandDelSupportControl = maxStrandAltSupportControl,
  minStrandInsSupport = minStrandAltSupport,
  maxStrandInsSupportControl = maxStrandAltSupportControl,
  minStrandCovControl = 5,
  maxStrandCovControl = 200,
  bases = 5:8,
  returnDataPoints = TRUE,
  annotateWithBackground = TRUE,
  mergeCalls = TRUE,
  mergeAggregator = mean,
  pValueAggregator = max
)

Arguments

`data`	A `list` with elements `Counts` (a 4d `integer` array of size [1:12, 1:2, 1:k, 1:n]), `Coverage` (a 3d `integer` array of size [1:2, 1:k, 1:n]), `Deletions` (a 3d `integer` array of size [1:2, 1:k, 1:n]), `Reference` (a 1d `integer` vector of size [1:n]) – see Details.
`sampledata`	A `data.frame` with `k` rows (one for each sample) and columns `Type`, `Column` and (`SampleGroup` or `Patient`). The tally file should contain this information as a group attribute, see `getSampleData` for an example.
`cl`	A list with parameters used by the variant calling functions. Such a list can be produced, for instance, by a call to `vcConfParams`.
`minStrandCov`	Minimum coverage per strand in the case sample.
`maxStrandCov`	Maximum coverage per strand in the case sample.
`minStrandCovControl`	Minimum coverage per strand in the control sample.
`maxStrandCovControl`	Maximum coverage per strand in the control sample.
`minStrandAltSupport`	Minimum support for the alternative allele per strand in the case sample. This should be 1 or higher.
`maxStrandAltSupportControl`	Maximum support for the alternative allele per strand in the control sample. This should usually be 0.
`minStrandDelSupport`	Minimum support for the deletion per strand in the case sample. This should be 1 or higher.
`maxStrandDelSupportControl`	Maximum support for the deletion per strand in the control sample. This should usually be 0.
`minStrandInsSupport`	Minimum support for the insertion per strand in the case sample. This should be 1 or higher.
`maxStrandInsSupportControl`	Maximum support for the insertion per strand in the control sample. This should usually be 0.
`bases`	Indices for subsetting in the bases dimension of the Counts array, 5:8 extracts only those calls made in the middle one of the sequencing cycle bins.
`returnDataPoints`	Boolean flag to specify that a data.frame with the variant calls should be returned, otherwise only position are returned as a numeric vector. If `returnDataPoints == FALSE` only the variant positions are returned.
`annotateWithBackground`	Boolean flag to specify that the background mismatch / deletion frequency estimated from all control samples in the cohort should be added to the output. A simple binomial test will be performed as well. Only usefull if `returnDataPoints == TRUE`
`mergeCalls`	Boolean flag to specify that adjacent calls should be merged where appropriate (used by `callDeletionsPaired`). Only usefull applied if `returnDataPoints == TRUE`
`mergeAggregator`	Aggregator function for merging adjacent calls, defaults to `mean`, which means that a deletion larger than 1bp will be annotated with the means of the counts and coverages
`pValueAggregator`	Aggregator function for combining the p-values of adjacent calls when merging, defaults to `max`. Is only applied if `annotateWithBackground == TRUE`

Details

data is a list of datasets which has to at least contain the Counts and Coverages for variant calling respectively Deletions for deletion calling. This list will usually be generated by a call to the h5dapply function in which the tally file, chromosome, datasets and regions within the datasets would be specified. See ?h5dapply for specifics. In order for callVariantsPaired to return the correct locations of the variants there must be the h5dapplyInfo slot present in data as well. This is itself a list (being automatically added by h5dapply and h5readBlock respectively) and contains the slots Group (location in the HDF5 file) and Blockstart, which are used to set the chromosome and the genomic positions of variants.

vcConfParams is a helper function that builds a set of variant calling parameters as a list. This list is provided to the calling functions e.g. callVariantsPaired and influences their behavior.

callVariantsPaired implements a simple pairwise variant callign approach applying the filters specified in cl, and might additionally computes an estimate of the background mismatch rate (the mean mismatch rate of all samples labeled as 'Control' in the sampledata and annotate the calls with p-values for the binom.test of the observed mismatch counts and coverage at each of the samples labeled as 'Case'.

Value

The result is either a list of positions with SNVs / deletions or a data.frame containing the calls themselves which might contain annotations. Adjacent calls might be merged and calls might be annotated with p-values depending on configuration parameters.

When the configuration parameter returnDataPoints is FALSE the functions return the positions of potential variants as a list containing one integer vector of positions for each sample, if no positions were found for a sample the list will contain NULL instead. In the case of returnDatapoints == TRUE the functions return either NULL if no poisitions were found or a data.frame with the following slots:

`Chrom`	The chromosome the potential variant / deletion is on
`Start`	The starting position of the variant / deletion
`End`	The end position of the variant / deletions (equal to Start for SNVs and single basepair deletions)
`Sample`	The `Case` sample in which the variant was observed
`altAllele`	The alternate allele for SNVs (skipped for deletions, would be `"-"`)
`refAllele`	The reference allele for SNVs (skipped for deletions since the tally file might not contain all the information necessary to extract it)
`caseCountFwd`	Support for the variant in the `Case` sample on the forward strand
`caseCountRev`	Support for the variant in the `Case` sample on the reverse strand
`caseCoverageFwd`	Coverage of the variant position in the `Case` sample on the forward strand
`caseCoverageRev`	Coverage of the variant position in the `Case` sample on the reverse strand
`controlCountFwd`	Support for the variant in the `Control` sample on the forward strand
`controlCountRev`	Support for the variant in the `Control` sample on the reverse strand
`controlCoverageFwd`	Coverage of the variant position in the `Control` sample on the forward strand
`controlCoverageRev`	Coverage of the variant position in the `Control` sample on the reverse strand

If the annotateWithBackground option is set the following extra columns are returned

`backgroundFrequencyFwd`	The averaged frequency of mismatches / deletions at the position of all samples of type `Control` on the forward strand
`backgroundFrequencyRev`	The averaged frequency of mismatches / deletions at the position of all samples of type `Control` on the reverse strand
`pValueFwd`	The `p.value` of the test `binom.test( caseCountFwd, caseCoverageFwd, p = backgroundFrequencyFwd, alternative = "greater")`
`pValueRev`	The `p.value` of the test `binom.test( caseCountRev, caseCoverageRev, p = backgroundFrequencyRev, alternative = "greater")`

The function callDeletionsPaired merges adjacent single-base deletion calls if the option mergeCalls is set to TRUE, in that case the counts and coverages ( e.g. caseCountFwd ) are aggregated using the function supplied in the mergeAggregator option of the configuration list (defaults to mean) and the p-values pValueFwd and pValueFwd (if annotateWithBackground is TRUE), are aggregated using the function supplied in the pValueAggregator option (defaults to max).

Author(s)

Paul Pyl

Examples

  library(h5vc) # loading library
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979629
  windowsize <- 1000
  vars <- h5dapply( # Calling Variants
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 500,
    FUN = callVariantsPaired,
    sampledata = sampleData,
    cl = vcConfParams(returnDataPoints=TRUE),
    names = c("Coverages", "Counts", "Reference", "Deletions"),
    range = c(position - windowsize, position + windowsize)
  )
  vars <- do.call( rbind, vars ) # merge the results from all blocks by row
  vars # We did find a variant
library(h5vc) # loading library
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979629
  windowsize <- 1000
  vars <- h5dapply( # Calling Variants
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 500,
    FUN = callVariantsPaired,
    sampledata = sampleData,
    cl = vcConfParams(returnDataPoints=TRUE),
    names = c("Coverages", "Counts", "Reference", "Deletions"),
    range = c(position - windowsize, position + windowsize)
  )
  vars <- do.call( rbind, vars ) # merge the results from all blocks by row
  vars # We did find a variant

Paired variant calling using fisher tests

Description

This function implements a simple paired variant calling strategy based on the fisher test

Usage

callVariantsPairedFisher(data, sampledata, pValCutOff = 0.05, minCoverage = 5, mergeDels = TRUE, mergeAggregator = mean)
callVariantsPairedFisher(data, sampledata, pValCutOff = 0.05, minCoverage = 5, mergeDels = TRUE, mergeAggregator = mean)

Arguments

`data`	A `list` with elements `Counts` (a 4d `integer` array of size [1:12, 1:2, 1:k, 1:n]), `Coverage` (a 3d `integer` array of size [1:2, 1:k, 1:n]), `Reference` (a 1d `integer` vector of size [1:n]) – see Details.
`sampledata`	A `data.frame` with `k` rows (one for each sample) and columns `Type`, `Column` and (`Group` or `Patient`). The tally file should contain this information as a group attribute, see `getSampleData` for an example.
`pValCutOff`	Maximum allowed p-Value for the fisher test on contingency matrix `matrix(c(caseCounts, caseCoverage, controlCounts, controlCoverage), nrow=2)`.
`minCoverage`	Required coverage in both sample for a call to be made
`mergeDels`	Boolean flag specifying whether adjacent deletions should be merged
`mergeAggregator`	Which function to use for aggregating the values associated with adjacent deletions that are being merged

Details

data is a list which has to at least contain the Counts, Coverages and Reference datasets. This list will usually be generated by a call to the h5dapply function in which the tally file, chromosome, datasets and regions within the datasets would be specified. See h5dapply for specifics.

callVariantsPairedFisher implements a simple pairwise variant callign approach based on using the fisher.test on the following contingency matrix:

caseSupport	caseCoverage - caseSupport
conttrolSupport	controlCoverage - controlSupport

The results are filtered by pValCutOff and minCoverage.

Value

The return value is a data.frame with the following slots:

`Chrom`	The chromosome the potential variant is on
`Start`	The starting position of the variant
`End`	The end position of the variant
`Sample`	The `Case` sample in which the variant was observed
`refAllele`	The reference allele
`altAllele`	The alternate allele
`caseCount`	Support for the variant in the `Case` sample
`caseCoverage`	Coverage of the variant position in the `Case` sample
`controlCount`	Support for the variant in the `Control` sample
`controlCoverage`	Coverage of the variant position in the `Control` sample
`pValue`	The `p.value` of the `fisher.test`

Author(s)

Paul Pyl

Examples

library(h5vc) # loading library
tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
position <- 29979629
windowsize <- 2000
vars <- h5dapply( # Calling Variants
  filename = tallyFile,
  group = "/ExampleStudy/16",
  blocksize = 1000,
  FUN = callVariantsPairedFisher,
  sampledata = sampleData,
  pValCutOff = 0.1,
  names = c("Coverages", "Counts", "Reference"),
  range = c(position - windowsize, position + windowsize),
  verbose = TRUE
)
vars <- do.call(rbind, vars)
vars
library(h5vc) # loading library
tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
position <- 29979629
windowsize <- 2000
vars <- h5dapply( # Calling Variants
  filename = tallyFile,
  group = "/ExampleStudy/16",
  blocksize = 1000,
  FUN = callVariantsPairedFisher,
  sampledata = sampleData,
  pValCutOff = 0.1,
  names = c("Coverages", "Counts", "Reference"),
  range = c(position - windowsize, position + windowsize),
  verbose = TRUE
)
vars <- do.call(rbind, vars)
vars

Single sample variant calling

Description

A simple single sample variant calling function (calling SNVs and deletions)

Usage

callVariantsSingle( data, sampledata, samples = sampledata$Sample, errorRate = 0.001, minSupport = 2, minAF = 0.05, minStrandSupport = 1, mergeDels = TRUE, aggregator = mean)
callVariantsSingle( data, sampledata, samples = sampledata$Sample, errorRate = 0.001, minSupport = 2, minAF = 0.05, minStrandSupport = 1, mergeDels = TRUE, aggregator = mean)

Arguments

`data`	A `list` with elements `Counts` (a 4d `integer` array of size [1:12, 1:2, 1:k, 1:n]), `Coverage` (a 3d `integer` array of size [1:2, 1:k, 1:n]), `Deletions` (a 3d `integer` array of size [1:2, 1:k, 1:n]), `Reference` (a 1d `integer` vector of size [1:n]) – see Details.
`sampledata`	A `data.frame` with `k` rows (one for each sample) and columns `Column` and (`Sample`. The tally file should contain this information as a group attribute, see `getSampleData` for an example.
`samples`	The samples on which variants should be called, by default all samples specified in sampledata are used
`errorRate`	The expected error rate of the sequencing technology that was used, for illumina this should be `1/1000`
`minSupport`	minimal support required for a position to be considered variant
`minAF`	minimal allelic frequency for an allele at a position to be cosidered a variant
`minStrandSupport`	minimal per-strand support for a position to be considered variant
`mergeDels`	Boolean flag to specify that adjacent deletion calls should be merged
`aggregator`	Aggregator function for merging statistics of adjacent deletion calls, defaults to `mean`, which means that a deletion larger than 1bp will be annotated with the means of the counts and coverages etc.

Details

data is a list of datasets which has to at least contain the Counts and Coverages for variant calling respectively Deletions for deletion calling (if Deletions is not present no deletion calls will be made). This list will usually be generated by a call to the h5dapply function in which the tally file, chromosome, datasets and regions within the datasets would be specified. See h5dapply for specifics.

callVariantsSingle implements a simple single sample variant callign approach for SNVs and deletions (if Deletions is a dataset present in the data parameter. The function applies three essential filters to the provided data, requiring:

- minSupport total support for the variant at the position - minStrandSupport support for the variant on each strand - an allele freqeuncy of at least minAF (for pure diploid samples this can be set relatively high, e.g. 0.3, for calling potentially homozygous variants a value of 0.8 or higher might be used)

Calls are annotated with the p-Value of a binom.test of the present support and coverage given the error rate provided in the errorRate parameter, no filtering is done on this annotation.

Adjacent deletion calls are merged based in the value of the mergeDels parameter and their statistics are aggregated with the function supplied in the aggregator parameter.

Value

This function returns a data.frame containing annotated calls with the following slots:

`Chrom`	The chromosome the potential variant / deletion is on
`Start`	The starting position of the variant / deletion
`End`	The end position of the variant / deletions (equal to Start for SNVs and single basepair deletions)
`Sample`	The sample in which the variant was called
`altAllele`	The alternate allele for SNVs (deletions will have a `"-"` in that slot)
`refAllele`	The reference allele for SNVs (deletions will have the deleted sequence here as extracted from the `Reference` dataset, if the tally file contains a sparse representation of the reference, i.e. only positions with mismatches show a reference value the missing values are substituted with `"N"`'s. It is strongly suggested to write the whole reference into the tally file prior to deletion calling - see `writeReference` for details)
`SupFwd`	Support for the variant in the sample on the forward strand
`SupRev`	Support for the variant in the sample on the reverse strand
`CovFwd`	Coverage of the variant position in the sample on the forward strand
`CovRev`	Coverage of the variant position in the sample on the reverse strand
`AF_Fwd`	Allele frequency of the variant in the sample on the forward strand
`AF_Rev`	Allele frequency of the variant in the sample on the reverse strand
`Support`	Total Support of the variant - i.e. `SupFwd + SupRev`
`Coverage`	Total Coverage of the variant position - i.e. `CovFwd + CovRev`
`AF`	Total allele frequency of the variant, i.e. `Support / Coverage`
`fBackground`	Background frequency of the variant in all samples but the one the variant is called in
`pErrorFwd`	Probablity of the observed support and coverage given the error rate on the forward strand
`pErrorRev`	Probablity of the observed support and coverage given the error rate on the reverse strand
`pError`	Probablity of the observed support and coverage given the error rate on both strands combined
`pError`	Coverage of the variant position in the `Control` sample on the forward strand
`pStrand`	p-Value of a `fisher.test` on the contingency matrix `matrix(c(CovFwd,CovRev,SupFwd,SupRev), nrow = 2)` at this position - low values could indicate strand bias

Author(s)

Paul Pyl

Examples

  library(h5vc) # loading library
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979629
  windowsize <- 1000
  vars <- h5dapply( # Calling Variants
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 500,
    FUN = callVariantsSingle,
    sampledata = sampleData,
    names = c("Coverages", "Counts", "Reference", "Deletions"),
    range = c(position - windowsize, position + windowsize)
  )
  vars <- do.call( rbind, vars ) # merge the results from all blocks by row
  vars # We did find a variant
library(h5vc) # loading library
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979629
  windowsize <- 1000
  vars <- h5dapply( # Calling Variants
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 500,
    FUN = callVariantsSingle,
    sampledata = sampleData,
    names = c("Coverages", "Counts", "Reference", "Deletions"),
    range = c(position - windowsize, position + windowsize)
  )
  vars <- do.call( rbind, vars ) # merge the results from all blocks by row
  vars # We did find a variant

Coverage analysis

Description

Functions to do analyses based on coverage

Usage

binnedCoverage( data, sampledata, gccount = FALSE )
binnedCoverage( data, sampledata, gccount = FALSE )

Arguments

`data`	A `list` with element `Coverage` (a 3d `integer` array of size [1:2, 1:k, 1:n])
`sampledata`	A `data.frame` with `k` rows (one for each sample) and columns `Type`, `Column` and (`SampleGroup` or `Patient`). The tally file should contain this information as a group attribute, see `getSampleData` for an example.
`gccount`	Boolean flag to specify whether the gc count of the bin should be reported as well, `Reference` must be a slot in the `data` object

Details

Explanations:

This computes the per sample coverage in a given bin (determined by the width of data). This feature is not implemented yet!

Value

Returns a data.frame with columns containing the coverage with the current bin for all samples provided in sampledata. The binsize is determined by the blocksize argument given to h5dapply when this function is run directly on a tally file.

Author(s)

Paul Pyl

Examples

  # loading library and example data
  library(h5vc)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/22" )
  data <- h5dapply( # extractting coverage binned at 1000 bases
    filename = tallyFile,
    group = "/ExampleStudy/22",
    blocksize = 1000,
    FUN = binnedCoverage,
    sampledata = sampleData,
    gccount = TRUE,
    names = c( "Coverages", "Reference" ),
    range = c(38900000,39000000)
  )
  data <- do.call(rbind, data)
  rownames(data) <- NULL
  head(data)
# loading library and example data
  library(h5vc)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/22" )
  data <- h5dapply( # extractting coverage binned at 1000 bases
    filename = tallyFile,
    group = "/ExampleStudy/22",
    blocksize = 1000,
    FUN = binnedCoverage,
    sampledata = sampleData,
    gccount = TRUE,
    names = c( "Coverages", "Reference" ),
    range = c(38900000,39000000)
  )
  data <- do.call(rbind, data)
  rownames(data) <- NULL
  head(data)

geom_h5vc

Description

Plotting function that returns a ggplot2 layer representing the specified dataset for the specified samples in the region [positon - windowsize, position + windowsize].

Usage

geom_h5vc( data, sampledata, samples=sampledata$Sample, windowsize, position, dataset, ... )
geom_h5vc( data, sampledata, samples=sampledata$Sample, windowsize, position, dataset, ... )

Arguments

`data`	The data to be plotted. Returned by `h5dapply`. Must be centered on `position`, extend by `windowsize` in each direction and contain a slot named like the `dataset` argument
`sampledata`	The sampledata for the cohort represented by `data`. Returned by `getSampleData`
`samples`	A character vector listing the names of samples to be plotted, defaults to all samples as described in `sampledata`
`windowsize`	Size of the window in which to plot on each side. The total interval that is plotted will be [position-windowsize,position+windowsize]
`position`	The position at which the plot shall be centered
`dataset`	The slot in the `data` argument that should be plotted
`...`	Paramteters to be passed to the internally used `geom_rect`, see `geom_rect` for details

Details

Creates a ggplot layer centered on position using the specified dataset from list data, annotating it with sample information provided in the data.frame sampledata and showing all samples listed in sample. The resulting plot uses ggplot2's geom_rect to draw boxes representing the values from dataset. The x-axis is the position and will span the interval [positon - windowsize, position + windowsize]. The x-axis is centered at 0 and additional layers to be added to the plot should be centered at 0 also.

Ths function allows for fast creation of overview plots similar to mismatchPlot (without the stacking of tracks). The example below shows how one can create a plot showing the coverage and number of mismatches per position (but not the alternative allele) for a given region.

Value

A ggplot layer object containing the plot of the specified dataset, this can be used like any other ggplot layer, i.e. it may be added to another plot.

Author(s)

Paul Pyl

Examples

  # loading library and example data
  library(h5vc)
  library(ggplot2)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979629
  windowsize <- 30
  samples <- sampleData$Sample[sampleData$Patient == "Patient8"]
  data <- h5dapply(
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = windowsize * 3, #choose blocksize larger than range so that all needed data is collected as one block
    names = c("Coverages", "Counts", "Deletions"),
    range = c(position - windowsize, position + windowsize)
  )[[1]]
  # Summing up all mismatches irrespective of the alternative allele
  data$CountsAggregate = colSums(data$Counts)
  # Simple overview plot showing number of mismatches per position
  p <- ggplot() + 
  geom_h5vc( data=data, sampledata=sampleData, windowsize = 35, position = 500, dataset = "Coverages", fill = "gray" ) + 
  geom_h5vc( data=data, sampledata=sampleData, windowsize = 35, position = 500, dataset = "CountsAggregate", fill = "#D50000" ) + 
  facet_wrap( ~ Sample, ncol = 2 )
  print(p)
# loading library and example data
  library(h5vc)
  library(ggplot2)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979629
  windowsize <- 30
  samples <- sampleData$Sample[sampleData$Patient == "Patient8"]
  data <- h5dapply(
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = windowsize * 3, #choose blocksize larger than range so that all needed data is collected as one block
    names = c("Coverages", "Counts", "Deletions"),
    range = c(position - windowsize, position + windowsize)
  )[[1]]
  # Summing up all mismatches irrespective of the alternative allele
  data$CountsAggregate = colSums(data$Counts)
  # Simple overview plot showing number of mismatches per position
  p <- ggplot() + 
  geom_h5vc( data=data, sampledata=sampleData, windowsize = 35, position = 500, dataset = "Coverages", fill = "gray" ) + 
  geom_h5vc( data=data, sampledata=sampleData, windowsize = 35, position = 500, dataset = "CountsAggregate", fill = "#D50000" ) + 
  facet_wrap( ~ Sample, ncol = 2 )
  print(p)

Reading and writing sample data from / to a tally file

Description

These functions allow reading and writing of sample data to the HDF5-based tally files. The sample data is stored as group attribute.

Usage

getSampleData( filename, group )
setSampleData( filename, group, sampleData, largeAttributes = FALSE, stringSize = 64 )
getSampleData( filename, group )
setSampleData( filename, group, sampleData, largeAttributes = FALSE, stringSize = 64 )

Arguments

`filename`	The name of a tally file
`group`	The name of a group within that tally file, e.g. `/ExampleStudy/22`
`sampleData`	A `data.frame` with `k` rows (one for each sample) and columns `Type`, `Column` and (`SampleGroup` or `Patient`. Additional column will be added as well but are not required.)
`largeAttributes`	HDF5 limits the size of attributes to 64KB, if you have many samples setting this flag will write the attributes in a separate dataset instead. `getSampleData` is aware of this and automatically chooses the dataset-stored attributes if they are present
`stringSize`	Maximum length for string attributes (number of characters) - default of 64 characters should be fine for most cases; This has to be specified since we do not support variable length strings as of now.

Details

The returned data.frame contains information about the sample ids, sample columns in the sample dimension of the dataset. The type of sample must be one of c("Case","Control") to be used with the provided SNV calling function. Additional relevant per-sample information may be stored here.

Note that the following columns are required in the sample data where the rows represent samples in the cohort:

Sample: the sample id of the corresponding sample

Column: the index within the genomic position dimension of the corresponding sample, be aware that getSampleData and setSampleData automatically add / remove 1 from this value since internally the tally files store the dimension 0-based whereas within R we count 1-based.

Patient the patient id of the corresponding sample

Type the type of sample

Value

sampledata

A data.frame with k rows (one for each sample) and columns Type, Column and (SampleGroup or Patient).

Author(s)

Paul Pyl

Examples

  # loading library and example data
  library(h5vc)
  # We make a copy of the file to tmp here, this is only needed if we want to keep the original intact.
  tallyFile <- tempfile()
  stopifnot(file.copy(system.file( "extdata", "example.tally.hfs5", package = "h5vcData" ), tallyFile))
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  sampleData
  # modify  the sample data
  sampleData$AnotherColumn <- paste( sampleData$Patient, "Modified" )
  # write to tallyFile
  setSampleData( tallyFile, "/ExampleStudy/16", sampleData )
  # re-load and check if it worked
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  sampleData
# loading library and example data
  library(h5vc)
  # We make a copy of the file to tmp here, this is only needed if we want to keep the original intact.
  tallyFile <- tempfile()
  stopifnot(file.copy(system.file( "extdata", "example.tally.hfs5", package = "h5vcData" ), tallyFile))
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  sampleData
  # modify  the sample data
  sampleData$AnotherColumn <- paste( sampleData$Patient, "Modified" )
  # write to tallyFile
  setSampleData( tallyFile, "/ExampleStudy/16", sampleData )
  # re-load and check if it worked
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  sampleData

h5dapply

Description

This is the central function of the h5vc package, allows an apply operation along common dimensions of datasets in a tally file.

Usage

## S4 method for signature 'numeric'
h5dapply( ..., blocksize, range)
## S4 method for signature 'GRanges'
h5dapply( ..., group, range)
## S4 method for signature 'IRanges'
h5dapply( ..., range)
## S4 method for signature 'numeric'
h5dapply( ..., blocksize, range)
## S4 method for signature 'GRanges'
h5dapply( ..., group, range)
## S4 method for signature 'IRanges'
h5dapply( ..., range)

Arguments

`blocksize`	The size of the blocks in which to process the data (integer)
`...`	Further parameters to be handed over to `FUN`
`range`	The range along the specified dimensions which should be processed, this allows for limiting the apply to a specific region or set of samples, etc. - optional (defaults to the whole chromosome); This can be a `GRanges`, `IRanges` or numerical vector of length 2 (i.e [start, stop])
`group`	The group (location) within the HDF5 file, note that when range is `numeric` or `IRanges` this has to point to the location of the chromosome, e.g. `/ExampleTally/Chr7`. When range is a `GRanges` object, the chromosome information is encoded in the `GRanges` directly and `group` should only point to the root-group of the study, i.e. `/ExampleTally`

Details

Additional function parameters are:

filename: The name of a tally file to process
group: The name of a group in that tally file
FUN: The function to apply to each block, defaults to function(x) x, which returns the data as is (a list of arrays)
names: The names of the datasets to extract, e.g. c("Counts","Coverages") - optional (defaults to all datasets)
dims: The dimension to apply along for each dataset in the same order as names, these should correspond to compatible dimensions between the datsets. - optional (defaults to the genomic position dimension)
samples: Character vector of sample names - must match contents of sampleData stored in the tallyFile
sampleDimMap: A list mapping dataset names to their respective sample dimensions - default provides values for "Counts", "Coverages", "Deletions" and "Reference"
verbose: Boolean flag that controls the amount of messages being printed by h5dapply
BPPARAM: BPPARAM object to be passed to the bplapply call used to apply FUN to the blocks - see BiocParallel documentation for details; if this is NULL a normal lapply will be used instead of bplapply.

This function applys parameter FUN to blocks along a specified axis within the tally file, group and specified datasets. It creates a list of arrays (one for each dataset) and processes that list with the function FUN.

This is by far the most essential and powerful function within this package since it allows the user to execute their own analysis functions on the tallies stored within the HDF5 tally file.

The supplied function FUN must have a parameter data or ... (the former is the expected behaviour), which will be supplied to FUN from h5dapply for each block. This structure is a list with one slot for each dataset specified in the names argument to h5dapply containing the array corresponding to the current block in the given dataset. Furthemore the slot h5dapplyInfo is reserved and contains another list with the following content:

Blockstart is an integer specifying the starting position of the current block (in the dimension specified by the dims argument to h5dapply)

Blockend is an integer specifying the end position of the current block (in the dimension specified by the dims argument to h5dapply)

Datasets Contains a data.frame as it is returned by h5ls listing all datasets present in the other slots of data with their group, name, dimensions, number of dimensions (DimCount) and the dimension that is used for splitting into blocks (PosDim)

Group contains the name of the group as specified by the group argument to h5dapply

Value

A list with one entry per block, which is the result of applying FUN to the datasets specified in the parameter names within the block.

Author(s)

Paul Pyl

Examples

  # loading library and example data
  library(h5vc)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  # check the available samples and sampleData
  print(sampleData)
  data <- h5dapply( #extracting coverage using h5dapply
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 1000,
    FUN = function(x) rowSums(x$Coverages),
    names = c( "Coverages" ),
    range = c(29000000,29010000),
    verbose = TRUE
    )
  coverages <- do.call( rbind, data )
  colnames(coverages) <- sampleData$Sample[order(sampleData$Column)]
  coverages
  #Subsetting by Sample
  sampleData <- sampleData[sampleData$Patient == "Patient5",]
  data <- h5dapply( #extracting coverage using h5dapply
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 1000,
    FUN = function(x) rowSums(x$Coverages),
    names = c( "Coverages" ),
    range = c(29000000,29010000),
    samples = sampleData$Sample,
    verbose = TRUE
    )
  coverages <- do.call( rbind, data )
  colnames(coverages) <- sampleData$Sample[order(sampleData$Column)]
  coverages
  #Using GRanges and IRanges
  library(GenomicRanges)
  library(IRanges)
  granges <- GRanges(
  c(rep("16", 10), rep("22", 10)),
  ranges = IRanges(
    start = c(seq(29000000,29009000, 1000), seq(39000000,39009000, 1000)),
    width = 1000
  ))
  data <- h5dapply( #extracting coverage using h5dapply
    filename = tallyFile,
    group = "/ExampleStudy",
    blocksize = 1000,
    FUN = function(x) rowSums(x$Coverages),
    names = c( "Coverages" ),
    range = granges,
    verbose = TRUE
    )
  lapply( data, function(x) do.call(rbind, x) )
# loading library and example data
  library(h5vc)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  # check the available samples and sampleData
  print(sampleData)
  data <- h5dapply( #extracting coverage using h5dapply
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 1000,
    FUN = function(x) rowSums(x$Coverages),
    names = c( "Coverages" ),
    range = c(29000000,29010000),
    verbose = TRUE
    )
  coverages <- do.call( rbind, data )
  colnames(coverages) <- sampleData$Sample[order(sampleData$Column)]
  coverages
  #Subsetting by Sample
  sampleData <- sampleData[sampleData$Patient == "Patient5",]
  data <- h5dapply( #extracting coverage using h5dapply
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 1000,
    FUN = function(x) rowSums(x$Coverages),
    names = c( "Coverages" ),
    range = c(29000000,29010000),
    samples = sampleData$Sample,
    verbose = TRUE
    )
  coverages <- do.call( rbind, data )
  colnames(coverages) <- sampleData$Sample[order(sampleData$Column)]
  coverages
  #Using GRanges and IRanges
  library(GenomicRanges)
  library(IRanges)
  granges <- GRanges(
  c(rep("16", 10), rep("22", 10)),
  ranges = IRanges(
    start = c(seq(29000000,29009000, 1000), seq(39000000,39009000, 1000)),
    width = 1000
  ))
  data <- h5dapply( #extracting coverage using h5dapply
    filename = tallyFile,
    group = "/ExampleStudy",
    blocksize = 1000,
    FUN = function(x) rowSums(x$Coverages),
    names = c( "Coverages" ),
    range = granges,
    verbose = TRUE
    )
  lapply( data, function(x) do.call(rbind, x) )

h5readBlock

Description

A simple access function for extracting a single block of data from a tally file, use h5dapply for applying functions on multiple blocks / extracting multiple blocks form a tally file.

Usage

h5readBlock( filename, group, names, dims, range, samples = NULL, sampleDimMap = .sampleDimMap, verbose = FALSE )
h5readBlock( filename, group, names, dims, range, samples = NULL, sampleDimMap = .sampleDimMap, verbose = FALSE )

Arguments

`filename`	The name of a tally file to process
`group`	The name of a group in that tally file
`names`	The names of the datasets to extract, e.g. `c("Counts","Coverages")` - optional (defaults to all datasets)
`dims`	The dimension in which the block shall be extracted for each dataset in the same order as `names`, these should correspond to compatible dimensions between the datsets. - optional (defaults to the genomic position dimension)
`range`	The range along the specified dimensions which should be extracted
`samples`	Character vector of sample names - must match contents of sampleData stored in the `tallyFile`
`sampleDimMap`	A list mapping dataset names to their respective sample dimensions - default provides values for "Counts", "Coverages", "Deletions" and "Reference"
`verbose`	Boolean flag that controls the amount of messages being printed by `h5dapply`

Details

This function extracts a block along the dimensions specified in dims (default: genomic position) from the datasets specified in names and returns it. The block is defined by the parameter range.

The function returns a list with one slot for each dataset specified in the names argument to containing the array corresponding to the specified block in the given dataset. Furthemore the slot h5dapplyInfo is reserved and contains another list with the following content:

Blockstart is an integer specifying the starting position of the current block (in the dimension specified by the dims argument to h5dapply)

Blockend is an integer specifying the end position of the current block (in the dimension specified by the dims argument to h5dapply)

Group contains the name of the group as specified by the group argument to h5dapply

Value

A list with one entry per dataset and an additional slot h5dapplyInfo containing auxiliary information.

Author(s)

Paul Pyl

Examples

library(h5vc) # loading the library
tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
data <- h5readBlock( #extracting coverage, deletions and reference using h5dreadBlock
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c( "Coverages", "Deletions", "Reference" ),
  range = c(29000000,29010000),
  verbose = TRUE
)
str(data)
sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
#Subsetting by Sample
sampleData <- sampleData[sampleData$Patient == "Patient8",]
data <- h5readBlock( #extracting coverage, deletions and reference using h5dreadBlock
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c( "Coverages", "Deletions", "Reference" ),
  range = c(29000000,29010000),
  samples = sampleData$Sample,
  verbose = TRUE
)
str(data)
library(h5vc) # loading the library
tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
data <- h5readBlock( #extracting coverage, deletions and reference using h5dreadBlock
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c( "Coverages", "Deletions", "Reference" ),
  range = c(29000000,29010000),
  verbose = TRUE
)
str(data)
sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
#Subsetting by Sample
sampleData <- sampleData[sampleData$Patient == "Patient8",]
data <- h5readBlock( #extracting coverage, deletions and reference using h5dreadBlock
  filename = tallyFile,
  group = "/ExampleStudy/16",
  names = c( "Coverages", "Deletions", "Reference" ),
  range = c(29000000,29010000),
  samples = sampleData$Sample,
  verbose = TRUE
)
str(data)

helper functions

Description

These functions are helpers for dealing with tally data stored in HDF5 files.

Usage

formatGenomicPosition( x, unit = "Mb", divisor = 1000000, digits = 3,
nsmall = 1 )
encodeDNAString( ds )
defineBlocks( start, stop, blocksize )
getChromSize( tallyFile, group, dataset = "Reference", posDim = 1 )
formatGenomicPosition( x, unit = "Mb", divisor = 1000000, digits = 3,
nsmall = 1 )
encodeDNAString( ds )
defineBlocks( start, stop, blocksize )
getChromSize( tallyFile, group, dataset = "Reference", posDim = 1 )

Arguments

`x`	Numerical genomic position
`unit`	Which unit to convert the position to
`divisor`	divisor corresponding to the unit, i.e. 'Mb' -> `1e6`, 'Kb' -> `1e3`
`digits`	number of digits to keep
`nsmall`	nsmall parameter to the format function
`ds`	A DNAString object to be encoded in the HDF5 tally file specific encoding of nucleotides.
`start`	first position
`stop`	last position
`blocksize`	size of blocks
`tallyFile`	Tally file to work on
`group`	Group within `tallyFile` that we want to find the chromosome size for
`dataset`	Datset to extract chromosome size from - default is "Reference"
`posDim`	Which dimension of the dataset describes the genomic position

Details

formatGenomicPosition: Helps formatting genomic positions for annotating axes in mismatch plots etc.

encodeDNAString: This translates a DNAString object into a comaptible encoding that can be written to a HDF5 based tally file in the Reference dataset. Since the Python script for generating tallies only sets the Reference dataset in positions where mismatches exists updating the Reference dataset becomes necessary if one would like to perform analysis involving sequence context (GC-bias, mutationSpectrum, etc.)

defineBlocks: This function returns a data.frame with the columns Start and End for blocks of size blocksize spanning the interval [start, stop].

getChromSize: This function is a helper to quickly look-up the chromosome size of a given group and tally file.

Value

formatGenomicPosition: formatted genomic position, e.g. "123.4 Mb"

encodeDNAString: A numeric vector encoding the nucleotide sequence provided in ds according to the scheme c("A"=0,"C"=1,"G"=2,"T"=3).

defineBlocks: A data.frame with the columns Start and End for blocks of size blocksize spanning the interval [start, stop].

getChromSize: Returns a numeric that is the size of the chromosome.

Author(s)

Paul Pyl

Examples

  formatGenomicPosition(123456789)
  library(Biostrings)
  lapply( DNAStringSet( c("simple"="ACGT", "movie"="GATTACA") ), encodeDNAString )
  getChromSize( system.file("extdata", "example.tally.hfs5", package="h5vcData"), "/ExampleStudy/16" )
formatGenomicPosition(123456789)
  library(Biostrings)
  lapply( DNAStringSet( c("simple"="ACGT", "movie"="GATTACA") ), encodeDNAString )
  getChromSize( system.file("extdata", "example.tally.hfs5", package="h5vcData"), "/ExampleStudy/16" )

Merging the prepared results from multiple bam file tallies into one block that can be written to the HDF5 tally file

Description

This function merges a set of tallies that have been processed with prepareForHDF5 into one block of data.

Usage

mergeTallies( tallies )
mergeTallies( tallies )

Arguments

tallies

A list of prepared talies, i.e. a list of lists with slots for the datasets "Counts", "Coverage", "Deletions" and "Reference" in each sub-list

Details

This function merges tallies from a set of bam files / samples, note that the order of samples in the sample column will be the same as the order of samples in the provided list, so ake sure this matches your sampledata.

Value

A list with slots containing the Counts,Coverages,Deletions and Reference datasets for the samples given in tallies. Each of the slots contains an array with the contents of the provided sub-lists merged along the "sample" axis. The Reference slot os filled from the first element of tallies and it is up to the user to make sure that the tallies provided for merging have compatible references.

Author(s)

Paul Pyl

Examples

library(h5vc)
library(BSgenome.Hsapiens.UCSC.hg19)
files <- c("NRAS.AML.bam","NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
chrom = "1"
startpos <- 115247090
endpos <- 115259515
theData <- lapply( bamFiles, function(bamf){ tallyBAM(bamf, chrom, startpos, endpos) } )
str(theData)
reference <- getSeq(BSgenome.Hsapiens.UCSC.hg19, "chr1", startpos, endpos)
theMergedData <- mergeTallies(lapply(theData, prepareForHDF5, reference))
str(theMergedData)
library(h5vc)
library(BSgenome.Hsapiens.UCSC.hg19)
files <- c("NRAS.AML.bam","NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
chrom = "1"
startpos <- 115247090
endpos <- 115259515
theData <- lapply( bamFiles, function(bamf){ tallyBAM(bamf, chrom, startpos, endpos) } )
str(theData)
reference <- getSeq(BSgenome.Hsapiens.UCSC.hg19, "chr1", startpos, endpos)
theMergedData <- mergeTallies(lapply(theData, prepareForHDF5, reference))
str(theMergedData)

Merging multiple tally files into one

Description

Function to merge multiple tally files by genomic position (i.e. gluing samples together)

Usage

mergeTallyFiles( inputFiles, destFile, destGroup, blockSize = 1e6, sampleDims = c(), positionDims = c() )
mergeTallyFiles( inputFiles, destFile, destGroup, blockSize = 1e6, sampleDims = c(), positionDims = c() )

Arguments

`inputFiles`	A list mapping input file names to the groups within them from which the data shall be taken (e.g. "example.tally.hfs5" -> "/ExampleStudy/16")
`destFile`	Name of the file that should be created
`destGroup`	Group within `destFile` that will hold the merged data
`blockSize`	Size of the blocks in bases that the merging will be performed in
`sampleDims`	List mapping dataset names to their respective sample dimension, e.g. "Counts" -> 2 - has the standard datasets included by default
`positionDims`	List mapping dataset names to their respective position dimension, e.g. "Counts" -> 4 - has the standard datasets included by default

Details

This function merges tally data from a list of tally files into a new destination file.

Value

[None] – prints progress messages along the way.

Author(s)

Paul Pyl

Examples

## Not run: 
mergeTallyFiles{ # merging a file to itself, i.e. "doubling" it
  list(
    "example.tally.hfs5" = "/ExampleStudy/16",
    "example.tally.hfs5" = "/ExampleStudy/16"
  ),
  "test.merge.hfs5",
  "/MergedStudy/16"}

## End(Not run)
## Not run: 
mergeTallyFiles{ # merging a file to itself, i.e. "doubling" it
  list(
    "example.tally.hfs5" = "/ExampleStudy/16",
    "example.tally.hfs5" = "/ExampleStudy/16"
  ),
  "test.merge.hfs5",
  "/MergedStudy/16"}

## End(Not run)

mismatchPlot

Description

Plotting function that returns a ggplot2 object representing the mismatches and coverages of the specified samples in the specified region.

Usage

mismatchPlot( data, sampledata, samples=sampledata$Sample, windowsize = NULL, position = NULL, range = NULL, plotReference = TRUE, refHeight=8, printReference = TRUE, printRefSize = 2, tickSpacing = c(10,10) )
mismatchPlot( data, sampledata, samples=sampledata$Sample, windowsize = NULL, position = NULL, range = NULL, plotReference = TRUE, refHeight=8, printReference = TRUE, printRefSize = 2, tickSpacing = c(10,10) )

Arguments

`data`	The data to be plotted. Returned by `h5dapply` or `h5readBlock`.
`sampledata`	The sampledata for the cohort represented by `data`. Returned by `getSampleData`
`samples`	A character vector listing the names of samples to be plotted, defaults to all samples as described in `sampledata`
`windowsize`	Size of the window in which to plot on each side. The total interval that is plotted will be [position-windowsize,position+windowsize]
`position`	The position at which the plot shall be centered
`range`	Integer vector of two elements specifying a range of coordinates to be plotted, use either position + windowsize or range; if both are provided range overwrites position and windowsize.
`plotReference`	This boolean flag specifies if a reference track should be plotted, only takes effect if there is a slot named `Reference` in the `data` object passed to the function
`refHeight`	Height of the reference track in coverage units (default of 8 = reference track is as high as 8 reads coverage would be in the plot of a sample.)
`printReference`	Boolean parameter to indicate whether a text representation of the reference should be overlayed to the reference track, can only be true if `plotReference` is true.
`printRefSize`	Size parameter of the `geom_text` layer used to print the reference. This value is unitless and needs to be manually optimised for a given plot.
`tickSpacing`	Integer vector of two elements, specifying the spacing of ticks along the x and y axes respectively.

Details

If position and windowsize are specified this function creates a plot centered on position using the coverage and mismatch counts stored in data, annotating it with sample information provided in the data.frame sampledata and showing all samples listed in sample. If range is specified, the plot will cover the positions from range[1] to range[2]. The difference between specifying range or position plus windowsize lies only in the labelling of the x-axis and the coordinate system used on the x-axis. In the former case the coordinate system is that of genomic coordinates as specified in range, when using the latter the x-axis coordinates go from -windowsize through +windowsize and position 0 is marked with the calue provided in the position parameter. Furthermore when a position and windowsize are provided two black lines marking the center position are drawn (this is usefull for visualising SNVs)

If neither range, nor position and windowsize are specified the function will try to extract the information from the data object. If data is the return value of a call to h5dapply or h5readBlock this will work automagically.

The plot has the genomic position on the x-axis. The y-axis encodes values where positive values are on the forward strand and negative values on the reverse. The coverage is shown in grey, deletions in purple and the mismatches in the colors specified in the legend. Note that for each possible mismatch there is an additional color for low-quality counts (coming from the first and last sequencing cycles), so e.g. C is filled dark red and C_lq light red.

If data is the result of a call to h5dapply representing multiple blocks of data as defined in the range parameter to h5dapply then the plot will contain the mismatchPlots of each of the ranges plotted next to each other.

Value

A ggplot object containing the mismatch plot, this can be used like any other ggplot object, i.e. additional layers and styles my be applied by simply adding them to the plot.

Author(s)

Paul Pyl

Examples

  # loading library and example data
  library(h5vc)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979628
  windowsize <- 30
  samples <- sampleData$Sample[sampleData$Patient == "Patient8"]
  data <- h5readBlock(
    filename = tallyFile,
    group = "/ExampleStudy/16",
    names = c("Coverages", "Counts", "Deletions", "Reference"),
    range = c(position - windowsize, position + windowsize)
  )
  #Plotting with position and windowsize
  p <- mismatchPlot(
    data = data,
    sampledata = sampleData,
    samples = samples,
    windowsize = windowsize,
    position = position
  )
  print(p)
  #plotting with range and modified tickSpacing and refHeight
  p <- mismatchPlot(
    data = data,
    sampledata = sampleData,
    samples = samples,
    range = c(position - windowsize, position + windowsize),
    tickSpacing = c(20, 5),
    refHeight = 5
  )
  print(p)
  #plotting without specfiying range or position
  p <- mismatchPlot(
    data = data,
    sampledata = sampleData,
    samples = samples
  )
  print(p)
  #Plotting multiple regions (with small overlaps)
  library(IRanges)
  dataList <- h5dapply(
    filename = tallyFile,
    group = "/ExampleStudy/16",
    names = c("Coverages", "Counts", "Deletions", "Reference"),
    range = IRanges(start = seq( position - windowsize, position + windowsize, 20), width = 30 )
  )
  p <- mismatchPlot(
    data = dataList,
    sampledata = sampleData,
    samples = samples
  )
  print(p)
# loading library and example data
  library(h5vc)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979628
  windowsize <- 30
  samples <- sampleData$Sample[sampleData$Patient == "Patient8"]
  data <- h5readBlock(
    filename = tallyFile,
    group = "/ExampleStudy/16",
    names = c("Coverages", "Counts", "Deletions", "Reference"),
    range = c(position - windowsize, position + windowsize)
  )
  #Plotting with position and windowsize
  p <- mismatchPlot(
    data = data,
    sampledata = sampleData,
    samples = samples,
    windowsize = windowsize,
    position = position
  )
  print(p)
  #plotting with range and modified tickSpacing and refHeight
  p <- mismatchPlot(
    data = data,
    sampledata = sampleData,
    samples = samples,
    range = c(position - windowsize, position + windowsize),
    tickSpacing = c(20, 5),
    refHeight = 5
  )
  print(p)
  #plotting without specfiying range or position
  p <- mismatchPlot(
    data = data,
    sampledata = sampleData,
    samples = samples
  )
  print(p)
  #Plotting multiple regions (with small overlaps)
  library(IRanges)
  dataList <- h5dapply(
    filename = tallyFile,
    group = "/ExampleStudy/16",
    names = c("Coverages", "Counts", "Deletions", "Reference"),
    range = IRanges(start = seq( position - windowsize, position + windowsize, 20), width = 30 )
  )
  p <- mismatchPlot(
    data = dataList,
    sampledata = sampleData,
    samples = samples
  )
  print(p)

Mutation spectrum analyses

Description

These functions help in analyses of mutation spectra

Usage

mutationSpectrum( variantCalls, tallyFile, study, context = 1 )
mutationSpectrum( variantCalls, tallyFile, study, context = 1 )

Arguments

`variantCalls`	A `data.frame` object that can be the output of a call to a `callVariantsPaired` or `callDeletionsPaired` function. The following columns are required: - `altAllele` - `refAllele` - `Sample` - `Start` - `End` - `Chrom`
`tallyFile`	filename of a tally file matching the variant calls
`study`	the study id used in the tally file
`context`	An integer specifying the size of the context that should be considered (i.e. the length of the prefix and suffix of the variant call)

Details

This function takes a set of variant calls (SNVs/Deletions) and a tallyFile as well as a context size and tabulates the number of observed mutations stratified by type (refAllele->altAllele) and sequence context (i.e. the prefix and suffix of size context around the variant position in the genome)

bases serves to map character representations to numeric encoding of bases

variantCalls is an example dataset of variant calls created by running callVariantsPaired on the example.tally.hfs5 file.

Value

A table listing the counts of mutations stratified by allele, sequence context and sample.

Author(s)

Paul Pyl

Examples

  library(h5vc)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  data( "example.variants", package = "h5vcData" )
  head( mutationSpectrum( variantCalls, tallyFile, "/ExampleStudy" ) )
library(h5vc)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  data( "example.variants", package = "h5vcData" )
  head( mutationSpectrum( variantCalls, tallyFile, "/ExampleStudy" ) )

Plotting a mutation spectrum

Description

This function generates a mutation spectrum plot from a mutation spectrum returned by a call to mutationSPectrum

Usage

plotMutationSpectrum( ms, plotCounts = TRUE )plotMutationSpectrum( ms, plotCounts = TRUE )

Arguments

`ms`	A mutation spectrum as returned by `mutationSpectrum`
`plotCounts`	Boolean flag specifying whether `ms` contains one row per variant (default) or already contains summarized counts per type of mutation

Details

The plot is inspired by the one shown in figure 1b of Signatures of mutational processes in human cancer -- Alexandrov et. al.

Value

A ggplot object containing the mutation spectrum plot

Author(s)

Paul Pyl

Examples

  library(h5vc)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  data( "example.variants", package = "h5vcData" )
  plotMutationSpectrum( mutationSpectrum( variantCalls, tallyFile, "/ExampleStudy" ) )
library(h5vc)
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  data( "example.variants", package = "h5vcData" )
  plotMutationSpectrum( mutationSpectrum( variantCalls, tallyFile, "/ExampleStudy" ) )

Preparing the results of tallyBAM for writing to an HDF5 tally file

Description

This function prepares the resulting array of a call to tallyBAM for writing to an HDF5 tally file.

Usage

prepareForHDF5( counts, reference )
prepareForHDF5( counts, reference )

Arguments

`counts`	An array as produced by a call to `tallyBAM`
`reference`	A DNAString object containing the reference sequence corresponding to the region that is described in the counts array – if this is `NULL` a consensus vote will be used to estimate the reference at any given position, this means you cannot detect variants with AF >= 0.5 anymore

Details

This function performs the neccessary transformation to the array returned by tallyBAM to be compatible with the HDF5 tally file data structure.

Value

A list with slots containing the Counts,Coverages,Deletions and Reference datasets for the given sample.

Author(s)

Paul Pyl

Examples

library(h5vc)
library(BSgenome.Hsapiens.UCSC.hg19)
files <- c("NRAS.AML.bam","NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
chrom = "1"
startpos <- 115247090
endpos <- 115259515
theData <- lapply( bamFiles, function(bamf){
  tallyBAM( file = bamf, chr = chrom, start = startpos, stop = endpos, ncycles = 10 )
})
reference <- getSeq(BSgenome.Hsapiens.UCSC.hg19, "chr1", startpos, endpos)
theData <- lapply(theData, prepareForHDF5, reference)
str(theData)
library(h5vc)
library(BSgenome.Hsapiens.UCSC.hg19)
files <- c("NRAS.AML.bam","NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
chrom = "1"
startpos <- 115247090
endpos <- 115259515
theData <- lapply( bamFiles, function(bamf){
  tallyBAM( file = bamf, chr = chrom, start = startpos, stop = endpos, ncycles = 10 )
})
reference <- getSeq(BSgenome.Hsapiens.UCSC.hg19, "chr1", startpos, endpos)
theData <- lapply(theData, prepareForHDF5, reference)
str(theData)

prepareTallyFile

Description

Functions for preparing an HDF5 file for storing tally data and / or modifying an existing file

Usage

prepareTallyFile( filename, study, chrom, chromlength, nsamples, maxsamples = nsamples, chunkSize = 50000, sampleChunkSize = nsamples, compressionLevel = 9, referenceFillValue = 5 )
resizeCohort( filename, study, chrom, newNumberOfSamples, dimmap = .sampleDimMap, force = FALSE )
prepareTallyFile( filename, study, chrom, chromlength, nsamples, maxsamples = nsamples, chunkSize = 50000, sampleChunkSize = nsamples, compressionLevel = 9, referenceFillValue = 5 )
resizeCohort( filename, study, chrom, newNumberOfSamples, dimmap = .sampleDimMap, force = FALSE )

Arguments

`filename`	Filename of the HDF5 file that should store the tallies
`study`	Study identifier which will be used in structuring the file
`chrom`	Chromosome for which the structure should be generated
`chromlength`	The length of the chromosom, this will be the size of genomic position dimension
`nsamples`	Number of samples that will be stored in the file
`maxsamples`	Maximum Number of samples that can be stored in the file, this relatesto the maxdim property of HDF5 datasets, which is used to specify possible re-sizing of datasets after creation - see `http:://www.hdfgroup.org` for details
`chunkSize`	The size of the chunks used in HDF5 storage, this is specified along the genomic position dimension, by default chunks will always be all data from all samples with the given width along the genomic position dimension
`compressionLevel`	Compression level to use in the HDF5 file, defaults to `9` (highest), use lower numbers to improve access time at the cost of disk space usage
`sampleChunkSize`	Size of the HDF5 chunks along the sample dimension, the dafault value is the whole dataset, i.e. all samples. For larger datasets where the typical use-case is to extract only data corresponding to a specific sample and genomic position, smaller values of `sampleChunkSize` should be used.
`referenceFillValue`	Default value to be used for the Reference dataset, this is set to `5` by default, which corresponds to the nucleotide `N`
`newNumberOfSamples`	New cohort size, this must be smaller than the value of `maxsamples` that was provided when the file was created
`dimmap`	A list mapping dataset names to the dimension in which the samples are stored (e.g. "Counts" -> 2)
`force`	Boolean parameter that controls whether a shrinking operation (i.e. newNumberOfSamples is smaller than the current number of samples) should be performed or throw an error. Shrinking will result in data loss.

Details

prepareTallyFile prepares (and creates if neccessary) an HDF5 file for storing the datasets that are associated with a tally. It creates the required groups and datasets (filled with 0's). resizeCohortResizes the datasets to a new number of samples, this is limited by the value of maxsamples that was provided in the initial call to prepareTallyFile

Value

Returns TRUE on success

Author(s)

Paul Pyl

Examples

prepareTallyFile( file.path( tempdir(), "test.tally.hfs5" ), "SomeStudy", "ChromosomeB", 1e6, 20 )
prepareTallyFile( file.path( tempdir(), "test.tally.hfs5" ), "SomeStudy", "ChromosomeB", 1e6, 20 )

tallyBAM

Description

Function for creating tallies from bam files.

Usage

tallyBAM(file, chr, start, stop, q=25, ncycles = 0, max.depth=1000000, verbose=FALSE, reference = NULL)
tallyBAM(file, chr, start, stop, q=25, ncycles = 0, max.depth=1000000, verbose=FALSE, reference = NULL)

Arguments

`file`	filename of the BAM file that should be tallies
`chr`	Chromosome in which to tally
`start`	First position of the tally
`stop`	Last position of the tally
`q`	quality cut-off for considering a base call
`ncycles`	number of sequencing cycles form the front and back of the read that should be considered unreliable
`max.depth`	only tally a position if there are less than this many reads overlapping it - can prevent long runtimes in unreliable regions
`verbose`	should additional information be printed
`reference`	`DNAString` object holding the reference sequence of the region being tallies, if this is `NULL` (the default) the raw tally is returned, otherwise `prepareForHDF5` is called with the raw tally and the reference and the prepared tally is returned instead

Details

This function tallies nucleotides and deletion counts in the specified region of a given BAM file. The results can be processed with the prepareForHDF5 function.

This function was adapted from the bam2R function provided by the deepSNV package.

Value

An array object with dimensions [stop - start + 1, 18, 2] which represent positions times nucleotides (4 bases + deletions + insertions times three for early, middle and late sequencing cycles) times strands.

Author(s)

Paul Pyl

Examples

library(h5vc)
files <- c("NRAS.AML.bam","NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
chrom = "1"
startpos <- 115247090
endpos <- 115259515
theData <- lapply( bamFiles, function(bamf){
  tallyBAM( file = bamf, chr = chrom, start = startpos, stop = endpos, ncycles = 10 )
})
str(theData)
print(theData[[1]][,,,9491]) #position 9491 of the pileup
library(h5vc)
files <- c("NRAS.AML.bam","NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
chrom = "1"
startpos <- 115247090
endpos <- 115259515
theData <- lapply( bamFiles, function(bamf){
  tallyBAM( file = bamf, chr = chrom, start = startpos, stop = endpos, ncycles = 10 )
})
str(theData)
print(theData[[1]][,,,9491]) #position 9491 of the pileup

Tallying function with a `GRanges` interface.

Description

Functions for tallying bam files in genomic intervals provided as GRanges objects, special version of the function for direct writing or computation on a cluster exist.

Usage

tallyRanges(bamfiles, ranges, reference, q = 25, ncycles = 10, max.depth = 1e+06)
tallyRangesToFile(tallyFile, study, bamfiles, ranges, reference, samples = NULL, q = 25, ncycles = 0, max.depth=1e6)
tallyRangesBatch(tallyFile, study, bamfiles, ranges, reference, q = 25, ncycles = 10, max.depth=1e6, regID = "Tally", res = list("ncpus" = 2, "memory" = 24000, "queue"="research-rh6"), written = c(), wrfile = "written.jobs.RDa", waitTime = Inf)
tallyRanges(bamfiles, ranges, reference, q = 25, ncycles = 10, max.depth = 1e+06)
tallyRangesToFile(tallyFile, study, bamfiles, ranges, reference, samples = NULL, q = 25, ncycles = 0, max.depth=1e6)
tallyRangesBatch(tallyFile, study, bamfiles, ranges, reference, q = 25, ncycles = 10, max.depth=1e6, regID = "Tally", res = list("ncpus" = 2, "memory" = 24000, "queue"="research-rh6"), written = c(), wrfile = "written.jobs.RDa", waitTime = Inf)

Arguments

`bamfiles`	Character vector giving the locations of the bam files to be tallied
`ranges`	A GRanges object describing the ranges that tallies shalle be generated in, e.g. the result of a call to `binGenome` or a set of exon or gene annotations provided by a `TxDB` object.
`reference`	`BSgenome` object describing the reference genome that the alignments were made against.
`samples`	The indices (within the HDF5 datasets) corresponding to the samples that the data represents. You can use this option to write sub-sets of samples from a cohort.
`q`	Read alignment quality cut-off.
`ncycles`	Number of cycles from the front and back of the reads that should be considered unreliable for mismatch detection
`max.depth`	Maximum depth of coverage to consider
`tallyFile`	Filename of the HDF5 tally file that the data shall be written to
`study`	The location within the HDF5 file that corresponds to the HDF5-group representing the study we are working on.
`regID`	Identifier for a `BatchJobs` registry which will be used to store and organise the cluster jobs used for parallelisation of the work.
`res`	Resource list specifying the compute resources to be requested for each of the cluster jobs.
`written`	Numerical vector indicating the Job IDs of jobs whose results have already been written to the tally file, this can be used to resume writing after a crash.
`wrfile`	Filename for a file to store the IDs of already written jobs in, can be used to resume writing after a crash.
`waitTime`	How long shall the function wait on cluster jobst to finish, before giving up. Default is wait forever.

Details

tallyRanges returns the tallies corresponding to the specifed ranges, tallyToFile performs the same task but writes the results to the tally file directly. tallyRangesBatch uses the BatchJobs package to set up cluster jobs for tallying and collects and writes the results of those jobs to the tally file. It is important to have a properly configured cluster (inlcuding a .BatchJobs.R as well as a template file). See the documentation of BatchJobs for that information.

Value

For tallyRanges the return value is a list of lists, where the top level corresponds to the ranges provided as an input to the function and each element is a list of the datasets in compatible format, that can directly be written to an HDF5 file using the writeToTallyFile function. The other two function perform the writing directly and return

Author(s)

Paul Theodor Pyl

Examples

suppressPackageStartupMessages(library("h5vc"))
suppressPackageStartupMessages(library("rhdf5"))
files <- list.files( system.file("extdata", package = "h5vcData"), "Pt.*bam$" )
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
suppressPackageStartupMessages(require(BSgenome.Hsapiens.NCBI.GRCh38))
suppressPackageStartupMessages(require(GenomicRanges))
dnmt3a <- read.table(system.file("extdata", "dnmt3a.txt", package = "h5vcData"), header=TRUE, stringsAsFactors = FALSE)
dnmt3a <- with( dnmt3a, GRanges(seqname, ranges = IRanges(start = start, end = end)))
dnmt3a <- reduce(dnmt3a)
require(BiocParallel)
register(MulticoreParam())
theData <- tallyRanges( bamFiles, ranges = dnmt3a[1:3], reference = Hsapiens )
str(theData)
suppressPackageStartupMessages(library("h5vc"))
suppressPackageStartupMessages(library("rhdf5"))
files <- list.files( system.file("extdata", package = "h5vcData"), "Pt.*bam$" )
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
suppressPackageStartupMessages(require(BSgenome.Hsapiens.NCBI.GRCh38))
suppressPackageStartupMessages(require(GenomicRanges))
dnmt3a <- read.table(system.file("extdata", "dnmt3a.txt", package = "h5vcData"), header=TRUE, stringsAsFactors = FALSE)
dnmt3a <- with( dnmt3a, GRanges(seqname, ranges = IRanges(start = start, end = end)))
dnmt3a <- reduce(dnmt3a)
require(BiocParallel)
register(MulticoreParam())
theData <- tallyRanges( bamFiles, ranges = dnmt3a[1:3], reference = Hsapiens )
str(theData)

Filling the Reference dataset in a tally file from a DNAString

Description

Function to fill the Reference dataset of a tally file from a DNAString object

Usage

writeReference( tallyFile, group, dnastring, blocksize = 1000000, verbose = TRUE )
writeReference( tallyFile, group, dnastring, blocksize = 1000000, verbose = TRUE )

Arguments

`tallyFile`	filename of a tally file matching the variant calls
`group`	The group that the `Reference` dataset is located in
`dnastring`	A `DNAString` object containing the new reference sequence
`blocksize`	The size of blocks in which to process the reference (higher values imply higher memory consumption)
`verbose`	Boolean flag to specify if diagnostic messages should be printed

Details

This function takes a tally file, a location within it (the group argument) and a reference sequence as a DNAString object, encodes the reference in the appropriate way and writes it to the location in the tally file in blocks of size specified in blocksize. The reference will be written to a dataset with the path paste(group, "Reference", sep = "/") within the tally file. The dataset itself must exists and have the correct dimensions to hold the sequence specified in dnastring.

Value

Returns TRUE on success.

Author(s)

Paul Pyl

Examples

library(h5vc)
library(rhdf5)
library(Biostrings)
filename = file.path(tempdir(), "write.ref.test.hfs5")
prepareTallyFile(filename=filename,study="SomeStudy",chrom="Foo",chromlength=8,nsamples=1)
writeReference(filename, group = "/SomeStudy/Foo", dnastring = DNAString("GATTACCA"))
h5dump(filename)$SomeStudy$Foo$Reference
library(h5vc)
library(rhdf5)
library(Biostrings)
filename = file.path(tempdir(), "write.ref.test.hfs5")
prepareTallyFile(filename=filename,study="SomeStudy",chrom="Foo",chromlength=8,nsamples=1)
writeReference(filename, group = "/SomeStudy/Foo", dnastring = DNAString("GATTACCA"))
h5dump(filename)$SomeStudy$Foo$Reference

Writing data to an HDF5 tally file

Description

This function is used to write the results of a call to tallyRanges to an HDF5 tally file.

Usage

writeToTallyFile( theData, file, study, ranges, samples = NULL )
writeToTallyFile( theData, file, study, ranges, samples = NULL )

Arguments

`theData`	A `list` of `list`s of datasets as returned by a call to e.g. `tallyRanges`
`file`	The target filename
`study`	The location of the Group (within the HDF5 file) representing the study the data belongs to.
`ranges`	A `GRanges` object defining the ranges that the elements of `theData` correspond to
`samples`	The indexes of the samples that the data corresponds to, this can be extracted from the 'Column'-field in the sample metadata and is used to write data corresponding to subsets of the cohort samples. The default (`NULL`) indicates that all samples are present and will be written.

Author(s)

Paul Theodor Pyl

Examples

suppressPackageStartupMessages(library("h5vc"))
suppressPackageStartupMessages(library("rhdf5"))
files <- list.files( system.file("extdata", package = "h5vcData"), "Pt.*bam$" )
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
suppressPackageStartupMessages(require(BSgenome.Hsapiens.NCBI.GRCh38))
suppressPackageStartupMessages(require(GenomicRanges))
dnmt3a <- read.table(system.file("extdata", "dnmt3a.txt", package = "h5vcData"), header=TRUE, stringsAsFactors = FALSE)
dnmt3a <- with( dnmt3a, GRanges(seqname, ranges = IRanges(start = start, end = end)))
dnmt3a <- reduce(dnmt3a)
require(BiocParallel)
register(MulticoreParam())
theData <- tallyRanges( bamFiles, ranges = dnmt3a[1:3], reference = Hsapiens )
chrom <- "2"
chromlength <- 250e6
study <- "/DNMT3A"
tallyFile <- file.path( tempdir(), "DNMT3A.tally.hfs5" )
if( file.exists(tallyFile) ){
  file.remove(tallyFile)
}
if( prepareTallyFile( tallyFile, study, chrom, chromlength, nsamples = length(files) ) ){
  h5ls(tallyFile)
}else{
  message( paste( "Preparation of:", tallyFile, "failed" ) )
}
writeToTallyFile(theData, tallyFile, study = "/DNMT3A", ranges = dnmt3a[1:3])
suppressPackageStartupMessages(library("h5vc"))
suppressPackageStartupMessages(library("rhdf5"))
files <- list.files( system.file("extdata", package = "h5vcData"), "Pt.*bam$" )
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
suppressPackageStartupMessages(require(BSgenome.Hsapiens.NCBI.GRCh38))
suppressPackageStartupMessages(require(GenomicRanges))
dnmt3a <- read.table(system.file("extdata", "dnmt3a.txt", package = "h5vcData"), header=TRUE, stringsAsFactors = FALSE)
dnmt3a <- with( dnmt3a, GRanges(seqname, ranges = IRanges(start = start, end = end)))
dnmt3a <- reduce(dnmt3a)
require(BiocParallel)
register(MulticoreParam())
theData <- tallyRanges( bamFiles, ranges = dnmt3a[1:3], reference = Hsapiens )
chrom <- "2"
chromlength <- 250e6
study <- "/DNMT3A"
tallyFile <- file.path( tempdir(), "DNMT3A.tally.hfs5" )
if( file.exists(tallyFile) ){
  file.remove(tallyFile)
}
if( prepareTallyFile( tallyFile, study, chrom, chromlength, nsamples = length(files) ) ){
  h5ls(tallyFile)
}else{
  message( paste( "Preparation of:", tallyFile, "failed" ) )
}
writeToTallyFile(theData, tallyFile, study = "/DNMT3A", ranges = dnmt3a[1:3])

Package 'h5vc'

Help Index

Managing alignment tallies using a hdf5 backend

Description

Details

Author(s)

Preparing the results of tallyBAM for writing to an HDF5 tally file

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Tallying bam files in parallel using BatchJobs on high performance compute clusters (HPC)

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Function for binning a genome.

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Estimate allelic frequency distributions in bins along the genome

Description

Usage

Arguments

Value

Author(s)

Examples

Variant calling

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Paired variant calling using fisher tests

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Single sample variant calling

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Coverage analysis

Description

Usage

Arguments

Details

Value

Author(s)

Examples

geom_h5vc

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Reading and writing sample data from / to a tally file

Description

Tallying bam files in parallel using `BatchJobs` on high performance compute clusters (HPC)