Package 'HTSeqGenie' reference manual

Title:	A NGS analysis pipeline.
Description:	Libraries to perform NGS analysis.
Authors:	Gregoire Pau, Jens Reeder
Maintainer:	Jens Reeder <[email protected]>
License:	Artistic-2.0
Version:	4.37.1
Built:	2024-12-22 02:49:25 UTC
Source:	https://github.com/bioc/HTSeqGenie

Calculate and process Variants

Description

Calculate and process Variants

Usage

analyzeVariants()
analyzeVariants()

Value

Nothing

Author(s)

Jens Reeder

Annotate variants via vep

Description

Annotate variants via vep

Usage

annotateVariants(vcf.file)
annotateVariants(vcf.file)

Arguments

vcf.file

A character vector pointing to a VCF (or gzipped VCF) file

Value

Path to a vcf file with variant annotations

Author(s)

Jens Reeder

Build genomic features from a TxDb object

Description

Build genomic features from a TxDb object

Usage

buildGenomicFeaturesFromTxDb(txdb)
buildGenomicFeaturesFromTxDb(txdb)

Arguments

txdb

A TxDb object.

Value

A list named list of GRanges objects containing the biological entities to account for.

Author(s)

Gregoire Pau

Examples

## Not run: 
  library("TxDb.Hsapiens.UCSC.hg19.knownGene")
  txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
  genomic_features <- buildGenomicFeaturesFromTxDb(txdb)

## End(Not run)
## Not run: 
  library("TxDb.Hsapiens.UCSC.hg19.knownGene")
  txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
  genomic_features <- buildGenomicFeaturesFromTxDb(txdb)

## End(Not run)

Variant calling via GATK

Description

Call variants via GATK using the pipeline framework. Requires a GATK compatible genome with a name matching the alignment genome to be installed in 'path.gatk_genome'

Usage

callVariantsGATK(bam.file)
callVariantsGATK(bam.file)

Arguments

bam.file

Path to bam.file

Value

Path to variant file

Author(s)

Jens Reeder

Check for the GATK jar file

Description

Check for the GATK jar file

Usage

checkGATKJar(path = getOption("gatk.path"))
checkGATKJar(path = getOption("gatk.path"))

Arguments

path

Path to the GATK jar file

Value

TRUE if tool can be called, FALSE otherwise

Detect rRNA Contamination in Reads

Description

Returns a named vector indicating if a read ID has rRNA contamination or not

Usage

detectRRNA(lreads, remove_tmp_dir = TRUE, save_dir = NULL)
detectRRNA(lreads, remove_tmp_dir = TRUE, save_dir = NULL)

Arguments

`lreads`	A list of ShortReadQ objects
`remove_tmp_dir`	boolean indicating whether or not to delete temp directory of gsnap results
`save_dir`	Save directory

Details

Given a genome and fastq data, each read in the fastq data is aligned against the rRNA sequences for that genome

Value

a named logical vector indicating if a read has rRNA contamination

Author(s)

Cory Barr

Filter variants by regions

Description

Filter variants by regions

Usage

excludeVariantsByRegions(variants, mask)
excludeVariantsByRegions(variants, mask)

Arguments

`variants`	Variants as Vranges, GRanges or VCF object
`mask`	region to mask, given as GRanges

Details

This function can be used to filter variants in a given region, e.g. low complexity and repeat regions

Value

The filtered variants

Author(s)

Jens Reeder

gatk

Description

Run a command from the GATK

Usage

gatk(gatk.jar.path = getOption("gatk.path"), method, args, maxheap = "4g")
gatk(gatk.jar.path = getOption("gatk.path"), method, args, maxheap = "4g")

Arguments

`gatk.jar.path`	Path to the gatk jar file
`method`	Name of the gatk method, e.g. UnifiedGenotyper
`args`	additional args passed to gatk
`maxheap`	Maximal heap space allocated for java, GATK recommends 4G heap for most of its apps

Details

Execute the GATK jar file using the method specified as arg. Stops if the command executed fails.

Value

0 for success, stops otherwise

Author(s)

Jens Reeder

generateSingleGeneDERs

Description

Generate DEXSeq-ready exons

Usage

generateSingleGeneDERs(txdb)
generateSingleGeneDERs(txdb)

Arguments

txdb

A transcript DB object

Details

generateSingleGeneDERs() generates exons by: 1) disjoining the whole exon set 2) keeping only the exons of coding regions 3) keeping only the exons that belong to unique genes

Value

single gene DERs

Detect reads that look like rRNA

Description

Detect reads that look like rRNA

Usage

getRRNAIds(file1, file2 = NULL, tmp_dir, rRNADb)
getRRNAIds(file1, file2 = NULL, tmp_dir, rRNADb)

Arguments

`file1`	FastQ file of forward reads
`file2`	FastQ of reverse reads in paired-end sequencing, NULL otherwise
`tmp_dir`	temporary directory used for storing the gsnap results
`rRNADb`	Name of the rRNA sequence database. Must exist in the gsnap genome directory

Value

IDs of reads flagged as rRNA

Load tabular data from the NGS pipeline result directory

Description

Load tabular data from the NGS pipeline result directory

Usage

getTabDataFromFile(save_dir, object_name)
getTabDataFromFile(save_dir, object_name)

Arguments

`save_dir`	A character string containing an NGS pipeline output directory.
`object_name`	A character string ontaining the regular expression matching a filename in dir_path

Value

A data frame.

Hashing function for coverage

Description

Hashing function for coverage

Usage

hashCoverage(cov)
hashCoverage(cov)

Arguments

cov

A SimpleRleList object

Value

A numeric

Author(s)

Gregoire Pau

Hashing function for variants

Description

Hashing function for variants

Usage

hashVariants(var)
hashVariants(var)

Arguments

var

A GRanges object

Value

A numeric

Author(s)

Gregoire Pau

The HTSeqGenie package is a robust and efficient software to analyze high-throughput sequencing experiments in a reproducible manner. It supports the RNA-Seq and Exome-Seq protocols and provides: quality control reporting (using the ShortRead package), detection of adapter contamination, read alignment versus a reference genome (using the gmapR package), counting reads in genomic regions (using the GenomicRanges package), and read-depth coverage computation.

Package content

To run the pipeline:

runPipeline

To access the pipeline output data:

getTabDataFromFile

To build the genomic features object:

buildGenomicFeaturesFromTxDb
TP53GenomicFeatures

Examples

  ## Not run: 
    ## build genome and genomic features
    tp53Genome <- TP53Genome()
    tp53GenomicFeatures <- TP53GenomicFeatures()
    
    ## get the FASTQ files
    fastq1 <- system.file("extdata/H1993_TP53_subset2500_1.fastq.gz", package="HTSeqGenie")
    fastq2 <- system.file("extdata/H1993_TP53_subset2500_2.fastq.gz", package="HTSeqGenie")
    
    ## run the pipeline
    save_dir <- runPipeline(
        ## input
        input_file=fastq1,
        input_file2=fastq2,
        paired_ends=TRUE,
        quality_encoding="illumina1.8",
        
        ## output
        save_dir="test",
        prepend_str="test",
        overwrite_save_dir="erase",
        
        ## aligner
        path.gsnap_genomes=path(directory(tp53Genome)),
        alignReads.genome=genome(tp53Genome),
        alignReads.additional_parameters="--indel-penalty=1 --novelsplicing=1 --distant-splice-penalty=1",
        
        ## gene model
        path.genomic_features=dirname(tp53GenomicFeatures),
        countGenomicFeatures.gfeatures=basename(tp53GenomicFeatures)
        )
   
## End(Not run)
## Not run: 
    ## build genome and genomic features
    tp53Genome <- TP53Genome()
    tp53GenomicFeatures <- TP53GenomicFeatures()
    
    ## get the FASTQ files
    fastq1 <- system.file("extdata/H1993_TP53_subset2500_1.fastq.gz", package="HTSeqGenie")
    fastq2 <- system.file("extdata/H1993_TP53_subset2500_2.fastq.gz", package="HTSeqGenie")
    
    ## run the pipeline
    save_dir <- runPipeline(
        ## input
        input_file=fastq1,
        input_file2=fastq2,
        paired_ends=TRUE,
        quality_encoding="illumina1.8",
        
        ## output
        save_dir="test",
        prepend_str="test",
        overwrite_save_dir="erase",
        
        ## aligner
        path.gsnap_genomes=path(directory(tp53Genome)),
        alignReads.genome=genome(tp53Genome),
        alignReads.additional_parameters="--indel-penalty=1 --novelsplicing=1 --distant-splice-penalty=1",
        
        ## gene model
        path.genomic_features=dirname(tp53GenomicFeatures),
        countGenomicFeatures.gfeatures=basename(tp53GenomicFeatures)
        )
   
## End(Not run)

isSparse

Description

Check coverage for sparseness

Usage

isSparse(cov, threshold = 0.1)
isSparse(cov, threshold = 0.1)

Arguments

`cov`	A cov object as SimpleRleList
`threshold`	Fraction of number of runs over total length

Details

Some Rle related operations become very slow when they are dealing with data that violates their sparseness assumption. This method provides an estimate about whether the data is dense or sparse. More precicely it checks if the fraction of the number of runs over the total length is smaller than a threshold

Value

Boolean whether this object is dense or sparse

Author(s)

Jens Reeder

markDuplicates

Description

Mark duplicates in bam

Usage

markDuplicates(bamfile, outfile = NULL, path = getOption("picard.path"))
markDuplicates(bamfile, outfile = NULL, path = getOption("picard.path"))

Arguments

`bamfile`	Name of input bam file
`outfile`	Name of output bam file
`path`	Full path to MarkDuplicates jar

Details

Use MarkDuplicates from PicardTools to mark duplicate alignments in bam file.

Value

Path to output bam file

Author(s)

Jens Reeder

markDups

Description

Mark duplicates in pipeline context

Usage

markDups()
markDups()

Details

High level function call to mark duplicates in the analyzed.bam file of a pipelin run.

Value

Nothing

Author(s)

Jens Reeder

realignIndels

Description

Realign indels in pipeline context

Usage

realignIndels()
realignIndels()

Details

High level function call to realign indels in the analyzed.bam file using GATK

Value

Nothing

Author(s)

Jens Reeder

Realign indels via GATK

Description

Realing indels using the GATK tools RealignerTargetCreator and IndelRealigner. Requires a GATK compatible genome with a name matching the alignment genome to be installed in 'path.gatk_genome'

Usage

realignIndelsGATK(bam.file)
realignIndelsGATK(bam.file)

Arguments

bam.file

Path to bam.file

Details

Since GATKs IndelRealigner is not parallelized, we run it in parallel per chromosome.

Value

Path to realigned bam file

Author(s)

Jens Reeder

Run the NGS analysis pipeline

Description

Run the NGS analysis pipeline

Usage

runPipeline(...)
runPipeline(...)

Arguments

...

A list of parameters. See the vignette for details.

Details

This function starts the pipeline. It first preprocesses the input FASTQ reads, align them, count the read overlaps with genomic features and compute the coverage. See the vignette for details.

Value

The path to the NGS output directory.

Author(s)

Jens Reeder, Gregoire Pau

Examples

## Not run: 
## build genome and genomic features
tp53Genome <- TP53Genome()
tp53GenomicFeatures <- TP53GenomicFeatures()
 
 ## get the FASTQ files
fastq1 <- system.file("extdata/H1993_TP53_subset2500_1.fastq.gz", package="HTSeqGenie")
fastq2 <- system.file("extdata/H1993_TP53_subset2500_2.fastq.gz", package="HTSeqGenie")

## run the pipeline
save_dir <- runPipeline(
    ## input
    input_file=fastq1,
    input_file2=fastq2,
    paired_ends=TRUE,
    quality_encoding="illumina1.8",
    
    ## output
    save_dir="test",
    prepend_str="test",
    overwrite_save_dir="erase",
    
    ## aligner
    path.gsnap_genomes=path(directory(tp53Genome)),
    alignReads.genome=genome(tp53Genome),
    alignReads.additional_parameters="--indel-penalty=1 --novelsplicing=1 --distant-splice-penalty=1",
    
    ## gene model
    path.genomic_features=dirname(tp53GenomicFeatures),
    countGenomicFeatures.gfeatures=basename(tp53GenomicFeatures)
    )

## End(Not run)


## Not run: 
## build genome and genomic features
tp53Genome <- TP53Genome()
tp53GenomicFeatures <- TP53GenomicFeatures()
 
 ## get the FASTQ files
fastq1 <- system.file("extdata/H1993_TP53_subset2500_1.fastq.gz", package="HTSeqGenie")
fastq2 <- system.file("extdata/H1993_TP53_subset2500_2.fastq.gz", package="HTSeqGenie")

## run the pipeline
save_dir <- runPipeline(
    ## input
    input_file=fastq1,
    input_file2=fastq2,
    paired_ends=TRUE,
    quality_encoding="illumina1.8",
    
    ## output
    save_dir="test",
    prepend_str="test",
    overwrite_save_dir="erase",
    
    ## aligner
    path.gsnap_genomes=path(directory(tp53Genome)),
    alignReads.genome=genome(tp53Genome),
    alignReads.additional_parameters="--indel-penalty=1 --novelsplicing=1 --distant-splice-penalty=1",
    
    ## gene model
    path.genomic_features=dirname(tp53GenomicFeatures),
    countGenomicFeatures.gfeatures=basename(tp53GenomicFeatures)
    )

## End(Not run)

Run the NGS analysis pipeline

Description

Run the NGS analysis pipeline from a configuration file

Usage

runPipelineConfig(config_filename, config_update)
runPipelineConfig(config_filename, config_update)

Arguments

`config_filename`	Path to a pipeline configuration file
`config_update`	A list of name value pairs that will update the config parameters

Details

This is the launcher function for all pipeline runs. It will do some preprocessing steps, then aligns the reads, counts overlap with genomic Features such as genes, exons etc and applies a variant caller.

Value

Nothing

Author(s)

Jens Reeder, Gregoire Pau

setup test framework

Description

setup test framework

Usage

setupTestFramework(config.filename, config.update = list(),
  testname = "test", package = "HTSeqGenie", use.TP53Genome = TRUE)
setupTestFramework(config.filename, config.update = list(),
  testname = "test", package = "HTSeqGenie", use.TP53Genome = TRUE)

Arguments

`config.filename`	configuration file
`config.update`	update list of config values
`testname`	name of test case
`package`	name of package
`use.TP53Genome`	Boolean indicating the use of the TP53 genome as template config

Value

the created temp directory

Demo genomic features around the TP53 gene

Description

Build the genomic features of the TP53 demo region

Usage

TP53GenomicFeatures()
TP53GenomicFeatures()

Details

Returns a list of genomic features (gene, exons, transcripts) annotating a region of UCSC hg19 sequence centered on the region of the TP53 gene, with 1 Mb flanking sequence on each side. This is intended as a test/demonstration to run the NGS pipeline in conjunction with the LungCancerLines data package.

Value

A list of GRanges objects containing the genomic features

Author(s)

Gregoire Pau

Compute stats on a VCF file

Description

Compute stats on a VCF file

Usage

vcfStat(vcf.filename)
vcfStat(vcf.filename)

Arguments

vcf.filename

A character pointing to a VCF (or gzipped VCF) file

Value

A numeric vector

Author(s)

Gregoire Pau

Variant calling

Description

Call Variants in the pipeline framework

Usage

wrap.callVariants(bam.file)
wrap.callVariants(bam.file)

Arguments

bam.file

Aligned reads as bam file

Details

A wrapper around VariantTools callVariant framework.

Value

Variants as Vranges

Author(s)

Jens Reeder

writeVCF

Description

Write variants to VCF file

Usage

writeVCF(variants.vranges, filename)
writeVCF(variants.vranges, filename)

Arguments

`variants.vranges`	Genomic Variants as VRanges object
`filename`	Name of vcf file to write

Value

VCF file name

Author(s)

Jens Reeder

Package 'HTSeqGenie'

Help Index

Calculate and process Variants

Description

Usage

Value

Author(s)

Annotate variants via vep

Description

Usage

Arguments

Value

Author(s)

Build genomic features from a TxDb object

Description

Usage

Arguments

Value

Author(s)

Examples

Variant calling via GATK

Description

Usage

Arguments

Value

Author(s)

Check for the GATK jar file

Description

Usage

Arguments

Value

Detect rRNA Contamination in Reads

Description

Usage

Arguments

Details

Value

Author(s)

Filter variants by regions

Description

Usage

Arguments

Details

Value

Author(s)

gatk

Description

Usage

Arguments

Details

Value

Author(s)

generateSingleGeneDERs

Description

Usage

Arguments

Details

Value

Detect reads that look like rRNA

Description

Usage

Arguments

Value

Load tabular data from the NGS pipeline result directory

Description

Usage

Arguments

Value

Hashing function for coverage

Description

Usage

Arguments

Value

Author(s)

Hashing function for variants

Description

Usage

Arguments

Value

Author(s)