Title: | A NGS analysis pipeline. |
---|---|
Description: | Libraries to perform NGS analysis. |
Authors: | Gregoire Pau, Jens Reeder |
Maintainer: | Jens Reeder <[email protected]> |
License: | Artistic-2.0 |
Version: | 4.37.0 |
Built: | 2024-11-03 02:50:54 UTC |
Source: | https://github.com/bioc/HTSeqGenie |
Calculate and process Variants
analyzeVariants()
analyzeVariants()
Nothing
Jens Reeder
Annotate variants via vep
annotateVariants(vcf.file)
annotateVariants(vcf.file)
vcf.file |
A character vector pointing to a VCF (or gzipped VCF) file |
Path to a vcf file with variant annotations
Jens Reeder
Build genomic features from a TxDb object
buildGenomicFeaturesFromTxDb(txdb)
buildGenomicFeaturesFromTxDb(txdb)
txdb |
A TxDb object. |
A list named list of GRanges objects containing the biological entities to account for.
Gregoire Pau
## Not run: library("TxDb.Hsapiens.UCSC.hg19.knownGene") txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene genomic_features <- buildGenomicFeaturesFromTxDb(txdb) ## End(Not run)
## Not run: library("TxDb.Hsapiens.UCSC.hg19.knownGene") txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene genomic_features <- buildGenomicFeaturesFromTxDb(txdb) ## End(Not run)
Call variants via GATK using the pipeline framework. Requires a GATK compatible genome with a name matching the alignment genome to be installed in 'path.gatk_genome'
callVariantsGATK(bam.file)
callVariantsGATK(bam.file)
bam.file |
Path to bam.file |
Path to variant file
Jens Reeder
Check for the GATK jar file
checkGATKJar(path = getOption("gatk.path"))
checkGATKJar(path = getOption("gatk.path"))
path |
Path to the GATK jar file |
TRUE if tool can be called, FALSE otherwise
Returns a named vector indicating if a read ID has rRNA contamination or not
detectRRNA(lreads, remove_tmp_dir = TRUE, save_dir = NULL)
detectRRNA(lreads, remove_tmp_dir = TRUE, save_dir = NULL)
lreads |
A list of ShortReadQ objects |
remove_tmp_dir |
boolean indicating whether or not to delete temp directory of gsnap results |
save_dir |
Save directory |
Given a genome and fastq data, each read in the fastq data is aligned against the rRNA sequences for that genome
a named logical vector indicating if a read has rRNA contamination
Cory Barr
Filter variants by regions
excludeVariantsByRegions(variants, mask)
excludeVariantsByRegions(variants, mask)
variants |
Variants as Vranges, GRanges or VCF object |
mask |
region to mask, given as GRanges |
This function can be used to filter variants in a given region, e.g. low complexity and repeat regions
The filtered variants
Jens Reeder
Run a command from the GATK
gatk(gatk.jar.path = getOption("gatk.path"), method, args, maxheap = "4g")
gatk(gatk.jar.path = getOption("gatk.path"), method, args, maxheap = "4g")
gatk.jar.path |
Path to the gatk jar file |
method |
Name of the gatk method, e.g. UnifiedGenotyper |
args |
additional args passed to gatk |
maxheap |
Maximal heap space allocated for java, GATK recommends 4G heap for most of its apps |
Execute the GATK jar file using the method specified as arg. Stops if the command executed fails.
0 for success, stops otherwise
Jens Reeder
Generate DEXSeq-ready exons
generateSingleGeneDERs(txdb)
generateSingleGeneDERs(txdb)
txdb |
A transcript DB object |
generateSingleGeneDERs() generates exons by: 1) disjoining the whole exon set 2) keeping only the exons of coding regions 3) keeping only the exons that belong to unique genes
single gene DERs
Detect reads that look like rRNA
getRRNAIds(file1, file2 = NULL, tmp_dir, rRNADb)
getRRNAIds(file1, file2 = NULL, tmp_dir, rRNADb)
file1 |
FastQ file of forward reads |
file2 |
FastQ of reverse reads in paired-end sequencing, NULL otherwise |
tmp_dir |
temporary directory used for storing the gsnap results |
rRNADb |
Name of the rRNA sequence database. Must exist in the gsnap genome directory |
IDs of reads flagged as rRNA
Load tabular data from the NGS pipeline result directory
getTabDataFromFile(save_dir, object_name)
getTabDataFromFile(save_dir, object_name)
save_dir |
A character string containing an NGS pipeline output directory. |
object_name |
A character string ontaining the regular expression matching a filename in dir_path |
A data frame.
Hashing function for coverage
hashCoverage(cov)
hashCoverage(cov)
cov |
A SimpleRleList object |
A numeric
Gregoire Pau
Hashing function for variants
hashVariants(var)
hashVariants(var)
var |
A GRanges object |
A numeric
Gregoire Pau
Hashing function for vector
hashVector(x)
hashVector(x)
x |
A vector |
A numeric
Gregoire Pau
The HTSeqGenie
package is a robust and efficient software
to analyze high-throughput sequencing experiments in a reproducible
manner. It supports the RNA-Seq and Exome-Seq protocols and provides:
quality control reporting (using the ShortRead
package), detection of adapter contamination, read
alignment versus a reference genome (using the gmapR
package), counting reads in genomic
regions (using the GenomicRanges
package), and read-depth coverage computation.
To run the pipeline:
runPipeline
To access the pipeline output data:
getTabDataFromFile
To build the genomic features object:
buildGenomicFeaturesFromTxDb
TP53GenomicFeatures
## Not run: ## build genome and genomic features tp53Genome <- TP53Genome() tp53GenomicFeatures <- TP53GenomicFeatures() ## get the FASTQ files fastq1 <- system.file("extdata/H1993_TP53_subset2500_1.fastq.gz", package="HTSeqGenie") fastq2 <- system.file("extdata/H1993_TP53_subset2500_2.fastq.gz", package="HTSeqGenie") ## run the pipeline save_dir <- runPipeline( ## input input_file=fastq1, input_file2=fastq2, paired_ends=TRUE, quality_encoding="illumina1.8", ## output save_dir="test", prepend_str="test", overwrite_save_dir="erase", ## aligner path.gsnap_genomes=path(directory(tp53Genome)), alignReads.genome=genome(tp53Genome), alignReads.additional_parameters="--indel-penalty=1 --novelsplicing=1 --distant-splice-penalty=1", ## gene model path.genomic_features=dirname(tp53GenomicFeatures), countGenomicFeatures.gfeatures=basename(tp53GenomicFeatures) ) ## End(Not run)
## Not run: ## build genome and genomic features tp53Genome <- TP53Genome() tp53GenomicFeatures <- TP53GenomicFeatures() ## get the FASTQ files fastq1 <- system.file("extdata/H1993_TP53_subset2500_1.fastq.gz", package="HTSeqGenie") fastq2 <- system.file("extdata/H1993_TP53_subset2500_2.fastq.gz", package="HTSeqGenie") ## run the pipeline save_dir <- runPipeline( ## input input_file=fastq1, input_file2=fastq2, paired_ends=TRUE, quality_encoding="illumina1.8", ## output save_dir="test", prepend_str="test", overwrite_save_dir="erase", ## aligner path.gsnap_genomes=path(directory(tp53Genome)), alignReads.genome=genome(tp53Genome), alignReads.additional_parameters="--indel-penalty=1 --novelsplicing=1 --distant-splice-penalty=1", ## gene model path.genomic_features=dirname(tp53GenomicFeatures), countGenomicFeatures.gfeatures=basename(tp53GenomicFeatures) ) ## End(Not run)
Check coverage for sparseness
isSparse(cov, threshold = 0.1)
isSparse(cov, threshold = 0.1)
cov |
A cov object as SimpleRleList |
threshold |
Fraction of number of runs over total length |
Some Rle related operations become very slow when they are dealing with data that violates their sparseness assumption. This method provides an estimate about whether the data is dense or sparse. More precicely it checks if the fraction of the number of runs over the total length is smaller than a threshold
Boolean whether this object is dense or sparse
Jens Reeder
Mark duplicates in bam
markDuplicates(bamfile, outfile = NULL, path = getOption("picard.path"))
markDuplicates(bamfile, outfile = NULL, path = getOption("picard.path"))
bamfile |
Name of input bam file |
outfile |
Name of output bam file |
path |
Full path to MarkDuplicates jar |
Use MarkDuplicates from PicardTools to mark duplicate alignments in bam file.
Path to output bam file
Jens Reeder
Mark duplicates in pipeline context
markDups()
markDups()
High level function call to mark duplicates in the analyzed.bam file of a pipelin run.
Nothing
Jens Reeder
Realign indels in pipeline context
realignIndels()
realignIndels()
High level function call to realign indels in the analyzed.bam file using GATK
Nothing
Jens Reeder
Realing indels using the GATK tools RealignerTargetCreator and IndelRealigner. Requires a GATK compatible genome with a name matching the alignment genome to be installed in 'path.gatk_genome'
realignIndelsGATK(bam.file)
realignIndelsGATK(bam.file)
bam.file |
Path to bam.file |
Since GATKs IndelRealigner is not parallelized, we run it in parallel per chromosome.
Path to realigned bam file
Jens Reeder
Run the NGS analysis pipeline
runPipeline(...)
runPipeline(...)
... |
A list of parameters. See the vignette for details. |
This function starts the pipeline. It first preprocesses the input FASTQ reads, align them, count the read overlaps with genomic features and compute the coverage. See the vignette for details.
The path to the NGS output directory.
Jens Reeder, Gregoire Pau
TP53Genome, TP53GenomicFeatures
## Not run: ## build genome and genomic features tp53Genome <- TP53Genome() tp53GenomicFeatures <- TP53GenomicFeatures() ## get the FASTQ files fastq1 <- system.file("extdata/H1993_TP53_subset2500_1.fastq.gz", package="HTSeqGenie") fastq2 <- system.file("extdata/H1993_TP53_subset2500_2.fastq.gz", package="HTSeqGenie") ## run the pipeline save_dir <- runPipeline( ## input input_file=fastq1, input_file2=fastq2, paired_ends=TRUE, quality_encoding="illumina1.8", ## output save_dir="test", prepend_str="test", overwrite_save_dir="erase", ## aligner path.gsnap_genomes=path(directory(tp53Genome)), alignReads.genome=genome(tp53Genome), alignReads.additional_parameters="--indel-penalty=1 --novelsplicing=1 --distant-splice-penalty=1", ## gene model path.genomic_features=dirname(tp53GenomicFeatures), countGenomicFeatures.gfeatures=basename(tp53GenomicFeatures) ) ## End(Not run)
## Not run: ## build genome and genomic features tp53Genome <- TP53Genome() tp53GenomicFeatures <- TP53GenomicFeatures() ## get the FASTQ files fastq1 <- system.file("extdata/H1993_TP53_subset2500_1.fastq.gz", package="HTSeqGenie") fastq2 <- system.file("extdata/H1993_TP53_subset2500_2.fastq.gz", package="HTSeqGenie") ## run the pipeline save_dir <- runPipeline( ## input input_file=fastq1, input_file2=fastq2, paired_ends=TRUE, quality_encoding="illumina1.8", ## output save_dir="test", prepend_str="test", overwrite_save_dir="erase", ## aligner path.gsnap_genomes=path(directory(tp53Genome)), alignReads.genome=genome(tp53Genome), alignReads.additional_parameters="--indel-penalty=1 --novelsplicing=1 --distant-splice-penalty=1", ## gene model path.genomic_features=dirname(tp53GenomicFeatures), countGenomicFeatures.gfeatures=basename(tp53GenomicFeatures) ) ## End(Not run)
Run the NGS analysis pipeline from a configuration file
runPipelineConfig(config_filename, config_update)
runPipelineConfig(config_filename, config_update)
config_filename |
Path to a pipeline configuration file |
config_update |
A list of name value pairs that will update the config parameters |
This is the launcher function for all pipeline runs. It will do some preprocessing steps, then aligns the reads, counts overlap with genomic Features such as genes, exons etc and applies a variant caller.
Nothing
Jens Reeder, Gregoire Pau
setup test framework
setupTestFramework(config.filename, config.update = list(), testname = "test", package = "HTSeqGenie", use.TP53Genome = TRUE)
setupTestFramework(config.filename, config.update = list(), testname = "test", package = "HTSeqGenie", use.TP53Genome = TRUE)
config.filename |
configuration file |
config.update |
update list of config values |
testname |
name of test case |
package |
name of package |
use.TP53Genome |
Boolean indicating the use of the TP53 genome as template config |
the created temp directory
Build the genomic features of the TP53 demo region
TP53GenomicFeatures()
TP53GenomicFeatures()
Returns a list of genomic features (gene, exons, transcripts) annotating
a region of UCSC hg19 sequence centered on the region of the TP53 gene, with 1 Mb flanking
sequence on each side. This is intended as a test/demonstration to run the NGS pipeline
in conjunction with the LungCancerLines
data package.
A list of GRanges objects containing the genomic features
Gregoire Pau
TP53Genome, buildGenomicFeaturesFromTxDb, runPipeline
Compute stats on a VCF file
vcfStat(vcf.filename)
vcfStat(vcf.filename)
vcf.filename |
A character pointing to a VCF (or gzipped VCF) file |
A numeric vector
Gregoire Pau
Call Variants in the pipeline framework
wrap.callVariants(bam.file)
wrap.callVariants(bam.file)
bam.file |
Aligned reads as bam file |
A wrapper around VariantTools callVariant framework.
Variants as Vranges
Jens Reeder
Write variants to VCF file
writeVCF(variants.vranges, filename)
writeVCF(variants.vranges, filename)
variants.vranges |
Genomic Variants as VRanges object |
filename |
Name of vcf file to write |
VCF file name
Jens Reeder