Title: | Annotation-agnostic differential expression analysis of RNA-seq data at base-pair resolution via the DER Finder approach |
---|---|
Description: | This package provides functions for annotation-agnostic differential expression analysis of RNA-seq data. Two implementations of the DER Finder approach are included in this package: (1) single base-level F-statistics and (2) DER identification at the expressed regions-level. The DER Finder approach can also be used to identify differentially bounded ChIP-seq peaks. |
Authors: | Leonardo Collado-Torres [aut, cre] , Alyssa C. Frazee [ctb], Andrew E. Jaffe [aut] , Jeffrey T. Leek [aut, ths] |
Maintainer: | Leonardo Collado-Torres <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.41.0 |
Built: | 2024-10-30 05:46:35 UTC |
Source: | https://github.com/bioc/derfinder |
This is a major wrapper for running several key functions from this package. It is meant to be used after loadCoverage has been used for a specific chromosome. The steps run include makeModels, preprocessCoverage, calculateStats, calculatePvalues and annotating with annotateTranscripts and matchGenes.
analyzeChr( chr, coverageInfo, models, cutoffPre = 5, cutoffFstat = 1e-08, cutoffType = "theoretical", nPermute = 1, seeds = as.integer(gsub("-", "", Sys.Date())) + seq_len(nPermute), groupInfo, txdb = NULL, writeOutput = TRUE, runAnnotation = TRUE, lowMemDir = file.path(chr, "chunksDir"), smooth = FALSE, weights = NULL, smoothFunction = bumphunter::locfitByCluster, ... )
analyzeChr( chr, coverageInfo, models, cutoffPre = 5, cutoffFstat = 1e-08, cutoffType = "theoretical", nPermute = 1, seeds = as.integer(gsub("-", "", Sys.Date())) + seq_len(nPermute), groupInfo, txdb = NULL, writeOutput = TRUE, runAnnotation = TRUE, lowMemDir = file.path(chr, "chunksDir"), smooth = FALSE, weights = NULL, smoothFunction = bumphunter::locfitByCluster, ... )
chr |
Used for naming the output files when |
coverageInfo |
A list containing a DataFrame – |
models |
The output from makeModels. |
cutoffPre |
This argument is passed to preprocessCoverage
( |
cutoffFstat |
This is used to determine the cutoff argument of
calculatePvalues and it's behaviour is determined by
|
cutoffType |
If set to |
nPermute |
The number of permutations. Note that for a full chromosome,
a small amount (10) of permutations is sufficient. If set to 0, no
permutations are performed and thus no null regions are used, however, the
|
seeds |
An integer vector of length |
groupInfo |
A factor specifying the group membership of each sample
that can later be used with the plotting functions in the
|
txdb |
This argument is passed to
annotateTranscripts. If |
writeOutput |
If |
runAnnotation |
If |
lowMemDir |
If specified, each chunk is saved into a separate Rdata
file under |
smooth |
Whether to smooth the F-statistics ( |
weights |
Weights used by the smoother as described in smoother. |
smoothFunction |
A function to be used for smoothing the F-statistics.
Two functions are provided by the |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to extendedMapSeqlevels, preprocessCoverage, calculateStats, calculatePvalues, annotateTranscripts, matchGenes, and define_cluster. |
If you are working with data from an organism different from 'Homo sapiens'
specify so by setting the global 'species' and 'chrsStyle' options. For
example:
options(species = 'arabidopsis_thaliana')
options(chrsStyle = 'NCBI')
If returnOutput=TRUE
, a list with six components:
The wallclock timing information for each step.
The main options used when running this function.
The output from preprocessCoverage.
The output from calculateStats.
The output from calculatePvalues.
The output from matchGenes.
These are the same components that are written to Rdata files if
writeOutput=TRUE
.
Leonardo Collado-Torres
makeModels, preprocessCoverage, calculateStats, calculatePvalues, annotateTranscripts, matchGenes
## Collapse the coverage information collapsedFull <- collapseFullCoverage(list(genomeData$coverage), verbose = TRUE ) ## Calculate library size adjustments sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), nonzero = TRUE, verbose = TRUE ) ## Build the models groupInfo <- genomeInfo$pop adjustvars <- data.frame(genomeInfo$gender) models <- makeModels(sampleDepths, testvars = groupInfo, adjustvars = adjustvars) ## Analyze the chromosome results <- analyzeChr( chr = "21", coverageInfo = genomeData, models = models, cutoffFstat = 1, cutoffType = "manual", groupInfo = groupInfo, mc.cores = 1, writeOutput = FALSE, returnOutput = TRUE, method = "regular", runAnnotation = FALSE ) names(results)
## Collapse the coverage information collapsedFull <- collapseFullCoverage(list(genomeData$coverage), verbose = TRUE ) ## Calculate library size adjustments sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), nonzero = TRUE, verbose = TRUE ) ## Build the models groupInfo <- genomeInfo$pop adjustvars <- data.frame(genomeInfo$gender) models <- makeModels(sampleDepths, testvars = groupInfo, adjustvars = adjustvars) ## Analyze the chromosome results <- analyzeChr( chr = "21", coverageInfo = genomeData, models = models, cutoffFstat = 1, cutoffType = "manual", groupInfo = groupInfo, mc.cores = 1, writeOutput = FALSE, returnOutput = TRUE, method = "regular", runAnnotation = FALSE ) names(results)
This function takes the regions found in calculatePvalues and assigns them genomic states contructed with makeGenomicState. The main workhorse functions are countOverlaps and findOverlaps.
annotateRegions(regions, genomicState, annotate = TRUE, ...)
annotateRegions(regions, genomicState, annotate = TRUE, ...)
regions |
The |
genomicState |
A GRanges object created with makeGenomicState.
It can be either the |
annotate |
If |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to extendedMapSeqlevels, countOverlaps and findOverlaps-methods. |
You might want to specify arguments such as minoverlap
to control
how the overlaps are determined. See findOverlaps
for further details.
A list with elements countTable
and annotationList
(only if annotate=TRUE
).
This is a data.frame with the number of overlaps from the
regions vs the genomic states with one type per column. For example, if
fullOrCoding='full'
then the columns are exon
,
intergenic
and intron
.
This is a GRangesList
with the genomic states
that overlapped with the regions. The
names of this GRangesList
correspond to the region index in
regions
.
Andrew Jaffe, Leonardo Collado-Torres
makeGenomicState, calculatePvalues
## Annotate regions, first two only annotatedRegions <- annotateRegions( regions = genomeRegions$regions[1:2], genomicState = genomicState$fullGenome, minoverlap = 1 ) annotatedRegions
## Annotate regions, first two only annotatedRegions <- annotateRegions( regions = genomeRegions$regions[1:2], genomicState = genomicState$fullGenome, minoverlap = 1 ) annotatedRegions
First, this function finds the regions of interest according to specified cutoffs. Then it permutes the samples and re-calculates the F-statistics. The area of the statistics from these segments are then used to calculate p-values for the original regions.
calculatePvalues( coveragePrep, models, fstats, nPermute = 1L, seeds = as.integer(gsub("-", "", Sys.Date())) + seq_len(nPermute), chr, cutoff = quantile(fstats, 0.99, na.rm = TRUE), significantCut = c(0.05, 0.1), lowMemDir = NULL, smooth = FALSE, weights = NULL, smoothFunction = bumphunter::locfitByCluster, ... )
calculatePvalues( coveragePrep, models, fstats, nPermute = 1L, seeds = as.integer(gsub("-", "", Sys.Date())) + seq_len(nPermute), chr, cutoff = quantile(fstats, 0.99, na.rm = TRUE), significantCut = c(0.05, 0.1), lowMemDir = NULL, smooth = FALSE, weights = NULL, smoothFunction = bumphunter::locfitByCluster, ... )
coveragePrep |
A list with |
models |
A list with |
fstats |
A numerical Rle with the F-statistics normally generated using calculateStats. |
nPermute |
The number of permutations. Note that for a full chromosome,
a small amount (10) of permutations is sufficient. If set to 0, no
permutations are performed and thus no null regions are used, however, the
|
seeds |
An integer vector of length |
chr |
A single element character vector specifying the chromosome name. This argument is passed to findRegions. |
cutoff |
F-statistic cutoff to use to determine segments. |
significantCut |
A vector of length two specifiying the cutoffs used to determine significance. The first element is used to determine significance for the P-values, while the second element is used for the Q-values (FDR adjusted P-values). |
lowMemDir |
The directory where the processed chunks are saved when
using preprocessCoverage with a specified |
smooth |
Whether to smooth the F-statistics ( |
weights |
Weights used by the smoother as described in smoother. |
smoothFunction |
A function to be used for smoothing the F-statistics.
Two functions are provided by the |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to findRegions, |
A list with four components:
is a GRanges with metadata columns given by
findRegions with the additional metadata column pvalues
:
p-value of the region calculated via permutations of the samples;
qvalues
: the qvalues calculated using qvalue;
significant
: whether the p-value is less than 0.05 (by default);
significantQval
: whether the q-value is less than 0.10 (by default).
It also includes the mean coverage of the region (mean from the mean
coverage at each base calculated in preprocessCoverage). Furthermore,
if groupInfo
was not NULL
in preprocessCoverage, then
the group mean coverage is calculated as well as the log 2 fold change
(using group 1 as the reference).
is a numeric Rle with the mean of the null statistics by segment.
is a numeric Rle with the length of each of the segments
in the null distribution. The area can be obtained by multiplying the
absolute nullstats
by the corresponding lengths.
is a Rle with the permutation number from which the null region originated from.
Leonardo Collado-Torres
findRegions, fstats.apply, qvalue
## Collapse the coverage information collapsedFull <- collapseFullCoverage(list(genomeData$coverage), verbose = TRUE ) ## Calculate library size adjustments sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), verbose = TRUE) ## Build the models group <- genomeInfo$pop adjustvars <- data.frame(genomeInfo$gender) models <- makeModels(sampleDepths, testvars = group, adjustvars = adjustvars) ## Preprocess the data ## Automatic chunksize used to then compare 1 vs 4 cores in the 'do not run' ## section prep <- preprocessCoverage(genomeData, groupInfo = group, cutoff = 0, scalefac = 32, chunksize = NULL, colsubset = NULL, mc.cores = 4 ) ## Get the F statistics fstats <- genomeFstats ## We recommend determining the cutoff to use based on the F-distribution ## although you could also based it on the observed F-statistics. ## In this example we use a low cutoff used for illustrative purposes cutoff <- 1 ## Calculate the p-values and define the regions of interest. regsWithP <- calculatePvalues(prep, models, fstats, nPermute = 1, seeds = 1, chr = "chr21", cutoff = cutoff, mc.cores = 1, method = "regular" ) regsWithP ## Not run: ## Calculate again, but with 10 permutations instead of just 1 regsWithP <- calculatePvalues(prep, models, fstats, nPermute = 10, seeds = 1:10, chr = "chr21", cutoff = cutoff, mc.cores = 2, method = "regular" ) ## Check that they are the same as the previously calculated regions library(testthat) expect_that(regsWithP, equals(genomeRegions)) ## Histogram of the theoretical p-values by region hist(pf(regsWithP$regions$value, df1 - df0, n - df1), main = "Distribution original p-values by region", freq = FALSE) ## Histogram of the permutted p-values by region hist(regsWithP$regions$pvalues, main = "Distribution permutted p-values by region", freq = FALSE) ## MA style plot library("ggplot2") ma <- data.frame( mean = regsWithP$regions$meanCoverage, log2FoldChange = regsWithP$regions$log2FoldChangeYRIvsCEU ) ggplot(ma, aes(x = log2(mean), y = log2FoldChange)) + geom_point() + ylab("Fold Change (log2)") + xlab("Mean coverage (log2)") + labs(title = "MA style plot") ## Annotate the results library("bumphunter") genes <- annotateTranscripts(TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene) annotation <- matchGenes(regsWithP$regions, genes) head(annotation) ## End(Not run)
## Collapse the coverage information collapsedFull <- collapseFullCoverage(list(genomeData$coverage), verbose = TRUE ) ## Calculate library size adjustments sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), verbose = TRUE) ## Build the models group <- genomeInfo$pop adjustvars <- data.frame(genomeInfo$gender) models <- makeModels(sampleDepths, testvars = group, adjustvars = adjustvars) ## Preprocess the data ## Automatic chunksize used to then compare 1 vs 4 cores in the 'do not run' ## section prep <- preprocessCoverage(genomeData, groupInfo = group, cutoff = 0, scalefac = 32, chunksize = NULL, colsubset = NULL, mc.cores = 4 ) ## Get the F statistics fstats <- genomeFstats ## We recommend determining the cutoff to use based on the F-distribution ## although you could also based it on the observed F-statistics. ## In this example we use a low cutoff used for illustrative purposes cutoff <- 1 ## Calculate the p-values and define the regions of interest. regsWithP <- calculatePvalues(prep, models, fstats, nPermute = 1, seeds = 1, chr = "chr21", cutoff = cutoff, mc.cores = 1, method = "regular" ) regsWithP ## Not run: ## Calculate again, but with 10 permutations instead of just 1 regsWithP <- calculatePvalues(prep, models, fstats, nPermute = 10, seeds = 1:10, chr = "chr21", cutoff = cutoff, mc.cores = 2, method = "regular" ) ## Check that they are the same as the previously calculated regions library(testthat) expect_that(regsWithP, equals(genomeRegions)) ## Histogram of the theoretical p-values by region hist(pf(regsWithP$regions$value, df1 - df0, n - df1), main = "Distribution original p-values by region", freq = FALSE) ## Histogram of the permutted p-values by region hist(regsWithP$regions$pvalues, main = "Distribution permutted p-values by region", freq = FALSE) ## MA style plot library("ggplot2") ma <- data.frame( mean = regsWithP$regions$meanCoverage, log2FoldChange = regsWithP$regions$log2FoldChangeYRIvsCEU ) ggplot(ma, aes(x = log2(mean), y = log2FoldChange)) + geom_point() + ylab("Fold Change (log2)") + xlab("Mean coverage (log2)") + labs(title = "MA style plot") ## Annotate the results library("bumphunter") genes <- annotateTranscripts(TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene) annotation <- matchGenes(regsWithP$regions, genes) head(annotation) ## End(Not run)
After defining the models of interest (see makeModels) and pre-processing the data (see preprocessCoverage), use calculateStats to calculate the F-statistics at base-pair resolution.
calculateStats(coveragePrep, models, lowMemDir = NULL, ...)
calculateStats(coveragePrep, models, lowMemDir = NULL, ...)
coveragePrep |
A list with |
models |
A list with |
lowMemDir |
The directory where the processed chunks are saved when
using preprocessCoverage with a specified |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to define_cluster. |
A numeric Rle with the F-statistics per base pair that passed the cutoff.
Leonardo Collado-Torres
makeModels, preprocessCoverage
## Collapse the coverage information collapsedFull <- collapseFullCoverage(list(genomeData$coverage), verbose = TRUE ) ## Calculate library size adjustments sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), verbose = TRUE) ## Build the models group <- genomeInfo$pop adjustvars <- data.frame(genomeInfo$gender) models <- makeModels(sampleDepths, testvars = group, adjustvars = adjustvars) ## Preprocess the data prep <- preprocessCoverage(genomeData, cutoff = 0, scalefac = 32, chunksize = 1e3, colsubset = NULL ) ## Run the function fstats <- calculateStats(prep, models, verbose = TRUE, method = "regular") fstats ## Not run: ## Compare vs pre-packaged F-statistics library("testthat") expect_that(fstats, is_equivalent_to(genomeFstats)) ## End(Not run)
## Collapse the coverage information collapsedFull <- collapseFullCoverage(list(genomeData$coverage), verbose = TRUE ) ## Calculate library size adjustments sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), verbose = TRUE) ## Build the models group <- genomeInfo$pop adjustvars <- data.frame(genomeInfo$gender) models <- makeModels(sampleDepths, testvars = group, adjustvars = adjustvars) ## Preprocess the data prep <- preprocessCoverage(genomeData, cutoff = 0, scalefac = 32, chunksize = 1e3, colsubset = NULL ) ## Run the function fstats <- calculateStats(prep, models, verbose = TRUE, method = "regular") fstats ## Not run: ## Compare vs pre-packaged F-statistics library("testthat") expect_that(fstats, is_equivalent_to(genomeFstats)) ## End(Not run)
Given the output of fullCoverage, coerce the coverage to a GRanges object.
coerceGR(sample, fullCov, ...)
coerceGR(sample, fullCov, ...)
sample |
The name or integer index of the sample of interest to coerce
to a |
fullCov |
A list where each element is the result from
loadCoverage used with |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to define_cluster. |
A GRanges object with score
metadata
vector containing the coverage information for the specified sample. The
ranges reported are only those for regions of the genome with coverage
greater than zero.
Leonardo Collado-Torres
## Create a small fullCov object with data only for chr21 fullCov <- list("chr21" = genomeDataRaw) ## Coerce to a GRanges the first sample gr <- createBwSample("ERR009101", fullCov = fullCov, seqlengths = c("chr21" = 48129895) ) ## Explore the output gr ## Coerces fullCoverage() output to GRanges for a given sample
## Create a small fullCov object with data only for chr21 fullCov <- list("chr21" = genomeDataRaw) ## Coerce to a GRanges the first sample gr <- createBwSample("ERR009101", fullCov = fullCov, seqlengths = c("chr21" = 48129895) ) ## Explore the output gr ## Coerces fullCoverage() output to GRanges for a given sample
For a given data set this function collapses the full coverage information for each sample from all the chromosomes. The resulting information per sample is the number of bases with coverage 0, 1, etc. It is similar to using table() on a regular vector. This information is then used by sampleDepth for calculating the sample depth adjustments. The data set can loaded to R using (see fullCoverage) and optionally filtered using filterData.
collapseFullCoverage(fullCov, colsubset = NULL, save = FALSE, ...)
collapseFullCoverage(fullCov, colsubset = NULL, save = FALSE, ...)
fullCov |
A list where each element is the result from
loadCoverage used with |
colsubset |
Which columns of |
save |
If |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
|
A list with one element per sample. Then per sample, a list with two vector
elements: values
and weights
. The first one is the coverage
value and the second one is the number of bases with that value.
Leonardo Collado-Torres
## Collapse the coverage information for the filtered data collapsedFull <- collapseFullCoverage(list(genomeData), verbose = TRUE ) collapsedFull ## Not run: ## You can also collapsed the raw data collapsedFullRaw <- collapseFullCoverage(list(genomeDataRaw), verbose = TRUE) ## End(Not run)
## Collapse the coverage information for the filtered data collapsedFull <- collapseFullCoverage(list(genomeData), verbose = TRUE ) collapsedFull ## Not run: ## You can also collapsed the raw data collapsedFullRaw <- collapseFullCoverage(list(genomeDataRaw), verbose = TRUE) ## End(Not run)
This function extracts the coverage information calculated by fullCoverage for a set of exons determined by makeGenomicState. The underlying code is similar to getRegionCoverage with additional tweaks for calculating RPKM values.
coverageToExon( fullCov = NULL, genomicState, L = NULL, returnType = "raw", files = NULL, ... )
coverageToExon( fullCov = NULL, genomicState, L = NULL, returnType = "raw", files = NULL, ... )
fullCov |
A list where each element is the result from
loadCoverage used with |
genomicState |
A GRanges object created with makeGenomicState.
It can be either the |
L |
The width of the reads used. Either a vector of length 1 or length equal to the number of samples. |
returnType |
If |
files |
A character vector with the full path to the sample BAM files
(or BigWig files).
The names are used for the column names of the DataFrame. Check
rawFiles for constructing |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to extendedMapSeqlevels and define_cluster. |
Parallelization is used twice.
First, it is used by strand. Second, for processing the exons by
chromosome. So there is no gain in using mc.cores
greater than the
maximum of the number of strands and number of chromosomes.
If fullCov
is NULL
and files
is specified, this function
will attempt to read the coverage from the files. Note that if you used
'totalMapped' and 'targetSize' before, you will have to specify them again
to get the same results.
A matrix (nrow = number of exons in genomicState
corresponding to the chromosomes in fullCov
, ncol = number of
samples) with the number of reads (or RPKM) per exon. The row names
correspond to the row indexes of genomicState$fullGenome
(if
fullOrCoding='full'
) or genomicState$codingGenome
(if
fullOrCoding='coding'
).
Andrew Jaffe, Leonardo Collado-Torres
fullCoverage, getRegionCoverage
## Obtain fullCov object fullCov <- list("21" = genomeDataRaw$coverage) ## Use only the first two exons smallGenomicState <- genomicState smallGenomicState$fullGenome <- smallGenomicState$fullGenome[ which(smallGenomicState$fullGenome$theRegion == "exon")[1:2] ] ## Finally, get the coverage information for each exon exonCov <- coverageToExon( fullCov = fullCov, genomicState = smallGenomicState$fullGenome, L = 36 )
## Obtain fullCov object fullCov <- list("21" = genomeDataRaw$coverage) ## Use only the first two exons smallGenomicState <- genomicState smallGenomicState$fullGenome <- smallGenomicState$fullGenome[ which(smallGenomicState$fullGenome$theRegion == "exon")[1:2] ] ## Finally, get the coverage information for each exon exonCov <- coverageToExon( fullCov = fullCov, genomicState = smallGenomicState$fullGenome, L = 36 )
Using output from fullCoverage, export the coverage from all the samples to BigWig files using createBwSample.
createBw(fullCov, path = ".", keepGR = TRUE, ...)
createBw(fullCov, path = ".", keepGR = TRUE, ...)
fullCov |
A list where each element is the result from
loadCoverage used with |
path |
The path where the BigWig files will be created. |
keepGR |
If |
... |
Arguments passed to createBwSample. |
Use at most one core per chromosome.
If keepGR = TRUE
, then a GRangesList
with the output for coerceGR for each of the samples.
Leonardo Collado-Torres
GRangesList, export.bw, createBwSample, coerceGR
## Create a small fullCov object with data only for chr21 fullCov <- list("chr21" = genomeDataRaw) ## Keep only 2 samples fullCov$chr21$coverage <- fullCov$chr21$coverage[c(1, 31)] ## Create the BigWig files for all samples in a test dir dir.create("createBw-example") bws <- createBw(fullCov, "createBw-example") ## Explore the output bws ## First sample bws[[1]] ## Note that if a sample has no bases with coverage > 0, the GRanges object ## is empty and no BigWig file is created for that sample. bws[[2]] ## Exports fullCoverage() output to BigWig files
## Create a small fullCov object with data only for chr21 fullCov <- list("chr21" = genomeDataRaw) ## Keep only 2 samples fullCov$chr21$coverage <- fullCov$chr21$coverage[c(1, 31)] ## Create the BigWig files for all samples in a test dir dir.create("createBw-example") bws <- createBw(fullCov, "createBw-example") ## Explore the output bws ## First sample bws[[1]] ## Note that if a sample has no bases with coverage > 0, the GRanges object ## is empty and no BigWig file is created for that sample. bws[[2]] ## Exports fullCoverage() output to BigWig files
Given the output of fullCoverage, this function coerces the coverage to a GRanges object using coerceGR and then exports the coverage to a BigWig file using export.bw.
createBwSample(sample, path = ".", fullCov, keepGR = TRUE, ...)
createBwSample(sample, path = ".", fullCov, keepGR = TRUE, ...)
sample |
The name or integer index of the sample of interest to coerce
to a |
path |
The path where the BigWig file will be created. |
fullCov |
A list where each element is the result from
loadCoverage used with |
keepGR |
If |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to coerceGR. |
Creates a BigWig file with the coverage information (regions with
coverage greater than zero) for a given sample. If keepGR
it returns
the output from coerceGR.
Leonardo Collado-Torres
GRanges, export.bw, linkcoerceGR
## Create a small fullCov object with data only for chr21 fullCov <- list("chr21" = genomeDataRaw) ## Create a BigWig for the first sample in a test directory dir.create("createBwSample-example") bw <- createBwSample("ERR009101", "createBwSample-example", fullCov = fullCov, seqlengths = c("chr21" = 48129895) ) ## Explore the output bw ## Exports a single sample to a BigWig file
## Create a small fullCov object with data only for chr21 fullCov <- list("chr21" = genomeDataRaw) ## Create a BigWig for the first sample in a test directory dir.create("createBwSample-example") bw <- createBwSample("ERR009101", "createBwSample-example", fullCov = fullCov, seqlengths = c("chr21" = 48129895) ) ## Explore the output bw ## Exports a single sample to a BigWig file
Define a cluster to use.
define_cluster(cores = "mc.cores", ...)
define_cluster(cores = "mc.cores", ...)
cores |
The argument to use to define the number of cores. This is useful for cases with nested parallelizations. |
... |
Advanced arguments are:
|
This function is used internally in many functions.
A BiocParallel *Param object
Leonardo Collado-Torres
## Use SerialParam() define_cluster(mc.cores = 1) ## Note that BPPARAM.custom takes precedence define_cluster(mc.cores = 2, BPPARAM.custom = BiocParallel::SerialParam())
## Use SerialParam() define_cluster(mc.cores = 1) ## Note that BPPARAM.custom takes precedence define_cluster(mc.cores = 2, BPPARAM.custom = BiocParallel::SerialParam())
These functions are provided for compatibility with older version of ‘derfinder’ only and will be defunct at the next release.
advancedArg(fun, package = "derfinder", browse = interactive())
advancedArg(fun, package = "derfinder", browse = interactive())
fun |
The name of a function(s) that has advanced arguments in
|
package |
The name of the package where the function is stored. Only 'derfinder', 'derfinderPlot', and 'regionReport' are accepted. |
browse |
Whether to open the URLs in a browser. |
The following functions are deprecated and will be made defunct.
advancedArg Not needed now that the advanced arguments are documented on the help pages of each function.
A vector of URLs with the GitHub search queries.
## Open the advanced argument docs for loadCoverage() if (interactive()) { advancedArg("loadCoverage") } else { (advancedArg("loadCoverage", browse = FALSE)) }
## Open the advanced argument docs for loadCoverage() if (interactive()) { advancedArg("loadCoverage") } else { (advancedArg("loadCoverage", browse = FALSE)) }
If available, use the information from GenomeInfoDb for your species of
interest to map the sequence names from the style currently used to another
valid style. For example, for Homo sapiens map '2' (NCBI style) to 'chr2'
(UCSC style). If the information from GenomeInfoDb is not available, the
original sequence names will be returned. To disable this functionality
specify set chrsStyle
to NULL
.
extendedMapSeqlevels( seqnames, style = getOption("chrsStyle", "UCSC"), species = getOption("species", "homo_sapiens"), currentStyle = NULL, ... )
extendedMapSeqlevels( seqnames, style = getOption("chrsStyle", "UCSC"), species = getOption("species", "homo_sapiens"), currentStyle = NULL, ... )
seqnames |
A character vector with the sequence names. |
style |
A single character vector specifying the naming style to use for renaming the sequence names. |
species |
A single character vector with the species of interest: it has
to match the valid species names supported in GenomeInfoDb. See
|
currentStyle |
A single character vector with the currently used
naming style. If |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
|
This function is inspired from mapSeqlevels with the difference that it will return the original sequence names if the species, current naming style or target naming style are not supported in GenomeInfoDb.
If you want to disable this function, set chrsStyle
to NULL
.
From other functions in derfinder that pass the
...
argument to this function, use chrsStyle = NULL
.
This can be useful when working
with organisms that are absent from GenomeInfoDb as documented in
https://support.bioconductor.org/p/95521/.
A vector of sequence names using the specified naming style
.
L. Collado-Torres
## Without guessing any information extendedMapSeqlevels("2", "UCSC", "Homo sapiens", "NCBI") ## Guessing the current naming style extendedMapSeqlevels("2", "UCSC", "Homo sapiens") ## Guess species and current style extendedMapSeqlevels("2", "NCBI") ## Guess species while supplying the current style. ## Probably an uncommon use-case extendedMapSeqlevels("2", "NCBI", currentStyle = "TAIR10") ## Sequence names are unchanged when using an unsupported species extendedMapSeqlevels("seq2", "NCBI", "toyOrganism") ## Disable extendedMapSeqlevels. This can be useful when working with ## organisms that are not supported by GenomeInfoDb chrs <- c( "I", "II", "III", "IV", "IX", "V", "VI", "VII", "VIII", "X", "XI", "XII", "XIII", "XIV", "XV", "XVI", "XVII" ) extendedMapSeqlevels(chrs, chrsStyle = NULL) ## Not run: ## Set global species and style option options("chrsStyle" = "UCSC") options("species" = "homo_sapiens") ## Run using global options extendedMapSeqlevels("2") ## End(Not run)
## Without guessing any information extendedMapSeqlevels("2", "UCSC", "Homo sapiens", "NCBI") ## Guessing the current naming style extendedMapSeqlevels("2", "UCSC", "Homo sapiens") ## Guess species and current style extendedMapSeqlevels("2", "NCBI") ## Guess species while supplying the current style. ## Probably an uncommon use-case extendedMapSeqlevels("2", "NCBI", currentStyle = "TAIR10") ## Sequence names are unchanged when using an unsupported species extendedMapSeqlevels("seq2", "NCBI", "toyOrganism") ## Disable extendedMapSeqlevels. This can be useful when working with ## organisms that are not supported by GenomeInfoDb chrs <- c( "I", "II", "III", "IV", "IX", "V", "VI", "VII", "VIII", "X", "XI", "XII", "XIII", "XIV", "XV", "XVI", "XVII" ) extendedMapSeqlevels(chrs, chrsStyle = NULL) ## Not run: ## Set global species and style option options("chrsStyle" = "UCSC") options("species" = "homo_sapiens") ## Run using global options extendedMapSeqlevels("2") ## End(Not run)
For a group of samples this function reads the coverage information for a specific chromosome directly from the BAM files. It then merges them into a DataFrame and removes the bases that do not pass the cutoff. This is a helper function for loadCoverage and preprocessCoverage.
filterData( data, cutoff = NULL, index = NULL, filter = "one", totalMapped = NULL, targetSize = 8e+07, ... )
filterData( data, cutoff = NULL, index = NULL, filter = "one", totalMapped = NULL, targetSize = 8e+07, ... )
data |
Either a list of Rle objects or a DataFrame with the coverage information. |
cutoff |
The base-pair level cutoff to use. It's behavior is controlled
by |
index |
A logical Rle with the positions of the chromosome that passed
the cutoff. If |
filter |
Has to be either |
totalMapped |
A vector with the total number of reads mapped for each
sample. The vector should be in the same order as the samples in |
targetSize |
The target library size to adjust the coverage to. Used
only when |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
|
If cutoff
is NULL
then the data is grouped into
DataFrame without applying any cutoffs. This can be useful if you want to
use loadCoverage to build the coverage DataFrame without applying any
cutoffs for other downstream purposes like plotting the coverage values of a
given region. You can always specify the colsubset
argument in
preprocessCoverage to filter the data before calculating the F
statistics.
A list with up to three components.
is a DataFrame object where each column represents a
sample. The number of rows depends on the number of base pairs that passed
the cutoff and the information stored is the coverage at that given base.
Included only when returnCoverage = TRUE
.
is a logical Rle with the positions of the chromosome that passed the cutoff.
is a numeric Rle with the mean coverage at each base.
Included only when returnMean = TRUE
.
Specifies the column names to be used for the results
DataFrame. If NULL
, names from data
are used.
Whether to smooth the mean. Used only when
filter = 'mean'
. This option is used internally by
regionMatrix.
Passed to the internal function .smootherFstats
, see
findRegions.
Leonardo Collado-Torres
loadCoverage, preprocessCoverage, getTotalMapped
## Construct some toy data library("IRanges") x <- Rle(round(runif(1e4, max = 10))) y <- Rle(round(runif(1e4, max = 10))) z <- Rle(round(runif(1e4, max = 10))) DF <- DataFrame(x, y, z) ## Filter the data filt1 <- filterData(DF, 5) filt1 ## Filter again but only using the first two samples filt2 <- filterData(filt1$coverage[, 1:2], 5, index = filt1$position) filt2
## Construct some toy data library("IRanges") x <- Rle(round(runif(1e4, max = 10))) y <- Rle(round(runif(1e4, max = 10))) z <- Rle(round(runif(1e4, max = 10))) DF <- DataFrame(x, y, z) ## Filter the data filt1 <- filterData(DF, 5) filt1 ## Filter again but only using the first two samples filt2 <- filterData(filt1$coverage[, 1:2], 5, index = filt1$position) filt2
Find genomic regions for which a numeric vector is above (or below) predefined thresholds. In other words, this function finds the candidate Differentially Expressed Regions (candidate DERs). This is similar to regionFinder and is a helper function for calculatePvalues.
findRegions( position = NULL, fstats, chr, oneTable = TRUE, maxClusterGap = 300L, cutoff = quantile(fstats, 0.99, na.rm = TRUE), segmentIR = NULL, smooth = FALSE, weights = NULL, smoothFunction = bumphunter::locfitByCluster, ... )
findRegions( position = NULL, fstats, chr, oneTable = TRUE, maxClusterGap = 300L, cutoff = quantile(fstats, 0.99, na.rm = TRUE), segmentIR = NULL, smooth = FALSE, weights = NULL, smoothFunction = bumphunter::locfitByCluster, ... )
position |
A logical Rle of genomic positions. This is generated in
loadCoverage. Note that it gets updated in preprocessCoverage
if |
fstats |
A numeric Rle with the F-statistics. Usually obtained using calculateStats. |
chr |
A single element character vector specifying the chromosome name. |
oneTable |
If |
maxClusterGap |
This determines the maximum gap between candidate DERs.
It should be greater than |
cutoff |
Threshold applied to the |
segmentIR |
An IRanges object with the genomic positions that are potentials DERs. This is used in calculatePvalues to speed up permutation calculations. |
smooth |
Whether to smooth the F-statistics ( |
weights |
Weights used by the smoother as described in smoother. |
smoothFunction |
A function to be used for smoothing the F-statistics.
Two functions are provided by the |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to extendedMapSeqlevels and the internal function
When |
regionFinder adapted to Rle world.
Either a GRanges or a GRangesList as determined by oneTable
.
Each of them has the following metadata variables.
The mean of the values of y
for the given region.
The absolute value of the sum of the values of y
for
the given region.
The start position of the region in terms of the index
for y
.
The end position of the region in terms of the index for
y
.
The cluser ID.
The total length of the cluster.
Leonardo Collado-Torres
Rafael A. Irizarry, Martin Aryee, Hector Corrada Bravo, Kasper D. Hansen and Harris A. Jaffee. bumphunter: Bump Hunter. R package version 1.1.10.
## Preprocess the data prep <- preprocessCoverage(genomeData, cutoff = 0, scalefac = 32, chunksize = 1e3, colsubset = NULL ) ## Get the F statistics fstats <- genomeFstats ## Find the regions regs <- findRegions(prep$position, fstats, "chr21", verbose = TRUE) regs ## Not run: ## Once you have the regions you can proceed to annotate them library("bumphunter") genes <- annotateTranscripts(TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene) annotation <- matchGenes(regs, genes) annotation ## End(Not run) # Find regions with smoothing the F-statistics by bumphunter::runmedByCluster regs_smooth <- findRegions(prep$position, fstats, "chr21", verbose = TRUE, smoothFunction = bumphunter::runmedByCluster ) ## Compare against the object regs obtained earlier regs_smooth
## Preprocess the data prep <- preprocessCoverage(genomeData, cutoff = 0, scalefac = 32, chunksize = 1e3, colsubset = NULL ) ## Get the F statistics fstats <- genomeFstats ## Find the regions regs <- findRegions(prep$position, fstats, "chr21", verbose = TRUE) regs ## Not run: ## Once you have the regions you can proceed to annotate them library("bumphunter") genes <- annotateTranscripts(TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene) annotation <- matchGenes(regs, genes) annotation ## End(Not run) # Find regions with smoothing the F-statistics by bumphunter::runmedByCluster regs_smooth <- findRegions(prep$position, fstats, "chr21", verbose = TRUE, smoothFunction = bumphunter::runmedByCluster ) ## Compare against the object regs obtained earlier regs_smooth
For a group of samples this function reads the coverage information for several chromosomes directly from the BAM files. Per chromosome, it merges the unfiltered coverage by sample into a DataFrame. The end result is a list with one such DataFrame objects per chromosome.
fullCoverage( files, chrs, bai = NULL, chrlens = NULL, outputs = NULL, cutoff = NULL, ... )
fullCoverage( files, chrs, bai = NULL, chrlens = NULL, outputs = NULL, cutoff = NULL, ... )
files |
A character vector with the full path to the sample BAM files
(or BigWig files).
The names are used for the column names of the DataFrame. Check
rawFiles for constructing |
chrs |
The chromosome of the files to read. The format has to match the one used in the input files. |
bai |
The full path to the BAM index files. If |
chrlens |
The chromosome lengths in base pairs. If it's |
outputs |
This argument is passed to the |
cutoff |
This argument is passed to filterData. |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to loadCoverage, define_cluster and
extendedMapSeqlevels.
Note that filterData is used internally
by loadCoverage (and hence fullCoverage) and has the important
arguments |
A list with one element per chromosome.
Each element is a DataFrame with the coverage information produced by loadCoverage.
Leonardo Collado-Torres
loadCoverage, filterData, getTotalMapped
datadir <- system.file("extdata", "genomeData", package = "derfinder") files <- rawFiles( datadir = datadir, samplepatt = "*accepted_hits.bam$", fileterm = NULL ) ## Shorten the column names names(files) <- gsub("_accepted_hits.bam", "", names(files)) ## Read and filter the data, only for 1 file fullCov <- fullCoverage(files = files[1], chrs = c("21", "22")) fullCov ## Not run: ## You can then use filterData() to filter the data if you want to. ## Use bplapply() if you want to do so with multiple cores as shown below. library("BiocParallel") p <- SnowParam(2L) bplapply(fullCov, function(x) { library("derfinder") filterData(x, cutoff = 0) }, BPPARAM = p) ## End(Not run)
datadir <- system.file("extdata", "genomeData", package = "derfinder") files <- rawFiles( datadir = datadir, samplepatt = "*accepted_hits.bam$", fileterm = NULL ) ## Shorten the column names names(files) <- gsub("_accepted_hits.bam", "", names(files)) ## Read and filter the data, only for 1 file fullCov <- fullCoverage(files = files[1], chrs = c("21", "22")) fullCov ## Not run: ## You can then use filterData() to filter the data if you want to. ## Use bplapply() if you want to do so with multiple cores as shown below. library("BiocParallel") p <- SnowParam(2L) bplapply(fullCov, function(x) { library("derfinder") filterData(x, cutoff = 0) }, BPPARAM = p) ## End(Not run)
10kb region from chr21 processed for 31 RNA-seq samples described in genomeInfo. The TopHat BAM files are included in the package and this is the output of loadCoverage applied to it. For more information check the example of loadCoverage.
A list with two components.
is a DataFrame object where each column represents a sample.
is a logical Rle with the positions of the chromosome that passed a cutoff of 0.
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras J-B, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 2010 Apr.
Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 2010 Mar.
10kb region from chr21 processed for 31 RNA-seq samples described in
genomeInfo. The TopHat BAM files are included in the package and this
is the output of loadCoverage applied to it with cutoff=NULL
.
For more information check the example of loadCoverage.
A list with two components.
is a DataFrame object where each column represents a sample.
is NULL
because no bases were filtered.
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras J-B, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 2010 Apr.
Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 2010 Mar.
Calculated F-statistics for a 10kb region from chr21 processed for 31 RNA-seq samples described in genomeInfo. For more information check the example of calculateStats.
A numeric Rle of length 1434 with the calculated F-statistics as exemplified in calculateStats.
Information for the 31 samples downloaded from the Short Read Archive from studies comparing CEU and YRI populations. This data is used to specify the adjustment variables in calculateStats. The data is sorted according to the BAM files identifiers. Reads were 36bp long.
A data.frame with 5 columns:
The short name used to identify the sample BAM file.
Whether it was a single-end library or a paired-end library.
The HapMap identifier of the person sequenced. Note that some were sequenced more than once.
Whether the person sequence is a female or a male.
The population the person belongs to.
The samples are from:
unrelated females from the YRI population.
unrelated females from the CEU population.
unrelated males (unrelated to the females too) from the CEU population.
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras J-B, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 2010 Apr.
Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 2010 Mar.
Candidate Differentially Expressed Regions (DERs) for the example data. For more information check calculatePvalues.
A list with four components.
a GRanges object with the candidate DERs.
a numeric Rle with the mean F-statistics for the null DERs found from the permutations.
an integer Rle with the width of each null candidate DER.
an integer Rle with the permutation number for each candidate DER. It identifies which permutation cycle created the null candidate DER.
Pre-computed genomic state for Hsapiens UCSC hg19 knownGene annotation built using makeGenomicState for TxDb.Hsapiens.UCSC.hg19.knownGene version 2.14.0. The object has been subset for chr21 only.
A GRangesList with two components.
classifies each region as either being exon, intron or intergenic.
classfies the regions as being promoter, exon, intro, 5UTR, 3UTR or intergenic.
This function extracts the raw coverage information calculated by fullCoverage at each base for a set of regions found with calculatePvalues. It can further calculate the mean coverage per sample for each region.
getRegionCoverage( fullCov = NULL, regions, totalMapped = NULL, targetSize = 8e+07, files = NULL, ... )
getRegionCoverage( fullCov = NULL, regions, totalMapped = NULL, targetSize = 8e+07, files = NULL, ... )
fullCov |
A list where each element is the result from
loadCoverage used with |
regions |
The |
totalMapped |
The total number of reads mapped for each sample.
Providing this data adjusts the coverage to reads in |
targetSize |
The target library size to adjust the coverage to. Used
only when |
files |
A character vector with the full path to the sample BAM files
(or BigWig files).
The names are used for the column names of the DataFrame. Check
rawFiles for constructing |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to extendedMapSeqlevels and define_cluster. When |
When fullCov
is the output of loadCoverage with
cutoff
non-NULL, getRegionCoverage assumes that the regions
come from the same data. Meaning that filterData was not used again.
This ensures that the regions are a subset of the data available in
fullCov
.
If fullCov
is NULL
and files
is specified, this function
will attempt to read the coverage from the files. Note that if you used
'totalMapped' and 'targetSize' before, you will have to specify them again
to get the same results.
You should use at most one core per chromosome.
a list of data.frame where each data.frame has the coverage
information (nrow = width of region, ncol = number of samples) for a given
region. The names of the list correspond to the region indexes in
regions
Andrew Jaffe, Leonardo Collado-Torres
fullCoverage, calculatePvalues
## Obtain fullCov object fullCov <- list("21" = genomeDataRaw$coverage) ## Assign chr lengths using hg19 information, use only first two regions library("GenomicRanges") regions <- genomeRegions$regions[1:2] seqlengths(regions) <- seqlengths(getChromInfoFromUCSC("hg19", as.Seqinfo = TRUE ))[ mapSeqlevels(names(seqlengths(regions)), "UCSC") ] ## Finally, get the region coverage regionCov <- getRegionCoverage(fullCov = fullCov, regions = regions)
## Obtain fullCov object fullCov <- list("21" = genomeDataRaw$coverage) ## Assign chr lengths using hg19 information, use only first two regions library("GenomicRanges") regions <- genomeRegions$regions[1:2] seqlengths(regions) <- seqlengths(getChromInfoFromUCSC("hg19", as.Seqinfo = TRUE ))[ mapSeqlevels(names(seqlengths(regions)), "UCSC") ] ## Finally, get the region coverage regionCov <- getRegionCoverage(fullCov = fullCov, regions = regions)
For a given BAM calculate the total number of mapped reads and for a BigWig file calculate the area under the curve (AUC), which is related to the number of mapped reads: the exact relationship depends on whether the aligner soft clips reads and/or if the length of the reads is the same. If you use the 'chrs' argument you can choose to only consider the information for your chromosomes of interest. For example, you can exclude the mitocondrial chromosome.
getTotalMapped(rawFile, chrs = NULL)
getTotalMapped(rawFile, chrs = NULL)
rawFile |
Either a BAM file or a BigWig file. |
chrs |
If |
The total number of mapped reads for a BAM file or the AUC for a BigWig file in a single element vector.
Leonardo Collado-Torres
## Get the total number of mapped reads for an example BAM file: bam <- system.file("extdata", "genomeData", "ERR009102_accepted_hits.bam", package = "derfinder", mustWork = TRUE ) getTotalMapped(bam)
## Get the total number of mapped reads for an example BAM file: bam <- system.file("extdata", "genomeData", "ERR009102_accepted_hits.bam", package = "derfinder", mustWork = TRUE ) getTotalMapped(bam)
For a group of samples this function reads the coverage information for a specific chromosome directly from the BAM files. It then merges them into a DataFrame and removes the bases that do not pass the cutoff.
loadCoverage( files, chr, cutoff = NULL, filter = "one", chrlen = NULL, output = NULL, bai = NULL, ... )
loadCoverage( files, chr, cutoff = NULL, filter = "one", chrlen = NULL, output = NULL, bai = NULL, ... )
files |
A character vector with the full path to the sample BAM files
(or BigWig files).
The names are used for the column names of the DataFrame. Check
rawFiles for constructing |
chr |
Chromosome to read. Should be in the format matching the one used in the raw data. |
cutoff |
This argument is passed to filterData. |
filter |
Has to be either |
chrlen |
The chromosome length in base pairs. If it's |
output |
If |
bai |
The full path to the BAM index files. If |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to extendedMapSeqlevels, define_cluster,
scanBamFlag and filterData.
Note that filterData is used internally
by loadCoverage and has the important arguments |
The ...
argument can be used to control which alignments to consider
when reading from BAM files. See scanBamFlag.
Parallelization for loading the data in chunks is used only used when
tilewidth
is specified. You may use up to one core per tile.
If you set the advanced argument drop.D = TRUE
, bases with CIGAR
string "D" (deletion from reference) will be excluded from the base-level
coverage calculation.
If you are working with data from an organism different from 'Homo sapiens'
specify so by setting the global 'species' and 'chrsStyle' options. For
example:
options(species = 'arabidopsis_thaliana')
options(chrsStyle = 'NCBI')
A list with two components.
is a DataFrame object where each column represents a sample. The number of rows depends on the number of base pairs that passed the cutoff and the information stored is the coverage at that given base.
is a logical Rle with the positions of the chromosome that passed the cutoff.
Leonardo Collado-Torres, Andrew Jaffe
datadir <- system.file("extdata", "genomeData", package = "derfinder") files <- rawFiles( datadir = datadir, samplepatt = "*accepted_hits.bam$", fileterm = NULL ) ## Shorten the column names names(files) <- gsub("_accepted_hits.bam", "", names(files)) ## Read and filter the data, only for 2 files dataSmall <- loadCoverage(files = files[1:2], chr = "21", cutoff = 0) ## Not run: ## Export to BigWig files createBw(list("chr21" = dataSmall)) ## Load data from BigWig files dataSmall.bw <- loadCoverage(c( ERR009101 = "ERR009101.bw", ERR009102 = "ERR009102.bw" ), chr = "chr21") ## Compare them mapply(function(x, y) { x - y }, dataSmall$coverage, dataSmall.bw$coverage) ## Note that the only difference is the type of Rle (integer vs numeric) used ## to store the data. ## End(Not run)
datadir <- system.file("extdata", "genomeData", package = "derfinder") files <- rawFiles( datadir = datadir, samplepatt = "*accepted_hits.bam$", fileterm = NULL ) ## Shorten the column names names(files) <- gsub("_accepted_hits.bam", "", names(files)) ## Read and filter the data, only for 2 files dataSmall <- loadCoverage(files = files[1:2], chr = "21", cutoff = 0) ## Not run: ## Export to BigWig files createBw(list("chr21" = dataSmall)) ## Load data from BigWig files dataSmall.bw <- loadCoverage(c( ERR009101 = "ERR009101.bw", ERR009102 = "ERR009102.bw" ), chr = "chr21") ## Compare them mapply(function(x, y) { x - y }, dataSmall$coverage, dataSmall.bw$coverage) ## Note that the only difference is the type of Rle (integer vs numeric) used ## to store the data. ## End(Not run)
This function summarizes the annotation contained in a TxDb at each given base of the genome based on annotated transcripts. It groups contiguous base pairs classified as the same type into regions.
makeGenomicState(txdb, chrs = c(seq_len(22), "X", "Y"), ...)
makeGenomicState(txdb, chrs = c(seq_len(22), "X", "Y"), ...)
txdb |
A TxDb object with chromosome lengths
(check |
chrs |
The names of the chromosomes to use as denoted in the
|
... |
Arguments passed to extendedMapSeqlevels. |
A GRangesList
object with two elements: fullGenome
and
codingGenome
. Both have metadata information for the type of region
(theRegion), transcript IDs (tx_id), transcript name (tx_name), and gene ID
(gene_id). fullGenome
classifies each region as either being exon,
intron or intergenic. codingGenome
classfies the regions as being
promoter, exon, intro, 5UTR, 3UTR or intergenic.
Andrew Jaffe, Leonardo Collado-Torres
## Load the example data base from the GenomicFeatures vignette library("GenomicFeatures") samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite", package = "GenomicFeatures" ) txdb <- loadDb(samplefile) ## Generate genomic state object, only for chr6 sampleGenomicState <- makeGenomicState(txdb, chrs = "chr6") # ## Not run: ## Create the GenomicState object for Hsapiens.UCSC.hg19.knownGene txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene ## Creating this GenomicState object takes around 8 min for all chrs and ## around 30 secs for chr21 GenomicState.Hsapiens.UCSC.hg19.knownGene.chr21 <- makeGenomicState(txdb = txdb, chrs = "chr21") ## For convinience, this object is already included in derfinder library("testthat") expect_that( GenomicState.Hsapiens.UCSC.hg19.knownGene.chr21, is_equivalent_to(genomicState) ) ## Hsapiens ENSEMBL GRCh37 library("GenomicFeatures") ## Can take several minutes and speed will depend on your internet speed xx <- makeTxDbPackageFromBiomart( version = "0.99", maintainer = "Your Name", author = "Your Name" ) txdb <- loadDb(file.path( "TxDb.Hsapiens.BioMart.ensembl.GRCh37.p11", "inst", "extdata", "TxDb.Hsapiens.BioMart.ensembl.GRCh37.p11.sqlite" )) ## Creating this GenomicState object takes around 13 min GenomicState.Hsapiens.ensembl.GRCh37.p11 <- makeGenomicState( txdb = txdb, chrs = c(1:22, "X", "Y") ) ## Save for later use save(GenomicState.Hsapiens.ensembl.GRCh37.p11, file = "GenomicState.Hsapiens.ensembl.GRCh37.p11.Rdata" ) ## End(Not run)
## Load the example data base from the GenomicFeatures vignette library("GenomicFeatures") samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite", package = "GenomicFeatures" ) txdb <- loadDb(samplefile) ## Generate genomic state object, only for chr6 sampleGenomicState <- makeGenomicState(txdb, chrs = "chr6") # ## Not run: ## Create the GenomicState object for Hsapiens.UCSC.hg19.knownGene txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene ## Creating this GenomicState object takes around 8 min for all chrs and ## around 30 secs for chr21 GenomicState.Hsapiens.UCSC.hg19.knownGene.chr21 <- makeGenomicState(txdb = txdb, chrs = "chr21") ## For convinience, this object is already included in derfinder library("testthat") expect_that( GenomicState.Hsapiens.UCSC.hg19.knownGene.chr21, is_equivalent_to(genomicState) ) ## Hsapiens ENSEMBL GRCh37 library("GenomicFeatures") ## Can take several minutes and speed will depend on your internet speed xx <- makeTxDbPackageFromBiomart( version = "0.99", maintainer = "Your Name", author = "Your Name" ) txdb <- loadDb(file.path( "TxDb.Hsapiens.BioMart.ensembl.GRCh37.p11", "inst", "extdata", "TxDb.Hsapiens.BioMart.ensembl.GRCh37.p11.sqlite" )) ## Creating this GenomicState object takes around 13 min GenomicState.Hsapiens.ensembl.GRCh37.p11 <- makeGenomicState( txdb = txdb, chrs = c(1:22, "X", "Y") ) ## Save for later use save(GenomicState.Hsapiens.ensembl.GRCh37.p11, file = "GenomicState.Hsapiens.ensembl.GRCh37.p11.Rdata" ) ## End(Not run)
Builds the model matrices for testing for differential expression by comparing a model with a grouping factor versus one without it. It adjusts for the confounders specified and the median coverage of each sample. The resulting models can be used in calculateStats.
makeModels(sampleDepths, testvars, adjustvars = NULL, testIntercept = FALSE)
makeModels(sampleDepths, testvars, adjustvars = NULL, testIntercept = FALSE)
sampleDepths |
Per sample library size adjustments calculated with sampleDepth. |
testvars |
A vector or matrix specifying the variables to test. For
example, a factor with the group memberships when testing for differences
across groups. It's length should match the number of columns used from
|
adjustvars |
Optional matrix of adjustment variables (e.g. measured confounders, output from SVA, etc.) to use in fitting linear models to each nucleotide. These variables have to be specified by sample and the number of rows must match the number of columns used. It will also work if it is a vector of the correct length. |
testIntercept |
If |
A list with two components.
The alternative model matrix.
The null model matrix.
Leonardo Collado-Torres
## Collapse the coverage information collapsedFull <- collapseFullCoverage(list(genomeData$coverage), verbose = TRUE ) ## Calculate library size adjustments sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), nonzero = TRUE, verbose = TRUE ) ## Build the models group <- genomeInfo$pop adjustvars <- data.frame(genomeInfo$gender) models <- makeModels(sampleDepths, testvars = group, adjustvars = adjustvars) names(models) models
## Collapse the coverage information collapsedFull <- collapseFullCoverage(list(genomeData$coverage), verbose = TRUE ) ## Calculate library size adjustments sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), nonzero = TRUE, verbose = TRUE ) ## Build the models group <- genomeInfo$pop adjustvars <- data.frame(genomeInfo$gender) models <- makeModels(sampleDepths, testvars = group, adjustvars = adjustvars) names(models) models
This function merges the results from running analyzeChr on several
chromosomes and assigns genomic states using annotateRegions. It
re-calculates the p-values and q-values using the pooled areas from the null
regions from all chromosomes. Once the results have been merged,
derfinderReport::generateReport
can be used to generate an HTML
report of the results. The derfinderReport
package is available at
https://github.com/lcolladotor/derfinderReport.
mergeResults( chrs = c(seq_len(22), "X", "Y"), prefix = ".", significantCut = c(0.05, 0.1), genomicState, minoverlap = 20, mergePrep = FALSE, ... )
mergeResults( chrs = c(seq_len(22), "X", "Y"), prefix = ".", significantCut = c(0.05, 0.1), genomicState, minoverlap = 20, mergePrep = FALSE, ... )
chrs |
The chromosomes of the files to be merged. |
prefix |
The main data directory path, which can be useful if analyzeChr is used for several parameters and the results are saved in different directories. |
significantCut |
A vector of length two specifiying the cutoffs used to determine significance. The first element is used to determine significance for the P-values and FWER adjusted P-values, while the second element is used for the Q-values (FDR adjusted P-values) similar to calculatePvalues. |
genomicState |
A GRanges object created with makeGenomicState.
It can be either the |
minoverlap |
Determines the mininum overlap needed when annotating regions with annotateRegions. |
mergePrep |
If |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to annotateRegions and extendedMapSeqlevels. |
If you want to calculate the FWER adjusted P-values, supply
optionsStats
which is produced by analyzeChr.
Seven Rdata files.
Full F-statistics from all chromosomes in a list of Rle objects.
Timing information from all chromosomes.
A DataFrame with the null region information: statistic, width, chromosome and permutation identifier. It's ordered by the statistics
GRanges object with regions found and with full
annotation from matchGenes. Note that the column
strand
from matchGenes is renamed to
annoStrand
to comply with GRanges specifications.
A list with the pre-processed coverage data from all chromosomes.
A list as constructed in annotateRegions with the assigned genomic states.
A list with the options used when merging the
results. Used in derfinderReport::generateReport
.
Leonardo Collado-Torres
analyzeChr, calculatePvalues, annotateRegions
## The output will be saved in the 'generateReport-example' directory dir.create("generateReport-example", showWarnings = FALSE, recursive = TRUE) ## For convenience, the derfinder output has been pre-computed file.copy(system.file(file.path("extdata", "chr21"), package = "derfinder", mustWork = TRUE ), "generateReport-example", recursive = TRUE) ## Merge the results from the different chromosomes. In this case, there's ## only one: chr21 mergeResults( chrs = "21", prefix = "generateReport-example", genomicState = genomicState$fullGenome ) ## Not run: ## You can then explore the wallclock time spent on each step load(file.path("generateReport-example", "fullRegions.Rdata")) ## Process the time info time <- lapply(fullTime, function(x) data.frame(diff(x))) time <- do.call(rbind, time) colnames(time) <- "sec" time$sec <- as.integer(round(time$sec)) time$min <- time$sec / 60 time$chr <- paste0("chr", gsub("\\..*", "", rownames(time))) time$step <- gsub(".*\\.", "", rownames(time)) rownames(time) <- seq_len(nrow(time)) ## Make plot library("ggplot2") ggplot(time, aes(x = step, y = min, colour = chr)) + geom_point() + labs(title = "Wallclock time by step") + scale_colour_discrete(limits = chrs) + scale_x_discrete(limits = names(fullTime[[1]])[-1]) + ylab("Time (min)") + xlab("Step") ## End(Not run)
## The output will be saved in the 'generateReport-example' directory dir.create("generateReport-example", showWarnings = FALSE, recursive = TRUE) ## For convenience, the derfinder output has been pre-computed file.copy(system.file(file.path("extdata", "chr21"), package = "derfinder", mustWork = TRUE ), "generateReport-example", recursive = TRUE) ## Merge the results from the different chromosomes. In this case, there's ## only one: chr21 mergeResults( chrs = "21", prefix = "generateReport-example", genomicState = genomicState$fullGenome ) ## Not run: ## You can then explore the wallclock time spent on each step load(file.path("generateReport-example", "fullRegions.Rdata")) ## Process the time info time <- lapply(fullTime, function(x) data.frame(diff(x))) time <- do.call(rbind, time) colnames(time) <- "sec" time$sec <- as.integer(round(time$sec)) time$min <- time$sec / 60 time$chr <- paste0("chr", gsub("\\..*", "", rownames(time))) time$step <- gsub(".*\\.", "", rownames(time)) rownames(time) <- seq_len(nrow(time)) ## Make plot library("ggplot2") ggplot(time, aes(x = step, y = min, colour = chr)) + geom_point() + labs(title = "Wallclock time by step") + scale_colour_discrete(limits = chrs) + scale_x_discrete(limits = names(fullTime[[1]])[-1]) + ylab("Time (min)") + xlab("Step") ## End(Not run)
This function takes the coverage data from loadCoverage, scales the data, does the log2 transformation, and splits it into appropriate chunks for using calculateStats.
preprocessCoverage( coverageInfo, groupInfo = NULL, cutoff = 5, colsubset = NULL, lowMemDir = NULL, ... )
preprocessCoverage( coverageInfo, groupInfo = NULL, cutoff = 5, colsubset = NULL, lowMemDir = NULL, ... )
coverageInfo |
A list containing a DataFrame – |
groupInfo |
A factor specifying the group membership of each sample. If
|
cutoff |
The base-pair level cutoff to use. It's behavior is controlled
by |
colsubset |
Optional vector of column indices of
|
lowMemDir |
If specified, each chunk is saved into a separate Rdata
file under |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
|
If chunksize
is NULL
, then mc.cores
is used to
determine the chunksize
. This is useful if you want to split the data
so each core gets the same amount of data (up to rounding).
Computing the indexes and using those for mclapply
reduces
memory copying as described by Ryan Thompson and illustrated in approach #4
at http://lcolladotor.github.io/2013/11/14/Reducing-memory-overhead-when-using-mclapply
If lowMemDir
is specified then $coverageProcessed
is NULL and
$mclapplyIndex
is a vector with the chunk identifiers.
A list with five components.
contains the processed coverage information in a
DataFrame object. Each column represents a sample and the coverage
information is scaled and log2 transformed. Note that if colsubset
is
not NULL
the number of columns will be less than those in
coverageInfo$coverage
. The total number of rows depends on the number
of base pairs that passed the cutoff
and the information stored is
the coverage at that given base. Further note that filterData is
re-applied if colsubset
is not NULL
and could thus lead to
fewer rows compared to coverageInfo$coverage
.
is a list of logical Rle objects. They contain the
partioning information according to chunksize
.
is a logical Rle with the positions of the chromosome that passed the cutoff.
is a numeric Rle with the mean coverage at each filtered base.
is a list of Rle objects containing the mean coverage at
each filtered base calculated by group. This list has length 0 if
groupInfo=NULL
.
Passed to filterData when colsubset
is specified.
Leonardo Collado-Torres
filterData, loadCoverage, calculateStats
## Split the data and transform appropriately before using calculateStats() dataReady <- preprocessCoverage(genomeData, cutoff = 0, scalefac = 32, chunksize = 1e3, colsubset = NULL, verbose = TRUE ) names(dataReady) dataReady
## Split the data and transform appropriately before using calculateStats() dataReady <- preprocessCoverage(genomeData, cutoff = 0, scalefac = 32, chunksize = 1e3, colsubset = NULL, verbose = TRUE ) names(dataReady) dataReady
Rail (available at http://rail.bio) generates coverage BigWig files. These files are faster to load in R and to process. Rail creates an un-adjusted coverage BigWig file per sample and an adjusted summary coverage BigWig file by chromosome (median or mean). railMatrix reads in the mean (or median) coverage BigWig file and applies a threshold cutoff to identify expressed regions (ERs). Then it goes back to the sample coverage BigWig files and extracts the base level coverage for each sample. Finally it summarizes this information in a matrix with one row per ERs and one column per sample. This function is similar to regionMatrix but is faster thanks to the advantages presented by BigWig files.
railMatrix( chrs, summaryFiles, sampleFiles, L, cutoff = NULL, maxClusterGap = 300L, totalMapped = NULL, targetSize = 4e+07, file.cores = 1L, chrlens = NULL, ... )
railMatrix( chrs, summaryFiles, sampleFiles, L, cutoff = NULL, maxClusterGap = 300L, totalMapped = NULL, targetSize = 4e+07, file.cores = 1L, chrlens = NULL, ... )
chrs |
A character vector with the names of the chromosomes. |
summaryFiles |
A character vector (or BigWigFileList) with the paths to
the summary BigWig files created by Rail. Either mean or median files. These
are library size adjusted by default in Rail. The order of the files in this
vector must match the order in |
sampleFiles |
A character vector with the paths to the BigWig files by sample. These files are unadjusted for library size. |
L |
The width of the reads used. Either a vector of length 1 or length equal to the number of samples. |
cutoff |
The base-pair level cutoff to use. It's behavior is controlled
by |
maxClusterGap |
This determines the maximum gap between candidate ERs. |
totalMapped |
A vector with the total number of reads mapped for each
sample. The vector should be in the same order as the samples in |
targetSize |
The target library size to adjust the coverage to. Used
only when |
file.cores |
Number of cores used for loading the BigWig files per chr. |
chrlens |
The chromosome lengths in base pairs. If it's |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to filterData, findRegions and define_cluster. |
Given a set of un-filtered coverage data (see fullCoverage), create candidate regions by applying a cutoff on the coverage values, and obtain a count matrix where the number of rows corresponds to the number of candidate regions and the number of columns corresponds to the number of samples. The values are the mean coverage for a given sample for a given region.
This function uses several other derfinder-package functions. Inspect the code if interested.
You should use at most one core per chromosome.
A list with one entry per chromosome. Then per chromosome, a list with two components.
A set of regions based on the coverage filter cutoff as returned by findRegions.
A matrix with the mean coverage by sample for each candidate region.
Leonardo Collado-Torres
## BigWig files are not supported in Windows if (.Platform$OS.type != "windows") { ## Get data library("derfinderData") ## Identify sample files sampleFiles <- rawFiles(system.file("extdata", "AMY", package = "derfinderData" ), samplepatt = "bw", fileterm = NULL) names(sampleFiles) <- gsub(".bw", "", names(sampleFiles)) ## Create the mean bigwig file. This file is normally created by Rail ## but in this example we'll create it manually. library("GenomicRanges") fullCov <- fullCoverage(files = sampleFiles, chrs = "chr21") meanCov <- Reduce("+", fullCov$chr21) / ncol(fullCov$chr21) createBw(list("chr21" = DataFrame("meanChr21" = meanCov)), keepGR = FALSE ) summaryFile <- "meanChr21.bw" ## Get the regions regionMat <- railMatrix( chrs = "chr21", summaryFiles = summaryFile, sampleFiles = sampleFiles, L = 76, cutoff = 5.1, maxClusterGap = 3000L ) ## Explore results names(regionMat$chr21) regionMat$chr21$regions dim(regionMat$chr21$coverageMatrix) }
## BigWig files are not supported in Windows if (.Platform$OS.type != "windows") { ## Get data library("derfinderData") ## Identify sample files sampleFiles <- rawFiles(system.file("extdata", "AMY", package = "derfinderData" ), samplepatt = "bw", fileterm = NULL) names(sampleFiles) <- gsub(".bw", "", names(sampleFiles)) ## Create the mean bigwig file. This file is normally created by Rail ## but in this example we'll create it manually. library("GenomicRanges") fullCov <- fullCoverage(files = sampleFiles, chrs = "chr21") meanCov <- Reduce("+", fullCov$chr21) / ncol(fullCov$chr21) createBw(list("chr21" = DataFrame("meanChr21" = meanCov)), keepGR = FALSE ) summaryFile <- "meanChr21.bw" ## Get the regions regionMat <- railMatrix( chrs = "chr21", summaryFiles = summaryFile, sampleFiles = sampleFiles, L = 76, cutoff = 5.1, maxClusterGap = 3000L ) ## Explore results names(regionMat$chr21) regionMat$chr21$regions dim(regionMat$chr21$coverageMatrix) }
For a group of samples this function creates the list of paths to the raw input files which can then be used in loadCoverage. The raw input files are either BAM files or BigWig files.
rawFiles( datadir = NULL, sampledirs = NULL, samplepatt = NULL, fileterm = "accepted_hits.bam" )
rawFiles( datadir = NULL, sampledirs = NULL, samplepatt = NULL, fileterm = "accepted_hits.bam" )
datadir |
The main directory where each of the |
sampledirs |
A character vector with the names of the sample
directories. If |
samplepatt |
If specified and |
fileterm |
Name of the BAM or BigWig file used in each sample. By
default it is set to |
This function can also be used to identify a set of BigWig files.
A vector with the full paths to the raw files and sample names stored as the vector names.
Leonardo Collado-Torres
## Get list of BAM files included in derfinder datadir <- system.file("extdata", "genomeData", package = "derfinder") files <- rawFiles( datadir = datadir, samplepatt = "*accepted_hits.bam$", fileterm = NULL ) files
## Get list of BAM files included in derfinder datadir <- system.file("extdata", "genomeData", package = "derfinder") files <- rawFiles( datadir = datadir, samplepatt = "*accepted_hits.bam$", fileterm = NULL ) files
Given a set of un-filtered coverage data (see fullCoverage), create candidate regions by applying a cutoff on the coverage values, and obtain a count matrix where the number of rows corresponds to the number of candidate regions and the number of columns corresponds to the number of samples. The values are the mean coverage for a given sample for a given region.
regionMatrix( fullCov, cutoff = 5, L, totalMapped = 8e+07, targetSize = 8e+07, runFilter = TRUE, returnBP = TRUE, ... )
regionMatrix( fullCov, cutoff = 5, L, totalMapped = 8e+07, targetSize = 8e+07, runFilter = TRUE, returnBP = TRUE, ... )
fullCov |
A list where each element is the result from
loadCoverage used with |
cutoff |
The base-pair level cutoff to use. It's behavior is controlled
by |
L |
The width of the reads used. Either a vector of length 1 or length equal to the number of samples. |
totalMapped |
A vector with the total number of reads mapped for each
sample. The vector should be in the same order as the samples in
|
targetSize |
The target library size to adjust the coverage to. Used
only when |
runFilter |
This controls whether to run filterData or not. If
set to |
returnBP |
If |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
Passed to filterData, findRegions and define_cluster. Note that filterData is used internally
by loadCoverage (and hence |
This function uses several other derfinder-package functions. Inspect the code if interested.
You should use at most one core per chromosome.
A list with one entry per chromosome. Then per chromosome, a list with three components.
A set of regions based on the coverage filter cutoff as returned by findRegions.
A list with one element per region. Each element is a matrix with numbers of rows equal to the number of base pairs in the region and number of columns equal to the number of samples. It contains the base-level coverage information for the regions. Only returned when returnBP = TRUE
.
A matrix with the mean coverage by sample for each candidate region.
Leonardo Collado-Torres
## Create some toy data library("IRanges") x <- Rle(round(runif(1e4, max = 10))) y <- Rle(round(runif(1e4, max = 10))) z <- Rle(round(runif(1e4, max = 10))) fullCov <- list("chr21" = DataFrame(x, y, z)) ## Calculate a proxy of library size libSize <- sapply(fullCov$chr21, sum) ## Run region matrix normalizing the coverage regionMat <- regionMatrix( fullCov = fullCov, maxRegionGap = 10L, maxClusterGap = 300L, L = 36, totalMapped = libSize, targetSize = 4e4 ) ## Not run: ## You can alternatively use filterData() on fullCov to reduce the required ## memory before using regionMatrix(). This can be useful when mc.cores > 1 filteredCov <- lapply(fullCov, filterData, returnMean = TRUE, filter = "mean", cutoff = 5, totalMapped = libSize, targetSize = 4e4 ) regionMat2 <- regionMatrix(filteredCov, maxRegionGap = 10L, maxClusterGap = 300L, L = 36, runFilter = FALSE ) ## End(Not run) ## regionMatrix() can work with multiple chrs as shown below. fullCov2 <- list("chr21" = DataFrame(x, y, z), "chr22" = DataFrame(x, y, z)) regionMat2 <- regionMatrix( fullCov = fullCov2, maxRegionGap = 10L, maxClusterGap = 300L, L = 36, totalMapped = libSize, targetSize = 4e4 ) ## Combine results from multiple chromosomes library("GenomicRanges") ## First extract the data regs <- unlist(GRangesList(lapply(regionMat2, "[[", "regions"))) covMat <- do.call(rbind, lapply(regionMat2, "[[", "coverageMatrix")) covBp <- do.call(c, lapply(regionMat2, "[[", "bpCoverage")) ## Force the names to match names(regs) <- rownames(covMat) <- names(covBp) <- seq_len(length(regs)) ## Combine into a list (not really needed) mergedRegionMat <- list( "regions" = regs, "coverageMatrix" = covMat, "bpCoverage" = covBp )
## Create some toy data library("IRanges") x <- Rle(round(runif(1e4, max = 10))) y <- Rle(round(runif(1e4, max = 10))) z <- Rle(round(runif(1e4, max = 10))) fullCov <- list("chr21" = DataFrame(x, y, z)) ## Calculate a proxy of library size libSize <- sapply(fullCov$chr21, sum) ## Run region matrix normalizing the coverage regionMat <- regionMatrix( fullCov = fullCov, maxRegionGap = 10L, maxClusterGap = 300L, L = 36, totalMapped = libSize, targetSize = 4e4 ) ## Not run: ## You can alternatively use filterData() on fullCov to reduce the required ## memory before using regionMatrix(). This can be useful when mc.cores > 1 filteredCov <- lapply(fullCov, filterData, returnMean = TRUE, filter = "mean", cutoff = 5, totalMapped = libSize, targetSize = 4e4 ) regionMat2 <- regionMatrix(filteredCov, maxRegionGap = 10L, maxClusterGap = 300L, L = 36, runFilter = FALSE ) ## End(Not run) ## regionMatrix() can work with multiple chrs as shown below. fullCov2 <- list("chr21" = DataFrame(x, y, z), "chr22" = DataFrame(x, y, z)) regionMat2 <- regionMatrix( fullCov = fullCov2, maxRegionGap = 10L, maxClusterGap = 300L, L = 36, totalMapped = libSize, targetSize = 4e4 ) ## Combine results from multiple chromosomes library("GenomicRanges") ## First extract the data regs <- unlist(GRangesList(lapply(regionMat2, "[[", "regions"))) covMat <- do.call(rbind, lapply(regionMat2, "[[", "coverageMatrix")) covBp <- do.call(c, lapply(regionMat2, "[[", "bpCoverage")) ## Force the names to match names(regs) <- rownames(covMat) <- names(covBp) <- seq_len(length(regs)) ## Combine into a list (not really needed) mergedRegionMat <- list( "regions" = regs, "coverageMatrix" = covMat, "bpCoverage" = covBp )
For a given data set calculate the per-sample coverage adjustments. Hector Corrada's group proposed calculating the sum of the coverage for genes below a given sample quantile. In this function, we calculate the sample quantiles of interest by sample, and then the sum of the coverage for bases below or equal to quantiles of interest. The resulting values are transformed log2(x
scalefac)
to avoid very large numbers that could potentially affect the stability of the F-statistics calculation. The sample coverage adjustments are then used in makeModels for constructing the null and alternative models.
sampleDepth(collapsedFull, probs = c(0.5, 1), scalefac = 32, ...)
sampleDepth(collapsedFull, probs = c(0.5, 1), scalefac = 32, ...)
collapsedFull |
The full coverage data collapsed by sample as produced by collapseFullCoverage. |
probs |
Number(s) between 0 and 1 representing the quantile(s) of interest. For example, 0.5 is the median. |
scalefac |
Number added to the sample coverage adjustments before the log2 transformation. |
... |
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
|
A matrix (vector of length(probs) == 1
) with the library size depth
adjustments per sample to be used in makeModels. The number of rows
corresponds to the number of quantiles used for the sample adjustments.
Leonardo Collado-Torres
Paulson, J. N., Stine, O. C., Bravo, H. C. & Pop, M. Differential abundance analysis for microbial marker-gene surveys. Nat. Methods (2013). doi:10.1038/nmeth.2658
collapseFullCoverage, makeModels
## Collapse the coverage information collapsedFull <- collapseFullCoverage(list(genomeData$coverage), verbose = TRUE ) ## Calculate library size adjustments sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5, 1), verbose = TRUE) sampleDepths
## Collapse the coverage information collapsedFull <- collapseFullCoverage(list(genomeData$coverage), verbose = TRUE ) ## Calculate library size adjustments sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5, 1), verbose = TRUE) sampleDepths