Title: | Epigenetic and gene transcription data normalization and integration with mixture models |
---|---|
Description: | A package for the integrative analysis of RNA-seq or microarray based gene transcription and histone modification data obtained by ChIP-seq. The package provides methods for data preprocessing and matching as well as methods for fitting bayesian mixture models in order to detect genes with differences in both data types. |
Authors: | Hans-Ulrich Klein, Martin Schaefer |
Maintainer: | Hans-Ulrich Klein <[email protected]> |
License: | LGPL-3 |
Version: | 1.47.0 |
Built: | 2024-12-29 06:39:30 UTC |
Source: | https://github.com/bioc/epigenomix |
This method estimates the posterior distribution of a Bayesian mixture model using Markov Chain Monte Carlo (MCMC) methods and calculates measures of this distribution. The mixture model may consist of normal components (with a fixed expectation of 0), exponential components and gamma components, which may be mirrored in order to model negative values.
bayesMixModel(z, normNull=c(), expNeg=c(), expPos=c(), gamNeg=c(), gamPos=c(), sdNormNullInit=c(), rateExpNegInit=c(), rateExpPosInit=c(), shapeGamNegInit=c(), scaleGamNegInit=c(), shapeGamPosInit=c(), scaleGamPosInit=c(), piInit, classificationsInit, dirichletParInit=1, shapeDir=1, scaleDir=1, weightsPrior="FDD", sdAlpha, shapeNorm0=c(), scaleNorm0=c(), shapeExpNeg0=c(), scaleExpNeg0=c(), shapeExpPos0=c(), scaleExpPos0=c(), shapeGamNegAlpha0=c(), shapeGamNegBeta0=c(), scaleGamNegAlpha0=c(), scaleGamNegBeta0=c(), shapeGamPosAlpha0=c(), shapeGamPosBeta0=c(), scaleGamPosAlpha0=c(), scaleGamPosBeta0=c(), itb, nmc, thin, average="mean",sdShape)
bayesMixModel(z, normNull=c(), expNeg=c(), expPos=c(), gamNeg=c(), gamPos=c(), sdNormNullInit=c(), rateExpNegInit=c(), rateExpPosInit=c(), shapeGamNegInit=c(), scaleGamNegInit=c(), shapeGamPosInit=c(), scaleGamPosInit=c(), piInit, classificationsInit, dirichletParInit=1, shapeDir=1, scaleDir=1, weightsPrior="FDD", sdAlpha, shapeNorm0=c(), scaleNorm0=c(), shapeExpNeg0=c(), scaleExpNeg0=c(), shapeExpPos0=c(), scaleExpPos0=c(), shapeGamNegAlpha0=c(), shapeGamNegBeta0=c(), scaleGamNegAlpha0=c(), scaleGamNegBeta0=c(), shapeGamPosAlpha0=c(), shapeGamPosBeta0=c(), scaleGamPosAlpha0=c(), scaleGamPosBeta0=c(), itb, nmc, thin, average="mean",sdShape)
z |
Observed values |
normNull |
Indices of the normal components (that have |
expNeg |
Indices of the mirrored exponential components. |
expPos |
Indices of the exponential components. |
gamNeg |
Indices of the mirrored gamma components. |
gamPos |
Indices of the gamma components. |
sdNormNullInit |
Initial standard deviations of the normal components. |
rateExpNegInit |
Initial rates of the mirrored exponential components. Only relevant if mirrored exponential components are specified. |
rateExpPosInit |
Initial rates of the exponential components. Only relevant if exponential components are specified. |
shapeGamNegInit |
Initial shape parameters of the mirrored gamma components. Only relevant if mirrored Gamma components are specified. |
scaleGamNegInit |
Initial scale parameters of the mirrored gamma components. Only relevant if mirrored Gamma components are specified. |
shapeGamPosInit |
Initial shape parameters of the gamma components. Only relevant if Gamma components are specified. |
scaleGamPosInit |
Initial scale parameters of the gamma components. Only relevant if Gamma components are specified. |
piInit |
Initial weights of the components. If missing, all k components get the same initial weight 1/k. |
classificationsInit |
Initial classifications of the data points. If missing, all data points are assigned to class floor(k/2) with k = number of components. |
dirichletParInit |
Initial concentration parameter of prior distribution assigned to the mixture weights. |
shapeDir |
Prior shape parameter of Gamma distribution for concentration parameter of prior distribution assigned to the mixture weights. |
scaleDir |
Prior scale parameter of Gamma distribution for concentration parameter of prior distribution assigned to the mixture weights. |
weightsPrior |
Prior distribution assigned to mixture weights. Available are the Finite-dimensional Dirichlet prior ("FDD"), also known as Dirichlet-multinomial process, and the Truncated Dirichlet process ("TDP"). Both are approximations to the Dirichlet process for a large number of components, while the Finite-dimensional Dirichlet prior is also suited for a small number of components as a special case of the Dirichlet distribution. |
sdAlpha |
Standard deviation of proposal distribution for concentration parameter of the prior distribution assigned to the mixture weights in the Metropolis-Hastings step incorporated in the Gibbs sampler. Only relevant if |
shapeNorm0 |
Prior shape parameter of Gamma distribution for precision of normal components. |
scaleNorm0 |
Prior scale parameter of Gamma distribution for precision of normal components. |
shapeExpNeg0 |
Prior shape parameter of Gamma distribution for parameter of mirrored exponential components. Only relevant if mirrored exponential components are specified. |
scaleExpNeg0 |
Prior scale parameter of Gamma distribution for parameter of mirrored exponential components. Only relevant if mirrored exponential components are specified. |
shapeExpPos0 |
Prior shape parameter of Gamma distribution for parameter of exponential components. Only relevant if exponential components are specified. |
scaleExpPos0 |
Prior scale parameter of Gamma distribution for parameter of exponential components. Only relevant if exponential components are specified. |
shapeGamNegAlpha0 |
Prior shape parameter of Gamma distribution for shape parameter of mirrored Gamma components. Only relevant if mirrored Gamma components are specified. |
shapeGamNegBeta0 |
Prior scale parameter of Gamma distribution for shape parameter of mirrored Gamma components. Only relevant if mirrored Gamma components are specified. |
scaleGamNegAlpha0 |
Prior shape parameter of Gamma distribution for scale parameter of mirrored Gamma components. Only relevant if mirrored Gamma components are specified. |
scaleGamNegBeta0 |
Prior scale parameter of Gamma distribution for scale parameter of mirrored Gamma components. Only relevant if mirrored Gamma components are specified. |
shapeGamPosAlpha0 |
Prior shape parameter of Gamma distribution for shape parameter of Gamma components. Only relevant if Gamma components are specified. |
shapeGamPosBeta0 |
Prior scale parameter of Gamma distribution for shape parameter of Gamma components. Only relevant if Gamma components are specified. |
scaleGamPosAlpha0 |
Prior shape parameter of Gamma distribution for scale parameter of Gamma components. Only relevant if Gamma components are specified. |
scaleGamPosBeta0 |
Prior scale parameter of Gamma distribution for scale parameter of Gamma components. Only relevant if Gamma components are specified. |
itb |
Number of iterations used for burn-in. |
nmc |
Number of iterations after burn-in used for analysis. |
thin |
Thinning value for the iterations after burn-in. |
average |
Way of averaging across the posterior distribution to obtain estimates of model parameters. Either |
sdShape |
Standard deviation of proposal distribution for shape parameter of Gamma components in the Metropolis-Hastings step incorporated in the Gibbs sampler. Only relevant if Gamma components are specified. |
The convergence of Markov chains must be assessed prior to an
interpretation of results. Inspection of trace plots via
plotChains
is therefore urgently recommended.
Iterations during which one of the chains has not yet reached stationarity should not be taken into account for analysis
and can be excluded by setting an appropriate burn-in value itb
.
Autocorrelation between subsequent chain values can be reduced by thinning the chain, setting an appropriate value for thin
.
To ensure a sufficient number of iterations for the chains after the burn-in, nmc
should be increased when the thinning is increased.
The standard deviations of the proposal distribution in a Metropolis-Hastings step should be tuned to achieve a medium-level acceptance rate (e.g., 0.3-0.7):
A very low acceptance rate would cause a long running time of the algorithm, while a very high acceptance rate
typically leads to autocorrelation between the values of the chain. Acceptance is documented for each iteration in the chains
slot of objects of class MixModelBayes-class
.
An object of class MixModelBayes-class
storing results, data,
priors, initial values and information about convergence.
Martin Schaefer ([email protected])
plotChains
, MixModelBayes-class
set.seed(1000) z <- c(rnorm(1000, 0, 0.5), rnorm(1000, 0, 1)) mm <- bayesMixModel(z, normNull=1:2, sdNormNullInit=c(0.1, 0.2), piInit=c(1/2, 1/2), shapeNorm0=c(1, 1), scaleNorm0=c(1, 1), shapeExpNeg0=c(), scaleExpNeg0=c(), shapeExpPos0=c(), scaleExpPos0=c(), sdAlpha=1, itb=100, nmc=1000, thin=10) mm plotComponents(mm) plotChains(mm, chain="pi") z <- c(rnorm(200, 0, 1), rnorm(200, 0, 5), rexp(200, 0.1), -rexp(200, 0.1)) mm <- bayesMixModel(z, normNull=1:2, gamNeg=3, gamPos=4, sdNormNullInit=c(1, 1), shapeGamNegInit=1, scaleGamNegInit=1, shapeGamPosInit=1, scaleGamPosInit=1, shapeNorm0=c(1,3), scaleNorm0=c(1,3), sdAlpha=1, shapeGamNegAlpha0=1, shapeGamNegBeta0=1, scaleGamNegAlpha0=1, scaleGamNegBeta0=1, shapeGamPosAlpha0=1, shapeGamPosBeta0=1, scaleGamPosAlpha0=1, scaleGamPosBeta0=1, sdShape=0.025, itb=100, nmc=1000, thin=10) mm plotComponents(mm) plotChains(mm, chain="pi")
set.seed(1000) z <- c(rnorm(1000, 0, 0.5), rnorm(1000, 0, 1)) mm <- bayesMixModel(z, normNull=1:2, sdNormNullInit=c(0.1, 0.2), piInit=c(1/2, 1/2), shapeNorm0=c(1, 1), scaleNorm0=c(1, 1), shapeExpNeg0=c(), scaleExpNeg0=c(), shapeExpPos0=c(), scaleExpPos0=c(), sdAlpha=1, itb=100, nmc=1000, thin=10) mm plotComponents(mm) plotChains(mm, chain="pi") z <- c(rnorm(200, 0, 1), rnorm(200, 0, 5), rexp(200, 0.1), -rexp(200, 0.1)) mm <- bayesMixModel(z, normNull=1:2, gamNeg=3, gamPos=4, sdNormNullInit=c(1, 1), shapeGamNegInit=1, scaleGamNegInit=1, shapeGamPosInit=1, scaleGamPosInit=1, shapeNorm0=c(1,3), scaleNorm0=c(1,3), sdAlpha=1, shapeGamNegAlpha0=1, shapeGamNegBeta0=1, scaleGamNegAlpha0=1, scaleGamNegBeta0=1, shapeGamPosAlpha0=1, shapeGamPosBeta0=1, scaleGamPosAlpha0=1, scaleGamPosBeta0=1, sdShape=0.025, itb=100, nmc=1000, thin=10) mm plotComponents(mm) plotChains(mm, chain="pi")
This method calculates the cross correlation, i.e. the Pearson correlation between the coverages of the positive and negative strand from a DNA sequencing experiment. The cross correlation can be used as a quality measure in ChIP-seq experiments (Kharchenko et al. 2008). Cross correlation can also be used to estimate the fragment size by determining the shift (given in base pairs) that maximizes the cross correlation.
calculateCrossCorrelation(object, shift=c(200,250,300), bin=10, mode="none", minReads=10000, chrs=NA, mc.cores=1)
calculateCrossCorrelation(object, shift=c(200,250,300), bin=10, mode="none", minReads=10000, chrs=NA, mc.cores=1)
object |
An |
shift |
The number of bases that the negative strand is shifted towards its three prime end. This can be a vector, if the correlation should be calculated for different shifts. |
bin |
If bin is larger than one, the coverage is calculated for bins of size
|
mode |
|
minReads |
If not at least |
chrs |
A |
mc.cores |
Number of cores to be used. |
Only 5 prime start positions of reads are used for calculating
the coverage. Therefore, after removing duplicates in a single end sequencing
experiment, the coverage can not be larger than one, if the bin size is
set to one. (In this setting, mode both
is meaningless.)
If bin is larger than one, the coverage within a bin is aggregated.
Then, the correlation is calculated for each shift. A shift
(given in basepairs) should be multiple of the bin size
(given in basepairs, too). If not, the binnend coverage is shifted by
round(shift/bin) elements.
The different modes define whether regions without coverage or with
only one covered strand should used. The original implementation in
the package "spp" does not make use of regions without
coverage. However, this seems to be a loss of information, since no
coverage has also a biological meaning in a ChIP-seq experiment. If
the fragment size is approximately 500bp, setting shift=seq(200, 800, 10)
,
bin=10
and mode="none"
should be a good setting.
After the cross correlation was calculated for each chromosome, the weighted mean correlation across all chromosomes is calculated. The weight for a specific chromosome equals the fraction of all reads that were aligned to that chromosome.
A numeric vector with the cross correlation for each shift. The names of the vector correspond to the shifts.
Hans-Ulrich Klein ([email protected])
Kharchenko PV, Tolstorukov MY and Park PJ. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol 2008, 26(12):1351-9
Landt SG et al., ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012, 22(9):1813-31
triangularKernel <- function(x, pos, h) { res <- 1 - (abs(x - pos) / h) res[res < 0] <- 0 return(res) } covPos <- round(triangularKernel(1:100, 60, 50) * 100) covNeg <- round(triangularKernel(1:100, 65, 50) * 100) reads <- GRanges(IRanges(start=c(rep(seq_along(covPos), covPos), rep(seq_along(covNeg), covNeg) - 9), width=10), strand=c(rep("+", sum(covPos)), rep("-", sum(covNeg))), seqnames=rep("1", sum(covPos)+sum(covNeg))) calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="none", minReads=1) cor(covPos, covNeg) cor(covPos[1:(length(covPos)-10)], covNeg[11:length(covNeg)]) calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="one", minReads=1) cor(covPos[covPos != 0 | covNeg != 0], covNeg[covPos != 0 | covNeg != 0]) calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="both", minReads=1) cor(covPos[covPos != 0 & covNeg != 0], covNeg[covPos != 0 & covNeg != 0]) covPos2 <- round(triangularKernel(1:100, 60, 50) * 50) covNeg2 <- round(triangularKernel(1:100, 68, 50) * 50) reads2 <- GRanges(IRanges(start=c(rep(seq_along(covPos2), covPos2), rep(seq_along(covNeg2), covNeg2) - 9), width=10), strand=c(rep("+", sum(covPos2)), rep("-", sum(covNeg2))), seqnames=rep("2", sum(covPos2)+sum(covNeg2))) seqlevels(reads2) <- c("1", "2") allReads <- c(reads, reads2) calculateCrossCorrelation(allReads, shift=5, minReads=1, bin=1, mode="none") cor1 <- cor(covPos[1:(length(covPos)-5)], covNeg[6:length(covNeg)]) cor2 <- cor(covPos2[1:(length(covPos2)-5)], covNeg2[6:length(covNeg2)]) cor1 * (sum(c(covPos, covNeg))/length(allReads)) + cor2 * (sum(c(covPos2, covNeg2))/length(allReads))
triangularKernel <- function(x, pos, h) { res <- 1 - (abs(x - pos) / h) res[res < 0] <- 0 return(res) } covPos <- round(triangularKernel(1:100, 60, 50) * 100) covNeg <- round(triangularKernel(1:100, 65, 50) * 100) reads <- GRanges(IRanges(start=c(rep(seq_along(covPos), covPos), rep(seq_along(covNeg), covNeg) - 9), width=10), strand=c(rep("+", sum(covPos)), rep("-", sum(covNeg))), seqnames=rep("1", sum(covPos)+sum(covNeg))) calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="none", minReads=1) cor(covPos, covNeg) cor(covPos[1:(length(covPos)-10)], covNeg[11:length(covNeg)]) calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="one", minReads=1) cor(covPos[covPos != 0 | covNeg != 0], covNeg[covPos != 0 | covNeg != 0]) calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="both", minReads=1) cor(covPos[covPos != 0 & covNeg != 0], covNeg[covPos != 0 & covNeg != 0]) covPos2 <- round(triangularKernel(1:100, 60, 50) * 50) covNeg2 <- round(triangularKernel(1:100, 68, 50) * 50) reads2 <- GRanges(IRanges(start=c(rep(seq_along(covPos2), covPos2), rep(seq_along(covNeg2), covNeg2) - 9), width=10), strand=c(rep("+", sum(covPos2)), rep("-", sum(covNeg2))), seqnames=rep("2", sum(covPos2)+sum(covNeg2))) seqlevels(reads2) <- c("1", "2") allReads <- c(reads, reads2) calculateCrossCorrelation(allReads, shift=5, minReads=1, bin=1, mode="none") cor1 <- cor(covPos[1:(length(covPos)-5)], covNeg[6:length(covNeg)]) cor2 <- cor(covPos2[1:(length(covPos2)-5)], covNeg2[6:length(covNeg2)]) cor1 * (sum(c(covPos, covNeg))/length(allReads)) + cor2 * (sum(c(covPos2, covNeg2))/length(allReads))
"ChIPseqSet"
A class for storing count data obtained from ChIP-seq experiments by
counting the number of reads lying within regions. The class extends
RangedSummarizedExperiment
.
Objects can be created by calls of the form
ChIPseqSet(chipVals=countDataMatrix,
rowRanges=genomicRegions,
colData=DataFrame(row.names=colnames(countDataMatrix)),
...)
.
However, one will most likely create a ChIPseqSet
object by
calling summarizeReads
.
metadata
:An optional list
of arbitrary
content describing the overall experiment.
rowRanges
:Object of class "GRanges"
or "GRangesList"
containing the genomic regions where the reads were counted.
colData
:Object of class "DataFrame"
containing information on variable values of the samples. Some methods
require the total library size of each sample stored in a column
titled totalCounts.
assays
:Object of class SimpleList
of a
matrix, named chipVals
containing the read counts
per genomic region.
Class "RangedSummarizedExperiment"
, directly.
signature(object = "ChIPseqSet")
: Returns the
matrix with read counts.
signature(object = "ChIPseqSet", value = "matrix")
:
Sets the matrix with read counts.
signature(object = "ChIPseqSet", libSize,
log2=FALSE, priorCount=0.1)
: Returns
an object of ChIPseqSet
with read counts standardized by
library size - counts per million (cpm). If the library size is
not given, the column sums of the given object are used. Cpm
values are logarithmized after adding priorCounts
,
if log2 is TRUE
.
Hans-Ulrich Klein ([email protected])
showClass("ChIPseqSet")
showClass("ChIPseqSet")
The ExpressionSet
stores 2 replicates for each of
two different conditions. Data were obtained from Affymetrix
MouseGene 1.0 ST arrays.
data(eSet)
data(eSet)
An object of class ExpressionSet
.
The example data contains a subset of 200 probesets located on chromosome 1. Data were RMA normalized.
data(eSet) eSet pData(eSet)
data(eSet) eSet pData(eSet)
The data.frame
stores transcription values obtained
from the Cufflinks software for two samples (CEBPA_WT and
CEBPA_KO). Transcription values are given in fragments per
kilobase of transcripts per million fragments (FPKM).
data(fpkm)
data(fpkm)
An object of class data.frame
.
All transcripts sharing the TSS were grouped and one transcription values is given for each group of transcripts. The example data contains a subset of about 3500 TSS located on chromosome 1.
data(fpkm) head(fpkm)
data(fpkm) head(fpkm)
Calculates some basic alignment statistics for given bam files.
getAlignmentQuality(bamFile, verbose = FALSE, mc.cores = 1)
getAlignmentQuality(bamFile, verbose = FALSE, mc.cores = 1)
bamFile |
A |
verbose |
If set to |
mc.cores |
Number of cores to be used. |
The given bam files should have marked duplicates and not uniquely mapped reads should have a quality value of 0. In detail, this function returns a data frame with the following columns:
File name without path and suffix
ID field from bam header, if available
SM field from bam header, if available
LB field from bam header, if available
Total number of reads in bam file
Number of mapped Reads
MappedReads/TotalReads
Number of mapped reads with mapping quality larger 0
UniquelyMappedReads/MappedReads
Number of non duplicated mapped reads with mapping quality larger 0
UniquelyMappedUniqueReads/MappedReads
UniquelyMappedUniqueReads/UniquelyMappedReads
Mean mapping quality of all uniquely mapped unique reads
Standard deviation of the mapping quality of all uniquely mapped unique reads
0% quantileof the mapping quality of all uniquely mapped unique reads
25% quantile of the mapping quality of all uniquely mapped unique reads
50% quantile of the mapping quality of all uniquely mapped unique reads
75% quantile of the mapping quality of all uniquely mapped unique reads
100% quantile of the mapping quality of all uniquely mapped unique reads
Full path and file name as given in argument bamFile
Returns a data frame with one row for each given bam file and the columns as listed in the details section.
Hans-Ulrich Klein ([email protected])
## Not run: getAlignmentQuality("myFile.bam")
## Not run: getAlignmentQuality("myFile.bam")
This function calculates the product of the standardized differences between two conditions in ChIP-seq data and the respective standardized differences in gene expression data. A score close to zero means that there are no (large) differences in at least one of the two data sets. If the score is positive, equally directed differences exist in both data sets. In case of a negative score, differences have unequal signs in the two data sets.
integrateData(expr, chipseq, factor, reference)
integrateData(expr, chipseq, factor, reference)
expr |
An |
chipseq |
A |
factor |
A |
reference |
Optionally, the name of the factor level that should be used as
reference. If missing, the first level of |
Let A and B denote the gene expression value of one probe in the group
of interest and in the reference group defined by the argument
reference
. And let X and Y be the ChIP-seq values assigned to
that probe. This functions returnes for each probe
where is the standard deviation estimated from all
observed difference in the gene expression data and
the standard deviation in the ChIP-seq data.
If there is more than one sample in any group and data set, the average of the replicates is calcuated first and than plugged into the formula above.
Not all features in expr
must also be in chipseq
and
vice versa. Features present in only one of the two data types are
omitted.
A matrix with five columns. The first 4 columns store the (average)
expression values and the (average) ChIP-seq values for each of the two
conditions. The fith columns store the correlation score. The row names
equal common feature names of expr
and chipseq
.
Hans-Ulrich Klein ([email protected])
ge <- matrix(c(5,12,5,11,11,10,12,11), nrow=2) row.names(ge) <- c("100_at", "200_at") colnames(ge) <- c("c1", "c2", "t1", "t2") geDf <- data.frame(status=factor(c("control", "control", "treated", "treated")), row.names=colnames(ge)) eSet <- ExpressionSet(ge, phenoData=AnnotatedDataFrame(geDf)) chip <- matrix(c(10,20,20,22), nrow=2) row.names(chip) <- c("100_at", "200_at") colnames(chip) <- c("c", "t") rowRanges <- GRanges(IRanges(start=c(10,50), end=c(20,60)), seqnames=c("1","1")) names(rowRanges) = c("100_at", "200_at") chipDf <- DataFrame(status=factor(c("control", "treated")), totalCount=c(100, 100), row.names=colnames(chip)) cSet <- ChIPseqSet(chipVals=chip, rowRanges=rowRanges, colData=chipDf) integrateData(eSet, cSet, factor="status", reference="control")
ge <- matrix(c(5,12,5,11,11,10,12,11), nrow=2) row.names(ge) <- c("100_at", "200_at") colnames(ge) <- c("c1", "c2", "t1", "t2") geDf <- data.frame(status=factor(c("control", "control", "treated", "treated")), row.names=colnames(ge)) eSet <- ExpressionSet(ge, phenoData=AnnotatedDataFrame(geDf)) chip <- matrix(c(10,20,20,22), nrow=2) row.names(chip) <- c("100_at", "200_at") colnames(chip) <- c("c", "t") rowRanges <- GRanges(IRanges(start=c(10,50), end=c(20,60)), seqnames=c("1","1")) names(rowRanges) = c("100_at", "200_at") chipDf <- DataFrame(status=factor(c("control", "treated")), totalCount=c(100, 100), row.names=colnames(chip)) cSet <- ChIPseqSet(chipVals=chip, rowRanges=rowRanges, colData=chipDf) integrateData(eSet, cSet, factor="status", reference="control")
The GRangesList
contains two elements:
"CEBPA_WT_1" and "CEBPA_KO_1". Both list elements are
GRanges
objects storing mapped reads
from anti-H3K4me3 ChIP-seq experiments. The first sample was a
wild-type mouse cell line. The second sample was obtained from
the same cell line after CEPBA knock-out.
data(mappedReads)
data(mappedReads)
A GRangesList
with two
GRanges
.
Duplicated reads and reads mapping to more than one genomic location were removed. Reads were extended to the estimated DNA fragment size of 200bp towards the 3 prime end. Further, only reads lying within certain regions of chromomse 1 were kept to reduce storage space.
data(mappedReads) names(mappedReads) mappedReads[[1]]
data(mappedReads) names(mappedReads) mappedReads[[1]]
This function returns a GRangesList
object asigning promoter regions
to probes. The assignment of transcripts to probes and the
transcriptional start sites must be given as arguments.
matchProbeToPromoter(probeToTranscript, transcriptToTSS, promWidth = 4000, mode = "union", fix = "center")
matchProbeToPromoter(probeToTranscript, transcriptToTSS, promWidth = 4000, mode = "union", fix = "center")
probeToTranscript |
A |
transcriptToTSS |
A
|
promWidth |
Width of the promoter regions in base pairs. Promoters are defined
as |
mode |
How probes with multiple transcripts should be handled. Must be either "union", "keepAll" or "dropMultiple". (default "union") |
fix |
Denotes what to use as anchor when defining the promoter region. Must be either "center", "start" or "end". "Center" means that the TSS is in the middle of the promoter, whereas "end" means that the promoter is placed upsream of the TSS. (default "center") |
More than one transcript can be assigned to one probe in the given
probeToTranscript
argument. Several options how to handle
such cases can be choosen by argument mode
. "union":
The union of all promoters is calculated and assigned to the probe.
"keepAll": All promoters of all transcripts are assigned to the
probe. If some transcript have identical TSSs, the same promoter
region occurs several times. "dropMultiple": All probes that have
more than one transcript with different TSS are removed.
The argument transcriptToTSS
must have at least 4 columns
giving the information as described above. The column names are not
decisive, but their position.
An object of class GRangesList
with one element for each probe.
If mode
is not set to "dropMultiple", GRanges
may consist
of more than one range. The names of the lists' elements are the probe
IDs and additionally, each GRanges
has a meta data column
"probe" giving the corresponding probe ID.
Hans-Ulrich Klein ([email protected])
probeToTrans <- list("101"="ENST00011", "102"=c("ENST00021", "ENST00022"), "103"=NA) transToTSS <- data.frame( transID=c("ENST00011", "ENST00021", "ENST00022"), chr=c("1", "1", "1"), tss=c(100000, 200000, 201000), strand=c("-", "+", "+")) matchProbeToPromoter(probeToTrans, transToTSS, promWidth=4000, mode="union") matchProbeToPromoter(probeToTrans, transToTSS, promWidth=4000, mode="keepAll")
probeToTrans <- list("101"="ENST00011", "102"=c("ENST00021", "ENST00022"), "103"=NA) transToTSS <- data.frame( transID=c("ENST00011", "ENST00021", "ENST00022"), chr=c("1", "1", "1"), tss=c(100000, 200000, 201000), strand=c("-", "+", "+")) matchProbeToPromoter(probeToTrans, transToTSS, promWidth=4000, mode="union") matchProbeToPromoter(probeToTrans, transToTSS, promWidth=4000, mode="keepAll")
"MixModel"
This class stores a fitted mixture model.
A virtual Class: No objects may be created from it.
mmData
:Object of class "numeric"
storing the data.
configuration
:Object of class "list"
storing
configuration. See notes for details.
results
:Object of class "list"
storing
results. See notes for details.
signature(object = "MixModel")
: Returns
a data.frame containing the z-scores and classification results.
The optional argument classificationMethod
can be used to
change the default classification method.
signature(object = "MixModel", method =
"character")
: Assess classification results.
signature(object = "MixModel", method =
"missing")
: Assess classification results.
signature(object = "MixModel")
: Assess
mixture components.
signature(object = "MixModel")
: Assess data.
signature(x = "MixModel")
: Assess dimension,
i.e. numer of data points and number of components.
signature(x = "MixModel")
: Number of data points.
signature(object =
"MixModel")
: List available classification methods.
signature(object = "MixModel")
: Print an object
of MixModel
on screen.
signature(object = "MixModel")
: Returns a list
of data frames summarizing the parameter estimations for each component.
signature(object = "MixModel")
: Asses the
components weights.
Slots configuration
and results
are lists with named elements. The following
elements make up the minimum set of element that must be present. Depending on the method
that was used to fit the mixture model, more elements may be present.
Slot configuration
has at least one element.
inits A list with at least two elements: component
and pi
.
components
contains a list of objects of
MixtureComponent-class
storing the inital parameters
of the mixture components. pi
is a vector storing
the initial components' weights.
Slot results
has at least three elements.
components A list of objects of MixtureComponent-class
storing
the fitted mixture components.
pi A numeric vector holding the estimated components' weights.
classification A list of numeric vectors of the same length as data
storing the classification results.
Hans-Ulrich Klein ([email protected])
mlMixModel
bayesMixModel
MixModelML
MixModelBayes
showClass("MixModel")
showClass("MixModel")
"MixModelBayes"
This class stores a Bayesian mixture model fitted by MCMC methods.
Objects can be created by calls of the form new("MixModelBayes", ...)
.
chains
:Object of class "list"
storing the course of the Markov chains for each parameter.
mmData
:Object of class "numeric"
storing the data.
configuration
:Object of class "list"
storing
configuration. See notes for details.
results
:Object of class "list"
storing results. See notes for details.
Class "MixModel"
, directly.
signature(object = "MixModelBayes")
: Gives
access to the chains
slot of the object.
signature(object = "MixModelBayes")
: Gives
the acceptance rate for the parameter of the Dirichlet distribution.
Acceptance rates between 0.3 and 0.7 are usually desired. Values
not smaller than 0.1 (not larger than 0.9) might still be
acceptable. The acceptance rate is only meaningful if the option
weightsPrior
was set to the Finite-dimensional Dirichlet
prior.
In addition to the content described in MixModel
,
the following elements are present:
Slot configuration
:
initsAs in MixModel
.
priorsA list specifying the prior distributions for the parameters of the components and the parameter of the Dirichlet process.
chainA list with the technical specifications for the Markov Chains.
Slot results
is exactly like in MixModel
.
Slot chains
:
componentsA list giving the values for the parameters of the components in each iteration after burn-in and application of thinning.
piA matrix giving the values for the weights pi of the components in each iteration after burn-in and application of thinning.
dirichletParameterA vector giving the values for dirichlet Parameter in each iteration after burn-in and application of thinning.
classificationA matrix giving the number of genes classified to each components in each iteration after burn-in and application of thinning.
Hans-Ulrich Klein ([email protected])
showClass("MixModelBayes")
showClass("MixModelBayes")
"MixModelML"
This class stores a mixture model fitted by a maximum likelihood approach.
Objects can be created by calls of the form new("MixModelML", ...)
.
Usually, objects are created by mlMixModel
.
convergence
:Object of class "list"
storing
information about the convergence of the EM algorithm.
mmData
:Object of class "numeric"
storing the data.
configuration
:Object of class "list"
storing
configuration. See notes for details.
results
:Object of class "list"
storing
results. See notes for details.
Class "MixModel"
, directly.
signature(object = "MixModelML")
: Access
to the convergence information.
In addition to the content described in MixModel
,
the following elements are present:
Slot configuration
:
convergenceA list storing the maximum number of allowed iterations. And delta log likelihood limit, that is interpreted as convergence, if the delta log likelihood falls below that limit.
Slot results
is exactly like in MixModel
.
Slot convergence
:
iterationsNumber of iterations ran.
deltaLogLikDelta of log likelihood observed in the last iteration.
logLikLog likelihood of the model fit.
Hans-Ulrich Klein ([email protected])
showClass("MixModelML")
showClass("MixModelML")
"MixtureComponent"
A class representing a mixture component.
Objects can be created by calls of the form new("MixtureComponent", ...)
.
name
:Object of class "character"
giving the
name or type of the mixture component.
parameters
:Object of class "list"
storing the
parameters of corresponding distribution.
pdf
:Object of class "function"
giving the pdf
of the mixture component.
color
:Object of class "character"
giving the
color of the component that is used by plotting methods.
signature(object = "MixtureComponent")
: A method
plotting a summary of the component on screen.
The element in parameters
should be named by the argument names
of pdf
such that this call works:
do.call(object@pdf, c(list(x=data), object@parameters))
Hans-Ulrich Klein ([email protected])
showClass("MixtureComponent")
showClass("MixtureComponent")
This method calculates the maximum likelihood estimations of a mixture model using the expectation-maximization (EM) algorithm. The mixture model may consists of normal components (with a fixed expectation of 0) and exponential components, which may be mirrored in order to model negative values.
mlMixModel(z, normNull = c(), expNeg = c(), expPos = c(), sdNormNullInit = c(), rateExpNegInit = c(), rateExpPosInit = c(), piInit = c(), maxIter = 500, tol = 0.001)
mlMixModel(z, normNull = c(), expNeg = c(), expPos = c(), sdNormNullInit = c(), rateExpNegInit = c(), rateExpPosInit = c(), piInit = c(), maxIter = 500, tol = 0.001)
z |
Observed values. |
normNull |
Indices of the normal components (that have |
expNeg |
Indices of the mirrored exponential components. |
expPos |
Indices of the exponential components. |
sdNormNullInit |
Initial standard deviations of the normal components. |
rateExpNegInit |
Initial rates ("lambda") of the exponential components. |
rateExpPosInit |
Initial rates ("lambda") of the exponential components. |
piInit |
Initial weights of the components. |
maxIter |
Maximum number of iterations. |
tol |
Threshold for convergence. The minimum log likelihood gain between two iterations that must be achieved to continue. |
The EM algorithm is known to converge slowly in some cases and local maxima may avoid finding the optimal solution. Users should try different initial values and different convergence criteria.
The components' indices do not influence the result, but may influence the order in which components are listed or ploted by downstream methods. Indices must be successive integers from 1 to n.
An object of MixModelML-class
storing results, data,
initial values and information about the convergence.
Hans-Ulrich Klein ([email protected])
z <- c(rnorm(1000, 0, 0.5), rnorm(1000, 0, 1)) mm <- mlMixModel(z, normNull=1:2, sdNormNullInit=c(0.1, 0.2), pi=c(1/2, 1/2), maxIter=500, tol=0.001) mm z <- c(rnorm(1000, 0, 3), rnorm(1000, 0, 5), rexp(1000, 5), -rexp(1000, 5)) mm <- mlMixModel(z, normNull=1:2, expNeg=3, expPos=4, sdNormNullInit=c(1, 2), rateExpNegInit=8, rateExpPosInit=8, pi=c(1/4, 1/4, 1/4, 1/4), maxIter=500, tol=0.001) mm
z <- c(rnorm(1000, 0, 0.5), rnorm(1000, 0, 1)) mm <- mlMixModel(z, normNull=1:2, sdNormNullInit=c(0.1, 0.2), pi=c(1/2, 1/2), maxIter=500, tol=0.001) mm z <- c(rnorm(1000, 0, 3), rnorm(1000, 0, 5), rexp(1000, 5), -rexp(1000, 5)) mm <- mlMixModel(z, normNull=1:2, expNeg=3, expPos=4, sdNormNullInit=c(1, 2), rateExpNegInit=8, rateExpPosInit=8, pi=c(1/4, 1/4, 1/4, 1/4), maxIter=500, tol=0.001) mm
This function implements some methods for between-sample normalization of count data. Although these methods were developed for RNA-seq data, they are also useful for ChIP-seq data normalization after reads were counted within regions or bins. Some methods may also be applied to count data after within-sample normalization (e.g. TPM or RPKM values).
## S4 method for signature 'ChIPseqSet' normalize(object, method, isLogScale = FALSE, trim = 0.3, totalCounts) ## S4 method for signature 'ExpressionSet' normalize(object, method, isLogScale = FALSE, trim = 0.3, totalCounts)
## S4 method for signature 'ChIPseqSet' normalize(object, method, isLogScale = FALSE, trim = 0.3, totalCounts) ## S4 method for signature 'ExpressionSet' normalize(object, method, isLogScale = FALSE, trim = 0.3, totalCounts)
object |
An object of class |
method |
Normalization method, either "scale", "scaleMedianRegion", "quantile" or "tmm". |
isLogScale |
Indicates whether the raw data in |
trim |
Only used if |
totalCounts |
Only used if |
The following normalization methods are implemented:
scaleSamples are scaled by a factor such that all samples
have the same number of reads after normalization, where
is the median number of reads observed accross all samples. If
the argument
totalCounts
is missing, the total numbers of
reads are calculated from the given data. Otherwise, the values
in totalCounts
are used.
scaleMedianRegionThe scaling factor for the
-th sample is defined as
is the value of region
in sample
. See Anders and Huber (2010) for details.
quantileQuantile normalization is applied to the ChIP-seq values such that each sample has the same cdf after normalization.
tmmThe trimmed mean M-value (tmm) normalization was
proposed by Robinson and Oshlack (2010). Here, the logarithm
of the scaling factor for sample is calculated as
the trimmed mean of
Variable denotes the geometric mean of region
.
Argument
trim
is set to 0.3 as default value, so that
the smallest 15% and the largest 15% of the log ratios are
trimmed before calculating the mean.
An object of the same class as the input object
with
the normalized data.
Hans-Ulrich Klein ([email protected])
Anders and Huber. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106.\ Robinson and Oshlack. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25
set.seed(1234) chip <- matrix(c(rpois(20, lambda=10), rpois(20, lambda=20)), nrow=20, dimnames=list(paste("feature", 1:20, sep=""), c("sample1", "sample2"))) rowRanges <- GRanges(IRanges(start=1:20, end=1:20), seqnames=c(rep("1", 20))) names(rowRanges) = rownames(chip) cSet <- ChIPseqSet(chipVals=chip, rowRanges=rowRanges) tmmSet <- normalize(cSet, method="tmm", trim=0.3) mean(log(chipVals(tmmSet))[, 1], trim=0.3) - mean(log(chipVals(tmmSet))[, 2], trim=0.3) < 0.01 quantSet <- normalize(cSet, method="quantile") all(quantile(chipVals(quantSet)[, 1]) == quantile(chipVals(quantSet)[, 2]))
set.seed(1234) chip <- matrix(c(rpois(20, lambda=10), rpois(20, lambda=20)), nrow=20, dimnames=list(paste("feature", 1:20, sep=""), c("sample1", "sample2"))) rowRanges <- GRanges(IRanges(start=1:20, end=1:20), seqnames=c(rep("1", 20))) names(rowRanges) = rownames(chip) cSet <- ChIPseqSet(chipVals=chip, rowRanges=rowRanges) tmmSet <- normalize(cSet, method="tmm", trim=0.3) mean(log(chipVals(tmmSet))[, 1], trim=0.3) - mean(log(chipVals(tmmSet))[, 2], trim=0.3) < 0.01 quantSet <- normalize(cSet, method="quantile") all(quantile(chipVals(quantSet)[, 1]) == quantile(chipVals(quantSet)[, 2]))
This method is deprecated. Use normalize instead. This method implements some normalization approaches for ChIP-seq data after counting reads within regions or bins. Similar methods are often applied to RNA-seq data after counting reads within genes.
normalizeChIP(object, method)
normalizeChIP(object, method)
object |
A |
method |
Normalization method, either "scaleTotal", "scaleRegion", "scaleMedianRegion" or "quantile" |
The following normalization methods are implemented:
scaleTotalSamples are scaled by a factor such that all samples have the same number of reads (the median number of reads observed accross all samples before normalization). All reads are used for calculating the scaling factor.
scaleRegionSamples are scaled by a factor such that all samples
have the same number of reads (the median number of reads observed
accross all samples before normalization). In contrast to scaleTotal,
only reads falling into the regions (genes, promoters) that were used
to create the given ChIPseqSet
object are used for
calculating the scaling factor. Hence, the sum of all columns of the
returned ChIPseqSet
are equal after applying this
method.
scaleMedianRegionThe scaling factor for the
-th sample is defined as:
is the value of region
in sample
. See Anders and Huber (2010) for details.
quantileQuantile normalization is applied to the ChIP-seq values such that each sample has the same cdf after normalization.
An ChIPseqSet-class
object with normalized ChIP-seq
values.
Hans-Ulrich Klein ([email protected]
Anders and Huber; Differential expression analysis for sequence count data; Genome Biology 2010, 11:R106
chip <- matrix(c(5,6,5,6,10,12,10,12), nrow=4, dimnames=list(c("f1", "f2", "f3", "f4"), c("s1", "s2"))) rowRanges <- GRanges(IRanges(start=c(10, 20, 30, 40), end=c(11, 21, 31, 41)), seqnames=c("1", "1", "1", "1")) names(rowRanges) = rownames(chip) chipDf <- DataFrame(totalCount=c(100, 100), row.names=colnames(chip)) cSet <- ChIPseqSet(chipVals=chip, rowRanges=rowRanges, colData=chipDf) chipVals(cSet) chipVals(normalize(cSet, method="scaleMedianRegion")) chipVals(normalize(cSet, method="quantile"))
chip <- matrix(c(5,6,5,6,10,12,10,12), nrow=4, dimnames=list(c("f1", "f2", "f3", "f4"), c("s1", "s2"))) rowRanges <- GRanges(IRanges(start=c(10, 20, 30, 40), end=c(11, 21, 31, 41)), seqnames=c("1", "1", "1", "1")) names(rowRanges) = rownames(chip) chipDf <- DataFrame(totalCount=c(100, 100), row.names=colnames(chip)) cSet <- ChIPseqSet(chipVals=chip, rowRanges=rowRanges, colData=chipDf) chipVals(cSet) chipVals(normalize(cSet, method="scaleMedianRegion")) chipVals(normalize(cSet, method="quantile"))
This function method draws trace plots for a Bayesian mixture model, e.g. visualizes the course of the Markov Chains. Inspection of the Markov Chains is important to determine convergence of the chains, which is necessary for sensible results.
plotChains(object, chain, component, itb = 1, thin = 1, cols, ...)
plotChains(object, chain, component, itb = 1, thin = 1, cols, ...)
object |
An object of |
chain |
A character of length one giving the name of the paramter, which
chain should be plotted. Can be omitted, if |
component |
An integer specifying the components, which parameter chains should
be plotted. Can be omitted, if |
itb |
Number of iterations used for burn-in. The burn-in is relative to
the output of |
thin |
Thinning value for the iterations after burn-in. The thinning is
relative to the output of |
cols |
Number of columns to be used in the plot. Optional, if omitted, the number of columns and rows are choosen be the method itself. |
... |
Further arguments passed to |
The number of iterations necessary until a Markov chain reaches stationarity depends on the specific model and data. For any inference based on Markov Chain Monte Carlo methods, it is therefore necessary to inspect the convergence of Markov Chains. One way to do this is visual inspection of trace plots using this method.
If argument main
is passed to this method, it should have as
many elements as chains are plotted. Otherwise, vector main
is
reapted.
Hans-Ulrich Klein ([email protected]) Martin Schaefer ([email protected])
bayesMixModel
, MixModelBayes-class
z <- c(rnorm(1000, 0, 3), rnorm(1000, 0, 5), rexp(1000, 5), -rexp(1000, 5)) mm <- bayesMixModel(z, normNull=1:2, expNeg=3, expPos=4, sdNormNullInit=c(1, 2), rateExpNegInit=8, rateExpPosInit=8, shapeNorm0=c(1, 1), scaleNorm0=c(1, 1), shapeExpNeg0=c(1, 1), scaleExpNeg0=c(1, 1), shapeExpPos0=c(1, 1), scaleExpPos0=c(1, 1), sdAlpha=1, itb=200, nmc=1000, thin=10) plotChains(mm, chain="pi") plotChains(mm, component=c(2,3))
z <- c(rnorm(1000, 0, 3), rnorm(1000, 0, 5), rexp(1000, 5), -rexp(1000, 5)) mm <- bayesMixModel(z, normNull=1:2, expNeg=3, expPos=4, sdNormNullInit=c(1, 2), rateExpNegInit=8, rateExpPosInit=8, shapeNorm0=c(1, 1), scaleNorm0=c(1, 1), shapeExpNeg0=c(1, 1), scaleExpNeg0=c(1, 1), shapeExpPos0=c(1, 1), scaleExpPos0=c(1, 1), sdAlpha=1, itb=200, nmc=1000, thin=10) plotChains(mm, chain="pi") plotChains(mm, component=c(2,3))
This method visualizes the assignment of data points to the mixture components of the given mixture model. The components are plotted on the y-axis and the data on the x-axis. Data points are plotted in the color of the respective mixture component.
plotClassification(object, method, ...)
plotClassification(object, method, ...)
object |
An object of |
method |
Depending on the type of the mixture model (ML, Bayes), different approaches to obtain a classification are available. Also the default approach may vary. |
... |
Further arguments passed to |
If method
is given, it must be a valid option for method
classification
. E.g., if bayesMixModel
was
used to create the mixture model, valid options are "maxDens",
"median" and "mode".
Arguments "col" and "pch" can be given to specify the color and the shape of the points plotted. Their length must equal to the number of components.
Hans-Ulrich Klein ([email protected])
MixModel-class
listClassificationMethods
z <- c(rnorm(100, 0, 10), rnorm(100, 0, 2), rexp(100, 1/2), -rexp(100, 1/2)) mm <- mlMixModel(z, normNull=1:2, expNeg=3, expPos=4, sdNormNullInit=c(1, 2), rateExpNegInit=c(1/2), rateExpPosInit=c(1/2), pi=c(1/4, 1/4, 1/4, 1/4), maxIter=50, tol=0.01) plotClassification(mm)
z <- c(rnorm(100, 0, 10), rnorm(100, 0, 2), rexp(100, 1/2), -rexp(100, 1/2)) mm <- mlMixModel(z, normNull=1:2, expNeg=3, expPos=4, sdNormNullInit=c(1, 2), rateExpNegInit=c(1/2), rateExpPosInit=c(1/2), pi=c(1/4, 1/4, 1/4, 1/4), maxIter=50, tol=0.01) plotClassification(mm)
This function plots the mixture pdf, the estimated data pdf and the weighted pdfs of all components of the given mixture model. The plot is useful to assess the fit of the model.
plotComponents(object, density = FALSE, ...)
plotComponents(object, density = FALSE, ...)
object |
A |
density |
A logical indicating whether the data distribution should be plotted as histogram (FALSE) or as density (TRUE) estimated using kernel density estimation. |
... |
Further arguments passed to |
If the argument "col" is given, the first color is used for the mixture
pdf. The following colors (2 to n+1) are used for the n mixture components'
pdfs. If density
is set to TRUE
, a further color (n+2) must be
given, that is used for the data pdf. The same applies for the argument "lty",
which can be given to specify the line type used to plot the densities.
Hans-Ulrich Klein ([email protected])
z <- c(rnorm(100, 0, 1), rnorm(100, 0, 2), rexp(100, 1/2), -rexp(100, 1/2)) mm <- mlMixModel(z, normNull=1:2, expNeg=3, expPos=4, sdNormNullInit=c(1, 2), rateExpNegInit=c(1/2), rateExpPosInit=c(1/2), pi=c(1/4, 1/4, 1/4, 1/4), maxIter=50, tol=0.01) plotComponents(mm)
z <- c(rnorm(100, 0, 1), rnorm(100, 0, 2), rexp(100, 1/2), -rexp(100, 1/2)) mm <- mlMixModel(z, normNull=1:2, expNeg=3, expPos=4, sdNormNullInit=c(1, 2), rateExpNegInit=c(1/2), rateExpPosInit=c(1/2), pi=c(1/4, 1/4, 1/4, 1/4), maxIter=50, tol=0.01) plotComponents(mm)
This function takes reads from e.g. ChIP-seq experiments and regions, e.g. promoters of genes, and assigns the number of overlapping reads to that region. This method was written particularly with regard to histone ChIP-seq experiments. Some histone modifications mainly occur at transcriptional start sites and thus ChIP-seq values can be assigned to genes by counting the number of reads within genes' pomoter regions. However, some genes may have several transcript and hence several promoters. Different options for handling multiple promoters are implemented. This method is also useful when integrating microarray expression data and ChIP-seq data, since most array platforms are gene centric and have probes that measure several transcripts.
summarizeReads(object, regions, summarize)
summarizeReads(object, regions, summarize)
object |
A |
regions |
An object of type |
summarize |
Defines how regions with several ranges are handled. "average" means that the mean count of reads across all ranges is assigned to the region whereas "add" means that all counts are simply added (default). |
This function is usually applied after calling
matchProbeToPromoter
. When
matchProbeToPromoter
is used with mode
"union",
it is recommended to use "add". If the option "keepAll"
had been used, one might want to use "average".
This method uses countOverlaps
and counts
each read that overlaps with at least one base.
Returns a ChIPseqSet
with number of rows equal to the length
of regions
and number of samples equal to the length of
object
.
Hans-Ulrich Klein ([email protected])
matchProbeToPromoter
ChIPseqSet-class
chipSeq <- GRangesList() chipSeq[[1]] <- GRanges(seqnames=c("1", "1", "1", "1"), ranges=IRanges(start=c(97900, 198200, 198600, 202500), end=c(98100, 198400, 198800, 202700)), strand=c("+", "+", "+", "+")) chipSeq[[2]] <- GRanges(seqnames=c("1", "1", "1", "1"), ranges=IRanges(start=c(97900, 198200, 198600, 300000), end=c(98100, 198400, 198800, 300200)), strand=c("+", "+", "+", "+")) names(chipSeq) = c("sample1", "sample2") promoters <- GRanges(seqnames=c("1", "1", "1"), ranges=IRanges(start=c(98000, 198000, 202000), end=c(101999, 201999, 205999)), strand=c("-", "+", "+"), probe=c("101", "102", "102")) promoters <- split(promoters, elementMetadata(promoters)$probe) chipSet <- summarizeReads(chipSeq, promoters, summarize="add") chipVals(chipSet)
chipSeq <- GRangesList() chipSeq[[1]] <- GRanges(seqnames=c("1", "1", "1", "1"), ranges=IRanges(start=c(97900, 198200, 198600, 202500), end=c(98100, 198400, 198800, 202700)), strand=c("+", "+", "+", "+")) chipSeq[[2]] <- GRanges(seqnames=c("1", "1", "1", "1"), ranges=IRanges(start=c(97900, 198200, 198600, 300000), end=c(98100, 198400, 198800, 300200)), strand=c("+", "+", "+", "+")) names(chipSeq) = c("sample1", "sample2") promoters <- GRanges(seqnames=c("1", "1", "1"), ranges=IRanges(start=c(98000, 198000, 202000), end=c(101999, 201999, 205999)), strand=c("-", "+", "+"), probe=c("101", "102", "102")) promoters <- split(promoters, elementMetadata(promoters)$probe) chipSet <- summarizeReads(chipSeq, promoters, summarize="add") chipVals(chipSet)
The data frame stores Ensemble transcript IDs and repective chromosomes, transcriptional start sites and strands for mus musculus (mm10).
data(transToTSS)
data(transToTSS)
A data frame with 277 mouse transcripts with the following 4 variables:
ensembl_transcript_id
A character giving the Ensemble transcript ID.
chromosome_name
A character with the respective chromomse name.
transcript_start
An integer storing the respective transcriptional start site.
strand
An integer storing the respective strand information.
Given a character vector transcripts
with the Ensemble
transcript IDs, a data frame like this can be obtained via biomaRt:
library("biomaRt")
mart <- useMart("ensembl", dataset="mmusculus_gene_ensembl")
transToTSS <- getBM(attributes=c("ensembl_transcript_id",
"chromosome_name", "transcript_start", "transcript_end", "strand"),
filters="ensembl_transcript_id", values=transcripts, mart=mart)
http://www.ensembl.org
data(transToTSS) head(transToTSS)
data(transToTSS) head(transToTSS)