Title: | Transcript mapping with high-density oligonucleotide tiling arrays |
---|---|
Description: | The package provides functionality that can be useful for the analysis of high-density tiling microarray data (such as from Affymetrix genechips) for measuring transcript abundance and architecture. The main functionalities of the package are: 1. the class 'segmentation' for representing partitionings of a linear series of data; 2. the function 'segment' for fitting piecewise constant models using a dynamic programming algorithm that is both fast and exact; 3. the function 'confint' for calculating confidence intervals using the strucchange package; 4. the function 'plotAlongChrom' for generating pretty plots; 5. the function 'normalizeByReference' for probe-sequence dependent response adjustment from a (set of) reference hybridizations. |
Authors: | Wolfgang Huber, Zhenyu Xu, Joern Toedling with contributions from Matt Ritchie |
Maintainer: | Zhenyu Xu <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.85.0 |
Built: | 2024-11-07 06:14:34 UTC |
Source: | https://github.com/bioc/tilingArray |
tilingArray package overview
The package provides some functionalities that can be useful for the analysis of high-density tiling microarray data (such as Affymetrix genechips) for measuring transcript abundance and architecture. The main functionalities of the package are:
The segmentation class for representing partitionings of a linear series of data (such as microarray intensity readings along a chromosome strand).
The function segment
for fitting piecewise constant models using a dynamic programming
algorithm that is both fast and exact,
and confint
for
calculating confidence intervals using the strucchange package.
Please see the vignette Segmentation demo in the file
inst/doc/segmentation.pdf (source file inst/scripts/segmentation.Rnw).
The function plotAlongChrom
for generating pretty
plots of segmentations along with genomic features. Please also see
the vignette Segmentation demo.
The function normalizeByReference
for probe-sequence dependent response adjustment from
a (set of) reference hybridizations.
Please see the vignette Assessing signal/noise ratio before
and after normalization in the file
inst/doc/assessNorm.pdf (source file inst/scripts/assessNorm.Rnw).
W. Huber <[email protected]>
Accessor methods for breakpointsPretend objects - not to be called by the user.
These functions are used in the interface between the
segmentation
class and the
confint.breakpointsfull
method of the strucchange package. This method calls
breakpoints
and residuals
methods for its first
argument, and since we pass an argument of S3 class
breakpointsPretend
, we can avoid the overhead of the
corresponding methods for breakpointsfull
objects.
These functions are of no interest to the user.
## S3 method for class 'breakpointsPretend' residuals(object, breaks, ...) breakpoints.breakpointsPretend(obj, breaks, ...)
## S3 method for class 'breakpointsPretend' residuals(object, breaks, ...) breakpoints.breakpointsPretend(obj, breaks, ...)
object |
a breakpointsPretend object. |
obj |
a breakpointsPretend object. |
breaks |
dummy argument, is ignored. |
... |
futher arguments. |
residuals and breakpoints.
W. Huber [email protected]
This function is used for Figure 5 in the David et al. (PNAS 2006) paper and in the Huber et al. methods paper.
comparisonPlot(x, y, xscale=range(x), yscale, anno, ticks, pch=20, cex=1, bgcol="#f2f2f2")
comparisonPlot(x, y, xscale=range(x), yscale, anno, ticks, pch=20, cex=1, bgcol="#f2f2f2")
x |
numeric vector. |
y |
list of numeric vector, each of same length as |
xscale |
numeric vector of length 2. |
yscale |
matrix with 2 rows and columns corresponding to the
elements of |
anno |
dataframe with columns |
ticks |
numeric vector, where to plot the ticks. |
pch |
A numeric or character vector indicating what sort of
plotting symbol to use, see |
cex |
Multiplier applied to fontsize, see |
bgcol |
Color to use as background for some of the plot panels. |
Function is called for its side-effect.
W. Huber [email protected]
...
##
##
This function calculates the cost matrix for the segmentation model
costMatrix(x, maxk)
costMatrix(x, maxk)
x |
Numeric vector of length |
maxk |
Positive integer. |
See the package vignette Calculation of the cost matrix.
Matrix with maxk
rows and length(x)
columns.
W. Huber
d = 4 x = apply(matrix(rnorm(200), ncol=d), 2, cumsum) maxk = 50 G = costMatrix(x, maxk=maxk) G.pedestrian = matrix(NA, nrow=nrow(G), ncol=ncol(G)) for(i in 1:(ncol(G))) for(k in 1:min(nrow(G), nrow(x)-i+1)) G.pedestrian[k, i] = (k*d-1)*var(as.vector(x[i:(i+k-1), ])) stopifnot(identical(is.na(G), is.na(G.pedestrian))) stopifnot(max(abs(G-G.pedestrian), na.rm=TRUE) <= 1e-6)
d = 4 x = apply(matrix(rnorm(200), ncol=d), 2, cumsum) maxk = 50 G = costMatrix(x, maxk=maxk) G.pedestrian = matrix(NA, nrow=nrow(G), ncol=ncol(G)) for(i in 1:(ncol(G))) for(k in 1:min(nrow(G), nrow(x)-i+1)) G.pedestrian[k, i] = (k*d-1)*var(as.vector(x[i:(i+k-1), ])) stopifnot(identical(is.na(G), is.na(G.pedestrian))) stopifnot(max(abs(G-G.pedestrian), na.rm=TRUE) <= 1e-6)
This function is only here for backward compatibility - please use
segment
.
The function fits a piecewise constant curve to a sequence of numbers using a simple least squares cost function and the dynamic programming algorithm described by Picard et al. (see reference).
findSegments(x, maxcp, maxk, verbose=TRUE)
findSegments(x, maxcp, maxk, verbose=TRUE)
x |
Numeric (real) vector. |
maxcp |
Integer (length 1): maximum number of segments (= 1 + maximum number of change points). |
maxk |
Integer (length 1): maximum length of a segment. |
verbose |
Logical: if this parameter has a positive value, various diagnostic output is printed. |
The complexity of the algorithm is
length(x)*maxk
in memory and
length(x)*maxk*maxcp
in time.
An object of class "segmentation"
A list with elements
J |
likelihood criterion |
th |
matrix of segment start points |
dat |
the data used for the segmentation |
call |
the function call |
.
See the vignette, and the paper cited below for details.
This function is depracated and replaced by function
segment
, but still included for backward compability.
W. Huber [email protected], Joern Toedling [email protected]
A statistical approach for CGH microarray data analysis. Franck Picard, Stephane Robin, Marc Lavielle, Christian Vaisse, Gilles Celeux, Jean-Jacques Daudin, Rapport de recherche No. 5139, Mars 2004, Institut National de Recherche en Informatique et en Automatique (INRIA), ISSN 0249-6399. The code of this function is based on the Matlab implementation presented at http://www.inapg.fr/ens_rech/mathinfo/recherche/mathematique/outil.html, but it has evolved.
x = rep( sin((0:4)/2*pi), each=3) + rnorm(3*5, sd=0.1) res = findSegments(x, maxcp=6, maxk=15)
x = rep( sin((0:4)/2*pi), each=3) + rnorm(3*5, sd=0.1) res = findSegments(x, maxcp=6, maxk=15)
Example of a genomic feature object
data(gffSub)
data(gffSub)
gffSub is a data frame that contains the features from 35000bp - 50000bp in yeast chromosome one.
Zhenyu Xu <[email protected]>
data(segnf) data(gffSub) nmLabel = colnames(segnf$"1.+"@y) plotAlongChrom(segnf,chr=1, coord=c(35000,50000),what="heatmap", gff=gffSub,rowNamesHeatmap=nmLabel)
data(segnf) data(gffSub) nmLabel = colnames(segnf$"1.+"@y) plotAlongChrom(segnf,chr=1, coord=c(35000,50000),what="heatmap", gff=gffSub,rowNamesHeatmap=nmLabel)
Adjust the hybridization intensities from an oligonucleotide microarray
for probe-specific response effect by using one or several reference
hybridizations.
If x
contains more than one array, vsnMatrix
from
the vsn
package is called for between array normalization.
normalizeByReference(x, reference, pm, background, refSig, nrStrata=10, cutoffQuantile=0.05, plotFileNames, verbose=FALSE)
normalizeByReference(x, reference, pm, background, refSig, nrStrata=10, cutoffQuantile=0.05, plotFileNames, verbose=FALSE)
x |
ExpressionSet containing the data to be normalized. |
reference |
ExpressionSet with the same number of features as |
pm |
Indices specifying the perfect match features in
|
background |
Indices specifying a set of background features in
|
refSig |
A numeric vector of the same length as |
nrStrata |
Integer (length 1), number of strata for the estimation of the background function. |
cutoffQuantile |
Numeric (length 1), the probes whose reference signal is below this quantile are thrown out. |
plotFileNames |
Character vector whose length is the same as the
number of arrays in |
verbose |
Logical of length 1, if |
The intensities in x
are adjusted according to the
reference values.
Typically, the reference values are obtained by hybridizing a DNA
sample to the array, so that the abundance of target is
the same for all reference probes, and their signal can be used to estimate
the probe sequence effect. A reference probe is a probe that
perfectly matches the target genome exactly once.
Usually, not all probes on a chip are reference
probes, hence the subset of those that are is specified by the
argument pm
.
The background signal is estimated from the probes indicated by the
argument background
. They need to be a strict subset of the
reference
probes. I.e., they need to uniquely
match the target organism's DNA, but are not expected to match any of its transcripts. A
robust estimation method is used, so a small fraction of
background
probes that do hit transcripts is not harmful.
A limitation of this normalization method is that it only makes sense for the data from reference probes, NA values are returned for all other probes.
The functions PMindex
and BGindex
can be used to produce
the pm
and background
arguments from a probeAnno
environment such as provided in the davidTiling package.
To summarize, a reference probe (indicated by argument pm
) is a
probe that perfectly matches the target genome exactly once, a
background probe (indicated by argument background
)
is a reference probe which we expect not to be transcribed. These
should not be confused with what is called 'perfect match' and
'mismatch' probes in Affymetrix annotation.
A copy of x
with the normalized intensities.
W. Huber [email protected]
The method implemented in this function is described in detail in Section 2.3 of the article Huber W, Toedling J, Steinmetz, L. Transcript mapping with high-density oligonucleotide tiling arrays. Bioinformatics 22, 1963-1970 (2006).
PMindex
, BGindex
## see vignette assessNorm.Rnw in inst/scripts directory
## see vignette assessNorm.Rnw in inst/scripts directory
Return the name of the opposite strand
otherStrand(x)
otherStrand(x)
x |
Character vector whose elements are "+" or "-". |
This is a rather trivial convenience function.
An alternative would be to code strands with integers -1
and
+1
, in which case the inversion would be a trivial builtin
operation. However, many genomic databases and input data files use
the character string / factor notation.
Character vector of same length as x
, with strands reversed.
W. Huber <[email protected]>
otherStrand(c("+", "-"))
otherStrand(c("+", "-"))
Plot signals and segmentation for a region of a chromosome
plotAlongChrom(segObj, y, probeAnno, gff, isDirectHybe=FALSE, what = c("dots"), ## "heatmap" chr, coord, highlight, colors, doLegend=FALSE, featureExclude=c("chromosome", "nucleotide_match", "insertion"), featureColorScheme=1, extras, rowNamesHeatmap, rowNamesExtras, ylab, ylabExtras, main, colHeatmap=colorRamp(brewer.pal(9, "YlGnBu")), colExtras=colorRamp(brewer.pal(9, "Reds")), sepPlots=FALSE, reOrder=TRUE, ...)
plotAlongChrom(segObj, y, probeAnno, gff, isDirectHybe=FALSE, what = c("dots"), ## "heatmap" chr, coord, highlight, colors, doLegend=FALSE, featureExclude=c("chromosome", "nucleotide_match", "insertion"), featureColorScheme=1, extras, rowNamesHeatmap, rowNamesExtras, ylab, ylabExtras, main, colHeatmap=colorRamp(brewer.pal(9, "YlGnBu")), colExtras=colorRamp(brewer.pal(9, "Reds")), sepPlots=FALSE, reOrder=TRUE, ...)
segObj |
Either an environment or an object of S4 class
|
y |
a numeric vector or matrix containing the signal to be plotted. See Details. |
probeAnno |
environment with probe annotations. See
Details, and package
|
gff |
data frame with genome annotation from the GFF file. |
isDirectHybe |
logical scalar: if TRUE, the mapping of probes to genomic strands is reversed with respect to the default. This is appropriate for data from a direct RNA hybridization that used no reverse transcription. |
what |
character scalar indicating which signal visualization to plot. Can be
either |
chr |
integer of length 1 indicating the chromosome number to plot. |
coord |
integer vector of length 2 containing the start and end coordinates (in bases) for the plot. |
highlight |
(optional) list with two elements: a single numeric value
|
colors |
(optional) named character vector. If missing,
a default color scheme is used:
|
doLegend |
logical: should the plot contain a legend? |
featureExclude |
character vector of names of feature types (in
|
featureColorScheme |
numeric scalar, used to select a color scheme for the
boxes representing genomic features such as coding sequences, ncRNAs etc.
Currently the only value supported is 1 (see |
extras |
a matrix containing additional values to be plotted along the
chromosome in a separate panel (such as p-values). This option is only
available when |
rowNamesHeatmap |
character vector of row names for the main heatmap. |
rowNamesExtras |
character vector of row names for the extra heatmap. |
ylab |
character label for y-axis of main plot. |
ylabExtras |
character label for y-axis on |
main |
character: plot title. |
colHeatmap |
function describing color scheme for the main heatmap plot (defaults to
|
colExtras |
function describing color scheme for the extra heatmap plot (if specified)
(defaults to |
sepPlots |
logical scalar. If TRUE, each column of intensities in |
reOrder |
logical scalar (only used when sepPlots is TRUE). If TRUE, the first column of intensities is printed at the bottom of each plot, and the subsequent columns are plotted above. If FALSE, the first appears at the top, and the subsequent columns are plotted below. |
... |
further arguments that can be passed to the
functions that implement the |
Intensities: There are two alternative, mutually exclusive ways of providing the intensities that are to be plotted to this function.
Via the parameters y
and probeAnno
. In this case,
y
is a matrix of intensities, whose rows correspond to probes
on the array, and its columns to different conditions, time points, etc.
It is also acceptable that y
is provided as a vector, in
which case it is converted to an nrow(y) x 1
matrix.
probeAnno
is an
environment whose elements correspond to target sequences (e.g.
chromosome strands) and that contain integer vectors of length
nrow(y)
with information about the probes: start and end positions of
their alignment to the target sequence, their row indices in
y
, the type of alignment (is it perfect? is is unique?).
For example,
the start positions and indices of probes for the + strand of
chromosome 1 would be described by environment elements
"1.+.start"
and "1.+.index"
.
Via the parameter segObj
.
segObj: This can be either an object of S4 class
segmentation
or an environment that by convention contains a
certain set of objects.
Future work on this package
will focus on the S4 class segmentation
. The environment
option is provided for backward compatibility.
Explanation of the environment: the intended workflow is as follows:
Use the script segment.R
(in the inst/scripts
directory of this
package) to generate segmentations.
This can be run in parallel on several processors, separately for each
chromosome and strand. The results of this are stored in files of the
name 1.+.rda
, 1.-.rda
, 2.+.rda
, and so forth,
typically within a dedicated directory.
Then use the script readSegments.R
to collect the
R
objects in these .rda
files into the environment.
It contains three types of data:
microarray intensities in along-chromosome order.
the segmentation objects (output of findSegments).
a dataframe named segScore
with segment scores; it can
be missing iff nrBasesPerSeg
is present,
a numeric scalar names theThreshold
, which is used to
draw a horizontal "threshold" line in the plot.
... and the different signal visualization methods (what
option):
If what=="dots"
, the argument showConfidenceIntervals
can be a logical scalar to choose whether vertical dashed lines are
drawn for the confidence interval. In any case, these are only drawn
if they are present in the segmentation
object in segObj
.
Wolfgang Huber <[email protected]>
## 1. see viewSegmentation.R script in the inst/scripts directory ## 2. (newer): segmentation.Rnw ## 3. (newer): see the plotALongChrom vignette data(segnf) data(gffSub) nmLabel = colnames(segnf$"1.+"@y) plotAlongChrom(segnf,chr=1,coord=c(35000,50000), gff=gffSub,rowNamesHeatmap=nmLabel) ##the dots plotAlongChrom(segnf,chr=1,coord=c(35000,50000),what="heatmap", gff=gffSub,rowNamesHeatmap=nmLabel) ##the heatmap plotAlongChrom(segnf,chr=1,coord=c(35000,50000),gff=gffSub, showConfidenceIntervals=FALSE) ##do not show the segment confidence interval
## 1. see viewSegmentation.R script in the inst/scripts directory ## 2. (newer): segmentation.Rnw ## 3. (newer): see the plotALongChrom vignette data(segnf) data(gffSub) nmLabel = colnames(segnf$"1.+"@y) plotAlongChrom(segnf,chr=1,coord=c(35000,50000), gff=gffSub,rowNamesHeatmap=nmLabel) ##the dots plotAlongChrom(segnf,chr=1,coord=c(35000,50000),what="heatmap", gff=gffSub,rowNamesHeatmap=nmLabel) ##the heatmap plotAlongChrom(segnf,chr=1,coord=c(35000,50000),gff=gffSub, showConfidenceIntervals=FALSE) ##do not show the segment confidence interval
Plot a legend for genomic features
plotAlongChromLegend(vpr, nr=2, featureColorScheme=1, featureExclude=c("chromosome", "nucleotide_match", "insertion"), mainLegend, cexLegend=0.35, cexMain=1)
plotAlongChromLegend(vpr, nr=2, featureColorScheme=1, featureExclude=c("chromosome", "nucleotide_match", "insertion"), mainLegend, cexLegend=0.35, cexMain=1)
vpr |
vector specifying where to place the legend in figure (set up by using the
|
nr |
numeric scalar, specifying the number of rows to plot legend over (default value is 2). |
featureColorScheme |
numeric scalar, used to select a color scheme for the boxes representing genomic features such as coding sequences, ncRNAs etc. Currently the only value supported is 1. |
featureExclude |
character vector of names of feature types (in
|
mainLegend |
character vector specifying legend title. |
cexLegend |
numeric scalar specifying the magnification to be used for the legend text relative to the current text size. |
cexMain |
numeric scalar specifying the magnification to be used for the legend title relative to the current text size. |
This function is usually called by plotAlongChrom
when doLegend
is TRUE. It can also be called directly by the user to produce a separate legend.
The following features are included in the legend (unless excluded using the featuredExclude
option): "chromosome"
, "nucleotide_match"
, "pseudogene"
, "uORF"
, "nc_primary_transcript"
, "region"
, "repeat_family"
, "repeat_region"
, "transposable_element"
, "transposable_element_gene"
, "ARS"
, "centromere"
, "telomere"
, "insertion"
, "CDS"
, "CDS_dubious"
, "ncRNA"
, "tRNA"
, "snRNA"
, "rRNA"
, "snoRNA"
, "binding_site"
and "TF_binding_site"
.
Wolfgang Huber <[email protected]>
## plotAlongChromLegend(mainLegend="Legend")
## plotAlongChromLegend(mainLegend="Legend")
Plot genomic features for a region along a chromosome
plotFeatures(gff, chr, xlim, strand, vpr, featureColorScheme=1, featureExclude=c("chromosome", "nucleotide_match", "insertion"), featureNoLabel=c("uORF", "CDS"), ...)
plotFeatures(gff, chr, xlim, strand, vpr, featureColorScheme=1, featureExclude=c("chromosome", "nucleotide_match", "insertion"), featureNoLabel=c("uORF", "CDS"), ...)
gff |
data frame with genome annotation from the GFF file. |
chr |
integer of length 1 specifying the chromosome to plot the features for. |
xlim |
integer of length 2 with start and end coordinates (in bases) for plotting. |
strand |
character scalar which should be set to either |
vpr |
which viewport to plot the features in. |
featureColorScheme |
numeric scalar, used to select a color scheme for the boxes representing genomic features such as coding sequences, ncRNAs etc. Currently the only value supported is 1. |
featureExclude |
character vector of names of feature types (in
gff) that should not be plotted. Default is |
featureNoLabel |
character vector, names of feature types (in gff) that should not be labelled with their names (if they are plotted). |
... |
additional arguments. |
This function is called by plotAlongChrom
when the gff
argument has been specified. It should not be called directly by the user.
Wolfgang Huber <[email protected]>
Plot the log-likelihood and two versions of penalized log-likelihoods (AIC, BIC) for a segmentation object.
plotPenLL(seg, extrabar=numeric(0), type="b", lty=1, pch=16, lwd=2, ...)
plotPenLL(seg, extrabar=numeric(0), type="b", lty=1, pch=16, lwd=2, ...)
seg |
A |
extrabar |
In addition to the location of maximmal BIC, vertical bars are drawn at these x-positions as well. |
type , pch , lty , lwd , ...
|
Get passed on to |
This function is used in the vignette: How to use the segment function to fit a piecewise constant curve.
The function is called for its side effect, which is creating a plot in the current graphics device.
Wolfgang Huber <[email protected]>
x = rep( sin((0:4)/2*pi), each=3) + rnorm(3*5, sd=0.1) res = segment(x, maxseg=8, maxk=15) plotPenLL(res)
x = rep( sin((0:4)/2*pi), each=3) + rnorm(3*5, sd=0.1) res = segment(x, maxseg=8, maxk=15) plotPenLL(res)
Plot points for a region along a chromosome
plotSegmentationDots(dat, xlim, ylim, ylab, threshold=NA, chr=1, strand="+", vpr, colors, main, pointSize=unit(0.6, "mm"), showConfidenceIntervals=TRUE, sepPlots=FALSE, cexAxisLabel=1, cexAxis=1,...)
plotSegmentationDots(dat, xlim, ylim, ylab, threshold=NA, chr=1, strand="+", vpr, colors, main, pointSize=unit(0.6, "mm"), showConfidenceIntervals=TRUE, sepPlots=FALSE, cexAxisLabel=1, cexAxis=1,...)
dat |
list containing data to be plotted (see Details section below for particulars). |
xlim |
integer vector of length 2 with start and end coordinates (in bases) for plotting. |
ylim |
numeric vector containing the y limits of the plot. |
ylab |
character scalar (if |
threshold |
numeric scalar indicating the threshold of expression (default value is NA, for
no threshold. If a value is supplied, it is subtracted from the intensity measures in |
chr |
integer of length 1 indicating the chromosome to be plot (defaults to 1). |
strand |
character scalar which should be set to either |
vpr |
which viewport to plot the figure in. If this function is called directly by the user this argument should be left missing. |
colors |
named character vector, optional. If missing,
a default color scheme is used:
|
main |
character vector specifying plot title. |
pointSize |
an object of class unit which specifies the size of each point. Default value is |
showConfidenceIntervals |
logical scalar indicating whether confidence intervals for each change-point are to be plotted (only available once segmentation has occurred). |
sepPlots |
logical scalar indicating whether the intensities are plotted separately for each array (if |
cexAxisLabel |
numeric scalar specifying the magnification to be used for the y-axis label relative to the current test size. |
cexAxis |
numeric scalar specifying the magnification to be used for the y-axis annotation relative to the current text size. |
... |
additional arguments. |
This function is called by plotAlongChrom
when the argument what
is set to dots
. Although this function can be called directly by the user, this is not recommended.
The dat
list contains the following items:
items x
: x-coordinates (in bases) along chromosome,
y
: intensity matrix of probes along chromosome,
flag
: indicates probe uniqueness in the genome. Possibilities are 3: multiple perfect matches, 2: has no PM but one or more near-matches, 1: has exactly one PM and some near-matches in the genome, 0: has exactly one PM and no near-matches.
extras
: (optional) matrix of additional values (such as test-statistics/p-values) to be plotted.
Wolfgang Huber <[email protected]>
Plot a heatmap diagram for a region along a chromosome
plotSegmentationHeatmap(dat, xlim, ylab, rowNames, chr=1, strand="+", vpr, colors, colHeatmap=colorRamp(brewer.pal(9, "YlGnBu")), showConfidenceIntervals=TRUE, just=c("left","centre"), main,makeRasterImage = TRUE, ...)
plotSegmentationHeatmap(dat, xlim, ylab, rowNames, chr=1, strand="+", vpr, colors, colHeatmap=colorRamp(brewer.pal(9, "YlGnBu")), showConfidenceIntervals=TRUE, just=c("left","centre"), main,makeRasterImage = TRUE, ...)
dat |
list containing data to be plotted (see Details section below for particulars). |
xlim |
integer vector of length 2 with start and end coordinates (in bases) for plotting. |
ylab |
character scalar specifying y-axis label. |
rowNames |
character vector specifying a name for each row in the heatmap plot. |
chr |
integer of length 1 indicating the chromosome to plot (defaults to 1). |
strand |
character scalar which should be set to either |
vpr |
which viewport to plot the figure in. If this function is called directly by the user this argument should be left missing. |
colors |
named character vector, optional. If missing,
a default color scheme is used:
|
colHeatmap |
function describing color scheme for the heatmap plot (defaults to
|
showConfidenceIntervals |
logical scalar indicating whether confidence intervals for each change-point are to be plotted (only available once segmentation has occurred). |
just |
character vector specifying the justification of the
supplied values to the given coordinates; setting the first entry to
"left" indicates that the supplied x-coordinates are the start
positions of the probes, change this to "centre" if the
x-coordinates are the probe middle positions. Usually the second
entry should be "centre" (see |
main |
character vector specifying plot title. |
makeRasterImage |
logical scalar indicating whether to plot the heatmap image
by the grid.raster (see |
... |
additional arguments. |
This function is called by plotAlongChrom
if the argument
what
is set to heatmap
.
Although this function can be called directly by the user, this is not recommended.
The dat
list contains the following items:
x
x-coordinates (in bases) along chromosome
y
intensity matrix of probes along chromosome
flag
indicates probe uniqueness in the genome. Possibilities are 3: multiple perfect matches, 2: has no PM but one or more near-matches, 1: has exactly one PM and some near-matches in the genome, 0: has exactly one PM and no near-matches.
extras
(optional) matrix of additional values (such as test-statistics/p-values) to be plotted
Wolfgang Huber <[email protected]>
data(segnf) data(gffSub) nmLabel = colnames(segnf$"1.+"@y) plotAlongChrom(segnf,chr=1,coord=c(35000,50000),what="heatmap", gff=gffSub,rowNamesHeatmap=nmLabel) ##using raster image
data(segnf) data(gffSub) nmLabel = colnames(segnf$"1.+"@y) plotAlongChrom(segnf,chr=1,coord=c(35000,50000),what="heatmap", gff=gffSub,rowNamesHeatmap=nmLabel) ##using raster image
Find the index of the exact match (PM) or background probes from a probeAnno environment
PMindex(probeAnno) BGindex(probeAnno)
PMindex(probeAnno) BGindex(probeAnno)
probeAnno |
environment with probe annotations. See
package |
These functions extract the exact match probes (PM) or background
probes (from intergenic regions outside of known annotations) indices from probeAnno
.
These indices can be used to select the relevant rows of intensity
data from the ExpressionSet
object for plotting and normalization.
Numeric vector of indices.
Matt Ritchie <[email protected]>
## library(davidTiling) ## data(davidTiling) ## data(probeAnno) ## pmind <- PMindex(probeAnno) ## mmind <- MMindex(probeAnno) ## bgind <- BGindex(probeAnno) ## boxplot(as.data.frame(log2(exprs(davidTiling))[pmind,]), outline=FALSE)
## library(davidTiling) ## data(davidTiling) ## data(probeAnno) ## pmind <- PMindex(probeAnno) ## mmind <- MMindex(probeAnno) ## bgind <- BGindex(probeAnno) ## boxplot(as.data.frame(log2(exprs(davidTiling))[pmind,]), outline=FALSE)
Find the smallest positive number in a vector
posMin(x, ...)
posMin(x, ...)
x |
Numeric vector. |
... |
Further arguments that get passed on to |
This is a rather trivial convenience function.
Numeric of length 1.
W. Huber <[email protected]>
x = runif(5) posMin(x-0.5) posMin(x-2)
x = runif(5) posMin(x-0.5) posMin(x-2)
Generate simple diagnostic plots for Affymetrix tiling array data
qcPlots(x, html=TRUE, plotdir=NULL, probeAnno, gff, chr=4, coord=c(230000,245000), nr = 2560, nc = 2560, ylimchrom=c(5,16), nucleicAcid, pmindex, pgm=TRUE, ext=".cel", ranks=FALSE, ...)
qcPlots(x, html=TRUE, plotdir=NULL, probeAnno, gff, chr=4, coord=c(230000,245000), nr = 2560, nc = 2560, ylimchrom=c(5,16), nucleicAcid, pmindex, pgm=TRUE, ext=".cel", ranks=FALSE, ...)
x |
ExpressionSet containing the data to be plotted. |
html |
logical scalar. If |
plotdir |
optional character string specifying the filepath where the plots will be saved. Defaults to current working directory. |
probeAnno |
environment with probe annotations. See
package |
gff |
data frame with genome annotation from the GFF file. |
chr |
integer of length 1 indicating the chromosome number to plot. |
coord |
integer vector of length 2 containing the start and end coordinates (in bases) for the along chromosome intensity plot. |
nr |
integer, indicating the number of probes in each row on the array (2560 for yeast tiling arrays). |
nc |
integer, indicating the number of probes in each column on the array (2560 for yeast tiling arrays). |
ylimchrom |
numeric vector containing the y limits of the along chromosome intensity plot. |
nucleicAcid |
character vector or factor indicating what sample has been hybridised to each array. Used to color the boxplots and smoothed histograms of intensities. |
pmindex |
integer vector of indices of PM probes in |
pgm |
logical scalar. If |
ext |
character string indicating the file extension. |
ranks |
logical scalar. If |
... |
further arguments that can be passed to the plotting function |
This function creates boxplots, smoothed histogram (density) plots, imageplots and along chromosome plots of the raw (log base 2) probe intensity data.
An html page called 'qcsummary.htm' which displays the results, is created when html=TRUE
.
Imageplots of standardised intensities (i.e. (probe intensity - minimum probe intensity) divided by the difference between the maximum and minimum probe intensities, all on log base 2 scale) or the ranks of these standardised intensities are plotted depending on the ranks
argument.
The individual plots are named by replacing the file extension (specified by ext
) of each 'celfile.ext', with 'density.png' for smoothed histogram plots, 'gencoord.jpg', for along chromosome plots and either 'log.pgm' ('log.jpg' if pgm=FALSE
) or 'rank.pgm' ('rank.jpg' if pgm=FALSE
) for the imageplots, depending on the ranks
argument.
Matt Ritchie <[email protected]> and Wolfgang Huber <[email protected]>
## library(davidTiling) ## data(davidTiling) ## data(probeAnno) ## qcPlots(davidTiling, probeAnno)
## library(davidTiling) ## data(davidTiling) ## data(probeAnno) ## qcPlots(davidTiling, probeAnno)
This is a wrapper for ReadAffy
that
returns an ExpressionSet
object rather than an
AffyBatch. This is particularly usefiles for arrays for which we have
or need no CDF environment.
readCel2eSet(filename, adf, path=".", rotated=FALSE, ...)
readCel2eSet(filename, adf, path=".", rotated=FALSE, ...)
filename |
Character vector with CEL file names. Either
|
adf |
Object of class
|
path |
Character scalar with path to CEL files. |
rotated |
Logical scalar, see details. |
... |
Further arguments that are passed on to
|
The rotate
options allows to deal with different versions of
the scanner software. Older versions rotated the image by 90 degrees,
newer ones do not. Use the default rotated=FALSE
for CEL files
produced by the newer version.
ExpressionSet
object.
W. Huber
## To test the rotation, look at the scatterplot between two DNA hybes ## that were measured with scanner software that rotated (041120) and did ## not rotate (060125) ## ## cp /ebi/research/huber/Projects/tilingArray/Celfiles/041120_S96genDNA_re-hybe.cel.gz ~/p/tmp ## cp /ebi/research/huber/Projects/allelicTranscription/celfiles_allelictrans/060125_S96_genomicDNA.zip ~/p/tmp ## cd ~/p/tmp ## gunzip 041120_S96genDNA_re-hybe.cel.gz ## unzip 060125_S96_genomicDNA.zip ## ## Not run: library("affy") options(error=recover) e1 = readCel2eSet("041120_S96genDNA_re-hybe.cel", rotated=TRUE) e2 = readCel2eSet("060125_S96_genomicDNA.CEL") smoothScatter(log(exprs(e1)), log(exprs(e2)), nrpoints=0) ## End(Not run)
## To test the rotation, look at the scatterplot between two DNA hybes ## that were measured with scanner software that rotated (041120) and did ## not rotate (060125) ## ## cp /ebi/research/huber/Projects/tilingArray/Celfiles/041120_S96genDNA_re-hybe.cel.gz ~/p/tmp ## cp /ebi/research/huber/Projects/allelicTranscription/celfiles_allelictrans/060125_S96_genomicDNA.zip ~/p/tmp ## cd ~/p/tmp ## gunzip 041120_S96genDNA_re-hybe.cel.gz ## unzip 060125_S96_genomicDNA.zip ## ## Not run: library("affy") options(error=recover) e1 = readCel2eSet("041120_S96genDNA_re-hybe.cel", rotated=TRUE) e2 = readCel2eSet("060125_S96_genomicDNA.CEL") smoothScatter(log(exprs(e1)), log(exprs(e2)), nrpoints=0) ## End(Not run)
Given a vector of ascending numbers and a step width, sample the
numbers such that the difference between consecutive numbers is greater
than or equal to step
.
sampleStep(x, step)
sampleStep(x, step)
x |
Numeric or integer vector. |
step |
Numeric scalar. |
The simple algorithm works greedily from x[1]
to
x[length(x)]
. First, x[1]
is selected. Then, if x[i]
is selected, all numbers x[j]
with j>i
and
x[j]-x[i]<step
are dropped. Then, i
is set to the
smallest j with x[j]-x[i]>=step
.
A logical vector of the same length as x
, representing the
selected subsample.
W. Huber <[email protected]>
x = sort(as.integer(runif(20)*100)) sel = sampleStep(x, step=10) x x[sel]
x = sort(as.integer(runif(20)*100)) sel = sampleStep(x, step=10) x x[sel]
Wrapper around the segment
function for each strand of one or
more chromosomes specified by the user. It does some typical
preprocessing and I/O.
segChrom(y, probeAnno, chr=1:17, strands=c("+", "-"), nrBasesPerSegment = 1500, maxk = 3000, step = 7, confint = FALSE, confintLevel = 0.95, useLocks=TRUE, verbose=TRUE, savedir)
segChrom(y, probeAnno, chr=1:17, strands=c("+", "-"), nrBasesPerSegment = 1500, maxk = 3000, step = 7, confint = FALSE, confintLevel = 0.95, useLocks=TRUE, verbose=TRUE, savedir)
y |
ExpressionSet or matrix containing the data to be segmented. |
probeAnno |
an object of class |
chr |
integer scalar or vector specifying which chromosome(s) to segment. |
strands |
character scalar or vector specifying which strands to
segment; can also be |
nrBasesPerSegment |
integer (length 1): the parameter
|
maxk |
passed on to the function |
step |
integer scalar, indicating the minimum distance between
consecutive probes. In cases when probes are offset by less than
|
confint |
logical scalar. If |
confintLevel |
numeric scalar between 0 and 1 indicating the probability level for the confidence intervals that are calculated for each change-point. |
useLocks |
logical scalar. Should a file locking mechanism be used that allows for a simple-minded parallelization of this function. |
verbose |
logical scalar. Should we be chatty about our progress? |
savedir |
character scalar. If specified, resulting |
This function is a wrapper for the segment
function.
It is provided in this package for illustration. For applications to
different datasets, you will likely need to adapt it to some extent,
please refer to its source code.
An environment containing S4 objects of class "segmentation"
called "1.+", "1.-", etc. (depending on the values in chr
and
strands
),
where "+" and "-" indicate the strand and the preceding number refers
to the chromosome.
If savedir
is specified, there is also the side-effect that
a series of files "1.+.rda", "1.-.rda", etc. is saved in that directory.
Wolfgang Huber <[email protected]>
## Not run: library("davidTiling") data("davidTiling") data("probeAnno") isDNA = seq(1:3) yn = normalizeByReference(davidTiling[,-isDNA],davidTiling[,isDNA], probeAnno=probeAnno) seg = segChrom(yn, probeAnno) ## this will take a while to run! ## End(Not run)
## Not run: library("davidTiling") data("davidTiling") data("probeAnno") isDNA = seq(1:3) yn = normalizeByReference(davidTiling[,-isDNA],davidTiling[,isDNA], probeAnno=probeAnno) seg = segChrom(yn, probeAnno) ## this will take a while to run! ## End(Not run)
The function fits a piecewise constant curve to one or multiple sequences of measurements, using a least squares cost function and an O(n) dynamic programming algorithm (see references).
segment(y, maxseg, maxk)
segment(y, maxseg, maxk)
y |
Numeric matrix. Rows correspond to the
|
maxseg |
integer of length 1, maximum number of segments (= 1 + maximum number of change points). |
maxk |
integer of length 1, maximum length of a single segment. |
The complexity of the algorithm is
length(x)*maxk
in memory and
length(x)*maxk*maxseg
in time.
An object of class segmentation
.
W. Huber [email protected]
[1] Transcript mapping with high-density oligonucleotide tiling arrays. Huber W, Toedling J, Steinmetz, L. Bioinformatics 22, 1963-1970 (2006).
[2] A statistical approach for CGH microarray data analysis. Franck Picard, Stephane Robin, Marc Lavielle, Christian Vaisse, Gilles Celeux, Jean-Jacques Daudin. BMC Bioinformatics. 2005 Feb 11; 6:27.
x = rep( sin((0:4)/2*pi), each=3) + rnorm(3*5, sd=0.1) res = segment(x, maxseg=6, maxk=15)
x = rep( sin((0:4)/2*pi), each=3) + rnorm(3*5, sd=0.1) res = segment(x, maxseg=6, maxk=15)
This class represents the result of a segmentation, usually
a call to the function segment
.
Objects can be created by calls of the function
segment
or by calls of the form new("segmentation", ...)
.
y
:A matrix with the data (the dependent variable(s)), see segment
.
x
:A numeric vector with the regressor variable.
The length of this vector must be either the same as
nrow(y)
, or 0. The latter case is equivalent to
x=1:nrow(y)
.
flag
:An integer vector, whose length must be either the same as
nrow(y)
, or 0. This can be used to flag certain
probes for special treatment, for example by
plotAlongChrom
.
breakpoints
:List of segmentations. The element
breakpoints[[j]]
corresponds to a segmentation fit of
j
segments, i.e. with j-1
breakpoints. It is a
matrix with (j-1)
rows and 1 or 3 columns. It always
contains a column named estimate
with the point estimates.
Optionally, it may contain columns lower
and upper
with the confidence intervals. The point estimates are the row
indices in y
where new segments start, for example:
let z=breakpoints[[j]]
, then the first segment
is from row 1
to z[1, "estimate"]-1
,
the second from row z[1, "estimate"]
to
z[2, "estimate"]-1
, and so on.
logLik
:Numeric vector of the same length as
breakpoints
, containing the log-likelihood of the piecewise
constant models under the data y
.
hasConfint
:Logical vector of the same length as
breakpoints
. TRUE if the confidence interval estimates
are present, i.e. if the matrix breakpoints[[j]]
has
columns lower
and upper
.
nrSegments
:A scalar integer, value must be either
NA
or between 1
and length(breakpoints)
.
Can be used to select one of the fits in breakpoints
for
special treatment, for example by
plotAlongChrom
.
The method confint(object, parm, level=0.95,
het.reg=FALSE, het.err=FALSE, ...)
computes confidence
intervals for the change point estimates of the
segmentation. Typically, these were obtained from a previous call
to the function segment
that created the object.
This is just a wrapper for the function
confint.breakpointsfull
from the strucchange
package, which does all the hard
computations.
Parameters: object
an object of class segmentation
,
parm
an integer vector, it determines for which of the segmentation fits
confidence intervals are computed. See also segment
.
The other parameters are directly passed on to
confint.breakpointsfull
.
The method logLik(object, penalty="none", ...)
returns the log-likelihoods of fitted models. Valid values for the argument
penalty
are none
, AIC
and BIC
.
The method plot(x, y, xlim, xlab="x", ylab="y",
bpcol="black", bplty=1, pch=16, ...)
provides a simple visualization of the result of a
segmentation. Parameters: x
an object of class segmentation
,
y
an integer between 1
and
length(x@breakpoints)
, selecting which of the fits
contained in x
to plot, bpcol
and bplty
color
and line type of breakpoints. The plot shows the numeric data
along with breakpoints and if available their confidence intervals.
summary.
Wolfgang Huber [email protected]
## generate random data with 5 segments: y = unlist(lapply(c(0,3,0.5,1.5,5), function(m) rnorm(10, mean=m))) seg = segment(y, maxseg=10, maxk=15) seg = confint(seg, parm=c(3,4,5)) if(interactive()) plot(seg, 5) show(seg)
## generate random data with 5 segments: y = unlist(lapply(c(0,3,0.5,1.5,5), function(m) rnorm(10, mean=m))) seg = segment(y, maxseg=10, maxk=15) seg = confint(seg, parm=c(3,4,5)) if(interactive()) plot(seg, 5) show(seg)
Example of a segmentation output object
data(segnf)
data(segnf)
segnf is an environment which contains two segmentation object for the "+" and "-" strand of yeast chromosome one. Each segmentation object contains the tiling array expression profiling between condition YPE and YPD, 3 replicates each. The raw array data are normalized, mapped to the yeast genome and segmented using the segChrom function. The object is the output of segChom with some small modification. Only a small region (35000bp-50000bp) is included in this data
Zhenyu Xu <[email protected]>
data(segnf) data(gffSub) nmLabel = colnames(segnf$"1.+"@y) plotAlongChrom(segnf,chr=1, coord=c(35000,50000),what="heatmap", gff=gffSub,rowNamesHeatmap=nmLabel)
data(segnf) data(gffSub) nmLabel = colnames(segnf$"1.+"@y) plotAlongChrom(segnf,chr=1, coord=c(35000,50000),what="heatmap", gff=gffSub,rowNamesHeatmap=nmLabel)