Title: | genefilter: methods for filtering genes from high-throughput experiments |
---|---|
Description: | Some basic functions for filtering genes. |
Authors: | Robert Gentleman [aut], Vincent J. Carey [aut], Wolfgang Huber [aut], Florian Hahne [aut], Emmanuel Taiwo [ctb] ('howtogenefinder' vignette translation from Sweave to RMarkdown / HTML.), Khadijah Amusat [ctb] (Converted genefilter vignette from Sweave to RMarkdown / HTML.), Bioconductor Package Maintainer [cre] |
Maintainer: | Bioconductor Package Maintainer <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.89.0 |
Built: | 2024-12-18 03:56:11 UTC |
Source: | https://github.com/bioc/genefilter |
Anova
returns a function of one argument with bindings for
cov
and p
.
The function, when evaluated, performs an ANOVA using cov
as
the covariate. It returns TRUE
if the p value for a difference
in means is less than p
.
Anova(cov, p=0.05, na.rm=TRUE)
Anova(cov, p=0.05, na.rm=TRUE)
cov |
The covariate. It must have length equal to the number of
columns of the array that |
p |
The p-value for the test. |
na.rm |
If set to |
The function returned by Anova
uses lm
to fit a linear
model of the form
lm(x ~ cov)
, where x
is the set of gene expressions.
The F statistic for an overall effect is computed and if it has a
p-value less than p
the function returns TRUE
,
otherwise it returns FALSE
for that gene.
Anova
returns a function with bindings for cov
and
p
that will perform a one-way ANOVA.
The covariate can be continuous, in which case the test is for a linear effect for the covariate.
R. Gentleman
set.seed(123) af <- Anova(c(rep(1,5),rep(2,5)), .01) af(rnorm(10))
set.seed(123) af <- Anova(c(rep(1,5),rep(2,5)), .01) af(rnorm(10))
A function that performs Cox regression with bindings for surt
,
cens
, and p
is returned. This function filters genes
according to the attained p-value from a Cox regression using
surt
as the survival times, and cens
as the censoring
indicator. It requires survival
.
coxfilter(surt, cens, p)
coxfilter(surt, cens, p)
surt |
Survival times. |
cens |
Censoring indicator. |
p |
The p-value to use in filtering. |
Calls to the coxph
function in the survival
library are used to fit a Cox model. The filter function returns
TRUE
if the p-value in the fit is less than p
.
R. Gentleman
set.seed(-5) sfun <- coxfilter(rexp(10), ifelse(runif(10) < .7, 1, 0), .05) ffun <- filterfun(sfun) dat <- matrix(rnorm(1000), ncol=10) out <- genefilter(dat, ffun)
set.seed(-5) sfun <- coxfilter(rexp(10), ifelse(runif(10) < .7, 1, 0), .05) ffun <- filterfun(sfun) dat <- matrix(rnorm(1000), ncol=10) out <- genefilter(dat, ffun)
cv
returns a function with values for a
and b
bound. This function takes a single argument. It computes the
coefficient of variation for the input vector and returns TRUE
if
the coefficient of variation is between a
and
b
. Otherwise it returns FALSE
cv(a=1, b=Inf, na.rm=TRUE)
cv(a=1, b=Inf, na.rm=TRUE)
a |
The lower bound for the cv. |
b |
The upper bound for the cv. |
na.rm |
If set to |
The coefficient of variation is the standard deviation divided by the absolute value of the mean.
It returns a function of one argument. The function has an environment
with bindings for a
and b
.
R. Gentleman
set.seed(-3) cvfun <- cv(1,10) cvfun(rnorm(10,10)) cvfun(rnorm(10))
set.seed(-3) cvfun <- cv(1,10) cvfun(rnorm(10,10)) cvfun(rnorm(10))
Calculate an n-by-n matrix by applying a function to all pairs of columns of an m-by-n matrix.
dist2(x, fun, diagonal=0)
dist2(x, fun, diagonal=0)
x |
A matrix. |
fun |
A symmetric function of two arguments that may be columns of |
diagonal |
The value to be used for the diagonal elements of the resulting matrix. |
With the default value of fun
, this function calculates
for each pair of columns of x
the mean of the absolute values
of their differences (which is proportional to the L1-norm of their
difference). This is a distance metric.
The implementation assumes that
fun(x[,i], x[,j])
can be evaluated for all pairs of i
and j
(see examples), and that
fun
is symmetric, i.e.
fun(a, b) = fun(b, a)
.
fun(a, a)
is not actually evaluated, instead the value of diagonal
is used to fill the diagonal elements of the returned matrix.
Note that dist
computes distances between rows of
x
, while this function computes relations between columns of
x
(see examples).
A symmetric matrix of size n x n
.
Wolfgang Huber, James Reid
# example matrix z = matrix(1:15693, ncol=3) matL1 = dist2(z) matL2 = dist2(z, fun=function(a,b) sqrt(sum((a-b)^2, na.rm=TRUE))) euc = as.matrix(dist(t(z))) stopifnot(identical(dim(matL2), dim(euc)), all(euc==matL2))
# example matrix z = matrix(1:15693, ncol=3) matL1 = dist2(z) matL2 = dist2(z, fun=function(a,b) sqrt(sum((a-b)^2, na.rm=TRUE))) euc = as.matrix(dist(t(z))) stopifnot(identical(dim(matL2), dim(euc)), all(euc==matL2))
Given a Bioconductor's ExpressionSet object, this function filters genes using a set of selected filters.
eSetFilter(eSet) getFilterNames() getFuncDesc(lib = "genefilter", funcs = getFilterNames()) getRdAsText(lib) parseDesc(text) parseArgs(text) showESet(eSet) setESetArgs(filter) isESet(eSet)
eSetFilter(eSet) getFilterNames() getFuncDesc(lib = "genefilter", funcs = getFilterNames()) getRdAsText(lib) parseDesc(text) parseArgs(text) showESet(eSet) setESetArgs(filter) isESet(eSet)
eSet |
|
lib |
|
funcs |
|
text |
|
filter |
|
These functions are deprecated. Please use the ‘iSee’ package instead.
A set of filters may be selected to filter genes in through each of the filters in the order the filters have been selected
A logical vector of length equal to the number of rows of 'expr'. The values in that vector indicate whether the corresponding row of 'expr' passed the set of filter functions.
Jianhua Zhang
if( interactive() ) { data(sample.ExpressionSet) res <- eSetFilter(sample.ExpressionSet) }
if( interactive() ) { data(sample.ExpressionSet) res <- eSetFilter(sample.ExpressionSet) }
Generate a volcano plot contrasting p-value with fold change (on the log scale), in order to visualize the effect of filtering on overall variance and also assign significance via p-value.
filter_volcano( d, p, S, n1, n2, alpha, S_cutoff, cex = 0.5, pch = 19, xlab = expression(paste(log[2], " fold change")), ylab = expression(paste("-", log[10], " p")), cols = c("grey80", "grey50", "black"), ltys = c(1, 3), use_legend = TRUE, ... )
filter_volcano( d, p, S, n1, n2, alpha, S_cutoff, cex = 0.5, pch = 19, xlab = expression(paste(log[2], " fold change")), ylab = expression(paste("-", log[10], " p")), cols = c("grey80", "grey50", "black"), ltys = c(1, 3), use_legend = TRUE, ... )
d |
Fold changes, typically on the log scale, base 2. |
p |
The p-values |
S |
The overall standard deviation filter statistics, i.e., the square roots of the overall variance filter statistics. |
n1 |
Sample size for group 1. |
n2 |
Sample size for group 2. |
alpha |
Significance cutoff used for p-values. |
S_cutoff |
Filter cutoff used for the overall standard deviation in |
cex |
Point size for plotting. |
pch |
Point character for plotting. |
xlab |
Label for x-axis. |
ylab |
Label for y-axis. |
cols |
A vector of three colors used for plotting. These correspond to filtered data, data which pass the filter but are insignificant, and data pass the filter and are also statistically significant. |
ltys |
The induced bound on log-scale fold change is plotted, as is the
significance cutoff for data passing the filter. The |
use_legend |
Should a legend for point color be produced? |
... |
Other arguments for |
Richard Bourgon <[email protected]>
# See the vignette: Diagnostic plots for independent filtering
# See the vignette: Diagnostic plots for independent filtering
Given filter and test statistics in the form of unadjusted p-values, or functions able to compute these statistics from the data, filter and then correct the p-values across a range of filtering stringencies.
filtered_p(filter, test, theta, data, method = "none") filtered_R(alpha, filter, test, theta, data, method = "none")
filtered_p(filter, test, theta, data, method = "none") filtered_R(alpha, filter, test, theta, data, method = "none")
alpha |
A cutoff to which p-values, possibly adjusted for multiple testing, will be compared. |
filter |
A vector of stage-one filter statistics, or a function which is able
to compute this vector from |
test |
A vector of unadjusted p-values, or a function which is able
to compute this vector from the filtered portion of |
theta |
A vector with one or more filtering fractions to consider. Actual
cutoffs are then computed internally by applying
|
data |
If |
method |
The unadjusted p-values contained in (or produced by) |
p
For filtered_p
, a matrix of p-values, possible adjusted for
multiple testing, with one row per null hypothesis and one column per
filtering fraction given in theta
. For a given column, entries
which have been filtered out are NA
.
For filtered_R
, a count of the entries in the filtered_p
result which are less than alpha
.
Richard Bourgon <[email protected]>
See rejection_plot
for visualization of
filtered_p
results.
# See the vignette: Diagnostic plots for independent filtering
# See the vignette: Diagnostic plots for independent filtering
This function creates a function that takes a single argument. The
filtering functions are bound in the environment of the returned
function and are applied sequentially to the argument of the returned
function. When the first filter function evaluates to FALSE
the
function returns FALSE
otherwise it returns TRUE
.
filterfun(...)
filterfun(...)
... |
Filtering functions. |
filterfun
returns a function that takes a single argument. It
binds the filter functions given to it in the environment of the
returned function. These functions are applied sequentially (in the
order they were given to filterfun
). The function returns
FALSE
(and exits) when the first filter function returns
FALSE
otherwise it returns TRUE
.
R. Gentleman
set.seed(333) x <- matrix(rnorm(100,2,1),nc=10) cvfun <- cv(.5,2.5) ffun <- filterfun(cvfun) which <- genefilter(x, ffun)
set.seed(333) x <- matrix(rnorm(100,2,1),nc=10) cvfun <- cv(.5,2.5) ffun <- filterfun(cvfun) which <- genefilter(x, ffun)
Most microarrays have multiple probes per gene (Entrez). This function finds all replicates, and then selects the one with the largest value of the test statistic.
findLargest(gN, testStat, data = "hgu133plus2")
findLargest(gN, testStat, data = "hgu133plus2")
gN |
A vector of probe identifiers for the chip. |
testStat |
A vector of test statistics, of the same length as
|
data |
The character string identifying the chip. |
All the probe identifiers, gN
, are mapped to Entrez Gene IDs
and the duplicates determined. For any set of probes that map to the
same Gene ID, the one with the largest test statistic is found. The
return vector is the named vector of selected probe identifiers. The
names are the Entrez Gene IDs.
This could be extended in different ways, such as allowing the user to use a different selection criterion. Also, matching on different identifiers seems like another alternative.
A named vector of probe IDs. The names are Entrez Gene IDs.
R. Gentleman
library("hgu95av2.db") set.seed(124) gN <- sample(ls(hgu95av2ENTREZID), 200) stats <- rnorm(200) findLargest(gN, stats, "hgu95av2")
library("hgu95av2.db") set.seed(124) gN <- sample(ls(hgu95av2ENTREZID), 200) stats <- rnorm(200) findLargest(gN, stats, "hgu95av2")
The gapFilter
looks for genes that might usefully discriminate
between two groups (possibly unknown at the time of filtering).
To do this we look for a gap in the ordered expression values. The gap
must come in the central portion (we exclude jumps in the initial
Prop
values or the final Prop
values).
Alternatively, if the IQR for the gene is large that will also pass
our test and the gene will be selected.
gapFilter(Gap, IQR, Prop, na.rm=TRUE, neg.rm=TRUE)
gapFilter(Gap, IQR, Prop, na.rm=TRUE, neg.rm=TRUE)
Gap |
The size of the gap required to pass the test. |
IQR |
The size of the IQR required to pass the test. |
Prop |
The proportion (or number) of samples to exclude at either end. |
na.rm |
If |
neg.rm |
If |
As stated above we are interested in
A function that returns either TRUE
or FALSE
depending on
whether the vector supplied has a gap larger than Gap
or an IQR
(inter quartile range) larger than IQR
. For computing the gap we
want to exclude a proportion, Prop
from either end of the sorted
values. The reason for this requirement is that genes which differ in
expression levels only for a few samples are not likely to be interesting.
R. Gentleman
set.seed(256) x <- c(rnorm(10,100,3), rnorm(10, 100, 10)) y <- x + c(rep(0,10), rep(100,10)) tmp <- rbind(x,y) Gfilter <- gapFilter(200, 100, 5) ffun <- filterfun(Gfilter) genefilter(tmp, ffun)
set.seed(256) x <- c(rnorm(10,100,3), rnorm(10, 100, 10)) y <- x + c(rep(0,10), rep(100,10)) tmp <- rbind(x,y) Gfilter <- gapFilter(200, 100, 5) ffun <- filterfun(Gfilter) genefilter(tmp, ffun)
genefilter
filters genes in the array expr
using the
filter functions in flist
. It returns an array of logical
values (suitable for subscripting) of the same length as there are
rows in expr
. For each row of expr
the returned value
is TRUE
if the row passed all the filter functions. Otherwise
it is set to FALSE
.
genefilter(expr, flist)
genefilter(expr, flist)
expr |
A |
flist |
A |
This package uses a very simple but powerful protocol for
filtering genes. The user simply constructs any number of
tests that they want to apply. A test is simply a function (as
constructed using one of the many helper functions in this package)
that returns TRUE
if the gene of interest passes the test (or
filter) and FALSE
if the gene of interest fails.
The benefit of this approach is that each test is constructed individually (and can be tested individually). The tests are then applied sequentially to each gene. The function returns a logical vector indicating whether the gene passed all tests functions or failed at least one of them.
Users can construct their own filters. These filters should accept
a vector of values, corresponding to a row of the expr
object.
The user defined function should return a length 1 logical vector,
with value TRUE
or FALSE
. User-defined functions can be
combined with filterfun
, just as built-in filters.
A logical vector
of length equal to the number of rows of
expr
. The values in that vector
indicate whether the
corresponding row of expr
passed the set of filter functions.
R. Gentleman
set.seed(-1) f1 <- kOverA(5, 10) flist <- filterfun(f1) exprA <- matrix(rnorm(1000, 10), ncol = 10) ans <- genefilter(exprA, flist)
set.seed(-1) f1 <- kOverA(5, 10) flist <- filterfun(f1) exprA <- matrix(rnorm(1000, 10), ncol = 10) ans <- genefilter(exprA, flist)
These functions are provided for compatibility with older versions of ‘genefilter’ only, and will be defunct at the next release.
The following functions are deprecated and will be made defunct; use the replacement indicated below:
eSetFilter
getFilterNames
getFuncDesc
getRdAsText
parseDesc
parseArgs
showESet
setESetArgs
isESet
Given an ExpressionSet
or a matrix
of gene expressions, and the
indices of the genes of interest, genefinder
returns a list
of the
numResults
closest genes.
The user can specify one of the standard distance measures listed
below.
The number of values to return can be specified. The return value is a
list
with two components:
genes (measured through the desired distance method) to the genes
of interest (where X is the number of desired results returned) and
their distances.
genefinder(X, ilist, numResults=25, scale="none", weights, method="euclidean")
genefinder(X, ilist, numResults=25, scale="none", weights, method="euclidean")
X |
A numeric |
ilist |
A |
numResults |
Number of results to display, starting from the least distance to the greatest. |
scale |
One of "none", "range", or "zscore". Scaling is carried out separately on each row. |
weights |
A vector of weights applied across the columns of
|
method |
One of "euclidean", "maximum", "manhattan", "canberra", "correlation", "binary". |
If the scale
option is "range", then the input matrix is scaled using
genescale()
. If it is "zscore", then the input matrix is scaled using
the scale
builtin with no arguments.
The method option specifies the metric used for gene comparisons. The
metric is applied, row by row, for each gene specified in ilist
.
The "correlation" option for the distance method will return a value equal to 1-correlation(x).
See dist
for a more detailed description of the distances.
The returned value is a list
containing an entry for each gene
specified in ilist
. Each list
entry contains an array of
distances for that gene of interest.
J. Gentry and M. Kajen
set.seed(12345) #create some fake expression profiles m1 <- matrix (1:12, 4, 3) v1 <- 1 nr <- 2 #find the 2 rows of m1 that are closest to row 1 genefinder (m1, v1, nr, method="euc") v2 <- c(1,3) genefinder (m1, v2, nr) genefinder (m1, v2, nr, scale="range") genefinder (m1, v2, nr, method="manhattan") m2 <- matrix (rnorm(100), 10, 10) v3 <- c(2, 5, 6, 8) nr2 <- 6 genefinder (m2, v3, nr2, scale="zscore")
set.seed(12345) #create some fake expression profiles m1 <- matrix (1:12, 4, 3) v1 <- 1 nr <- 2 #find the 2 rows of m1 that are closest to row 1 genefinder (m1, v1, nr, method="euc") v2 <- c(1,3) genefinder (m1, v2, nr) genefinder (m1, v2, nr, scale="range") genefinder (m1, v2, nr, method="manhattan") m2 <- matrix (rnorm(100), 10, 10) v3 <- c(2, 5, 6, 8) nr2 <- 6 genefinder (m2, v3, nr2, scale="zscore")
genescale
returns a scaled version of the input matrix m by applying
the following formula to each column of the matrix:
genescale(m, axis=2, method=c("Z", "R"), na.rm=TRUE)
genescale(m, axis=2, method=c("Z", "R"), na.rm=TRUE)
m |
Input a matrix or a vector with numeric elements. |
axis |
An integer indicating which axis of |
method |
Either "Z" or "R", indicating whether a Z scaling or a range scaling should be performed. |
na.rm |
A boolean indicating whether |
Either the rows or columns of m
are scaled. This is done either
by subtracting the mean and dividing by the standard deviation ("Z")
or by subtracing the minimum and dividing by the range.
A scaled version of the input.
If m
is a matrix
or a dataframe
then the
dimensions of the returned value agree with that of m
,
in both cases the returned value is a matrix
.
R. Gentleman
m <- matrix(1:12, 4, 3) genescale(m)
m <- matrix(1:12, 4, 3) genescale(m)
For data assumed to be drawn from a unimodal, continuous distribution, the mode is estimated by the “half-range” method. Bootstrap resampling for variance reduction may optionally be used.
half.range.mode(data, B, B.sample, beta = 0.5, diag = FALSE)
half.range.mode(data, B, B.sample, beta = 0.5, diag = FALSE)
data |
A numeric vector of data from which to estimate the mode. |
B |
Optionally, the number of bootstrap resampling rounds to use. Note
that |
B.sample |
If bootstrap resampling is requested, the size of the bootstrap
samples drawn from |
beta |
The fraction of the remaining range to use at each iteration. |
diag |
Print extensive diagnostics. For internal testing only... best left
|
Briefly, the mode estimator is computed by iteratively identifying
densest half ranges. (Other fractions of the current range can be
requested by setting beta
to something other than 0.5.) A densest half
range is an interval whose width equals half the current range, and
which contains the maximal number of observations. The subset of
observations falling in the selected densest half range is then used to compute
a new range, and the procedure is iterated. See the references for
details.
If bootstrapping is requested, B
half-range mode estimates are
computed for B
bootstrap samples, and their average is returned
as the final estimate.
The mode estimate.
Richard Bourgon <[email protected]>
DR Bickel, “Robust estimators of the mode and skewness of continuous data.” Computational Statistics & Data Analysis 39:153-163 (2002).
SB Hedges and P Shah, “Comparison of mode estimation methods and application in molecular clock analysis.” BMC Bioinformatics 4:31-41 (2003).
## A single normal-mixture data set x <- c( rnorm(10000), rnorm(2000, mean = 3) ) M <- half.range.mode( x ) M.bs <- half.range.mode( x, B = 100 ) if(interactive()){ hist( x, breaks = 40 ) abline( v = c( M, M.bs ), col = "red", lty = 1:2 ) legend( 1.5, par("usr")[4], c( "Half-range mode", "With bootstrapping (B = 100)" ), lwd = 1, lty = 1:2, cex = .8, col = "red" ) } # Sampling distribution, with and without bootstrapping X <- rbind( matrix( rnorm(1000 * 100), ncol = 100 ), matrix( rnorm(200 * 100, mean = 3), ncol = 100 ) ) M.list <- list( Simple = apply( X, 2, half.range.mode ), BS = apply( X, 2, half.range.mode, B = 100 ) ) if(interactive()){ boxplot( M.list, main = "Effect of bootstrapping" ) abline( h = 0, col = "red" ) }
## A single normal-mixture data set x <- c( rnorm(10000), rnorm(2000, mean = 3) ) M <- half.range.mode( x ) M.bs <- half.range.mode( x, B = 100 ) if(interactive()){ hist( x, breaks = 40 ) abline( v = c( M, M.bs ), col = "red", lty = 1:2 ) legend( 1.5, par("usr")[4], c( "Half-range mode", "With bootstrapping (B = 100)" ), lwd = 1, lty = 1:2, cex = .8, col = "red" ) } # Sampling distribution, with and without bootstrapping X <- rbind( matrix( rnorm(1000 * 100), ncol = 100 ), matrix( rnorm(200 * 100, mean = 3), ncol = 100 ) ) M.list <- list( Simple = apply( X, 2, half.range.mode ), BS = apply( X, 2, half.range.mode, B = 100 ) ) if(interactive()){ boxplot( M.list, main = "Effect of bootstrapping" ) abline( h = 0, col = "red" ) }
Filtering on overall variance induces a lower bound on fold change. This bound depends on the significance of the evidence against the null hypothesis, an is a multiple of the cutoff used for an overall variance filter. It also depends on sample size in both of the groups being compared. These functions compute the multiplier for the supplied p-values or t-statistics.
kappa_p(p, n1, n2 = n1) kappa_t(t, n1, n2 = n1)
kappa_p(p, n1, n2 = n1) kappa_t(t, n1, n2 = n1)
p |
The p-values at which to compute the multiplier. |
t |
The t-statistics at which to compute the multiplier. |
n1 |
Sample size for class 1. |
n2 |
Sample size for class 2. |
A vector of multipliers: one per p-value or t-static in
p
or t
.
Richard Bourgon <[email protected]>
# See the vignette: Diagnostic plots for independent filtering
# See the vignette: Diagnostic plots for independent filtering
kOverA
returns a filter function with bindings for k
and
A
. This function evaluates to TRUE
if at least k
of the arguments elements are larger than A
.
kOverA(k, A=100, na.rm=TRUE)
kOverA(k, A=100, na.rm=TRUE)
A |
The value you want to exceed. |
k |
The number of elements that have to exceed A. |
na.rm |
If set to |
A function with bindings for A
and k
.
R. Gentleman
fg <- kOverA(5, 100) fg(90:100) fg(98:110)
fg <- kOverA(5, 100) fg(90:100) fg(98:110)
maxA
returns a function with the parameter A
bound.
The returned function evaluates to TRUE
if any element of its
argument is larger than A
.
maxA(A=75, na.rm=TRUE)
maxA(A=75, na.rm=TRUE)
A |
The value that at least one element must exceed. |
na.rm |
If |
maxA
returns a function with an environment containing a binding
for A
.
R. Gentleman
ff <- maxA(30) ff(1:10) ff(28:31)
ff <- maxA(30) ff(1:10) ff(28:31)
The function nsFilter
tries to provide a one-stop shop for
different options of filtering (removing) features from an ExpressionSet.
Filtering features exhibiting little variation, or a consistently low
signal, across samples can be advantageous for
the subsequent data analysis (Bourgon et al.).
Furthermore, one may decide that there is little value in considering
features with insufficient annotation.
nsFilter(eset, require.entrez=TRUE, require.GOBP=FALSE, require.GOCC=FALSE, require.GOMF=FALSE, require.CytoBand=FALSE, remove.dupEntrez=TRUE, var.func=IQR, var.cutoff=0.5, var.filter=TRUE, filterByQuantile=TRUE, feature.exclude="^AFFX", ...) varFilter(eset, var.func=IQR, var.cutoff=0.5, filterByQuantile=TRUE) featureFilter(eset, require.entrez=TRUE, require.GOBP=FALSE, require.GOCC=FALSE, require.GOMF=FALSE, require.CytoBand=FALSE, remove.dupEntrez=TRUE, feature.exclude="^AFFX")
nsFilter(eset, require.entrez=TRUE, require.GOBP=FALSE, require.GOCC=FALSE, require.GOMF=FALSE, require.CytoBand=FALSE, remove.dupEntrez=TRUE, var.func=IQR, var.cutoff=0.5, var.filter=TRUE, filterByQuantile=TRUE, feature.exclude="^AFFX", ...) varFilter(eset, var.func=IQR, var.cutoff=0.5, filterByQuantile=TRUE) featureFilter(eset, require.entrez=TRUE, require.GOBP=FALSE, require.GOCC=FALSE, require.GOMF=FALSE, require.CytoBand=FALSE, remove.dupEntrez=TRUE, feature.exclude="^AFFX")
eset |
an |
var.func |
The function used as the per-feature filtering statistic. This function should return a numeric vector of length one when given a numeric vector as input. |
var.filter |
A logical indicating whether to perform
filtering based on |
filterByQuantile |
A logical indicating whether |
var.cutoff |
A numeric value. If |
require.entrez |
If |
require.GOBP , require.GOCC , require.GOMF
|
If |
require.CytoBand |
If |
remove.dupEntrez |
If |
feature.exclude |
A character vector of regular expressions.
Feature identifiers (i.e. value of |
... |
Unused, but available for specializing methods. |
In this Section, the effect of filtering on the type I error rate estimation / control of subsequent hypothesis testing is explained. See also the paper by Bourgon et al.
Marginal type I errors:
Filtering on the basis of a statistic which is independent of the test
statistic used for detecting differential gene expression can increase
the detection rate at the same marginal type I error. This is
clearly the case for filter criteria that do not depend on the data,
such as the annotation based criteria provided by the nsFilter
and featureFilter
functions. However, marginal type I error can
also be controlled for certain types of data-dependent criteria.
Call the stage 1 filter statistic, which is a function
that is applied feature by feature,
based on whose value the feature is or is not accepted to
pass to stage 2, and which depends only on the data for that feature
and not any other feature, and call
the stage 2 test statistic for differential expression.
Sufficient conditions for marginal type-I error control are:
the overall (across all samples) variance or
mean,
the t-statistic (or any other scale and location
invariant statistic),
data normal distributed and exchangeable across samples.
the overall mean,
the moderated t-statistic
(as in limma's
eBayes
function),
data normal distributed and exchangeable.
a sample-class label independent function
(e.g. overall mean, median, variance, IQR),
the Wilcoxon rank sum statistic,
data exchangeable.
Experiment-wide type I error: Marginal type-I error control provided by the conditions above is sufficient for control of the family wise error rate (FWER). Note, however, that common false discovery rate (FDR) methods depend not only on the marginal behaviour of the test statistics under the null hypothesis, but also on their joint distribution. The joint distribution can be affected by filtering, even when this filtering leaves the marginal distributions of true-null test statistics unchanged. Filtering might, for example, change correlation structure. The effect of this is negligible in many cases in practice, but this depends on the dataset and the filter used, and the assessment is in the responsibility of the data analyst.
Annotation Based Filtering Arguments require.entrez
,
require.GOBP
, require.GOCC
, require.GOMF
and
require.CytoBand
filter based on available annotation data. The annotation
package is determined by calling annotation(eset)
.
Variance Based Filtering The var.filter
,
var.func
, var.cutoff
and varByQuantile
arguments
control numerical cutoff-based filtering.
Probes for which var.func
returns NA
are
removed.
The default var.func
is IQR
, which we here define as
rowQ(eset, ceiling(0.75 * ncol(eset))) - rowQ(eset, floor(0.25 * ncol(eset)))
;
this choice is motivated by the observation that unexpressed genes are
detected most reliably through low variability of their features
across samples.
Additionally, IQR
is robust to outliers (see note below). The
default var.cutoff
is 0.5
and is motivated by a rule of
thumb that in many tissues only 40% of genes are expressed.
Please adapt this value to your data and question.
By default the numerical-filter cutoff is interpreted as a quantile, so with the default settings, 50% of the genes are filtered.
Variance filtering is performed last, so that
(if varByQuantile=TRUE
and remove.dupEntrez=TRUE
) the
final number of genes does indeed exclude precisely the var.cutoff
fraction of unique genes remaining after all other filters were
passed.
The stand-alone function varFilter
does only
var.func
-based filtering
(and no annotation based filtering).
featureFilter
does only
annotation based filtering and duplicate removal; it always
performs duplicate removal to retain the highest-IQR
probe for each gene.
For nsFilter
a list consisting of:
eset |
the filtered |
filter.log |
a list giving details of how many probe sets where removed for each filtering step performed. |
For both varFilter
and featureFilter
the filtered
ExpressionSet
.
IQR
is a reasonable variance-filter choice when the dataset
is split into two roughly equal and relatively homogeneous phenotype
groups. If your dataset has important groups smaller than 25% of the
overall sample size, or if you are interested in unusual
individual-level patterns, then IQR
may not be sensitive enough
for your needs. In such cases, you should consider using less robust
and more sensitive measures of variance (the simplest of which would
be sd
).
Seth Falcon (somewhat revised by Assaf Oron)
R. Bourgon, R. Gentleman, W. Huber, Independent filtering increases power for detecting differentially expressed genes, Technical Report.
library("hgu95av2.db") library("Biobase") data(sample.ExpressionSet) ans <- nsFilter(sample.ExpressionSet) ans$eset ans$filter.log ## skip variance-based filtering ans <- nsFilter(sample.ExpressionSet, var.filter=FALSE) a1 <- varFilter(sample.ExpressionSet) a2 <- featureFilter(sample.ExpressionSet)
library("hgu95av2.db") library("Biobase") data(sample.ExpressionSet) ans <- nsFilter(sample.ExpressionSet) ans$eset ans$filter.log ## skip variance-based filtering ans <- nsFilter(sample.ExpressionSet, var.filter=FALSE) a1 <- varFilter(sample.ExpressionSet) a2 <- featureFilter(sample.ExpressionSet)
A function that returns a function with values for A
, p
and na.rm
bound to the specified values. The function takes a
single vector, x
, as an argument.
When the returned function is evaluated it returns TRUE
if the
proportion of values in x
that are larger than A
is at
least p
.
pOverA(p=0.05, A=100, na.rm=TRUE)
pOverA(p=0.05, A=100, na.rm=TRUE)
A |
The value to be exceeded. |
p |
The proportion that need to exceed |
na.rm |
If |
pOverA
returns a function with bindings for A
, p
and na.rm
. This function evaluates to TRUE
if the
proportion of values in x
that are larger than A
exceeds
p
.
R. Gentleman
ff<- pOverA(p=.1, 10) ff(1:20) ff(1:5)
ff<- pOverA(p=.1, 10) ff(1:20) ff(1:5)
Plot the number, or fraction, of null hypotheses rejected as a function of the p-value cutoff. Multiple sets of p-values are accepted, in a list or in the columns of a matrix, in order to permit comparisons.
rejection_plot(p, col, lty = 1, lwd = 1, xlab = "p cutoff", ylab = "number of rejections", xlim = c(0, 1), ylim, legend = names(p), at = c("all", "sample"), n_at = 100, probability = FALSE, ... )
rejection_plot(p, col, lty = 1, lwd = 1, xlab = "p cutoff", ylab = "number of rejections", xlim = c(0, 1), ylim, legend = names(p), at = c("all", "sample"), n_at = 100, probability = FALSE, ... )
p |
The p-values to be used for plotting. These may be in the columns of
a matrix, or in the elements of a list. One curve will be generated
for each column/element, and all |
col |
Colors to be used for each curve plotted. Recycled if necessary. If
|
lty |
Line styles to be used for each curve plotted. Recycled if necessary. |
lwd |
Line widths to be used for each curve plotted. Recycled if necessary. |
xlab |
X-axis text label. |
ylab |
Y-axis text label. |
xlim |
X-axis limits. |
ylim |
Y-axis limits. |
legend |
Text for legend. Matrix column names or list element names (see
|
at |
Should step functions be plotted with a step at every value in
|
n_at |
When |
probability |
Should the fraction of null hypotheses rejected be reported instead
of the count? See the |
... |
Other arguments to pass to the |
A list of the step functions used for plotting is returned invisibly.
Richard Bourgon <[email protected]>
# See the vignette: Diagnostic plots for independent filtering
# See the vignette: Diagnostic plots for independent filtering
t-tests and F-tests for rows or columns of a matrix, intended to be speed efficient.
rowttests(x, fac, tstatOnly = FALSE, na.rm = FALSE) colttests(x, fac, tstatOnly = FALSE, na.rm = FALSE) fastT(x, ig1, ig2, var.equal = TRUE) rowFtests(x, fac, var.equal = TRUE) colFtests(x, fac, var.equal = TRUE)
rowttests(x, fac, tstatOnly = FALSE, na.rm = FALSE) colttests(x, fac, tstatOnly = FALSE, na.rm = FALSE) fastT(x, ig1, ig2, var.equal = TRUE) rowFtests(x, fac, var.equal = TRUE) colFtests(x, fac, var.equal = TRUE)
x |
Numeric matrix. The matrix must not contain |
fac |
Factor which codes the grouping to be tested.
There must be 1 or 2 groups for the t-tests (corresponding to one-
and two-sample t-test), and 2 or more for the F-tests. If If |
tstatOnly |
A logical variable indicating whether to calculate
p-values from the t-distribution with appropriate degrees of
freedom. If |
na.rm |
A logical variable indicating whether to remove NA values prior to calculation test statistics. |
ig1 |
The indices of the columns of |
ig2 |
The indices of the columns of |
var.equal |
A logical variable indicating whether to treat the variances in the samples as equal. If 'TRUE', a simple F test for the equality of means in a one-way analysis of variance is performed. If 'FALSE', an approximate method of Welch (1951) is used, which generalizes the commonly known 2-sample Welch test to the case of arbitrarily many samples. |
If fac
is specified, rowttests
performs for each
row of x
a two-sided, two-class t-test with equal variances.
fac
must be a factor of length ncol(x)
with two levels,
corresponding to the two groups. The sign of the resulting t-statistic
corresponds to "group 1 minus group 2".
If fac
is missing, rowttests
performs for each row of
x
a two-sided one-class t-test against the null hypothesis 'mean=0'.
rowttests
and colttests
are implemented in C and
should be reasonably fast and memory-efficient.
fastT
is an alternative implementation, in Fortran, possibly useful
for certain legacy code.
rowFtests
and colFtests
are currently implemented using
matrix algebra in R. Compared to the rowttests
and
colttests
functions,
they are slower and use more memory.
A data.frame
with columns statistic
,
p.value
(optional in the case of the t-test functions) and
dm
, the difference of the group means (only in the
case of the t-test functions).
The row.names
of the data.frame are taken from the
corresponding dimension names of x
.
The degrees of freedom are provided in the attribute df
.
For the F-tests, if var.equal
is 'FALSE', nrow(x)+1
degree of freedoms
are given, the first one is the first degree of freedom (it is the
same for each row) and the other ones are the second degree of freedom
(one for each row).
Wolfgang Huber <[email protected]>
B. L. Welch (1951), On the comparison of several mean values: an alternative approach. Biometrika, *38*, 330-336
## ## example data ## x = matrix(runif(40), nrow=4, ncol=10) f2 = factor(floor(runif(ncol(x))*2)) f4 = factor(floor(runif(ncol(x))*4)) ## ## one- and two group row t-test; 4-group F-test ## r1 = rowttests(x) r2 = rowttests(x, f2) r4 = rowFtests(x, f4) ## approximate equality about.equal = function(x,y,tol=1e-10) stopifnot(is.numeric(x), is.numeric(y), length(x)==length(y), all(abs(x-y) < tol)) ## ## compare with the implementation in t.test ## for (j in 1:nrow(x)) { s1 = t.test(x[j,]) about.equal(s1$statistic, r1$statistic[j]) about.equal(s1$p.value, r1$p.value[j]) s2 = t.test(x[j,] ~ f2, var.equal=TRUE) about.equal(s2$statistic, r2$statistic[j]) about.equal(s2$p.value, r2$p.value[j]) dm = -diff(tapply(x[j,], f2, mean)) about.equal(dm, r2$dm[j]) s4 = summary(lm(x[j,] ~ f4)) about.equal(s4$fstatistic["value"], r4$statistic[j]) } ## ## colttests ## c2 = colttests(t(x), f2) stopifnot(identical(r2, c2)) ## ## missing values ## f2n = f2 f2n[sample(length(f2n), 3)] = NA r2n = rowttests(x, f2n) for(j in 1:nrow(x)) { s2n = t.test(x[j,] ~ f2n, var.equal=TRUE) about.equal(s2n$statistic, r2n$statistic[j]) about.equal(s2n$p.value, r2n$p.value[j]) } ## ## larger sample size ## x = matrix(runif(1000000), nrow=4, ncol=250000) f2 = factor(floor(runif(ncol(x))*2)) r2 = rowttests(x, f2) for (j in 1:nrow(x)) { s2 = t.test(x[j,] ~ f2, var.equal=TRUE) about.equal(s2$statistic, r2$statistic[j]) about.equal(s2$p.value, r2$p.value[j]) } ## single row matrix rowFtests(matrix(runif(10),1,10),as.factor(c(rep(1,5),rep(2,5)))) rowttests(matrix(runif(10),1,10),as.factor(c(rep(1,5),rep(2,5))))
## ## example data ## x = matrix(runif(40), nrow=4, ncol=10) f2 = factor(floor(runif(ncol(x))*2)) f4 = factor(floor(runif(ncol(x))*4)) ## ## one- and two group row t-test; 4-group F-test ## r1 = rowttests(x) r2 = rowttests(x, f2) r4 = rowFtests(x, f4) ## approximate equality about.equal = function(x,y,tol=1e-10) stopifnot(is.numeric(x), is.numeric(y), length(x)==length(y), all(abs(x-y) < tol)) ## ## compare with the implementation in t.test ## for (j in 1:nrow(x)) { s1 = t.test(x[j,]) about.equal(s1$statistic, r1$statistic[j]) about.equal(s1$p.value, r1$p.value[j]) s2 = t.test(x[j,] ~ f2, var.equal=TRUE) about.equal(s2$statistic, r2$statistic[j]) about.equal(s2$p.value, r2$p.value[j]) dm = -diff(tapply(x[j,], f2, mean)) about.equal(dm, r2$dm[j]) s4 = summary(lm(x[j,] ~ f4)) about.equal(s4$fstatistic["value"], r4$statistic[j]) } ## ## colttests ## c2 = colttests(t(x), f2) stopifnot(identical(r2, c2)) ## ## missing values ## f2n = f2 f2n[sample(length(f2n), 3)] = NA r2n = rowttests(x, f2n) for(j in 1:nrow(x)) { s2n = t.test(x[j,] ~ f2n, var.equal=TRUE) about.equal(s2n$statistic, r2n$statistic[j]) about.equal(s2n$p.value, r2n$p.value[j]) } ## ## larger sample size ## x = matrix(runif(1000000), nrow=4, ncol=250000) f2 = factor(floor(runif(ncol(x))*2)) r2 = rowttests(x, f2) for (j in 1:nrow(x)) { s2 = t.test(x[j,] ~ f2, var.equal=TRUE) about.equal(s2$statistic, r2$statistic[j]) about.equal(s2$p.value, r2$p.value[j]) } ## single row matrix rowFtests(matrix(runif(10),1,10),as.factor(c(rep(1,5),rep(2,5)))) rowttests(matrix(runif(10),1,10),as.factor(c(rep(1,5),rep(2,5))))
Methods for fast rowwise computation of ROC curves and
(partial) area under the curve (pAUC) using the simple classification
rule x > theta
, where theta
is a value in the range of
x
rowpAUCs(x, fac, p=0.1, flip=TRUE, caseNames=c("1", "2"))
rowpAUCs(x, fac, p=0.1, flip=TRUE, caseNames=c("1", "2"))
x |
|
fac |
A |
p |
Numeric |
flip |
Logical. If |
caseNames |
The class names that are used when plotting the
data. If |
Rowwise calculation of Receiver Operating Characteristic (ROC) curves
and the corresponding partial area under the curve (pAUC) for a given
data matrix or ExpressionSet
. The function is implemented in C
and thus reasonably fast and memory efficient. Cutpoints (theta
are calculated before the first, in between and after the last data
value. By default, both classification rules x > theta
and
x < theta
are tested and the (partial) area under the curve of
the better one of the two is returned. This is only valid for
symmetric cases, where the classification is independent of the
magnitude of x
(e.g., both over- and under-expression of
different genes in the same class). For unsymmetric cases in which
you expect x to be consistently higher/lower in of of the two classes
(e.g. presence or absence of a single biomarker) set flip=FALSE
or use the functionality provided in the ROC
package. For
better control over the classification (i.e., the choice of "Disease"
and "Control" class in the sense of the Pepe et al paper), argument
fac
can be an integer in [0,1]
where 1 indicates
"Disease" and 0 indicates "Control".
An object of class rowROC
with the
calculated specificities and sensitivities for each row and the
corresponding pAUCs and AUCs values. See
rowROC
for details.
Methods exist for rowPAUCs
:
signature(x="matrix", fac="factor")
signature(x="matrix", fac="numeric")
signature(x="ExpressionSet")
signature(x="ExpressionSet", fac="character")
Florian Hahne <[email protected]>
Pepe MS, Longton G, Anderson GL, Schummer M.: Selecting differentially expressed genes from microarray experiments. Biometrics. 2003 Mar;59(1):133-42.
library(Biobase) data(sample.ExpressionSet) r1 = rowttests(sample.ExpressionSet, "sex") r2 = rowpAUCs(sample.ExpressionSet, "sex", p=0.1) plot(area(r2, total=TRUE), r1$statistic, pch=16) sel <- which(area(r2, total=TRUE) > 0.7) plot(r2[sel]) ## this compares performance and output of rowpAUCs to function pAUC in ## package ROC if(require(ROC)){ ## performance myRule = function(x) pAUC(rocdemo.sca(truth = as.integer(sample.ExpressionSet$sex)-1 , data = x, rule = dxrule.sca), t0 = 0.1) nGenes = 200 cat("computation time for ", nGenes, "genes:\n") cat("function pAUC: ") print(system.time(r3 <- esApply(sample.ExpressionSet[1:nGenes, ], 1, myRule))) cat("function rowpAUCs: ") print(system.time(r2 <- rowpAUCs(sample.ExpressionSet[1:nGenes, ], "sex", p=1))) ## compare output myRule2 = function(x) pAUC(rocdemo.sca(truth = as.integer(sample.ExpressionSet$sex)-1 , data = x, rule = dxrule.sca), t0 = 1) r4 <- esApply(sample.ExpressionSet[1:nGenes, ], 1, myRule2) plot(r4,area(r2), xlab="function pAUC", ylab="function rowpAUCs", main="pAUCs") plot(r4, area(rowpAUCs(sample.ExpressionSet[1:nGenes, ], "sex", p=1, flip=FALSE)), xlab="function pAUC", ylab="function rowpAUCs", main="pAUCs") r4[r4<0.5] <- 1-r4[r4<0.5] plot(r4, area(r2), xlab="function pAUC", ylab="function rowpAUCs", main="pAUCs") }
library(Biobase) data(sample.ExpressionSet) r1 = rowttests(sample.ExpressionSet, "sex") r2 = rowpAUCs(sample.ExpressionSet, "sex", p=0.1) plot(area(r2, total=TRUE), r1$statistic, pch=16) sel <- which(area(r2, total=TRUE) > 0.7) plot(r2[sel]) ## this compares performance and output of rowpAUCs to function pAUC in ## package ROC if(require(ROC)){ ## performance myRule = function(x) pAUC(rocdemo.sca(truth = as.integer(sample.ExpressionSet$sex)-1 , data = x, rule = dxrule.sca), t0 = 0.1) nGenes = 200 cat("computation time for ", nGenes, "genes:\n") cat("function pAUC: ") print(system.time(r3 <- esApply(sample.ExpressionSet[1:nGenes, ], 1, myRule))) cat("function rowpAUCs: ") print(system.time(r2 <- rowpAUCs(sample.ExpressionSet[1:nGenes, ], "sex", p=1))) ## compare output myRule2 = function(x) pAUC(rocdemo.sca(truth = as.integer(sample.ExpressionSet$sex)-1 , data = x, rule = dxrule.sca), t0 = 1) r4 <- esApply(sample.ExpressionSet[1:nGenes, ], 1, myRule2) plot(r4,area(r2), xlab="function pAUC", ylab="function rowpAUCs", main="pAUCs") plot(r4, area(rowpAUCs(sample.ExpressionSet[1:nGenes, ], "sex", p=1, flip=FALSE)), xlab="function pAUC", ylab="function rowpAUCs", main="pAUCs") r4[r4<0.5] <- 1-r4[r4<0.5] plot(r4, area(r2), xlab="function pAUC", ylab="function rowpAUCs", main="pAUCs") }
A class to model ROC curves and corresponding area under the curve as produced by rowpAUCs.
Objects can be created by calls of the form new("rowROC", ...)
.
data
:Object of class "matrix"
The input data.
ranks
:Object of class "matrix"
The ranked
input data.
sens
:Object of class "matrix"
Matrix of
senitivity values for each gene at each cutpoint.
spec
:Object of class "matrix"
Matrix of
specificity values for each gene at each cutpoint.
pAUC
:Object of class "numeric"
The partial
area under the curve (integrated from 0 to p
.
AUC
:Object of class "numeric"
The total area
under the curve.
factor
:Object of class "factor"
The factor
used for classification.
cutpoints
:Object of class "matrix"
The values
of the cutpoints at which specificity ans sensitivity was
calculated. (Note: the data is ranked prior to computation
of ROC curves, the cutpoints map to the ranked data.
caseNames
:Object of class "character"
The
names of the two classification cases.
p
:Object of class "numeric"
The limit to which
pAUC
is integrated.
signature(object="rowROC")
Print nice info about the object.
signature(x="rowROC", j="missing")
Subset the object according to rows/genes.
signature(x="rowROC", y="missing")
Plot the ROC
curve of the first row of the object along with the pAUC
.
To plot the curve for a specific row/gene subsetting should be done
first (i.e. plot(rowROC[1])
.
signature(object="rowROC", p="numeric", flip="logical")
Integrate
area under the curve from 0
to p
. This method
returns a new rowROC
object.
signature(object="rowROC")
Integrate
total area under the curve. This method returns a new
rowROC
object.
signature(object="rowROC")
Accessor method for sensitivity slot.
signature(object="rowROC")
Accessor method for specificity slot.
signature(object="rowROC", total="logical")
Accessor method for pAUC slot.
Florian Hahne <[email protected]>
Pepe MS, Longton G, Anderson GL, Schummer M.: Selecting differentially expressed genes from microarray experiments. Biometrics. 2003 Mar;59(1):133-42.
library("Biobase") data("sample.ExpressionSet") roc <- rowpAUCs(sample.ExpressionSet, "sex", p=0.5) roc area(roc[1:3]) if(interactive()) { par(ask=TRUE) plot(roc) plot(1-spec(roc[1]), sens(roc[2])) par(ask=FALSE) } pAUC(roc, 0.1) roc
library("Biobase") data("sample.ExpressionSet") roc <- rowpAUCs(sample.ExpressionSet, "sex", p=0.5) roc area(roc[1:3]) if(interactive()) { par(ask=TRUE) plot(roc) plot(1-spec(roc[1]), sens(roc[2])) par(ask=FALSE) } pAUC(roc, 0.1) roc
Row variance and standard deviation of a numeric array
rowVars(x, ...) rowSds(x, ...)
rowVars(x, ...) rowSds(x, ...)
x |
An array of two or more dimensions, containing numeric, complex, integer or logical values, or a numeric data frame. |
... |
Further arguments that get passed on to
|
These are very simple convenience functions, the main work is done in
rowMeans
and rowSums
. See the function
definition of rowVars
, it is very simple.
A numeric or complex array of suitable size, or a vector if the result is one-dimensional. The ‘dimnames’ (or ‘names’ for a vector result) are taken from the original array.
Wolfgang Huber http://www.ebi.ac.uk/huber
a = matrix(rnorm(1e4), nrow=10) rowSds(a)
a = matrix(rnorm(1e4), nrow=10) rowSds(a)
A location estimator based on the shorth
shorth(x, na.rm=FALSE, tie.action="mean", tie.limit=0.05)
shorth(x, na.rm=FALSE, tie.action="mean", tie.limit=0.05)
x |
Numeric |
na.rm |
Logical. If |
tie.action |
Character scalar. See details. |
tie.limit |
Numeric scalar. See details. |
The shorth is the shortest interval that covers half of the
values in x
. This function calculates the mean of the x
values that lie in the shorth. This was proposed by Andrews (1972) as a
robust estimator of location.
Ties: if there are multiple shortest intervals,
the action specified in ties.action
is applied.
Allowed values are mean
(the default), max
and min
.
For mean
, the average value is considered; however, an error is
generated if the start indices of the different shortest intervals
differ by more than the fraction tie.limit
of length(x)
.
For min
and max
, the left-most or right-most, respectively, of
the multiple shortest intervals is considered.
Rate of convergence: as an estimator of location of a unimodal
distribution, under regularity conditions,
the quantity computed here has an asymptotic rate of only and a
complicated limiting distribution.
See half.range.mode
for an iterative version
that refines the estimate iteratively and has a builtin bootstrapping option.
The mean of the x
values that lie in the shorth.
Wolfgang Huber http://www.ebi.ac.uk/huber, Ligia Pedroso Bras
G Sawitzki, “The Shorth Plot.” Available at http://lshorth.r-forge.r-project.org/TheShorthPlot.pdf
DF Andrews, “Robust Estimates of Location.” Princeton University Press (1972).
R Grueble, “The Length of the Shorth.” Annals of Statistics 16, 2:619-628 (1988).
DR Bickel and R Fruehwirth, “On a fast, robust estimator of the mode: Comparisons to other robust estimators with applications.” Computational Statistics & Data Analysis 50, 3500-3530 (2006).
x = c(rnorm(500), runif(500) * 10) methods = c("mean", "median", "shorth", "half.range.mode") ests = sapply(methods, function(m) get(m)(x)) if(interactive()) { colors = 1:4 hist(x, 40, col="orange") abline(v=ests, col=colors, lwd=3, lty=1:2) legend(5, 100, names(ests), col=colors, lwd=3, lty=1:2) }
x = c(rnorm(500), runif(500) * 10) methods = c("mean", "median", "shorth", "half.range.mode") ests = sapply(methods, function(m) get(m)(x)) if(interactive()) { colors = 1:4 hist(x, 40, col="orange") abline(v=ests, col=colors, lwd=3, lty=1:2) legend(5, 100, names(ests), col=colors, lwd=3, lty=1:2) }
The tdata
data frame has 500 rows and 26 columns.
The columns correspond to samples while the rows correspond to genes.
The row names are Affymetrix accession numbers.
data(tdata)
data(tdata)
This data frame contains 26 columns.
An unknown data set.
data(tdata)
data(tdata)
ttest
returns a function of one argument with bindings for
cov
and p
.
The function, when evaluated, performs a t-test using cov
as
the covariate. It returns TRUE
if the p value for a difference
in means is less than p
.
ttest(m, p=0.05, na.rm=TRUE)
ttest(m, p=0.05, na.rm=TRUE)
m |
If |
p |
The p-value for the test. |
na.rm |
If set to |
When the data can be split into two groups (diseased and normal for example) then we often want to select genes on their ability to distinguish those two groups. The t-test is well suited to this and can be used as a filter function.
This helper function creates a t-test (function) for the specified
covariate and considers a gene to have passed the filter if the
p-value for the gene is less than the prespecified p
.
ttest
returns a function with bindings for m
and
p
that will perform a t-test.
R. Gentleman
dat <- c(rep(1,5),rep(2,5)) set.seed(5) y <- rnorm(10) af <- ttest(dat, .01) af(y) af2 <- ttest(5, .01) af2(y) y[8] <- NA af(y) af2(y) y[1:5] <- y[1:5]+10 af(y)
dat <- c(rep(1,5),rep(2,5)) set.seed(5) y <- rnorm(10) af <- ttest(dat, .01) af(y) af2 <- ttest(5, .01) af2(y) y[8] <- NA af(y) af2(y) y[1:5] <- y[1:5]+10 af(y)