The genefilter
package can be used to filter (select) genes from a microarray dataset
according to a variety of different filtering mechanisms. Here, we will
consider the example dataset in the sample.ExpressionSet
example from the Biobase
package. This experiment has 26 samples, and there are 500 genes and 3
covariates. The covariates are named sex
, type
and score
. The first two have two levels and the last one
is continuous.
## [1] "sex" "type" "score"
##
## Female Male
## 11 15
##
## Case Control
## 15 11
One dichotomy that can be of interest for subsequent analyses is
whether the filter is specific or non-specific. Here,
specific means that we are filtering with reference to sample metadata,
for example, type
. For example, if we want to select genes
that are differentially expressed in the two groups defined by
type
, that is a specific filter. If on the other hand we
want to select genes that are expressed in more than 5 samples, that is
an example of a non–specific filter.
First, let us see how to perform a non–specific filter. Suppose we
want to select genes that have an expression measure above 200 in at
least 5 samples. To do that we use the function kOverA
.
There are three steps that must be performed.
Create function(s) implementing the filtering criteria.
Assemble it (them) into a (combined) filtering function.
Apply the filtering function to the expression matrix.
f1 <- kOverA(5, 200)
ffun <- filterfun(f1)
wh1 <- genefilter(exprs(sample.ExpressionSet), ffun)
sum(wh1)
## [1] 159
Here f1
is a function that implies our “expression
measure above 200 in at least 5 samples” criterion, the function
ffun
is the filtering function (which in this case consists
of only one criterion), and we apply it using
r Biocpkg("genefilter")
. There were 159 genes that
satisfied the criterion and passed the filter. As an example for a
specific filter, let us select genes that are differentially expressed
in the groups defined by type
.
f2 <- ttest(sample.ExpressionSet$type, p=0.1)
wh2 <- genefilter(exprs(sample.ExpressionSet), filterfun(f2))
sum(wh2)
## [1] 88
Here, ttest
is a function from the genefilter
package which provides a suitable wrapper around t.test
from package stats. Now we see that there are 88 genes that
satisfy the selection criterion. Suppose that we want to combine the two
filters. We want those genes for which at least 5 have an expression
measure over 200 and which also are differentially expressed
between the groups defined by type
ffun_combined <- filterfun(f1, f2)
wh3 <- genefilter(exprs(sample.ExpressionSet), ffun_combined)
sum(wh3)
## [1] 35
Now we see that there are only 35 genes that satisfy both conditions.
The function knnCV
defined below performs k–nearest neighbour classification
using leave–one–out cross–validation. At the same time it aggregates the
genes that were selected. The function returns the predicted
classifications as its returned value. However, there is an additional
side effect. The number of times that each gene was used (provided it
was at least one) are recorded and stored in the environment of the
aggregator Agg
. These can subsequently be retrieved and
used for other purposes.
knnCV <- function(x, selectfun, cov, Agg, pselect = 0.01, scale=FALSE) {
nc <- ncol(x)
outvals <- rep(NA, nc)
for(i in seq_len(nc)) {
v1 <- x[,i]
expr <- x[,-i]
glist <- selectfun(expr, cov[-i], p=pselect)
expr <- expr[glist,]
if( scale ) {
expr <- scale(expr)
v1 <- as.vector(scale(v1[glist]))
}
else
v1 <- v1[glist]
out <- paste("iter ",i, " num genes= ", sum(glist), sep="")
print(out)
Aggregate(row.names(expr), Agg)
if( length(v1) == 1)
outvals[i] <- knn(expr, v1, cov[-i], k=5)
else
outvals[i] <- knn(t(expr), v1, cov[-i], k=5)
}
return(outvals)
}
gfun <- function(expr, cov, p=0.05) {
f2 <- ttest(cov, p=p)
ffun <- filterfun(f2)
which <- genefilter(expr, ffun)
}
Next we show how to use this function on the dataset
geneData
library("class")
##scale the genes
##genescale is a slightly more flexible "scale"
##work on a subset -- for speed only
geneData <- genescale(exprs(sample.ExpressionSet)[1:75,], 1)
Agg <- new("aggregator")
testcase <- knnCV(geneData, gfun, sample.ExpressionSet$type,
Agg,pselect=0.05)
## AFFX-MurIL4_at AFFX-TrpnX-M_at AFFX-hum_alu_at
## 26 26 26
## AFFX-YEL018w/_at 31308_at AFFX-PheX-M_at
## 20 15 15
## 31312_at AFFX-BioC-3_st AFFX-HUMRGE/M10098_M_at
## 3 3 1
## AFFX-DapX-5_at AFFX-TrpnX-5_at AFFX-BioDn-5_at
## 1 1 1
## AFFX-PheX-5_at
## 1
The environment Agg
contains, for each gene, the number
of times it was selected in the cross-validation.
The version number of R and packages loaded for generating the vignette were:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] class_7.3-22 genefilter_1.89.0 Biobase_2.67.0
## [4] BiocGenerics_0.53.3 generics_0.1.3 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 lattice_0.22-6 RSQLite_2.3.8
## [4] digest_0.6.37 grid_4.4.2 evaluate_1.0.1
## [7] fastmap_1.2.0 blob_1.2.4 Matrix_1.7-1
## [10] jsonlite_1.8.9 AnnotationDbi_1.69.0 GenomeInfoDb_1.43.1
## [13] DBI_1.2.3 survival_3.7-0 BiocManager_1.30.25
## [16] httr_1.4.7 UCSC.utils_1.3.0 XML_3.99-0.17
## [19] Biostrings_2.75.1 jquerylib_0.1.4 cli_3.6.3
## [22] rlang_1.1.4 crayon_1.5.3 XVector_0.47.0
## [25] splines_4.4.2 bit64_4.5.2 cachem_1.1.0
## [28] yaml_2.3.10 tools_4.4.2 memoise_2.0.1
## [31] annotate_1.85.0 GenomeInfoDbData_1.2.13 buildtools_1.0.0
## [34] vctrs_0.6.5 R6_2.5.1 png_0.1-8
## [37] matrixStats_1.4.1 stats4_4.4.2 lifecycle_1.0.4
## [40] zlibbioc_1.52.0 KEGGREST_1.47.0 S4Vectors_0.45.2
## [43] IRanges_2.41.1 bit_4.5.0 bslib_0.8.0
## [46] xfun_0.49 sys_3.4.3 MatrixGenerics_1.19.0
## [49] knitr_1.49 xtable_1.8-4 htmltools_0.5.8.1
## [52] rmarkdown_2.29 maketools_1.3.1 compiler_4.4.2