FeatSeekR
user guideA fundamental step in many analyses of high-dimensional data is
dimension reduction. Feature selection is one approach to dimension
reduction whose strengths include interpretability, conceptual
simplicity, transferability and modularity. Here, we introduce the
FeatSeekR
algorithm, which selects features based on the
consistency of their signal across replicates and their non-redundancy.
It takes a 2 dimensional array (features x samples) of replicated
measurements and returns a SummarizedExperiment
object storing the selected features ranked by reproducibility.
Here we simulate a data set with features generated by orthogonal latent factors. Features derived from the same latent factor are highly redundant and form distinct clusters. The function simulates 10 redundant features per latent factor. Replicates are generated by adding independent Gaussian noise.
set.seed(111)
# simulate data with 500 conditions, 3 replicates and 5 latent factors
conditions <- 500
latent_factors <- 5
replicates <- 3
# simData generates 10 features per latent_factor, so choosing latent_factors=5
# will generate 50 features.
# we simulate samples from 500 independent conditions per replicate. setting
# conditions=500 and replicates=3 will generate 1500 samples, leading to
# final data dimensions of 50 features x 1500 samples
sim <- simData(conditions=conditions, n_latent_factors=latent_factors,
replicates=replicates)
# show that simulated data dimensions are indeed 50 x 1500
dim(assay(sim, "data"))
## [1] 50 1500
# calculate the feature correlation for first replicate
data <- t(assay(sim, "data"))
cor <- cor(data, use = "pairwise.complete.obs")
# plot a heatmap of the features and color features according to their
# generating latent factors
anno <- data.frame(Latent_factor = as.factor(rep(1:5, each=10)))
rownames(anno) <- dimnames(sim)[[1]]
colors <- c("red", "blue", "darkorange", "darkgreen", "black")
names(colors) <- c("1", "2", "3", "4", "5")
anno_colors <- list(Latent_factor = colors)
range <- max(abs(cor))
pheatmap(cor, treeheight_row = 0 , treeheight_col = 0,
show_rownames = FALSE, show_colnames = FALSE,
breaks = seq(-range, range, length.out = 100), cellwidth = 6,
cellheight = 6, annotation_col = anno, annotation_colors = anno_colors,
fontsize = 8)
We first plot the correlation matrix of the data to visualize feature
redundancy. As intended by the simulation, the features derived from the
same latent factor cluster together. This suggests that the true
dimension is indeed lower than the number of features. We now run
FeatSeekR
to rank the features based on their uniqueness
and reproducibility.
## Input data has:
## 1500 samples
## 500 conditions
## 50 features
## 3 replicates
## Starting feature ranking!
## Iteration: 1 selected = Latent_factor2_Feature3, replicate F-statistic = 10873.0023316711
## Iteration: 2 selected = Latent_factor5_Feature6, replicate F-statistic = 9364.10884704127
## Iteration: 3 selected = Latent_factor3_Feature6, replicate F-statistic = 6048.5339756004
## Iteration: 4 selected = Latent_factor4_Feature1, replicate F-statistic = 1777.09431418772
## Iteration: 5 selected = Latent_factor1_Feature5, replicate F-statistic = 60.3857000568669
## Finished feature ranking procedure!
## Calculating explained variance of selected feature sets!
We again visualize the selected features by plotting their correlation matrix. As expected, the top 5 selected features are each from a different latent factor and low correlated. This suggests that we were able to obtain a compressed version of the data, while keeping most of the contained information.
DmelSGI
packageHere we use FeatSeekR
to rapidly identify unique
features with reproducible signal between measurements in an image
dataset from the DmelSGI
package. The authors of DmelSGI
performed combinatorial gene knock-outs using siRNA, followed by imaging
of the cells. The resulting images were segmented and features were
extracted using the EBImage
package. Here, conditions refer to different gene knock-outs, features
to the extracted image features and replicates to repeated measurements
of the individual conditions.
# load data from DmelSGI package
data("subSampleForStabilitySelection", package="DmelSGI")
data <- subSampleForStabilitySelection$D
# dimensions are conditions, features, replicates
data <- aperm(data, c(1, 3, 2))
# set feature names
dimnames(data)[[2]] <- subSampleForStabilitySelection$phenotype
# bind samples and create condition factor
conds <- rep(seq_len(dim(data)[1]), 2)
data<- rbind(data[, , 1], data[, , 2])
# show final data dimensions
dim(data)
## [1] 6000 162
The input data has 3000 samples, 162 features and 2 replicates. Again, we plot the correlation matrix of the data to explore the structure of the features.
# calculate correlation matrix of the first 50 features of one of the replicates
cor_mat <- cor(data[, 1:50, drop=FALSE])
# plot correlation matrix, omitting featurenames
pheatmap(cor_mat, show_rownames=FALSE, show_colnames=FALSE,
treeheight_row=0, treeheight_col=0)
Analogous to the idealized simulated example, the extracted features
formed groups of high correlation within and lower correlation between.
This supports the idea that the effective dimension of the data matrix
is substantially lower than the number of features and that feature
selection is a plausible approach to these data. We apply
FeatSeek
to identify unique features with high replicate
consistency.
# run FeatSeekR and rank up to 20 features based on their replicate
# reproducibility and uniqueness
max_features <- 30
res <- FeatSeek(t(data),
conditions=conds,
max_features=max_features,
verbose=TRUE)
## Input data has:
## 6000 samples
## 3000 conditions
## 162 features
## 2 replicates
## Starting feature ranking!
## Iteration: 1 selected = 4x.isMitoticAll, replicate F-statistic = 39.9436313137099
## Iteration: 2 selected = 4x.areaNucAll, replicate F-statistic = 30.0263582106047
## Iteration: 3 selected = 4x.ratioMitoticAll, replicate F-statistic = 11.5398616989652
## Iteration: 4 selected = 4x.LCD4, replicate F-statistic = 10.8825393255875
## Iteration: 5 selected = 4x.areaNucH1, replicate F-statistic = 7.95956181567868
## Iteration: 6 selected = 4x.areaNucH10, replicate F-statistic = 7.47519394820442
## Iteration: 7 selected = 4x.LCDratio.LCD4, replicate F-statistic = 5.8357355692092
## Iteration: 8 selected = 4x.areaNucH3, replicate F-statistic = 5.55948447657088
## Iteration: 9 selected = 4x.areapH3All, replicate F-statistic = 5.3771381704978
## Iteration: 10 selected = 4x.LCD3, replicate F-statistic = 4.33930418841255
## Iteration: 11 selected = 4x.intNucH10, replicate F-statistic = 3.72653395075801
## Iteration: 12 selected = 4x.intNucQ0.97, replicate F-statistic = 3.56658985821053
## Iteration: 13 selected = 4x.LCD8, replicate F-statistic = 3.55024557636757
## Iteration: 14 selected = 4x.areaNucH7, replicate F-statistic = 3.35197326379491
## Iteration: 15 selected = 10x.meanNonmitotic.cell.Tub.m.eccentricity, replicate F-statistic = 3.20238723367772
## Iteration: 16 selected = 4x.intNucH5, replicate F-statistic = 3.1658168982331
## Iteration: 17 selected = 4x.intH3pH4, replicate F-statistic = 3.05967610288084
## Iteration: 18 selected = 10x.meanNonmitotic.nucleus.DAPI.m.majoraxis, replicate F-statistic = 2.83259945427357
## Iteration: 19 selected = 4x.intNucH1, replicate F-statistic = 2.82822480334084
## Iteration: 20 selected = 4x.intNucQ0.10, replicate F-statistic = 2.755206092354
## Iteration: 21 selected = 4x.areaNucQ0.90, replicate F-statistic = 2.53605767224537
## Iteration: 22 selected = 4x.intNucH8, replicate F-statistic = 2.50090551118911
## Iteration: 23 selected = 4x.LCD2, replicate F-statistic = 2.35829643448643
## Iteration: 24 selected = 4x.intpH3All, replicate F-statistic = 2.34578890414463
## Iteration: 25 selected = 4x.intNucH3, replicate F-statistic = 2.22292998007601
## Iteration: 26 selected = 10x.meanNonmitotic.nucleus.DAPI.m.eccentricity, replicate F-statistic = 2.17920581314792
## Iteration: 27 selected = 4x.areaNucH5, replicate F-statistic = 2.14959536975668
## Iteration: 28 selected = 10x.meanNonmitotic.cell.0.s.perimeter, replicate F-statistic = 2.14253291864597
## Iteration: 29 selected = 10x.meanNonmitotic.cell.Tub.m.majoraxis, replicate F-statistic = 2.34896602854229
## Iteration: 30 selected = 4x.intNucQ0.25, replicate F-statistic = 2.10291760262207
## Finished feature ranking procedure!
## Calculating explained variance of selected feature sets!
In order determine the ideal number of selected features we can have a look at the fraction of explained variance per additionally selected feature.
# plotVarianceExplained plots the fraction of explained variance per
# additionally selected feature, ranked by FeatSeek.
plotVarianceExplained(res)
The increase in explained variance seems to flatten out at around 70%. We therefore select the number of features, that explain at least 70% of the total variance and plot their correlation matrix.
# get number of features which explain at least 70% of the total variance
n_feat <- min(which(rowData(res)$explained_variance > 0.7))
# plot the top n_feat features based on the ranking by FeatSeek
plotSelectedFeatures(res, n_features=n_feat)
The low correlation between the top selected features confirm their
low redundancy. Using FeatSeekR
we were able to reduce the
dimension of the data to 17 features, while still being able to explain
70% of the variance of the original data.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.35.5 Biobase_2.67.0
## [3] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2
## [5] IRanges_2.39.2 S4Vectors_0.43.2
## [7] BiocGenerics_0.53.0 MatrixGenerics_1.17.1
## [9] matrixStats_1.4.1 pheatmap_1.0.12
## [11] DmelSGI_1.37.0 FeatSeekR_1.7.0
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.48 bslib_0.8.0
## [4] caTools_1.18.3 rhdf5_2.49.0 lattice_0.22-6
## [7] bitops_1.0-9 rhdf5filters_1.17.0 tools_4.4.1
## [10] highr_0.11 pkgconfig_2.0.3 KernSmooth_2.23-24
## [13] Matrix_1.7-1 RColorBrewer_1.1-3 lifecycle_1.0.4
## [16] GenomeInfoDbData_1.2.13 farver_2.1.2 compiler_4.4.1
## [19] gplots_3.2.0 statmod_1.5.0 munsell_0.5.1
## [22] codetools_0.2-20 htmltools_0.5.8.1 sys_3.4.3
## [25] buildtools_1.0.0 sass_0.4.9 yaml_2.3.10
## [28] pracma_2.4.4 crayon_1.5.3 jquerylib_0.1.4
## [31] MASS_7.3-61 DelayedArray_0.31.14 cachem_1.1.0
## [34] limma_3.61.12 iterators_1.0.14 TSP_1.2-4
## [37] abind_1.4-8 foreach_1.5.2 gtools_3.9.5
## [40] digest_0.6.37 maketools_1.3.1 fastmap_1.2.0
## [43] grid_4.4.1 colorspace_2.1-1 cli_3.6.3
## [46] SparseArray_1.5.45 magrittr_2.0.3 S4Arrays_1.5.11
## [49] scales_1.3.0 UCSC.utils_1.1.0 rmarkdown_2.28
## [52] XVector_0.45.0 httr_1.4.7 igraph_2.1.1
## [55] evaluate_1.0.1 knitr_1.48 rlang_1.1.4
## [58] glue_1.8.0 BiocManager_1.30.25 jsonlite_1.8.9
## [61] R6_2.5.1 Rhdf5lib_1.27.0 zlibbioc_1.51.2