Feature rankings are important in analyzing high-throughput data, particularly for bioinformatics studies. These rankings are typically obtained by calculating the marginal correlations with the outcomes. The ordered feature list reflects the importance of features, which plays a vital role in guiding subsequent research. For instance, researchers usually focus on a small subset of important features that are associated with research objectives. However, the feature ranking can be distorted by a single case. The case exerts abnormal influence on the feature ranking is termed as influential points (IPs). The presence of IPs renders the feature ranking unreliable, consequently affecting the subsequent analysis based on feature rankings.
The findIPs R package is specifically designed to detect IPs for feature rankings. The method utilized in this package is based on the leave-one-out strategy, which involves comparing the rank difference between the original ranking and a new ranking that is obtained by removing a single observation (1). The new rankings are leave-one-out rankings. The rank difference obtained through this comparison helps to determine the influence of the deleted observation.
The whole process can be divided into three steps,
Step 1, generate the original and leave-one-out rankings using a feature ranking method, such as t-test. A dataset with n cases will result in one original ranking and n leave-one-out feature rankings.
Step 2, calculate rank changes. It is advisable to use top-prioritized weights when comparing ranks.
Step 3, calculate the cumulative rank changes for each observation. A diagnostic check is also required to identify any potential influential points.
The findIPs package can be installed from Bioconductor using the following commands:
The findIPs package includes the miller05 microarray data (2). The data contains 236 samples and 1,000 genes. It has two types of responses: binary and survival. The binary response is classified based on the p53 mutation: 58 cases with a p53 mutant and 193 cases with the wild-type p53 mutation. The survival response is associated with a total of 55 recorded events.
We use a simple example where features are ranked based on t.test to demonstrate the use of findIPs package for IPs detection. We use getdrop1ranks() to derive the original ranking and leave-one-out rankings. Features are simply ranked according to the p.values of t.test. Of note, the rank criteria is the p.value if fun = “t.test”. P.values are ranked in ascending order by specifying decreasing = FALSE. We select the top 100 important features in the original ranking. The function returns an object containing a vector of original ranking (origRank) and a matrix of leave-one-out rankings (drop1Rank).
## List of 2
## $ origRank : chr [1:100] "202580_x_at" "205240_at" "205394_at" "212494_at" ...
## $ drop1Rank: chr [1:100, 1:236] "202580_x_at" "205240_at" "205394_at" "212494_at" ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:236] "GSM79114" "GSM79115" "GSM79116" "GSM79118" ...
After obtaining the original ranking and leave-one-out rankings using the getdrop1ranks() function, we use the sumRanks() function to compute the distance between them. This function provides three methods for comparing ranks: unweighted, weighted Spearman, and method with adaptive weights. The unweighted method directly compares the ranks and assumes that all ranks have equal importance. However, this is not always the case as the top-ranked methods are usually more important. The weighted Spearman and adaptive weights methods address this issue by emphasizing the importance of the top-ranked methods (3). The adaptive weights method can further adjust the weights based on the distribution of rank changes in the data.
results <- sumRanks(origRank = obj$origRank,
drop1Rank = obj$drop1Rank,
topN = 100,
method = "adaptive")
str(results)
## List of 6
## $ kappa : num 0.0226
## $ score : Named num [1:236] 0.2 2.476 2.01 1.648 0.254 ...
## ..- attr(*, "names")= chr [1:236] "GSM79114" "GSM79115" "GSM79116" "GSM79118" ...
## $ origRank : chr [1:100] "202580_x_at" "205240_at" "205394_at" "212494_at" ...
## $ drop1Rank : int [1:100, 1:236] 1 2 3 4 6 5 7 8 9 10 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:236] "GSM79114" "GSM79115" "GSM79116" "GSM79118" ...
## $ origRankWeighted : num [1:100] 0.025 0.0244 0.0239 0.0233 0.0228 ...
## $ drop1RankWeighted: num [1:100, 1:236] 0.025 0.0244 0.0239 0.0233 0.0223 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:236] "GSM79114" "GSM79115" "GSM79116" "GSM79118" ...
The outputs of sumRanks() are diverse across the selected methods. For method = “adaptive”, sumRanks() returns a list with the following elements:
1, kappa, the shape parameter of the adaptive weights method;
2, score, the accumulated weighted rank changes, reflecting the
influence of each sample;
3, origRank, the original ranking;
4, drop1Rank, the leave-one-out rankings;
5, origRankWeighted, weighted original ranking;
6, drop1RankWeighted, weighted leave-one-out rankings.
However, if the method is “weightedSpearman” or “unweighted”, the function will only return three elements: “score”, “origRank”, and “drop1RankWeighted”. The elements “kappa”, “origRankWeighted”, and “drop1RankWeighted” will not be returned.
findIPs() combines getdrop1ranks() and sumRanks() into one step. The output is identical to that using the two-step process.
results.ipsr <- findIPs(X, y,
fun = "t.test",
decreasing = FALSE,
method = "adaptive")
identical(results, results.ipsr)
## [1] TRUE
findIPs package offers three visualization functions: plotIPs(), plotRankScatters(), and plotAdaptiveWeights(). plotIPs() can directly utilize the output of findIPs() or sumRanks() to create a lollipop plot that displays the influence of each case. In Figure 1, we can see that the observation 68 (obs68) seems to be more influential on the final results. However, the difference between obs68 and the other observations is not that distinct, indicating a lower possibility of the presence of an influential observation.
In addition to the lollipop, findIPs also provides a simple visualization function plotRankScatters() that exhibits the distribution of rank changes using a scatter plot (Figure 2). Alike plotIPs(), plotRankScatters() simply uses the output of findIPs() or sumRanks(). According to Figure 2, we can observe more rank changes in the tail side, but less changes in the head. The black points denote the rank changes caused by the most influential case.
The plotAdaptiveWeights() function aims to visualize weight function if adaptive weights are used for rank comparison, that is method = “adaptive” for findIPs() or sumRanks(). The argument kappa refers to the shape parameter of the weight function. Here, the optimized kappa is 0.023 (Figure 3). n is the length of the feature list. We select the top 100 features, hence, n = 100. We can observe that more weights are allocated to the top-ranked features.
For survival analysis, we offer the option to rank features using univariate Cox model by setting fun = “cox”. The features are ranked in ascending order based on their P-values.
par(mar = c(4, 4, 2, 2))
results.cox <- findIPs(X, surv,
fun = "cox",
decreasing = FALSE,
method = "adaptive")
plotIPs(results.cox)
In addition to the provided ranking criteria in findIPs() or sumRanks(), which includes “t.test”, “cox”, “log2fc”, and “kruskal.test”. We can also rank features based on a specified rank criterion. To this end, we can pass a function to the fun argument. The function should take x and y as inputs and output the rank criterion, such as p-values.
As an example, we can rank features based on the p-values obtained from the kruskal.test. We can either specify fun = “kruskal.test”, as this test has been implemented in the package, or define our own function passing to getdrop1ranks. Both methods produce the same results.
fun <- function(x, y){
kruskal.test(x, y)$p.value
}
kruskal.test1 <- getdrop1ranks(X, y,
fun = fun,
decreasing = FALSE)
kruskal.test2 <- getdrop1ranks(X, y,
fun = "kruskal.test",
decreasing = FALSE)
identical(kruskal.test1, kruskal.test2)
## [1] TRUE
findIPs provides three rank comparison methods: unweighted, weighted Spearman, and adaptive weights. We recommend using the adaptive weights. Here, we compare the three methods. Weighted Spearman and the adaptive weights method demonstrates similar results. But both are different from the unweighted method.
re.unweighted <- findIPs(X, y,
fun = "t.test",
decreasing = FALSE,
method = "unweighted")
re.weighted <- findIPs(X, y,
fun = "t.test",
decreasing = FALSE,
method = "weightedSpearman")
re.adaptive <- findIPs(X, y,
fun = "t.test",
decreasing = FALSE,
method = "adaptive")
par(mfrow = c(1, 3), mar = c(4, 4, 2, 2))
plotIPs(re.unweighted)
mtext("A, unweighted", side = 3, line = 0, adj = 0)
plotIPs(re.weighted)
mtext("B, weighted Spearman", side = 3, line = 0, adj = 0)
plotIPs(re.adaptive)
mtext("C, adaptive weights", side = 3, line = 0, adj = 0)
1, Wang, Shuo, and Junyan Lu. “Detect influential points of feature
rankings.” arXiv preprint arXiv:2303.10516 (2023).
2, Miller, Lance D., et al. “An expression signature for p53 status in
human breast cancer predicts mutation status, transcriptional effects,
and patient survival.” Proceedings of the National Academy of Sciences
102.38 (2005): 13550-13555.doi:10.1073pnas.0506230102
3, Da Pinto Costa, Joaquim; Soares, Carlos (2005): A weighted rank
measure of correlation. In Australian & New Zealand Journal of
Statistics 47 (4), pp. 515–529.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] findIPs_1.3.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-1 jsonlite_1.8.9
## [3] compiler_4.4.2 BiocManager_1.30.25
## [5] crayon_1.5.3 SummarizedExperiment_1.37.0
## [7] Biobase_2.67.0 GenomicRanges_1.59.1
## [9] parallel_4.4.2 jquerylib_0.1.4
## [11] splines_4.4.2 IRanges_2.41.1
## [13] BiocParallel_1.41.0 yaml_2.3.10
## [15] fastmap_1.2.0 lattice_0.22-6
## [17] R6_2.5.1 XVector_0.47.0
## [19] S4Arrays_1.7.1 generics_0.1.3
## [21] GenomeInfoDb_1.43.1 knitr_1.49
## [23] BiocGenerics_0.53.3 DelayedArray_0.33.2
## [25] MatrixGenerics_1.19.0 maketools_1.3.1
## [27] GenomeInfoDbData_1.2.13 bslib_0.8.0
## [29] rlang_1.1.4 cachem_1.1.0
## [31] xfun_0.49 sass_0.4.9
## [33] sys_3.4.3 SparseArray_1.7.2
## [35] cli_3.6.3 zlibbioc_1.52.0
## [37] grid_4.4.2 digest_0.6.37
## [39] lifecycle_1.0.4 S4Vectors_0.45.2
## [41] evaluate_1.0.1 codetools_0.2-20
## [43] buildtools_1.0.0 survival_3.7-0
## [45] abind_1.4-8 stats4_4.4.2
## [47] rmarkdown_2.29 httr_1.4.7
## [49] matrixStats_1.4.1 tools_4.4.2
## [51] htmltools_0.5.8.1 UCSC.utils_1.3.0