Title: | Sparse Contrastive Principal Component Analysis |
---|---|
Description: | A toolbox for sparse contrastive principal component analysis (scPCA) of high-dimensional biological data. scPCA combines the stability and interpretability of sparse PCA with contrastive PCA's ability to disentangle biological signal from unwanted variation through the use of control data. Also implements and extends cPCA. |
Authors: | Philippe Boileau [aut, cre, cph] , Nima Hejazi [aut] , Sandrine Dudoit [ctb, ths] |
Maintainer: | Philippe Boileau <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.21.0 |
Built: | 2024-12-30 04:24:37 UTC |
Source: | https://github.com/bioc/scPCA |
The background data consisting of 400 observations and 30 variables was simulated as follows:
Each of the first 10 variables was drawn from $N(0, 10)$
Variables 11 through 20 were drawn from $N(0, 3)$
Variables 21 through 30 were drawn from $N(0, 1)$
data(background_df)
data(background_df)
A simple data.frame
.
data(background_df)
data(background_df)
Given target and background data frames or matrices,
scPCA
will perform the sparse contrastive principal component
analysis (scPCA) of the target data for a given number of eigenvectors, a
vector of real-valued contrast parameters, and a vector of sparsity inducing
penalty terms.
If instead you wish to perform contrastive principal component analysis
(cPCA), set the penalties
argument to 0
. So long as the
n_centers
parameter is larger than one, the automated hyperparameter
tuning heuristic described in Boileau et al. (2020) is
used. Otherwise, the semi-automated approach of
Abid et al. (2018) is used to select the
appropriate hyperparameter.
scPCA( target, background, center = TRUE, scale = FALSE, n_eigen = 2, cv = NULL, alg = c("iterative", "var_proj", "rand_var_proj"), contrasts = exp(seq(log(0.1), log(1000), length.out = 40)), penalties = seq(0.05, 1, length.out = 20), clust_method = c("kmeans", "pam", "hclust"), n_centers = NULL, max_iter = 10, linkage_method = "complete", n_medoids = 8, parallel = FALSE, clusters = NULL, eigdecomp_tol = 1e-10, eigdecomp_iter = 1000, scaled_matrix = FALSE )
scPCA( target, background, center = TRUE, scale = FALSE, n_eigen = 2, cv = NULL, alg = c("iterative", "var_proj", "rand_var_proj"), contrasts = exp(seq(log(0.1), log(1000), length.out = 40)), penalties = seq(0.05, 1, length.out = 20), clust_method = c("kmeans", "pam", "hclust"), n_centers = NULL, max_iter = 10, linkage_method = "complete", n_medoids = 8, parallel = FALSE, clusters = NULL, eigdecomp_tol = 1e-10, eigdecomp_iter = 1000, scaled_matrix = FALSE )
target |
The target (experimental) data set, in a standard format such
as a |
background |
The background data set, in a standard format such as a
|
center |
A |
scale |
A |
n_eigen |
A |
cv |
A |
alg |
A |
contrasts |
A |
penalties |
A |
clust_method |
A |
n_centers |
A |
max_iter |
A |
linkage_method |
A |
n_medoids |
A |
parallel |
A |
clusters |
A |
eigdecomp_tol |
A |
eigdecomp_iter |
A |
scaled_matrix |
A |
A list containing the following components:
rotation
: The matrix of variable loadings if n_centers
is larger than one. Otherwise, a list of rotation matrices is returned,
one for each medoid. The number of medoids is specified by
n_medoids
.
x
: The rotated data, centred and scaled if requested,
multiplied by the rotation matrix if n_centers
is larger than
one. Otherwise, a list of rotated data matrices is returned, one for
each medoid. The number of medoids is specified by n_medoids
.
contrast: The optimal contrastive parameter.
penalty: The optimal L1 penalty term.
center: A logical indicating whether the target dataset was centered.
scale: A logical indicating whether the target dataset was scaled.
Abid A, Zhang MJ, Bagaria VK, Zou J (2018).
“Exploring patterns enriched in a dataset with contrastive principal component analysis.”
Nature communications, 9(1), 2134.
Boileau P, Hejazi NS, Dudoit S (2020).
“Exploring High-Dimensional Biological Data with Sparse Contrastive Principal Component Analysis.”
Bioinformatics.
ISSN 1367-4803, doi:10.1093/bioinformatics/btaa176, btaa176, https://academic.oup.com/bioinformatics/article-pdf/doi/10.1093/bioinformatics/btaa176/32914142/btaa176.pdf.
Erichson NB, Zeng P, Manohar K, Brunton SL, Kutz JN, Aravkin AY (2018).
“Sparse Principal Component Analysis via Variable Projection.”
ArXiv, abs/1804.00341.
Zou H, Hastie T, Tibshirani R (2006).
“Sparse principal component analysis.”
Journal of computational and graphical statistics, 15(2), 265–286.
# perform cPCA on the simulated data set scPCA( target = toy_df[, 1:30], background = background_df, contrasts = exp(seq(log(0.1), log(100), length.out = 5)), penalties = 0, n_centers = 4 ) # perform scPCA on the simulated data set scPCA( target = toy_df[, 1:30], background = background_df, contrasts = exp(seq(log(0.1), log(100), length.out = 5)), penalties = seq(0.1, 1, length.out = 3), n_centers = 4 ) # perform cPCA on the simulated data set with known clusters scPCA( target = toy_df[, 1:30], background = background_df, contrasts = exp(seq(log(0.1), log(100), length.out = 5)), penalties = 0, clusters = toy_df[, 31] ) # cPCA as implemented in Abid et al. scPCA( target = toy_df[, 1:30], background = background_df, contrasts = exp(seq(log(0.1), log(100), length.out = 10)), penalties = 0, n_centers = 1 )
# perform cPCA on the simulated data set scPCA( target = toy_df[, 1:30], background = background_df, contrasts = exp(seq(log(0.1), log(100), length.out = 5)), penalties = 0, n_centers = 4 ) # perform scPCA on the simulated data set scPCA( target = toy_df[, 1:30], background = background_df, contrasts = exp(seq(log(0.1), log(100), length.out = 5)), penalties = seq(0.1, 1, length.out = 3), n_centers = 4 ) # perform cPCA on the simulated data set with known clusters scPCA( target = toy_df[, 1:30], background = background_df, contrasts = exp(seq(log(0.1), log(100), length.out = 5)), penalties = 0, clusters = toy_df[, 31] ) # cPCA as implemented in Abid et al. scPCA( target = toy_df[, 1:30], background = background_df, contrasts = exp(seq(log(0.1), log(100), length.out = 10)), penalties = 0, n_centers = 1 )
The toy data consisting of 400 observations and 31 variables was simulated as follows:
Each of the first 10 variables was drawn from $N(0, 10)$
For group 1 and 2, variables 11 through 20 were drawn from $N(0, 1)$
For group 3 and 4, variables 11 through 20 were drawn from $N(3, 1)$
For group 1 and 3, variables 21 though 30 were drawn from $N(-3, 1)$
For group 2 and 4, variables 21 though 30 were drawn from $N(0, 1)$
The last column provides each observations group number
data(toy_df)
data(toy_df)
A simple data.frame
.
data(toy_df)
data(toy_df)