Title: | Co-Expression Analysis of Sequencing Data |
---|---|
Description: | Co-expression analysis for expression profiles arising from high-throughput sequencing data. Feature (e.g., gene) profiles are clustered using adapted transformations and mixture models or a K-means algorithm, and model selection criteria (to choose an appropriate number of clusters) are provided. |
Authors: | Andrea Rau [cre, aut] , Cathy Maugis-Rabusseau [ctb], Antoine Godichon-Baggioni [ctb] |
Maintainer: | Andrea Rau <[email protected]> |
License: | GPL-3 |
Version: | 1.31.0 |
Built: | 2024-10-31 00:40:35 UTC |
Source: | https://github.com/bioc/coseq |
Co-expression analysis for expression profiles arising from high-throughput sequencing data. Feature (e.g., gene) profiles are clustered using adapted transformations and mixture models or a K-means algorithm, and model selection criteria (to choose an appropriate number of clusters) are provided.
Package: | coseq |
Type: | Package |
Version: | 1.15.4 |
Date: | 2020-12-03 |
License: | GPL-3 |
LazyLoad: | yes |
Andrea Rau, Cathy Maugis-Rabusseau, Antoine Godichon-Baggioni
Maintainer: Andrea Rau <[email protected]>
Godichon-Baggioni, A., Maugis-Rabusseau, C. and Rau, A. (2018) Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data. Journal of Applied Statistics, doi:10.1080/02664763.2018.1454894.
Rau, A. and Maugis-Rabusseau, C. (2018) Transformation and model choice for co-expression analayis of RNA-seq data. Briefings in Bioinformatics, 19(3)-425-436.
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux, G. (2015) Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, doi: 10.1093/bioinformatics/btu845.
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011) Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at http://hal.inria.fr/inria-00638082.
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
Provides the calculation of per-cluster entropy, equivalent to
where is the conditional probability of gene i belonging
to cluster k and
corresponds to the set of indices of genes
attributed to cluster k.
clusterEntropy(probaPost)
clusterEntropy(probaPost)
probaPost |
Matrix containing the conditional probabilities of belonging to each cluster for all observations |
Entropy per cluster
Cathy Maugis-Rabusseau
## Generate artificial matrix of conditional probabilities for K=5 clusters tmp <- matrix(runif(100*5), nrow=100, ncol=5) probaPost <- tmp / rowSums(tmp) clusterEntropy(probaPost)
## Generate artificial matrix of conditional probabilities for K=5 clusters tmp <- matrix(runif(100*5), nrow=100, ncol=5) probaPost <- tmp / rowSums(tmp) clusterEntropy(probaPost)
Provides the calculation of within-cluster inertia, equivalent to
where is the mean of cluster k and
corresponds to the set of indices of genes
attributed to cluster k.
clusterInertia(profiles, clusters)
clusterInertia(profiles, clusters)
profiles |
Matrix, data.frame, or DataFrame containing the (transformed) profiles used for the clustering |
clusters |
Vector of cluster labels corresponding to the observations in |
Within cluster inertia
Andrea Rau, Antoine Godichon-Baggioni
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") clusterInertia(profiles=tcounts(run_kmeans), clusters=clusters(run_kmeans))
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") clusterInertia(profiles=tcounts(run_kmeans), clusters=clusters(run_kmeans))
Provides the adjusted rand index (ARI) between pairs of clustering paritions.
compareARI(object, ...) ## S4 method for signature 'coseqResults' compareARI( object, K = NULL, parallel = FALSE, BPPARAM = bpparam(), plot = TRUE, ... ) ## S4 method for signature 'matrix' compareARI(object, parallel = FALSE, BPPARAM = bpparam(), plot = TRUE, ...) ## S4 method for signature 'data.frame' compareARI(object, parallel = FALSE, BPPARAM = bpparam(), plot = TRUE, ...)
compareARI(object, ...) ## S4 method for signature 'coseqResults' compareARI( object, K = NULL, parallel = FALSE, BPPARAM = bpparam(), plot = TRUE, ... ) ## S4 method for signature 'matrix' compareARI(object, parallel = FALSE, BPPARAM = bpparam(), plot = TRUE, ...) ## S4 method for signature 'data.frame' compareARI(object, parallel = FALSE, BPPARAM = bpparam(), plot = TRUE, ...)
object |
Object of class |
... |
Additional optional parameters for corrplot |
K |
If |
parallel |
If |
BPPARAM |
Optional parameter object passed internally to |
plot |
If |
Matrix of adjusted rand index values calculated between each pair of models.
Andrea Rau
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
Compare the corrected ICL values after applying the arcsin, logit, and logMedianRef transformations in a coseq analysis
compareICL(x)
compareICL(x)
x |
A list made up of |
A plot of corrected ICL values for the models included in x
(the list
of coseqResults
objects)
Andrea Rau, Cathy Maugis-Rabusseau
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
Convert legacy coseq S3 class objects to coseqResults S4 class objects
convertLegacyCoseq(object, digits = 3)
convertLegacyCoseq(object, digits = 3)
object |
Object of S3 class |
digits |
integer indicating the number of decimal places (round) to retain in results. |
Converted object of S4 class coseqResults
compatible with
recent versions of coseq (>= 0.99.1)
This is the primary user interface for the coseq
package.
Generic S4 methods are implemented to perform co-expression or co-abudance analysis of
high-throughput sequencing data, with or without data transformation, using K-means or mixture models.
The supported classes are matrix
, data.frame
, and DESeqDataSet
.
The output of coseq
is an S4 object of class coseqResults
.
coseq(object, ...) ## S4 method for signature 'matrix' coseq( object, K, subset = NULL, model = "kmeans", transformation = "logclr", normFactors = "TMM", meanFilterCutoff = NULL, modelChoice = ifelse(model == "kmeans", "DDSE", "ICL"), parallel = FALSE, BPPARAM = bpparam(), seed = NULL, ... ) ## S4 method for signature 'data.frame' coseq( object, K, subset = NULL, model = "kmeans", transformation = "logclr", normFactors = "TMM", meanFilterCutoff = NULL, modelChoice = ifelse(model == "kmeans", "DDSE", "ICL"), parallel = FALSE, BPPARAM = bpparam(), seed = NULL, ... ) ## S4 method for signature 'DESeqDataSet' coseq( object, K, model = "kmeans", transformation = "logclr", normFactors = "TMM", meanFilterCutoff = NULL, modelChoice = ifelse(model == "kmeans", "DDSE", "ICL"), parallel = FALSE, BPPARAM = bpparam(), seed = NULL, ... )
coseq(object, ...) ## S4 method for signature 'matrix' coseq( object, K, subset = NULL, model = "kmeans", transformation = "logclr", normFactors = "TMM", meanFilterCutoff = NULL, modelChoice = ifelse(model == "kmeans", "DDSE", "ICL"), parallel = FALSE, BPPARAM = bpparam(), seed = NULL, ... ) ## S4 method for signature 'data.frame' coseq( object, K, subset = NULL, model = "kmeans", transformation = "logclr", normFactors = "TMM", meanFilterCutoff = NULL, modelChoice = ifelse(model == "kmeans", "DDSE", "ICL"), parallel = FALSE, BPPARAM = bpparam(), seed = NULL, ... ) ## S4 method for signature 'DESeqDataSet' coseq( object, K, model = "kmeans", transformation = "logclr", normFactors = "TMM", meanFilterCutoff = NULL, modelChoice = ifelse(model == "kmeans", "DDSE", "ICL"), parallel = FALSE, BPPARAM = bpparam(), seed = NULL, ... )
object |
Data to be clustered. May be provided as a y (n x q)
matrix or data.frame of observed counts for n
observations and q variables, or an object of class |
... |
Additional optional parameters. |
K |
Number of clusters (a single value or a vector of values) |
subset |
Optional vector providing the indices of a subset of
genes that should be used for the co-expression analysis (i.e., row indices
of the data matrix |
model |
Type of mixture model to use (“ |
transformation |
Transformation type to be used: “ |
normFactors |
The type of estimator to be used to normalize for differences in
library size: (“ |
meanFilterCutoff |
Value used to filter low mean normalized counts if desired (by default, set to a value of 50) |
modelChoice |
Criterion used to select the best model. For Gaussian mixture models,
“ |
parallel |
If |
BPPARAM |
Optional parameter object passed internally to |
seed |
If desired, an integer defining the seed of the random number generator. If
|
An S4 object of class coseqResults
, where conditional
probabilities of cluster membership for each gene in each model is stored as a SimpleList of assay
data, and the corresponding log likelihood, ICL value, number of
clusters, and form of Gaussian model for each model are stored as metadata.
Andrea Rau
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
The counts slot holds the count data as a matrix of non-negative integer count values, one row for each observational unit (gene or the like), and one column for each sample.
coseqFullResults(object, ...) clusters(object, ...) likelihood(object, ...) nbCluster(object, ...) proba(object, ...) ICL(object, ...) profiles(object, ...) tcounts(object, ...) transformationType(object, ...) model(object, ...) DDSEextract(object, ...) Djumpextract(object, ...) ## S4 method for signature 'coseqResults' clusters(object, K) ## S4 method for signature 'RangedSummarizedExperiment' clusters(object, ...) ## S4 method for signature 'matrix' clusters(object, ...) ## S4 method for signature 'data.frame' clusters(object, ...) ## S4 method for signature 'MixmodCluster' likelihood(object) ## S4 method for signature 'RangedSummarizedExperiment' likelihood(object) ## S4 method for signature 'coseqResults' likelihood(object) ## S4 method for signature ''NULL'' likelihood(object) ## S4 method for signature 'MixmodCluster' nbCluster(object) ## S4 method for signature 'RangedSummarizedExperiment' nbCluster(object) ## S4 method for signature 'coseqResults' nbCluster(object) ## S4 method for signature ''NULL'' nbCluster(object) ## S4 method for signature 'MixmodCluster' ICL(object) ## S4 method for signature 'RangedSummarizedExperiment' ICL(object) ## S4 method for signature 'coseqResults' ICL(object) ## S4 method for signature ''NULL'' ICL(object) ## S4 method for signature 'coseqResults' profiles(object) ## S4 method for signature 'coseqResults' tcounts(object) ## S4 method for signature 'coseqResults' transformationType(object) ## S4 method for signature 'coseqResults' model(object) ## S4 method for signature 'coseqResults' coseqFullResults(object) ## S4 method for signature 'coseqResults' show(object) ## S4 method for signature 'MixmodCluster' proba(object) ## S4 method for signature 'Capushe' DDSEextract(object) ## S4 method for signature 'Capushe' Djumpextract(object)
coseqFullResults(object, ...) clusters(object, ...) likelihood(object, ...) nbCluster(object, ...) proba(object, ...) ICL(object, ...) profiles(object, ...) tcounts(object, ...) transformationType(object, ...) model(object, ...) DDSEextract(object, ...) Djumpextract(object, ...) ## S4 method for signature 'coseqResults' clusters(object, K) ## S4 method for signature 'RangedSummarizedExperiment' clusters(object, ...) ## S4 method for signature 'matrix' clusters(object, ...) ## S4 method for signature 'data.frame' clusters(object, ...) ## S4 method for signature 'MixmodCluster' likelihood(object) ## S4 method for signature 'RangedSummarizedExperiment' likelihood(object) ## S4 method for signature 'coseqResults' likelihood(object) ## S4 method for signature ''NULL'' likelihood(object) ## S4 method for signature 'MixmodCluster' nbCluster(object) ## S4 method for signature 'RangedSummarizedExperiment' nbCluster(object) ## S4 method for signature 'coseqResults' nbCluster(object) ## S4 method for signature ''NULL'' nbCluster(object) ## S4 method for signature 'MixmodCluster' ICL(object) ## S4 method for signature 'RangedSummarizedExperiment' ICL(object) ## S4 method for signature 'coseqResults' ICL(object) ## S4 method for signature ''NULL'' ICL(object) ## S4 method for signature 'coseqResults' profiles(object) ## S4 method for signature 'coseqResults' tcounts(object) ## S4 method for signature 'coseqResults' transformationType(object) ## S4 method for signature 'coseqResults' model(object) ## S4 method for signature 'coseqResults' coseqFullResults(object) ## S4 method for signature 'coseqResults' show(object) ## S4 method for signature 'MixmodCluster' proba(object) ## S4 method for signature 'Capushe' DDSEextract(object) ## S4 method for signature 'Capushe' Djumpextract(object)
object |
a |
... |
Additional optional parameters |
K |
numeric indicating the model to be used (if NULL of missing, the model chosen by ICL is used by default) |
Output varies depending on the method. clusters
returns a vector of cluster
labels for each gene for the desired model.
Andrea Rau
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
coseqResults
is a subclass of RangedSummarizedExperiment
,
used to store the co-expression results as well as some additional
information useful for plotting (tcounts
, y_profiles
) and
meta-information about the co-expression analysis (transformation
,
normFactors
).
coseqResults( SummarizedExperiment, allResults, model = NULL, transformation = NULL, tcounts = NULL, y_profiles = NULL, normFactors = NULL )
coseqResults( SummarizedExperiment, allResults, model = NULL, transformation = NULL, tcounts = NULL, y_profiles = NULL, normFactors = NULL )
SummarizedExperiment |
a |
allResults |
List of conditional probabilities of cluster membership for each gene, in all models fit |
model |
|
transformation |
Transformation applied to counts to obtain |
tcounts |
Transformed counts used for mixture model fitting |
y_profiles |
y profiles used for |
normFactors |
Scaling factors used for normalization |
This constructor function would not typically be used by "end users".
This simple class extends the RangedSummarizedExperiment
class of the
SummarizedExperiment package
to allow other packages to write methods for results
objects from the coseq package. It is used by coseqRun
to wrap up the results table.
a coseqResults object
Function for primary code to perform co-expression analysis, with or without data transformation,
using mixture models. The output of coseqRun
is an S4 object of class coseqResults
.
coseqRun( y, K, conds = NULL, normFactors = "TMM", model = "kmeans", transformation = "logclr", subset = NULL, meanFilterCutoff = 50, modelChoice = ifelse(model == "kmeans", "DDSE", "ICL"), parallel = FALSE, BPPARAM = bpparam(), seed = NULL, ... )
coseqRun( y, K, conds = NULL, normFactors = "TMM", model = "kmeans", transformation = "logclr", subset = NULL, meanFilterCutoff = 50, modelChoice = ifelse(model == "kmeans", "DDSE", "ICL"), parallel = FALSE, BPPARAM = bpparam(), seed = NULL, ... )
y |
(n x q) matrix of observed counts for n observations (genes) and q variables (samples). In nearly all cases, n > q. |
K |
Number of clusters (a single value or a vector of values) |
conds |
Vector of length q defining the condition (treatment
group) for each variable (column) in |
normFactors |
The type of estimator to be used to normalize for differences in
library size: (“ |
model |
Type of mixture model to use (“ |
transformation |
Transformation type to be used: “ |
subset |
Optional vector providing the indices of a subset of
genes that should be used for the co-expression analysis (i.e., row indices
of the data matrix |
meanFilterCutoff |
Value used to filter low mean normalized counts if desired (by default, set to a value of 50) |
modelChoice |
Criterion used to select the best model. For Gaussian mixture models,
“ |
parallel |
If |
BPPARAM |
Optional parameter object passed internally to |
seed |
If desired, an integer defining the seed of the random number generator. If
|
... |
Additional optional parameters. |
An S4 object of class coseqResults
whose assays contain a SimpleList
object, where each element in the list corresponds to the conditional probabilities of cluster membership
for each gene in each model. Meta data (accessible via metatdata
include the model
used
(either Normal
or Poisson
), the transformation
used on the data, the
transformed data using to estimate model (tcounts
), the normalized profiles for use in plotting
(y_profiles
), and the normalization factors used in the analysis (normFactors
).
Andrea Rau
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the K-means for K = 2,3,4 with logCLR transformation ## The following are equivalent: run <- coseqRun(y=countmat, K=2:15) run <- coseq(object=countmat, K=2:15, transformation="logclr", model="kmeans") ## Run the Normal mixture model for K = 2,3,4 with arcsine transformation ## The following are equivalent: run <- coseqRun(y=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal") run <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal")
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the K-means for K = 2,3,4 with logCLR transformation ## The following are equivalent: run <- coseqRun(y=countmat, K=2:15) run <- coseq(object=countmat, K=2:15, transformation="logclr", model="kmeans") ## Run the Normal mixture model for K = 2,3,4 with arcsine transformation ## The following are equivalent: run <- coseqRun(y=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal") run <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal")
This dataset represents RNA-seq data from mouse neocortex RNA-seq data in five embryonic (day 14.5) mice by analyzing the transcriptome of three regions: the ventricular zone (VZ), subventricular zone (SVZ) and cortical place (CP).
data(fietz)
data(fietz)
An ExpressionSet named fietz.eset
containing the phenotype data and
expression data for the Fietz et al. (2012) experiment. Phenotype data may be
accessed using the pData
function, and expression data may be accessed
using the exprs
function.
Object of class ‘ExpressionSet’. Matrix of counts can be accessed after
loading the ‘Biobase’ package and calling exprs(fietz))
.
Digital Expression Explorer (http://dee.bakeridi.edu.au/).
https://perso.math.univ-toulouse.fr/maugis/mixstatseq/packages
Fietz, S. A., et al. (2012). Transcriptomes of germinal zones of human and mouse fetal neocortex suggest a role of extracellular matrix in progenitor self-renewal. Proceedings of the National Academy of Sciences, 109(29):11836-11841.
Calculate conditional probabilities of cluster membership for K-means clustering
kmeansProbaPost(clusters, tcounts)
kmeansProbaPost(clusters, tcounts)
clusters |
Cluster labels arising from K-means clustering |
tcounts |
Transformed counts clustered using K-means |
Conditional probabilities of cluster membership for each observation in each cluster
## Example of K-means taken from ?kmeans help page x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") cl <- kmeans(x, 5) probaPost <- kmeansProbaPost(cl$cluster, x) head(probaPost)
## Example of K-means taken from ?kmeans help page x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") cl <- kmeans(x, 5) probaPost <- kmeansProbaPost(cl$cluster, x) head(probaPost)
Calculate the Log Centered Log Ratio (logCLR) transformation
logclr(profiles)
logclr(profiles)
profiles |
Matrix of profiles. Note that the presence of 0 values causes an error message to be produced. |
logCLR-transformed profiles
Permute the columns of a contingency table comparing two clusterings to load the diagonal as much as possible.
matchContTable(table_1, table_2)
matchContTable(table_1, table_2)
table_1 |
Partition from a first data clustering |
table_2 |
Partition from a second data clustering |
Permuted contingency table
## Generate arbitrary labels from two separate clustering results labels_1 <- sample(1:10, 1000, replace=TRUE) ## K=10 clusters labels_2 <- sample(1:8, 1000, replace=TRUE) ## K=8 clusters matchContTable(labels_1, labels_2)
## Generate arbitrary labels from two separate clustering results labels_1 <- sample(1:10, 1000, replace=TRUE) ## K=10 clusters labels_2 <- sample(1:8, 1000, replace=TRUE) ## K=8 clusters matchContTable(labels_1, labels_2)
Perform co-expression and co-abudance analysis of high-throughput
sequencing data, with or without data transformation, using a Normal
mixture models. The output of NormMixClus
is an S4 object of
class RangedSummarizedExperiment
.
NormMixClus( y_profiles, K, subset = NULL, parallel = TRUE, BPPARAM = bpparam(), seed = NULL, ... )
NormMixClus( y_profiles, K, subset = NULL, parallel = TRUE, BPPARAM = bpparam(), seed = NULL, ... )
y_profiles |
(n x q) matrix of observed profiles for n observations and q variables |
K |
Number of clusters (a single value or a sequence of values). |
subset |
Optional vector providing the indices of a subset of
genes that should be used for the co-expression analysis (i.e., row indices
of the data matrix |
parallel |
If |
BPPARAM |
Optional parameter object passed internally to |
seed |
If desired, an integer defining the seed of the random number generator. If
|
... |
Additional optional parameters to be passed to |
An S4 object of class coseqResults
, with conditional
probabilities of cluster membership for each gene in each model stored as a list of assay
data, and corresponding log likelihood, ICL value, number of
clusters, and form of Gaussian model for each model stored as metadata.
Andrea Rau, Cathy Maugis-Rabusseau
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] profiles <- transformRNAseq(countmat, norm="none", transformation="arcsin")$tcounts conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3 ## Object of class coseqResults run <- NormMixClus(y=profiles, K=2:3, iter=5) run ## Run the Normal mixture model for K=2 ## Object of class SummarizedExperiment0 run2 <- NormMixClusK(y=profiles, K=2, iter=5) ## Summary of results summary(run) ## Re-estimate mixture parameters for the model with K=2 clusters param <- NormMixParam(run, y_profiles=profiles)
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] profiles <- transformRNAseq(countmat, norm="none", transformation="arcsin")$tcounts conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3 ## Object of class coseqResults run <- NormMixClus(y=profiles, K=2:3, iter=5) run ## Run the Normal mixture model for K=2 ## Object of class SummarizedExperiment0 run2 <- NormMixClusK(y=profiles, K=2, iter=5) ## Summary of results summary(run) ## Re-estimate mixture parameters for the model with K=2 clusters param <- NormMixParam(run, y_profiles=profiles)
Perform co-expression and co-abudance analysis of high-throughput
sequencing data, with or without data transformation, using a Normal
mixture models for single number of clusters K.
The output of NormMixClusK
is an S4 object of
class RangedSummarizedExperiment
.
NormMixClusK( y_profiles, K, alg.type = "EM", init.runs = 50, init.type = "small-em", GaussianModel = "Gaussian_pk_Lk_Ck", init.iter = 20, iter = 1000, cutoff = 0.001, verbose = TRUE, digits = 3, seed = NULL )
NormMixClusK( y_profiles, K, alg.type = "EM", init.runs = 50, init.type = "small-em", GaussianModel = "Gaussian_pk_Lk_Ck", init.iter = 20, iter = 1000, cutoff = 0.001, verbose = TRUE, digits = 3, seed = NULL )
y_profiles |
y (n x q) matrix of observed profiles for n observations and q variables |
K |
Number of clusters (a single value). |
alg.type |
Algorithm to be used for parameter estimation:
“ |
init.runs |
Number of runs to be used for the Small-EM strategy, with a default value of 50 |
init.type |
Type of initialization strategy to be used:
“ |
GaussianModel |
One of the 28 forms of Gaussian models defined in Rmixmod,
by default equal to the |
init.iter |
Number of iterations to be used within each run for the Small-EM strategry, with a default value of 20 |
iter |
Maximum number of iterations to be run for the chosen algorithm |
cutoff |
Cutoff to declare algorithm convergence |
verbose |
If |
digits |
Integer indicating the number of decimal places to be used for the
|
seed |
If desired, an integer defining the seed of the random number generator. If
|
An S4 object of class RangedSummarizedExperiment
, with conditional
probabilities of cluster membership for each gene stored as assay data, and
log likelihood, ICL value, number of
clusters, and form of Gaussian model stored as metadata.
Cathy Maugis-Rabusseau, Andrea Rau
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] profiles <- transformRNAseq(countmat, norm="none", transformation="arcsin")$tcounts conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3 ## Object of class coseqResults run <- NormMixClus(y=profiles, K=2:3, iter=5) run ## Run the Normal mixture model for K=2 ## Object of class SummarizedExperiment0 run2 <- NormMixClusK(y=profiles, K=2, iter=5) ## Summary of results summary(run) ## Re-estimate mixture parameters for the model with K=2 clusters param <- NormMixParam(run, y_profiles=profiles)
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] profiles <- transformRNAseq(countmat, norm="none", transformation="arcsin")$tcounts conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3 ## Object of class coseqResults run <- NormMixClus(y=profiles, K=2:3, iter=5) run ## Run the Normal mixture model for K=2 ## Object of class SummarizedExperiment0 run2 <- NormMixClusK(y=profiles, K=2, iter=5) ## Summary of results summary(run) ## Re-estimate mixture parameters for the model with K=2 clusters param <- NormMixParam(run, y_profiles=profiles)
Calculates the mean and covariance parameters for a normal mixture model of the form pK_Lk_Ck
NormMixParam( coseqResults, y_profiles = NULL, K = NULL, digits = 3, plot = FALSE, ... )
NormMixParam( coseqResults, y_profiles = NULL, K = NULL, digits = 3, plot = FALSE, ... )
coseqResults |
Object of class |
y_profiles |
y (n x q) matrix of observed profiles for n
observations and q variables, required for |
K |
The model used for parameter estimation for objects |
digits |
Integer indicating the number of decimal places to be used for output |
plot |
If |
... |
Additional optional parameters to pass to |
pi |
Vector of dimension K with the estimated cluster proportions from the Gaussian mixture model, where K is the number of clusters |
mu |
Matrix of dimension K x d containing the estimated mean
vector from the Gaussian mixture model, where d is the
number of samples in the data |
Sigma |
Array of dimension d x d x K containing the
estimated covariance matrices from the Gaussian mixture model, where d is the
number of samples in the data |
rho |
Array of dimension d x d x K containing the
estimated correlation matrices from the Gaussian mixture model, where d is the
number of samples in the data |
Andrea Rau, Cathy Maugis-Rabusseau
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] profiles <- transformRNAseq(countmat, norm="none", transformation="arcsin")$tcounts conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3 ## Object of class coseqResults run <- NormMixClus(y=profiles, K=2:3, iter=5) run ## Run the Normal mixture model for K=2 ## Object of class SummarizedExperiment0 run2 <- NormMixClusK(y=profiles, K=2, iter=5) ## Summary of results summary(run) ## Re-estimate mixture parameters for the model with K=2 clusters param <- NormMixParam(run, y_profiles=profiles)
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] profiles <- transformRNAseq(countmat, norm="none", transformation="arcsin")$tcounts conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3 ## Object of class coseqResults run <- NormMixClus(y=profiles, K=2:3, iter=5) run ## Run the Normal mixture model for K=2 ## Object of class SummarizedExperiment0 run2 <- NormMixClusK(y=profiles, K=2, iter=5) ## Summary of results summary(run) ## Re-estimate mixture parameters for the model with K=2 clusters param <- NormMixParam(run, y_profiles=profiles)
Plot a coseqResults object.
plot(x, ...) ## S4 method for signature 'coseqResults' plot( x, y_profiles = NULL, K = NULL, threshold = 0.8, conds = NULL, average_over_conds = FALSE, collapse_reps = "none", graphs = c("logLike", "ICL", "profiles", "boxplots", "probapost_boxplots", "probapost_barplots", "probapost_histogram"), order = FALSE, profiles_order = NULL, n_row = NULL, n_col = NULL, add_lines = TRUE, ... ) coseqGlobalPlots(object, graphs = c("logLike", "ICL"), ...) coseqModelPlots( probaPost, y_profiles, K = NULL, threshold = 0.8, conds = NULL, collapse_reps = "none", graphs = c("profiles", "boxplots", "probapost_boxplots", "probapost_barplots", "probapost_histogram"), order = FALSE, profiles_order = NULL, n_row = NULL, n_col = NULL, add_lines = TRUE, ... )
plot(x, ...) ## S4 method for signature 'coseqResults' plot( x, y_profiles = NULL, K = NULL, threshold = 0.8, conds = NULL, average_over_conds = FALSE, collapse_reps = "none", graphs = c("logLike", "ICL", "profiles", "boxplots", "probapost_boxplots", "probapost_barplots", "probapost_histogram"), order = FALSE, profiles_order = NULL, n_row = NULL, n_col = NULL, add_lines = TRUE, ... ) coseqGlobalPlots(object, graphs = c("logLike", "ICL"), ...) coseqModelPlots( probaPost, y_profiles, K = NULL, threshold = 0.8, conds = NULL, collapse_reps = "none", graphs = c("profiles", "boxplots", "probapost_boxplots", "probapost_barplots", "probapost_histogram"), order = FALSE, profiles_order = NULL, n_row = NULL, n_col = NULL, add_lines = TRUE, ... )
x |
An object of class |
... |
Additional optional plotting arguments (e.g., xlab, ylab, use_sample_names, facet_labels) |
y_profiles |
y (n x q) matrix of observed profiles for n
observations and q variables to be used for graphing results (optional for
|
K |
If desired, the specific model to use for plotting (or the specific cluster number(s)
to use for plotting in the case of |
threshold |
Threshold used for maximum conditional probability; only observations with maximum conditional probability greater than this threshold are visualized |
conds |
Condition labels, if desired |
average_over_conds |
If |
collapse_reps |
If |
graphs |
Graphs to be produced, one (or more) of the following:
|
order |
If |
profiles_order |
If |
n_row |
Number of rows for plotting layout of line plots and boxplots of profiles. |
n_col |
Number of columns for plotting layout of line plots and boxplots of profiles. |
add_lines |
If |
object |
An object of class |
probaPost |
Matrix or data.frame of dimension (n x K) containing the conditional probilities of cluster membership for n genes in K clusters arising from a mixture model |
Named list of plots of the coseqResults
object.
Andrea Rau, Cathy Maugis-Rabusseau
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
A function to summarize the clustering results obtained from a Poisson or
Gaussian mixture model estimated using coseq
. In particular,
the function provides the number of clusters selected for the ICL
model selection approach (or alternatively, for the capushe non-asymptotic approach
if K-means clustering is used), number of genes assigned to each cluster, and
if desired the per-gene cluster means.
## S4 method for signature 'coseqResults' summary(object, y_profiles, digits = 3, ...)
## S4 method for signature 'coseqResults' summary(object, y_profiles, digits = 3, ...)
object |
An object of class |
y_profiles |
Data used for clustering if per-cluster means are desired |
digits |
Integer indicating the number of decimal places to be used for mixture model parameters |
... |
Additional arguments |
Provides the following summary of results:
1) Number of clusters and model selection criterion used, if applicable.
2) Number of observations across all clusters with a maximum conditional probability greater than 90 observations) for the selected model.
3) Number of observations per cluster with a maximum conditional probability greater than 90 cluster) for the selected model.
4) If desired, the values and
values for the selected
model in the case of a Gaussian mixture model.
Summary of the coseqResults
object.
Andrea Rau
Rau, A. and Maugis-Rabusseau, C. (2017) Transformation and model choice for co-expression analayis of RNA-seq data. Briefings in Bioinformatics, doi: http://dx.doi.org/10.1101/065607.
Godichon-Baggioni, A., Maugis-Rabusseau, C. and Rau, A. (2017) Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data. arXiv:1704.06150.
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
## Simulate toy data, n = 300 observations set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Run the Normal mixture model for K = 2,3,4 run_arcsin <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin", model="Normal", seed=12345) run_arcsin ## Plot and summarize results plot(run_arcsin) summary(run_arcsin) ## Compare ARI values for all models (no plot generated here) ARI <- compareARI(run_arcsin, plot=FALSE) ## Compare ICL values for models with arcsin and logit transformations run_logit <- coseq(object=countmat, K=2:4, iter=5, transformation="logit", model="Normal") compareICL(list(run_arcsin, run_logit)) ## Use accessor functions to explore results clusters(run_arcsin) likelihood(run_arcsin) nbCluster(run_arcsin) ICL(run_arcsin) ## Examine transformed counts and profiles used for graphing tcounts(run_arcsin) profiles(run_arcsin) ## Run the K-means algorithm for logclr profiles for K = 2,..., 20 run_kmeans <- coseq(object=countmat, K=2:20, transformation="logclr", model="kmeans") run_kmeans
Application of common transformations for RNA-seq data prior to fitting a normal mixture model
transformRNAseq( y, normFactors = "TMM", transformation = "arcsin", geneLength = NA, meanFilterCutoff = NULL, verbose = TRUE )
transformRNAseq( y, normFactors = "TMM", transformation = "arcsin", geneLength = NA, meanFilterCutoff = NULL, verbose = TRUE )
y |
(n x q) |
normFactors |
The type of estimator to be used to normalize for differences in
library size: “ |
transformation |
Transformation type to be used: “ |
geneLength |
Vector of length equal to the number of rows in “ |
meanFilterCutoff |
Value used to filter low mean normalized counts |
verbose |
If |
tcounts |
Transformed counts |
normCounts |
Normalized counts |
snorm |
Per-sample normalization factors divided by mean normalization factor |
ellnorm |
Per-sample normalization factors |
set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Arcsin transformation, TMM normalization arcsin <- transformRNAseq(countmat, normFactors="TMM", transformation="arcsin")$tcounts ## Logit transformation, TMM normalization logit <- transformRNAseq(countmat, normFactors="TMM", transformation="logit")$tcounts ## logCLR transformation, TMM normalization logclr <- transformRNAseq(countmat, normFactors="TMM", transformation="logclr")$tcounts
set.seed(12345) countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4) countmat <- countmat[which(rowSums(countmat) > 0),] conds <- rep(c("A","B","C","D"), each=2) ## Arcsin transformation, TMM normalization arcsin <- transformRNAseq(countmat, normFactors="TMM", transformation="arcsin")$tcounts ## Logit transformation, TMM normalization logit <- transformRNAseq(countmat, normFactors="TMM", transformation="logit")$tcounts ## logCLR transformation, TMM normalization logclr <- transformRNAseq(countmat, normFactors="TMM", transformation="logclr")$tcounts