Title: | Microbiota STability ASsessment via Iterative cluStering |
---|---|
Description: | The toolkit 'µSTASIS', or microSTASIS, has been developed for the stability analysis of microbiota in a temporal framework by leveraging on iterative clustering. Concretely, the core function uses Hartigan-Wong k-means algorithm as many times as possible for stressing out paired samples from the same individuals to test if they remain together for multiple numbers of clusters over a whole data set of individuals. Moreover, the package includes multiple functions to subset samples from paired times, validate the results or visualize the output. |
Authors: | Pedro Sánchez-Sánchez [aut, cre] , Alfonso Benítez-Páez [aut] |
Maintainer: | Pedro Sánchez-Sánchez <[email protected]> |
License: | GPL-3 |
Version: | 1.7.0 |
Built: | 2024-10-30 08:51:30 UTC |
Source: | https://github.com/bioc/microSTASIS |
A dataset containing the amplicon sequence variants of 131 samples from the gut microbiota of 43 individuals. The values are transformed from counts by applying centred log-transformation (CLR).
data(clr)
data(clr)
A data.frame with 131 rows and 226 variables
Gloria M. Agudelo-Ochoa, Beatriz E. Valdés-Duque, Nubia A. Giraldo-Giraldo, Ana M. Jaillier-Ramírez, Adriana Giraldo-Villa, Irene Acevedo-Castaño, Mónica A. Yepes-Molina, Janeth Barbosa-Barbosa, Alfonso Benítez-Paéz, Gut microbiota profiles in critically ill patients, potential biomarkers and risk variables for sepsis, Gut Microbes, Volume 12, Issue 1, January 2020, https://doi.org/10.1080/19490976.2019.1707610
Pedro Sánchez-Sánchez, Francisco J Santonja, Alfonso Benítez-Páez, Assessment of human microbiota stability across longitudinal samples using iteratively growing-partitioned clustering, Briefings in Bioinformatics, Volume 23, Issue 2, March 2022, bbac055, https://doi.org/10.1093/bib/bbac055
Perform Hartigan-Wong stats::kmeans()
algorithm as many times as possible. The values of k are from 2 to the number of samples minus 1. Those individuals whose paired samples are clustered under the same label sum 1. If paired samples are in different clusters, then sum 0, except when the euclidean distance between them is smaller to the ones of each sample to its centroid. This is done for all possible values of k and, finally, divided the sum by k, so obtaining a value between 0 and 1.
iterativeClustering( pairedTimes, BPPARAM = BiocParallel::bpparam(), common = "_" )
iterativeClustering( pairedTimes, BPPARAM = BiocParallel::bpparam(), common = "_" )
pairedTimes |
list of matrices with paired times,
i.e. samples to be stressed to multiple iterations.
Output of |
BPPARAM |
supply a |
common |
pattern that separates the ID and the sampling time. |
µSTASIS stability score (mS) for the individuals from the corresponding paired times.
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_")
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_")
Perform cross validation of the stability results from
iterativeClustering()
in the way of leave-one-out (LOO)
or leave-k-out (understood as quitting k individuals each time for
calculating the metric over individuals).
iterativeClusteringCV( pairedTimes, results, name, common = "_", k = 1L, BPPARAM = BiocParallel::bpparam() )
iterativeClusteringCV( pairedTimes, results, name, common = "_", k = 1L, BPPARAM = BiocParallel::bpparam() )
pairedTimes |
list of matrices with paired times,
i.e. samples to be stressed to multiple iterations.
Output of |
results |
the list output of |
name |
character; name of the paired times whose stability is being assessed. |
common |
pattern that separates the ID and the sampling time. |
k |
integer; number of individuals to remove from the data for
each time running |
BPPARAM |
supply a |
Multiple lists with multiple objects of class "kmeans".
data(clr) times <- pairedTimes(data = clr[, 1:20], sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") cv_klist_t1_t25_k2 <- iterativeClusteringCV(pairedTimes = times, results = mS, name = "t1_t25", common = "_0_", k = 2L)
data(clr) times <- pairedTimes(data = clr[, 1:20], sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") cv_klist_t1_t25_k2 <- iterativeClusteringCV(pairedTimes = times, results = mS, name = "t1_t25", common = "_0_", k = 2L)
The toolkit 'µSTASIS' has been developed for the stability analysis of microbiota in a temporal framework by leveraging on iterative clustering. Concretely, the core function uses Hartigan-Wong k-means algorithm as many times as possible for stressing out paired samples from the same individuals to test if they remain together for multiple numbers of clusters over a whole data set of individuals. Moreover, the package includes multiple functions to subset samples from paired times, validate the results or visualize the output.
iterativeClusteringCV()
.Compute the mean absolute error after the cross validation or plot lines connecting the stability values for each subset of the original matrix of paired times.
mSerrorCV(pairedTime, CVklist, k = 1L)
mSerrorCV(pairedTime, CVklist, k = 1L)
pairedTime |
input matrix with paired times whose stability
has being assessed. One of the lists output of
|
CVklist |
list resulting from |
k |
integer; number of individuals to subset from the data.
The same as used in |
A vector with MAE values for each individual's mS score.
data(clr) times <- pairedTimes(data = clr[, 1:20], sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") cv_klist_t1_t25_k2 <- iterativeClusteringCV(pairedTimes = times, results = mS, name = "t1_t25", common = "_0_", k = 2L) MAE_t1_t25 <- mSerrorCV(pairedTime = times$t1_t25, CVklist = cv_klist_t1_t25_k2, k = 2L) MAE <- mSpreviz(results = list(MAE_t1_t25), times = list(t1_t25 = times$t1_t25)) plotmSheatmap(results = MAE, times = c("t1_t25", "t25_t26"), label = TRUE, high = 'red2', low = 'forestgreen', midpoint = 5)
data(clr) times <- pairedTimes(data = clr[, 1:20], sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") cv_klist_t1_t25_k2 <- iterativeClusteringCV(pairedTimes = times, results = mS, name = "t1_t25", common = "_0_", k = 2L) MAE_t1_t25 <- mSerrorCV(pairedTime = times$t1_t25, CVklist = cv_klist_t1_t25_k2, k = 2L) MAE <- mSpreviz(results = list(MAE_t1_t25), times = list(t1_t25 = times$t1_t25)) plotmSheatmap(results = MAE, times = c("t1_t25", "t25_t26"), label = TRUE, high = 'red2', low = 'forestgreen', midpoint = 5)
pairedTimes()
.Internal function for pairedTimes()
.
mSinternalPairedTimes(data, specifiedTimePoints, common = "_")
mSinternalPairedTimes(data, specifiedTimePoints, common = "_")
data |
matrix with rownames including ID, common pattern and sampling time. |
specifiedTimePoints |
character vector to specify the selection of concrete paired times. |
common |
pattern separating the ID and the sampling time in rownames. |
A list of matrices with the same number of columns as input and with samples from paired sampling times as rows.
data(clr) t1_t2 <- mSinternalPairedTimes(data = clr, specifiedTimePoints = c("1", "25"), common = "_0_")
data(clr) t1_t2 <- mSinternalPairedTimes(data = clr, specifiedTimePoints = c("1", "25"), common = "_0_")
Easily extract groups of individuals from sample metadata.
mSmetadataGroups( metadata, samples, individuals, variable, common, ID, timePoints )
mSmetadataGroups( metadata, samples, individuals, variable, common, ID, timePoints )
metadata |
input data.frame with data corresponding to samples. It can
be the |
samples |
vector from metadata corresponding to the samples ID, if applicable; should be NULL if ID and timePoints are provided from a TreeSummarizedExperiment, for example. |
individuals |
vector of individuals; first column of the
|
variable |
column name with the variable used for grouping individuals. |
common |
pattern that separates the ID and the sampling time in rownames, if applicable. |
ID |
If applicable, one of the colData() colnames from the TreeSummarizedExperiment should be given as individuals. |
timePoints |
If applicable, one of the colData() colnames from the TreeSummarizedExperiment should be given as sampling times. |
A vector with the same length as the number of rows in the
mSpreviz()
output.
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") results <- mSpreviz(results = mS, times = times) metadata <- data.frame(Sample = rownames(clr), age = c(rep("youth", 65), rep("old", 131-65))) group <- mSmetadataGroups(metadata = metadata, samples = metadata$Sample, common = "_0_", individuals = results$individual, variable = "age")
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") results <- mSpreviz(results = mS, times = times) metadata <- data.frame(Sample = rownames(clr), age = c(rep("youth", 65), rep("old", 131-65))) group <- mSmetadataGroups(metadata = metadata, samples = metadata$Sample, common = "_0_", individuals = results$individual, variable = "age")
iterativeClustering()
output to a new format
ready for the implemented visualization functions.Process the iterativeClustering()
output to a new format
ready for the implemented visualization functions.
mSpreviz(results, times)
mSpreviz(results, times)
results |
list; output of |
times |
list; output of |
A data frame ready for its use under the implemented visualization functions and others.
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") results <- mSpreviz(results = mS, times = times)
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") results <- mSpreviz(results = mS, times = times)
Generate one or multiple matrices with paired times.
pairedTimes(data, ...) ## S4 method for signature 'matrix' pairedTimes(data, sequential, common, specifiedTimePoints) ## S4 method for signature 'TreeSummarizedExperiment' pairedTimes( data, sequential, assay, alternativeExp, ID, timePoints, specifiedTimePoints )
pairedTimes(data, ...) ## S4 method for signature 'matrix' pairedTimes(data, sequential, common, specifiedTimePoints) ## S4 method for signature 'TreeSummarizedExperiment' pairedTimes( data, sequential, assay, alternativeExp, ID, timePoints, specifiedTimePoints )
data |
input object: either a matrix with rownames including ID, common pattern and sampling time, or a TreeSummarizedExperiment object. |
... |
Additional argument list that might not ever be used. |
sequential |
TRUE if paired times to analyse are sequential and present the desired alphanumerical order. |
common |
If is.matrix(data), pattern that separates the ID and the sampling time in rownames. |
specifiedTimePoints |
character vector to specify the selection of concrete paired times. |
assay |
If class(data) == "TreeSummarizedExperiment", name of the assay to use. |
alternativeExp |
If class(data) == "TreeSummarizedExperiment", name of the alternative experiment to use (if applicable). |
ID |
If class(data) == "TreeSummarizedExperiment", one of the colData(data) colnames should be given as individuals. |
timePoints |
If class(data) == "TreeSummarizedExperiment", one of the colData(data) colnames should be given as sampling times. |
A list of matrices with the same number of columns as input and with samples from paired sampling times as rows.
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") times_b <- pairedTimes(data = clr, sequential = FALSE, common = "_0_", specifiedTimePoints = c("1", "26"))
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") times_b <- pairedTimes(data = clr, sequential = FALSE, common = "_0_", specifiedTimePoints = c("1", "26"))
Generate boxplots of the stability dynamics throughout sampling times by groups.
plotmSdynamics(results, groups, points = TRUE, linetype = 2)
plotmSdynamics(results, groups, points = TRUE, linetype = 2)
results |
input data.frame resulting from |
groups |
vector with the same length as individuals, i.e. the number
of rows in the |
points |
logical; FALSE to only visualize boxplots or TRUE to also add individual points. |
linetype |
numeric; type of line to connect the median value of paired times; 0 to avoid the line. |
A plot with as many boxes as paired times by group in the form of a
ggplot2::ggplot()
object.
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") results <- mSpreviz(results = mS, times = times) metadata <- data.frame(Sample = rownames(clr), age = c(rep("youth", 65), rep("old", 131-65))) group <- mSmetadataGroups(metadata = metadata, samples = metadata$Sample, common = "_0_", individuals = results$individual, variable = "age") plotmSdynamics(results, groups = group, points = TRUE, linetype = 0)
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") results <- mSpreviz(results = mS, times = times) metadata <- data.frame(Sample = rownames(clr), age = c(rep("youth", 65), rep("old", 131-65))) group <- mSmetadataGroups(metadata = metadata, samples = metadata$Sample, common = "_0_", individuals = results$individual, variable = "age") plotmSdynamics(results, groups = group, points = TRUE, linetype = 0)
Plot a heatmap of the stability results.
plotmSheatmap( results, order = NULL, times, label = FALSE, low = "red2", mid = "yellow", high = "forestgreen", midpoint = 0.5 )
plotmSheatmap( results, order = NULL, times, label = FALSE, low = "red2", mid = "yellow", high = "forestgreen", midpoint = 0.5 )
results |
input data.frame resulting from |
order |
NULL object or character: none, mean or median; if the individuals should be sorted by any of those statistics of the stability values. |
times |
character; names of the paired times to plot, i.e. colnames of results. |
label |
logical; TRUE to print the mS score or FALSE to not. |
low |
color for the lowest value. |
mid |
color for the middle value. |
high |
color for the highest values. |
midpoint |
value to situate the middle. |
A heatmap of the stability values in the form of a
ggplot2::ggplot()
object.
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") results <- mSpreviz(results = mS, times = times) plotmSheatmap(results = results, order = "mean", times = c("t1_t25", "t25_t26"), label = TRUE)
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") results <- mSpreviz(results = mS, times = times) plotmSheatmap(results = results, order = "mean", times = c("t1_t25", "t25_t26"), label = TRUE)
iterativeClusteringCV()
.Plot lines connecting the mS score for each subset of the original matrix of paired times.
plotmSlinesCV(pairedTime, CVklist, k = 1L, points = TRUE, sizeLine = 0.5)
plotmSlinesCV(pairedTime, CVklist, k = 1L, points = TRUE, sizeLine = 0.5)
pairedTime |
input matrix with paired times whose stability
has being assessed. One of the lists output of
|
CVklist |
list resulting from |
k |
integer; number of individuals to subset from the data.
The same as used in |
points |
logical; if plotting, FALSE to only plot lines and TRUE to
add points on the mS score, i.e. result from
|
sizeLine |
numeric; if plotting, size of the multiple lines. |
A line plot in the form of a ggplot2::ggplot()
object with the
values of stability for the multiple subsets and the original matrix
of paired samples (points).
data(clr) times <- pairedTimes(data = clr[, 1:20], sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") cv_klist_t1_t25_k2 <- iterativeClusteringCV(pairedTimes = times, results = mS, name = "t1_t25", common = "_0_", k = 2L) plotmSlinesCV(pairedTime = times$t1_t25, CVklist = cv_klist_t1_t25_k2, k = 2L)
data(clr) times <- pairedTimes(data = clr[, 1:20], sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") cv_klist_t1_t25_k2 <- iterativeClusteringCV(pairedTimes = times, results = mS, name = "t1_t25", common = "_0_", k = 2L) plotmSlinesCV(pairedTime = times$t1_t25, CVklist = cv_klist_t1_t25_k2, k = 2L)
Plot a scatter and side boxplot of the stability results.
plotmSscatter(results, order = NULL, times, gridLines = FALSE, sideScale = 0.3)
plotmSscatter(results, order = NULL, times, gridLines = FALSE, sideScale = 0.3)
results |
input data.frame resulting from |
order |
NULL object or character: mean or median; if the individuals should be sorted by any of those statistics of the stability values. |
times |
a vector with the names of each paired time, e.g. "t1_t2". |
gridLines |
logical; FALSE to print a blank background or TRUE to include a gray grid. |
sideScale |
numeric; scale of the side boxplot. |
A scatter plot and a side boxplot of the stability values in the
form of a ggplot2::ggplot()
object.
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") results <- mSpreviz(results = mS, times = times) plotmSscatter(results = results, order = "median", times = c("t1_t25", "t25_t26"), gridLines = TRUE, sideScale = 0.2)
data(clr) times <- pairedTimes(data = clr, sequential = TRUE, common = "_0_") mS <- iterativeClustering(pairedTimes = times, common = "_") results <- mSpreviz(results = mS, times = times) plotmSscatter(results = results, order = "median", times = c("t1_t25", "t25_t26"), gridLines = TRUE, sideScale = 0.2)