Title: | Time Series Clustering of Gene Expression with Gaussian Mixed-Effects Models and Smoothing Splines |
---|---|
Description: | Implementation of a clustering method for time series gene expression data based on mixed-effects models with Gaussian variables and non-parametric cubic splines estimation. The method can robustly account for the high levels of noise present in typical gene expression time series datasets. |
Authors: | Monica Golumbeanu <[email protected]> |
Maintainer: | Monica Golumbeanu <[email protected]> |
License: | GPL (>=2) |
Version: | 1.29.0 |
Built: | 2024-11-30 05:20:18 UTC |
Source: | https://github.com/bioc/TMixClust |
Description of package TMixClust
TMixClust is a soft-clustering method which employs mixed-effects models with nonparametric smoothing spline fitting and is able to robustly stratify genes by their complex time series patterns. The package has, besides the main clustering method, a set of functionalities assisting the user to visualise and assess the clustering results, and to choose the most optimal clustering solution.
Monica Golumbeanu, [email protected]
Golumbeanu M, Desfarges S, Hernandez C, Quadroni M, Rato S, Mohammadi P, Telenti A, Beerenwinkel N, Ciuffi A. (2017) Dynamics of Proteo-Transcriptomic Response to HIV-1 Infection.
analyse_stability
Performs multiple clustering runs
with TMixClust, analyses the agreement between runs
with the Rand index and returns the clustering solution with the largest
likelihood.
A plot of agreement probability between all the runs and the run with the
maximum likelihood is produced.
analyse_stability(time_series_df, time_points = seq_len(ncol(time_series_df)), nb_clusters = 2, em_iter_max = 1000, mc_em_iter_max = 10, em_ll_convergence = 0.001, nb_clustering_runs = 3, nb_cores = 1)
analyse_stability(time_series_df, time_points = seq_len(ncol(time_series_df)), nb_clusters = 2, em_iter_max = 1000, mc_em_iter_max = 10, em_ll_convergence = 0.001, nb_clustering_runs = 3, nb_cores = 1)
time_series_df |
data frame containing the time series. Each row is a time series comprised of the time series name which is also the row name, and the time series values at each time point. |
time_points |
vector containing numeric values for the time points.
Default: |
nb_clusters |
desired number of clusters |
em_iter_max |
maximum number of iterations for the expectation-maximization (EM) algorithm. Default: 1000. |
mc_em_iter_max |
maximum number of iterations for Monte-Carlo resampling. Default is 10. |
em_ll_convergence |
convergence threshold for likelihood improvement. Default is 0.001. |
nb_clustering_runs |
number of times the clustering procedure is repeated on the input data. Default is 3. |
nb_cores |
number of cores to be used to run the separate clustering operations in parallel. Default is 1. |
TMixClust object with the highest likelihood. Renders a plot showing the overall distribution of the Rand index, which allows the user to assess clustering stability.
Monica Golumbeanu, [email protected]
Golumbeanu M, Desfarges S, Hernandez C, Quadroni M, Rato S, Mohammadi P, Telenti A, Beerenwinkel N, Ciuffi A. (2017) Dynamics of Proteo-Transcriptomic Response to HIV-1 Infection.
# Load the toy time series data provided with the TMixClust package data(toy_data_df) # Identify the most optimal clustering solution with 3 clusters best_clust_obj = analyse_stability(toy_data_df, nb_clusters = 3, nb_clustering_runs = 4, nb_cores = 1) # Plot the time series from each cluster for (i in seq_len(3)) { # Extract the time series in the current cluster and plot them c_df=toy_data_df[which(best_clust_obj$em_cluster_assignment==i),] plot_time_series_df(c_df, plot_title = paste("cluster",i)) }
# Load the toy time series data provided with the TMixClust package data(toy_data_df) # Identify the most optimal clustering solution with 3 clusters best_clust_obj = analyse_stability(toy_data_df, nb_clusters = 3, nb_clustering_runs = 4, nb_cores = 1) # Plot the time series from each cluster for (i in seq_len(3)) { # Extract the time series in the current cluster and plot them c_df=toy_data_df[which(best_clust_obj$em_cluster_assignment==i),] plot_time_series_df(c_df, plot_title = paste("cluster",i)) }
This object contains the result of clustering and stability analysis corresponding to the clustering solution with the highest likelihood among 10 different runs of clustering on the toy data with K=3 clusters.
best_clust_toy_obj
best_clust_toy_obj
A TMixClust
object.
optimal clustering solution for the toy data
Monica Golumbeanu, [email protected]
Golumbeanu M, Desfarges S, Hernandez C, Quadroni M, Rato S, Mohammadi P, Telenti A, Beerenwinkel N, Ciuffi A. (2017) Dynamics of Proteo-Transcriptomic Response to HIV-1 Infection.
# Load the optimal clustering solution for the toy data # provided with the TMixClust package data("best_clust_toy_obj") # Print the first lines of the toy clustering object head(best_clust_toy_obj)
# Load the optimal clustering solution for the toy data # provided with the TMixClust package data("best_clust_toy_obj") # Print the first lines of the toy clustering object head(best_clust_toy_obj)
This object contains the result of clustering and stability analysis corresponding to the clustering solution with the highest likelihood among 10 different runs of clustering on the yeast data with K=4 clusters.
best_clust_yeast_obj
best_clust_yeast_obj
A TMixClust
object.
optimal clustering solution for the yeast data
Monica Golumbeanu, [email protected]
Golumbeanu M, Desfarges S, Hernandez C, Quadroni M, Rato S, Mohammadi P, Telenti A, Beerenwinkel N, Ciuffi A. (2017) Dynamics of Proteo-Transcriptomic Response to HIV-1 Infection.
# Load the optimal clustering solution for the yeast data # provided with the TMixClust package data("best_clust_yeast_obj") # Print the first lines of the yeast clustering object head(best_clust_yeast_obj)
# Load the optimal clustering solution for the yeast data # provided with the TMixClust package data("best_clust_yeast_obj") # Print the first lines of the yeast clustering object head(best_clust_yeast_obj)
generate_TMixClust_report
generate_TMixClust_report(TMixClust_object, report_folder = paste(getwd(), "/TMixClust_report/", sep = ""), data_color = "#fd8d3c", x_label = "time", y_label = "value")
generate_TMixClust_report(TMixClust_object, report_folder = paste(getwd(), "/TMixClust_report/", sep = ""), data_color = "#fd8d3c", x_label = "time", y_label = "value")
TMixClust_object |
list object created by the |
report_folder |
full path of the folder where the report files will be saved. Default is TMixClust_report/ folder in current working directory. |
data_color |
color of the time series to be used when generating the cluster plots. Default is orange. |
x_label |
label of the x axis for the cluster plots. Default is "time" |
y_label |
label of the y axis for the cluster plots. Default is "value" |
Produces a series of files containing information about the clustering results and saves them in the provided folder location. The folder contains the following:
log-lihelihood.txt
- file with the log likelihood values at
each iteration on separate lines
log-likelihood.pdf
- plot of log-likelihood at each iteration
posterior.txt
- file with the posterior probabilities of all
the time-series for each cluster
estimated_curves/
- folder containing a number of files equal to
the number of clusters; each file has 4 lines consisting of curve values and
their confidence intervals (first 3 lines) for a discrete time grid
(last line).
clusters/
- folder containing a plot with the time series in
each cluster, a silhouette plot of the clustering configuration, as well as,
for each cluster, a file containing the names of the time series in the
respective cluster and a file containing the names and time series values
for the time series in each cluster.
Monica Golumbeanu, [email protected]
Golumbeanu M, Desfarges S, Hernandez C, Quadroni M, Rato S, Mohammadi P, Telenti A, Beerenwinkel N, Ciuffi A. (2017) Dynamics of Proteo-Transcriptomic Response to HIV-1 Infection.
## Not run: # Load the toy time series data provided with the TMixClust package data(toy_data_df) # Cluster the toy data with default parameters TMixClust_obj = TMixClust(toy_data_df) # Generate a TMixClust report in the current working directory generate_TMixClust_report(TMixClust_obj) ## End(Not run)
## Not run: # Load the toy time series data provided with the TMixClust package data(toy_data_df) # Cluster the toy data with default parameters TMixClust_obj = TMixClust(toy_data_df) # Generate a TMixClust report in the current working directory generate_TMixClust_report(TMixClust_obj) ## End(Not run)
get_time_series_df
creates a data frame containing time
series data from a file.
get_time_series_df(data_file)
get_time_series_df(data_file)
data_file |
path to a tab-delimited text file containing the time series data formatted such that each row contains a time-series represented by its name (e.g. gene name, protein name, etc.) and the values at each time point. |
A data frame containing the time series
Monica Golumbeanu, [email protected]
Golumbeanu M, Desfarges S, Hernandez C, Quadroni M, Rato S, Mohammadi P, Telenti A, Beerenwinkel N, Ciuffi A. (2017) Dynamics of Proteo-Transcriptomic Response to HIV-1 Infection.
# Load a simulated toy time-series data provided with the package toy_data_file = system.file("extdata", "toy_time_series.txt", package = "TMixClust") toy_data= get_time_series_df(toy_data_file) # Print the first lines of the resulting data frame print(head(toy_data))
# Load a simulated toy time-series data provided with the package toy_data_file = system.file("extdata", "toy_time_series.txt", package = "TMixClust") toy_data= get_time_series_df(toy_data_file) # Print the first lines of the resulting data frame print(head(toy_data))
get_time_series_df_bio
creates a data frame with time
series data from a Bioconductor Biobase ExpressionSet object.
get_time_series_df_bio(bio_obj)
get_time_series_df_bio(bio_obj)
bio_obj |
Bioconductor Biobase ExpressionSet object. The assayData has to contain a matrix where each row is a gene time series and each column contains the time series values at each time point. The number of columns is equal to the number of time points, while the number of rows is equal to the number of genes. |
A data frame containing the time series
Monica Golumbeanu, [email protected]
Golumbeanu M, Desfarges S, Hernandez C, Quadroni M, Rato S, Mohammadi P, Telenti A, Beerenwinkel N, Ciuffi A. (2017) Dynamics of Proteo-Transcriptomic Response to HIV-1 Infection.
# Load the SOS pathway data from Bioconductor package SPEM library(SPEM) data(sos) sos_data = get_time_series_df_bio(sos) # Print the first lines of the retrieved time series data frame print(head(sos_data))
# Load the SOS pathway data from Bioconductor package SPEM library(SPEM) data(sos) sos_data = get_time_series_df_bio(sos) # Print the first lines of the retrieved time series data frame print(head(sos_data))
plot_silhouette
plot_silhouette(TMixClust_object, sim_metric = "euclidean", sil_color = "#bdbdbd")
plot_silhouette(TMixClust_object, sim_metric = "euclidean", sil_color = "#bdbdbd")
TMixClust_object |
list object created by the |
sim_metric |
character string taking one of the possible values: "euclidean", "gower" or "manhattan". Default is "euclidean". |
sil_color |
color of the bars representing the silhouette widths on the plot |
List object with the following components:
similarity_m
similarity matrix
silh
silhouette object
Renders a plot comprised of a set of barplots with the distributions of silhouette coefficients for the data points in each cluster. Each barplot has indicated on its right hand side the total number of points in the corresponding cluster. The plot also indicates with a dotted line, the overall average silhouette width, whose value is specified at the bottom of the plot.
Monica Golumbeanu, [email protected]
Golumbeanu M, Desfarges S, Hernandez C, Quadroni M, Rato S, Mohammadi P, Telenti A, Beerenwinkel N, Ciuffi A. (2017) Dynamics of Proteo-Transcriptomic Response to HIV-1 Infection.
# Load the TMixClust object associated to the toy time series data # provided with the TMixClust package data(best_clust_toy_obj) # Plot the silhouette for the clustering stored in the toy TMixClust object plot_silhouette(best_clust_toy_obj)
# Load the TMixClust object associated to the toy time series data # provided with the TMixClust package data(best_clust_toy_obj) # Plot the silhouette for the clustering stored in the toy TMixClust object plot_silhouette(best_clust_toy_obj)
plot_time_series_df
allows the user to visualise the time
series from a given data set.
plot_time_series_df(ts_df, time_points = seq_len(ncol(ts_df)), data_color = "#fd8d3c", x_label = "time", y_label = "value", plot_title = "Time series plot")
plot_time_series_df(ts_df, time_points = seq_len(ncol(ts_df)), data_color = "#fd8d3c", x_label = "time", y_label = "value", plot_title = "Time series plot")
ts_df |
data frame containing on each row a time-series |
time_points |
vector containing the values of the time points.
Default: |
data_color |
color of the time series to be used for the plot. Default is orange. |
x_label |
label of the x axis of the plot. Default is "time" |
y_label |
label of the y axis of the plot. Default is "value" |
plot_title |
title of the plot. Default is "Time series plot". |
Plots a figure with all the the time series in the data set
Monica Golumbeanu, [email protected]
Golumbeanu M, Desfarges S, Hernandez C, Quadroni M, Rato S, Mohammadi P, Telenti A, Beerenwinkel N, Ciuffi A. (2017) Dynamics of Proteo-Transcriptomic Response to HIV-1 Infection.
# Load the toy time series data provided with the TMixClust package data(toy_data_df) # Plot the time series plot_time_series_df(toy_data_df)
# Load the toy time series data provided with the TMixClust package data(toy_data_df) # Plot the time series plot_time_series_df(toy_data_df)
TMixClust
is the central function of the package.
It clusters the given time series data into a specified number of clusters.
TMixClust(time_series_df, time_points = seq_len(ncol(time_series_df)), nb_clusters = 2, em_iter_max = 1000, mc_em_iter_max = 10, em_ll_convergence = 0.001)
TMixClust(time_series_df, time_points = seq_len(ncol(time_series_df)), nb_clusters = 2, em_iter_max = 1000, mc_em_iter_max = 10, em_ll_convergence = 0.001)
time_series_df |
data frame containing the time series. Each row is a time series comprised of the time series name which is also the row name, and the time series values at each time point. |
time_points |
vector containing numeric values for the time points.
Default: |
nb_clusters |
desired number of clusters |
em_iter_max |
maximum number of iterations for the expectation-maximization (EM) algorithm. Default: 1000. |
mc_em_iter_max |
maximum number of iterations for Monte-Carlo resampling. Default is 100. |
em_ll_convergence |
convergence threshold for likelihood improvement. Default is 0.001. |
list object with the following attributes:
em_gss_obj_list
object of class gss
containing
estimated parameters of the mixed-effects model
(see package vignette for more details).
em_pi_k
vector containing the mixing coefficients
corresponding to each cluster
em_mat_post
matrix containing the posterior values for each
time series and cluster
em_cluster_assignment
vector with the clustering attribution
for each time series
el_ll
vector containing the log likelihood values at each
iteration in the EM algorithm
ts_data
the same as the input time series data-frame
ts_time_points
the same as the input time-points vector
Monica Golumbeanu, [email protected]
Golumbeanu M, Desfarges S, Hernandez C, Quadroni M, Rato S, Mohammadi P, Telenti A, Beerenwinkel N, Ciuffi A. (2017) Dynamics of Proteo-Transcriptomic Response to HIV-1 Infection.
# Load the toy time series data provided with the TMixClust package data(toy_data_df) # Cluster the toy data with default parameters TMixClust_obj = TMixClust(toy_data_df)
# Load the toy time series data provided with the TMixClust package data(toy_data_df) # Cluster the toy data with default parameters TMixClust_obj = TMixClust(toy_data_df)
This data set contains a toy example of time-series gene expression data.
toy_data_df
toy_data_df
A data frame
with 91 rows and 6 columns.
The columns correspond to different time points, while the rownames of the
data frame correspond to gene names.
toy data
Monica Golumbeanu, [email protected]
Golumbeanu M, Desfarges S, Hernandez C, Quadroni M, Rato S, Mohammadi P, Telenti A, Beerenwinkel N, Ciuffi A. (2017) Dynamics of Proteo-Transcriptomic Response to HIV-1 Infection.
# Load the toy time series data provided with the TMixClust package data("toy_data_df") # Print the first lines of the toy data frame head(toy_data_df)
# Load the toy time series data provided with the TMixClust package data("toy_data_df") # Print the first lines of the toy data frame head(toy_data_df)