Title: | Clustering of Time Series Gene Expression data |
---|---|
Description: | Methodology for supervised clustering of potentially many predictor variables, such as genes etc., in time series datasets Provides functions that help the user assigning genes to predefined set of model profiles. |
Authors: | Michal Sharabi-Schwager [aut, cre], Ron Ophir [aut] |
Maintainer: | Michal Sharabi-Schwager <[email protected]> |
License: | GPL-2 |
Version: | 1.33.0 |
Built: | 2024-12-07 05:57:34 UTC |
Source: | https://github.com/bioc/ctsGE |
Clustering each index, that was predifined by
PreparingTheIndexes
, with kmeans
.
ClustIndexes(x, scaling = TRUE)
ClustIndexes(x, scaling = TRUE)
x |
list of expression data and their indexes after running
|
scaling |
Boolean parameter, does the data should be standardized before clustered. Default = TRUE |
The clustering is done with K-means. To choose an optimal k for K-means clustering, the Elbow method was applied, this method looks at the percentage of variance explained as a function of the number of clusters: the chosen number of clusters should be such that adding another cluster does not give much better modeling of the data. First, the ratio of the within-cluster sum of squares (WSS) to the total sum of squares (TSS) is computed for different values of k (i.e., 1, 2, 3 ...). The WSS, also known as sum of squared error (SSE), decreases as k gets larger. The Elbow method chooses the k at which the SSE decreases abruptly. This happens when the computed value of the WSS-to-TSS ratio first drops from 0.2.
Running kmeans
and calculating the optimal k for each one of
the indexes in the data could take a long time. To shorten the procedure the
user can skip this step altogether and directly view a specific index and
its clusters by running either the PlotIndexesClust
or the
ctsGEShinyApp
function.
By default data is standardize before clustering,for clustering
the raw counts set the scaling
parameter to FALSE.
list object is returned as output, with the relative culstered indexes table in object$ClusteredIdxTable, and the number of clusters for each index in object$optimalK
data_dir <- system.file("extdata", package = "ctsGE") files <- dir(path=data_dir,pattern = "\\.xls$") rts <- readTSGE(files, path = data_dir, labels = c("0h","6h","12h","24h","48h","72h"), skip = 10625 ) prts <- PreparingTheIndexes(rts) tsCI <- ClustIndexes(prts) head(tsCI$ClusteredIdxTable) #the table with the clusterd indexes head(tsCI$optimalK) #the table with the number of clusters for each index
data_dir <- system.file("extdata", package = "ctsGE") files <- dir(path=data_dir,pattern = "\\.xls$") rts <- readTSGE(files, path = data_dir, labels = c("0h","6h","12h","24h","48h","72h"), skip = 10625 ) prts <- PreparingTheIndexes(rts) tsCI <- ClustIndexes(prts) head(tsCI$ClusteredIdxTable) #the table with the clusterd indexes head(tsCI$optimalK) #the table with the number of clusters for each index
Produce and launch Shiny app for interactive exploration of gene expression data. For more information about shiny apps http://shiny.rstudio.com/
ctsGEShinyApp(rts, min_cutoff = 0.5, max_cutoff = 0.7, mad.scale = TRUE, title = NULL)
ctsGEShinyApp(rts, min_cutoff = 0.5, max_cutoff = 0.7, mad.scale = TRUE, title = NULL)
rts |
list of an expression data that made by readTSGE |
min_cutoff |
A numeric the lower limit range to calculate the optimal
cutoff for the data, default to 0.5
See |
max_cutoff |
A numeric the upper limit range to calculate the optimal
cutoff for the data, default to 0.7
See |
mad.scale |
A boolean defaulting to TRUE as to what method of scaling to use. Default median-base scaling. FALSE, mean-base scaling |
title |
Character, the title at the header panel. default to NULL. |
The 'ctsGEShinyApp' function takes the ctsGE object and opens an html page as a GUI. On the web page, the user chooses the profile to visualize and the number of clusters (k parameter for K-means) to show. The line graph of the profile separated into the clusters will show in the main panel, and a list of the genes and their expressions will also be available. The tables and figures can be downloaded.
Creates a shiny application and opens a shinyapp.io web page
shiny::ShinyApp
## Not run: data_dir <- system.file("extdata", package = "ctsGE") files <- dir(path=data_dir,pattern = "\\.xls$") rts <- readTSGE(files, path = data_dir, labels = c("0h","6h","12h","24h","48h","72h") ) ctsGEShinyApp(rts) ## End(Not run)
## Not run: data_dir <- system.file("extdata", package = "ctsGE") files <- dir(path=data_dir,pattern = "\\.xls$") rts <- readTSGE(files, path = data_dir, labels = c("0h","6h","12h","24h","48h","72h") ) ctsGEShinyApp(rts) ## End(Not run)
Takes a numeric vector and return an expression index (i.e., a sequence of 1,-1, and 0)
index(x, cutoff)
index(x, cutoff)
x |
A numeric |
cutoff |
A numeric, dermine the threshold for indexing |
The function defines limits around the center (median or mean), +/- cutoff value in median absolute deviation (MAD) or standard deviation (SD) units respectively.The user defines a parameter cutoff that determines the limits around the gene-expression center. Then the function calculates the index value at each time point according to:
0: standardized value is within the limits (+/- cutoff)
1: standardized value exceeds the upper limit (+ cutoff)
-1: standardized value exceeds the lower limit (- cutoff)
Gene expression index
rawCounts <- c(103.5, 75.1, 97.3, 27.12, 34.83, 35.53, 40.59, 30.84, 16.39, 29.29) (sCounts <- scale(rawCounts)[,1])# standardized mean-base scaling cutoff <- seq(0.2,2,0.1) # different cutoff produce different indexes for(i in cutoff){print(index(sCounts,i))}
rawCounts <- c(103.5, 75.1, 97.3, 27.12, 34.83, 35.53, 40.59, 30.84, 16.39, 29.29) (sCounts <- scale(rawCounts)[,1])# standardized mean-base scaling cutoff <- seq(0.2,2,0.1) # different cutoff produce different indexes for(i in cutoff){print(index(sCounts,i))}
The function generates graphs and tables of a specific index and its clusters. The user decides whether to supply the k or let the function calculate the k for the selected index
PlotIndexesClust(x, idx, k = NULL, scaling = TRUE)
PlotIndexesClust(x, idx, k = NULL, scaling = TRUE)
x |
list of expression data and their indexes after running
|
idx |
A character, the index to plot (e.g., for 8 time points "11100-1-1-1") |
k |
A numeric, number of clusters. If not given the function will calculate what is the optimal k for the index. |
scaling |
A boolean, default to TRUE, does the data should be standardized before clustered with K-means. |
A list with two objects:
Table of of a specific index and its clusters
Gene expression pattern graphs for each one of the clusters
data_dir <- system.file("extdata", package = "ctsGE") files <- dir(path=data_dir,pattern = "\\.xls$") rts <- readTSGE(files, path = data_dir, labels = c("0h","6h","12h","24h","48h","72h"), skip = 10625 ) prts <- PreparingTheIndexes(rts) pp <- PlotIndexesClust(prts,idx="00101-1") pp$graphs # plots the line graphs
data_dir <- system.file("extdata", package = "ctsGE") files <- dir(path=data_dir,pattern = "\\.xls$") rts <- readTSGE(files, path = data_dir, labels = c("0h","6h","12h","24h","48h","72h"), skip = 10625 ) prts <- PreparingTheIndexes(rts) pp <- PlotIndexesClust(prts,idx="00101-1") pp$graphs # plots the line graphs
Reads the table of genes expression and return an expression index for each gene.
PreparingTheIndexes(x, min_cutoff = 0.5, max_cutoff = 0.7, mad.scale = TRUE)
PreparingTheIndexes(x, min_cutoff = 0.5, max_cutoff = 0.7, mad.scale = TRUE)
x |
list of an expression data that made by readTSGE |
min_cutoff |
A numeric the lower limit range to calculate the optimal cutoff for the data, default to 0.5 See Details. |
max_cutoff |
A numeric the upper limit range to calculate the optimal cutoff for the data, default to 0.7 See Details. |
mad.scale |
A boolean defaulting to TRUE as to what method of scaling to use. Default median-base scaling. FALSE, mean-base scaling. |
1. First, the expression matrix is standardized. The function default standardizing method is a median-based scaling; alternatively, a mean-based scaling can be used. The new scaled values represent the distance of each gene at a certain time point from its center, median or mean, in median absolute deviation (MAD) units or standard deviation (SD) units, respectively.
2. The function compute the cutoff value following the idea that the clustering will be performed on small gene groups, an optimal cutoff value will be one that will minimize the number of genes in each group, i.e., generate index groups of equal size. The chi-squared values will be generate for each cutoff value (from min_cutoff to max_cutoff parameter in increments of 0.05) the cutoff that generate the lowest chi-squared is chosen.
3. Next, the standardized values are converted to index values that indicate whether gene expression is above, below or within the limits around the center of the time series, i.e., **1 / -1 / 0**, respectively. The cutoff parameter determines the limits around the gene-expression center. Then the function calculates the index value at each time point according to:
0: standardized value is within the limits (+/- cutoff)
1: standardized value exceeds the upper limit (+ cutoff)
-1: standardized value exceeds the lower limit (- cutoff)
list object is returned as output with the relative standarization table in object$scaled, and the indexes table in object$index
data_dir <- system.file("extdata", package = "ctsGE") files <- dir(path=data_dir,pattern = "\\.xls$") rts <- readTSGE(files, path = data_dir, labels = c("0h","6h","12h","24h","48h","72h"), skip = 10625 ) prts <- PreparingTheIndexes(rts) prts$cutoff # the optimal cutoff
data_dir <- system.file("extdata", package = "ctsGE") files <- dir(path=data_dir,pattern = "\\.xls$") rts <- readTSGE(files, path = data_dir, labels = c("0h","6h","12h","24h","48h","72h"), skip = 10625 ) prts <- PreparingTheIndexes(rts) prts$cutoff # the optimal cutoff
Reads and merges a set of text files containing normalized gene expression data
readTSGE(files, path = NULL, columns = c(1, 2), labels = NULL, desc = NULL, ...)
readTSGE(files, path = NULL, columns = c(1, 2), labels = NULL, desc = NULL, ...)
files |
character vector of filenames, or alternative a named list of tables for each time point. |
path |
character string giving the directory containing the files. The default is the current working directory. |
columns |
numeric vector stating which two columns contain the tag names and counts, respectively |
labels |
character vector giving short names to associate with the libraries. |
desc |
character vector with genes description (annotation),default to NULL Defaults to the file names. |
... |
other are passed to read.delim |
As input, the ctsGE package expects normalized expression table, where rows are genes and columns are samples Each file is assumed to contained digital gene expression data for one sample (or library), with transcript or gene identifiers in the first column and expression values in the second column. Transcript identifiers are assumed to be unique and not repeated in any one file. By default, the files are assumed to be tab-delimited and to contain column headings. The function forms the union of all transcripts and creates one big table with zeros where necessary. When reading the normalized expression values the function check whether there are rows that their median absolute deviation (MAD) value equal to zero and remove these rows. This step is important in order to continue to the next step of indexing the data. The function will output a message of how many genes were remove.
A list with four objects:
expression matrix
samples names
tags - genes name
timePoints - number of time points
## Read all .txt files from current working directory data_dir <- system.file("extdata", package = "ctsGE") files <- dir(path=data_dir,pattern = "\\.xls$") # reading only 2000 genes rts <- readTSGE(files, path = data_dir, labels = c("0h","6h","12h","24h","48h","72h"), skip = 10625 )
## Read all .txt files from current working directory data_dir <- system.file("extdata", package = "ctsGE") files <- dir(path=data_dir,pattern = "\\.xls$") # reading only 2000 genes rts <- readTSGE(files, path = data_dir, labels = c("0h","6h","12h","24h","48h","72h"), skip = 10625 )