Title: | Clustering of high-throughput sequencing data by identifying co-expression patterns |
---|---|
Description: | Identification of clusters of co-expressed genes based on their expression across multiple (replicated) biological samples. |
Authors: | Thomas J. Hardcastle [aut], Irene Papatheodorou [aut], Samuel Granjeaud [cre] |
Maintainer: | Samuel Granjeaud <[email protected]> |
License: | GPL-3 |
Version: | 1.31.0 |
Built: | 2024-11-25 06:12:33 UTC |
Source: | https://github.com/bioc/clusterSeq |
Identification of clusters of co-expressed genes based on their expression across multiple (replicated) biological samples.
The DESCRIPTION file:
Package: | clusterSeq |
Type: | Package |
Title: | Clustering of high-throughput sequencing data by identifying co-expression patterns |
Version: | 1.31.0 |
Depends: | R (>= 3.0.0), methods, BiocParallel, baySeq, graphics, stats, utils |
Imports: | BiocGenerics |
Suggests: | BiocStyle |
Date: | 2016-01-19 |
Authors@R: | c(person("Thomas J.", "Hardcastle", role = "aut"), person("Irene", "Papatheodorou", role = "aut"), person("Samuel", "Granjeaud", role = "cre", email = "[email protected]", comment=c(ORCID="0000-0001-9245-1535"))) |
Description: | Identification of clusters of co-expressed genes based on their expression across multiple (replicated) biological samples. |
License: | GPL-3 |
LazyLoad: | yes |
biocViews: | Sequencing, DifferentialExpression, MultipleComparison, Clustering, GeneExpression |
URL: | https://github.com/samgg/clusterSeq |
BugReports: | https://github.com/samgg/clusterSeq/issues |
Config/pak/sysreqs: | libssl-dev |
Repository: | https://bioc.r-universe.dev |
RemoteUrl: | https://github.com/bioc/clusterSeq |
RemoteRef: | HEAD |
RemoteSha: | 72a65eb492771b58555ae0ec1a5eb97f14b66fc4 |
Author: | Thomas J. Hardcastle [aut], Irene Papatheodorou [aut], Samuel Granjeaud [cre] (<https://orcid.org/0000-0001-9245-1535>) |
Maintainer: | Samuel Granjeaud <[email protected]> |
Index of help topics:
associatePosteriors Associates posterior likelihood to generate co-expression dissimilarities between genes cD.ratThymus Data from female rat thymus tissue taken from the Rat BodyMap project (Yu et al, 2014) and processed by baySeq. clusterSeq-package Clustering of high-throughput sequencing data by identifying co-expression patterns kCluster Constructs co-expression dissimilarities from k-means analyses. makeClusters Creates clusters from a co-expression minimal linkage data.frame. makeClustersFF Creates clusters from a file containing a full dissimilarity matrix. plotCluster Plots data from clusterings. ratThymus Data from female rat thymus tissue taken from the Rat BodyMap project (Yu et al, 2014). wallace Computes Wallace scores comparing two clustering methods.
Thomas J. Hardcastle [aut], Irene Papatheodorou [aut], Samuel Granjeaud [cre] (<https://orcid.org/0000-0001-9245-1535>)
Maintainer: Samuel Granjeaud <[email protected]>
#Load in the processed data of observed read counts at each gene for each sample. data(ratThymus, package = "clusterSeq") # Library scaling factors are acquired here using the getLibsizes # function from the baySeq package. libsizes <- getLibsizes(data = ratThymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratThymus[ratThymus == 0] <- 1 normRT <- log2(t(t(ratThymus / libsizes)) * mean(libsizes)) # run kCluster on reduced set. normRT <- normRT[1:1000,] kClust <- kCluster(normRT) # make the clusters from these data. mkClust <- makeClusters(kClust, normRT, threshold = 1) # or using likelihood data from a Bayesian analysis of the data # load in analysed countData object data(cD.ratThymus, package = "clusterSeq") # estimate likelihoods of dissimilarity on reduced set aM <- associatePosteriors(cD.ratThymus[1:1000,]) # make clusters from dissimilarity data sX <- makeClusters(aM, cD.ratThymus, threshold = 0.5) # plot first six clusters par(mfrow = c(2,3)) plotCluster(sX[1:6], cD.ratThymus)
#Load in the processed data of observed read counts at each gene for each sample. data(ratThymus, package = "clusterSeq") # Library scaling factors are acquired here using the getLibsizes # function from the baySeq package. libsizes <- getLibsizes(data = ratThymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratThymus[ratThymus == 0] <- 1 normRT <- log2(t(t(ratThymus / libsizes)) * mean(libsizes)) # run kCluster on reduced set. normRT <- normRT[1:1000,] kClust <- kCluster(normRT) # make the clusters from these data. mkClust <- makeClusters(kClust, normRT, threshold = 1) # or using likelihood data from a Bayesian analysis of the data # load in analysed countData object data(cD.ratThymus, package = "clusterSeq") # estimate likelihoods of dissimilarity on reduced set aM <- associatePosteriors(cD.ratThymus[1:1000,]) # make clusters from dissimilarity data sX <- makeClusters(aM, cD.ratThymus, threshold = 0.5) # plot first six clusters par(mfrow = c(2,3)) plotCluster(sX[1:6], cD.ratThymus)
This function aims to find pairwise dissimilarities between genes. It does this by comparing the posterior likelihoods of patterns of differential expression for each gene, and estimating the likelihood that the two genes are not equivalently expressed.
associatePosteriors(cD, maxsize = 250000, matrixFile = NULL)
associatePosteriors(cD, maxsize = 250000, matrixFile = NULL)
cD |
A |
maxsize |
The maximum size (in MB) to use when partitioning the data. |
matrixFile |
If given, a file to write the complete (gzipped) matrix of pairwise distances between genes. Defaults to NULL. |
In comparing two genes, we find all patterns of expression considered
in the '@groups' slot of the 'cD' (countData
)
object for which the expression of the two genes can be considered
monotonic. We then subtract the sum the posterior likelihods of these
patterns of expression from 1 to define a likelihood of dissimilarity
between the two genes.
A data.frame which for each gene defines its nearest neighbour of higher row index and the dissimilarity with that neighbour.
Thomas J. Hardcastle
makeClusters
makeClustersFF
kCluster
# load in analysed countData (baySeq package) object library(baySeq) data(cD.ratThymus, package = "clusterSeq") # estimate likelihoods of dissimilarity on reduced set aM <- associatePosteriors(cD.ratThymus[1:1000,])
# load in analysed countData (baySeq package) object library(baySeq) data(cD.ratThymus, package = "clusterSeq") # estimate likelihoods of dissimilarity on reduced set aM <- associatePosteriors(cD.ratThymus[1:1000,])
This data set is a countData
object
for 17230 genes from 16 samples of female rat thymus tissue. The
tissues are extracted from four different age groups (2, 6, 21 and 104
week) with four replicates at each age. Posterior likelihoods for the
15 possible patterns of differential expression have been
precalculated using the baySeq-package
functions.
data(cD.ratThymus)
data(cD.ratThymus)
A countData
object
A countData
object
Illumina sequencing.
Yu Y. et al. A rat RNA-Seq transcriptomic BodyMap across 11 organs and 4 developmental stages. Nature Communications (2014)
This function aims to find pairwise distances between genes. It does this by constructing k-means clusterings of the observed (log) expression for each gene, and for each pair of genes, finding the maximum value of k for which the centroids of the clusters are monotonic between the genes.
kCluster(cD, maxK = 100, matrixFile = NULL, replicates = NULL, algorithm = "Lloyd", B = 1000, sdm = 1)
kCluster(cD, maxK = 100, matrixFile = NULL, replicates = NULL, algorithm = "Lloyd", B = 1000, sdm = 1)
cD |
A |
maxK |
The maximum value of k for which k-means clustering will be performed. Defaults to 100. |
matrixFile |
If given, a file to write the complete (gzipped) matrix of pairwise distances between genes. Defaults to NULL. |
replicates |
If given, a factor or vector that can be cast to a factor that defines the replicate structure of the data. See Details. |
algorithm |
The algorithm to be used by the kmeans function. |
B |
Number of iterations of bootstrapping algorithm used to establish clustering validity |
sdm |
Thresholding parameter for validity; see Details. |
In comparing two genes, we find the maximum value of k for which separate k-means clusterings of the two genes lead to a monotonic relationship between the centroids of the clusters. For this value of k, the maximum difference between expression levels observed within a cluster of either gene is reported as a measure of the dissimilarity between the two genes.
There is a potential issue in that for genes non-differentially expressed across all samples (i.e., the appropriate value of k is 1), there will nevertheless exist clusterings for k > 1. For some arrangements of data, this leads to misattribution of non-differentially expressed genes. We identify these cases by adapting Tibshirani's gap statistic; bootstrapping uniformly distributed data on the same range as the observed data, calculating the dissimilarity score as above, and finding those cases for which the gap between the bootstrapped mean dissimilarity and the observed dissimilarity for k = 1 exceeds that for k = 2 by more than some multiple (sdm) of the standard error of the bootstrapped dissimilarities of k = 2. These cases are forced to be treated as non-differentially expressed by discarding all dissimilarity data for k > 1.
If the replicates vector is given, or if the replicates slot of a
countData
given as the 'cD' variable is
complete, then the k-means clustering will be done on the median of
the expression values of each replicate group. Dissimilarity calculations
will still be made on the full data.
A data.frame which for each gene defines its nearest neighbour of higher row index and the dissimilarity with that neighbour.
Thomas J. Hardcastle
makeClusters
makeClustersFF
associatePosteriors
#Load in the processed data of observed read counts at each gene for each sample. data(ratThymus, package = "clusterSeq") # Library scaling factors are acquired here using the getLibsizes # function from the baySeq package. libsizes <- getLibsizes(data = ratThymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratThymus[ratThymus == 0] <- 1 normRT <- log2(t(t(ratThymus / libsizes)) * mean(libsizes)) # run kCluster on reduced set. normRT <- normRT[1:1000,] kClust <- kCluster(normRT) head(kClust) # Alternatively, run on a count data object: # load in analysed countData (baySeq package) object library(baySeq) data(cD.ratThymus, package = "clusterSeq") # estimate likelihoods of dissimilarity on reduced set kClust2 <- kCluster(cD.ratThymus[1:1000,]) head(kClust2)
#Load in the processed data of observed read counts at each gene for each sample. data(ratThymus, package = "clusterSeq") # Library scaling factors are acquired here using the getLibsizes # function from the baySeq package. libsizes <- getLibsizes(data = ratThymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratThymus[ratThymus == 0] <- 1 normRT <- log2(t(t(ratThymus / libsizes)) * mean(libsizes)) # run kCluster on reduced set. normRT <- normRT[1:1000,] kClust <- kCluster(normRT) head(kClust) # Alternatively, run on a count data object: # load in analysed countData (baySeq package) object library(baySeq) data(cD.ratThymus, package = "clusterSeq") # estimate likelihoods of dissimilarity on reduced set kClust2 <- kCluster(cD.ratThymus[1:1000,]) head(kClust2)
This function uses minimal linkage data to perform rapid clustering by
singleton agglomeration (i.e., a gene will always cluster with its
nearest neighbours provided the distance to those neighbours does not
exceed some threshold). For alternative (but slower) clustering options,
see the makeClustersFF
function.
makeClusters(aM, cD, threshold = 0.5)
makeClusters(aM, cD, threshold = 0.5)
aM |
A data frame constructed by |
cD |
The data given as input to |
threshold |
A threshold on the maximum dissimilarity at which two genes can cluster. Defaults to 0.5. |
An IntegerList object, each member of whom defines a cluster of co-expressed genes. The object is ordered decreasingly by the size of each cluster.
Thomas J Hardcastle
makeClustersFF
kCluster
associatePosteriors
#Load in the processed data of observed read counts at each gene for each sample. data(ratThymus, package = "clusterSeq") # Library scaling factors are acquired here using the getLibsizes # function from the baySeq package. libsizes <- getLibsizes(data = ratThymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratThymus[ratThymus == 0] <- 1 normRT <- log2(t(t(ratThymus / libsizes)) * mean(libsizes)) # run kCluster on reduced set. normRT <- normRT[1:1000,] kClust <- kCluster(normRT) # make the clusters from these data. mkClust <- makeClusters(kClust, normRT, threshold = 1)
#Load in the processed data of observed read counts at each gene for each sample. data(ratThymus, package = "clusterSeq") # Library scaling factors are acquired here using the getLibsizes # function from the baySeq package. libsizes <- getLibsizes(data = ratThymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratThymus[ratThymus == 0] <- 1 normRT <- log2(t(t(ratThymus / libsizes)) * mean(libsizes)) # run kCluster on reduced set. normRT <- normRT[1:1000,] kClust <- kCluster(normRT) # make the clusters from these data. mkClust <- makeClusters(kClust, normRT, threshold = 1)
This function uses the complete pairwise dissimilarity scores to construct a hierarchical clustering of the genes.
makeClustersFF(file, method = "complete", cut.height = 5)
makeClustersFF(file, method = "complete", cut.height = 5)
file |
Filename containing the dissimilarity data. |
method |
Method to use in |
cut.height |
Cut height to use in |
An IntegerList object containing the clusters derived from a cut hierarchical clustering.
Thomas J Hardcastle
makeClusters
kCluster
associatePosteriors
#Load in the processed data of observed read counts at each gene for each sample. data(ratThymus, package = "clusterSeq") # Library scaling factors are acquired here using the getLibsizes # function from the baySeq package. libsizes <- getLibsizes(data = ratThymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratThymus[ratThymus == 0] <- 1 normRT <- log2(t(t(ratThymus / libsizes)) * mean(libsizes)) # run kCluster on reduced set. For speed, one thousand bootstraps are # used, but higher values should be used in real analyses. # Write full dissimilarity matrix to file "kclust.gz" normRT <- normRT[1:1000,] kClust <- kCluster(normRT, B = 1000, matrixFile = "kclust.gz") # make the clusters from these data. mkClustR <- makeClustersFF("kclust.gz") # no need to clean up (specific to Bioconductor pipeline) file.remove("kclust.gz")
#Load in the processed data of observed read counts at each gene for each sample. data(ratThymus, package = "clusterSeq") # Library scaling factors are acquired here using the getLibsizes # function from the baySeq package. libsizes <- getLibsizes(data = ratThymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratThymus[ratThymus == 0] <- 1 normRT <- log2(t(t(ratThymus / libsizes)) * mean(libsizes)) # run kCluster on reduced set. For speed, one thousand bootstraps are # used, but higher values should be used in real analyses. # Write full dissimilarity matrix to file "kclust.gz" normRT <- normRT[1:1000,] kClust <- kCluster(normRT, B = 1000, matrixFile = "kclust.gz") # make the clusters from these data. mkClustR <- makeClustersFF("kclust.gz") # no need to clean up (specific to Bioconductor pipeline) file.remove("kclust.gz")
Given clusterings and expression data, plots representative expression data for each clustering.
plotCluster(cluster, cD, sampleSize = 1000)
plotCluster(cluster, cD, sampleSize = 1000)
cluster |
A list object defining the clusters, produced by |
cD |
The data object used to produce the clusters. |
sampleSize |
The maximum number of genes that will be ploted. |
Expression data are normalised and rescaled before plotting.
Plotting function.
Thomas J Hardcastle
# load in analysed countData object data(cD.ratThymus, package = "clusterSeq") # estimate likelihoods of dissimilarity on reduced set aM <- associatePosteriors(cD.ratThymus[1:1000,]) # make clusters from dissimilarity data sX <- makeClusters(aM, cD.ratThymus, threshold = 0.5) # plot first six clusters par(mfrow = c(2,3)) plotCluster(sX[1:6], cD.ratThymus)
# load in analysed countData object data(cD.ratThymus, package = "clusterSeq") # estimate likelihoods of dissimilarity on reduced set aM <- associatePosteriors(cD.ratThymus[1:1000,]) # make clusters from dissimilarity data sX <- makeClusters(aM, cD.ratThymus, threshold = 0.5) # plot first six clusters par(mfrow = c(2,3)) plotCluster(sX[1:6], cD.ratThymus)
This data set is a matrix ('mobData') of raw count data acquired for 17230 genes from 16 samples of female rat thymus tissue. The tissues are extracted from four different age groups (2, 6, 21 and 104 week) with four replicates at each age. Gene annotation is given in the rownames of the matrix.
data(ratThymus)
data(ratThymus)
A matrix of RNA-Seq counts in which each of the sixteen columns represents a sample, and each row a gene locus.
A matrix
Illumina sequencing.
Yu Y. et al. A rat RNA-Seq transcriptomic BodyMap across 11 organs and 4 developmental stages. Nature Communications (2014)
Given two clusterings A and B we can calculate the likelihood that two elements are in the same cluster in B given that they are in the same cluster in A, and vice versa.
wallace(v1, v2)
wallace(v1, v2)
v1 |
SimpleIntegerList object (output from makeClusters or makeClustersFF). |
v2 |
SimpleIntegerList object (output from makeClusters or makeClustersFF). |
Vector of length 2 giving conditional likelihoods.
Thomas J. Hardcastle
# using likelihood data from a Bayesian analysis of the data # load in analysed countData object data(cD.ratThymus, package = "clusterSeq") # estimate likelihoods of dissimilarity on reduced set aM <- associatePosteriors(cD.ratThymus[1:1000,]) # make clusters from dissimilarity data sX <- makeClusters(aM, cD.ratThymus[1:1000,], threshold = 0.5) # or using k-means clustering on raw count data #Load in the processed data of observed read counts at each gene for each sample. data(ratThymus, package = "clusterSeq") # Library scaling factors are acquired here using the getLibsizes # function from the baySeq package. libsizes <- getLibsizes(data = ratThymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratThymus[ratThymus == 0] <- 1 normRT <- log2(t(t(ratThymus / libsizes)) * mean(libsizes)) # run kCluster on reduced set. normRT <- normRT[1:1000,] kClust <- kCluster(normRT, replicates = cD.ratThymus@replicates) # make the clusters from these data. mkClust <- makeClusters(kClust, normRT, threshold = 1) # compare clusterings wallace(sX, mkClust)
# using likelihood data from a Bayesian analysis of the data # load in analysed countData object data(cD.ratThymus, package = "clusterSeq") # estimate likelihoods of dissimilarity on reduced set aM <- associatePosteriors(cD.ratThymus[1:1000,]) # make clusters from dissimilarity data sX <- makeClusters(aM, cD.ratThymus[1:1000,], threshold = 0.5) # or using k-means clustering on raw count data #Load in the processed data of observed read counts at each gene for each sample. data(ratThymus, package = "clusterSeq") # Library scaling factors are acquired here using the getLibsizes # function from the baySeq package. libsizes <- getLibsizes(data = ratThymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratThymus[ratThymus == 0] <- 1 normRT <- log2(t(t(ratThymus / libsizes)) * mean(libsizes)) # run kCluster on reduced set. normRT <- normRT[1:1000,] kClust <- kCluster(normRT, replicates = cD.ratThymus@replicates) # make the clusters from these data. mkClust <- makeClusters(kClust, normRT, threshold = 1) # compare clusterings wallace(sX, mkClust)