Title: | Simplify Functional Enrichment Results |
---|---|
Description: | A new clustering algorithm, "binary cut", for clustering similarity matrices of functional terms is implemeted in this package. It also provides functions for visualizing, summarizing and comparing the clusterings. |
Authors: | Zuguang Gu [aut, cre] |
Maintainer: | Zuguang Gu <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.1.0 |
Built: | 2024-12-30 04:37:44 UTC |
Source: | https://github.com/bioc/simplifyEnrichment |
Word cloud annotations
anno_word_cloud( align_to, term, exclude_words = NULL, max_words = 10, word_cloud_grob_param = list(), fontsize_range = c(4, 16), value_range = NULL, bg_gp = gpar(fill = "#DDDDDD", col = "#AAAAAA"), side = c("right", "left"), add_new_line = FALSE, count_words_param = list(), ..., return_gbl = FALSE )
anno_word_cloud( align_to, term, exclude_words = NULL, max_words = 10, word_cloud_grob_param = list(), fontsize_range = c(4, 16), value_range = NULL, bg_gp = gpar(fill = "#DDDDDD", col = "#AAAAAA"), side = c("right", "left"), add_new_line = FALSE, count_words_param = list(), ..., return_gbl = FALSE )
align_to |
How to align the annotations to the heatmap. Similar as in |
term |
The description text used for constructing the word clouds. The value should have the same format as |
exclude_words |
The words excluced for construcing word cloud. |
max_words |
Maximal number of words visualized in the word cloud. |
word_cloud_grob_param |
A list of graphics parameters passed to |
fontsize_range |
The range of the font size. The value should be a numeric vector with length two. The font size interpolation is linear. |
value_range |
The range of values to map to font sizes. |
bg_gp |
Graphics parameters for controlling the background. |
side |
Side of the annotation relative to the heatmap. |
add_new_line |
Whether to add new line after every word? If |
count_words_param |
A list of parameters passed to |
... |
Other parameters. |
return_gbl |
Internally used. |
The word cloud annotation is constructed by ComplexHeatmap::anno_link
.
If the annotation is failed to construct or no keyword is found, the function returns a ComplexHeatmap::anno_empty
with 1px width.
English stop words, punctuation and numbers are removed by default when counting words. As specific stop words might
coincide with gene or pathway names, and numbers in genes names might be meaningful it is recommended to adjust this
behaviour by passing appropriate arguments to the count_words
function using count_words_param
.
gm = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) go_id = rownames(gm) go_term = AnnotationDbi::select(GO.db::GO.db, keys = go_id, columns = "TERM")$TERM split = sample(letters[1:4], 100, replace = TRUE) align_to = split(1:100, split) term = lapply(letters[1:4], function(x) sample(go_term, sample(100:400, 1))) names(term) = letters[1:4] require(ComplexHeatmap) mat = matrix(rnorm(100*10), nrow = 100) Heatmap(mat, cluster_rows = FALSE, row_split = split, right_annotation = rowAnnotation(foo = anno_word_cloud(align_to, term)))
gm = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) go_id = rownames(gm) go_term = AnnotationDbi::select(GO.db::GO.db, keys = go_id, columns = "TERM")$TERM split = sample(letters[1:4], 100, replace = TRUE) align_to = split(1:100, split) term = lapply(letters[1:4], function(x) sample(go_term, sample(100:400, 1))) names(term) = letters[1:4] require(ComplexHeatmap) mat = matrix(rnorm(100*10), nrow = 100) Heatmap(mat, cluster_rows = FALSE, row_split = split, right_annotation = rowAnnotation(foo = anno_word_cloud(align_to, term)))
Word cloud annotations from GO
anno_word_cloud_from_GO( align_to, go_id, stat = c("pvalue", "count"), min_stat = ifelse(stat == "count", 5, 0.05), term = NULL, exclude_words = NULL, ... )
anno_word_cloud_from_GO( align_to, go_id, stat = c("pvalue", "count"), min_stat = ifelse(stat == "count", 5, 0.05), term = NULL, exclude_words = NULL, ... )
align_to |
The same format as in |
go_id |
The value should be in the same format as |
stat |
What type of value to map to font sizes of the keywords. There are two possible values. "pvalue": enrichment is applied to keywords and -log10(p-value) is used to map to font size; "count": simply word frequency of keywords. |
min_stat |
Minimal value for |
term |
Alternatively the GO description can be set via the |
exclude_words |
The words excluced for construcing word cloud. Some words are internally exclucded: |
... |
All other arguments passed to |
Area above the eCDF curve
area_above_ecdf(x)
area_above_ecdf(x)
x |
A vector of similarity values. |
Denote F(x)
as the eCDF (empirical Cumulative Distribution Function) of the similarity vector x
,
this function calculates the area above the eCDF curve, which is 1 - \int_0^1 F(x)dx
.
A numeric value.
Cluster terms based on their similarity matrix
cluster_terms( mat, method = "binary_cut", control = list(), verbose = se_opt$verbose ) cluster_by_kmeans(mat, max_k = max(2, min(round(nrow(mat)/5), 100)), ...) cluster_by_pam(mat, max_k = max(2, min(round(nrow(mat)/10), 100)), ...) cluster_by_dynamicTreeCut(mat, minClusterSize = 5, ...) cluster_by_fast_greedy(mat, ...) cluster_by_leading_eigen(mat, ...) cluster_by_louvain(mat, ...) cluster_by_walktrap(mat, ...) cluster_by_mclust(mat, G = seq_len(max(2, min(round(nrow(mat)/5), 100))), ...) cluster_by_apcluster(mat, s = apcluster::negDistMat(r = 2), ...) cluster_by_hdbscan(mat, minPts = 5, ...) cluster_by_MCL(mat, addLoops = TRUE, ...)
cluster_terms( mat, method = "binary_cut", control = list(), verbose = se_opt$verbose ) cluster_by_kmeans(mat, max_k = max(2, min(round(nrow(mat)/5), 100)), ...) cluster_by_pam(mat, max_k = max(2, min(round(nrow(mat)/10), 100)), ...) cluster_by_dynamicTreeCut(mat, minClusterSize = 5, ...) cluster_by_fast_greedy(mat, ...) cluster_by_leading_eigen(mat, ...) cluster_by_louvain(mat, ...) cluster_by_walktrap(mat, ...) cluster_by_mclust(mat, G = seq_len(max(2, min(round(nrow(mat)/5), 100))), ...) cluster_by_apcluster(mat, s = apcluster::negDistMat(r = 2), ...) cluster_by_hdbscan(mat, minPts = 5, ...) cluster_by_MCL(mat, addLoops = TRUE, ...)
mat |
A similarity matrix. |
method |
The clustering methods. Value should be in |
control |
A list of parameters passed to the corresponding clustering function. |
verbose |
Whether to print messages. |
max_k |
Maximal k for k-means/PAM clustering. K-means/PAM clustering is applied from k = 2 to k = max_k. |
... |
Other arguments. |
minClusterSize |
Minimal number of objects in a cluster. Pass to |
G |
Passed to the |
s |
Passed to the |
minPts |
Passed to the |
addLoops |
Passed to the |
New clustering methods can be registered by register_clustering_methods()
.
Please note it is better to directly use cluster_terms()
for clustering while not the individual cluster_by_*
functions
because cluster_terms()
does additional cluster label adjustment.
By default, there are the following clustering methods and corresponding clustering functions:
kmeans
see cluster_by_kmeans()
.
dynamicTreeCut
see cluster_by_dynamicTreeCut()
.
mclust
see cluster_by_mclust()
.
apcluster
see cluster_by_apcluster()
.
hdbscan
see cluster_by_hdbscan()
.
fast_greedy
see cluster_by_fast_greedy()
.
louvain
see cluster_by_louvain()
.
walktrap
see cluster_by_walktrap()
.
MCL
see cluster_by_MCL()
.
binary_cut
see binary_cut()
.
The additional argument in individual clustering functions can be set with the control
argument
in cluster_terms()
.
cluster_by_kmeans()
: The best k for k-means clustering is determined according to the "elbow" or "knee" method on
the distribution of within-cluster sum of squares (WSS) on each k. All other arguments are passed
from ...
to stats::kmeans()
.
cluster_by_pam()
: PAM is applied by fpc::pamk()
which can automatically select the best k.
All other arguments are passed from ...
to fpc::pamk()
.
cluster_by_dynamicTreeCut()
: All other arguments are passed from ...
to dynamicTreeCut::cutreeDynamic()
.
cluster_by_fast_greedy()
: All other arguments are passed from ...
to igraph::cluster_fast_greedy()
.
cluster_by_leading_eigen()
: All other arguments are passed from ...
to igraph::cluster_leading_eigen()
.
cluster_by_louvain()
: All other arguments are passed from ...
to igraph::cluster_louvain()
.
cluster_by_walktrap()
: All other arguments are passed from ...
to igraph::cluster_walktrap()
.
cluster_by_mclust()
: All other arguments are passed from ...
to mclust::Mclust()
.
cluster_by_apcluster()
: All other arguments are passed from ...
to apcluster::apcluster()
.
cluster_by_hdbscan()
: All other arguments are passed from ...
to dbscan::hdbscan()
.
cluster_by_MCL()
: All other arguments are passed from ...
to MCL::mcl()
.
A vector of numeric cluster labels.
Compare clustering methods
cmp_make_clusters( mat, method = setdiff(all_clustering_methods(), "mclust"), verbose = TRUE ) cmp_make_plot(mat, clt, plot_type = c("mixed", "heatmap"), nrow = 3) compare_clustering_methods( mat, method = setdiff(all_clustering_methods(), "mclust"), plot_type = c("mixed", "heatmap"), nrow = 3, verbose = TRUE )
cmp_make_clusters( mat, method = setdiff(all_clustering_methods(), "mclust"), verbose = TRUE ) cmp_make_plot(mat, clt, plot_type = c("mixed", "heatmap"), nrow = 3) compare_clustering_methods( mat, method = setdiff(all_clustering_methods(), "mclust"), plot_type = c("mixed", "heatmap"), nrow = 3, verbose = TRUE )
mat |
The similarity matrix. |
method |
Which methods to compare. All available methods are in |
verbose |
Whether to print messages. Ddetails The function compares following default clustering methods by default: - Also the user-defined methods in |
clt |
A list of clusterings from |
plot_type |
What type of plots to make. See Details. |
nrow |
Number of rows of the layout when |
For cmp_make_plot()
, if plot_type
is the default value "mixed"
, a figure with three panels will be generated:
A heatmap of the similarity matrix with different classifications as row annotations.
A heatmap of the pair-wise concordance of the classifications of every two clustering methods.
Barplots of the difference scores for each method (calculated by difference_score
), the number
of clusters (total clusters and the clusters with size >= 5) and the mean similarity of the terms
that are in the same clusters.
If plot_type
is "heatmap"
. There are heatmaps for the similarity matrix under clusterings
from different methods. The last panel is a table with the number of clusters under different
clusterings.
compare_clustering_methods()
is basically a wrapper function of cmp_make_clusters()
and cmp_make_plot()
.
cmp_make_clusters()
returns a list of cluster label vectors from different clustering methods.
cmp_make_plot()
returns no value.
compare_clustering_methods()
returns no value.
mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) compare_clustering_methods(mat) compare_clustering_methods(mat, plot_type = "heatmap")
mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) compare_clustering_methods(mat) compare_clustering_methods(mat, plot_type = "heatmap")
Calculate word frequency
count_words( term, exclude_words = NULL, stop_words = stopwords(), min_word_length = 1, tokenizer = "words", transform_case = tolower, remove_numbers = TRUE, remove_punctuation = TRUE, custom_transformer = NULL, stemming = FALSE, dictionary = NULL )
count_words( term, exclude_words = NULL, stop_words = stopwords(), min_word_length = 1, tokenizer = "words", transform_case = tolower, remove_numbers = TRUE, remove_punctuation = TRUE, custom_transformer = NULL, stemming = FALSE, dictionary = NULL )
term |
A vector of description texts. |
exclude_words |
The words that should be excluded. |
stop_words |
The stop words that should be be removed. |
min_word_length |
Minimum length of the word to be counted. |
tokenizer |
The tokenizer function, one of the values accepted by |
transform_case |
The function normalizing lettercase of the words. |
remove_numbers |
Whether to remove numbers. |
remove_punctuation |
Whether to remove punctuation. |
custom_transformer |
Custom function that transforms words. |
stemming |
Whether to only keep the roots of inflected words. |
dictionary |
A vector of words to be counted (if given all other words will be excluded). |
The text preprocessing followings the instruction from http://www.sthda.com/english/wiki/word-cloud-generator-in-r-one-killer-function-to-do-everything-you-need.
A data frame with words and frequencies.
gm = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) go_id = rownames(gm) go_term = AnnotationDbi::select(GO.db::GO.db, keys = go_id, columns = "TERM")$TERM count_words(go_term) |> head()
gm = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) go_id = rownames(gm) go_term = AnnotationDbi::select(GO.db::GO.db, keys = go_id, columns = "TERM")$TERM count_words(go_term) |> head()
Apply functions on every node in a dendrogram
dend_node_apply(dend, fun) edit_node(dend, fun = function(d, index) d)
dend_node_apply(dend, fun) edit_node(dend, fun = function(d, index) d)
dend |
A |
fun |
A self-defined function. |
dend_node_apply()
returns a vector or a list as the same length as the number of nodes in the dendrogram.
The self-defined function can have one single argument which is the sub-dendrogram at a certain node. E.g. to get the number of members at every node:
dend_node_apply(dend, function(d) attr(d, "members"))
The self-defined function can have a second argument, which is the index of current sub-dendrogram in
the complete dendrogram. E.g. dend[[1]]
is the first child node of the complete dendrogram and
dend[[c(1, 2)]]
is the second child node of dend[[1]]
, et al. This makes that at a certain node,
it is possible to get information of its child nodes and parent nodes.
dend_node_apply(dend, function(d, index) { dend[[c(index, 1)]] # is the first child node of d, or simply d[[1]] dend[[index[-length(index)]]] # is the parent node of d ... })
Note for the top node, the value of index
is NULL
.
In edit_node()
, if fun
only has one argument, it is basically the same as stats::dendrapply()
,
but it can have a second argument which is the index of the node in the dendrogram,
which makes it possible to get information of child nodes and parent nodes for
a specific node.
As an example, we first assign random values to every node in the dendrogram:
mat = matrix(rnorm(100), 10) dend = as.dendrogram(hclust(dist(mat))) dend = edit_node(dend, function(d) {attr(d, 'score') = runif(1); d})
Then for every node, we take the maximal absolute difference to all its child nodes
and parent node as the attribute abs_diff
.
dend = edit_node(dend, function(d, index) { n = length(index) s = attr(d, "score") if(is.null(index)) { # d is the top node s_children = sapply(d, function(x) attr(x, "score")) s_parent = NULL } else if(is.leaf(d)) { # d is the leaf s_children = NULL s_parent = attr(dend[[index[-n]]], "score") } else { s_children = sapply(d, function(x) attr(x, "score")) s_parent = attr(dend[[index[-n]]], "score") } abs_diff = max(abs(s - c(s_children, s_parent))) attr(d, "abs_diff") = abs_diff return(d) })
dend_node_apply()
returns a vector or a list, depends on whether fun
returns a scalar or more complex values.
edit_node()
returns a dendrogram
object.
mat = matrix(rnorm(100), 10) dend = as.dendrogram(hclust(dist(mat))) # number of members on every node dend_node_apply(dend, function(d) attr(d, "members")) # the depth on every node dend_node_apply(dend, function(d, index) length(index))
mat = matrix(rnorm(100), 10) dend = as.dendrogram(hclust(dist(mat))) # number of members on every node dend_node_apply(dend, function(d) attr(d, "members")) # the depth on every node dend_node_apply(dend, function(d, index) length(index))
Difference score
difference_score(mat, cl)
difference_score(mat, cl)
mat |
The similarity matrix. |
cl |
Cluster labels. |
This function measures the different between the similarity values for the terms that belong to the same clusters and in different clusters. The difference score is the Kolmogorov-Smirnov statistic between the two distributions.
A numeric scalar.
mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) cl = binary_cut(mat) difference_score(mat, cl)
mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) cl = binary_cut(mat) difference_score(mat, cl)
Interactively visualize the similarity heatmap
export_to_shiny_app(mat, cl = binary_cut(mat))
export_to_shiny_app(mat, cl = binary_cut(mat))
mat |
A similarity matrix. |
cl |
Cluster labels inferred from the similarity matrix, e.g. from |
A shiny application.
if(interactive()) { mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) cl = binary_cut(mat) export_to_shiny_app(mat, cl) }
if(interactive()) { mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) cl = binary_cut(mat) export_to_shiny_app(mat, cl) }
Calculate Gene Ontology (GO) semantic similarity matrix
GO_similarity( go_id, ont = NULL, db = "org.Hs.eg.db", measure = "Sim_XGraSM_2013" ) guess_ont(go_id, db = "org.Hs.eg.db") random_GO(n, ont = c("BP", "CC", "MF"), db = "org.Hs.eg.db")
GO_similarity( go_id, ont = NULL, db = "org.Hs.eg.db", measure = "Sim_XGraSM_2013" ) guess_ont(go_id, db = "org.Hs.eg.db") random_GO(n, ont = c("BP", "CC", "MF"), db = "org.Hs.eg.db")
go_id |
A vector of GO IDs. |
ont |
Sub-ontology of GO. Value should be one of "BP", "CC" or "MF". If it is not specified,
the function automatically identifies it by random sampling 10 IDs from |
db |
Annotation database. It should be an OrgDb package name from https://bioconductor.org/packages/release/BiocViews.html#___OrgDb. The value
can also directly be an |
measure |
Semantic measure for the GO similarity, pass to |
n |
Number of GO IDs. |
The default similarity method is "Sim_XGraSM_2013". Since the semantic similarities are calculated based on gene annotations to GO terms, I suggest users also try the following methods:
"Sim_Lin_1998"
"Sim_Resnik_1999"
"Sim_Relevance_2006"
"Sim_SimIC_2010"
"Sim_XGraSM_2013"
"Sim_EISI_2015"
"Sim_AIC_2014"
"Sim_Wang_2007"
"Sim_GOGO_2018"
In guess_ont()
, only 10 random GO IDs are checked.
In random_GO()
, only GO terms with gene annotations are sampled.
GO_similarity()
returns a symmetric matrix.
guess_ont()
returns a single character scalar of "BP", "CC" or "MF".
If there are more than one ontologies detected. It returns NULL
.
random_GO()
returns a vector of GO IDs.
go_id = random_GO(100) mat = GO_similarity(go_id) go_id = random_GO(100) guess_ont(go_id)
go_id = random_GO(100) mat = GO_similarity(go_id) go_id = random_GO(100) guess_ont(go_id)
Visualize the similarity matrix and the clustering
ht_clusters( mat, cl, dend = NULL, col = c("white", "red"), draw_word_cloud = TRUE, min_term = round(nrow(mat) * 0.01), order_by_size = FALSE, stat = "pvalue", min_stat = ifelse(stat == "count", 5, 0.05), exclude_words = character(0), max_words = 10, word_cloud_grob_param = list(), fontsize_range = c(4, 16), bg_gp = gpar(fill = "#DDDDDD", col = "#AAAAAA"), column_title = NULL, ht_list = NULL, use_raster = TRUE, run_draw = TRUE, ... )
ht_clusters( mat, cl, dend = NULL, col = c("white", "red"), draw_word_cloud = TRUE, min_term = round(nrow(mat) * 0.01), order_by_size = FALSE, stat = "pvalue", min_stat = ifelse(stat == "count", 5, 0.05), exclude_words = character(0), max_words = 10, word_cloud_grob_param = list(), fontsize_range = c(4, 16), bg_gp = gpar(fill = "#DDDDDD", col = "#AAAAAA"), column_title = NULL, ht_list = NULL, use_raster = TRUE, run_draw = TRUE, ... )
mat |
A similarity matrix. |
cl |
Cluster labels inferred from the similarity matrix, e.g. from |
dend |
Used internally. |
col |
A vector of colors that map from 0 to the 97.5^th percentile of the similarity values. The value can also be a color mapping function
generated by |
draw_word_cloud |
Whether to draw the word clouds. |
min_term |
Minimal number of functional terms in a cluster. All the clusters
with size less than |
order_by_size |
Whether to reorder clusters by their sizes. The cluster
that is merged from small clusters (size < |
stat |
Type of value for mapping to the font size of keywords in the word clouds. There are two options: "count": simply number of keywords; "pvalue": enrichment on keywords is performed (by fisher's exact test) and -log10(pvalue) is used to map to font sizes. |
min_stat |
Minimal value for |
exclude_words |
Words that are excluded in the word cloud. |
max_words |
Maximal number of words visualized in the word cloud. |
word_cloud_grob_param |
A list of graphic parameters passed to |
fontsize_range |
The range of the font size. The value should be a numeric vector with length two. The font size interpolation is linear. |
bg_gp |
Graphics parameters for controlling word cloud annotation background. |
column_title |
Column title for the heatmap. |
ht_list |
A list of additional heatmaps added to the left of the similarity heatmap. |
use_raster |
Whether to write the heatmap as a raster image. |
run_draw |
Internally used. |
... |
Other arguments passed to |
A ComplexHeatmap::HeatmapList
object.
mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) cl = binary_cut(mat) ht_clusters(mat, cl, word_cloud_grob_param = list(max_width = 80)) ht_clusters(mat, cl, word_cloud_grob_param = list(max_width = 80), order_by_size = TRUE)
mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) cl = binary_cut(mat) ht_clusters(mat, cl, word_cloud_grob_param = list(max_width = 80)) ht_clusters(mat, cl, word_cloud_grob_param = list(max_width = 80), order_by_size = TRUE)
Keyword enrichment for GO terms
keyword_enrichment_from_GO(go_id, min_bg = 5, min_term = 2)
keyword_enrichment_from_GO(go_id, min_bg = 5, min_term = 2)
go_id |
A vector of GO IDs. |
min_bg |
Minimal number of GO terms (in the background, i.e. all GO temrs in the GO database) that contain a specific keyword. |
min_term |
Minimal number of GO terms (GO terms in |
The enrichment is applied by Fisher's exact test. For a keyword, there is the following 2x2 contigency table:
| contains the keyword | does not contain the keyword In the GO set | s11 | s12 Not in the GO set | s21 | s22
where s11
, s12
, s21
and s22
are the counts of GO terms in the four categories.
A data frame with keyword enrichment results.
go_id = random_GO(100) keyword_enrichment_from_GO(go_id)
go_id = random_GO(100) keyword_enrichment_from_GO(go_id)
Partition the matrix
partition_by_kmeans(mat, n_repeats = 10) partition_by_pam(mat) partition_by_hclust(mat) partition_by_kmeanspp(mat)
partition_by_kmeans(mat, n_repeats = 10) partition_by_pam(mat) partition_by_hclust(mat) partition_by_kmeanspp(mat)
mat |
The submatrix in the binary cut clustering process. |
n_repeats |
Number of repeated runs of k-means clustering. |
These functions can be set to the partition_fun
argument in binary_cut()
.
partition_by_kmeans()
: Since k-means clustering brings randomness, this function performs
k-means clustering several times (controlled by n_repeats
) and uses the final consensus partitioning results.
partition_by_pam()
: The clustering is performed by cluster::pam()
with the pamonce
argument set to 5.
partition_by_hclust()
: The "ward.D2" clusering method was used.
partition_by_kmeanspp()
: It uses the kmeanspp method from the flexclust package.
All partitioning functions split the matrix into two groups and return a categorical vector of labels of 1 and 2.
Cluster functional terms by recursively binary cutting the similarity matrix
plot_binary_cut( mat, value_fun = area_above_ecdf, cutoff = 0.85, partition_fun = partition_by_pam, dend = NULL, dend_width = unit(3, "cm"), depth = NULL, show_heatmap_legend = TRUE, ... ) binary_cut( mat, value_fun = area_above_ecdf, partition_fun = partition_by_hclust, cutoff = 0.85, try_all_partition_fun = TRUE, partial = nrow(mat) > 1500 )
plot_binary_cut( mat, value_fun = area_above_ecdf, cutoff = 0.85, partition_fun = partition_by_pam, dend = NULL, dend_width = unit(3, "cm"), depth = NULL, show_heatmap_legend = TRUE, ... ) binary_cut( mat, value_fun = area_above_ecdf, partition_fun = partition_by_hclust, cutoff = 0.85, try_all_partition_fun = TRUE, partial = nrow(mat) > 1500 )
mat |
A similarity matrix. |
value_fun |
A function that calculates the scores for the four submatrices on a node. |
cutoff |
The cutoff for splitting the dendrogram. |
partition_fun |
A function to split each node into two groups. Pre-defined functions
in this package are |
dend |
A dendrogram object, used internally. |
dend_width |
Width of the dendrogram on the plot. |
depth |
Depth of the recursive binary cut process. |
show_heatmap_legend |
Whether to show the heatmap legend. |
... |
Other arguments. |
try_all_partition_fun |
Different |
partial |
Whether to generate the complete clustering or the clustering stops when sub-matrices cannot be split anymore. |
After the functions which perform clustering are executed, such as simplifyGO()
or
binary_cut()
, the dendrogram is temporarily saved and plot_binary_cut()
directly
uses this dendrogram.
binary_cut()
returns a vector of numeric cluster labels.
mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) plot_binary_cut(mat, depth = 1) plot_binary_cut(mat, depth = 2) plot_binary_cut(mat) mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) binary_cut(mat)
mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) plot_binary_cut(mat, depth = 1) plot_binary_cut(mat, depth = 2) plot_binary_cut(mat) mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) binary_cut(mat)
Configure clustering methods
register_clustering_methods(...) all_clustering_methods() remove_clustering_methods(method) reset_clustering_methods()
register_clustering_methods(...) all_clustering_methods() remove_clustering_methods(method) reset_clustering_methods()
... |
A named list of clustering functions, see in Details. |
method |
A vector of method names. |
The user-defined functions should accept at least one argument which is the input matrix.
The second optional argument should always be ...
so that parameters
for the clustering function can be passed by the control
argument from cluster_terms()
, simplifyGO()
or simplifyEnrichment()
.
If users forget to add ...
, it is added internally.
Please note, the user-defined function should automatically identify the optimized number of clusters.
The function should return a vector of cluster labels. Internally it is converted to numeric labels.
The default clustering methods are:
kmeans
see cluster_by_kmeans()
.
dynamicTreeCut
see cluster_by_dynamicTreeCut()
.
mclust
see cluster_by_mclust()
.
apcluster
see cluster_by_apcluster()
.
hdbscan
see cluster_by_hdbscan()
.
fast_greedy
see cluster_by_fast_greedy()
.
louvain
see cluster_by_louvain()
.
walktrap
see cluster_by_walktrap()
.
MCL
see cluster_by_MCL()
.
binary_cut
see binary_cut()
.
all_clustering_methods()
returns a vector of clustering method names.
register_clustering_methods( # assume there are 5 groups random = function(mat, ...) sample(5, nrow(mat), replace = TRUE) ) all_clustering_methods() remove_clustering_methods("random") all_clustering_methods() remove_clustering_methods(c("kmeans", "mclust")) all_clustering_methods() reset_clustering_methods() all_clustering_methods()
register_clustering_methods( # assume there are 5 groups random = function(mat, ...) sample(5, nrow(mat), replace = TRUE) ) all_clustering_methods() remove_clustering_methods("random") all_clustering_methods() remove_clustering_methods(c("kmeans", "mclust")) all_clustering_methods() reset_clustering_methods() all_clustering_methods()
Scale font size
scale_fontsize(x, rg = c(1, 30), fs = c(4, 16))
scale_fontsize(x, rg = c(1, 30), fs = c(4, 16))
x |
A numeric vector. |
rg |
The range. |
fs |
Range of the font size. |
It is a linear interpolation.
A numeric vector.
x = runif(10, min = 1, max = 20) # scale x to fontsize 4 to 16. scale_fontsize(x)
x = runif(10, min = 1, max = 20) # scale x to fontsize 4 to 16. scale_fontsize(x)
Global parameters
se_opt(..., RESET = FALSE, READ.ONLY = NULL, LOCAL = FALSE, ADD = FALSE)
se_opt(..., RESET = FALSE, READ.ONLY = NULL, LOCAL = FALSE, ADD = FALSE)
... |
Arguments for the parameters, see "details" section. |
RESET |
Whether to reset to default values. |
READ.ONLY |
Please ignore. |
LOCAL |
Please ignore. |
ADD |
Please ignore. |
There are the following global options:
verobse
: Whether to print messages.
A GlobalOptionsFun
object.
Select the cutoff for binary cut
select_cutoff( mat, cutoff = seq(0.6, 0.98, by = 0.01), verbose = se_opt$verbose, ... )
select_cutoff( mat, cutoff = seq(0.6, 0.98, by = 0.01), verbose = se_opt$verbose, ... )
mat |
A similarity matrix. |
cutoff |
A list of cutoffs to test. Note the range of the cutoff values should be inside |
verbose |
Whether to print messages. |
... |
Pass to |
Binary cut is applied to each cutoff and the clustering results are evaluated by following metrics:
difference score, calculated by difference_score()
.
number of clusters.
block mean, which is the mean similarity in the blocks in the diagonal of the heatmap.
mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) select_cutoff(mat)
mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) select_cutoff(mat)
Simplify Gene Ontology (GO) enrichment results
simplifyGO( mat, method = "binary_cut", control = list(), plot = TRUE, verbose = TRUE, column_title = qq("@{nrow(mat)} GO terms clustered by '@{method}'"), ht_list = NULL, ... ) simplifyEnrichment(...)
simplifyGO( mat, method = "binary_cut", control = list(), plot = TRUE, verbose = TRUE, column_title = qq("@{nrow(mat)} GO terms clustered by '@{method}'"), ht_list = NULL, ... ) simplifyEnrichment(...)
mat |
A GO similarity matrix. You can also provide a vector of GO IDs to this argument. |
method |
Method for clustering the matrix. See |
control |
A list of parameters for controlling the clustering method, passed to |
plot |
Whether to make the heatmap. |
verbose |
Whether to print messages. |
column_title |
Column title for the heatmap. |
ht_list |
A list of additional heatmaps added to the left of the similarity heatmap. |
... |
Arguments passed to |
This is basically a wrapper function that it first runs cluster_terms()
to cluster
GO terms and then runs ht_clusters()
to visualize the clustering.
The arguments in simplifyGO()
passed to ht_clusters()
are:
draw_word_cloud
: Whether to draw the word clouds.
min_term
: Minimal number of GO terms in a cluster. All the clusters
with size less than min_term
are all merged into one single cluster in the heatmap.
order_by_size
: Whether to reorder GO clusters by their sizes. The cluster
that is merged from small clusters (size < min_term
) is always put to the bottom of the heatmap.
stat
: What values of keywords are used to map to font sizes in the word clouds.
exclude_words
: Words that are excluded in the word cloud.
max_words
: Maximal number of words visualized in the word cloud.
word_cloud_grob_param
: A list of graphic parameters passed to word_cloud_grob()
.
fontsize_range
The range of the font size. The value should be a numeric vector with length two.
The minimal font size is mapped to word frequency value of 1 and the maximal font size is mapped
to the maximal word frequency. The font size interlopation is linear.
bg_gp
: Graphic parameters for controlling the background of word cloud annotations.
A data frame with two columns: GO IDs and cluster labels.
set.seed(123) go_id = random_GO(500) mat = GO_similarity(go_id) df = simplifyGO(mat, word_cloud_grob_param = list(max_width = 80)) head(df)
set.seed(123) go_id = random_GO(500) mat = GO_similarity(go_id) df = simplifyGO(mat, word_cloud_grob_param = list(max_width = 80)) head(df)
Perform simplifyGO analysis with multiple lists of GO IDs
simplifyGOFromMultipleLists( lt, go_id_column = NULL, padj_column = NULL, padj_cutoff = 0.01, filter = function(x) any(x < padj_cutoff), default = 1, ont = NULL, db = "org.Hs.eg.db", measure = "Sim_XGraSM_2013", heatmap_param = list(NULL), show_barplot = TRUE, method = "binary_cut", control = list(), min_term = NULL, verbose = TRUE, column_title = NULL, ... )
simplifyGOFromMultipleLists( lt, go_id_column = NULL, padj_column = NULL, padj_cutoff = 0.01, filter = function(x) any(x < padj_cutoff), default = 1, ont = NULL, db = "org.Hs.eg.db", measure = "Sim_XGraSM_2013", heatmap_param = list(NULL), show_barplot = TRUE, method = "binary_cut", control = list(), min_term = NULL, verbose = TRUE, column_title = NULL, ... )
lt |
A data frame, a list of numeric vectors (e.g. adjusted p-values) where each numeric vector has GO IDs as names, or a list of GO IDs. |
go_id_column |
Column index of GO ID if |
padj_column |
Column index of adjusted p-values if |
padj_cutoff |
Cut off for adjusted p-values. |
filter |
A self-defined function for filtering GO IDs. By default it requires GO IDs should be significant in at least one list. |
default |
The default value for the adjusted p-values. See Details. |
ont |
Pass to |
db |
Pass to |
measure |
Pass to |
heatmap_param |
Parameters for controlling the heatmap, see Details. |
show_barplot |
Whether draw barplots which shows numbers of significant GO terms in clusters. |
method |
Pass to |
control |
Pass to |
min_term |
Pass to |
verbose |
Pass to |
column_title |
Pass to |
... |
Pass to |
The input data can have three types of formats:
A list of numeric vectors of adjusted p-values where each vector has the GO IDs as names.
A data frame. The column of the GO IDs can be specified with go_id_column
argument and the column of the adjusted p-values can be
specified with padj_column
argument. If these columns are not specified, they are automatically identified. The GO ID column
is found by checking whether a column contains all GO IDs. The adjusted p-value column is found by comparing the column names of the
data frame to see whether it might be a column for adjusted p-values. These two columns are used to construct a numeric vector
with GO IDs as names.
A list of character vectors of GO IDs. In this case, each character vector is changed to a numeric vector where all values take 1 and the original GO IDs are used as names of the vector.
Now let's assume there are n
GO lists, we first construct a global matrix where columns correspond to the n
GO lists and rows correspond
to the "union" of all GO IDs in the lists. The value for the ith GO ID and in the jth list are taken from the corresponding numeric vector
in lt
. If the jth vector in lt
does not contain the ith GO ID, the value defined by default
argument is taken there (e.g. in most cases the numeric
values are adjusted p-values, default
is set to 1). Let's call this matrix as M0
.
Next step is to filter M0
so that we only take a subset of GO IDs of interest. We define a proper function via argument filter
to remove
GO IDs that are not important for the analysis. Functions for filter
is applied to every row in M0
and filter
function needs
to return a logical value to decide whether to remove the current GO ID. For example, if the values in lt
are adjusted p-values, the filter
function
can be set as function(x) any(x < padj_cutoff)
so that the GO ID is kept as long as it is signfiicant in at least one list. After the filter, let's call
the filtered matrix M1
.
GO IDs in M1
(row names of M1
) are used for clustering. A heatmap of M1
is attached to the left of the GO similarity heatmap so that
the group-specific (or list-specific) patterns can be easily observed and to corresponded to GO functions.
Argument heatmap_param
controls several parameters for heatmap M1
:
transform
: A self-defined function to transform the data for heatmap visualization. The most typical case is to transform adjusted p-values by -log10(x)
.
breaks
: break values for color interpolation.
col
: The corresponding values for breaks
.
labels
: The corresponding labels.
name
: Legend title.
# perform functional enrichment on the signatures genes from cola anlaysis require(cola) data(golub_cola) res = golub_cola["ATC:skmeans"] require(hu6800.db) x = hu6800ENTREZID mapped_probes = mappedkeys(x) id_mapping = unlist(as.list(x[mapped_probes])) lt = functional_enrichment(res, k = 3, id_mapping = id_mapping) # you can check the value of `lt` # a list of data frames simplifyGOFromMultipleLists(lt, padj_cutoff = 0.001) # a list of numeric values lt2 = lapply(lt, function(x) structure(x$p.adjust, names = x$ID)) simplifyGOFromMultipleLists(lt2, padj_cutoff = 0.001) # a list of GO IDS lt3 = lapply(lt, function(x) x$ID[x$p.adjust < 0.001]) simplifyGOFromMultipleLists(lt3)
# perform functional enrichment on the signatures genes from cola anlaysis require(cola) data(golub_cola) res = golub_cola["ATC:skmeans"] require(hu6800.db) x = hu6800ENTREZID mapped_probes = mappedkeys(x) id_mapping = unlist(as.list(x[mapped_probes])) lt = functional_enrichment(res, k = 3, id_mapping = id_mapping) # you can check the value of `lt` # a list of data frames simplifyGOFromMultipleLists(lt, padj_cutoff = 0.001) # a list of numeric values lt2 = lapply(lt, function(x) structure(x$p.adjust, names = x$ID)) simplifyGOFromMultipleLists(lt2, padj_cutoff = 0.001) # a list of GO IDS lt3 = lapply(lt, function(x) x$ID[x$p.adjust < 0.001]) simplifyGOFromMultipleLists(lt3)
A simplified way to visualize enrichment in GO clusters
summarizeGO( go_id, value = NULL, aggregate = mean, method = "binary_cut", control = list(), verbose = TRUE, axis_label = "Value", title = "", legend_title = axis_label, min_term = round(nrow(mat) * 0.01), stat = "pvalue", min_stat = ifelse(stat == "count", 5, 0.05), exclude_words = character(0), max_words = 6, word_cloud_grob_param = list(), fontsize_range = c(4, 16), bg_gp = gpar(fill = "#DDDDDD", col = "#AAAAAA") )
summarizeGO( go_id, value = NULL, aggregate = mean, method = "binary_cut", control = list(), verbose = TRUE, axis_label = "Value", title = "", legend_title = axis_label, min_term = round(nrow(mat) * 0.01), stat = "pvalue", min_stat = ifelse(stat == "count", 5, 0.05), exclude_words = character(0), max_words = 6, word_cloud_grob_param = list(), fontsize_range = c(4, 16), bg_gp = gpar(fill = "#DDDDDD", col = "#AAAAAA") )
go_id |
A vector of GO IDs. |
value |
A list of numeric value associate with |
aggregate |
Function to aggregate values in each GO cluster. |
method |
Method for clustering the matrix. See |
control |
A list of parameters for controlling the clustering method, passed to |
verbose |
Whether to print messages. |
axis_label |
X-axis label. |
title |
Title for the whole plot. |
legend_title |
Title for the legend. |
min_term |
Minimal number of functional terms in a cluster. All the clusters
with size less than |
stat |
Type of value for mapping to the font size of keywords in the word clouds. There are two options: "count": simply number of keywords; "pvalue": enrichment on keywords is performed (by fisher's exact test) and -log10(pvalue) is used to map to font sizes. |
min_stat |
Minimal value for |
exclude_words |
Words that are excluded in the word cloud. |
max_words |
Maximal number of words visualized in the word cloud. |
word_cloud_grob_param |
A list of graphic parameters passed to |
fontsize_range |
The range of the font size. The value should be a numeric vector with length two. The font size interpolation is linear. |
bg_gp |
Graphics parameters for controlling word cloud annotation background. |
There are several other ways to specify GO IDs and the associated values.
specify value
as a named vector where GO IDs are the names.
specify value
as a list of numeric named vectors. In this case, value
contains multiple enrichment results.
Please refer to https://jokergoo.github.io/2023/10/02/simplified-simplifyenrichment-plot/ for more examples of this function.
A simple grob for the word cloud
word_cloud_grob( text, fontsize, line_space = unit(4, "pt"), word_space = unit(4, "pt"), max_width = unit(80, "mm"), col = function(fs) circlize::rand_color(length(fs), luminosity = "dark"), add_new_line = FALSE, test = FALSE ) ## S3 method for class 'word_cloud' widthDetails(x) ## S3 method for class 'word_cloud' heightDetails(x)
word_cloud_grob( text, fontsize, line_space = unit(4, "pt"), word_space = unit(4, "pt"), max_width = unit(80, "mm"), col = function(fs) circlize::rand_color(length(fs), luminosity = "dark"), add_new_line = FALSE, test = FALSE ) ## S3 method for class 'word_cloud' widthDetails(x) ## S3 method for class 'word_cloud' heightDetails(x)
text |
A vector of words. |
fontsize |
The corresponding font size. With the frequency of the words known, |
line_space |
Space between lines. The value can be a |
word_space |
Space between words. The value can be a |
max_width |
The maximal width of the viewport to put the word cloud. The value can be a |
col |
Colors for the words. The value can be a vector, in numeric or character, which should have the same
length as |
add_new_line |
Whether to add new line after every word? If |
test |
Internally used. It basically adds borders to the words and the viewport. |
x |
The |
A grid::grob
object. The width and height of the grob can be get by grid::grobWidth
and grid::grobHeight
.
# very old R versions do not have strrep() function if(!exists("strrep")) { strrep = function(x, i) paste(rep(x, i), collapse = "") } words = sapply(1:30, function(x) strrep(sample(letters, 1), sample(3:10, 1))) require(grid) gb = word_cloud_grob(words, fontsize = runif(30, min = 5, max = 30), max_width = 100) grid.newpage(); grid.draw(gb) # color as a single scalar gb = word_cloud_grob(words, fontsize = runif(30, min = 5, max = 30), max_width = 100, col = 1) grid.newpage(); grid.draw(gb) # color as a vector gb = word_cloud_grob(words, fontsize = runif(30, min = 5, max = 30), max_width = 100, col = 1:30) grid.newpage(); grid.draw(gb) # color as a function require(circlize) col_fun = colorRamp2(c(5, 17, 30), c("blue", "black", "red")) gb = word_cloud_grob(words, fontsize = runif(30, min = 5, max = 30), max_width = 100, col = function(fs) col_fun(fs)) grid.newpage(); grid.draw(gb)
# very old R versions do not have strrep() function if(!exists("strrep")) { strrep = function(x, i) paste(rep(x, i), collapse = "") } words = sapply(1:30, function(x) strrep(sample(letters, 1), sample(3:10, 1))) require(grid) gb = word_cloud_grob(words, fontsize = runif(30, min = 5, max = 30), max_width = 100) grid.newpage(); grid.draw(gb) # color as a single scalar gb = word_cloud_grob(words, fontsize = runif(30, min = 5, max = 30), max_width = 100, col = 1) grid.newpage(); grid.draw(gb) # color as a vector gb = word_cloud_grob(words, fontsize = runif(30, min = 5, max = 30), max_width = 100, col = 1:30) grid.newpage(); grid.draw(gb) # color as a function require(circlize) col_fun = colorRamp2(c(5, 17, 30), c("blue", "black", "red")) gb = word_cloud_grob(words, fontsize = runif(30, min = 5, max = 30), max_width = 100, col = function(fs) col_fun(fs)) grid.newpage(); grid.draw(gb)