The Self-training Gene Clustering Pipeline (SGCP
) is a
robust framework designed for constructing and analyzing gene
co-expression networks. Its primary objective is to organize genes into
clusters, or modules, based on their similar expression patterns within
these networks. SGCP
introduces several innovative steps
that facilitate the identification of highly enriched modules through an
unsupervised approach. What sets SGCP
apart from existing
frameworks is its incorporation of a unique semi-supervised clustering
method that integrates Gene Ontology (GO) information. This additional
step significantly enhances the quality of the identified modules.
SGCP
ultimately produces modules that are notably enriched
with relevant biological insights. For a comprehensive overview of
SGCP
, including detailed methodology and results, refer to
the preprint of the manuscript available here.
SGCP
installationTo install the package SGCP
package and its
dependencies, follow these steps. For more information please visit here). In this manual guide,
SGCP
also relies on two more packages;
SummarizedExperiment
and org.Hs.eg.db
.
Let’s start.
SGCP
General InputSGCP
requires three main inputs: gene expression data,
gene identifiers, and a genome-wide annotation database.
expData: A matrix or dataframe of size
m*n
, where m
is the number of genes and
n
is the number of samples. It can be either DNA-microarray
or RNA-seq data. In expData
, rows represent genes, columns
represent samples, and the entry at position (i, j)
is the
expression value for gene i
in sample j.
SGCP
does not perform normalization or batch effect
correction; these preprocessing steps should be completed
beforehand.
geneID: A vector of size m, where each entry at
index i
denotes the gene identifier for gene
i
. There is a one-to-one correspondence between the rows in
expData
and the entries in the geneID vector, meaning index
i
in geneID corresponds to row i in
expData.
annotation_db: The name of a genome-wide
annotation package for the organism of interest, used for the gene
ontology (GO) enrichment step. The annotation_db
must be
installed by the user before using SGCP.
Here, are some important annotation_db
along with its
corresponding identifiers.
organism | annotation_db | gene identifier |
---|---|---|
Homo sapiens (Hs) | org.Hs.eg.db | Entrez Gene identifiers |
Drosophila melanogaster (Dm) | org.Dm.eg.db | Entrez Gene identifiers |
Rattus norvegicus (Rn) | org.Rn.eg.db | Entrez Gene identifiers |
Mus musculus (Mm) | org.Mm.eg.db | Entrez Gene identifiers |
Arabidopsis thaliana (At) | org.At.tair.db | TAIR identifiers |
Note that genes: * must have expression values across all samples (i.e. no missing value). * must have non-zero variance across all the samples. * must have exactly one unique identifier, say geneID. * must have GO annotation.
Note that SGCP
depends on GOstats
for GO enrichment, thus (geneID) and (annotation_db)
must be compatible to this package standard.
SGCP
Input ExampleFor illustrative purposes we will use an example of a gene expression
provided with SGCP
. This dataset originally represent 5000
genes in 57 samples for Homo sapiens, for more information see (Cheng
et al.) Here, to ease the computation, 1000 genes with the highest
variance and 5 samples have been selected from the original dataset.
For this input example we need to install the following packages.
The input example is organized as an object of
SummarizedExperiment
. The assay
field contains
the expression matrix, where rows correspond to genes and columns
correspond to samples. The rowData
field denotes the gene
Entrez IDs. There is a one-to-one correspondence between rows in the
assay and rowData
fields; thus, rowData
at
index i
shows the gene Entrez ID for the gene at row
i
in the assay field. We call this object
cheng.
The annotation_db
used is org.Hs.eg.db
,
which must be installed if it is not already. For installation, see link.
Let’s look at the input example.
## class: SummarizedExperiment
## dim: 1500 10
## metadata(0):
## assays(1): Normalized gene expression Of ischemic cardiomyopathy (ICM)
## rownames(1500): 7503 100462977 ... 1152 11082
## rowData names(2): SYMBOL ENTREZID
## colnames(10): SRR7426784 SRR7426785 ... SRR7426792 SRR7426793
## colData names(1): sampleName
## [1] "gene expression..."
## [1] "rownames and colnames correspond to gene Entrez ids and sample names"
## SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503 4.608516 4.822920 11.477016 11.164324 10.631023 5.372338
## 100462977 6.566498 11.170815 10.721761 7.120300 10.991193 12.889641
## 4878 6.925341 9.098948 9.550381 7.742856 7.539139 6.818301
## 4879 6.020677 10.545173 6.398164 8.415020 7.936110 5.492805
## 6192 9.256628 9.511192 4.276300 3.705296 3.392204 9.581542
## 9086 9.098932 9.110217 3.858846 3.077543 2.923131 9.426101
## SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503 4.618841 4.628351 4.398288 4.686936
## 100462977 12.024684 12.295130 7.481746 6.661691
## 4878 6.181450 6.083118 9.985733 7.622827
## 4879 5.255798 6.266800 8.812797 8.678046
## 6192 9.211812 9.109689 9.309379 9.158489
## 9086 9.567030 9.557452 9.316588 9.059718
## [1] " \n gene ids..."
## [1] "rownames are the gene Entrez ids"
## DataFrame with 6 rows and 2 columns
## SYMBOL ENTREZID
## <character> <character>
## 7503 XIST 7503
## 100462977 MTRNR2L1 100462977
## 4878 NPPA 4878
## 4879 NPPB 4879
## 6192 RPS4Y1 6192
## 9086 EIF1AY 9086
This data has the dimension of 1500 genes and 10 samples.
Now, we are ready to create the three main inputs. Using
assay
and rowData
functions we can have access
to expression matrix and gene Entrez IDs respectively.
## expData
## SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503 4.608516 4.822920 11.477016 11.164324 10.631023 5.372338
## 100462977 6.566498 11.170815 10.721761 7.120300 10.991193 12.889641
## 4878 6.925341 9.098948 9.550381 7.742856 7.539139 6.818301
## 4879 6.020677 10.545173 6.398164 8.415020 7.936110 5.492805
## 6192 9.256628 9.511192 4.276300 3.705296 3.392204 9.581542
## 9086 9.098932 9.110217 3.858846 3.077543 2.923131 9.426101
## SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503 4.618841 4.628351 4.398288 4.686936
## 100462977 12.024684 12.295130 7.481746 6.661691
## 4878 6.181450 6.083118 9.985733 7.622827
## 4879 5.255798 6.266800 8.812797 8.678046
## 6192 9.211812 9.109689 9.309379 9.158489
## 9086 9.567030 9.557452 9.316588 9.059718
## [1] 1500 10
##
## geneID
## [1] "7503" "100462977" "4878" "4879" "6192" "9086"
## [1] 1500
## Loading required package: AnnotationDbi
SGCP
Pipeline Parameters and WorkflowSGCP
operates through five main steps to produce the
final modules. Each step provides parameters that can be adjusted by the
user. Each step is implemented as a single function.
adjacencyMatrix
function)
calibration
: Boolean, default FALSE
. If
TRUE
SGCP
performs calibration step for more
information see link.norm
: Boolean, default TRUE
. If
TRUE
SGCP
divides each genes (rows) by its
norm2.tom
: Boolean, default TRUE
. If
TRUE
SGCP
adds TOM to the network.saveAdja
: Boolean, default FALSE
. If
TRUE
, the adjacency matrix will be saved .adjaNameFile
: string indicating the name of file for
saving adjacency matrix.hm
: string indicates the name of file for saving
adjacency matrix heat map.clustering
function)
kopt
: An integer, default is NULL
. Denotes
the optimal number of clusters k
by the user.method
: Either “relativeGap”, “secondOrderGap”,
“additiveGap”, or NULL
(defualt), Defines the method for
number of cluster.func.GO
: A function for gene ontology validation,
default is sum.
func.conduct
: A function for conductance validation,
default is min.
maxIter
: An integer identifying the maximum number of
iteration in kmeans.
numStart
: An integer identifying the number of starts
in kmeans.
saveOrig
: Boolean, default TRUE
. If
TRUE
, keeps the transformed matrix.n_egvec
: Either “all” or an integer, default is
100
. Indicates the number of columns of transformed matrix
to be kept.sil
: Boolean, default FALSE
. If
TRUE
, calculates silhouette index per gene. at the end of
this step, initial clusters are produced.geneOntology
function)
direction
: Test direction, default is c(“over”,
“under”), for over-represented, or under-represented GO termsontology
: GO ontologies, default c(“BP”, “CC”, “MF”),
where BP stands for Biological Process, CC for Cellular Component, and
MF for Molecular Function.hgCutoff
: A numeric value in 0
and
1
, default is 0.05
. GO terms with p-values
smaller than hgCutoff
are retained.cond
: Boolean, default is TRUE
. If
TRUE
, performs conditional hypergeometric test.semiLabeling
function)
cutoff
: a numeric in 0 and 1, default is
NULL
. This baseline is used to determined the significance
of GO terms, distinguishing between remarkable and unremarkable
genes.percent
: a number in 0 and 1, default is
0.1
. Indicates the percentile used to identify top GO
terms.stp
: a number in 0 and ,1 default is 0.01
.
Specifies the increment added to the percent
parameter.semiSupevised
function)
model
: Either “knn” for k-nearest neighbors or “lr” for
logistic regression, specifying the classification model.kn
: An integer, default is NULL
,
indicating the number of neighbors in k-nearest neighbors
(knn
). If kn
is NULL
, then
kn
is determined as follow kn = 20
if 2 * k
< 30 otherwise kn = 20 : 30
,At the end of this step, final modules are produced. Subsequently,
SGCP
performs an additional step of Gene Ontology
Enrichment on those final modules.
Detailed of the steps are available in the manuscript in SGCP.
In Network Construction, user can identify of any of
steps to be added to the network. In Network
Clustering, SGCP
behaves as follows; If
kopt
is not NULL
, SGCP
finds
clusters based on kopt
. If kopt
is
NULL
and method
is not NULL
,
SGCP
picks k
based on the selected method. If
geneID
and annotation_db
are
NULL
, SGCP
determines the optimal method and
its corresponding number of cluster based on conductance validation. It
selects the method that func.conduct
(default
min
) on its cluster is minimum. Otherwise,
SGCP
will use gene ontology validation (default) to find
the optimal method and its corresponding number of clusters. It performs
gene ontology enrichment on the cluster with minimum conductance index
per method and selects the method that has the maximum
func.GO
(default sum
) over -log10 of p-values.
In Gene Semi-labeling step, if cutoff
is
not NULL
, SGCP
considers genes associated to
GO terms more significant than cutoff
as remarkable. Otherwise, SGCP
collects all GO terms from all clusters and selects the top
percent
(default 0.1
) most significant GO
terms among them. If genes associated to these terms come from more than
a single cluster, SGCP
these genes as remarkable.
Otherwise, it increases percent
to stp
and repeat this process until remarkable
genes come from at least two clusters. In Semi-supervised
Classification, remarkable clusters are the clusters that have
at least one remarkable gene.
ezSGCP
FunctionThe ezSGCP
function implements the SGCP
pipeline in one function. Parameters align with those specified in the
here
with specific mappings for clarity.
*__Network Construction__: Parameter `calib` corresponds to `calibration`.
*__Network Clustering__: Parameters `method_k`, `f.GO`, `f.conduct`, `maxIteration`, and `numberStart` correspond to `method`, `func.GO`, `func.conduct`, `maxIter`, and `numStart`, respectively.
*__Gene Ontology Enrichment__: Parameters `dir`, `onto`, `hgCut`, and `condTest` correspond to `direction`, `ontology`, `hgCutoff`, and `cond`, respectively.
* __Gene Ontology Enrichment__: The boolean parameter `semilabel` controls whether `SGCP` continues to the Semi-supervised Classification step (`TRUE`) or stops after initial clusters (`FALSE`).
For a comprehensive description of each parameter, refer to the help documentation.
Below is an example of how to call this function. Due to its
computation time (up to 15
minutes), the result has been
precomputed and stored as sgcp
, with the function call
commented out. To run it yourself, uncomment the function call. In this
example, the sil
parameter is set to TRUE to compute the
gene silhouette index.
# sgcp <- ezSGCP(expData = expData, geneID = geneID, annotation_db = annotation_db,
# eff.egs = FALSE , saveOrig = FALSE, sil = TRUE, hm = NULL)
data(sgcp)
summary(sgcp, show.data=TRUE)
## Length Class Mode
## semilabel 1 -none- logical
## clusterLabels 3 data.frame list
## clustering 13 -none- list
## initial.GO 2 -none- list
## semiLabeling 2 -none- list
## semiSupervised 3 -none- list
## final.GO 2 -none- list
If you run the ezSGCP
function, it may provide status
updates during execution. Here’s an example of what you might see:
## starting network construction step...
## normalization...
## Gaussian kernel...
## it may take time...
## TOM...
## it may take time...
## network is created, done!...
## starting network clustering step...
## calculating normalized Laplacian
## it may take time...
## calculating eigenvalues/vectors
## it may take time...
## n_egvec is 100
## number of clusters for relativeGap method is 2
## number of clusters for secondOrderGap method is 5
## number of clusters for additiveGap method is 3
## Gene Ontology Validation...
## method relativeGap....
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method secondOrderGap...
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method additiveGap....
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method relativeGap is selected using GO validation and k is 2
## calculating the Silhouette index
## it may take time...
## network clustering is done...
## starting initial gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## gene ontology done!..
## starting semi-labeling stage...
## cutoff value is 9.28130152105493e-05
## semiLabeling done!..
## starting semi-supervised step...
## creating train and test sets based on remarkable and unremarkable genes...
## number of remarkable genes is 1165
## number of unremarkable genes is 335
## performing knn...
## it may take time
## Loading required package: ggplot2
## Loading required package: lattice
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 4 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
## 24-nearest neighbor model
## Training set outcome distribution:
## 1 2
## 498 667
## class assignment for unremarkable genes...
## top class labels, and bottom number of predicted genes
## prediction
## 1 2
## 130 205
## semi-supervised done!..
## starting final gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## gene ontology done!..
## ezSGCP done!..
As observed, SGCP
provides detailed information at each
step of its execution. In this example, the analysis considers three
possible numbers of clusters: 2
, 3
, and
5
. Following gene ontology validation, the optimal number
of clusters is determined to be 2
.
While SGCP generally performs well with default parameters, users have the option to customize settings according to their specific needs. For example, enabling calib can enhance the enrichment of the final modules in certain cases. Similarly, the addition of TOM may not always be beneficial.
By default, SGCP operates under the assumption that users do not have prior knowledge of the exact number of clusters or the optimal method for clustering. Therefore, it utilizes gene ontology validation to establish initial clusters. We highly recommend that users evaluate SGCP using all three potential methods: “relativeGap”, “secondOrderGap”, and “additiveGap”.
SGCP
offers visualization capabilities to users. The
SGCP_ezPLOT
function accepts the sgcp
result
from ezSGCP
, along with expData
and
geneID
inputs. It generates two files, an Excel spreadsheet
and a PDF, depending on the initial call. Users can also retain the
plotting object by setting keep = TRUE
. In this example, we
enable silhouette_in to visualize the silhouette index plot.
message("PCA of expression data without label")
SGCP_ezPLOT(sgcp = sgcp, expreData = expData, silhouette_index = TRUE, keep = FALSE)
Note that in order to store files in xlsx format, Excel must be installed on your system.
For a detailed explanation of parameters, please refer to the help document.
SGCP
In DetailThe ezSGCP
function described earlier acts as a wrapper,
sequentially calling several other SGCP
functions in the
following order. Let’s explore each step in detail using the example data.
adjacencyMatrix
function, performs Network
Construction stepclustering
function, performs Network
Clustering stepgeneOntology
function, performs Gene Ontology
Enrichment step (produces initial clusters)semiLabeling
function, performs
Semi-labeling stepsemiSupervised
function, performs
Semi-supervised stepgeneOntology
function, performs Gene Ontology
Enrichment step (produces final modules)To conserve space, let’s omit the details about
sgcp
.
Here, we’ll skip the details of the function’s input and output, as
they are described in the help document. SGCP
enables users
to visualize the Principal Component Analysis (PCA
) of the
input gene expression data using the SGCP_plot_pca
function.
## Plotting PCA of expression data
adjacencyMatrix
functionThe Network Construction step utilizes the
adjacencyMatrix
function. By default, this function
normalizes each gene vector using its L2 norm and employs a Gaussian
kernel metric for similarity computation. Additionally, it integrates
the second-order neighborhood information of genes into the network as
Topological Overlap Matrix (TOM). The code snippet below demonstrates
its usage:
## normalization...
## Gaussian kernel...
## it may take time...
## TOM...
## it may take time...
## network is created, done!...
## 1 2 3 4 5
## 1 0.000000e+00 1.561704e-23 1.684533e-18 5.096042e-21 2.304447e-34
## 2 1.561704e-23 0.000000e+00 3.073427e-08 9.883634e-11 1.560804e-14
## 3 1.684533e-18 3.073427e-08 0.000000e+00 2.990119e-04 3.639579e-13
## 4 5.096042e-21 9.883634e-11 2.990119e-04 0.000000e+00 2.664746e-14
## 5 2.304447e-34 1.560804e-14 3.639579e-13 2.664746e-14 0.000000e+00
## 6 1.880926e-37 1.859059e-16 1.732464e-15 6.205653e-17 6.756836e-01
## 7 7.700277e-16 3.234158e-06 1.891155e-02 3.050233e-04 7.223951e-11
## 8 4.549591e-34 2.956936e-14 5.344484e-13 1.850700e-14 6.616417e-01
## 9 2.052374e-34 1.424767e-14 2.918285e-13 2.113249e-14 7.306804e-01
## 10 5.767117e-34 3.294000e-14 6.190725e-13 3.546475e-14 7.471861e-01
resAdja
represents the adjacency matrix of the gene
co-expression network. To visualize the heatmap of this adjacency
matrix, you can use the SGCP_plot_heatMap
function.
Uncomment the following lines of code to view the heatmap.
clustering
functionThe Network Clustering step is implemented using
clustering
function, which takes the adjacency matrix,
geneID
, and annotation_db
as inputs. Depending
on the initial call, it returns a list of results. In this example, the
sil
parameter is set to TRUE
to compute the
silhouette index per cluster
dropped.indices
: A vector of indices corresponding
to dropped genes, which are considered noise and removed during
processing.
geneID
: A vector containing gene
identifiers.
method
: Indicates the method used to determine the
number of clusters.
k
: The number of clusters identified.
Y
: Transformed matrix with dimensions
(number of genes) x (2*k)
.
X
: Eigenvalues corresponding to the first
2 * k
columns in Y
.
cluster
: An object of class “kmeans” representing
the clustering results.
clusterLabels
: A vector containing the cluster
labels assigned to each gene. There is a one-to-one correspondence
between geneID
and clusterLabels.
conductance
: A list containing mean and median
conductance indices, along with individual cluster conductance indices
for each method. The clusterConductance
field indexes
denote cluster labels, and their values indicate the conductance
index.
cvGOdf
: A dataframe used for gene ontology
validation, presenting enrichment results for each method on the cluster
with the minimum conductance index.
cv
: A string indicating the validation method for
the number of clusters; “cvGO”:gene ontology validation,
“cvConductance”: conductance validation, “userMethod”: user-defined
method, “userkopt”: user-defined kopt.
clusterNumberPlot
: A plotting object displaying
results for methods such as “relativeGap”, “secondOrderGap”, and
“additiveGap”.
silhouette
: A dataframe showing silhouette indices
for genes.
original
:A list containing matrix transformation
details, including eigenvalues (n_egvec
), with the top
columns of transformation retained.
Uncomment and run the following code snippet to execute the function.
Since it may take some time, the results are stored in
resClus
.
# resClus <- clustering(adjaMat = resAdja, geneID = geneID, annotation_db = annotation_db,
# eff.egs = FALSE , saveOrig = FALSE, sil = TRUE)
data(resClus)
summary(resClus)
## Length Class Mode
## dropped.indices 0 -none- numeric
## geneID 1500 -none- character
## method 1 -none- character
## k 1 -none- numeric
## Y 6000 -none- numeric
## X 4 -none- numeric
## cluster 9 kmeans list
## clusterLabels 1500 -none- character
## conductance 3 -none- list
## cvGOdf 9 data.frame list
## cv 1 -none- character
## clusterNumberPlot 3 -none- list
## silhouette 3 data.frame list
# lets drop noisy genes from expData
geneID <- resClus$geneID
if(length(resClus$dropped.indices)>0 ){ expData <- expData[-resClus$dropped.indices, ]}
# removing the adjacency matrix
rm(resAdja)
If you execute the code, the following information will be printed out for you.
## calculating normalized Laplacian
## it may take time...
## calculating eigenvalues/vectors
## it may take time...
## n_egvec is 100
## number of clusters for relativeGap method is 2
## number of clusters for secondOrderGap method is 5
## number of clusters for additiveGap method is 3
## Gene Ontology Validation...
## method relativeGap....
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method secondOrderGap...
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method additiveGap....
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method relativeGap is selected using GO validation and k is 2
## calculating the Silhouette index
## it may take time...
## network clustering is done...
We observed that three methods (“relativeGap”, “secondOrderGap”,
“additiveGap”) produced three potentially distinct numbers of clusters.
Following gene ontology validation, SGCP
selected the
“secondOrderGap” method, resulting in 2 clusters. The cv
field under “cvGO” indicates the number of clusters determined through
gene ontology validation. The original field provides the first
n_egvec
columns of eigenvectors and their corresponding
eigenvalues from the transformed matrix, which can be useful for further
investigation.
To visualize PCA plots of the expression and transformed data with
and without labels, use SGCP_plot_pca
. Uncomment the
following lines to view the resulting PCAs.
## Plotting PCA of expression data with label
# pl <- SGCP_plot_pca(m = expData, clusLabs = NULL, tit = "PCA plot without label", ps = .5)
# print(pl)
## Plptting PCA of transformed data without label
# pl <- SGCP_plot_pca(m = resClus$Y, clusLabs = NULL, tit = "PCA plot without label", ps = .5)
# print(pl)
## Plotting PCA of transformed data with label
# pl <- SGCP_plot_pca(m = resClus$Y, clusLabs = resClus$clusterLabels, tit = "PCA plot with label", ps = .5)
# print(pl)
In resClus
, you can find plotting objects for each
method used to determine the number of clusters.
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance"){
message("plotting relativeGap, secondOrderGap, additiveGap methods ...")
print(resClus$clusterNumberPlot$relativeGap)
}
## plotting relativeGap, secondOrderGap, additiveGap methods ...
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance"){
message("plotting relativeGap, secondOrderGap, additiveGap methods ...")
print(resClus$clusterNumberPlot$secondOrderGap)
}
## plotting relativeGap, secondOrderGap, additiveGap methods ...
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance"){
message("plotting relativeGap, secondOrderGap, additiveGap methods ...")
print(resClus$clusterNumberPlot$additiveGap)
}
## plotting relativeGap, secondOrderGap, additiveGap methods ...
In SGCP
, you can visualize the conductance index per
cluster using the SGCP_plot_conductance
function. The
conductance index is calculated only when both kopt
and
method
are NULL.
If either parameter is
specified, SGCP
skips conductance computation as it does
not need to validate the number of clusters.
# checking if conductance index is calculated
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance" ){
message("plotting cluster conductance index...")
pl <- SGCP_plot_conductance(conduct = resClus$conductance,
tit = "Cluster Conductance Index",
xname = "clusterLabel", yname = "conductance index")
print(pl)}
## plotting cluster conductance index...
In here We observed that
clusters with lower conductance indices tend to exhibit higher
enrichment. This analysis confirms our observation. In the clustering
function, gene ontology enrichment is compared among clusters with the
minimum conductance for the methods “relativeGap”
(cluster rg1
), “secondOrderGap” (cluster sg4
),
and “additiveGap” (cluster ag1
). Among them, cluster
rg1
shows higher enrichment. Interestingly, this cluster
also demonstrates a lower conductance index compared to ag1
and sg4.
In SGCP, you can also plot the silhouette index per gene if the
sil
parameter is set to TRUE
in the clustering
function.
# checking if silhouette index is calculated
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance" ){
message("plotting gene silhouette index...")
pl <- SGCP_plot_silhouette(resClus$silhouette, tit = "Gene Silhouette Index",
xname = "genes", yname = "silhouette index")
print(pl)}
## plotting gene silhouette index...
The silhouette index ranges between -1 and 1
, where
values closer to 1
indicate better clustering fit for
genes. As observed, the majority of genes have a silhouette index very
close to 1
, suggesting they are well-described by the
clusters. Interestingly, in the worst-case scenario, some genes have a
silhouette index of zero, with no negative indices observed for any
gene.
geneOntology
functionThe Gene Ontology Enrichment step is executed using
the geneOntology
function. This function calculates gene
ontology enrichment on the clusters and returns two outputs;
GOresults
, results of gene ontology enrichment and
FinalGOTermGenes
, a list that maps gene IDs to the
corresponding GO terms identified in GOresults.
r Please
uncomment the code above to execute the function and store the results
in resInitialGO, keeping in mind that it may take some time to
complete.
# resInitialGO <- geneOntology(geneUniv = geneID, clusLab = resClus$clusterLabels,
# annotation_db = annotation_db)
data(resInitialGO)
head(resInitialGO$GOresults)
## clusterNum GOtype GOID Pvalue OddsRatio ExpCount Count Size
## 1 1 underMF GO:0005201 5.280629e-15 0.1084677 41.36170 10 72
## 2 1 underMF GO:0005198 2.081705e-11 0.2448824 62.04255 29 108
## 3 1 underCC GO:0031012 1.276834e-10 0.3605085 107.53125 67 186
## 4 1 underCC GO:0030312 2.159404e-10 0.3663565 108.10938 68 187
## 5 1 underCC GO:0005576 1.076738e-09 0.5227524 350.34375 294 606
## 6 1 underCC GO:0071944 1.653199e-09 0.5292955 446.31250 390 772
## Term
## 1 extracellular matrix structural constituent
## 2 structural molecule activity
## 3 extracellular matrix
## 4 external encapsulating structure
## 5 extracellular region
## 6 cell periphery
If you execute the code, the following information will be printed out for you.
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## gene ontology done!..
Above shows the enrichment dataframe returned from GOstats (link)[https://bioconductor.org/packages/release/bioc/html/GOstats.html]
package. clusterNum
indicates the cluster label.
SGCP
offers four functions to visualize gene ontology
enrichment results: SGCP_plot_jitter
,
SGCP_plot_density
, SGCP_plot_bar
, and
SGCP_plot_pie.
It’s important to note that p-values are
log-transformed in these visualizations. Therefore, higher points
indicate more significant GO terms.
## plotting initial GO p-values jitters...
pl <- SGCP_plot_jitter(df = resInitialGO$GOresults,
tit = "Initial GO p-values",
xname = "cluster", yname = "-log10 p-value", ps = 3)
print(pl)
## plotting initial GO p-values density
pl <- SGCP_plot_density(df = resInitialGO$GOresults,
tit = "Initial GO p-values Density",
xname = "cluster", yname = "-log10 p-value")
print(pl)
## Picking joint bandwidth of 0.199
## plotting initial GO p-values mean
pl <- SGCP_plot_bar(df = resInitialGO$GOresults, tit = "Cluster Performance",
xname = "cluster", yname = "mean -log10 p-value")
print(pl)
## plotting initial GO p-values pie chart...
pl <- SGCP_plot_pie(df = resInitialGO$GOresults, tit = "Initial GO Analysis",
xname = "cluster", yname = "count", posx = 1.8)
print(pl)
Each segment in the plot is labeled with the log-transformed p-value of the most significant term found in that segment.
semiLabeling
functionThe Gene Semi-labeling step utilizes the
semiLabeling
function. This function takes the output from
the geneOntology
function and returns a dataframe that
identifies remarkable and unremarkable genes, along with the associated
cutoff value.
resSemiLabel <- semiLabeling(geneID = geneID,
df_GO = resInitialGO$GOresults,
GOgenes = resInitialGO$FinalGOTermGenes)
## cutoff value is 9.28130152105493e-05
## semiLabeling done!..
## geneID label
## 1 7503 1
## 2 100462977 <NA>
## 3 4878 2
## 4 4879 2
## 5 6192 1
## 6 9086 <NA>
##
## 1 2
## 667 498
In the table above, genes are listed with their corresponding cluster
labels if they are remarkable; otherwise, they are labeled as NA. A
cluster is considered remarkable if it contains at least one remarkable
gene. Therefore, both clusters shown are remarkable here. In our
discussion here, we highlighted that if the number of clusters exceeds
2, SGCP
will eliminate unremarkable clusters through the
Semi-labeling and Semi-supervised steps. Genes from these clusters are
reassigned to remarkable clusters. Consequently, the number of modules
may decrease while module enrichment increases. These steps modify the
original clusters produced by the clustering function, resulting in a
different set of clusters. Therefore, SGCP
returns two sets
of results: (i) immediately after the clustering function, referred to
as clusters, and (ii) after the semiSupervised
function,
referred to as modules. In our manuscript, we distinguish these as
pSGCP
and SGCP
, respectively.
semiSupervised
functionThe Gene Semi-supervised step utilizes the
semiSupervised
function. The final module labeling is
provided in FinalLabeling
, which is a dataframe containing
geneIDs
with their corresponding semi-labels and final
labels. Uncomment and run the following code to perform gene
semi-supervised labeling.
resSemiSupervised <- semiSupervised(specExp = resClus$Y,
geneLab = resSemiLabel$geneLabel,
model = "knn", kn = NULL)
## creating train and test sets based on remarkable and unremarkable genes...
## number of remarkable genes is 1165
## number of unremarkable genes is 335
## performing knn...
## it may take time
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:generics':
##
## train
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 4 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
## 24-nearest neighbor model
## Training set outcome distribution:
##
## 1 2
## 667 498
## class assignment for unremarkable genes...
## top class labels, and bottom number of predicted genes
## prediction
## 1 2
## 205 130
## semi-supervised done!..
## geneID semiLabel FinalLabel
## 1 7503 1 1
## 2 100462977 <NA> 2
## 3 4878 2 2
## 4 4879 2 2
## 5 6192 1 1
## 6 9086 <NA> 1
##
## 1 2
## 872 628
In this step, the k-nearest neighbor model is selected, and its hyper-parameters are chosen through cross-validation based on the accuracy metric. The table above displays the gene semi-labeling and final labeling.
You can visualize the PCA plot with the final labeling using the
SGCP_plot_pca
function. Uncomment the following code to
view the resulting PCAs:
geneOntology
functionFinally, the Gene Ontology Enrichment step is
performed once more on the final modules to enhance module enrichment.
Again, due to its time-consuming nature, the code is commented out, and
the output is stored in resFinalGO
. Uncomment and run the
following code for practice:
# resFinalGO <- geneOntology(geneUniv = geneID, clusLab = resSemiSupervised$FinalLabeling$FinalLabel,
# annotation_db = annotation_db)
data(resFinalGO)
head(resFinalGO$GOresults)
## clusterNum GOtype GOID Pvalue OddsRatio ExpCount Count Size
## 1 1 underMF GO:0005201 4.792355e-15 0.1081310 41.41277 10 72
## 2 1 underMF GO:0005198 1.872816e-11 0.2440998 62.11915 29 108
## 3 1 underCC GO:0031012 1.118311e-10 0.3593320 107.65761 67 186
## 4 1 underCC GO:0030312 1.893638e-10 0.3651603 108.23641 68 187
## 5 1 underCC GO:0005576 8.138271e-10 0.5201268 350.75543 294 606
## 6 1 underCC GO:0071944 1.165513e-09 0.5259400 446.83696 390 772
## Term
## 1 extracellular matrix structural constituent
## 2 structural molecule activity
## 3 extracellular matrix
## 4 external encapsulating structure
## 5 extracellular region
## 6 cell periphery
Upon running the code, you will observe the following output:
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## gene ontology done!..
Now, let’s examine the corresponding plots for final module enrichment.
## plotting final GO p-values jitters...
pl <- SGCP_plot_jitter(df = resFinalGO$GOresults, tit = "Final GO p-values",
xname = "module", yname = "-log10 p-value", ps = 3)
print(pl)
## plotting final GO p-values density...
pl <- SGCP_plot_density(df = resFinalGO$GOresults,
tit = "Final GO p-values Density",
xname = "module", yname = "-log10 p-value")
print(pl)
## Picking joint bandwidth of 0.197
## plotting final GO p-values mean...
pl <- SGCP_plot_bar(df = resFinalGO$GOresults, tit = "Module Performance",
xname = "module", yname = "mean -log10 p-value")
print(pl)
## plotting final GO p-values pie xhart...
pl <- SGCP_plot_pie(df = resFinalGO$GOresults, tit = "Final GO Analysis",
xname = "module", yname = "count", posx = 1.9)
print(pl)
Comparing this result with the initial clusters, we observe a slight increase in gene ontology enrichment. We have found that when all initial clusters are remarkable, as observed here, the Semi-labeling and *Semi-supervised** steps do not fundamentally change the gene cluster labels, resulting in convergence of initial clusters towards final clusters. However, if initial clusters include non-significant ones, SGCP eliminates these clusters, enhancing the enrichment of remarkable clusters. Consequently, the final modules differ from the initial clusters. Further details are available in the manuscript. SGCP manuscript.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] caret_7.0-1 lattice_0.22-6
## [3] ggplot2_3.5.1 org.Hs.eg.db_3.20.0
## [5] AnnotationDbi_1.69.0 SummarizedExperiment_1.37.0
## [7] Biobase_2.67.0 GenomicRanges_1.59.1
## [9] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [11] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [13] generics_0.1.3 MatrixGenerics_1.19.1
## [15] matrixStats_1.5.0 SGCP_1.7.0
## [17] knitr_1.49 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3 rstudioapi_0.17.1
## [4] jsonlite_1.8.9 magrittr_2.0.3 farver_2.1.2
## [7] rmarkdown_2.29 vctrs_0.6.5 memoise_2.0.1
## [10] RCurl_1.98-1.16 htmltools_0.5.8.1 S4Arrays_1.7.1
## [13] forcats_1.0.0 haven_2.5.4 cellranger_1.1.0
## [16] pROC_1.18.5 SparseArray_1.7.3 parallelly_1.41.0
## [19] sass_0.4.9 bslib_0.8.0 plyr_1.8.9
## [22] lubridate_1.9.4 rootSolve_1.8.2.4 cachem_1.1.0
## [25] buildtools_1.0.0 lifecycle_1.0.4 iterators_1.0.14
## [28] pkgconfig_2.0.3 Matrix_1.7-1 R6_2.5.1
## [31] fastmap_1.2.0 GenomeInfoDbData_1.2.13 future_1.34.0
## [34] digest_0.6.37 Exact_3.3 colorspace_2.1-1
## [37] RSpectra_0.16-2 RSQLite_2.3.9 labeling_0.4.3
## [40] timechange_0.3.0 httr_1.4.7 abind_1.4-8
## [43] compiler_4.4.2 proxy_0.4-27 bit64_4.6.0-1
## [46] withr_3.0.2 DBI_1.2.3 MASS_7.3-64
## [49] lava_1.8.1 DelayedArray_0.33.3 gld_2.6.6
## [52] ModelMetrics_1.2.2.2 Category_2.73.0 tools_4.4.2
## [55] zip_2.3.1 future.apply_1.11.3 nnet_7.3-20
## [58] glue_1.8.0 nlme_3.1-166 grid_4.4.2
## [61] reshape2_1.4.4 recipes_1.1.0 gtable_0.3.6
## [64] class_7.3-23 data.table_1.16.4 lmom_3.2
## [67] hms_1.1.3 XVector_0.47.2 foreach_1.5.2
## [70] pillar_1.10.1 stringr_1.5.1 genefilter_1.89.0
## [73] splines_4.4.2 dplyr_1.1.4 survival_3.8-3
## [76] bit_4.5.0.1 annotate_1.85.0 tidyselect_1.2.1
## [79] RBGL_1.83.0 GO.db_3.20.0 maketools_1.3.1
## [82] Biostrings_2.75.3 xfun_0.50 expm_1.0-0
## [85] hardhat_1.4.0 timeDate_4041.110 stringi_1.8.4
## [88] UCSC.utils_1.3.1 yaml_2.3.10 boot_1.3-31
## [91] evaluate_1.0.3 codetools_0.2-20 tibble_3.2.1
## [94] Rgraphviz_2.51.0 BiocManager_1.30.25 graph_1.85.1
## [97] cli_3.6.3 rpart_4.1.24 xtable_1.8-4
## [100] DescTools_0.99.58 munsell_0.5.1 jquerylib_0.1.4
## [103] Rcpp_1.0.14 readxl_1.4.3 globals_0.16.3
## [106] png_0.1-8 parallel_4.4.2 XML_3.99-0.18
## [109] gower_1.0.2 blob_1.2.4 GOstats_2.73.0
## [112] AnnotationForge_1.49.0 bitops_1.0-9 listenv_0.9.1
## [115] mvtnorm_1.3-3 GSEABase_1.69.0 ipred_0.9-15
## [118] scales_1.3.0 prodlim_2024.06.25 e1071_1.7-16
## [121] ggridges_0.5.6 purrr_1.0.2 openxlsx_4.2.7.1
## [124] crayon_1.5.3 rlang_1.1.4 KEGGREST_1.47.0