Title: | Gene Expression Signature based Similarity Metric |
---|---|
Description: | This package gives the implementations of the gene expression signature and its distance to each. Gene expression signature is represented as a list of genes whose expression is correlated with a biological state of interest. And its distance is defined using a nonparametric, rank-based pattern-matching strategy based on the Kolmogorov-Smirnov statistic. Gene expression signature and its distance can be used to detect similarities among the signatures of drugs, diseases, and biological states of interest. |
Authors: | Yang Cao [aut, cre], Fei Li [ctb], Lu Han [ctb] |
Maintainer: | Yang Cao <[email protected]> |
License: | GPL-2 |
Version: | 1.53.0 |
Built: | 2024-10-30 07:21:52 UTC |
Source: | https://github.com/bioc/GeneExpressionSignature |
sample data, a subset of the C-MAP as , which is a collection of 50 genome-wide transcriptional expression data from cultured human cells treated with 15 different small molecules
A ExpressionSet: assay data represents the 50 genome-wide transcriptional expression data, phenotypic data describes 15 different small molecules corresponds to the expression data (assay data).
http://www.sciencemag.org/content/313/5795/1929.short Lamb et al., The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease, science 2006
This package gives the implementations of the gene expression signature and its distance to each. Gene expression signature is represented as a list of genes whose expression is correlated with a biological state of interest. And its distance is defined using a nonparametric, rank-based pattern- matching strategy based on the Kolmogorov-Smirnov statistic. Gene expression signature and its distance can be used to detect similarities among the signatures of drugs, diseases, and biological states of interest.
Sorting the micro-array probe-set identifiers according to the differential expression values with respect to the untreated hybridization to obtain a ranked list. Gene-expression profiles in are represented in a nonparametric fashion.
getRLs(control, treatment)
getRLs(control, treatment)
control |
a matrix, including the vehicle control gene expression profiles corresponding to the treatment gene expression profiles. |
treatment |
a matrix, is composed of gene expression profiles. |
The genes on the array are rank-ordered according to their differential expression relative to the control. First, control and treatment values less than a primary threshold value (quartile) were set to that threshold value. Finally, probe sets were ranked in descending order of d, where d is the ratio of the corresponding treatment-to-control values. For probe sets where d=1, a lower threshold was applied to the original difference values and a new treatment to control ratio (d') calculated. These probe sets were then sub-sorted in descending order of d.
A matrix is composed of ranked lists, a ranked list represents the corresponding gene expression profiles.
if (require(GEOquery)){ # treatment gene-expression profiles file1 <- system.file( "extdata/GSM118720.soft", package = "GeneExpressionSignature" ) GSM118720 <- getGEO(filename = file1) # control gene-expression profiles file2 <- system.file( "extdata/GSM118721.soft", package = "GeneExpressionSignature" ) GSM118721 <- getGEO(filename = file2) # data ranking according to the different expression values control <- as.matrix(as.numeric(Table(GSM118721)[, 2])) treatment <- as.matrix(as.numeric(Table(GSM118720)[, 2])) ranked_list <- getRLs(control, treatment) }
if (require(GEOquery)){ # treatment gene-expression profiles file1 <- system.file( "extdata/GSM118720.soft", package = "GeneExpressionSignature" ) GSM118720 <- getGEO(filename = file1) # control gene-expression profiles file2 <- system.file( "extdata/GSM118721.soft", package = "GeneExpressionSignature" ) GSM118721 <- getGEO(filename = file2) # data ranking according to the different expression values control <- as.matrix(as.numeric(Table(GSM118721)[, 2])) treatment <- as.matrix(as.numeric(Table(GSM118720)[, 2])) ranked_list <- getRLs(control, treatment) }
Merging the assay data according to phenotypic data of the input ExpressionSet. Each group of the ranked lists with the same phenotypic data is aggregated into a single list, return it as an ExpressionSet object.
RankMerging( exprSet, MergingDistance = c("Spearman", "Kendall"), weighted = TRUE )
RankMerging( exprSet, MergingDistance = c("Spearman", "Kendall"), weighted = TRUE )
exprSet |
an ExpressionSet object, each column of assay data represents a ranked list obtained by preprocessing the corresponding gene expression profile, and phenotypic data represents the short description (characteristics of gene expression profile, such as the drug type, the disease state) about the assay data. |
MergingDistance |
distance to be used which "measures" the similarity of ordered lists, the default is "Spearman" |
weighted |
there are tow rank merging approaches for two cases: if
|
The krubor function is used in the aggregating procedure. And the following methods are used in the implementation: a measure of the distance between two ranked lists (Spearman's Footrule), a method to merge two or more ranked lists the (Borda Merging Method), and a algorithm to obtain a single ranked list from a set of them in a hierarchical way (the Kruskal Algorithm). If choose Kendall as distance, the effectiveness of this function is certainly limited by the size of the merging problem.
a Biobase::ExpressionSet
object.
# load the sample expressionSet data(exampleSet) # Merging each group of the ranked lists in the exampleSet with the same # phenotypic data into a single PRL MergingSet <- RankMerging(exampleSet, "Spearman", weighted = TRUE)
# load the sample expressionSet data(exampleSet) # Merging each group of the ranked lists in the exampleSet with the same # phenotypic data into a single PRL MergingSet <- RankMerging(exampleSet, "Spearman", weighted = TRUE)
Compute pairwise distances between sample according to their (Prototype
Ranked List) PRL, a N x N
distance matrix is generated by calling this
function, N
is the length of PRL.
ScoreGSEA( MergingSet, SignatureLength, ScoringDistance = c("avg", "max"), p.value = FALSE )
ScoreGSEA( MergingSet, SignatureLength, ScoringDistance = c("avg", "max"), p.value = FALSE )
MergingSet |
an |
SignatureLength |
the length of "gene signature". In order to compute pairwise distances among samples, genes lists are ranked according to the gene expression ratio (fold change). And the "gene signature" includes the most up-regulated genes (near the top of the list) and the most down-regulated genes (near the bottom of the list). |
ScoringDistance |
the distance measurements between PRLs: the Average Enrichment Score Distance (:avg"), and the Maximum Enrichment Score Distance ("max"). |
p.value |
logical, if |
Once the PRL obtained for each sample, the distances between samples
are calculated base on gene signature, including the expression of genes
that seemed to consistently vary in response to the across different
experimental conditions (e.g., different cell lines and different dosages).
We take two distance measurements between PRLs: the Average
Enrichment-Score Distance Davg = (TES{x,y} + TES{y,x}) / 2
, and the Maximum
Enrichment-Score Distance Dmax = Min(TES{x,y}, TES{y,x}) / 2
.The avg is more
stringent than max, where max is more sensitive to weak similarities, with
lower precision but large recall.
an distance-matrix, the max distance is more sensitive to weak
similarities, providing a lower precision but a larger recall.If p.value
is set to TRUE
, then a list is returned that consists of the distance
matrix as well as their p.values, otherwise, without p.values in the
result.
ScorePGSEA()
,SignatureDistance()
# load the sample expressionSet data(exampleSet) # Merging each group of the ranked lists in the exampleSet with the same # phenotypic data into a single PRL MergingSet <- RankMerging(exampleSet,"Spearman") # get the distance matrix ds <- ScoreGSEA(MergingSet, 250, "avg")
# load the sample expressionSet data(exampleSet) # Merging each group of the ranked lists in the exampleSet with the same # phenotypic data into a single PRL MergingSet <- RankMerging(exampleSet,"Spearman") # get the distance matrix ds <- ScoreGSEA(MergingSet, 250, "avg")
Compute pairwise distances between sample according to their
(Prototype Ranked List) PRL, get a N x N
distance matrix is generated by
calling this function , N
is the length of PRL.
ScorePGSEA( MergingSet, SignatureLength, ScoringDistance = c("avg", "max"), p.value = FALSE )
ScorePGSEA( MergingSet, SignatureLength, ScoringDistance = c("avg", "max"), p.value = FALSE )
MergingSet |
an |
SignatureLength |
the length of "gene signature". In order to compute pairwise distances among samples, genes lists are ranked according to the gene expression ratio (fold change). And the "gene signature" includes the most up-regulated genes (near the top of the list) and the most down-regulated genes (near the bottom of the list). |
ScoringDistance |
the distance measurements between PRLs: the Average Enrichment Score Distance ("avg"), or the Maximum Enrichment Score Distance ("max"). |
p.value |
logical, if |
an distance-matrix, the max distance is more sensitive to weak
similarities, providing a lower precision but a larger recall.If p.value
is set to TRUE
, then a list is returned that consists of the distance
matrix as well as their p.values, otherwise, without p.values in the
result.
ScoreGSEA()
,SignatureDistance()
# load the sample expressionSet data(exampleSet) # Merging each group of the ranked lists in the exampleSet with the same # phenotypic data into a single PRL MergingSet <- RankMerging(exampleSet,"Spearman") # get the distance matrix ds <- ScorePGSEA(MergingSet,250, ScoringDistance="avg")
# load the sample expressionSet data(exampleSet) # Merging each group of the ranked lists in the exampleSet with the same # phenotypic data into a single PRL MergingSet <- RankMerging(exampleSet,"Spearman") # get the distance matrix ds <- ScorePGSEA(MergingSet,250, ScoringDistance="avg")
This function integrated the function for rank merging and distance scoring, we can do the rank merging and distance scoring simply with it.
SignatureDistance( exprSet, SignatureLength, MergingDistance = c("Spearman", "Kendall"), ScoringMethod = c("GSEA", "PGSEA"), ScoringDistance = c("avg", "max"), weighted = TRUE, ... )
SignatureDistance( exprSet, SignatureLength, MergingDistance = c("Spearman", "Kendall"), ScoringMethod = c("GSEA", "PGSEA"), ScoringDistance = c("avg", "max"), weighted = TRUE, ... )
exprSet |
an |
SignatureLength |
the length of "gene signature". In order to compute pairwise distances among samples, genes lists are ranked according to the gene expression ratio (fold change). And the "gene signature" includes the most up-regulated genes (near the top of the list) and the most down-regulated genes (near the bottom of the list). |
MergingDistance |
distance to be used which "measures" the similarity of ordered lists, "Spearman" or "Kendall". |
ScoringMethod |
method to be used to perform distance scoring, "GSEA" or "PGSEA". |
ScoringDistance |
the distance measurements between PRLs: the Average Enrichment Score Distance ("avg"), or the Maximum Enrichment Score Distance ("max"). |
weighted |
there are tow rank merging approaches for two cases: if
|
... |
additional arguments can be passed to |
the result from ScoreGSEA()
or ScorePGSEA()
.
RankMerging()
,ScoreGSEA()
,ScorePGSEA()
#load the sample expressionSet data(exampleSet) # distance scoring SignatureDistance( exampleSet, SignatureLength = 250, MergingDistance = "Spearman", ScoringMethod = "GSEA", ScoringDistance = "avg", weighted = TRUE )
#load the sample expressionSet data(exampleSet) # distance scoring SignatureDistance( exampleSet, SignatureLength = 250, MergingDistance = "Spearman", ScoringMethod = "GSEA", ScoringDistance = "avg", weighted = TRUE )