Title: | Comparison of disease and drug profiles using Gene set Enrichment Analysis |
---|---|
Description: | This package generates ranked lists of differential gene expression for either disease or drug profiles. Input data can be downloaded from Array Express or GEO, or from local CEL files. Ranked lists of differential expression and associated p-values are calculated using Limma. Enrichment scores (Subramanian et al. PNAS 2005) are calculated to a reference set of default drug or disease profiles, or a set of custom data supplied by the user. Network visualisation of significant scores are output in Cytoscape format. |
Authors: | C. Pacini |
Maintainer: | j. Saez-Rodriguez <[email protected]> |
License: | GPL-3 |
Version: | 2.49.0 |
Built: | 2024-12-29 06:24:13 UTC |
Source: | https://github.com/bioc/DrugVsDisease |
This package generates ranked lists of differential gene expression for either disease or drug profiles. Input data can be downloaded from Array Express [1] or GEO [2], or from local CEL files. Ranked lists of differential expression and associated p-values are calculated using Limma [3]. Enrichment scores [4] are calculated to a reference set of default drug or disease profiles, or a set of custom data supplied by the user. Significance scores are output in Cytoscape http://www.cytoscape.org/ format.
Package: | DvD |
Type: | Package |
Version: | 1.0 |
Date: | 2012-06-15 |
License: | GPL-2 |
LazyLoad: | yes |
Profiles are calculated via generateprofiles and selected profiles are then classified using classifyprofile.
C. Pacini
Maintainer: J Saez-Rodriguez <[email protected]>
[1]Parkinson et al. (2010) ArrayExpress update an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucl. Acids Res.,doi: 10.1093/nar/gkq1040.
[2]Barrett T et al. (2011) NCBI GEO: archive for functional genomics data sets-10 years on. Nucl. Acids Res, 39, D1005-D1010.
[3]Smyth et al. (2004). Linear models and empirical Bayes method for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, Vol. 3, No. 1, Article 3.
[4]Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S. & Mesirov, J. P. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545-15550.
generateprofiles
, selectrankedlists
, classifyprofile
#profileAE<-generateprofiles(input="AE",accession="E-GEOD-22528") #selprofiles<-selectrankedlists(profileAE,1) #default<-classifyprofile(data=selprofiles$ranklist, signif.fdr=1.1)
#profileAE<-generateprofiles(input="AE",accession="E-GEOD-22528") #selprofiles<-selectrankedlists(profileAE,1) #default<-classifyprofile(data=selprofiles$ranklist, signif.fdr=1.1)
For a set of ranked gene expression profiles, enrichment scores to a default or custom set of profiles. The ranked gene expression profiles can have been generated from generate profiles or provided by the user. The enrichment scores are assessed for significance using permutations to generate random profiles.
classifyprofile(data, pvalues = NULL, case = c("disease", "drug"), type = c("fixed", "dynamic", "range"), lengthtest = 100, ranges = seq(100, 2000, by = 100), adj = c("BH","qvalue"), dynamic.fdr = 0.05, signif.fdr = 0.05, customRefDB = NULL, noperm = 1000, customClusters = NULL, clustermethod = c("single", "average"), avgstat = c("mean", "median"), cytoout = FALSE, customsif = NULL, customedge = NULL, cytofile = NULL, no.signif = 10,stat=c("KS","WSR"))
classifyprofile(data, pvalues = NULL, case = c("disease", "drug"), type = c("fixed", "dynamic", "range"), lengthtest = 100, ranges = seq(100, 2000, by = 100), adj = c("BH","qvalue"), dynamic.fdr = 0.05, signif.fdr = 0.05, customRefDB = NULL, noperm = 1000, customClusters = NULL, clustermethod = c("single", "average"), avgstat = c("mean", "median"), cytoout = FALSE, customsif = NULL, customedge = NULL, cytofile = NULL, no.signif = 10,stat=c("KS","WSR"))
data |
Matrix gene expression profiles. Rows are genes whose names match the genelist from DvDdata and columns are the different profiles. This can also be a path to a txt file containing the data. |
pvalues |
Optional numerical matrix of pvalues associated with the differential expression of the ranked lists in data argument. This can also be a path to a txt file containing the data. |
case |
Character string indicating whether or not the input profiles are disease (default) or drug profiles |
type |
Character string giving the method of gene set sizes, one of fixed (default), dynamic or range |
lengthtest |
Integer giving the number of genes to generate a set size for use with type=fixed, default 100. |
ranges |
A vector of integer values giving a range of gene set sizes, for use with type=range, default 100 to 2000 every 100. |
adj |
Character string to set the method for multiple hypothesis testing, one of qvalue or BH (default) |
dynamic.fdr |
Double giving the false discovery rate to determine the gene set sizes (default 5 |
signif.fdr |
Double giving the false discovery rate to determine significance of enrichment scores (default 5 |
customRefDB |
Optional matrix of reference ranked profiles to compare the input profiles to. Alternatively a string giving the name of the path where the database is stored. |
noperm |
Integer for the number of permutation profiles (default 1000) |
customClusters |
Optional data frame of cluster assignments which relate to customRefDB |
clustermethod |
Character string to give the cluster method one of single (default) or average |
avgstat |
Character string for the statistic used with average cluster method. One of mean (default) or median. |
cytoout |
Logical if Cytoscape SIF and Edge Attribute files should be produced (default is FALSE) |
customsif |
Optional SIF input needed for use with custom clusters. Can be an R object or a character string containing a file path |
customedge |
Optional Edge attribute input needed for use with custom clusters. Can be an R object or a character string containing a file path |
cytofile |
Character string for the filename of the Cytoscape output |
no.signif |
Integer giving the maximum number of significant enrichment scores to return. Default is 10. |
stat |
One of KS (default) or WSR. KS is the Kolmogorov-Smirnov type statistic which equally weights all elements of the gene set. WSR uses both the sign and the position in the ranked list to calculate an enrichment score. |
The classify profile function contains a default set of drug (from the Connectivity Map [2]) and disease profiles (from various GEO profiles) with corresponding clusters which input profiles of disease and drug respectively are compared to. Enrichment scores [1] are calculated between the input profiles and the corresponding inverted reference profiles such that, the score measures the enrichment of up-regulated genes in the input profiles in the down-regulated genes in the reference set[3]. The gene set sizes to use in the Enrichment scores can be specified using one of three methods - fixed, range or dynamic. With the fixed method the user specifies a fixed number of genes to use with all input profiles. The range option takes a vector of integers - enrichment scores are calculated for each profile using a gene set size as given by the vector. The dynamic option uses the p-values to determine the gene set size according to the number of significantly differential expressed genes (following multiple hypothesis correction). Multiple hypothesis correction is done using one of two methods, qvalue or Benjamini-Hochberg. Two enrichment scores are calculated for the up and down regulated scores. These contain the scores where the input profiles gene set is compared to the reference profile and where the reference profiles equivalent gene set (as determined by the type method) are compared to the input profile when using the KS option [4]. The WSR option implements the score of [5]. Various optional parameters exist for comparing an input profile to a users own reference set. For using custom reference data the user needs to provide the custom ranked lists of differential expression (customRefDB), the corresponding clusters (or network) between nodes in the reference data set (customClusters) and the SIF and Edge attribute files for Cytoscape option if cytoout=TRUE. Default clusters are provided by the DvDdata package, and include a drug and disease network. Input profiles to classifyprofile can be assigned to clusters using either single or average linkage. With single linkage an edge is drawn between the input profile and any significant scoring (up to a user defined maximum of no.signif) reference profiles. For average linkage either the mean or median (specified through avgstat) of the scores to each member in a cluster is calculated and the profile is assigned to the cluster with the highest average score.
To use your own preprocessed data, make sure the txt files for the data (and optional pvalues) have rownames with genes matching those in the reference data set. The files should have genes as row names in the first column and the header (col names) giving the names of the input profile(s). The input to classifyprofile is then a string of the path to the files.
Data Frame for each profile with elements:
Node |
Names of the profiles with significant scores to the input profile(s) |
ES Distance |
The distance between the input profile and corresponding reference profile. Defined as 1-ES |
Cluster |
The cluster number of the node |
RPS |
Running sum Peak Sign, taking values -1 to indicate an inverse relationship (potentially) therapeutic, or 1 to indicate a similar profile. |
C. Pacini
[1]Subramanian A et~al. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 102(43), 15545-15550.
[2]Lamb J et~al. (2006) The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science, 313(5795), 1929-1935.
[3]Sirota M et~al. (2011) Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data. Sci Trans Med,3:96ra77.
[4]Iorio et al. (2010) Discovery of drug mode of action and drug repositioning from transcriptional responses. PNAS, 107(33), 14621-14626.
[5]Zhang S et al. (2008) A simple and robust method for connecting small- molecule drugs using gene-expression signatures. BMC Bioinformatics, 9:258.
Function for generating profiles for input to classifyprofile: generateprofiles
.
data(selprofile) classification<-classifyprofile(data=selprofile$ranklist,signif.fdr=1,noperm=20)
data(selprofile) classification<-classifyprofile(data=selprofile$ranklist,signif.fdr=1,noperm=20)
Combine a set of hybridisations using the median rank method as proposed by Warnat
combineProfiles(data)
combineProfiles(data)
data |
A list of R objects, each object should contain a matrix of expression values to be merged with each other. The first object will be used as the reference data set. |
The combine profiles function merges a set of expression values, normalising across different experiments to facilitate the meta-analysis of similar experiments i.e. cell lines treated with the same drug but which may be from different "batches" or platforms.
Matrix of rank normalised expression data. Rows are genes, columns are hybridisations from the different experiments.
C. Pacini
R code from CONOR package. Need also reference to the method.
Matrix containing a set of drugs and their corresponding clusters which can be used as input to the classifyprofile function.
data(customClust)
data(customClust)
Data frame with 47 rows and 2 columns. Columns are headed "Drug" and "Cluster".
Each row refers to a compound in the customDB data set
link{customDB}
with its corresponding cluster assignment in the
second column. These profiles are a subset of the Connectivity Map data
[1] (full set available in the DrugVsDiseasedata package
DrugVsDiseasedata, data object drugRL
, for example use. Clusters
were generated using affinity propagation clustering [2]
http://www.broadinstitute.org/cmap/
[1] Lamb J et~al. (2006) The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science, 313(5795), 1929-1935. [2] U. Bodenhofer, A. Kothmeier, and S. Hochreiter. APCluster: an R package for affinity propagation clustering. Bioinformatics, 27(17):2463-2464, 2011.
data(customClust)
data(customClust)
Matrix containing a set of ranked gene expression profiles which can be used as custom input into classifyprofiles.
data(customDB)
data(customDB)
This is an example subset of the drugRL available in the DrugVsDiseasedata package
DrugVsDiseasedata, data object drugRL
, based on data from the Connectivity
Map
http://www.broadinstitute.org/cmap/
data(customDB) ## maybe str(profiles) ; plot(profiles) ...
data(customDB) ## maybe str(profiles) ; plot(profiles) ...
Data frame containing the required format of an edge attribute file input into Cytoscape
data(customedge)
data(customedge)
Example subset of edge attributes taken from the full reference data set of drug compounds in the DvDdata package.
http://www.broadinstitute.org/cmap/
data(customedge)
data(customedge)
Data frame containing SIF format file of a network which could be input into the classifyprofile function.
data(customsif)
data(customsif)
Example subset of edge attributes taken from the full reference data set of drug compounds in the DvDdata package.
http://www.broadinstitute.org/cmap/
data(customsif)
data(customsif)
Processing Affymetrix data to generate ranked lists of differential gene expression and associated p-values.
generateprofiles(input = c("AE", "GEO", "localAE", "local"), normalisation = c("rma", "mas5"), accession = NULL, customfile = NULL, celfilepath = NULL, sdrfpath = NULL, case = c("disease", "drug"), statistic = c("coef", "t", "diff"), annotation = NULL, factorvalue = NULL,annotationmap=NULL,type=c("average","medpolish","maxvar","max"),outputgenedata=FALSE)
generateprofiles(input = c("AE", "GEO", "localAE", "local"), normalisation = c("rma", "mas5"), accession = NULL, customfile = NULL, celfilepath = NULL, sdrfpath = NULL, case = c("disease", "drug"), statistic = c("coef", "t", "diff"), annotation = NULL, factorvalue = NULL,annotationmap=NULL,type=c("average","medpolish","maxvar","max"),outputgenedata=FALSE)
input |
Character string denoting the source of the data. One of AE (default), GEO, localAE or local. |
normalisation |
Character string denoting the normalisation procedure as implemented in the affy package. One of mas5 (default) or rma. |
accession |
Optional character string giving the database reference for use with either the AE or GEO options. |
customfile |
Optional character string giving the path of a file containing the factor values associated with the CEL files specified in folder celfilepath |
celfilepath |
Optional character string giving the path of a folder containing CEL files to analyse. |
sdrfpath |
Optional character string giving path of an sdrf file corresponding to CEL files in celfilepath |
case |
Character string, one of disease (default) or drug denoting whether the input profiles are disease or drug profiles. |
statistic |
Character string, one of coef (default), t or diff. |
annotation |
Optional character string giving the platform of the affymetrix files |
factorvalue |
Optional character string giving the name of the factor value in the GEO database. |
annotationmap |
Optional matrix, or string to text file, containing an annotation map to convert from probes (first column) to HUGO gene symbols (second column). If passing a file path name the text file should have only two columns without rownames or headers. |
type |
The type of statistic to use to combine multiple probes to a single gene. Can be one of average (default) expression values, median polish, maxvar: the single probe to represent the set which has maximum variance or max to use the probe with maximal variance. |
outputgenedata |
Boolean set to default FALSE. Outputs the gene data produced by generate profiles instead of the fitted coefficients from the linear models. |
Input types of AE and GEO use raw data download from Array Express using the ArrayExpress [1] package or processed GDS files from GEO using the GEOquery package [2]. CEL files and sdrf files downloaded from Array Express and stored locally can be processed using localAE option with the sdrf file path specified in sdrfpath and the path of the folder containing the CEL files contained in celfilepath. Users data stored locally can be processed using the local option with CEL file folders in celfilepath and factors associated with the CEL files in customfile. Where metadata may be missing from the GEO database, platform annotations can be specified using the annotation parameters and the name of main factor value (e.g. disease status, or compound treatment) using factorvalue option. Raw CEL files are normalised (rma or mast)[3] and data is converted from probes to genes using BioMart annotations [4]. Linear models are fitted using the database factor vales or user provided factors for locally stored data [5]. The differential expression is calculated for HUGO genes with the mapping performed automatically for Affymetrix platforms, HGU133A, HGU133Plus2 and HGU133A2 using BioMart. The differential expression statistic is one of coef (default), which corresponds to log (base 2) FC, diff (which is the difference between raw (non-logged) expression values, or t for the t-statistic based on log base 2 expression values.
List with two elements:
Ranklist |
Matrix containing the ranks of gene expression. Rows containing the genes, columns the different profiles |
Pvalues |
Matrix containing the associated p-values to the differential expression profiles in Ranklist |
C. Pacini
[1]Kauffmann et al. (2009) Importing Array Express datasets into R/Bioconductor. Bioinformatics, 25(16):2092-4.
[2]Davis et al. (2007) GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics, 14, 1846-1847.
[3]Irizarry et al. (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research, 31(4); e15.
[4]Durinck et al. (2009). Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols 4, 1184-1191.
[5]Smyth et al. (2004). Linear models and empirical Bayes method for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, Vol. 3, No. 1, Article 3.
#profileAE<-generateprofiles(input="AE",accession="E-GEOD-22528")
#profileAE<-generateprofiles(input="AE",accession="E-GEOD-22528")
List containing ranked lists of gene expression and associated p-values for a set of profiles.
data(profiles)
data(profiles)
List containing rank of differential gene expression and pvalues in a list. Each item in the list contains a matrix. Matrix has number of rows equal number genes and columns for the different profiles.
http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-22528
data(profiles) ## maybe str(profiles) ; plot(profiles) ...
data(profiles) ## maybe str(profiles) ; plot(profiles) ...
Given a list with two elements, one containing ranklists for a set of regression models and the second containing the associated p-values. This function is used to extract a subset of the models.
selectrankedlists(ranklist, colsinc)
selectrankedlists(ranklist, colsinc)
ranklist |
List with two elements (as output from generate profiles): Ranklist: Matrix containing the ranks of gene expression. Rows containing the genes, columns the different profiles. Pvalues: Matrix containing the associated p-values to the differential expression profiles in Ranklist. |
colsinc |
Vector of integers containing the column references of the profiles to select. |
Format of list provided to selectrankedlists is the same as is output from generateprofiles generateprofiles
. The output from selectrankedlists can be used as input to the classify profile function classifyprofile
.
Ranklist |
Matrix containing the ranks of gene expression. Rows containing the genes, columns the selected profiles |
Pvalues |
Matrix containing the associated p-values to the differential expression profiles in Ranklist |
C. Pacini
http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-22528
generateprofiles
, classifyprofile
.
data(profiles) selectprofile<-selectrankedlists(profiles,1) classification<-classifyprofile(data=selectprofile$ranklist,noperm=10,signif.fdr=1)
data(profiles) selectprofile<-selectrankedlists(profiles,1) classification<-classifyprofile(data=selectprofile$ranklist,noperm=10,signif.fdr=1)
Ranklist which has the rank of each gene according to differential expression. P-values is the second element in the list containing the associated p-value for gene differential expression.
data(selprofile)
data(selprofile)
List with two elements, the first ranklist containing the ranked position of the gene according to its differential expression. The second item in the list is the associated pvalues.
Example ranked list of differential expression. Taken from Array express experiment E-GEOD-22528
http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-22528
data(selprofile) ## maybe str(selprofile) ; plot(selprofile) ...
data(selprofile) ## maybe str(selprofile) ; plot(selprofile) ...