Title: | Visualise microarray and RNAseq data using gene ontology annotations |
---|---|
Description: | The package contains methods to visualise the expression profile of genes from a microarray or RNA-seq experiment, and offers a supervised clustering approach to identify GO terms containing genes with expression levels that best classify two or more predefined groups of samples. Annotations for the genes present in the expression dataset may be obtained from Ensembl through the biomaRt package, if not provided by the user. The default random forest framework is used to evaluate the capacity of each gene to cluster samples according to the factor of interest. Finally, GO terms are scored by averaging the rank (alternatively, score) of their respective gene sets to cluster the samples. P-values may be computed to assess the significance of GO term ranking. Visualisation function include gene expression profile, gene ontology-based heatmaps, and hierarchical clustering of experimental samples using gene expression data. |
Authors: | Kevin Rue-Albrecht [aut, cre], Tharvesh M.L. Ali [ctb], Paul A. McGettigan [ctb], Belinda Hernandez [ctb], David A. Magee [ctb], Nicolas C. Nalpas [ctb], Andrew Parnell [ctb], Stephen V. Gordon [ths], David E. MacHugh [ths] |
Maintainer: | Kevin Rue-Albrecht <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.41.0 |
Built: | 2024-10-31 06:23:31 UTC |
Source: | https://github.com/bioc/GOexpress |
Integrates gene expression data with gene ontology annotations to score and
visualise genes and gene ontologies best clustering groups of experimental
samples. Supports custom annotations, or alternatively provides an interface
to the Ensembl annotations using the biomaRt
package. The default
scoring approach is based on the random forest framework, while a one-way
ANOVA is available as an alteranative. GO term are scored and ranked
according to the average rank (alternatively, average power) of all
associated genes. P-values can be generated to assess the significance of
GO term ranking. The ranked list of GO terms is returned, with tools
allowing to visualise the statistics on a gene- and ontology-basis.
Package: | GOexpress |
Type: | Package |
Version: | 1.7.1 |
Date: | 2016-02-06 |
License: | GPL (>= 3) |
This package requires only two input variables
An ExpressionSet
containing assayData and phenoData.
The former
should be a gene-by-sample matrix providing gene expression values for
each gene in each sample. The latter should be an
AnnotatedDataFrame
from the Biobase
package providing
phenotypic information and grouping factors
with two or more levels.
The name of the grouping factor to investigate, which must be a
valid column name in the phenoData
slot of the above
ExpressionSet
.
Following analysis, visualisation methods include:
Histogram and quantiles representations of the scores of GO terms
Permutation-based P-values to assess the significance of GO term ranking
Filtering of results on various criteria (e.g. number of genes annotated to GO term)
Re-ordering of GO terms and gene result tables based on score or rank metric
Table of statistics for genes annotated to a given GO term
Hierarchical clustering of samples based on the expression level of genes annotated to a given GO term
Heatmap of samples and genes based on the expression level of genes annotated to a given GO term
Expression profile of a gene against one given factor (e.g. Time) while grouping samples on another given factor (e.g. Treatment)
Univariate analysis of the expression level of a gene in the different groups of each experimental factor.
Venn diagram of the counts of genes shared between a list of GO terms.
Maintainer: Kevin Rue-Albrecht <[email protected]
>
Main method for an example usage:
GO_analyse
.
Packages
Biobase
,
randomForest
,
RColorBrewer
,
VennDiagram
.
Methods
biomaRt:getBM
,
ggplot2:ggplot
,
gplots:heatmap.2
,
gplots:bluered
,
gplots:greenred
,
grid:grid.newpage
,
grid:grid.layout
,
stringr:str_extract
.
# Sample input data available with package: data(AlvMac) # Sample output data available with package: data(AlvMac_results) # Supported species and microarrays: data(microarray2dataset) data(prefix2dataset)
# Sample input data available with package: data(AlvMac) # Sample output data available with package: data(AlvMac_results) # Supported species and microarrays: data(microarray2dataset) data(prefix2dataset)
An example ExpressionSet
including expression data and phenotypic
information about the samples.
The expression data is saved in the assayData
slot of the
ExpressionSet
. It is a gene-by-sample matrix, containing a subset of
data from an in vitro stimulation of bovine macrophages with different
mycobacterial strains. Column names are sample names, and row names are
Ensembl gene identifiers of the Bos taurus species. Each cell contains
the log2-transformed normalised expression level of each gene in each sample.
The phenotypic information is saved in the phenoData
slot of the
ExpressionSet
. Row names are sample names and columns contain
descriptive information about each sample, including experimental factors(e.g.
Treatment, Timepoint, Animal).
data(AlvMac)
data(AlvMac)
Gene expression was measured in poly-A purified strand-specific RNA libraries using the RNA-Sequencing Illumina(R) HiSeq(R) 2000 platform as paired-end 2 x 90 nucleotide reads. Raw reads from pooled RNA libraries were first deconvoluted according to sample-specific nucleotide barcodes. Read pairs containing adapter sequence in either read mate were discarded, and similarly read pairs of low overall quality in either mate were also discarded. Paired-end reads from each filtered individual library were aligned to the Bos taurus reference genome (B. taurus UMD3.1.71 genome release) using the STAR aligner software. For each library, raw counts for each gene based on sense strand data were obtained using the featureCounts software from the Subread package. The featureCounts parameters were set to unambiguously assign uniquely aligned paired-end reads in a stranded manner to the exons of genes within the Bos taurus reference genome annotation (B. taurus UMD3.1.71 genome annotation). The gene count outputs were further processed using the edgeR Bioconductor package.
The gene expression quantitation pipeline within the edgeR package was customised to: (1) filter out all bovine rRNA genes; (2) filter out genes displaying expression levels below the minimally-set threshold of one count per million [CPM] in at least ten individual libraries (number of biological replicates); (3) calculate normalisation factors for each library using the trimmed mean of M-values method; (4) log2-transform CPM values based on the normalised library size.
To generate this test data subset, we extracted 100 genes from the original
dataset of 12,121 genes. All 7 genes associated with the GO term "GO:0034142"
(i.e. "toll-like receptor 4 signaling pathway") present in the original
full-size filtered-normalised dataset were kept, all 3 Ensembl
gene identifiers annotated to the gene symbol 'RPL36A', and finally another
random 90 random genes, making a total of 100 genes measured in 117 samples.
Samples include all 10 biological replicates collected at four different
time-points, see data(targets)
. The TLR4 pathway was found in the full
dataset as the top-ranking biological pathway discriminating the different
mycobacterial infections (unpublished observations).
assayData
is a matrix of expression levels for 100 genes (rows)
measured in 117 samples (columns).
rownames
are Ensembl gene identifiers of the
Bos taurus species.
colnames
are samples identifiers.
phenoData
is a data frame with 117 samples and 7 descriptive fields
(e.g. experimental factors) in the columns listed below:
rownames
are unique identifiers. Here, sample names.
File
contains local filenames where the RNAseq counts were
obtained from.
Sample
contains individual sample name.
Animal
contains the unique identifier of the animal
corresponding to the biological replicate, stored as a factor.
Treatment
contains the infection status of the sample,
stored as a factor (CN: Control, MB: M. bovis, TB:
M. tuberculosis)
Time
contains the time of measurement in hours
post-infection,stored as a factor.
Group
contains a combination of the Treatment and Time
factors above, stored as a factor itself.
Timepoint
contains the time of measurement, stored as a
numeric value. This field is useful to use on the X-axis of expression
plots. See function expression_plot()
.
Publication in review process.
# Load the data data(AlvMac) # Structure of the data str(AlvMac) # Dimensions (rows, columns) of the data dim(AlvMac) # Subset of first 5 features and 5 samples AlvMac[1:5, 1:5] # Phenotypic information pData(AlvMac) # Phenotypic information about factor "Group" AlvMac$Group # Conversion of a factor to a character vector as.character(AlvMac$Group) # Number of samples (rows) and annotations (columns) dim(pData(AlvMac))
# Load the data data(AlvMac) # Structure of the data str(AlvMac) # Dimensions (rows, columns) of the data dim(AlvMac) # Subset of first 5 features and 5 samples AlvMac[1:5, 1:5] # Phenotypic information pData(AlvMac) # Phenotypic information about factor "Group" AlvMac$Group # Conversion of a factor to a character vector as.character(AlvMac$Group) # Number of samples (rows) and annotations (columns) dim(pData(AlvMac))
An example data.frame
providing the identifier, name and description
corresponding to Ensembl gene identifiers present in the AlvMac
example ExpressionSet
.
data("AlvMac_allgenes")
data("AlvMac_allgenes")
This data-frame includes only Ensembl gene identifiers present in the
AlvMac
example ExpressionSet
.
See the help page of the GO_analyse
function for an example usage.
A data frame detailing information about the 100 gene features present in
the AlvMac
example ExpressionSet
:
gene_id
are Ensembl gene identifiers.
external_gene_name
contains the corresponding gene name.
description
contains the corresponding description.
These annotations were obtained from the Ensembl BioMart server using the
biomaRt
package to access the Ensembl release 75
http://feb2014.archive.ensembl.org.
# Load the data data(AlvMac_allgenes) # Structure of the data str(AlvMac_allgenes) # First few rows head(AlvMac_allgenes)
# Load the data data(AlvMac_allgenes) # Structure of the data str(AlvMac_allgenes) # First few rows head(AlvMac_allgenes)
An example data.frame
providing the identifier, name and namespace
corresponding
to gene ontology identifiers, compatible with the AlvMac
example
ExpressionSet
.
data("AlvMac_allGO")
data("AlvMac_allGO")
This data-frame includes all gene ontologies present in the
btaurus_gene_ensembl
dataset of the Ensembl BioMart server, including
those associated with no gene identifier in the AlvMac
example
ExpressionSet
.
See the help page of the GO_analyse
function for an example usage.
A data frame detailing information about 13,302 gene ontologies:
go_id
are gene ontology identifiers.
name_1006
contains the corresponding gene ontology name.
namespace_1003
contains the corresponding gene ontology
namespace (i.e. "biological_process", "molecular_function", or
"cellular_component")
These annotations were obtained from the Ensembl BioMart server using the
biomaRt
package to access the Ensembl release 75
http://feb2014.archive.ensembl.org.
# Load the data data(AlvMac_allGO) # Structure of the data str(AlvMac_allGO) # First few rows head(AlvMac_allGO)
# Load the data data(AlvMac_allGO) # Structure of the data str(AlvMac_allGO) # First few rows head(AlvMac_allGO)
An example data.frame
associating Ensembl gene identifiers to gene
ontology identifiers, compatible with the AlvMac
example
ExpressionSet
.
data("AlvMac_GOgenes")
data("AlvMac_GOgenes")
This data-frame includes all annotations between Ensembl gene identifiers
and gene ontology identifiers
present in the btaurus_gene_ensembl
dataset of the Ensembl BioMart
server, including gene identifiers absent from the AlvMac
example
ExpressionSet
.
Importantly, gene identifiers present in these annotations but absent from
the ExpressionSet
will be given a score of 0 (minimum valid
score; indicates no power to discriminates the predefined groups of
sample) and a rank equal to the number of genes present in the entire
dataset plus one (worst rank, while preserving discrete continuity of
the ranking). This is helpful where features considered uninformative
were filtered out of the ExpressionSet
.
See the help page of the GO_analyse
function for an example usage.
A data frame with 191,614 associations between the following 2 variables.
gene_id
are Ensembl gene identifiers of the
Bos taurus species.
go_id
are gene ontology identifiers.
These annotations were obtained from the Ensembl BioMart server using the
biomaRt
package to access the Ensembl release 75
http://feb2014.archive.ensembl.org.
# Load the data data(AlvMac_GOgenes) # Structure of the data str(AlvMac_GOgenes) # First few rows head(AlvMac_GOgenes)
# Load the data data(AlvMac_GOgenes) # Structure of the data str(AlvMac_GOgenes) # First few rows head(AlvMac_GOgenes)
GO_analyse()
function on an RNAseq experiment.
This variable may be used to test the filtering and visualisation methods
implemented in the package. It contains the output of the command
AlvMac_results = GO_analyse(eSet=AlvMac, f="Treatment")
applied to the toy input data AlvMac
.
data(AlvMac_results)
data(AlvMac_results)
A list of 9 slots summarising the input and results of the analysis:
GO
contains a table ranking all GO terms related to genes in
the expression dataset based on the average ability of their related
genes to cluster the samples according to the predefined grouping
factor.
mapping
contains the table mapping genes present in the
dataset to GO terms.
genes
contains a table ranking all genes present in the
expression dataset based on their ability to cluster the samples
according to the predefined grouping factor (see 'factor' below).
factor
contains the grouping factor analysed.
method
contains the statistical framework used.
subset
contains the filters used to select a subset of
samples from the original ExpressionSet
for analysis.
rank.by
contains the metric used to rank the scoring tables.
ntree
contains number of trees built during the randomForest
analysis.
mtry
contains the number of features randomly sampled as
candidates at each split in each tree built during the randomForest
analysis.
Running the above command again, you might obtain slightly different scores and ranks due to the stochastic process of sampling used by the random forest algorithm. However, the ranking metric was found to be robust and stable across run, given adequate number of trees and predictor variables sampled.
To produce reproducible results, use the set.seed()
function prior
to running any randomising or sampling function.
Source data are part of a publication in review.
data(AlvMac_results) str(AlvMac_results) head(AlvMac_results$GO, n=20) head(AlvMac_results$GO$genes, n=20)
data(AlvMac_results) str(AlvMac_results) head(AlvMac_results$GO, n=20) head(AlvMac_results$GO$genes, n=20)
pValue_GO()
function on an RNAseq experiment.
This variable may be used to test the filtering and visualisation methods
implemented in the package. It contains the output of
successively applying the commands
AlvMac_results = GO_analyse(eSet=AlvMac, f="Treatment")
and
AlvMac_results.pVal = pValue_GO(AlvMac_results, N=100,
ranked.by=result$rank.by, rank.by='P')
to the toy input data AlvMac
.
data("AlvMac_results.pVal")
data("AlvMac_results.pVal")
A list of 9 slots summarising the input and results of the analysis:
GO
contains a table ranking all GO terms related to genes in
the expression dataset based on the average ability of their related
genes to cluster the samples according to the predefined grouping
factor.
mapping
contains the table mapping genes present in the
dataset to GO terms.
genes
contains a table ranking all genes present in the
expression dataset based on their ability to cluster the samples
according to the predefined grouping factor (see 'factor' below).
factor
contains the grouping factor analysed.
method
contains the statistical framework used.
subset
contains the filters used to select a subset of
samples from the original ExpressionSet
for analysis.
rank.by
contains the metric used to rank the scoring tables.
ntree
contains number of trees built during the randomForest
analysis.
mtry
contains the number of features randomly sampled as
candidates at each split in each tree built during the randomForest
analysis.
p.iterations
contains the number of permutations performed
to compute the P-value in the GO
slot.
Running the above command again, you might obtain slightly different scores and ranks due to the stochastic process of sampling used by the random forest algorithm. However, the ranking metric was found to be robust and stable across run, given adequate number of trees and predictor variables sampled.
To produce reproducible results, use the set.seed()
function prior
to running any randomising or sampling function.
Source data are part of a publication in review.
data(AlvMac_results.pVal) str(AlvMac_results.pVal) head(AlvMac_results.pVal, n=20)
data(AlvMac_results.pVal) str(AlvMac_results.pVal) head(AlvMac_results.pVal, n=20)
Clusters the samples using only the expression levels of genes associated with a given go_id.
cluster_GO( go_id, result, eSet, f=result$factor, subset=NULL, method_dist="euclidean", method_hclust="average", cex=0.8, main=paste(go_id, result$GO[result$GO$go_id == go_id, "name_1006"]), xlab="Distance", cex.main=1, main.Lsplit=NULL, ...)
cluster_GO( go_id, result, eSet, f=result$factor, subset=NULL, method_dist="euclidean", method_hclust="average", cex=0.8, main=paste(go_id, result$GO[result$GO$go_id == go_id, "name_1006"]), xlab="Distance", cex.main=1, main.Lsplit=NULL, ...)
go_id |
A Gene Ontology (GO) identifier. |
result |
The output of |
eSet |
|
f |
The grouping factor in |
subset |
A named list to subset |
method_dist |
The method used to calculate distance between samples. See the
|
method_hclust |
The method used to cluster samples. See the
|
cex |
A numeric value defining the character expansion of text in the plot. |
main |
A character string for the main title of the plot. |
xlab |
A label for the x axis, defaults to "Distance". |
cex.main |
Scaling factor of the main title font size. Default is 1. We suggest to
use it in combination with the argument |
main.Lsplit |
Number of characters after which a new-line character will be inserted in the main title. If this would occur within a word, the new-line character will be inserted before this word. Default is NULL, leaving the title on a single line. |
... |
Additional parameters passed on to |
Returns the output of the plot()
function.
Kevin Rue-Albrecht
Method GO_analyse
.
# load the sample output data data(AlvMac_results) # Hierarchical clustering of samples based on the same GO term cluster_GO( go_id="GO:0034142", result=AlvMac_results, eSet=AlvMac, cex=0.7 ) # Re-label sample by another factor cluster_GO( go_id="GO:0034142", result=AlvMac_results, eSet=AlvMac, cex=0.7, f="Group" )
# load the sample output data data(AlvMac_results) # Hierarchical clustering of samples based on the same GO term cluster_GO( go_id="GO:0034142", result=AlvMac_results, eSet=AlvMac, cex=0.7 ) # Re-label sample by another factor cluster_GO( go_id="GO:0034142", result=AlvMac_results, eSet=AlvMac, cex=0.7, f="Group" )
This function will plot the expression profile of a gene across a valid
X-axis variable in the phenodata
while representing the mean
and confidence interval of groups of samples defined by levels of another
valid grouping factor in the phenodata
.
expression_plot( gene_id, result, eSet, x_var, f=result$factor, subset=NULL, xlab=x_var, ylab="log2(cpm)", ylim=range(exprs(eSet)), col.palette="Accent", col=brewer.pal(n=length(levels(pData(eSet)[,f])), name=col.palette), level=0.95, title=NULL, title.size=2, axis.title.size=20, axis.text.size=15, axis.text.angle=0, legend.title.size=20, legend.text.size=15, legend.key.size=30)
expression_plot( gene_id, result, eSet, x_var, f=result$factor, subset=NULL, xlab=x_var, ylab="log2(cpm)", ylim=range(exprs(eSet)), col.palette="Accent", col=brewer.pal(n=length(levels(pData(eSet)[,f])), name=col.palette), level=0.95, title=NULL, title.size=2, axis.title.size=20, axis.text.size=15, axis.text.angle=0, legend.title.size=20, legend.text.size=15, legend.key.size=30)
gene_id |
An gene or probeset identifier present in |
result |
An output of the |
eSet |
|
x_var |
A column name in |
f |
A column name in |
subset |
A named list to subset |
xlab |
Title of the X-axis. Default is tha value of |
ylab |
Title of the Y-axis. Default is "log2(cpm)". |
ylim |
Numeric vector of length 2 specifying the lower and upper bounds of the Y axis. Default is scaled to the full range of expression values in the expression dataset, to ease comparison of different genes. If set to NULL, the axis will be scaled to fit the plotted data only. |
col.palette |
A valid |
col |
A vector of color names or codes. The number of colors provided must match
the number of levels of the grouping factor. If specified, overrides
argument |
level |
The confidence interval level to visualise around the mean of each group. Default is 0.95. |
title |
Changes the plot title. Default is a combination of the gene id and the associated gene. |
title.size |
Changes the font size of the title. Default is 2. |
axis.title.size |
Changes the font size of the axes title. Defalt is 20. |
axis.text.size |
Changes the font size of the axes text labels. Default is 15. |
axis.text.angle |
Changes the angle of the X axis text labels. Default is 0 (horizontal). |
legend.title.size |
Changes the font size of the legend title. Default is 20. |
legend.text.size |
Changes the font size of the legend text labels. Default is 15. |
legend.key.size |
Changes the size of the legend keys (in points). Default is 30. |
The ggplot object.
Common issues:
It may not be possible to produce plots where the combination of X-axis variable and grouping factor leaves too few replicates to compute a confidence interval for each X value. This is a limitation imposed by the ggplot2 package to produce proper statistics and confidence intervals. In such cases, it may be preferrable to use the expression_profiles() method.
Kevin Rue-Albrecht
ggplot2 package.
Package Biobase
, methods
expression_plot_symbol
,
GO_analyse
and
ggplot
.
# load the sample output data data(AlvMac_results) # Expression by gene identifier (TNIP3) expression_plot( gene_id="ENSBTAG00000047107", result=AlvMac_results, eSet=AlvMac, x_var="Timepoint" ) # Same gene, plotted by animal and grouped by treatment (merging time points) expression_plot( gene_id="ENSBTAG00000047107", result=AlvMac_results, eSet=AlvMac, x_var="Animal", f="Treatment") # Same gene, plotted by animal and grouped by time-point (merging treatments) expression_plot( gene_id="ENSBTAG00000047107", result=AlvMac_results, eSet=AlvMac, x_var="Animal", f="Time")
# load the sample output data data(AlvMac_results) # Expression by gene identifier (TNIP3) expression_plot( gene_id="ENSBTAG00000047107", result=AlvMac_results, eSet=AlvMac, x_var="Timepoint" ) # Same gene, plotted by animal and grouped by treatment (merging time points) expression_plot( gene_id="ENSBTAG00000047107", result=AlvMac_results, eSet=AlvMac, x_var="Animal", f="Treatment") # Same gene, plotted by animal and grouped by time-point (merging treatments) expression_plot( gene_id="ENSBTAG00000047107", result=AlvMac_results, eSet=AlvMac, x_var="Animal", f="Time")
This function will plot the expression profile of a gene across a valid
X-axis variable from the AnnotatedDataFrame
while representing the mean
and confidence interval of groups of samples defined by levels of a valid
grouping factor from the AnnotatedDataFrame.
In the case of a gene name represented by multiple gene or probeset identifiers in the dataset, a lattice of plots will be produced. Each of the plots in this lattice can subsequently be plotted separately using its associated index.
expression_plot_symbol( gene_symbol, result, eSet, x_var, f=result$factor, subset=NULL, index=0, xlab=x_var, ylab="log2(cpm)", ylim=range(exprs(eSet)), col.palette="Accent", col=brewer.pal(n=length(levels(pData(eSet)[,f])), name=col.palette), level=0.95, titles=c(), title.size=2, axis.title.size=20, axis.text.size=15, axis.text.angle=0, legend.title.size=20, legend.text.size=20, legend.key.size=30)
expression_plot_symbol( gene_symbol, result, eSet, x_var, f=result$factor, subset=NULL, index=0, xlab=x_var, ylab="log2(cpm)", ylim=range(exprs(eSet)), col.palette="Accent", col=brewer.pal(n=length(levels(pData(eSet)[,f])), name=col.palette), level=0.95, titles=c(), title.size=2, axis.title.size=20, axis.text.size=15, axis.text.angle=0, legend.title.size=20, legend.text.size=20, legend.key.size=30)
gene_symbol |
A gene name present in |
result |
An output of the |
eSet |
|
x_var |
A column name in |
f |
A column name in |
subset |
A named list to subset |
index |
In the case where multiple gene or probeset identifiers are associated
with the gene name, |
xlab |
Title of the X-axis. Default is tha value of |
ylab |
Title of the Y-axis. Default is "log2(cpm)". |
ylim |
Numeric vector of length 2 specifying the lower and upper bounds of the Y axis. Default is scaled to the full range of expression values in the expression dataset, to ease comparison of different genes. If set to NULL, the axis will be scaled to fit the plotted data only. |
col.palette |
A valid |
col |
A vector of color names or codes. The number of colors provided must match
the number of levels of the grouping factor. Default to a palette with an
adequate set of colors. If specified, overrides argument
|
level |
The confidence interval level to visualise around the mean of each group. Default is 0.95. |
titles |
Character vector providing as many titles as there are plots to replace the default titles. Default is a combination of the gene id and the associated gene. |
title.size |
Changes the font size of the title. Default is 2. |
axis.title.size |
Changes the font size of the axes title. Defalt is 20. |
axis.text.size |
Changes the font size of the axes text labels. Default is 15. |
axis.text.angle |
Changes the angle of the X axis text labels. Default is 0 (horizontal). |
legend.title.size |
Changes the font size of the legend title. Default is 20. |
legend.text.size |
Changes the font size of the legend text labels. Default is 15. |
legend.key.size |
Changes the size of the legend keys (in points). Default is 30. |
The ggplot object, or the vector of feature identifiers if multiple features are annotated to the gene symbol.
Common issues:
It may not be possible to produce plots where the combination of X-axis variable and grouping factor leaves too few replicates to compute a confidence interval for each X value. This is a limitation imposed by the ggplot2 package to produce proper statistics and confidence intervals. In such cases, it may be preferrable to use the expression_profiles_symbol() method.
Kevin Rue-Albrecht
ggplot2 package.
Package Biobase
, methods
expression_plot
,
GO_analyse
and
ggplot
.
# load the sample output data data(AlvMac_results) # Expression by gene identifier (TNIP3) expression_plot_symbol( gene_symbol="PIK3AP1", result=AlvMac_results, eSet=AlvMac, x_var="Timepoint" ) # Same gene, plotted by animal and grouped by treatment (merging time points) expression_plot_symbol( gene_symbol="PIK3AP1", result=AlvMac_results, eSet=AlvMac, x_var="Animal", f="Treatment" ) # Same gene, plotted by animal and grouped by time-point (merging treatments) expression_plot_symbol( gene_symbol="PIK3AP1", result=AlvMac_results, eSet=AlvMac, x_var="Animal", f="Time")
# load the sample output data data(AlvMac_results) # Expression by gene identifier (TNIP3) expression_plot_symbol( gene_symbol="PIK3AP1", result=AlvMac_results, eSet=AlvMac, x_var="Timepoint" ) # Same gene, plotted by animal and grouped by treatment (merging time points) expression_plot_symbol( gene_symbol="PIK3AP1", result=AlvMac_results, eSet=AlvMac, x_var="Animal", f="Treatment" ) # Same gene, plotted by animal and grouped by time-point (merging treatments) expression_plot_symbol( gene_symbol="PIK3AP1", result=AlvMac_results, eSet=AlvMac, x_var="Animal", f="Time")
This function will plot the expression profile of a gene in individual samples
series across a valid X-axis variable in the phenodata
, while colouring
sample groups according to another variable in the phenodata
, and
using different line types according to yet another (or the same) variable
in the phenodata
.
expression_profiles( gene_id, result, eSet, x_var, seriesF, subset=NULL, colourF=result$factor, linetypeF=colourF, line.size=1.5, xlab=x_var, ylab="log2(cpm)", ylim=range(exprs(eSet)), col.palette="Accent", col=brewer.pal(n=length(levels(pData(eSet)[,colourF])), name=col.palette), lty=1:length(levels(pData(eSet)[,linetypeF])), title=NULL, title.size=2, axis.title.size=20, axis.text.size=15, axis.text.angle=0, legend.title.size=20, legend.text.size=15, legend.key.size=30)
expression_profiles( gene_id, result, eSet, x_var, seriesF, subset=NULL, colourF=result$factor, linetypeF=colourF, line.size=1.5, xlab=x_var, ylab="log2(cpm)", ylim=range(exprs(eSet)), col.palette="Accent", col=brewer.pal(n=length(levels(pData(eSet)[,colourF])), name=col.palette), lty=1:length(levels(pData(eSet)[,linetypeF])), title=NULL, title.size=2, axis.title.size=20, axis.text.size=15, axis.text.angle=0, legend.title.size=20, legend.text.size=15, legend.key.size=30)
gene_id |
An gene or probeset identifier present in |
result |
An output of the |
eSet |
|
x_var |
A column name in |
seriesF |
A column name in |
subset |
A named list to subset |
colourF |
A column name in |
linetypeF |
A column name in |
line.size |
The width of the plotted lines. Default is 1.5. |
xlab |
Title of the X-axis. Default is tha value of |
ylab |
Title of the Y-axis. Default is "log2(cpm)". |
ylim |
Numeric vector of length 2 specifying the lower and upper bounds of the Y axis. Default is scaled to the full range of expression values in the expression dataset, to ease comparison of different genes. If set to NULL, the axis will be scaled to fit the plotted data only. |
col.palette |
A valid |
col |
A vector of color names or codes. The number of colors provided must match
the number of levels of the factor |
lty |
A vector of numeric values corresponding to line types. The number of
line types provided must match the number of levels of the factor
|
title |
Changes the plot title. Default is a combination of the gene id and the associated gene. |
title.size |
Changes the font size of the title. Default is 2. |
axis.title.size |
Changes the font size of the axes title. Defalt is 20. |
axis.text.size |
Changes the font size of the axes text labels. Default is 15. |
axis.text.angle |
Changes the angle of the X axis text labels. Default is 0 (horizontal). |
legend.title.size |
Changes the font size of the legend title. Default is 20. |
legend.text.size |
Changes the font size of the legend text labels. Default is 15. |
legend.key.size |
Changes the size of the legend keys (in points). Default is 30. |
In order to track and visualise individual sample series, each sample in the
ExpressionSet
should be associated with a unique combination of
seriesF
and x_var
. This may require the generation of a new
factor in the phenodata
, combining all experimental factors except
that plotted on the X-axis. See below for an example on the training dataset.
The ggplot object.
Kevin Rue-Albrecht
ggplot2 package.
Package Biobase
, methods
expression_plot_symbol
,
expression_plot_symbol
and
ggplot
.
# load the sample output data data(AlvMac_results) AlvMac$Series <- paste(AlvMac$Animal, AlvMac$Treatment, sep="_") expression_profiles( gene_id = "ENSBTAG00000047107", result = AlvMac_results, eSet=AlvMac, x_var = "Timepoint", seriesF="Series", legend.title.size=10, legend.text.size=10, legend.key.size=15) expression_profiles( gene_id = "ENSBTAG00000047107", result = AlvMac_results, eSet=AlvMac, x_var = "Timepoint", seriesF="Series", linetypeF="Animal", legend.title.size=10, legend.text.size=10, legend.key.size=15)
# load the sample output data data(AlvMac_results) AlvMac$Series <- paste(AlvMac$Animal, AlvMac$Treatment, sep="_") expression_profiles( gene_id = "ENSBTAG00000047107", result = AlvMac_results, eSet=AlvMac, x_var = "Timepoint", seriesF="Series", legend.title.size=10, legend.text.size=10, legend.key.size=15) expression_profiles( gene_id = "ENSBTAG00000047107", result = AlvMac_results, eSet=AlvMac, x_var = "Timepoint", seriesF="Series", linetypeF="Animal", legend.title.size=10, legend.text.size=10, legend.key.size=15)
This function will plot the expression profile of a gene in individual samples
series across a valid X-axis variable in the phenodata
, while colouring
sample groups according to another variable in the phenodata
, and
using different line types according to yet another (or the same) variable
in the phenodata
.
expression_profiles_symbol( gene_symbol, result, eSet, x_var, seriesF, subset=NULL, colourF=result$factor, linetypeF=colourF, line.size=1.5, index=0, xlab=x_var, ylab="log2(cpm)", ylim=range(exprs(eSet)), col.palette="Accent", col=brewer.pal(n=length(levels(pData(eSet)[,colourF])), name=col.palette), lty=1:length(levels(pData(eSet)[,linetypeF])), titles=c(), title.size=2, axis.title.size=20, axis.text.size=15, axis.text.angle=0, legend.title.size=20, legend.text.size=15, legend.key.size=30)
expression_profiles_symbol( gene_symbol, result, eSet, x_var, seriesF, subset=NULL, colourF=result$factor, linetypeF=colourF, line.size=1.5, index=0, xlab=x_var, ylab="log2(cpm)", ylim=range(exprs(eSet)), col.palette="Accent", col=brewer.pal(n=length(levels(pData(eSet)[,colourF])), name=col.palette), lty=1:length(levels(pData(eSet)[,linetypeF])), titles=c(), title.size=2, axis.title.size=20, axis.text.size=15, axis.text.angle=0, legend.title.size=20, legend.text.size=15, legend.key.size=30)
gene_symbol |
A gene name present in |
result |
An output of the |
eSet |
|
x_var |
A column name in |
seriesF |
A column name in |
subset |
A named list to subset |
colourF |
A column name in |
linetypeF |
A column name in |
line.size |
The width of the plotted lines. Default is 1.5. |
index |
In the case where multiple gene or probeset identifiers are associated
with the gene name, |
xlab |
Title of the X-axis. Default is tha value of |
ylab |
Title of the Y-axis. Default is "log2(cpm)". |
ylim |
Numeric vector of length 2 specifying the lower and upper bounds of the Y axis. Default is scaled to the full range of expression values in the expression dataset, to ease comparison of different genes. If set to NULL, the axis will be scaled to fit the plotted data only. |
col.palette |
A valid |
col |
A vector of color names or codes. The number of colors provided must match
the number of levels of the factor |
lty |
A vector of numeric values corresponding to line types. The number of
line types provided must match the number of levels of the factor
|
titles |
Character vector providing as many titles as there are plots to replace the default titles. Default is a combination of the gene id and the associated gene. |
title.size |
Changes the font size of the title. Default is 2. |
axis.title.size |
Changes the font size of the axes title. Defalt is 20. |
axis.text.size |
Changes the font size of the axes text labels. Default is 15. |
axis.text.angle |
Changes the angle of the X axis text labels. Default is 0 (horizontal). |
legend.title.size |
Changes the font size of the legend title. Default is 20. |
legend.text.size |
Changes the font size of the legend text labels. Default is 15. |
legend.key.size |
Changes the size of the legend keys (in points). Default is 30. |
In order to track and visualise individual sample series, each sample in the
ExpressionSet
should be associated with a unique combination of
seriesF
and x_var
. This may require the generation of a new
factor in the phenodata
, combining all experimental factors except
that plotted on the X-axis. See below for an example on the training dataset.
The ggplot object, or the vector of feature identifiers if multiple features are annotated to the gene symbol.
Kevin Rue-Albrecht
ggplot2 package.
Package Biobase
, methods
expression_plot_symbol
,
expression_plot_symbol
and
ggplot
.
# load the sample output data data(AlvMac_results) AlvMac$Series <- paste(AlvMac$Animal, AlvMac$Treatment, sep="_") expression_profiles_symbol( gene_symbol="TNIP3", result = AlvMac_results, eSet=AlvMac, seriesF="Series", x_var = "Timepoint", linetypeF="Animal", line.size=1.5, title.size=1.5, legend.title.size=10, legend.text.size=10, legend.key.size=30) expression_profiles_symbol( gene_symbol="TNIP3", result = AlvMac_results, eSet=AlvMac, seriesF="Series", x_var = "Timepoint", linetypeF="Animal", line.size=1.5, lty=rep(1,10), title.size=1.5, legend.title.size=10, legend.text.size=10, legend.key.size=30)
# load the sample output data data(AlvMac_results) AlvMac$Series <- paste(AlvMac$Animal, AlvMac$Treatment, sep="_") expression_profiles_symbol( gene_symbol="TNIP3", result = AlvMac_results, eSet=AlvMac, seriesF="Series", x_var = "Timepoint", linetypeF="Animal", line.size=1.5, title.size=1.5, legend.title.size=10, legend.text.size=10, legend.key.size=30) expression_profiles_symbol( gene_symbol="TNIP3", result = AlvMac_results, eSet=AlvMac, seriesF="Series", x_var = "Timepoint", linetypeF="Animal", line.size=1.5, lty=rep(1,10), title.size=1.5, legend.title.size=10, legend.text.size=10, legend.key.size=30)
Combines gene expression data with Gene Ontology (GO) annotations to rank and visualise genes and GO terms enriched for genes best clustering predefined groups of samples based on gene expression levels.
This methods semi-automatically retrieves the latest information from Ensembl
using the biomaRt
package, except if custom GO annotations are
provided. Custom GO annotations have two main benefits: firstly
they allow the analysis of species not supported in the Ensembl BioMart
server, and secondly they save time skipping calls to the Ensembl BioMart
server for species that are supported.
The latter also presents the possiblity of using an older release of the
Ensembl annotations, for example.
Using default settings, a random forest analysis is performed to evaluate the ability of each gene to cluster samples according to a predefined grouping factor (one-way ANOVA available as an alteranative). Each GO term is scored and ranked according to the average rank (alternatively, average power) of all associated genes to cluster the samples according to the factor. The ranked list of GO terms is returned, with tools allowing to visualise the statistics on a gene- and ontology-basis.
GO_analyse( eSet, f, subset=NULL, biomart_name = "ENSEMBL_MART_ENSEMBL", biomart_dataset="", microarray="", method="randomForest", rank.by="rank", do.trace=100, ntree=1000, mtry=ceiling(2*sqrt(nrow(eSet))), GO_genes=NULL, all_GO=NULL, all_genes=NULL, FUN.GO=mean, ...)
GO_analyse( eSet, f, subset=NULL, biomart_name = "ENSEMBL_MART_ENSEMBL", biomart_dataset="", microarray="", method="randomForest", rank.by="rank", do.trace=100, ntree=1000, mtry=ceiling(2*sqrt(nrow(eSet))), GO_genes=NULL, all_GO=NULL, all_genes=NULL, FUN.GO=mean, ...)
eSet |
|
f |
A column name in |
subset |
A named list to subset |
biomart_name |
The Ensembl BioMart database to which connect to. |
biomart_dataset |
The Ensembl BioMart dataset identifier corresponding to the species studied. If not specified and no custom annotations were provided, the method will attempt to automatically identify the adequate dataset from the first feature identifier in the dataset. Use data(prefix2dataset) to access a table listing valid choices. |
microarray |
The identifier in the Ensembl BioMart corresponding to the microarray
platform used. If not specified and no custom annotations were provided,
the method will attempt to
automatically identify the platform used from the first feature identifier
in the dataset.
Use |
method |
The statistical framework to score genes and gene ontologies. Either "randomForest" or "rf" to use the random forest algorithm, or alternatively either of "anova" or "a" to use the one-way ANOVA model. Default is "randomForest". |
rank.by |
Either of "rank" or "score" to chose the metric used to order the gene and GO term result tables. Default to 'rank'. |
do.trace |
Only used if method="randomForest". If set to TRUE, gives a more verbose output as randomForest is run. If set to some integer, then running output is printed for every do.trace trees. Default is 100. |
ntree |
Only used if method="randomForest". Number of trees to grow. This should be set to a number large enough to ensure that every input row gets predicted at least a few times |
mtry |
Only used if method="randomForest". Number of features randomly sampled as candidates at each split. Default value is 2*sqrt(gene_count) which is approximately 220 genes for a dataset of 12,000 genes. |
GO_genes |
Custom annotations associating features present in the expression dataset
to gene ontology identifiers. This must be provided as a data-frame of
two columns, named |
all_GO |
Custom annotations used to annotate each GO identifier present in
|
all_genes |
Custom annotations used to annotate each feature identifier in the
expression dataset with the gene name or symbol (e.g. "TNF"), and an
optional description. This must be provided as a data-frame containing at
least a column named |
FUN.GO |
Function to summarise the score and rank of all feature associated with
each gene ontology. Default is |
... |
Additional arguments passed on to the |
The default scoring functions strongly favor GO terms associated with fewer genes at the top of the ranking. This bias may actually be seen as a valuable feature which enables the user to browse through GO terms of increasing "granularity", i.e. associated with increasingly large sets of genes, although consequently being increasingly vague and blurry (e.g. "protein binding" molecular function associated with over 6,000 genes).
It is suggested to use the subset_scores()
function to subsequently
filter out GO terms with fewer than 5+ genes associated with it. Indeed,
those GO terms are more sensitive to outlier genes as they were scored on
the average of a handful of genes.
Additionally, the pValue_GO
function may be used to generate
a permutation-based P-value indicating the chance of seeing each GO term
reaching an equal or higher rank – or score – by chance.
A list containing the results of the analysis. Some elements are specific to the output of each analysis method.
Core elements:
GO |
A table ranking all GO terms related to genes in the expression dataset based on the average ability of their related genes to cluster the samples according to the predefined grouping factor. |
mapping |
The table mapping genes present in the dataset to GO terms. |
genes |
A table ranking all genes present in the expression dataset based on their ability to cluster the samples according to the predefined grouping factor. |
factor |
The predefined grouping factor. |
method |
The statistical framework used. |
subset |
The filters used to run the analysis only on a subet of the samples. NULL if no filter was applied. |
rank.by |
The metric used to rank order the genes and gene ontologies. |
FUN.GO |
The function used to summarise the score and rank of all gene features associated with each gene ontology. |
Random Forest additional elements:
ntree |
Number of trees grown. |
mtry |
Number of variables randomly sampled as candidates at each split. |
One-way ANOVA does not have additional arguments.
Make sure that the factor f
is an actual factor in the R language
meaning. This is important for the underlying statistical framework to
identify the groups of samples defined by their level of this factor.
If the column defining the factor (e.g. "Treatment") in phenodata
is not an R factor, use
pData(targets)$Treatment = factor(pData(targets)$Treatment)
to convert the character values into an actual R factor with appropriate
levels.
Kevin Rue-Albrecht
Methods
subset_scores
,
pValue_GO
,
getBM
,
randomForest
,
and oneway.test
.
# Load example data subset data(AlvMac) # Load a local copy of annotations obtained from the Ensembl Biomart server data(AlvMac_GOgenes) data(AlvMac_allgenes) data(AlvMac_allGO) # Run the analysis on factor "Treatment", # considering only treatments "MB" and "TB" at time-point "48H" # using a local copy of annotations obtained from the Ensembl BioMart server AlvMac_results <- GO_analyse( eSet=AlvMac, f="Treatment", subset=list(Time=c("48H"), Treatment=c("MB", "TB")), GO_genes=AlvMac_GOgenes, all_genes=AlvMac_allgenes, all_GO=AlvMac_allGO ) # Valid Ensembl BioMart datasets are listed in the following variable data(prefix2dataset) # Valid microarray= values are listed in the following variable data(microarray2dataset) ## Not run: # Other valid but time-consuming examples: # Run the analysis on factor "Treatment" including all samples GO_analyse(eSet=AlvMac, f="Treatment") # Run the analysis on factor "Treatment" using ANOVA method GO_analyse(eSet=AlvMac, f="Treatment", method="anova") # Use alternative GO scoring/summarisation functions (Default is: average) # Named functions GO_analyse(eSet=AlvMac, f="Treatment", FUN.GO = median) # Anonymous functions (simple example without scientific value) GO_analyse(eSet=AlvMac, f="Treatment", FUN.GO = function(x){median(x)/100}) # Syntax examples without actual data: # To force the use of the Ensembl BioMart for the human species, use: GO_analyse(eSet, f, biomart_dataset = "hsapiens_gene_ensembl") # To force use of the bovine affy_bovine microarray annotations use: GO_analyse(eSet, f, microarray = "affy_bovine") ## End(Not run)
# Load example data subset data(AlvMac) # Load a local copy of annotations obtained from the Ensembl Biomart server data(AlvMac_GOgenes) data(AlvMac_allgenes) data(AlvMac_allGO) # Run the analysis on factor "Treatment", # considering only treatments "MB" and "TB" at time-point "48H" # using a local copy of annotations obtained from the Ensembl BioMart server AlvMac_results <- GO_analyse( eSet=AlvMac, f="Treatment", subset=list(Time=c("48H"), Treatment=c("MB", "TB")), GO_genes=AlvMac_GOgenes, all_genes=AlvMac_allgenes, all_GO=AlvMac_allGO ) # Valid Ensembl BioMart datasets are listed in the following variable data(prefix2dataset) # Valid microarray= values are listed in the following variable data(microarray2dataset) ## Not run: # Other valid but time-consuming examples: # Run the analysis on factor "Treatment" including all samples GO_analyse(eSet=AlvMac, f="Treatment") # Run the analysis on factor "Treatment" using ANOVA method GO_analyse(eSet=AlvMac, f="Treatment", method="anova") # Use alternative GO scoring/summarisation functions (Default is: average) # Named functions GO_analyse(eSet=AlvMac, f="Treatment", FUN.GO = median) # Anonymous functions (simple example without scientific value) GO_analyse(eSet=AlvMac, f="Treatment", FUN.GO = function(x){median(x)/100}) # Syntax examples without actual data: # To force the use of the Ensembl BioMart for the human species, use: GO_analyse(eSet, f, biomart_dataset = "hsapiens_gene_ensembl") # To force use of the bovine affy_bovine microarray annotations use: GO_analyse(eSet, f, microarray = "affy_bovine") ## End(Not run)
Clusters the samples and the genes associated with a GO term using the expression levels of genes related to a given ontology. Represents expression levels of those genes in a heatmap.
heatmap_GO( go_id, result, eSet, f=result$factor, subset=NULL, gene_names=TRUE, NA.names=FALSE, margins=c(7 ,5), scale="none", cexCol=1.2, cexRow=0.5, labRow=NULL, cex.main=1, trace="none", expr.col=bluered(75), row.col.palette="Accent", row.col=c(), main=paste( go_id, result$GO[result$GO$go_id == go_id,"name_1006"] ), main.Lsplit=NULL, ...)
heatmap_GO( go_id, result, eSet, f=result$factor, subset=NULL, gene_names=TRUE, NA.names=FALSE, margins=c(7 ,5), scale="none", cexCol=1.2, cexRow=0.5, labRow=NULL, cex.main=1, trace="none", expr.col=bluered(75), row.col.palette="Accent", row.col=c(), main=paste( go_id, result$GO[result$GO$go_id == go_id,"name_1006"] ), main.Lsplit=NULL, ...)
go_id |
A Gene Ontology (GO) identifier. |
result |
The output of |
eSet |
|
f |
The grouping factor in |
subset |
A named list to subset |
gene_names |
A boolean value. Default is TRUE, to label genes by their associated gene name. If FALSE, labels the genes by their feature identifier in the expression dataset (i.e. Ensembl gene identifier or microarray probeset). |
NA.names |
A boolean value. Default is TRUE. If labelling genes by their associated
gene name (see argument |
margins |
A numeric vector of length 2 specifying the number of lines of margins to
apply for the bottom and right margins, respectively.
Defaults are 7 (bottom) and 5 (right). For Ensembl gene identifiers, we
suggest setting manually a bottom margin of 13. See |
scale |
Character indicating if the values should be centered and scaled in either
the row direction or the column direction, or none. Default is "none".
See |
cexCol , cexRow
|
Positive numbers, used as cex.axis in for the row or column axis labeling.
Defaults are 1.2 and 1, respectively. See |
labRow |
A character vector of names to re-label the rows (samples).
If vector of length 1, it is assumed to be a column name in the
|
cex.main |
Scaling factor of the main title font size. Default is 1. We suggest to
use it in combination with the argument |
trace |
Character string indicating whether a solid "trace" line should be drawn across each 'row' or down each 'column', both' or 'none'. The distance of the line from the center of each color-cell is proportional to the size of the measurement. Defaults to 'none'. |
expr.col |
Character vector indicating the colors to represent the different levels
of gene expression. Defaults to a colormap of 75 shades ranging from blue
(low) to red (high) and centered around white. If using differential
expression data, you should probably use |
row.col.palette |
A valid |
row.col |
A vector of color names or codes. The number of colors provided must match the number of levels of the grouping factor. Default to an palette of up to 9 colors marking the different levels of the predefined grouping factor on the left side of the heatmap. |
main |
Main title of the figure. Default is paste(go_id, go_name). |
main.Lsplit |
Number of characters after which a new-line character will be inserted in the main title. If this would occur within a word, the new-line character will be inserted before this word. Default is NULL, leaving the title on a single line. |
... |
Additional arguments passed on to |
Returns the output of the heatmap.2()
function.
Kevin Rue-Albrecht
Method heatmap.2
,
GO_analyse
,
and brewer.pal.info
.
# load the sample output data data(AlvMac_results) # Heatmap the top-ranked GO term (toll-like receptor 4 signaling pathway) as # example heatmap_GO(go_id="GO:0034142", result=AlvMac_results, eSet=AlvMac) # Same with larger sample labels on the right hand side. heatmap_GO(go_id="GO:0034142", result=AlvMac_results, eSet=AlvMac, cexRow=1) # Change the color-coding to green-black-red gradient (more appropriate for # differential expression values) library(gplots) heatmap_GO( go_id="GO:0034142", result=AlvMac_results, eSet=AlvMac, expr.col=greenred(75) )
# load the sample output data data(AlvMac_results) # Heatmap the top-ranked GO term (toll-like receptor 4 signaling pathway) as # example heatmap_GO(go_id="GO:0034142", result=AlvMac_results, eSet=AlvMac) # Same with larger sample labels on the right hand side. heatmap_GO(go_id="GO:0034142", result=AlvMac_results, eSet=AlvMac, cexRow=1) # Change the color-coding to green-black-red gradient (more appropriate for # differential expression values) library(gplots) heatmap_GO( go_id="GO:0034142", result=AlvMac_results, eSet=AlvMac, expr.col=greenred(75) )
Plots the an histogram representing the frequencies of scores in the output variable of the GO_analyse() function.
This function can also be used on the output of subset_scores()
function as it returns a value formatted identically to the output of the
GO_analyse()
function.
hist_scores(result, main=paste("Distribution of average scores in", deparse(substitute(result))), xlab="Average score", ...)
hist_scores(result, main=paste("Distribution of average scores in", deparse(substitute(result))), xlab="Average score", ...)
result |
The output of the |
main , xlab
|
These arguments to title have useful defaults here. |
... |
Additional arguments passed on to |
Returns the output of the hist()
function.
Kevin Rue-Albrecht
Method hist
,
and GO_analyse
.
# load the sample output data with p.values computed data(AlvMac_results.pVal) # Histogram of scores (labelled with counts) hist_scores(result=AlvMac_results, breaks=20, labels=TRUE) # filter for Biological Processes associated with 5+ genes and <=0.05 P-value filtered_results <- subset_scores( result=AlvMac_results.pVal, total_count=5, p.val=0.05, namespace="BP") # Histogram of scores (labelled with counts) hist_scores(result=filtered_results, breaks=20, labels=TRUE)
# load the sample output data with p.values computed data(AlvMac_results.pVal) # Histogram of scores (labelled with counts) hist_scores(result=AlvMac_results, breaks=20, labels=TRUE) # filter for Biological Processes associated with 5+ genes and <=0.05 P-value filtered_results <- subset_scores( result=AlvMac_results.pVal, total_count=5, p.val=0.05, namespace="BP") # Histogram of scores (labelled with counts) hist_scores(result=filtered_results, breaks=20, labels=TRUE)
Given a Gene Ontology (GO) identifier represented in the dataset, returns a character vector listing the feature identifiers annotated to it.
list_genes(go_id, result, data.only=TRUE)
list_genes(go_id, result, data.only=TRUE)
go_id |
A Gene Ontology (GO) identifier represented in the dataset. |
result |
The output of |
data.only |
Whether to return only the feature identifiers present in the given dataset or alternatively returns all feature identifiers associated with the GO term in the Ensembl BioMart. Default is TRUE. |
A character vector listing the feature identifiers of the genes associated with the GO term.
Kevin Rue-Albrecht
Method GO_analyse
.
# load the sample output data data(AlvMac_results) # List of genes associated with the GO term "toll-like receptor 4 signaling # pathway" list_genes(result=AlvMac_results, go_id="GO:0034142")
# load the sample output data data(AlvMac_results) # List of genes associated with the GO term "toll-like receptor 4 signaling # pathway" list_genes(result=AlvMac_results, go_id="GO:0034142")
The species corresponding to an probeset identifier can often be identified from the prefix of the identifier (e.g. Bt.457.1.S1_at corresponds to Bos taurus which is only associated with a single microarray platform in the Ensembl BioMart). This table maps some manually curated unique patterns to the corresponding species and microarray platform.
data(microarray2dataset)
data(microarray2dataset)
All Agilent microarray share the same prefix pattern, making it very difficult to differentiate. Many Affymetrix microarrays also share the same prefix "AFFX" for several probesets.
The dataset
and microarray
arguments of the GO_analyse
method are the best way to specify which BioMart information should be
used to annotate the features in your expression dataset.
A data frame with 158 rows and the 6 columns. Rows are sorted alphabetically by species name and each refer to a unique combination of dataset and microarray identifier in the Ensembl BioMart. The columns are described below:
dataset
contains species-specific biomaRt dataset names.
microarray
contains the microarray identifier in the Ensembl
BioMart dataset above.
sample
contains a sample probeset from this combination of
Ensembl BioMart dataset and microarray.
species
contains an example probeset sampled from the
corresponding Ensembl BioMart dataset and microarray.
pattern
contains the corresponding manually curated pattern
representative of probesets for this microarray platform. This pattern
may or may not be unique to the microarray. Therefore
we encourage users to use the 'microarray' argument of the
GO_analyse
method to specify the Ensembl BioMart dataset to use.
unique
contains a boolean value stating whether the
pattern was found unique to the microarray or not.
The microarray2dataset.build
method stored in the toolkit.R
script was used to query the Ensembl BioMart server and build this table.
data(microarray2dataset) microarray2dataset
data(microarray2dataset) microarray2dataset
Successively plots univariate effects of one or more
factors
, typically for a designed experiment
as analyzed by aov
().
plot_design( go_id, result, eSet, subset=NULL, factors=colnames(pData(eSet)), main="", main.Lsplit=NULL, ...)
plot_design( go_id, result, eSet, subset=NULL, factors=colnames(pData(eSet)), main="", main.Lsplit=NULL, ...)
go_id |
A Gene Ontology (GO) identifier represented by at least one gene in the dataset. |
result |
The output of |
eSet |
|
subset |
A named list to subset |
factors |
A set of column names from |
main |
Changes the main title of the plots. |
main.Lsplit |
Number of characters after which a new-line character will be inserted in the main title. If this would occur within a word, the new-line character will be inserted before this word. Default is NULL, leaving the title on a single line. |
... |
Additional arguments which will be passed on to the |
The output of the plot.design
() function.
Kevin Rue-Albrecht
Method plot.design
.
# load the sample output data data(AlvMac_results) # Univariate plot plot_design(go_id="GO:0034142", eSet=AlvMac, result=AlvMac_results)
# load the sample output data data(AlvMac_results) # Univariate plot plot_design(go_id="GO:0034142", eSet=AlvMac, result=AlvMac_results)
The species corresponding to an Ensembl gene identifier can typically be identified from the prefix of the identifier (e.g. ENSBTAG corresponds to Bos taurus). This table maps each known unique prefix to the corresponding species.
data(prefix2dataset)
data(prefix2dataset)
C. elegans, D. melanogaster, and S. cerevisiae have atypical identifier pattern and prefixes in their Ensembl gene identifiers. However, the automatically extracted prefix for C. elegans and D. melanogaster – respectively "WBgene" and "FBgn" – can be used as such to identify datasets from those species. On the oher hand, prefixes used for the S. cerevisiae include "YHR", "YAL", and many others. Consequently, expression data from S. cerevisiae species is identified without referring to the "prefix2dataset"" table; instead, such datasets are identified if the first gene identifier in the dataset starts with "Y".
A data frame with 69 rows and 4 columns. Each row refers to one dataset in the Ensembl BioMart. The columns are described below:
dataset
contains the biomaRt dataset name.
species
contains the corresponding species name.
prefix
contains the corresponding unique prefix.
sample
contains a sample Ensembl gene identifier from this
dataset.
The prefix2dataset.build
method stored in the toolkit.R
script was used to query the Ensembl BioMart server and build this table.
data(prefix2dataset) str(prefix2dataset) prefix2dataset
data(prefix2dataset) str(prefix2dataset) prefix2dataset
Permutes the rank of genes processed by GO_analyse()
to compute
a P-value for each ontology
pValue_GO( result, N=1000, ranked.by=result$rank.by, rank.by='P', FUN.GO=result$FUN.GO)
pValue_GO( result, N=1000, ranked.by=result$rank.by, rank.by='P', FUN.GO=result$FUN.GO)
result |
The output of |
N |
The number of permutation desired. Default to 1000. |
ranked.by |
Either of 'rank' or 'score'. The metric used to compare whether GO terms
in the randomised data arelative to their original value. Default to the
current ordering method of the |
rank.by |
Either of 'rank', 'score', or 'p.val'. The metric used to order the GO terms after computing of the P-value. Default to 'p.val'. |
FUN.GO |
Function to summarise the score and rank of all feature associated with
each gene ontology. Logically, default is the function used in the call
to |
This function is relatively lengthy. Its procedure can be divided in several steps: (1) assemble a copy of all genes used in the original scoring step, then for each of the N permutations, (2) permute the gene labels, (3) aggregate the rank – or score – of all genes belonging to each GO term, (4) decide whether each GO term is ranked – or scored –better or worst than originally, (5) return a p-value based on the number of permutations where each GO term was better ranked – or scored – than originally.
A list formatted identically to the results of the analysis, with an added 'p.val' column in the GO slot, and re-ordered by the chosen metric.
Kevin Rue-Albrecht
Method GO_analyse
.
## Not run: # Time-consuming examples: # Limited here to N=1 for time constraint # Recommended N=1000 (or at least 100). # Run the analysis on factor "Treatment" including all samples AlvMac.Pvals <- pValue_GO(result, N=1, ranked.by=result$rank.by, rank.by='P') ## End(Not run)
## Not run: # Time-consuming examples: # Limited here to N=1 for time constraint # Recommended N=1000 (or at least 100). # Run the analysis on factor "Treatment" including all samples AlvMac.Pvals <- pValue_GO(result, N=1, ranked.by=result$rank.by, rank.by='P') ## End(Not run)
Returns a set of quantiles in indicating the scores reached by given proportions of the GO terms.
This function can also be used on the output of subset_scores()
function as it returns a value formatted identically to the output of the
GO_analyse()
function.
quantiles_scores(result, probs=c(0.9, 0.95, 0.99, 0.999, 0.9999), quartiles=FALSE)
quantiles_scores(result, probs=c(0.9, 0.95, 0.99, 0.999, 0.9999), quartiles=FALSE)
result |
The output of |
probs |
Numeric vector of probabilities with values in [0,1]. (Values up
to 2e-14 outside that range are accepted and moved to the nearby
endpoint.) See |
quartiles |
A numeric vector of the percentiles for which the scores are desired. |
A named vector of percentiles and corresponding scores.
Kevin Rue-Albrecht
Method quantile
.
# load the sample output data with p.values computed data(AlvMac_results.pVal) # filter for Biological Processes associated with 5+ genes and <=0.05 P-value filtered_results <- subset_scores( result=AlvMac_results.pVal, total_count=5, p.val=0.05, namespace="BP") # Quantiles of scores quantiles_scores(result=filtered_results)
# load the sample output data with p.values computed data(AlvMac_results.pVal) # filter for Biological Processes associated with 5+ genes and <=0.05 P-value filtered_results <- subset_scores( result=AlvMac_results.pVal, total_count=5, p.val=0.05, namespace="BP") # Quantiles of scores quantiles_scores(result=filtered_results)
Reorder the ranked tables of GO terms and genes either by increasing (average) rank or decreasing (average) score.
rerank(result, rank.by = 'rank')
rerank(result, rank.by = 'rank')
result |
The output of |
rank.by |
Either of 'rank', 'score' or 'p.val'; the metric to rank the GO terms and
genes. Note that 'pval' is only applicable on the output of the
|
Taking an example, to rank GO terms by P-value and break ties by average rank, rerank first by 'rank', and then rerank the resulting object by 'p.val'.
A list formatted identically to the results of the analysis, but ordered by the chosen metric.
The name reorder() was not used to avoid conflict with package stats
.
Kevin Rue-Albrecht
Method GO_analyse
.
# load the sample output data data(AlvMac_results) # Re-rank the GO terms and genes based on the actual score instead of the rank reranked.byScore <- rerank(result=AlvMac_results, rank.by="score") # load the sample output data with p.values computed data(AlvMac_results.pVal) # To rank by P-value, while breaking the ties by ave_rank, # rank first by ave_rank reranked.byRank <- rerank(result=AlvMac_results.pVal, rank.by="rank") # rank the result by P-value reranked.pVal_rank <- rerank(result=reranked.byRank, rank.by="p.val")
# load the sample output data data(AlvMac_results) # Re-rank the GO terms and genes based on the actual score instead of the rank reranked.byScore <- rerank(result=AlvMac_results, rank.by="score") # load the sample output data with p.values computed data(AlvMac_results.pVal) # To rank by P-value, while breaking the ties by ave_rank, # rank first by ave_rank reranked.byRank <- rerank(result=AlvMac_results.pVal, rank.by="rank") # rank the result by P-value reranked.pVal_rank <- rerank(result=reranked.byRank, rank.by="p.val")
Given a list of column names and corresponding values present in the phenoData slot of an ExpressionSet object, this method returns the subset of the ExpressionSet restricted to samples associated with the given values in the given columns.
subEset(eSet, subset = list())
subEset(eSet, subset = list())
eSet |
|
subset |
A named list to subset |
A subset of the given ExpressionSet restricted to samples associated with the given values in the given columns
Kevin Rue-Albrecht
Class ExpressionSet
.
# Load example data set data(AlvMac) # Subset it to only samples of "CN" and "MB" treatments, and also only "2H", # "6H", and "24H" time-points sub.AlvMac <- subEset( eSet=AlvMac, subset=list( Treatment=c("CN","MB"), Time=c("2H","6H") ) )
# Load example data set data(AlvMac) # Subset it to only samples of "CN" and "MB" treatments, and also only "2H", # "6H", and "24H" time-points sub.AlvMac <- subEset( eSet=AlvMac, subset=list( Treatment=c("CN","MB"), Time=c("2H","6H") ) )
Given an output variable from a GO_analyse
analysis and a set of valid
filters and thresholds, returns an identically formatted list keeping only the
rows of the score table passing all the filters.
subset_scores(result, ...) # Suggested use: # on the output of GO_analyse() # subset_scores(result, total=5, namespace="BP") # or after application of pValue_GO() # subset_scores(result, total=5, namespace="BP", p.val=0.05)
subset_scores(result, ...) # Suggested use: # on the output of GO_analyse() # subset_scores(result, total=5, namespace="BP") # or after application of pValue_GO() # subset_scores(result, total=5, namespace="BP", p.val=0.05)
result |
The output of the |
... |
Additional pairs of filter and threshold values in the format "filter=threshold". Filters must be valid names from colnames(result$GO). |
It is highly recommended to filter out GO terms with very few genes (e.g. less than 5 genes), as the scoring function is biased for those GO terms (see UsersGuide).
Suggested filters:
total_count, total: | Filter keeping only GO terms associated with at least the given count of genes in the annotations provided. |
p.val, p, P: | Filter keeping only the GO terms with P-value lower or equal
to the given cutoff (This filter is only applicable after the use of the
pValue_GO function). |
namespace, namespace_1003: | Filter keeping only the GO terms of a given type. Valid values are "biological_process", "molecular_function", and "cellular_component". Abbreviations "BP", "MF", and "CC" are also accepted. |
Additional filters:
data_count, data: | Filter keeping only GO terms associated with at
least the given count of genes, present in the ExpressionSet . |
ave_rank, rank: | Filter keeping only GO terms with an ave_rank value equal or lower than the given cutoff (average of the rank of all genes annotated to the GO term). |
ave_score, score: | Filter keeping only GO terms with an ave_score value equal or higher than the given cutoff (average of the score of all genes annotated to the GO term) Scores are the mean decrease in Gini index for the random forest, or the F-value for the ANOVA approach. |
A list formatted identically to the result
object provided, with an
added or updated filters.GO
slot stating the filters and cutoffs
applied, and restricted to:
the gene ontologies passing the given filters and cutoff values.
the genes mapped to those remaining ontologies (i.e. all genes not associated with an ontology are discarded in the output object).
gene-mapping annotations related to the remaining gene ontologies.
It is possible to further filter a filter result object. Some warnings may
appear if the new filter cutoff conflict with the ones previously applied
(stored in the filters.GO
slot of the filtered object).
Example of conflicting filter values are:
Filtering for 'biological_process'
a result object that was
previously filtered for 'molecular_function'
(no more
'BP'
terms are present in the first filtered object).
filtering for a total_count
of 5 or more a result object
previously filtered for a total_count
of 10 or more.
Kevin Rue-Albrecht
Method GO_analyse
.
# load the sample output data data(AlvMac_results.pVal) # have an overview of the result variable str(AlvMac_results.pVal) # filter for Biological Processes associated with 5+ genes and <=0.05 P-value filtered_results <- subset_scores( result=AlvMac_results.pVal, total_count=5, p.val=0.05, namespace="BP") # have an overview of the filtered result variable str(filtered_results)
# load the sample output data data(AlvMac_results.pVal) # have an overview of the result variable str(AlvMac_results.pVal) # filter for Biological Processes associated with 5+ genes and <=0.05 P-value filtered_results <- subset_scores( result=AlvMac_results.pVal, total_count=5, p.val=0.05, namespace="BP") # have an overview of the filtered result variable str(filtered_results)
Given a Gene Ontology (GO) identifier represented in the dataset and the
output variable of a GO_analyse()
function, table_genes()
returns a table listing the genes associated with that go_id, their
associated gene name, and description.
table_genes(go_id, result, data.only=FALSE, order.by='rank')
table_genes(go_id, result, data.only=FALSE, order.by='rank')
go_id |
A Gene Ontology (GO) identifier. |
result |
The output of the |
data.only |
Whether to return only the feature identifiers present in the given dataset or alternatively returns all feature identifiers associated with the GO term in the Ensembl BioMart. |
order.by |
The metric or name to order the output table. Default is increasing 'rank', which corresponds to decreasing score. Valid values are 'rank' (increasing) equivalent to 'score' (decreasing), 'gene_id', 'name', equivalent to 'external_gene_name' and 'external_gene_id'. Text ranking in alphabetical order. |
A data frame listing the statistics and annotations for the genes present in the expression dataset and associated with the GO term.
Kevin Rue-Albrecht
Method GO_analyse
.
# load the sample output data data(AlvMac_results) # Table of result for genes associated with the GO term # "toll-like receptor 4 signaling pathway" table_genes(result=AlvMac_results, go_id="GO:0034142") # ordered by gene name table_genes(result=AlvMac_results, go_id="GO:0034142", order.by='name')
# load the sample output data data(AlvMac_results) # Table of result for genes associated with the GO term # "toll-like receptor 4 signaling pathway" table_genes(result=AlvMac_results, go_id="GO:0034142") # ordered by gene name table_genes(result=AlvMac_results, go_id="GO:0034142", order.by='name')