Title: | Take command of set enrichment analyses through a unified interface |
---|---|
Description: | Provides a unified interface to a variety of GSEA techniques from different bioconductor packages. Results are harmonized into a single object and can be interrogated uniformly for quick exploration and interpretation of results. Interactive exploration of GSEA results is enabled through a shiny app provided by a sparrow.shiny sibling package. |
Authors: | Steve Lianoglou [aut, cre] , Arkadiusz Gladki [ctb], Aratus Informatics, LLC [fnd] (2023+), Denali Therapeutics [fnd] (2018-2022), Genentech [fnd] (2014 - 2017) |
Maintainer: | Steve Lianoglou <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.13.4 |
Built: | 2024-12-04 20:27:46 UTC |
Source: | https://github.com/bioc/sparrow |
Subset whole genesets from a GeneSetDb
## S4 method for signature 'GeneSetDb,ANY,ANY,ANY' x[i, j, ..., drop = FALSE]
## S4 method for signature 'GeneSetDb,ANY,ANY,ANY' x[i, j, ..., drop = FALSE]
x |
GeneSetDb |
i |
a logical vector as long as |
j |
ignored |
... |
pass through arguments |
drop |
ignored |
GeneSetDb x
with a subset of the genesets it came in with.k
This function adds/updates columns entries in the geneSets(gdb)
table.
If there already are defined meta values for the columns of meta
in x
,
these will be updated with the values in meta
.
addGeneSetMetadata(x, meta, ...)
addGeneSetMetadata(x, meta, ...)
x |
a |
meta |
a |
... |
not used yet |
TODO: should this be a setReplaceMethod, Issue #13 (?) https://github.com/lianos/multiGSEA/issues/13
the updated GeneSetDb
object x
.
gdb <- exampleGeneSetDb() meta.info <- transform( geneSets(gdb)[, c("collection", "name")], someinfo = sample(c("one", "two"), nrow(gdb), replace = TRUE)) gdb <- addGeneSetMetadata(gdb, meta.info)
gdb <- exampleGeneSetDb() meta.info <- transform( geneSets(gdb)[, c("collection", "name")], someinfo = sample(c("one", "two"), nrow(gdb), replace = TRUE)) gdb <- addGeneSetMetadata(gdb, meta.info)
Checks equality (feature parity) between GeneSetDb objects
## S3 method for class 'GeneSetDb' all.equal(target, current, features.only = TRUE, ...)
## S3 method for class 'GeneSetDb' all.equal(target, current, features.only = TRUE, ...)
target |
The reference |
current |
The |
features.only |
Only compare the "core" columns of |
... |
moar args. |
TRUE
if equal, or character
vector of messages if not.
This is helpful when you don't have a monsterly sized GeneSetDb. There will
be as many new columns added to x
as there are active genesets in gdb
.
annotateGeneSetMembership(x, gdb, x.ids = NULL, ...)
annotateGeneSetMembership(x, gdb, x.ids = NULL, ...)
x |
A data.frame with genes/features in rows |
gdb |
A |
x.ids |
The name of the column in |
... |
parameters passed down into |
Returns the original x
with additional columns: each is a
logical vector that indicates membership for genesets defined in
gdb
.
vm <- exampleExpressionSet() gdb <- exampleGeneSetDb() mg <- seas(vm, gdb, design = vm$design, contrast = 'tumor') lfc <- logFC(mg) annotated <- annotateGeneSetMembership(lfc, gdb, 'feature_id') ## Show only genes that are part of 'HALLMARK_ANGIOGENESIS' geneset angio <- subset(annotated, `c2;;BIOCARTA_AGPCR_PATHWAY`)
vm <- exampleExpressionSet() gdb <- exampleGeneSetDb() mg <- seas(vm, gdb, design = vm$design, contrast = 'tumor') lfc <- logFC(mg) annotated <- annotateGeneSetMembership(lfc, gdb, 'feature_id') ## Show only genes that are part of 'HALLMARK_ANGIOGENESIS' geneset angio <- subset(annotated, `c2;;BIOCARTA_AGPCR_PATHWAY`)
Utility function to run limma differential expression analysis
calculateIndividualLogFC( x, design, contrast = ncol(design), robust.fit = FALSE, robust.eBayes = FALSE, trend.eBayes = FALSE, treat.lfc = NULL, weights = NULL, confint = TRUE, with.fit = FALSE, use.qlf = TRUE, ..., xmeta. = NULL, as.dt = FALSE )
calculateIndividualLogFC( x, design, contrast = ncol(design), robust.fit = FALSE, robust.eBayes = FALSE, trend.eBayes = FALSE, treat.lfc = NULL, weights = NULL, confint = TRUE, with.fit = FALSE, use.qlf = TRUE, ..., xmeta. = NULL, as.dt = FALSE )
x |
The expression object. This can be 1 column matrix if you are not running any analysis, and this function essentially is just a "pass through" |
design |
The design matrix for the experiment |
contrast |
The contrast you want to test and provide stats for. By
default this tests the last column of the |
robust.fit |
The value of the |
robust.eBayes |
The value of the |
trend.eBayes |
The value of the |
treat.lfc |
If this is numeric, this activates limma's "treat"
functionality and tests for differential expression against this
specified log fold change threshold. This defaults to |
weights |
an option matrix of weights to use in |
confint |
add confidence intervals to |
with.fit |
If |
use.qlf |
If |
... |
parameters passed down into the relevant limma/edgeR based functions. |
xmeta. |
a data.frame to add meta data (symbol, primarly) to the outgoing
logFC data.frame. This is used when |
as.dt |
If |
This function fits linear modles (or glms) to perform differential
expression analyses. If the x
object is a DGEList
the
analysis will be performed using edgeR's quasi-likelihood framework,
otherwise limma will be used for all other scenarios.
If x
is a edgeR::DGEList()
we require that edgeR::estimateDisp()
has
already been called. If you prefer to analyze rnaseq data using voom, be sure
that x
is the object that has been returned from a call to limma::voom()
(or limma::voomWithQualityWeights()
.
The documentation here is speaking the language of a "limma" analysis, however for each parameter, there is an analagous function/parameter that will be delegated to.
Lastly, if x
is simply a single column matrix, we assume that we are
just passing a single pre-ranked vector of statistics through sparrow::seas's
analysis pipelines (for use in methods like "fgsea", "cameraPR", etc.), and
a logFC-like data.frame is constructed with these statistics in the
logFC
and t
columns.
If with.fit == FALSE
(the default) a data.table
of
logFC statistics for the contrast under test. Otherwise, a list is
returned with $result
containing the logFC statistics, and
$fit
has the limma fit for the data/design/contrast under test.
vm <- exampleExpressionSet(do.voom = TRUE) lfc <- calculateIndividualLogFC(vm, vm$design, "tumor")
vm <- exampleExpressionSet(do.voom = TRUE) lfc <- calculateIndividualLogFC(vm, vm$design, "tumor")
Associates key:value metadata to a gene set collection of a GeneSetDb()
.
collectionMetadata(x, collection, name, ...) geneSetURL(x, i, j, ...) featureIdType(x, i, ...) featureIdType(x, i) <- value ## S4 method for signature 'GeneSetDb,missing,missing' collectionMetadata(x, collection, name, as.dt = FALSE) ## S4 method for signature 'GeneSetDb,character,missing' collectionMetadata(x, collection, name, as.dt = FALSE) ## S4 method for signature 'GeneSetDb,character,character' collectionMetadata(x, collection, name, as.dt = FALSE) ## S4 method for signature 'GeneSetDb' geneSetURL(x, i, j, ...) ## S4 replacement method for signature 'GeneSetDb' featureIdType(x, i) <- value ## S4 method for signature 'GeneSetDb' featureIdType(x, i, ...) addCollectionMetadata( x, xcoll, xname, value, validate.value.fn = NULL, allow.add = TRUE ) ## S4 method for signature 'SparrowResult' geneSetURL(x, i, j, ...)
collectionMetadata(x, collection, name, ...) geneSetURL(x, i, j, ...) featureIdType(x, i, ...) featureIdType(x, i) <- value ## S4 method for signature 'GeneSetDb,missing,missing' collectionMetadata(x, collection, name, as.dt = FALSE) ## S4 method for signature 'GeneSetDb,character,missing' collectionMetadata(x, collection, name, as.dt = FALSE) ## S4 method for signature 'GeneSetDb,character,character' collectionMetadata(x, collection, name, as.dt = FALSE) ## S4 method for signature 'GeneSetDb' geneSetURL(x, i, j, ...) ## S4 replacement method for signature 'GeneSetDb' featureIdType(x, i) <- value ## S4 method for signature 'GeneSetDb' featureIdType(x, i, ...) addCollectionMetadata( x, xcoll, xname, value, validate.value.fn = NULL, allow.add = TRUE ) ## S4 method for signature 'SparrowResult' geneSetURL(x, i, j, ...)
x |
|
collection |
The geneset collection to to query |
name |
The name of the metadata variable to get the value for |
... |
not used yet |
i , j
|
The collection,name compound key identifier of the gene set |
value |
The value of the metadata variable |
as.dt |
If |
xcoll |
The collection name |
xname |
The name of the metadata variable |
validate.value.fn |
If a function is provided, it is run on
|
allow.add |
If |
The design of the GeneSetDb is such that we assume that groups of gene sets are usually defined together and will therefore share similar metadata. These groups of gene sets will fall into the same "collection", and, therefore, metadata for particular gene sets are tracked at the collection level.
Types of metadata being referred to could be things like the organism
that a batch of gene sets were defined in, the type of feature identifiers
that a collection of gene sets are using (ie. GSEABase::EntrezIdentifier()
)
or a URL pattern that combines the collection,name compound key that one
can browse to in order to find out more information about the gene set.
There are explicit helper functions that set and get these aforementioned
metadata, namely featureIdType()
, geneSetCollectionURLfunction()
, and
geneSetURL()
. Aribtrary metadata can be stored at the collection level
using the addCollectionMetadata()
function. More details are provided
below.
A character vector of URLs for each of the genesets identified by
i, j
. NA
is returned for genesets i,j
that are not found in x
.
The updated GeneSetDb
.
collectionMetadata(x = GeneSetDb, collection = missing, name = missing)
: Returns metadata for all collections
collectionMetadata(x = GeneSetDb, collection = character, name = missing)
: Returns all metadata for a specific collection
collectionMetadata(x = GeneSetDb, collection = character, name = character)
: Returns the name
metadata value for a given
collection
.
geneSetURL(GeneSetDb)
: returns the URL for a geneset
featureIdType(GeneSetDb) <- value
: sets the feature id type for a collection
featureIdType(GeneSetDb)
: retrieves the feature id type for a collection
geneSetURL(SparrowResult)
: returns the URL for a geneset from a
SparrowResult object
A URL function can be defined per collection that takes the collection,name
compound key and generates a URL for the gene set that the user can browse
to for futher information. For instance, the
geneSetCollectionURLfunction()
for the MSigDB collections are defined
like so:
url.fn <- function(collection, name) { url <- 'http://www.broadinstitute.org/gsea/msigdb/cards/%s.html' sprintf(url, name) } gdb <- getMSigGeneSetDb('H') geneSetCollectionURLfunction(gdb, 'H') <- url.fn
In this way, a call to geneSetURL(gdb, 'H', 'HALLMARK_ANGIOGENESIS')
will return
http://www.broadinstitute.org/gsea/msigdb/cards/HALLMARK_ANGIOGENESIS.html.
This function is vectorized over i
and j
When defining a set of gene sets in a collection, the identifiers used must be of the same type. Most often you'll probably be working with Entrez identifiers, simply because that's what most of the annotations work with.
As such, you'd define that your collection uses geneset identifiers like so:
gdb <- getMSigGeneSetDb('H') featureIdType(gdb, 'H') <- GSEABase::ENSEMBLIdentifier() ## or, equivalently (but you don't want to use this) gdb <- addCollectionMetadata(gdb, 'H', 'id_type', GSEABase::ENSEMBLIdentifier())
Adds arbitrary metadata to a gene set collection of a GeneSetDb
Note that this is not a replacement method! You must catch the returned
object to keep the one with the updated collectionMetadata
. Although this
function is exported, I imagine this being used mostly through predefined
replace methods that use this as a utility function, such as the replacement
methods featureIdType<-
, geneSetURLfunction<-
, etc.
gdb <- getMSigGeneSetDb('H') gdb <- addCollectionMetadata(gdb, 'H', 'foo', 'bar')
gdb <- exampleGeneSetDb() # Gene Set URLs geneSetURL(gdb, 'c2', 'BIOCARTA_AGPCR_PATHWAY') geneSetURL(gdb, c('c2', 'c7'), c('BIOCARTA_AGPCR_PATHWAY', 'GSE14308_TH2_VS_TH1_UP')) # feature id types featureIdType(gdb, "c2") <- GSEABase::EntrezIdentifier() featureIdType(gdb, "c2") ## Arbitrary metadata gdb <- addCollectionMetadata(gdb, 'c2', 'foo', 'bar') cmh <- collectionMetadata(gdb, 'c2', as.dt = TRUE) ## print this to see
gdb <- exampleGeneSetDb() # Gene Set URLs geneSetURL(gdb, 'c2', 'BIOCARTA_AGPCR_PATHWAY') geneSetURL(gdb, c('c2', 'c7'), c('BIOCARTA_AGPCR_PATHWAY', 'GSE14308_TH2_VS_TH1_UP')) # feature id types featureIdType(gdb, "c2") <- GSEABase::EntrezIdentifier() featureIdType(gdb, "c2") ## Arbitrary metadata gdb <- addCollectionMetadata(gdb, 'c2', 'foo', 'bar') cmh <- collectionMetadata(gdb, 'c2', as.dt = TRUE) ## print this to see
Combines two GeneSetDb objects together
## S4 method for signature 'GeneSetDb,GeneSetDb' combine(x, y, ...)
## S4 method for signature 'GeneSetDb,GeneSetDb' combine(x, y, ...)
x |
a GeneSetDb object |
y |
a GeneSetDb object |
... |
more things |
a new GeneSetDb that contains all genesets from x
and y
gdb1 <- exampleGeneSetDb() gdb2 <- GeneSetDb(exampleGeneSetDF()) gdb <- combine(gdb1, gdb2)
gdb1 <- exampleGeneSetDb() gdb2 <- GeneSetDb(exampleGeneSetDF()) gdb <- combine(gdb1, gdb2)
This would be useful when you want to add a GSEA result to an already
existing one. append
would be more appropriate, but ...
## S4 method for signature 'SparrowResult,SparrowResult' combine(x, y, rename.x = NULL, rename.y = NULL, ...)
## S4 method for signature 'SparrowResult,SparrowResult' combine(x, y, rename.x = NULL, rename.y = NULL, ...)
x |
A |
y |
A |
rename.x |
A named vector that used to match resultNames(x) and remane
them to something different. |
rename.y |
Same as |
... |
more things |
When would you want to do that? Imagine a shiny app that drives sparrow. You might want to present the results of each analysis as they come "online", so you would run them independently and make them available to the user immediately after they each finish (ie. in combination with the promises package).
A combined SparrowResult
object
mg1 <- exampleSparrowResult() mg2 <- exampleSparrowResult() mgc <- combine(mg1, mg2)
mg1 <- exampleSparrowResult() mg2 <- exampleSparrowResult() mgc <- combine(mg1, mg2)
conform
-ing, a GeneSetDb
to a target expression
object is an important step required prior to perform any type of GSEA. This
function maps the featureIds used in the GeneSetDb to the elements of a
target expression object (ie. the rows of an expression matrix, or the
elements of a vector of gene-level statistics).
After conform
-ation, each geneset in the GeneSetDb
is flagged
as active (or inactive) given the number of its features that are
successfully mapped to target
and the minimum and maximum number of
genes per geneset required as specified by the min.gs.size
and
max.gs.size
parameters, respectively.
Only genesets that are marked with active = TRUE
will be used in any
downstream gene set operations.
conform(x, ...) unconform(x, ...) ## S4 method for signature 'GeneSetDb' conform( x, target, unique.by = c("none", "mean", "var"), min.gs.size = 2L, max.gs.size = Inf, match.tolerance = 0.25, ... ) ## S4 method for signature 'GeneSetDb' unconform(x, ...) is.conformed(x, to)
conform(x, ...) unconform(x, ...) ## S4 method for signature 'GeneSetDb' conform( x, target, unique.by = c("none", "mean", "var"), min.gs.size = 2L, max.gs.size = Inf, match.tolerance = 0.25, ... ) ## S4 method for signature 'GeneSetDb' unconform(x, ...) is.conformed(x, to)
x |
The GeneSetDb |
... |
moar args |
target |
The expression object/matrix to conform to. This could also just be a character vector of IDs. |
unique.by |
If there are multiple rows that map to the identifiers used in the genesets, this is a means to pick the single row for that ID |
min.gs.size |
Ensure that the genesets that make their way to the
|
max.gs.size |
Ensure that the genesets that make their way to the
|
match.tolerance |
Numeric value between [0,1]. If the fraction of
|
to |
the object to test conformation to |
A GeneSetDb()
that has been matched/conformed to an expression
object target y
.
is.conformed()
: Checks to see if GeneSetDb x
is conformed to a target
object to
unconform()
: Resets the conformation mapping.
is.conformed()
: If to
is missing, looks for evidence that conform
has
been called (at all) on x
. If to
is provided, specifically checks that
x
has been conformed to the target object to
.
es <- exampleExpressionSet() gdb <- exampleGeneSetDb() head(geneSets(gdb)) gdb <- conform(gdb, es) ## Note the updated values `active` flag, and n (the number of features ## mapped per gene set) head(geneSets(gdb))
es <- exampleExpressionSet() gdb <- exampleGeneSetDb() head(geneSets(gdb)) gdb <- conform(gdb, es) ## Note the updated values `active` flag, and n (the number of features ## mapped per gene set) head(geneSets(gdb))
As awesome as a GeneSetDb is, you might find a time when you'll need your gene set information in an other format. To do that, we provide the following functions:
as(gdb, "BiocSetf')
: convert to a BiocSet::BiocSet()
.
as(gdb, "GeneSetCollection")
: Convert to a
GSEABase::GeneSetCollection()
object.
as.data.(table|frame)(gdb)
: Perhaps the most natural format to convert to in
order to save locally and examine outside of Bioconductor's GSEA universe,
but not many other tools accept gene set definitions in this format.
as.list(gdb)
: A named list of feature identifiers. This is the format
that many of the limma gene set testing methods use
## S3 method for class 'GeneSetDb' as.data.table( x, keep.rownames = FALSE, value = c("feature_id", "x.id", "x.idx"), active.only = is.conformed(x), ... ) ## S3 method for class 'GeneSetDb' as.data.frame( x, row.names = NULL, optional = NULL, value = c("feature_id", "x.id", "x.idx"), active.only = is.conformed(x), ... )
## S3 method for class 'GeneSetDb' as.data.table( x, keep.rownames = FALSE, value = c("feature_id", "x.id", "x.idx"), active.only = is.conformed(x), ... ) ## S3 method for class 'GeneSetDb' as.data.frame( x, row.names = NULL, optional = NULL, value = c("feature_id", "x.id", "x.idx"), active.only = is.conformed(x), ... )
x |
A |
keep.rownames |
included here just for consistency with
|
value |
The value type to export for the feature ids, defaults to
|
active.only |
If the |
... |
pass through arguments (not used) |
row.names , optional
|
included here for consistency with |
The as.*
functions accept a value
parameter which indicates the type of
IDs you want to export in the conversion:
"feature_id"
: The ID used as originally entered into the GeneSetDb
.
"x.idx"
: Only valid if the GeneSetDb x
has been conform
-ed to an
expession container. This option will export the features as the integer
rows of the expression container.
"x.id"
: Only valid if the GeneSetDb x
has been conform
-ed. The
target expression container might use feature identifiers that are
different than what is in the GeneSetDb. If an active featureMap is
set on the GeneSetDb, this will convert the original feature identifiers
into a different target space (entrez to ensembl, for instance). Using
this option, the features will be provided in the target space.
a converted GeneSetDb
as.data.frame(GeneSetDb)
: convert a GeneSetDb to data.frame
es <- exampleExpressionSet() gdb <- conform(exampleGeneSetDb(), es) bs <- as(gdb, "BiocSet") gdf <- as.data.frame(gdb) gdb <- conform(gdb, es) gdfi <- as.data.frame(gdb, value = 'x.idx') gdl <- as.list(gdb)
es <- exampleExpressionSet() gdb <- conform(exampleGeneSetDb(), es) bs <- as(gdb, "BiocSet") gdf <- as.data.frame(gdb) gdb <- conform(gdb, es) gdfi <- as.data.frame(gdb, value = 'x.idx') gdl <- as.list(gdb)
The various GeneSetDb data providers (MSigDb, KEGG, etc). limit the identifier types that they return. Use this function to map the given identifiers to whichever type you like.
convertIdentifiers( x, from = NULL, to = NULL, id.type = c("ensembl", "entrez", "symbol"), xref = NULL, extra.cols = NULL, allow.cartesian = FALSE, min_support = 3, top = TRUE, ... ) ## S4 method for signature 'BiocSet' convertIdentifiers( x, from = NULL, to = NULL, id.type = c("ensembl", "entrez", "symbol"), xref = NULL, extra.cols = NULL, allow.cartesian = FALSE, min_support = 3, top = TRUE, ... ) ## S4 method for signature 'GeneSetDb' convertIdentifiers( x, from = NULL, to = NULL, id.type = c("ensembl", "entrez", "symbol"), xref = NULL, extra.cols = NULL, allow.cartesian = FALSE, min_support = 3, top = TRUE, ... )
convertIdentifiers( x, from = NULL, to = NULL, id.type = c("ensembl", "entrez", "symbol"), xref = NULL, extra.cols = NULL, allow.cartesian = FALSE, min_support = 3, top = TRUE, ... ) ## S4 method for signature 'BiocSet' convertIdentifiers( x, from = NULL, to = NULL, id.type = c("ensembl", "entrez", "symbol"), xref = NULL, extra.cols = NULL, allow.cartesian = FALSE, min_support = 3, top = TRUE, ... ) ## S4 method for signature 'GeneSetDb' convertIdentifiers( x, from = NULL, to = NULL, id.type = c("ensembl", "entrez", "symbol"), xref = NULL, extra.cols = NULL, allow.cartesian = FALSE, min_support = 3, top = TRUE, ... )
x |
The GeneSetDb with identifiers to convert |
from , to
|
If you are doing identifier and/orspecies conversion using
babelgene, |
id.type |
If you are using babelgene conversion, this specifies the
type of identifier you want to convert to. It can be any of |
xref |
a data.frame used to map current identifiers to target ones. |
extra.cols |
a character vector of columns from |
allow.cartesian |
a boolean used to temporarily set the
|
min_support , top
|
Parameters used in the internal call to
|
... |
pass through args (not used) |
For best results, provide your own identifier mapping reference, but we
provide a convenience wrapper around the babelgene::orthologs()
function to
change between identifier types and species.
When there are multiple target id's for the source id, they will all be returned. When there is no target id for the source id, the soucre feature will be axed.
A new GeneSetDb object with converted identifiers. We try to retain
any metadata in the original object, but no guarantees are given. If
id_type
was stored previously in the collectionMetadata, that will be
dropped.
convertIdentifiers(BiocSet)
: converts identifiers in a BiocSet
convertIdentifiers(GeneSetDb)
: converts identifiers in a GeneSetDb
You need to provide a data.frame via the xref
paramater that has a column
for the current identifiers and another column for the target identifiers.
The columns are specified by the from
and to
paramters, respectively.
If you don't provide a data.frame, you can provide a species name. We will
rely on the {babelgene}
package for the conversion, so you will have to
provide a species name that it recognizes.
We plan to provide a quick wrapper to babelgene's ortholog mapping function to make identifier conversion a easier through this function. You can track this in sparrow issue #2.
# You can convert the identifiers within a GeneSetDb to some other type # by providing a "translation" table. Check out the unit tests for more # examples. gdb <- exampleGeneSetDb() # this has no symbols in it # Define a silly conversion table. xref <- data.frame( current_id = featureIds(gdb), new_id = paste0(featureIds(gdb), "_symbol")) gdb2 <- convertIdentifiers(gdb, from = "current_id", to = "new_id", xref = xref, extra.cols = "original_id") geneSet(gdb2, name = "BIOCARTA_AGPCR_PATHWAY") # Convert entrez to ensembl id's using babelgene ## Not run: # The conversion functionality via babelgene isn't yet implemented, but # will look like this. # 1. convert the human entrez identifiers to ensembl gdb.ens <- convertIdentifiers(gdb, "human", id.type = "ensembl") # 2. convert the human entrez to mouse entrez gdb.entm <- convertIdentifiers(gdb, "human", "mouse", id.type = "entrez") # 3. convert the human entrez to mouse ensembl gdb.ensm <- convertIdentifiers(gdb, "human", "mouse", id.type = "ensembl") ## End(Not run)
# You can convert the identifiers within a GeneSetDb to some other type # by providing a "translation" table. Check out the unit tests for more # examples. gdb <- exampleGeneSetDb() # this has no symbols in it # Define a silly conversion table. xref <- data.frame( current_id = featureIds(gdb), new_id = paste0(featureIds(gdb), "_symbol")) gdb2 <- convertIdentifiers(gdb, from = "current_id", to = "new_id", xref = xref, extra.cols = "original_id") geneSet(gdb2, name = "BIOCARTA_AGPCR_PATHWAY") # Convert entrez to ensembl id's using babelgene ## Not run: # The conversion functionality via babelgene isn't yet implemented, but # will look like this. # 1. convert the human entrez identifiers to ensembl gdb.ens <- convertIdentifiers(gdb, "human", id.type = "ensembl") # 2. convert the human entrez to mouse entrez gdb.entm <- convertIdentifiers(gdb, "human", "mouse", id.type = "entrez") # 3. convert the human entrez to mouse ensembl gdb.ensm <- convertIdentifiers(gdb, "human", "mouse", id.type = "ensembl") ## End(Not run)
We assume that this is a sample x gene expression matrix, but it can (of course) be any numeric matrix of your choosing. The column names appear in the main diagonal of the plot. Note that you might prefer the corrplot package for similar functionality, and this functionality is intentionally named different from that..
corplot( E, title, cluster = FALSE, col.point = "#00000066", diag.distro = TRUE, smooth.scatter = nrow(E) > 400, max.cex.cor = NULL, ... )
corplot( E, title, cluster = FALSE, col.point = "#00000066", diag.distro = TRUE, smooth.scatter = nrow(E) > 400, max.cex.cor = NULL, ... )
E |
the matrix used to plot a pairs correlation plot. The vectors used to assess all pairwise correlation should be in the columns of the matrix. |
title |
The title of the plot |
cluster |
|
col.point |
the color of the points in the scatterplots |
diag.distro |
show the distribution of values on the diagnols? |
smooth.scatter |
boolean to indicate wether to use a normal scatter, or
a |
max.cex.cor |
the numeric value defining the maximum text size (cor) in the correlation panel.
By default there is no limit on the maximum text size and the text size is calculated with |
... |
pass through arguments to internal panel functions |
TODO: Add with.signature parameter to allow a box to plot the signature score of all genes in E.
nothing, just creates the plot
The corrplot package
x <- matrix(rnorm(1000), ncol=5) corplot(x)
x <- matrix(rnorm(1000), ncol=5) corplot(x)
Weights for the genes in x
are calculated by the percent of which
they contribute to the principal component indicated by eigengene
.
eigenWeightedMean( x, eigengene = 1L, center = TRUE, scale = TRUE, uncenter = center, unscale = scale, retx = FALSE, weights = NULL, normalize = FALSE, all.x = NULL, ..., .drop.sd = 1e-04 )
eigenWeightedMean( x, eigengene = 1L, center = TRUE, scale = TRUE, uncenter = center, unscale = scale, retx = FALSE, weights = NULL, normalize = FALSE, all.x = NULL, ..., .drop.sd = 1e-04 )
x |
An expression matrix of genes x samples. When using this to score
geneset activity, you want to reduce the rows of |
eigengene |
the PC used to extract the gene weights from |
center , scale
|
center and/or scale data before scoring? |
uncenter , unscale
|
uncenter and unscale the data data on the way out?
Defaults to the respective values of |
retx |
Works the same as |
weights |
a user can pass in a prespecified set of waits using a named
numeric vector. The names must be a superset of |
normalize |
If |
all.x |
if the user is trying to normalize these scores, an expression
matrix that has superset of the control genes needs to be provided, where
the columns of |
... |
these aren't used in here |
.drop.sd |
When zero-sd (non varying) features are scaled, their values
are |
You will generally want the rows of the gene x sample matrix “xto be z-transformed. If it is not already, ensure that
center' and
'scale' are set to 'TRUE'.
When uncenter and/or unscale are FALSE
, it means that the scores
should be applied on the centered or scaled values, respectively.
A list of useful transformation information. The caller is likely
most interested in the $score
vector, but other bits related to
the SVD/PCA decomposition are included for the ride.
Scores can be normalized against a set of control genes. This results in negative and postiive sample scores. Positive scores are ones where the specific geneset score is higher than the aggregate control-geneset score.
Genes used for the control set can either be randomly sampled from the
rows of the all.x
expression matrix (when normalize = TRUE
), or
explicitly specified by a row-identifier character vectore passed to the
normalize
parameter. In both cases, the code prefers to select a
random-control geneset to be of equal size as nrow(x)
. If that's not
possible, we use as many genes as we can get.
Note that normalization requires an expression matrix to be passed into
the all.x
parameter, whose columns match 1:1 to the columns in x
.
Calling scoreSingleSamples()
with method = "ewm", normalize = TRUE
handles this transparently.
This idea to implement this method of normalizatition was inspired from
the ctrl.score
normalization found in Seurat's AddModuleScore()
function.
scoreSingleSamples
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- conform(exampleGeneSetDb(), vm) features <- featureIds(gdb, 'c2', 'BURTON_ADIPOGENESIS_PEAK_AT_2HR', value='x.idx') scores <- eigenWeightedMean(vm[features,])$score ## Use scoreSingleSamples to facilitate scoring of all gene sets scores.all <- scoreSingleSamples(gdb, vm, 'ewm') s2 <- with(subset(scores.all, name == 'BURTON_ADIPOGENESIS_PEAK_AT_2HR'), setNames(score, sample_id)) all.equal(s2, scores)
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- conform(exampleGeneSetDb(), vm) features <- featureIds(gdb, 'c2', 'BURTON_ADIPOGENESIS_PEAK_AT_2HR', value='x.idx') scores <- eigenWeightedMean(vm[features,])$score ## Use scoreSingleSamples to facilitate scoring of all gene sets scores.all <- scoreSingleSamples(gdb, vm, 'ewm') s2 <- with(subset(scores.all, name == 'BURTON_ADIPOGENESIS_PEAK_AT_2HR'), setNames(score, sample_id)) all.equal(s2, scores)
The "key" form often comes out as rownames to matrices and such, or particularly for sending genesets down into wrapped methods, like do.camera.
splt_gskey
is the inverse function of encode_gskey()
encode_gskey(x, y, sep = ";;") split_gskey(x, sep = ";;")
encode_gskey(x, y, sep = ";;") split_gskey(x, sep = ";;")
x |
a character vector of encoded geneset keys from |
y |
if |
sep |
the separator used in the encoding of geneset names |
a character vector
a data.frame with (collection,name) columns
gdf <- exampleGeneSetDF() gskeys <- encode_gskey(gdf) gscols <- split_gskey(gskeys)
gdf <- exampleGeneSetDF() gskeys <- encode_gskey(gdf) gscols <- split_gskey(gskeys)
We provide examplar expression data (counts or voomed) as well as exemplar gene sets in different forms.
exampleExpressionSet( dataset = c("tumor-vs-normal", "tumor-subtype"), do.voom = TRUE ) exampleGeneSets(x, unlist = !missing(x)) exampleGeneSetDb() exampleBiocSet() exampleGeneSetDF() exampleSparrowResult(cached = TRUE, methods = c("cameraPR", "fry")) exampleDgeResult( species = "human", id.type = c("entrez", "ensembl"), induce.bias = NULL )
exampleExpressionSet( dataset = c("tumor-vs-normal", "tumor-subtype"), do.voom = TRUE ) exampleGeneSets(x, unlist = !missing(x)) exampleGeneSetDb() exampleBiocSet() exampleGeneSetDF() exampleSparrowResult(cached = TRUE, methods = c("cameraPR", "fry")) exampleDgeResult( species = "human", id.type = c("entrez", "ensembl"), induce.bias = NULL )
dataset |
Character vector indicating what samples wanted, either
|
do.voom |
If TRUE, a voomed EList is returned, otherwise an ExpressionSet of counts. |
x |
If provided, an expression/matrix object so that the genesets are returned as (integer) index vectors into the rows of x whose rownames match the ids in the geneset. |
unlist |
return the genesets as nested list of lists (default: |
cached |
If |
methods |
the methods to use to create a new SparrowResult for. |
species |
the species to return the example result from (right now, only "human") |
id.type |
the type of identifiers to use: |
induce.bias |
We can simulate a bias on the pvalue by the gene's
|
A list of lists of entrezIDs when as == 'lol'
, or
a list of integers into the rows of x
.
The expression data is a subset of the TCGA BRCA indication. Calling
exampleExpressionSet(do.voom = TRUE)
will return a voomed EList
version
of the data. When do.voom = FALSE
, you will get a DGEList of the counts
Returns gene sets as either a list of feature identifiers. Entrez identifiers
are used. If x
is provided, integers that index into the expression
container x
are used (this is a legacy feature that we should nuke).
Returns gene sets as a GeneSetDb
object
Returns gene sets as a BiocSet
object
Returns a data.frame of gene set definitions. A data.frame of this form
can be passed into the GeneSetDb()
contructor.
vm <- exampleExpressionSet() head(exampleGeneSets())
vm <- exampleExpressionSet() head(exampleGeneSets())
Inspired from one of Hadley's functions (in plyr or something?)
failWith( default = NULL, expr, frame = parent.frame(), message = geterrmessage(), silent = FALSE, file = stderr() )
failWith( default = NULL, expr, frame = parent.frame(), message = geterrmessage(), silent = FALSE, file = stderr() )
default |
the value to return if |
expr |
the expression to take a shot at |
frame |
the frame to evaluate the expression in |
message |
the error message to display if |
silent |
if |
file |
where msg sends the message |
the result of expr
if successful, otherwise default
value.
# look, this doesn't throw an error, it just returns NULL x <- failWith(NULL, stop("no error, just NULL"), silent = TRUE)
# look, this doesn't throw an error, it just returns NULL x <- failWith(NULL, stop("no error, just NULL"), silent = TRUE)
GeneSetDb
The GeneSetDb has an internal data structure that is used to cross reference the feature_id's used in the database construction to the features in the expression object that is used to run GSEA methods against.
featureIdMap(x, ...) ## S4 method for signature 'GeneSetDb' featureIdMap(x, as.dt = FALSE)
featureIdMap(x, ...) ## S4 method for signature 'GeneSetDb' featureIdMap(x, as.dt = FALSE)
x |
the object to retrieve the featureIdMap from |
... |
pass through arguments |
as.dt |
If |
a data.frame of input feature_id's to conformed id's/rows/etc
featureIdMap(GeneSetDb)
: extract featureIdMap from a GeneSetDb
gdb <- exampleGeneSetDb() vm <- exampleExpressionSet() gdb <- conform(gdb, vm) fmap <- featureIdMap(gdb)
gdb <- exampleGeneSetDb() vm <- exampleExpressionSet() gdb <- conform(gdb, vm) fmap <- featureIdMap(gdb)
Gene sets are defined by the unique compound key consisting of their
collection
and name
. To fetch the featureIds associated with
a specific geneset, you must provide values for i
and j
. If
these are missing, then a character vector of all the unique feature ids
within x
are returned.
If the GeneSetDb x
has been conformed to an expression object this
will default to return only the feature_id's that are matched to the target
expression object, and they will be returned using the same identifiers that
the target expression object uses. To change this behavior, tweak the values
for the active.only
and value
parameters, respectively.
x
can be either a GeneSetDb
or a SparrowResult
. If its the latter,
then this call simply delegates to the internal GeneSetDb
.
featureIds( x, i, j, value = c("feature_id", "x.id", "x.idx"), active.only = is.conformed(x), ... ) ## S4 method for signature 'GeneSetDb' featureIds( x, i, j, value = c("feature_id", "x.id", "x.idx"), active.only = is.conformed(x), ... ) ## S4 method for signature 'SparrowResult' featureIds( x, i, j, value = c("feature_id", "x.id", "x.idx"), active.only = TRUE, ... )
featureIds( x, i, j, value = c("feature_id", "x.id", "x.idx"), active.only = is.conformed(x), ... ) ## S4 method for signature 'GeneSetDb' featureIds( x, i, j, value = c("feature_id", "x.id", "x.idx"), active.only = is.conformed(x), ... ) ## S4 method for signature 'SparrowResult' featureIds( x, i, j, value = c("feature_id", "x.id", "x.idx"), active.only = TRUE, ... )
x |
Object to retrieve the gene set from, either a |
i , j
|
The collection,name compound key identifier of the gene set |
value |
What form do you want the id's in?
|
active.only |
only look for gene sets that are "active"? Defaults to
|
... |
pass through arguments |
A vector of identifiers (or indexes into an expression object,
depending on the value
argument) for the features in the specified
geneset. NA
is returned if the geneset is not "active" (ie. listed in
geneSets()
)
gdb <- exampleGeneSetDb() fids.gs <- featureIds(gdb, 'c2', 'BIOCARTA_AGPCR_PATHWAY') fids.c2 <- featureIds(gdb, 'c2') fids.all <- featureIds(gdb) vm <- exampleExpressionSet(do.voom=TRUE) gdb <- conform(gdb, vm) ## fewer than before fids.gs2 <- featureIds(gdb, 'c2', 'BIOCARTA_AGPCR_PATHWAY') ## same as before fids.gs3 <- featureIds(gdb, 'c2', 'BIOCARTA_AGPCR_PATHWAY', active.only=FALSE) ## returned as row indices into vm fids.idxs <- featureIds(gdb, 'c2', value='x.idx')
gdb <- exampleGeneSetDb() fids.gs <- featureIds(gdb, 'c2', 'BIOCARTA_AGPCR_PATHWAY') fids.c2 <- featureIds(gdb, 'c2') fids.all <- featureIds(gdb) vm <- exampleExpressionSet(do.voom=TRUE) gdb <- conform(gdb, vm) ## fewer than before fids.gs2 <- featureIds(gdb, 'c2', 'BIOCARTA_AGPCR_PATHWAY') ## same as before fids.gs3 <- featureIds(gdb, 'c2', 'BIOCARTA_AGPCR_PATHWAY', active.only=FALSE) ## returned as row indices into vm fids.idxs <- featureIds(gdb, 'c2', value='x.idx')
Gene sets inside a GeneSetDb()
are indexed by their collection,name
compound key. There is no special class to represent an individual gene set.
Instead, gene sets are returned as a data.frame, the rows of which enumerate
the features that belong to them.
When x
is a SparrowResult()
, this function will append
the differential expression statistics for the individual features generated
across the contrast that defined x
.
geneSet(x, i, j, ...) ## S4 method for signature 'GeneSetDb' geneSet( x, i, j, active.only = is.conformed(x), with.feature.map = FALSE, ..., collection = NULL, name = NULL, as.dt = FALSE ) ## S4 method for signature 'SparrowResult' geneSet( x, i, j, active.only = TRUE, with.feature.map = FALSE, ..., collection = NULL, name = NULL, as.dt = FALSE )
geneSet(x, i, j, ...) ## S4 method for signature 'GeneSetDb' geneSet( x, i, j, active.only = is.conformed(x), with.feature.map = FALSE, ..., collection = NULL, name = NULL, as.dt = FALSE ) ## S4 method for signature 'SparrowResult' geneSet( x, i, j, active.only = TRUE, with.feature.map = FALSE, ..., collection = NULL, name = NULL, as.dt = FALSE )
x |
Object to retrieve the gene set from, either a |
i , j
|
The collection,name compound key identifier of the gene set |
... |
passed down to inner functinos |
active.only |
only look for gene sets that are "active"? Defaults to
|
with.feature.map |
If |
collection |
using |
name |
the same for the |
as.dt |
If |
a data.(frame|table)
of gene set information. If x
is a
SparrowResult
object, then differential expression statistics
are added as columns to this result.
gdb <- exampleGeneSetDb() geneSet(gdb, "c2", "KOMMAGANI_TP63_GAMMA_TARGETS") geneSet(gdb, collection = "c2", name = "KOMMAGANI_TP63_GAMMA_TARGETS") geneSet(gdb, name = "KOMMAGANI_TP63_GAMMA_TARGETS")
gdb <- exampleGeneSetDb() geneSet(gdb, "c2", "KOMMAGANI_TP63_GAMMA_TARGETS") geneSet(gdb, collection = "c2", name = "KOMMAGANI_TP63_GAMMA_TARGETS") geneSet(gdb, name = "KOMMAGANI_TP63_GAMMA_TARGETS")
Reference collectionMetadata()
for more info.
geneSetCollectionURLfunction(x, i, ...) geneSetCollectionURLfunction(x, i) <- value ## S4 method for signature 'GeneSetDb' geneSetCollectionURLfunction(x, i, ...) ## S4 replacement method for signature 'GeneSetDb' geneSetCollectionURLfunction(x, i) <- value ## S4 method for signature 'SparrowResult' geneSetCollectionURLfunction(x, i, ...)
geneSetCollectionURLfunction(x, i, ...) geneSetCollectionURLfunction(x, i) <- value ## S4 method for signature 'GeneSetDb' geneSetCollectionURLfunction(x, i, ...) ## S4 replacement method for signature 'GeneSetDb' geneSetCollectionURLfunction(x, i) <- value ## S4 method for signature 'SparrowResult' geneSetCollectionURLfunction(x, i, ...)
x |
The GeneSetDb |
i |
The collection to get the url function from |
... |
pass through arguments (not used) |
value |
the function to set as the geneset url function for the given
collection |
the function that maps collection,name combinations to an informational URL.
geneSetCollectionURLfunction(GeneSetDb)
: returns the gene set collection
url function from a GeneSetDb
geneSetCollectionURLfunction(GeneSetDb) <- value
: sets the gene set collection url
function for a GeneSetDb : Collection
combination.
geneSetCollectionURLfunction(SparrowResult)
: return the url function from a
SparrowResult
object.
gdb <- exampleGeneSetDb() geneSetCollectionURLfunction(gdb, "c2", "BIOCARTA_AGPCR_PATHWAY")
gdb <- exampleGeneSetDb() geneSetCollectionURLfunction(gdb, "c2", "BIOCARTA_AGPCR_PATHWAY")
Fetches the GeneSetDb from SparrowResult
geneSetDb(x)
geneSetDb(x)
x |
|
The GeneSetDb
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- exampleGeneSetDb() mg <- seas(vm, gdb, design = vm$design, contrast = 'tumor') geneSetDb(mg)
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- exampleGeneSetDb() mg <- seas(vm, gdb, design = vm$design, contrast = 'tumor') geneSetDb(mg)
Please refer to the sparrow vignette (vignette("sparrow")
),
(and the "The GeneSetDb Class" section, in particular) for a more deatiled
description of the sematnics of this central data object.
The GeneSetDb class serves the same purpose as the
GSEABase::GeneSetCollection()
class does: it acts as a centralized
object to hold collections of Gene Sets. The reason for its existence is
because there are things that I wanted to know about my gene set
collections that weren't easily inferred from what is essentially a
"list of GeneSets" that is the GeneSetCollection
class.
Gene Sets are internally represented by a data.table
in "a tidy"
format, where we minimally require non NA
values for the following
three character
columns:
collection
name
feature_id
The (collection
, name
) compound key is the primary key of a gene set.
There will be as many entries with the same (collection
, name
) as there
are genes/features in that set.
The GeneSetDb
tracks metadata about genesets at the collection
level. This means that we assume that all of the feature_id
's used
within a collection use the same type of feature identifier (such as
a GSEABase::EntrezIdentifier()
, were defined in the same organism,
etc.
Please refer to the "GeneSetDb" section of the vignette for more
details regarding the construction and querying of a GeneSetDb
object.
GeneSetDb(x, featureIdMap = NULL, collectionName = NULL, ...)
GeneSetDb(x, featureIdMap = NULL, collectionName = NULL, ...)
x |
A |
featureIdMap |
A data.frame with 2 character columns. The first
column is the ids of the genes (features) used to identify the genes in
|
collectionName |
If |
... |
these aren't used for anything in particular, but are here to catch extra arguments that may get passed down if this function is part of some call chain. |
The functionality in the class is useful for the functionality in this
package, but for your own personal usage, you probably want a {BiocSet}
.
A GeneSetDb object
table
The "gene set table": a data.table with geneset information,
one row per gene set. Columns include collection, name, N, and n. The end
user can add more columns to this data.table as desired. The actual
feature IDs are computed on the fly by doing a db[J(group, id)]
db
A data.table
to hold all of the original geneset id
information that was used to construct this GeneSetDb
.
featureIdMap
Maps the ids used in the geneset lists to the ids (rows) over the expression data the GSEA is run on
collectionMetadata
A data.table
to keep metadata about each
individual geneset collection, ie. the user might want to keep track of
where the geneset definitions come from. Perhaps a function that parses
the collection,name combination to generate an URL that points the user
to more information about that geneset, etc.
The GeneSetDb()
constructor is sufficiently flexible enough to create
a GeneSetDb
object from a variety of formats that are commonly used
in the bioconductor echosystem, such as:
GSEABase::GeneSetCollection()
: If you already have a GeneSetCollection
on your hands, you can simply pass it to the GeneSetDb()
constructor.
list of ids: This format is commonly used to define gene sets in the
edgeR/limma universe for testing with camera, roast, romer, etc. The names
of the list items are the gene set names, and their values are a character
vector of gene identifiers. When it's a single list of lists, you must
provide a value for collectionName
. You can embed multiple
collections of gene sets by having a three-deep list-of-lists-of-ids.
The top level list define the different collections, the second level
are the genesets, and the third level are the feature identifiers for
each gene set. See the examples for clarification.
a data.frame
-like object: To keep track of your own custom gene sets, you
have probably realized the importance of maintaing your own sanity, and
likely have gene sets organized in a table like object that has something
like the collection
, name
, and feature_id
required for a GeneSetDb
.
Simply rename the appropriate columns to the ones prescribed here, and pass
that into the constructor. Any other additional columns (symbol, direction,
etc.) will be copied into the GeneSetDb
.
You might wonder what gene sets are defined in a GeneSetDb
: see
the geneSets()
function.
Curious about what features are defined in your GeneSetDb
? See
the featureIds()
function.
Want the details of a particular gene set? Try the geneSet()
function.
This will return a data.frame
of the gene set definition. Calling
geneSet()
on a SparrowResult()
will return the same data.frame
along
with the differential expression statistics for the individual members of the
geneSet across the contrast that was tested in the seas()
call that
created the SparrowResult()
.
You can subset a GeneSetDb to include a subset of genesets defined in it.
To do this, you need to provide an indexing vector that is as long as
length(gdb)
, ie. the number of gene sets defined in GeneSetDb. You
can construct such a vector by performing your boolean logic over the
geneSets(gdb)
table.
Look at the Examples section to see how this works, where we take the MSIgDB c7 collection (aka. "ImmuneSigDB") and only keep gene sets that were defined in experiments from mouse.
?conversion
## exampleGeneSetDF provides gene set definitions in "long form". We show ## how this can easily turned into a GeneSetDb from this form, or convert ## it to other forms (list of features, or list of list of features) to ## do the same. gs.df <- exampleGeneSetDF() gdb.df <- GeneSetDb(gs.df) ## list of ids gs.df$key <- encode_gskey(gs.df) gs.list <- split(gs.df$feature_id, gs.df$key) gdb.list <- GeneSetDb(gs.list, collectionName='custom-sigs') ## A list of lists, where the top level list splits the collections. ## The name of the collection in the GeneSetDb is taken from this top level ## hierarchy gs.lol <- as.list(gdb.df, nested=TRUE) ## examine this list-of lists gdb.lol <- GeneSetDb(gs.lol) ## note that collection is set propperly ## GeneSetDb Interrogation gsets <- geneSets(gdb.df) nkcells <- geneSet(gdb.df, 'cellularity', 'NK cells') fids <- featureIds(gdb.df) # GeneSetDb Manipulation .................................................... # Subset down to only t cell related gene sets gdb.t <- gdb.df[grepl("T cell", geneSets(gdb.df)$name)] gdb.t
## exampleGeneSetDF provides gene set definitions in "long form". We show ## how this can easily turned into a GeneSetDb from this form, or convert ## it to other forms (list of features, or list of list of features) to ## do the same. gs.df <- exampleGeneSetDF() gdb.df <- GeneSetDb(gs.df) ## list of ids gs.df$key <- encode_gskey(gs.df) gs.list <- split(gs.df$feature_id, gs.df$key) gdb.list <- GeneSetDb(gs.list, collectionName='custom-sigs') ## A list of lists, where the top level list splits the collections. ## The name of the collection in the GeneSetDb is taken from this top level ## hierarchy gs.lol <- as.list(gdb.df, nested=TRUE) ## examine this list-of lists gdb.lol <- GeneSetDb(gs.lol) ## note that collection is set propperly ## GeneSetDb Interrogation gsets <- geneSets(gdb.df) nkcells <- geneSet(gdb.df, 'cellularity', 'NK cells') fids <- featureIds(gdb.df) # GeneSetDb Manipulation .................................................... # Subset down to only t cell related gene sets gdb.t <- gdb.df[grepl("T cell", geneSets(gdb.df)$name)] gdb.t
Fetch the active (or all) gene sets from a GeneSetDb or SparrowResult
geneSets(x, ...) ## S4 method for signature 'GeneSetDb' length(x) ## S4 method for signature 'GeneSetDb' geneSets(x, active.only = is.conformed(x), ..., as.dt = FALSE) ## S4 method for signature 'GeneSetDb' nrow(x) ## S4 method for signature 'SparrowResult' geneSets(x, ..., as.dt = FALSE)
geneSets(x, ...) ## S4 method for signature 'GeneSetDb' length(x) ## S4 method for signature 'GeneSetDb' geneSets(x, active.only = is.conformed(x), ..., as.dt = FALSE) ## S4 method for signature 'GeneSetDb' nrow(x) ## S4 method for signature 'SparrowResult' geneSets(x, ..., as.dt = FALSE)
x |
Object to retrieve the gene set from, either a |
... |
pass through arguments |
active.only |
only look for gene sets that are "active"? Defaults to
|
as.dt |
If |
a data.table with geneset information.
length(GeneSetDb)
: Returns the number of genesets in a GeneSetDb
geneSets(GeneSetDb)
: return all genesets from a GeneSetDb
nrow(GeneSetDb)
: return number of genesets in GeneSetDb
geneSets(SparrowResult)
: return the active genesets from a SparrowResult
gdb <- exampleGeneSetDb() gs <- geneSets(gdb)
gdb <- exampleGeneSetDb() gs <- geneSets(gdb)
This function calculates the number of genes that move up/down for the given contrasts, as well as mean and trimmed mean of the logFC and t-statistics. Note that the statistics calculated and returned here are purely a function of the statistics generated at the gene-level stage of the analysis.
geneSetsStats( x, feature.min.logFC = 1, feature.max.padj = 0.1, trim = 0.1, reannotate.significance = FALSE, as.dt = FALSE )
geneSetsStats( x, feature.min.logFC = 1, feature.max.padj = 0.1, trim = 0.1, reannotate.significance = FALSE, as.dt = FALSE )
x |
A |
feature.min.logFC |
used with |
feature.max.padj |
used with |
trim |
The amount to trim when calculated trimmed |
reannotate.significance |
this is internally by the package, and should
left as |
as.dt |
If |
A data.table with statistics at the gene set level across the
prescribed contrast run on x
. These statistics are independent
of any particular GSEA method, but rather summarize aggregate shifts
of the gene sets individual features. The columns included in the output
are summarized below:
n.sig
: The number of individual features whose abs(logFC)
and padj
thersholds satisfy the criteria of the feature.min.logFC
and
feature.max.padj
parameters of the original seas()
call
n.neutral
: The number of individual features whose abs(logFC) and padj
thersholds do not satisfy the feature.*
criteria named above.
n.up, n.down
: The number of individual features with logFC > 0
or
logFC < 0
, respectively, irrespective of the feature.*
thresholds
referenced above.
n.sig.up, n.sig.down
: The number of individual features that pass the
feature.*
thresholds and have logFC > 0 or logFC < 0, respectively.
mean.logFC, mean.logFC.trim
: The mean (or trimmed mean) of the individual
logFC estimates for the features in the gene set. The amount of trim is
specified in the trim
parameter of the seas()
call.
mean.t, mean.t.trim
: The mean (or trimmed mean) of the individual
t-statistics for the features in the gene sets. These are NA
if the input
expression object was a DGEList
.
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- exampleGeneSetDb() mg <- seas(vm, gdb, design = vm$design, contrast = 'tumor') head(geneSetsStats(mg))
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- exampleGeneSetDb() mg <- seas(vm, gdb, design = vm$design, contrast = 'tumor') head(geneSetsStats(mg))
This function creates a geneset by feature table with geneset membership
information for the features
specified by the user. Only the gene sets that
have any of the features
are included in the table returned.
geneSetSummaryByGenes( x, features, with.features = TRUE, feature.rename = NULL, ..., as.dt = FALSE ) ## S4 method for signature 'GeneSetDb' geneSetSummaryByGenes( x, features, with.features = TRUE, feature.rename = NULL, ..., as.dt = FALSE ) ## S4 method for signature 'SparrowResult' geneSetSummaryByGenes( x, features, with.features = TRUE, feature.rename = NULL, method = NULL, max.p = 0.3, p.col = c("padj", "padj.by.collection", "pval"), ..., as.dt = FALSE )
geneSetSummaryByGenes( x, features, with.features = TRUE, feature.rename = NULL, ..., as.dt = FALSE ) ## S4 method for signature 'GeneSetDb' geneSetSummaryByGenes( x, features, with.features = TRUE, feature.rename = NULL, ..., as.dt = FALSE ) ## S4 method for signature 'SparrowResult' geneSetSummaryByGenes( x, features, with.features = TRUE, feature.rename = NULL, method = NULL, max.p = 0.3, p.col = c("padj", "padj.by.collection", "pval"), ..., as.dt = FALSE )
x |
|
features |
a character vector of featureIds |
with.features |
Include columns for |
feature.rename |
if |
... |
pass through arguments |
as.dt |
If |
method |
The GSEA method to pull statistics from |
max.p |
the maximum p-value from the analysis |
p.col |
which p-value column to select from: |
a data.frame of geneset <-> feature incidence/feature matrix.
geneSetSummaryByGenes(SparrowResult)
: get geneset:feature incidence table from
a SparrowResult, optionally filtered by statistical significance from
a given gsea method
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- conform(exampleGeneSetDb(), vm) mg <- seas(vm, gdb, design = vm$design, contrast = 'tumor') features <- c("55839", "8522", "29087") gsm.hit <- geneSetSummaryByGenes(gdb, features) gsm.fid <- geneSetSummaryByGenes(mg, features, feature.rename=NULL) gsm.sym <- geneSetSummaryByGenes(mg, features, feature.rename='symbol')
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- conform(exampleGeneSetDb(), vm) mg <- seas(vm, gdb, design = vm$design, contrast = 'tumor') features <- c("55839", "8522", "29087") gsm.hit <- geneSetSummaryByGenes(gdb, features) gsm.fid <- geneSetSummaryByGenes(mg, features, feature.rename=NULL) gsm.sym <- geneSetSummaryByGenes(mg, features, feature.rename='symbol')
Uses limma::getGeneKEGGLinks()
and limma::getKEGGPathwayNames()
internally.
getKeggCollection(species = "human", id.type = c("ensembl", "entrez"), ...) getKeggGeneSetDb(species = "human", id.type = c("ensembl", "entrez"), ...)
getKeggCollection(species = "human", id.type = c("ensembl", "entrez"), ...) getKeggGeneSetDb(species = "human", id.type = c("ensembl", "entrez"), ...)
species |
|
id.type |
Gene identifiers are returned by the REST service as
entrez identifiers. Set this to |
... |
pass through arguments |
Currently we just support the pathway database, and only entrez ids.
Note that it is your responsibility to ensure that you can use the KEGG database according to their licensing requirements.
A BiocSet of the kegg stuffs
getKeggGeneSetDb()
: method that returns a GeneSetDb
# connects to the internet and takes a while mouse.entrez <- getKeggCollection("mouse", id.type = "entrez") human.enrez <- getKeggCollection("human", id.type = "entrez")
# connects to the internet and takes a while mouse.entrez <- getKeggCollection("mouse", id.type = "entrez") human.enrez <- getKeggCollection("human", id.type = "entrez")
This provides versioned genesets from gene set collections defined in
MSigDB. Collections can
be retrieved by their collection name, ie c("H", "C2", "C7")
.
getMSigCollection( collection = NULL, species = "human", id.type = c("ensembl", "entrez", "symbol"), with.kegg = FALSE, promote.subcategory.to.collection = FALSE, prefix.collection = FALSE, ... ) getMSigGeneSetDb( collection = NULL, species = "human", id.type = c("ensembl", "entrez", "symbol"), with.kegg = FALSE, promote.subcategory.to.collection = FALSE, prefix.collection = FALSE, ... )
getMSigCollection( collection = NULL, species = "human", id.type = c("ensembl", "entrez", "symbol"), with.kegg = FALSE, promote.subcategory.to.collection = FALSE, prefix.collection = FALSE, ... ) getMSigGeneSetDb( collection = NULL, species = "human", id.type = c("ensembl", "entrez", "symbol"), with.kegg = FALSE, promote.subcategory.to.collection = FALSE, prefix.collection = FALSE, ... )
collection |
character vector specifying the collections you want
(c1, c2, ..., c7, h). By default we load just the hallmark collecitons.
Setting this to |
species |
|
id.type |
do you want the feature id's used in the gene sets to be
|
with.kegg |
The Broad distributes the latest versions of the KEGG
genesets as part of the c2 collection. These genesets come with a
restricted license, so by default we do not return them as part of the
GeneSetDb. To include the KEGG gene sets when asking for the c2
collection, set this flag to |
promote.subcategory.to.collection |
there are different sources of
genesets for a number of the collections in MSigDB. These are included
in the |
prefix.collection |
When |
... |
pass through parameters |
a BiocSet
of the MSigDB collections
getMSigGeneSetDb()
: retrieval method for a GeneSetDb container
This function utilizes the functionality from the {msigdbr}
and
{babelgene}
packages to retrieve gene set definitions from a variety of
organisms and identifier types.
Due to the licensing restrictions over the KEGG collections, they are not
returned from this function unless they are explicitly asked for. You can
ask for them through this function by either (i) querying for the "c2"
collection while setting with.kegg = TRUE
; or (ii) explicitly calling with
collection = "kegg"
.
To cite your use of the Molecular Signatures Database (MSigDB), please reference Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550) and one or more of the following as appropriate:
Liberzon, et al. (2011, Bionformatics);
Liberzon, et al. (2015, Cell Systems); and
The source for the gene set as listed on the gene set page.
# these take a while to load initially, so put them in dontrun blocks. # you should run these interactively to understand what they return bcs <- getMSigCollection("h", "human", "entrez") bcs.h.entrez <- getMSigCollection(c("h", "c2"), "human", "entrez") bcs.h.ens <- getMSigCollection(c("h", "c2"), "human", "ensembl") bcs.m.entrez <- getMSigCollection(c("h", "c2"), "mouse", "entrez") gdb <- getMSigGeneSetDb("h", "human", "entrez")
# these take a while to load initially, so put them in dontrun blocks. # you should run these interactively to understand what they return bcs <- getMSigCollection("h", "human", "entrez") bcs.h.entrez <- getMSigCollection(c("h", "c2"), "human", "entrez") bcs.h.ens <- getMSigCollection(c("h", "c2"), "human", "ensembl") bcs.m.entrez <- getMSigCollection(c("h", "c2"), "mouse", "entrez") gdb <- getMSigGeneSetDb("h", "human", "entrez")
This is a convience function that orchestrates the PANTHER.db package to return GeneSetDb objects for either pathway or GOslim information for human or mouse.
getPantherCollection( type = c("pathway", "goslim"), species = c("human", "mouse") ) getPantherGeneSetDb( type = c("pathway", "goslim"), species = c("human", "mouse") )
getPantherCollection( type = c("pathway", "goslim"), species = c("human", "mouse") ) getPantherGeneSetDb( type = c("pathway", "goslim"), species = c("human", "mouse") )
type |
"pathway" or, "goslim" |
species |
"human" or "mouse" |
Note that for some reason the PANTHER.db
package needs to be
installed in a user-writable package location for this to work properly.
If you see an error like "Error in resqlite_send_query ... attempt to
write a readonly database", this is the problem. Please install another
version of the PANTHER.db
package in a user-writable directory using
BiocManager::install()
.
A BiocSet of panther pathways
getPantherGeneSetDb()
: returns a GeneSetDb
GO Slims are "cut down" versions of the GO ontology that contain a subset of the terms in the whole GO.
PANTHER provides their own set of GO slims, although it's not clear how often these get updated.
# this requires you have the PANTHER.db package installed via BiocManager bsc.panther <- getPantherCollection(species = "human")
# this requires you have the PANTHER.db package installed via BiocManager bsc.panther <- getPantherCollection(species = "human")
Retrieve gene set collections from from reactome.db
getReactomeCollection( species = "human", id.type = c("entrez", "ensembl"), rm.species.prefix = TRUE ) getReactomeGeneSetDb( species = "human", id.type = c("entrez", "ensembl"), rm.species.prefix = TRUE )
getReactomeCollection( species = "human", id.type = c("entrez", "ensembl"), rm.species.prefix = TRUE ) getReactomeGeneSetDb( species = "human", id.type = c("entrez", "ensembl"), rm.species.prefix = TRUE )
species |
the species to get pathay information for |
id.type |
|
rm.species.prefix |
pathways are provided with species prefixes from
|
a reactome BiocSet object
getReactomeGeneSetDb()
: returns a GeneSetDb object
bsc.h <- getReactomeCollection("human") gdb.h <- getReactomeGeneSetDb("human")
bsc.h <- getReactomeCollection("human") gdb.h <- getReactomeGeneSetDb("human")
Note that we do not import things from goseq directly, and only load it if this function is fired. I can't figure out a way to selectively import functions from the goseq package without it having to load its dependencies, which take a long time – and I don't want loading sparrow to take a long time. So, the goseq package has moved to Suggests and then is loaded within this function when necessary.
goseq( gsd, selected, universe, feature.bias, method = c("Wallenius", "Sampling", "Hypergeometric"), repcnt = 2000, use_genes_without_cat = TRUE, plot.fit = FALSE, do.conform = TRUE, as.dt = FALSE, .pipelined = FALSE )
goseq( gsd, selected, universe, feature.bias, method = c("Wallenius", "Sampling", "Hypergeometric"), repcnt = 2000, use_genes_without_cat = TRUE, plot.fit = FALSE, do.conform = TRUE, as.dt = FALSE, .pipelined = FALSE )
gsd |
The |
selected |
The ids of the selected features |
universe |
The ids of the universe |
feature.bias |
a named vector as long as |
method |
The method to use to calculate the unbiased category enrichment scores |
repcnt |
Number of random samples to be calculated when random sampling
is used. Ignored unless |
use_genes_without_cat |
A boolean to indicate whether genes without a categorie should still be used. For example, a large number of gene may have no GO term annotated. If this option is set to FALSE, those genes will be ignored in the calculation of p-values (default behaviour). If this option is set to TRUE, then these genes will count towards the total number of genes outside the category being tested. |
plot.fit |
parameter to pass to |
do.conform |
By default |
as.dt |
If |
.pipelined |
If this is being called external to a seas pipeline, then
some additional cleanup of columns name output will be done when
|
A data.table
of results, similar to goseq output. The output
from nullp
is added to the outgoing data.table as
an attribue named "pwf"
.
Young, M. D., Wakefield, M. J., Smyth, G. K., Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology 11, R14. http://genomebiology.com/2010/11/2/R14
vm <- exampleExpressionSet() gdb <- conform(exampleGeneSetDb(), vm) # Identify DGE genes mg <- seas(vm, gdb, design = vm$design) lfc <- logFC(mg) # wire up params selected <- subset(lfc, significant)$feature_id universe <- rownames(vm) mylens <- setNames(vm$genes$size, rownames(vm)) degenes <- setNames(integer(length(universe)), universe) degenes[selected] <- 1L gostats <- sparrow::goseq( gdb, selected, universe, mylens, method = "Wallenius", use_genes_without_cat = TRUE)
vm <- exampleExpressionSet() gdb <- conform(exampleGeneSetDb(), vm) # Identify DGE genes mg <- seas(vm, gdb, design = vm$design) lfc <- logFC(mg) # wire up params selected <- subset(lfc, significant)$feature_id universe <- rownames(vm) mylens <- setNames(vm$genes$size, rownames(vm)) degenes <- setNames(integer(length(universe)), universe) degenes[selected] <- 1L gostats <- sparrow::goseq( gdb, selected, universe, mylens, method = "Wallenius", use_genes_without_cat = TRUE)
This method was developed by Jason Hackney and first introduced in the following paper doi:10.1038/ng.3520. It produces a single sample gene set score in values that are in "expression space," the innards of which mimic something quite similar to an eigengene based score.
To easily use this method to score a number of gene setes across an
experiment, you'll want to have the scoreSingleSamples()
method
drive this function via specifying "svd"
as one of the
methods
.
gsdScore( x, eigengene = 1L, center = TRUE, scale = TRUE, uncenter = center, unscale = scale, retx = FALSE, ..., .use_irlba = FALSE, .drop.sd = 1e-04 )
gsdScore( x, eigengene = 1L, center = TRUE, scale = TRUE, uncenter = center, unscale = scale, retx = FALSE, ..., .use_irlba = FALSE, .drop.sd = 1e-04 )
x |
An expression matrix of genes x samples. When using this to score
geneset activity, you want to reduce the rows of |
eigengene |
the "eigengene" you want to get the score for. only accepts a single value for now. |
center , scale
|
center and/or scale data before scoring? |
uncenter , unscale
|
uncenter and unscale the data data on the way out?
Defaults to the respective values of |
retx |
Works the same as |
... |
these aren't used in here |
.use_irlba |
when |
.drop.sd |
When zero-sd (non varying) features are scaled, their values
are |
The difference between this method vs the eigengene score is that the SVD is used to calculate the eigengene. The vector of eigengenes (one score per sample) is then multiplied through by the SVD's left matrix. This produces a matrix which we then take the colSums of to get back to a single sample score for the geneset.
Why do all of that? You get data that is back "in expression space" and we
also run around the problem of sign of the eigenvector. The scores you get
are very similar to average zscores of the genes per sample, where the
average is weighted by the degree to which each gene contributes to the
principal component chosen by eigengene
, as implemented in the
eigenWeightedMean()
function.
The core functionality provided here is taken from the soon to be released GSDecon package by Jason Hackney
A list of useful transformation information. The caller is likely
most interested in the $score
vector, but other bits related to
the SVD/PCA decomposition are included for the ride.
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- conform(exampleGeneSetDb(), vm) features <- featureIds(gdb, "c2", "BURTON_ADIPOGENESIS_PEAK_AT_2HR") scores <- gsdScore(vm[features,])$score ## Use scoreSingleSamples to facilitate scoring of all gene sets scores.all <- scoreSingleSamples(gdb, vm, 'gsd') s2 <- with(subset(scores.all, name == 'BURTON_ADIPOGENESIS_PEAK_AT_2HR'), setNames(score, sample_id)) all.equal(s2, scores)
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- conform(exampleGeneSetDb(), vm) features <- featureIds(gdb, "c2", "BURTON_ADIPOGENESIS_PEAK_AT_2HR") scores <- gsdScore(vm[features,])$score ## Use scoreSingleSamples to facilitate scoring of all gene sets scores.all <- scoreSingleSamples(gdb, vm, 'gsd') s2 <- with(subset(scores.all, name == 'BURTON_ADIPOGENESIS_PEAK_AT_2HR'), setNames(score, sample_id)) all.equal(s2, scores)
Check to see if the GeneSetDb has a collection,name GeneSet defined
hasGeneSet(x, collection, name, as.error = FALSE)
hasGeneSet(x, collection, name, as.error = FALSE)
x |
GeneSetDb |
collection |
character indicating the collection |
name |
character indicating the name of the geneset |
as.error |
If |
logical indicating whether or not the geneset is defined.
gdb <- exampleGeneSetDb() hasGeneSet(gdb, c('c2', 'c7'), c('BIOCARTA_AGPCR_PATHWAY', 'something'))
gdb <- exampleGeneSetDb() hasGeneSet(gdb, c('c2', 'c7'), c('BIOCARTA_AGPCR_PATHWAY', 'something'))
GeneSetDb
Check if a collection exists in the GeneSetDb
hasGeneSetCollection(x, collection, as.error = FALSE)
hasGeneSetCollection(x, collection, as.error = FALSE)
x |
|
collection |
character vector of name(s) of the collections to query |
as.error |
logical if TRUE, this will error instead of returning FALSE |
logical indicating if this collection exists
gdb <- exampleGeneSetDb() hasGeneSetCollection(gdb, "c2") hasGeneSetCollection(gdb, "unknown collection")
gdb <- exampleGeneSetDb() hasGeneSetCollection(gdb, "c2") hasGeneSetCollection(gdb, "unknown collection")
Generates an inidcator matrix to indicate membership of genes (columns)
to gene sets (rows). If y
is provided, then the incidence is mapped
across the entire feature-space of y
.
incidenceMatrix(x, y, ...)
incidenceMatrix(x, y, ...)
x |
|
y |
(optional) A target (expression) object |
... |
parameters passed down into |
incidence matrix with nrows = number of genesets and columns are
featureIDs. If y
is passed in, the columns of the returned value
match the rows of y
.
vm <- exampleExpressionSet() gdb <- exampleGeneSetDb() im <- incidenceMatrix(gdb) imv <- incidenceMatrix(gdb, vm)
vm <- exampleExpressionSet() gdb <- exampleGeneSetDb() im <- incidenceMatrix(gdb) imv <- incidenceMatrix(gdb, vm)
It is informative to look at the individual log fold changes of the genes within a gene set to explore the degree to which they (1) are coherent with respect to each other; and (2) see how the compare to the background distribution of log fold changes of the entire transcriptome.
You can visualize this behavior via a type = "density"
plot, or a
type = "boxplot". It is also common to plot either the individual log fold changes
value = "logFC"or t-statistics
value = "t"'.
iplot( x, name, value = "logFC", type = c("density", "gsea", "boxplot"), tools = c("wheel_zoom", "box_select", "reset", "save"), main = NULL, with.legend = TRUE, collection = NULL, shiny_source = "mggenes", width = NULL, height = NULL, ggtheme = ggplot2::theme_bw(), trim = 0.005, ... )
iplot( x, name, value = "logFC", type = c("density", "gsea", "boxplot"), tools = c("wheel_zoom", "box_select", "reset", "save"), main = NULL, with.legend = TRUE, collection = NULL, shiny_source = "mggenes", width = NULL, height = NULL, ggtheme = ggplot2::theme_bw(), trim = 0.005, ... )
x |
A |
name |
the name of the geneset to plot |
value |
A string indicating the column name for the value of the
gene-level metadata to plot. Default is |
type |
plot the distributions as a |
tools |
the tools to display in the rbokeh plot |
main |
A title to display. If not specified, the gene set name
will be used, otherwise you can pass in a custom title, or |
with.legend |
Draws a legend to map point color to meaning. There are three levels a point (gene level statistic) can be color as, "notsig", "psig", and "sig". "notsig" implies that the FDR >= 10%, "psig" means that FDR <= 10%, but the logFC is "unremarkable" (< 1), and "sig" means that both the FDR <= 10% and the logFC >= 1 |
collection |
If you have genesets with duplicate names in |
shiny_source |
the name of this element that is used in shiny callbacks.
Defaults to |
width , height
|
the width and height of the output plotly plot |
ggtheme |
a ggplot theme, like the thing returned from
|
trim |
used to define the upper and lower quantiles to max out the individual gene statistics in the selected geneset. |
... |
pass through parameters to internal boxplot/density/gsea plotting functions |
the ploty plot object
mgr <- exampleSparrowResult() iplot(mgr, "BURTON_ADIPOGENESIS_PEAK_AT_2HR", value = c("t-statistic" = "t"), type = "density") iplot(mgr, "BURTON_ADIPOGENESIS_PEAK_AT_2HR", value = c("log2FC" = "logFC"), type = "boxplot") iplot(mgr, "BURTON_ADIPOGENESIS_PEAK_AT_2HR", value = c("-statistic" = "t"), type = "gsea")
mgr <- exampleSparrowResult() iplot(mgr, "BURTON_ADIPOGENESIS_PEAK_AT_2HR", value = c("t-statistic" = "t"), type = "density") iplot(mgr, "BURTON_ADIPOGENESIS_PEAK_AT_2HR", value = c("log2FC" = "logFC"), type = "boxplot") iplot(mgr, "BURTON_ADIPOGENESIS_PEAK_AT_2HR", value = c("-statistic" = "t"), type = "gsea")
Returns the active
status of genesets, which are specified by
their collection,name compound keys. This function is vectorized and
supports query of multiple gene sets at a time. If a requested
collection,name gene set doesn't exist, this throws an error.
is.active(x, i, j)
is.active(x, i, j)
x |
|
i |
collection of geneset(s) |
j |
name of geneset(s) (must be same length as |
logical indicating if geneset is active. throws an error if
any requested geneset does not exist in x
.
dge.stats <- exampleDgeResult() y <- exampleExpressionSet(do.voom = FALSE) gdb <- conform(exampleGeneSetDb(), y, min.gs.size = 10) # size 9 geneset: geneSet(gdb, "c2", "BYSTRYKH_HEMATOPOIESIS_STEM_CELL_IL3RA") is.active(gdb, "c2", "BYSTRYKH_HEMATOPOIESIS_STEM_CELL_IL3RA") # geneset with >100 genes is.active(gdb, "c7", "GSE3982_MAC_VS_NEUTROPHIL_LPS_STIM_DN")
dge.stats <- exampleDgeResult() y <- exampleExpressionSet(do.voom = FALSE) gdb <- conform(exampleGeneSetDb(), y, min.gs.size = 10) # size 9 geneset: geneSet(gdb, "c2", "BYSTRYKH_HEMATOPOIESIS_STEM_CELL_IL3RA") is.active(gdb, "c2", "BYSTRYKH_HEMATOPOIESIS_STEM_CELL_IL3RA") # geneset with >100 genes is.active(gdb, "c7", "GSE3982_MAC_VS_NEUTROPHIL_LPS_STIM_DN")
Extract the individual fold changes statistics for elements in the expression object.
logFC(x, as.dt = FALSE)
logFC(x, as.dt = FALSE)
x |
|
as.dt |
If |
The log fold change 'data.table“
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- exampleGeneSetDb() mg <- seas(vm, gdb, design = vm$design, contrast = 'tumor') lfc <- logFC(mg)
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- exampleGeneSetDb() mg <- seas(vm, gdb, design = vm$design, contrast = 'tumor') lfc <- logFC(mg)
Before we get started, note that you probably want to use mgheatmap2()
.
This function encapsulates many common "moves" you'll make when trying to make a heatmap, especially if you are trying to show geneset activity across a panel of samples.
NOTE: this function will almost certainly reorder the rows of the
input matrix. If you are concatentating Heatmap objects together horizontally
(ie. you if you want to use a rowAnnotation along side the returned heatmap),
you must reorder the rows of the annotation data.frame, ie.
ranno.df <- ranno.df[rownames(out@matrix),]
mgheatmap( x, gdb = NULL, col = NULL, aggregate.by = c("none", "ewm", "ewz", "zscore"), split = TRUE, scores = NULL, gs.order = NULL, name = NULL, rm.collection.prefix = TRUE, rm.dups = FALSE, recenter = FALSE, rescale = FALSE, center = TRUE, scale = TRUE, rename.rows = NULL, zero_center_colramp = NULL, zlim = NULL, transpose = FALSE, ... )
mgheatmap( x, gdb = NULL, col = NULL, aggregate.by = c("none", "ewm", "ewz", "zscore"), split = TRUE, scores = NULL, gs.order = NULL, name = NULL, rm.collection.prefix = TRUE, rm.dups = FALSE, recenter = FALSE, rescale = FALSE, center = TRUE, scale = TRUE, rename.rows = NULL, zero_center_colramp = NULL, zlim = NULL, transpose = FALSE, ... )
x |
the data matrix |
gdb |
|
col |
a colorRamp(2) function |
aggregate.by |
the method used to generate single-sample geneset
scores. Default is |
split |
introduce row-segmentation based on genesets or collections?
Defaults is |
scores |
If |
gs.order |
This is experimental, and is here to help order the order
of the genesets (or genesets collection) in a different way than the
default. By default, |
name |
passed down to |
rm.collection.prefix |
When |
rm.dups |
if |
recenter |
do you want to mean center the rows of the heatmap matrix
prior to calling |
rescale |
do you want to standardize the row variance to one on the
values of the heatmap matrix prior to calling
|
center , scale
|
boolean parameters passed down into the the single
sample gene set scoring methods defined by |
rename.rows |
defaults to |
zero_center_colramp |
Used to specify the type of color ramp to generate
when |
zlim |
Used to control the color saturation of the heatmap when the
|
transpose |
Flip display so that rows are columns. Default is |
... |
parameters to send down to |
More info here.
A Heatmap
object.
This function leverages renameRows()
so that you can better customize the
output of your heatmaps by tweaking its rownames.
If you are plotting a gene-level heatmap (ie. aggregate.by == "none"``) and the
rownames()are gene identifiers, but you want the rownames of the heatmap to be gene symbols. You can perform this renaming using the
rename.rows' parameter.
If rename.rows
is NULL
, then nothing is done.
If rename.rows
is a string
, then we assume that x
has an associated
metadata data.frame
over its rows and that rename.rows
names one of
its columns, ie. DGEList$genes[[rename.rows]]
or
fData(ExpressionSet)[[rename.rows]]
. The values in that column will
be swapped out for x
's rownames
If rename.rows
is a two-column data.frame, the first column is assumed
to be rownames(x)
and the second is what you want to rename it to.
When there are duplicates in the renamed rownames, the rename.duplicates
...
parameter dictates the behavior. This will happen, for instance, if
you are trying to rename the rows of an affy matrix to gene symbols, where
we have multiple probe ids for one gene. When rename.duplicates
is set to
"original"
, one of the rows will get the new name, and the remaning
duplicate rows will keep the rownames they came in with. When set to
"make.unique"
, the new names will contain *.1
, *.2
, etc. suffixes,
as you would get from using base::make.unique()
.
Maybe you are aggregating the expression scores into geneset scores, and
you don't want the rownames of the heatmap to be collection;;name
(or just
name
when rm.collection.prefx = TRUE
), you can pass in a two column
data.frame
, where the first column is collection;name
and the second
is the name you want to rename that to. There is an example of this in
the "Examples" section here.
library(ComplexHeatmap) vm <- exampleExpressionSet() gdb <- exampleGeneSetDb() col.anno <- ComplexHeatmap::HeatmapAnnotation( df = vm$targets[, c("Cancer_Status", "PAM50subtype")], col = list( Cancer_Status = c(normal = "grey", tumor = "red"), PAM50subtype = c(Basal = "purple", Her2 = "green", LumA = "orange"))) mgh <- mgheatmap(vm, gdb, aggregate.by = "ewm", split=TRUE, top_annotation = col.anno, show_column_names = FALSE, column_title = "Gene Set Activity in BRCA subset") # Maybe you want the rownames of the matrix to use spaces instead of "_" rr <- geneSets(gdb)[, "name", drop = FALSE] rr$newname <- gsub("_", " ", rr$name) mg2 <- mgheatmap(vm, gdb, aggregate.by='ewm', split=TRUE, top_annotation = col.anno, show_column_names = FALSE, column_title = "Gene Set Activity in BRCA subset", rename.rows = rr)
library(ComplexHeatmap) vm <- exampleExpressionSet() gdb <- exampleGeneSetDb() col.anno <- ComplexHeatmap::HeatmapAnnotation( df = vm$targets[, c("Cancer_Status", "PAM50subtype")], col = list( Cancer_Status = c(normal = "grey", tumor = "red"), PAM50subtype = c(Basal = "purple", Her2 = "green", LumA = "orange"))) mgh <- mgheatmap(vm, gdb, aggregate.by = "ewm", split=TRUE, top_annotation = col.anno, show_column_names = FALSE, column_title = "Gene Set Activity in BRCA subset") # Maybe you want the rownames of the matrix to use spaces instead of "_" rr <- geneSets(gdb)[, "name", drop = FALSE] rr$newname <- gsub("_", " ", rr$name) mg2 <- mgheatmap(vm, gdb, aggregate.by='ewm', split=TRUE, top_annotation = col.anno, show_column_names = FALSE, column_title = "Gene Set Activity in BRCA subset", rename.rows = rr)
Encapsulates many common "moves" you'll make when trying to make a heatmap, especially if you are trying to show geneset activity across a panel of samples.
NOTE: this function will almost certainly reorder the rows of the
input matrix. If you are concatentating Heatmap objects together horizontally
(ie. you if you want to use a rowAnnotation along side the returned heatmap),
you must reorder the rows of the annotation data.frame, ie.
ranno.df <- ranno.df[rownames(out@matrix),]
mgheatmap2( x, gdb = NULL, col = NULL, aggregate.by = c("none", "ewm", "ewz", "zscore"), split = TRUE, scores = NULL, gs.order = NULL, name = NULL, rm.collection.prefix = TRUE, rm.dups = FALSE, recenter = FALSE, rescale = FALSE, center = FALSE, scale = FALSE, uncenter = FALSE, unscale = FALSE, rename.rows = NULL, zlim = NULL, transpose = FALSE, ... )
mgheatmap2( x, gdb = NULL, col = NULL, aggregate.by = c("none", "ewm", "ewz", "zscore"), split = TRUE, scores = NULL, gs.order = NULL, name = NULL, rm.collection.prefix = TRUE, rm.dups = FALSE, recenter = FALSE, rescale = FALSE, center = FALSE, scale = FALSE, uncenter = FALSE, unscale = FALSE, rename.rows = NULL, zlim = NULL, transpose = FALSE, ... )
x |
the data matrix |
gdb |
|
col |
a colorRamp(2) function |
aggregate.by |
the method used to generate single-sample geneset
scores. Default is |
split |
introduce row-segmentation based on genesets or collections?
Defaults is |
scores |
If |
gs.order |
This is experimental, and is here to help order the order
of the genesets (or genesets collection) in a different way than the
default. By default, |
name |
passed down to |
rm.collection.prefix |
When |
rm.dups |
if |
recenter |
do you want to mean center the rows of the heatmap matrix
prior to calling |
rescale |
do you want to standardize the row variance to one on the
values of the heatmap matrix prior to calling
|
center , scale , uncenter , unscale
|
boolean parameters passed down into the
the single sample gene set scoring methods defined by |
rename.rows |
defaults to |
zlim |
A |
transpose |
Flip display so that rows are columns. Default is |
... |
parameters to send down to |
More info here.
A Heatmap
object.
This function leverages renameRows()
so that you can better customize the
output of your heatmaps by tweaking its rownames.
If you are plotting a gene-level heatmap (ie. aggregate.by == "none"``) and the
rownames()are gene identifiers, but you want the rownames of the heatmap to be gene symbols. You can perform this renaming using the
rename.rows' parameter.
If rename.rows
is NULL
, then nothing is done.
If rename.rows
is a string
, then we assume that x
has an associated
metadata data.frame
over its rows and that rename.rows
names one of
its columns, ie. DGEList$genes[[rename.rows]]
or
fData(ExpressionSet)[[rename.rows]]
. The values in that column will
be swapped out for x
's rownames
If rename.rows
is a two-column data.frame, the first column is assumed
to be rownames(x)
and the second is what you want to rename it to.
When there are duplicates in the renamed rownames, the rename.duplicates
...
parameter dictates the behavior. This will happen, for instance, if
you are trying to rename the rows of an affy matrix to gene symbols, where
we have multiple probe ids for one gene. When rename.duplicates
is set to
"original"
, one of the rows will get the new name, and the remaning
duplicate rows will keep the rownames they came in with. When set to
"make.unique"
, the new names will contain *.1
, *.2
, etc. suffixes,
as you would get from using base::make.unique()
.
Maybe you are aggregating the expression scores into geneset scores, and
you don't want the rownames of the heatmap to be collection;;name
(or just
name
when rm.collection.prefx = TRUE
), you can pass in a two column
data.frame
, where the first column is collection;name
and the second
is the name you want to rename that to. There is an example of this in
the "Examples" section here.
vm <- exampleExpressionSet() gdb <- exampleGeneSetDb() col.anno <- ComplexHeatmap::HeatmapAnnotation( df = vm$targets[, c("Cancer_Status", "PAM50subtype")], col = list( Cancer_Status = c(normal = "grey", tumor = "red"), PAM50subtype = c(Basal = "purple", Her2 = "green", LumA = "orange"))) mgh <- mgheatmap2(vm, gdb, aggregate.by = "ewm", split = TRUE, top_annotation = col.anno, show_column_names = FALSE, column_title = "Gene Set Activity in BRCA subset") ComplexHeatmap::draw(mgh) # Center to "normal" group mgc <- mgheatmap2(vm, gdb, aggregate.by = "ewm", split = TRUE, top_annotation = col.anno, show_column_names = FALSE, recenter = vm$targets$Cancer_Status == "normal", column_title = "Gene Set Activity in BRCA subset") ComplexHeatmap::draw(mgc) # Maybe you want the rownames of the matrix to use spaces instead of "_" rr <- geneSets(gdb)[, "name", drop = FALSE] rr$newname <- gsub("_", " ", rr$name) mg2 <- mgheatmap2(vm, gdb, aggregate.by='ewm', split=TRUE, top_annotation = col.anno, show_column_names = FALSE, column_title = "Gene Set Activity in BRCA subset", rename.rows = rr)
vm <- exampleExpressionSet() gdb <- exampleGeneSetDb() col.anno <- ComplexHeatmap::HeatmapAnnotation( df = vm$targets[, c("Cancer_Status", "PAM50subtype")], col = list( Cancer_Status = c(normal = "grey", tumor = "red"), PAM50subtype = c(Basal = "purple", Her2 = "green", LumA = "orange"))) mgh <- mgheatmap2(vm, gdb, aggregate.by = "ewm", split = TRUE, top_annotation = col.anno, show_column_names = FALSE, column_title = "Gene Set Activity in BRCA subset") ComplexHeatmap::draw(mgh) # Center to "normal" group mgc <- mgheatmap2(vm, gdb, aggregate.by = "ewm", split = TRUE, top_annotation = col.anno, show_column_names = FALSE, recenter = vm$targets$Cancer_Status == "normal", column_title = "Gene Set Activity in BRCA subset") ComplexHeatmap::draw(mgc) # Maybe you want the rownames of the matrix to use spaces instead of "_" rr <- geneSets(gdb)[, "name", drop = FALSE] rr$newname <- gsub("_", " ", rr$name) mg2 <- mgheatmap2(vm, gdb, aggregate.by='ewm', split=TRUE, top_annotation = col.anno, show_column_names = FALSE, column_title = "Gene Set Activity in BRCA subset", rename.rows = rr)
Utility function to cat a message to stderr (by default)
msg(..., file = stderr())
msg(..., file = stderr())
... |
pieces of the message |
file |
where to send the message. Defaults to |
Nothing, dumps text to file
msg("this is a message", "to stderr")
msg("this is a message", "to stderr")
This function wraps limma::kegga()
to perform biased overrepresntation
analysis over gene set collection stored in a GeneSetDb (gsd
) object. Its
easiest to use this function when the biases and selection criteria are
stored as columns of the input data.frame dat
.
ora( x, gsd, selected = "significant", groups = NULL, feature.bias = NULL, universe = NULL, restrict.universe = FALSE, plot.bias = FALSE, ..., as.dt = FALSE ) plot_ora_bias(x, selected, feature.bias, ...)
ora( x, gsd, selected = "significant", groups = NULL, feature.bias = NULL, universe = NULL, restrict.universe = FALSE, plot.bias = FALSE, ..., as.dt = FALSE ) plot_ora_bias(x, selected, feature.bias, ...)
x |
A data.frame with feature-level statistics. Minimally, this should
have a |
gsd |
The GeneSetDb |
selected |
Either the name of a logical column in |
groups |
Encodes groups of features that we can use to test selected
features individual, as well as "all" together. This can be specified by:
(1) specifying a name of a column in |
feature.bias |
If |
universe |
Defaults to all elements in |
restrict.universe |
See same parameter in |
plot.bias |
See |
... |
parameters passed to |
as.dt |
If |
In principle, this test does what goseq
does, however I found that
sometimes calling goseq would throw errors within goseq::nullp()
when
calling makesplines
. I stumbled onto this implementation when googling
for these errors and landing here:
https://support.bioconductor.org/p/65789/#65914
The meat and potatoes of this function's code was extracted from
limma::kegga()
, written by Gordon Smyth and Yifang Hu.
Note that the BiasedUrn CRAN package needs to be installed to support biased enrichment testing
A data.frame of pathway enrichment. The last N colums are enrichment
statistics per pathway, grouped by the groups
parameter. P.all
are the
stats for all selected features, and the remaingin P.*
columns are for
the features specifed by groups
.
plot_ora_bias()
: plots the bias of coviarate to DE / selected status. Code
taken from limma::kegga()
Young, M. D., Wakefield, M. J., Smyth, G. K., Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology 11, R14. http://genomebiology.com/2010/11/2/R14
dgestats <- exampleDgeResult() gdb <- randomGeneSetDb(dgestats) # Run enrichmnent without accounting for any bias nobias <- ora(dgestats, gdb, selected = "selected", groups = "direction", feature.bias = NULL) # Run enrichment and account for gene length lbias <- ora(dgestats, gdb, selected = "selected", feature.bias = "effective_length") # plot length bias with DGE status plot_ora_bias(dgestats, "selected", "effective_length") # induce length bias and see what is the what ............................... biased <- dgestats[order(dgestats$pval),] biased$effective_length <- sort(biased$effective_length, decreasing = TRUE) plot_ora_bias(biased, "selected", "effective_length") etest <- ora(biased, gdb, selected = "selected", groups = "direction", feature.bias = "effective_length")
dgestats <- exampleDgeResult() gdb <- randomGeneSetDb(dgestats) # Run enrichmnent without accounting for any bias nobias <- ora(dgestats, gdb, selected = "selected", groups = "direction", feature.bias = NULL) # Run enrichment and account for gene length lbias <- ora(dgestats, gdb, selected = "selected", feature.bias = "effective_length") # plot length bias with DGE status plot_ora_bias(dgestats, "selected", "effective_length") # induce length bias and see what is the what ............................... biased <- dgestats[order(dgestats$pval),] biased$effective_length <- sort(biased$effective_length, decreasing = TRUE) plot_ora_bias(biased, "selected", "effective_length") etest <- ora(biased, gdb, selected = "selected", groups = "direction", feature.bias = "effective_length")
You might want a matrix of pvalues (or FDRs) for the gene sets across all GSEA methods you tried. I think I did, once, so here it is.
p.matrix( x, names = resultNames(x), pcol = c("padj", "padj.by.collection", "pval") )
p.matrix( x, names = resultNames(x), pcol = c("padj", "padj.by.collection", "pval") )
x |
A |
names |
the entries from |
pcol |
The name of the column in |
A matrix of the desired pvalues for all genesets
mg <- exampleSparrowResult() pm <- p.matrix(mg)
mg <- exampleSparrowResult() pm <- p.matrix(mg)
I wrote this because initial fetching from msigdbr can be slow, and also having some weird crashes in the unit tests of bioc3.14-devel.
randomGeneSetDb(x, n = 10, bias = NULL, ...)
randomGeneSetDb(x, n = 10, bias = NULL, ...)
x |
an input container to |
n |
number of genesets |
bias |
column in |
... |
pass through args |
This is a helper function for development, and shouldn't be used by normal users of this package.
A randomly generated GeneSetDb you can use against x
for testing.
gdb.rando <- randomGeneSetDb(exampleDgeResult(), 10, bias = "t")
gdb.rando <- randomGeneSetDb(exampleDgeResult(), 10, bias = "t")
This function remaps names of collections in the database from their current
names to ones specified by the user, folows the dplyr::rename
convenction
where names()
of the rename vector are the new names you want, and its
values are the old names it came from.
renameCollections(x, rename = NULL, ...)
renameCollections(x, rename = NULL, ...)
x |
A GeneSetDb object |
rename |
a named character vector. |
... |
pass it along |
GeneSetDb x
with renamed geneSets(x)$collection
values.
gdb <- exampleGeneSetDb() ngdb <- renameCollections(gdb, c("MSigDB C2" = "c2", "ImmuneSigDb" = "c7")) all.equal( unname(geneSetURL(gdb, "c7", "GSE3982_BCELL_VS_TH2_DN")), unname(geneSetURL(ngdb, "ImmuneSigDb", "GSE3982_BCELL_VS_TH2_DN")))
gdb <- exampleGeneSetDb() ngdb <- renameCollections(gdb, c("MSigDB C2" = "c2", "ImmuneSigDb" = "c7")) all.equal( unname(geneSetURL(gdb, "c7", "GSE3982_BCELL_VS_TH2_DN")), unname(geneSetURL(ngdb, "ImmuneSigDb", "GSE3982_BCELL_VS_TH2_DN")))
The most common usecase for this is when you have a SummarizedExperiment, DGEList, matrix, etc. that is "rownamed" by some gene idnetifiers (ensembl, entrez, etc) that you want to "easily" convert to be rownamed by symbols. And perhaps the most common use-case for this, again, would be able to easily change rownames of a heatmap to symbols.
renameRows(x, xref, duplicate.policy = "original", ...)
renameRows(x, xref, duplicate.policy = "original", ...)
x |
an object to whose rows need renaming |
xref |
an object to help with the renaming.
|
duplicate.policy |
The policy used to deal with duplicates in the
renamed values. If Multiple elements in the source can be renamed to
the same elements in the target (think of microarray probes to gene
symbols), what to do? By deafult ( |
... |
pass through variable down to default method |
The rownames that can't successfully remapped will keep their old names. This function should also guarantee that the rows of the incoming matrix are the same as the outgoing one.
An updated version of x
with freshly minted rownames.
eset <- exampleExpressionSet(do.voom = FALSE) ess <- renameRows(eset, "symbol") vm <- exampleExpressionSet(do.voom = TRUE) vms <- renameRows(vm, "symbol")
eset <- exampleExpressionSet(do.voom = FALSE) ess <- renameRows(eset, "symbol") vm <- exampleExpressionSet(do.voom = TRUE) vms <- renameRows(vm, "symbol")
The resultNames
, result
, and results
functions enable
you to explore the results of the analysis run with seas()
.
The results that are stored within a SparrowResult
object have a
more or less 1:1 mapping with the values passed as methods
, parameter
of the seas()
call.
Generates a table to indicate the number of genesets per collection that
pass a given FDR. The table provides separate groups of rows for each of
the methods
run in the seas()
call that generated that
generated x
.
resultNames(x) result(x, ...) ## S3 method for class 'SparrowResult' result( x, name = NULL, stats.only = FALSE, rank.by = c("pval", "t", "logFC"), add.suffix = FALSE, as.dt = FALSE, ... ) results( x, names = resultNames(x), stats.only = TRUE, rank.by = c("pval", "logFC", "t"), add.suffix = length(names) > 1L, as.dt = FALSE ) tabulateResults( x, names = resultNames(x), max.p = 0.2, p.col = c("padj", "padj.by.collection", "pval"), as.dt = FALSE )
resultNames(x) result(x, ...) ## S3 method for class 'SparrowResult' result( x, name = NULL, stats.only = FALSE, rank.by = c("pval", "t", "logFC"), add.suffix = FALSE, as.dt = FALSE, ... ) results( x, names = resultNames(x), stats.only = TRUE, rank.by = c("pval", "logFC", "t"), add.suffix = length(names) > 1L, as.dt = FALSE ) tabulateResults( x, names = resultNames(x), max.p = 0.2, p.col = c("padj", "padj.by.collection", "pval"), as.dt = FALSE )
x |
A |
... |
pass through arguments |
name |
the names of the results desired |
stats.only |
logical, set to |
rank.by |
the statistic to use to append a |
add.suffix |
If |
as.dt |
If |
names |
the names of the GSEA methods to be reported. By default, this function will display results for all methods. |
max.p |
The maximum padj value to consider a result significant |
p.col |
use padj or padj.by.collection? |
The product of an indivdual GSEA is consumed by the corresponding
do.<METHOD>
function and converted into a data.table of results that
is internally stored.
Use the resultNames()
function to identify which results are available
for interrogation. The result()
function returns the statistics of
one individual result, and the results()
function combines the results
from the specified methods into an arbitrarily wide data.table with
method
-suffixed column names.
Use the tabulateResults()
function to create a summary table that
tallies the number of significant genesets per collection, per method at
the specified FDR thresholds.
a data.table with the results from the requested method.
a data.table that summarizes the significant results per method per collection for the GSEA that was run
res <- exampleSparrowResult() resultNames(res) head(result(res, "camera")) head(results(res))
res <- exampleSparrowResult() resultNames(res) head(result(res, "camera")) head(results(res))
This was for two reasons: (1) to avoid the (more commonly used)
t(scale(t(x), ...)
idiom; and (2) to specify what values, columns of x,
etc. to use to calculate means and sd's to use in the scaling function.
scale_rows(x, center = TRUE, scale = TRUE, ...)
scale_rows(x, center = TRUE, scale = TRUE, ...)
x |
the matrix-like object |
center |
Either a logical, character, or numeric-like value that specifies what to center |
scale |
Either a logical, characeter, or numeric-like value that specifies what to scale |
... |
pass through arguments |
For instance, you might want to subtract the mean of a subset of columns from each row in the matrix (like the columns that come from control samples)
Note that this method returns different attrs() for scaling and center than base:scale does. Our values are always named.
a scaled version of x
center
and scale
can be a logical, character, or numeric-like vector.
The flexibility enables the following scenarios:
The user can set it to TRUE
to center all values on the mean of their
row. (FALSE
does no centering)
A (named) vector of values that is a superset of rownames(x). These will be the values that are subtracted from each row.
A logical vector as long as ncol(x). Each value will be centered to the mean of the values of the columns specified as TRUE.
An integer vector, the is the analog of 3 but specifies the columns to use for centering.
# see tests/testthat/test-scale_rows.R for more examples m <- matrix(rnorm(50, mean = 1, sd = 2), nrow = 5, dimnames = list(LETTERS[1:5], letters[1:10])) s0 <- scale_rows(m, center = TRUE, scale = FALSE) all.equal(s0, t(scale(t(m), center = TRUE, scale = FALSE))) # mean center rows to a specific group of control samples (columns) ctrl <- sample(colnames(m), 3) s.ctrl <- scale_rows(m, center = ctrl, scale = FALSE) ctrl.means <- Matrix::rowMeans(m[, ctrl]) all.equal(s.ctrl, t(scale(t(m), center = ctrl.means, scale = FALSE)))
# see tests/testthat/test-scale_rows.R for more examples m <- matrix(rnorm(50, mean = 1, sd = 2), nrow = 5, dimnames = list(LETTERS[1:5], letters[1:10])) s0 <- scale_rows(m, center = TRUE, scale = FALSE) all.equal(s0, t(scale(t(m), center = TRUE, scale = FALSE))) # mean center rows to a specific group of control samples (columns) ctrl <- sample(colnames(m), 3) s.ctrl <- scale_rows(m, center = ctrl, scale = FALSE) ctrl.means <- Matrix::rowMeans(m[, ctrl]) all.equal(s.ctrl, t(scale(t(m), center = ctrl.means, scale = FALSE)))
It is common to assess the activity of a gene set in a given sample. There
are many ways to do that, and this method is analogous to the
seas()
function in that it enables the user to run a multitude of
single-sample-gene-set-scoring algorithms over a target expression matrix
using a GeneSetDb()
object.
scoreSingleSamples( gdb, y, methods = "ewm", as.matrix = FALSE, drop.sd = 1e-04, drop.unconformed = FALSE, verbose = FALSE, recenter = FALSE, rescale = FALSE, ..., as.dt = FALSE )
scoreSingleSamples( gdb, y, methods = "ewm", as.matrix = FALSE, drop.sd = 1e-04, drop.unconformed = FALSE, verbose = FALSE, recenter = FALSE, rescale = FALSE, ..., as.dt = FALSE )
gdb |
A GeneSetDb |
y |
An expression matrix to score genesets against |
methods |
A character vector that enumerates the scoring methods you want to run over the samples. Please reference the "Single Sample Scoring Methods" section for more information. |
as.matrix |
Return results as a list of matrices instead of a melted
data.frame? Defaults to |
drop.sd |
Genes with a standard deviation across columns in |
drop.unconformed |
When |
verbose |
make some noise? Defaults to |
recenter , rescale
|
If |
... |
these parameters are passed down into the the individual single sample scoring funcitons to customize them further. |
as.dt |
If |
Please refer to the "Generating Single Sample Gene Set Scores" of the sparrow vignette for further exposition.
A long data.frame with sample_id,method,score values per row. If
as.matrix=TRUE
, a matrix with as many rows as geneSets(gdb)
and as many columns as ncol(x)
The following methods
are currenly provided.
"ewm"
: The eigenWeightedMean()
calculates the fraction each gene
contributes to a pre-specified principal component. These contributions
act as weights over each gene, which are then used in a simple weighted
mean calculation over all the genes in the geneset per sample. This is
similar, in spirit, to the svd/gsdecon method (ie. method = "gsd"``) You can use this method to perform an "eigenweighted" zscore by setting
unscaleand
uncenterto
FALSE.
"ewz": with
unscaleand
uncenterset to
FALSE'.
"gsd"
: This method was first introduced by Jason Hackney in
doi:10.1038/ng.3520. Please refer to
the gsdScore()
function for more information.
"ssgsea"
: Using ssGSEA as implemented in the GSVA package.
"zscore"
: The features in the expression matrix are rowwise z transformed.
The gene set level score is then calculated by adding up the zscores for
the genes in the gene set, then dividing that number by either the the
size (or its sqaure root (default)) of the gene set.
"mean"
: Simply take the mean of the values from the expression matrix
that are in the gene set. Right or wrong, sometimes you just want the mean
without transforming the data.
"gsva"
: The gsva method of GSVA package.
"plage"
: Using "plage" as implemented in the GSVA package
gdb <- exampleGeneSetDb() vm <- exampleExpressionSet() scores <- scoreSingleSamples( gdb, vm, methods = c("ewm", "gsva", "zscore"), center = TRUE, scale = TRUE, ssgsea.norm = TRUE, as.dt = TRUE) sw <- data.table::dcast(scores, name + sample_id ~ method, value.var='score') corplot( sw[, c("ewm", "gsva", "zscore")], title = "Single Sample Score Comparison") zs <- scoreSingleSamples( gdb, vm, methods = c('ewm', 'ewz', 'zscore'), summary = "mean", center = TRUE, scale = TRUE, uncenter = FALSE, unscale = FALSE, as.dt = TRUE) zw <- data.table::dcast(zs, name + sample_id ~ method, value.var='score') corplot(zw[, c("ewm", "ewz", "zscore")], title = "EW zscores")
gdb <- exampleGeneSetDb() vm <- exampleExpressionSet() scores <- scoreSingleSamples( gdb, vm, methods = c("ewm", "gsva", "zscore"), center = TRUE, scale = TRUE, ssgsea.norm = TRUE, as.dt = TRUE) sw <- data.table::dcast(scores, name + sample_id ~ method, value.var='score') corplot( sw[, c("ewm", "gsva", "zscore")], title = "Single Sample Score Comparison") zs <- scoreSingleSamples( gdb, vm, methods = c('ewm', 'ewz', 'zscore'), summary = "mean", center = TRUE, scale = TRUE, uncenter = FALSE, unscale = FALSE, as.dt = TRUE) zw <- data.table::dcast(zs, name + sample_id ~ method, value.var='score') corplot(zw[, c("ewm", "ewz", "zscore")], title = "EW zscores")
This is a wrapper function that delegates GSEA analyses to different
"workers", each of which implements the flavor of GSEA of your choosing.
The particular analyses that are performed are specified by the
methods
argument, and these methods are fine tuned by passing their
arguments down through the ...
of this wrapper function.
seas( x, gsd, methods = NULL, design = NULL, contrast = NULL, use.treat = FALSE, feature.min.logFC = if (use.treat) log2(1.25) else 1, feature.max.padj = 0.1, trim = 0.1, verbose = FALSE, ..., score.by = c("t", "logFC", "pval"), rank_by = NULL, rank_order = c("ordered", "descending", "ascending"), xmeta. = NULL, BPPARAM = BiocParallel::SerialParam() )
seas( x, gsd, methods = NULL, design = NULL, contrast = NULL, use.treat = FALSE, feature.min.logFC = if (use.treat) log2(1.25) else 1, feature.max.padj = 0.1, trim = 0.1, verbose = FALSE, ..., score.by = c("t", "logFC", "pval"), rank_by = NULL, rank_order = c("ordered", "descending", "ascending"), xmeta. = NULL, BPPARAM = BiocParallel::SerialParam() )
x |
An object to run enrichment analyses over. This can be an ExpressoinSet-like object that you can differential expression over (for roast, fry, camera), a named (by feature_id) vector of scores to run ranked-based GSEA, a data.frame with feature_id's, ranks, scores, etc. |
gsd |
The |
methods |
A character vector indicating the GSEA methods you want to run. Refer to the GSEA Methods section for more details. If no methods are specified, only differential gene expression and geneset level statistics for the contrast are computed. |
design |
A design matrix for the study |
contrast |
The contrast of interest to analyze. This can be a column
name of |
use.treat |
should we use limma/edgeR's "treat" functionality for the gene-level differential expression analysis? |
feature.min.logFC |
The minimum logFC required for an individual
feature (not geneset) to be considered differentialy expressed. Used in
conjunction with |
feature.max.padj |
The maximum adjusted pvalue used to consider an
individual feature (not geneset) to be differentially expressed. Used in
conjunction with |
trim |
The amount to trim when calculated trimmed |
verbose |
make some noise during execution? |
... |
The arguments are passed down into
|
score.by |
This tells us how to rank the features after differential
expression analysis when |
rank_by |
Only works when |
rank_order |
Only used when |
xmeta. |
A hack to support data.frame inputs for |
BPPARAM |
a BiocParallel parameter definition, like one generated from
|
Set enrichment analyses can either be performed over an expression object, which requires the specification of the experiment design and contrast of interest, or over a set of features to rank (stored as a data.frame or vector).
Note that we are currently in the middle of a refactor to accept and fully
take advantage of data.frame
as inputs for x
, which will be used for
preranked type of GSEA methods. See the following issue for more details:
https://github.com/lianos/multiGSEA/issues/24
The bulk of the GSEA methods currently available in this package come from edgeR/limma, however others are included (and are being added), as well. GSEA Methods and GSEA Method Parameterization sections for more details.
In addition to performing GSEA, this function also internally orchestrates
a differential expression analysis, which can be tweaked by identifying
the parameters in the calculateIndividualLogFC()
function, and
passing them down through ...
here. The results of the differential
expression analysis (ie. the limma::topTable()
) are accessible by calling
the logFC()
function on the SparrowResult()
object returned from this
function call.
Please Note: be sure to cite the original GSEA method when using results generated from this function.
A SparrowResult()
which holds the results of all the analyses
specified in the methods
parameter.
You can choose the methods you would like to run by providing a character
vector of GSEA method names to the methods
parameter. Valid methods
you can select from include:
"camera"
: from limma::camera()
(*)
"cameraPR"
: from limma::cameraPR()
"ora"
: overrepresentation analysis optionally accounting for bias
(ora()
). This method is a wrapper function that makes the functionality
in limma::kegga()
available to an arbitrary GeneSetDb.
"roast"
: from limma::roast()
(*)
"fry"
: from limma::fry()
(*)
"romer"
: from limma::romer()
(*)
"geneSetTest"
: from limma::geneSetTest()
"goseq"
: from goseq::goseq()
"fgsea"
: from fgsea::fgsea()
Methods annotated with a (*)
indicate that these methods require
a complete expression object, a valid design matrix, and a contrast
specification in order to run. These are all of the same things you need to
provide when performing a vanilla differential gene expression analysis.
Methods missing a (*)
can be run on a feature-named input vector
of gene level statistics which will be used for ranking (ie. a named vector
of logFC's or t-statistics for genes). They can also be run by providing
an expression, design, and contrast vector, and the appropriate statistics
vector will be generated internally from the t-statistics, p-values, or
log-fold-changes, depending on the value provided in the score.by
parameter.
The worker functions that execute these GSEA methods are functions named
do.METHOD
within this package. These functions are not meant to be executed
directly by the user, and are therefore not exported. Look at the respective
method's help page (ie. if you are running "camera"
, look at the
limma::camera()
help page for full details. The formal parameters that
these methods take can be passed to them via the ...
in this seas()
function.
Each GSEA method can be tweaked via a custom set of parameters. We leave the
documentation of these parameters and how they affect their respective GSEA
methods to the documentation available in the packages where they are
defined. The seas()
call simply has to pass these parameters down
into the ...
parameters here. The seas
function will then pass these
along to their worker functions.
What happens when two different GSEA methods have parameters with the same name?
Unfortunately you currently cannot provide different values for these parameters. An upcoming version version of sparrow will support this feature via slightly different calling semantics. This will also allow the caller to call the same GSEA method with different parameterizations so that even these can be compared against each other.
When the seas()
call is given an expression matrix, design, and
contrast, it will also internally orchestrate a gene level differential
expression analysis. Depending on the type of expression object passed in
via x
, this function will guess on the best method to use for this
analysis.
If x
is a DGEList
, then ensure that you have already called
edgeR::estimateDisp()
on x
and edgeR's quasilikelihood framework will be
used, otherwise we'll use limma (note that x
can be an EList
run through
voom()
, voomWithQuailityWeights()
, or when where you have leveraged
limma's limma::duplicateCorrelation()
functionality, even.
The parameters of this differential expression analysis can also be
customized. Please refer to the calculateIndividualLogFC()
function for
more information. The use.treat
, feature.min.logFC
,
feature.max.padj
, as well as the ...
parameters from this function are
passed down to that funciton.
vm <- exampleExpressionSet() gdb <- exampleGeneSetDb() mg <- seas(vm, gdb, c('camera', 'fry'), design = vm$design, contrast = 'tumor', # customzie camera parameter: inter.gene.cor = 0.04) resultNames(mg) res.camera <- result(mg, 'camera') res.fry <- result(mg, 'fry') res.all <- results(mg)
vm <- exampleExpressionSet() gdb <- exampleGeneSetDb() mg <- seas(vm, gdb, c('camera', 'fry'), design = vm$design, contrast = 'tumor', # customzie camera parameter: inter.gene.cor = 0.04) resultNames(mg) res.camera <- result(mg, 'camera') res.fry <- result(mg, 'fry') res.all <- results(mg)
Lists the supported GSEA methods by sparrow
sparrow_methods()
sparrow_methods()
a character vector of GSEA names, or a list of metadata for each method.
sparrow_methods()
sparrow_methods()
A call to seas()
will produce analyses for an
arbitrary number of GSEA methods, the results of which will be stored and
accessible here using the result()
, results()
, and resultNames()
.
In addition, the GeneSetDb()
used for the analysis is accessible
via geneSetDb(), and the results from the differential
expression analysis is available via logFC()
.
Visualizing results of a geneset based analysis also are functions that
operate over a SparrowResult
object, for instance see the
iplot()
and the sparrow.shiny
package.
gsd
The GeneSetDb()
this analysis was run over
results
The list of individual results generated by each of the GSEA methods that were run.
logFC
The differential expression statistics for each individual feature measured in the experiment.
Match a species query to the regularized species info.
species_info(query = NULL, ...)
species_info(query = NULL, ...)
query |
the species name to lookup, if |
... |
pass through |
a data.frame of species-related information that is used to fetch appropriate annotation files and conversion functions between species for gene identifiers, and such.
species_info() species_info("human")
species_info() species_info("human")
ssGSEA normalization (as implemented in GSVA (ssgsea.norm)) normalizes the individual scores based on ALL scores calculated across samples AND genesets. It does NOTE normalize the scores within each geneset independantly of the others.
ssGSEA.normalize(x, bounds = range(x))
ssGSEA.normalize(x, bounds = range(x))
x |
a |
bounds |
the maximum and minimum scores obvserved used to normalize against. |
This method is an internal utilit function and not exported on purpose
normalized numeric
vector of x
This is a utility function that is called by [.GeneSetDb
and is not
exported because it is not meant for external use.
## S3 method for class 'GeneSetDb' subset(x, keep)
## S3 method for class 'GeneSetDb' subset(x, keep)
x |
|
keep |
logical vector as long as |
DEBUG: If keep
is all FALSE, this will explode. What does an empty
GeneSetDb look like, anyway? Something ...
We want to support a better, more fluent subsetting of GeneSetDb objects. See Issue #12 (https://github.com/lianos/multiGSEA/issues/12)
a GeneSetDb
that has only the results for the specified
genesets.
gdb.all <- exampleGeneSetDb() gs <- geneSets(gdb.all) gdb <- gdb.all[gs$collection %in% c("c2", "c7")]
gdb.all <- exampleGeneSetDb() gs <- geneSets(gdb.all) gdb <- gdb.all[gs$collection %in% c("c2", "c7")]
Subset a GeneSetDb to only include geneSets with specified features.
subsetByFeatures(x, features, value = c("feature_id", "x.id", "x.idx"), ...) ## S4 method for signature 'GeneSetDb' subsetByFeatures(x, features, value = c("feature_id", "x.id", "x.idx"), ...)
subsetByFeatures(x, features, value = c("feature_id", "x.id", "x.idx"), ...) ## S4 method for signature 'GeneSetDb' subsetByFeatures(x, features, value = c("feature_id", "x.id", "x.idx"), ...)
x |
|
features |
Character vector of featureIds |
value |
are you feature id's entered as themselves ( |
... |
pass through arguments |
A subset of x
which contains only the geneSets that contain
features found in featureIds
subsetByFeatures(GeneSetDb)
: subset GeneSetDb by feature id's
gdb <- exampleGeneSetDb() features <- c("55839", "8522", "29087") (gdb.sub <- subsetByFeatures(gdb, features))
gdb <- exampleGeneSetDb() features <- c("55839", "8522", "29087") (gdb.sub <- subsetByFeatures(gdb, features))
Checks to ensure that the values for x
, design
, and contrast
are
appropriate for the GSEA methods
being used. If they are kosher, then
"normalized" versions of these objects are returned in an (aptly) named list,
otheerwise an error is thrown.
validateInputs( x, design = NULL, contrast = NULL, methods = NULL, xmeta. = NULL, require.x.rownames = TRUE, ... )
validateInputs( x, design = NULL, contrast = NULL, methods = NULL, xmeta. = NULL, require.x.rownames = TRUE, ... )
x |
The expression object to use |
design |
A design matrix, if the GSEA method(s) require it |
contrast |
A contrast vector (if the GSEA method(s) require it) |
methods |
A character vector of the GSEA methods that these inputs will be used for. |
xmeta. |
hack for supportin data.frame inputs. |
require.x.rownames |
Leave this alone, should always be |
... |
other variables that called methods can check if they want |
This function is strange in that we both want to verify the objects, and
return them in some canonical form, so it is normal for the caller to then
use the values for x
, design
, and contrast
that are returned from this
call, and not the original values for these objects themselves
I know that the validation/checking logic is a bit painful (and repetitive) here. I will (perhaps) clean that up some day.
A list with "normalized" versions of $x
, $design
, and $contrast
for downstream use.
dge.stats <- exampleDgeResult() ranks <- setNames(dge.stats$t, dge.stats$feature_id) gdb <- exampleGeneSetDb() ok <- validateInputs(ranks, gdb, methods = c("cameraPR", "fgsea")) # need full expressionset & design for romer null <- failWith(NULL, validateInputs(ranks, gdb, methods = "romer"))
dge.stats <- exampleDgeResult() ranks <- setNames(dge.stats$t, dge.stats$feature_id) gdb <- exampleGeneSetDb() ok <- validateInputs(ranks, gdb, methods = c("cameraPR", "fgsea")) # need full expressionset & design for romer null <- failWith(NULL, validateInputs(ranks, gdb, methods = "romer"))
Convenience function to create volcano plots from results generated within
this package. This is mostly used by {sparrow.shiny}
.
volcanoPlot( x, stats = "dge", xaxis = "logFC", yaxis = "pval", idx, xtfrm = base::identity, ytfrm = function(vals) -log10(vals), xlab = xaxis, ylab = sprintf("-log10(%s)", yaxis), highlight = NULL, horiz_line = c(padj = 0.1), xhex = NULL, yhex = NULL, width = NULL, height = NULL, shiny_source = "mgvolcano", ggtheme = ggplot2::theme_bw(), ... )
volcanoPlot( x, stats = "dge", xaxis = "logFC", yaxis = "pval", idx, xtfrm = base::identity, ytfrm = function(vals) -log10(vals), xlab = xaxis, ylab = sprintf("-log10(%s)", yaxis), highlight = NULL, horiz_line = c(padj = 0.1), xhex = NULL, yhex = NULL, width = NULL, height = NULL, shiny_source = "mgvolcano", ggtheme = ggplot2::theme_bw(), ... )
x |
A |
stats |
One of |
xaxis , yaxis
|
the column of the the provided (or extracted)
|
idx |
The column of the |
xtfrm |
A function that transforms the |
ytfrm |
A function that transforms the |
xlab , ylab
|
x and y axis labels |
highlight |
A vector of featureIds to highlight, or a GeneSetDb that we can extract the featureIds from for this purpose. |
horiz_line |
A (optionally named) number vecor (length 1) that indicates
where a line should be drawn across the volcano plot. This is usually done
to signify statistical significance. When the number is "named", this
indicates that you want to find an approximation of the values plotted
on y based on some transformation of the values that is the named column
of x (like "padj"). The default value |
xhex |
The raw |
yhex |
the |
width , height
|
the width and height of the output plotly plot |
shiny_source |
the name of this element that is used in shiny callbacks.
Defaults to |
ggtheme |
a ggplot theme, like the thing returned from
|
... |
pass through arguments (not used) |
a ploty plot object
mg <- exampleSparrowResult() volcanoPlot(mg) volcanoPlot(mg, xhex=1, yhex=0.05)
mg <- exampleSparrowResult() volcanoPlot(mg) volcanoPlot(mg, xhex=1, yhex=0.05)
You can, in theory, create a volcano plot from a number of different parts
of a SparrowResult()
object. Most often you want to create a volcano
plot from the differential expressino results, but you could imagine
building a volcan plot where each point is a geneset. In this case, you
would extract the pvalues from the method you like in the
SparrowResult()
object using the stats
parameter.
volcanoStatsTable( x, stats = "dge", xaxis = "logFC", yaxis = "pval", idx = "idx", xtfrm = identity, ytfrm = function(vals) -log10(vals) )
volcanoStatsTable( x, stats = "dge", xaxis = "logFC", yaxis = "pval", idx = "idx", xtfrm = identity, ytfrm = function(vals) -log10(vals) )
x |
A |
stats |
One of |
xaxis , yaxis
|
the column of the the provided (or extracted)
|
idx |
The column of the |
xtfrm |
A function that transforms the |
ytfrm |
A function that transforms the |
Like the volcanoPlot()
function, this is mostly used by the
sparrow.shiny package.
a data.frame
with .xv
, .xy
, .xvt
and
.xvy
columns that represent the xvalues, yvalues, transformed
xvalues, and transformed yvalues, respectively
mg <- exampleSparrowResult() v.dge <- volcanoStatsTable(mg) v.camera <- volcanoStatsTable(mg, 'camera')
mg <- exampleSparrowResult() v.dge <- volcanoStatsTable(mg) v.camera <- volcanoStatsTable(mg, 'camera')
Calculate single sample geneset score by average z-score method
zScore(x, summary = c("mean", "sqrt"), trim = 0, ...)
zScore(x, summary = c("mean", "sqrt"), trim = 0, ...)
x |
gene x sample matrix with rows already subsetted to the ones you care about. |
summary |
sqrt or mean |
trim |
calculate trimmed mean? |
... |
pass through arguments |
A list of stats related to the zscore. You care mostly about
$score
.
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- conform(exampleGeneSetDb(), vm) features <- featureIds(gdb, 'c2', 'BURTON_ADIPOGENESIS_PEAK_AT_2HR', value='x.idx') zscores <- zScore(vm[features,]) ## Use scoreSingleSamples to facilitate scoring of all gene sets scores.all <- scoreSingleSamples(gdb, vm, 'zscore', summary = "mean") s2 <- with(subset(scores.all, name == 'BURTON_ADIPOGENESIS_PEAK_AT_2HR'), setNames(score, sample_id)) all.equal(s2, zscores$score)
vm <- exampleExpressionSet(do.voom=TRUE) gdb <- conform(exampleGeneSetDb(), vm) features <- featureIds(gdb, 'c2', 'BURTON_ADIPOGENESIS_PEAK_AT_2HR', value='x.idx') zscores <- zScore(vm[features,]) ## Use scoreSingleSamples to facilitate scoring of all gene sets scores.all <- scoreSingleSamples(gdb, vm, 'zscore', summary = "mean") s2 <- with(subset(scores.all, name == 'BURTON_ADIPOGENESIS_PEAK_AT_2HR'), setNames(score, sample_id)) all.equal(s2, zscores$score)