Title: | Single Sample directioNAl Pathway Perturbation analYsis |
---|---|
Description: | A single sample pathway perturbation testing method for RNA-seq data. The method propagates changes in gene expression down gene-set topologies to compute single-sample directional pathway perturbation scores that reflect potential direction of change. Perturbation scores can be used to test significance of pathway perturbation at both individual-sample and treatment levels. |
Authors: | Wenjun Liu [aut, cre] , Stephen Pederson [aut] |
Maintainer: | Wenjun Liu <[email protected]> |
License: | GPL-3 |
Version: | 1.11.0 |
Built: | 2024-10-31 05:36:23 UTC |
Source: | https://github.com/bioc/sSNAPPY |
Simulate null distributions of perturbation scores for each pathway through sample permutation.
generate_permuted_scores( expreMatrix, NB = NULL, testScore = NULL, gsTopology, weight, drop = TRUE ) ## S4 method for signature 'matrix' generate_permuted_scores( expreMatrix, NB = NULL, testScore = NULL, gsTopology, weight, drop = TRUE ) ## S4 method for signature 'data.frame' generate_permuted_scores( expreMatrix, NB = NULL, testScore = NULL, gsTopology, weight, drop = TRUE ) ## S4 method for signature 'DGEList' generate_permuted_scores( expreMatrix, NB = NULL, testScore = NULL, gsTopology, weight, drop = TRUE ) ## S4 method for signature 'SummarizedExperiment' generate_permuted_scores( expreMatrix, NB = NULL, testScore = NULL, gsTopology, weight, drop = TRUE )
generate_permuted_scores( expreMatrix, NB = NULL, testScore = NULL, gsTopology, weight, drop = TRUE ) ## S4 method for signature 'matrix' generate_permuted_scores( expreMatrix, NB = NULL, testScore = NULL, gsTopology, weight, drop = TRUE ) ## S4 method for signature 'data.frame' generate_permuted_scores( expreMatrix, NB = NULL, testScore = NULL, gsTopology, weight, drop = TRUE ) ## S4 method for signature 'DGEList' generate_permuted_scores( expreMatrix, NB = NULL, testScore = NULL, gsTopology, weight, drop = TRUE ) ## S4 method for signature 'SummarizedExperiment' generate_permuted_scores( expreMatrix, NB = NULL, testScore = NULL, gsTopology, weight, drop = TRUE )
expreMatrix |
|
NB |
Number of permuted sample pairs to compute permuted logFCs for. Default to be the maximum number of possibilities (See details). |
testScore |
Optional. Output of |
gsTopology |
List of pathway topology matrices generated using function |
weight |
A vector of gene-wise weights derived from function |
drop |
logic(1). Whether to drop pathways with all zero scores |
This generate_permuted_scores
function firstly randomly permute sample labels
to form permuted pairs and generate permuted logFCs, which are then used to
compute permuted perturbation scores for each pathway.
The function outputs a list that is of the same length as the list storing
pathway topology matrices (i.e. gsTopology
). Each element of the output list
corresponds to one pathway and contains a vector of permuted perturbation scores.
The permuted perturbation scores will be used to approximate the null distributions
of perturbation scores and compute permuted p-values.
If the input is S4 object of DGEList or SummarizedExperiment
, gene expression
matrix will be extracted and converted to a logCPM matrix.
The sample permutation is operated by randomly choosing combination of 2 samples
from the column names of the expreMatrix
. Hence, sample metadata is not a
required input. The number of maximum unique permutations in a dataset with N
samples is N x (N-1). By default, permuted logFCs will be computed for all N x (N-1)
permuted pairs. However, this can be overwritten by the NB
parameter.
If the NB
specified is smaller than N x (N-1), NB
possibilities will be randomly
sampled from all the possible permutation. If the NB
specified is larger than
the maximum number of permutation possible, the parameter will be ignored.
Since the smallest achievable permutation p-value is 1/NB
, accurate estimation
of small p-value requires a large number of permutations that is only feasible
in data with a large sample size.
A list where each element is a vector of perturbation scores for a pathway.
#compute weighted single sample logFCs data(metadata_example) data(logCPM_example) metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) ls <- weight_ss_fc(logCPM_example, metadata = metadata_example, groupBy = "patient", treatColumn = "treatment", sampleColumn = "sample") ## Not run: load(system.file("extdata", "gsTopology.rda", package = "sSNAPPY")) # simulate the null distribution of scores through sample permutation permutedScore <- generate_permuted_scores(logCPM_example, gsTopology = gsTopology, weight = ls$weight) ## End(Not run)
#compute weighted single sample logFCs data(metadata_example) data(logCPM_example) metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) ls <- weight_ss_fc(logCPM_example, metadata = metadata_example, groupBy = "patient", treatColumn = "treatment", sampleColumn = "sample") ## Not run: load(system.file("extdata", "gsTopology.rda", package = "sSNAPPY")) # simulate the null distribution of scores through sample permutation permutedScore <- generate_permuted_scores(logCPM_example, gsTopology = gsTopology, weight = ls$weight) ## End(Not run)
gsAnnotation_df: Categorization of KEGG pathways used for community annotation
data(gsAnnotation_df)
data(gsAnnotation_df)
A data.frame
with 549 rows and 2 columns containing categorization of 549 KEGG pathways
Gene-set name
Category
This data was adopted from a study by Singhal H, et al., which was published as Genomic agonism and phenotypic antagonism between estrogen and progesterone receptors in breast cancer in 2016.
data(logCPM_example)
data(logCPM_example)
A data.frame
with 7672 rows and 15 columns
In this study, 12 primary malignant breast tissues (8PR+ and 4 PR-) were developed into patient-derived explants and treated with Vehicle, E2, E2+R5020, or R5020 for 24 or 48 hrs.
Raw data for 48-hr Vehicle-, R5020-treated and E2+R5020-treated samples were retrieved from GEO (GSE80098) and pre-processed into raw count. Filtration was sequentially performed to remove undetectable genes and the filtered counts were normalised using conditional quantile normalisation to offset effects of systematic artefacts, such as gene length and GC contents.
To reduce computing time, we randomly sampled half of the genes after filtration and used their logCPM value as the example data.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80098
metadata_example: Sample metadata for malignant breast cancer tumours PDE from 5 ER+ breast cancer tumour (GSE80098)
data(metadata_example)
data(metadata_example)
A data.frame
with 15 rows and 4 columns
patient N2-3, P4-6
treatment: Vehicle, R5020 or E2+R5020
progesterone receptor status
sample name, corresponding to column names of the logCPM matrix
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4928895/
Normalise test perturbation scores by permutation result and compute permutation p-values
normalise_by_permu( permutedScore, testScore, pAdj_method = "fdr", sortBy = c("gs_name", "sample", "pvalue") )
normalise_by_permu( permutedScore, testScore, pAdj_method = "fdr", sortBy = c("gs_name", "sample", "pvalue") )
permutedScore |
A list. Output of |
testScore |
A |
pAdj_method |
Method for adjusting p-values for multiple comparisons.
See |
sortBy |
Sort the output by p-value, gene-set name or sample names. |
Normalise the test perturbation scores generated by weight_ss_fc()
through the permuted perturbation scores derived from the
generate_permuted_scores()
function. The mean absolute deviation(MAD) and
median of perturbation scores for each pathway are firstly derived from the
permuted perturbation scores. The test perturbation scores are then converted
to robust z-scores using MADs and medians calculated.
Additionally, by assessing the proportion of permuted scores that are more extreme than the test perturbation score within each pathway, the permuted p-value of individual test perturbation scores will be computed.
A data.frame
## Not run: load(system.file("extdata", "gsTopology.rda", package = "sSNAPPY")) data(metadata_example) data(logCPM_example) metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) ls <- weight_ss_fc(logCPM_example, metadata = metadata_example, groupBy = "patient", sampleColumn = "sample", treatColumn = "treatment") # compute raw gene-wise perturbation scores genePertScore <- raw_gene_pert(ls$weighted_logFC, gsTopology) # sum gene-wise perturbation scores to derive the pathway-level # single-sample perturbation scores pathwayPertScore <- pathway_pert(genePertScore, ls$weighted_logFC) # simulate the null distribution of scores through sample permutation permutedScore <- generate_permuted_scores(logCPM_example, gsTopology = gsTopology, weight = ls$weight) # normlise the test perturbation scores using the permutation results normalisedScores <- normalise_by_permu(permutedScore, pathwayPertScore, sortBy = "pvalue") ## End(Not run)
## Not run: load(system.file("extdata", "gsTopology.rda", package = "sSNAPPY")) data(metadata_example) data(logCPM_example) metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) ls <- weight_ss_fc(logCPM_example, metadata = metadata_example, groupBy = "patient", sampleColumn = "sample", treatColumn = "treatment") # compute raw gene-wise perturbation scores genePertScore <- raw_gene_pert(ls$weighted_logFC, gsTopology) # sum gene-wise perturbation scores to derive the pathway-level # single-sample perturbation scores pathwayPertScore <- pathway_pert(genePertScore, ls$weighted_logFC) # simulate the null distribution of scores through sample permutation permutedScore <- generate_permuted_scores(logCPM_example, gsTopology = gsTopology, weight = ls$weight) # normlise the test perturbation scores using the permutation results normalisedScores <- normalise_by_permu(permutedScore, pathwayPertScore, sortBy = "pvalue") ## End(Not run)
Substract ssFC from the raw gene-level perturbation scores within each sample and sum gene-wise raw perturbation scores to derive single-sample perturbation scores for each pathway.
pathway_pert(genePertScore, weightedFC, drop = TRUE)
pathway_pert(genePertScore, weightedFC, drop = TRUE)
genePertScore |
List of gene-wise raw perturbation score matrices
generated using function |
weightedFC |
A matrix of weighted ssFC generated using function |
drop |
logic(1). Whether to drop pathways with all zero scores |
A data.frame with 3 columns: score (single-sample pathway-level perturbation score), sample, and gs_name (gene-set name)
Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim JS, Kim CJ, Kusanovic JP, Romero R. A novel signaling pathway impact analysis. Bioinformatics. 2009 Jan 1;25(1):75-82.
#compute weighted single sample logFCs data(metadata_example) data(logCPM_example) metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) ls <- weight_ss_fc(logCPM_example, metadata = metadata_example, groupBy = "patient", treatColumn = "treatment", sampleColumn = "sample") # extract all the KEGG pathways gsTopology <- retrieve_topology(database = "kegg", species = "hsapiens") # compute raw gene-wise perturbation scores genePertScore <- raw_gene_pert(ls$weighted_logFC, gsTopology) # sum gene-wise perturbation scores to derive the pathway-level single-sample # perturbation scores pathwayPertScore <- pathway_pert(genePertScore, ls$weighted_logFC)
#compute weighted single sample logFCs data(metadata_example) data(logCPM_example) metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) ls <- weight_ss_fc(logCPM_example, metadata = metadata_example, groupBy = "patient", treatColumn = "treatment", sampleColumn = "sample") # extract all the KEGG pathways gsTopology <- retrieve_topology(database = "kegg", species = "hsapiens") # compute raw gene-wise perturbation scores genePertScore <- raw_gene_pert(ls$weighted_logFC, gsTopology) # sum gene-wise perturbation scores to derive the pathway-level single-sample # perturbation scores pathwayPertScore <- pathway_pert(genePertScore, ls$weighted_logFC)
Visualise the community structure in significantly perturbed gene-set network
plot_community( normalisedScores, gsTopology, gsAnnotation = NULL, colorBy = "community", communityMethod = c("louvain", "walktrap", "spinglass", "leading_eigen", "edge_betweenness", "fast_greedy", "label_prop", "leiden"), foldGSname = TRUE, foldafter = 2, labelFun = .rm_prefix, layout = c("fr", "dh", "gem", "graphopt", "kk", "lgl", "mds", "sugiyama"), markCommunity = "ellipse", markAlpha = 0.2, color_lg_title = NULL, edgeAlpha = 0.8, scale_edgeWidth = c(0.5, 3), edgeLegend = FALSE, scale_nodeSize = c(3, 6), nodeShape = 16, lb_size = 3, lb_color = "black", plotIsolated = FALSE, ... )
plot_community( normalisedScores, gsTopology, gsAnnotation = NULL, colorBy = "community", communityMethod = c("louvain", "walktrap", "spinglass", "leading_eigen", "edge_betweenness", "fast_greedy", "label_prop", "leiden"), foldGSname = TRUE, foldafter = 2, labelFun = .rm_prefix, layout = c("fr", "dh", "gem", "graphopt", "kk", "lgl", "mds", "sugiyama"), markCommunity = "ellipse", markAlpha = 0.2, color_lg_title = NULL, edgeAlpha = 0.8, scale_edgeWidth = c(0.5, 3), edgeLegend = FALSE, scale_nodeSize = c(3, 6), nodeShape = 16, lb_size = 3, lb_color = "black", plotIsolated = FALSE, ... )
normalisedScores |
A |
gsTopology |
List of pathway topology matrices generated using retrieve_topology |
gsAnnotation |
A |
colorBy |
Can be any column with in the |
communityMethod |
A community detection method supported by |
foldGSname |
|
foldafter |
The number of words after which gene-set names should be folded. Defaults to 2 |
labelFun |
function to manipulate or modify gene-set labels. By default, any database will be stripped from the prefix using a regex pattern |
layout |
The layout algorithm to apply. Accepted layouts are
|
markCommunity |
|
markAlpha |
Transparency of annotation areas. |
color_lg_title |
Title for the color legend |
edgeAlpha |
Transparency of edges. |
scale_edgeWidth |
A numerical vector of length 2 to be provided to
|
edgeLegend |
|
scale_nodeSize |
A numerical vector of length 2 to be provided to
|
nodeShape |
The shape to use for nodes |
lb_size |
Size of node text labels |
lb_color |
Color of node text labels |
plotIsolated |
|
... |
Used to pass various potting parameters to |
A community detection strategy specified by communityMethod
will
be applied to the pathway-pathway network, and communities will be annotated
with the pathway category that had the highest number of occurrence,
denoting the main biological processes perturbed in that community.
At the moment, only KEGG pathway categories are provided with the
package, so if the provided normalisedScores
contains perturbation
scores of pathways derived from other databases, annotation of communities
will not be performed unless pathway information is provided through
the gsAnnotation
object. The category information needs to be
provided in a data.frame
containing gs_name
(gene-set names) and
category
(categorising the given pathways).
Plotting parameters accepted by geom_mark_*
could be passed to the
function to adjust the annotation area or the annotation label. See
geom_mark_ellipse for more details.
A ggplot2 object
load(system.file("extdata", "gsTopology.rda", package = "sSNAPPY")) load(system.file("extdata", "normalisedScores.rda", package = "sSNAPPY")) #Subset the first 10 rows of the normalisedScores data.frame as an example subset <- normalisedScores[1:15,] subset$status <- ifelse(subset$robustZ > 0, "Activated", "Inhibited") # Color network plot nodes by the community they were assigned to and mark # nodes belonging to the same community by ellipses plot_community(subset, gsTopology, colorBy = "community",layout = "kk", color_lg_title = "Community") # Color network plot nodes by pathways' directions of changes and mark nodes # belonging to the same community by ellipses plot_community(subset, gsTopology, colorBy = "status",layout = "kk", color_lg_title = "Direction of pathway perturbation") # To change the colour and fill of `geom_mark_*` annotation, use any # `scale_fill_*` and/or `scale_color_*` # functions supported by `ggplot2`. For example: p <- plot_community(subset, gsTopology, colorBy = "status",layout = "kk", markCommunity = "rect",color_lg_title = "Direction of pathway perturbation") p + ggplot2::scale_color_ordinal() + ggplot2::scale_fill_ordinal()
load(system.file("extdata", "gsTopology.rda", package = "sSNAPPY")) load(system.file("extdata", "normalisedScores.rda", package = "sSNAPPY")) #Subset the first 10 rows of the normalisedScores data.frame as an example subset <- normalisedScores[1:15,] subset$status <- ifelse(subset$robustZ > 0, "Activated", "Inhibited") # Color network plot nodes by the community they were assigned to and mark # nodes belonging to the same community by ellipses plot_community(subset, gsTopology, colorBy = "community",layout = "kk", color_lg_title = "Community") # Color network plot nodes by pathways' directions of changes and mark nodes # belonging to the same community by ellipses plot_community(subset, gsTopology, colorBy = "status",layout = "kk", color_lg_title = "Direction of pathway perturbation") # To change the colour and fill of `geom_mark_*` annotation, use any # `scale_fill_*` and/or `scale_color_*` # functions supported by `ggplot2`. For example: p <- plot_community(subset, gsTopology, colorBy = "status",layout = "kk", markCommunity = "rect",color_lg_title = "Direction of pathway perturbation") p + ggplot2::scale_color_ordinal() + ggplot2::scale_fill_ordinal()
Plot individual genes' contributions to the pathway-level perturbation score
plot_gene_contribution( genePertMatr, mapEntrezID = NULL, topGene = 10, filterBy = c("mean", "sd", "max.abs"), tieMethod = "min", annotation_df = NULL, ... )
plot_gene_contribution( genePertMatr, mapEntrezID = NULL, topGene = 10, filterBy = c("mean", "sd", "max.abs"), tieMethod = "min", annotation_df = NULL, ... )
genePertMatr |
A matrix of gene-wise perturbation scores corresponding
to a pathway. An element of the output generated using function |
mapEntrezID |
Optional. A |
topGene |
Numeric(1). The number of top genes to plot |
filterBy |
Filter top genes by the mean, variability (sd), maximum value, or maximum absolute values |
tieMethod |
Method for handling ties in ranking (i.e. in values returned
by |
annotation_df |
A |
... |
Used to pass various potting parameters to
|
The single-sample pathway-level perturbation score for a given pathway is derived from aggregating all the gene-wise perturbation scores of genes in that pathway. This function visualises individual pathway genes' perturbation scores as a heatmap to demonstrate pathway genes' contribution to a pathway perturbation.
Plotting of the heatmap is done through pheatmap::pheatmap()
so all
plotting parameters accepted by pheatmap::pheatmap()
could also be passed
to this function.
Kolde R (2019). pheatmap: Pretty Heatmaps. R package version 1.0.12, https://CRAN.R-project.org/package=pheatmap.
#compute weighted single sample logFCs data(metadata_example) data(logCPM_example) metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) # compute single-sample logFCs for all treated samples ls <- weight_ss_fc(logCPM_example, metadata = metadata_example, groupBy = "patient", treatColumn = "treatment", sampleColumn = "sample") # extract all the KEGG pathways gsTopology <- retrieve_topology(database = "kegg", species = "hsapiens") # compute raw gene-wise perturbation scores genePertScore <- raw_gene_pert(ls$weighted_logFC, gsTopology) # sum gene-wise perturbation scores to derive the pathway-level single-sample perturbation scores pathwayPertScore <- pathway_pert(genePertScore, ls$weighted_logFC) # Genes with top 10 mean absolute gene-wise perturbation scores in the # Estrogen signaling pathway was visualised. plot_gene_contribution(genePertScore$`kegg.Estrogen signaling pathway`, filterBy = "mean", topGene = 10) # Columns of the heatmap could be annotated by the pathway-level perturbation # and treatments. Firstly, create a `data.frame` with the two annotation # attributes and sample names matching the column names of the perturbation # score matrix. annotation_df <- dplyr::select(metadata_example, sample, treatment) pathwayLevel <- dplyr::filter(pathwayPertScore, gs_name == "kegg.Estrogen signaling pathway") pathwayLevel$`pathway-level` <- ifelse( pathwayLevel$score > 0, "Activated", "Inhibited") annotation_df <- dplyr::left_join( dplyr::select(pathwayLevel, sample, `pathway-level`), annotation_df, unmatched = "drop") # To make the gene labels more informative, also map genes' entrez id # to chosen identifiers. load(system.file("extdata", "entrez2name.rda", package = "sSNAPPY")) plot_gene_contribution(genePertScore$`kegg.Estrogen signaling pathway`, topGene = 10, filterBy = "mean", annotation_df = annotation_df, mapEntrezID = entrez2name) # Plotting parameters accepted by `pheatmap::pheatmap()` could be passed to # this function to customise the plot. For example, changin the color of annotations plot_gene_contribution(genePertScore$`kegg.Estrogen signaling pathway`, topGene = 10, filterBy = "mean", annotation_df = annotation_df, mapEntrezID = entrez2name, annotation_colors = list( treatment = c(R5020 = "black", `E2+R5020` = "white"), `pathway-level` = c(Activated = "darkred", Inhibited = "lightskyblue")))
#compute weighted single sample logFCs data(metadata_example) data(logCPM_example) metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) # compute single-sample logFCs for all treated samples ls <- weight_ss_fc(logCPM_example, metadata = metadata_example, groupBy = "patient", treatColumn = "treatment", sampleColumn = "sample") # extract all the KEGG pathways gsTopology <- retrieve_topology(database = "kegg", species = "hsapiens") # compute raw gene-wise perturbation scores genePertScore <- raw_gene_pert(ls$weighted_logFC, gsTopology) # sum gene-wise perturbation scores to derive the pathway-level single-sample perturbation scores pathwayPertScore <- pathway_pert(genePertScore, ls$weighted_logFC) # Genes with top 10 mean absolute gene-wise perturbation scores in the # Estrogen signaling pathway was visualised. plot_gene_contribution(genePertScore$`kegg.Estrogen signaling pathway`, filterBy = "mean", topGene = 10) # Columns of the heatmap could be annotated by the pathway-level perturbation # and treatments. Firstly, create a `data.frame` with the two annotation # attributes and sample names matching the column names of the perturbation # score matrix. annotation_df <- dplyr::select(metadata_example, sample, treatment) pathwayLevel <- dplyr::filter(pathwayPertScore, gs_name == "kegg.Estrogen signaling pathway") pathwayLevel$`pathway-level` <- ifelse( pathwayLevel$score > 0, "Activated", "Inhibited") annotation_df <- dplyr::left_join( dplyr::select(pathwayLevel, sample, `pathway-level`), annotation_df, unmatched = "drop") # To make the gene labels more informative, also map genes' entrez id # to chosen identifiers. load(system.file("extdata", "entrez2name.rda", package = "sSNAPPY")) plot_gene_contribution(genePertScore$`kegg.Estrogen signaling pathway`, topGene = 10, filterBy = "mean", annotation_df = annotation_df, mapEntrezID = entrez2name) # Plotting parameters accepted by `pheatmap::pheatmap()` could be passed to # this function to customise the plot. For example, changin the color of annotations plot_gene_contribution(genePertScore$`kegg.Estrogen signaling pathway`, topGene = 10, filterBy = "mean", annotation_df = annotation_df, mapEntrezID = entrez2name, annotation_colors = list( treatment = c(R5020 = "black", `E2+R5020` = "white"), `pathway-level` = c(Activated = "darkred", Inhibited = "lightskyblue")))
Plot significantly perturbed gene-sets as a network
plot_gs_network( normalisedScores, gsTopology, colorBy = NULL, foldGSname = TRUE, foldafter = 2, labelFun = .rm_prefix, layout = c("fr", "dh", "gem", "graphopt", "kk", "lgl", "mds", "sugiyama"), edgeAlpha = 0.8, edgeWidthScale = c(0.5, 3), edgeLegend = FALSE, nodeSizeScale = c(3, 6), nodeShape = 16, showLegend = TRUE, gsLegTitle = NULL, gsNameSize = 3, gsNameColor = "black", plotIsolated = FALSE, maxOverlaps = 10, ... )
plot_gs_network( normalisedScores, gsTopology, colorBy = NULL, foldGSname = TRUE, foldafter = 2, labelFun = .rm_prefix, layout = c("fr", "dh", "gem", "graphopt", "kk", "lgl", "mds", "sugiyama"), edgeAlpha = 0.8, edgeWidthScale = c(0.5, 3), edgeLegend = FALSE, nodeSizeScale = c(3, 6), nodeShape = 16, showLegend = TRUE, gsLegTitle = NULL, gsNameSize = 3, gsNameColor = "black", plotIsolated = FALSE, maxOverlaps = 10, ... )
normalisedScores |
A |
gsTopology |
List of pathway topology matrices generated using function
|
colorBy |
Choose to color nodes either by robustZ or pvalue. A
column must exist in the |
foldGSname |
logical(1) Should long gene-set names fold across multiple lines |
foldafter |
The number of words after which gene-set names should be folded. Defaults to 2 |
labelFun |
function to manipulate or modify gene-set labels. By default, any database will be stripped from the prefix using a regex pattern |
layout |
The layout algorithm to apply. Accept all layout supported by
|
edgeAlpha |
numeric(1) Transparency of edges. Default to 0.8 |
edgeWidthScale |
A numerical vector of length 2 to be provided to
|
edgeLegend |
logical(1) Should edge weight legend be shown |
nodeSizeScale |
A numeric vector of length 2 to be provided to
|
nodeShape |
Shape of nodes |
showLegend |
logical(1) Should color legend be shown |
gsLegTitle |
Optional title for the color legend |
gsNameSize |
Size of node text labels |
gsNameColor |
Color of node text labels |
plotIsolated |
logical(1) Should nodes not connected to any other nodes be plotted. Defaults to FALSE |
maxOverlaps |
passed to geom_node_text |
... |
Not used |
A ggplot2 object
load(system.file("extdata", "gsTopology.rda", package = "sSNAPPY")) load(system.file("extdata", "normalisedScores.rda", package = "sSNAPPY")) # Subset pathways significantly perturbed in sample R5020_N2_48 subset <- dplyr::filter( normalisedScores, adjPvalue < 0.05, sample == "R5020_N2_48" ) subset[["status"]] <- ifelse(subset[["robustZ"]]> 0, "Activated", "Inhibited") # Color network plot nodes by status plot_gs_network(subset, gsTopology, colorBy = "status", layout = "dh", gsLegTitle = "Direction of pathway Perturbation") # Color network plot nodes by p-values plot_gs_network(subset, gsTopology, layout = "dh", colorBy = "pvalue", gsLegTitle = "P-value")
load(system.file("extdata", "gsTopology.rda", package = "sSNAPPY")) load(system.file("extdata", "normalisedScores.rda", package = "sSNAPPY")) # Subset pathways significantly perturbed in sample R5020_N2_48 subset <- dplyr::filter( normalisedScores, adjPvalue < 0.05, sample == "R5020_N2_48" ) subset[["status"]] <- ifelse(subset[["robustZ"]]> 0, "Activated", "Inhibited") # Color network plot nodes by status plot_gs_network(subset, gsTopology, colorBy = "status", layout = "dh", gsLegTitle = "Direction of pathway Perturbation") # Color network plot nodes by p-values plot_gs_network(subset, gsTopology, layout = "dh", colorBy = "pvalue", gsLegTitle = "P-value")
Plot pathways and genes contained in them as a network
plot_gs2gene( normalisedScores, gsTopology, geneFC = NULL, mapEntrezID = NULL, colorGsBy = NULL, foldGSname = TRUE, foldafter = 2, labelFun = .rm_prefix, filterGeneBy = 2, layout = c("fr", "dh", "gem", "graphopt", "kk", "lgl", "mds", "sugiyama"), edgeColor = "darkgrey", edgeAlpha = 0.8, edgeArc = 0.5, geneNodeSize = 3, geneNodeShape = 17, geneNameFace = c("italic", "plain", "bold", "bold-italic"), geneNameColor = "grey30", geneNameSize = 3, labelGene = TRUE, gsNodeSize = 2, gsNodeShape = 21, gsNodeStroke = 0.5, gsNodeOutline = "white", gsNameSize = 6, gsNameColor = "black", geneLegTitle = "Mean logFC", gsLegTitle = colorGsBy, maxOverlaps = 10, ... )
plot_gs2gene( normalisedScores, gsTopology, geneFC = NULL, mapEntrezID = NULL, colorGsBy = NULL, foldGSname = TRUE, foldafter = 2, labelFun = .rm_prefix, filterGeneBy = 2, layout = c("fr", "dh", "gem", "graphopt", "kk", "lgl", "mds", "sugiyama"), edgeColor = "darkgrey", edgeAlpha = 0.8, edgeArc = 0.5, geneNodeSize = 3, geneNodeShape = 17, geneNameFace = c("italic", "plain", "bold", "bold-italic"), geneNameColor = "grey30", geneNameSize = 3, labelGene = TRUE, gsNodeSize = 2, gsNodeShape = 21, gsNodeStroke = 0.5, gsNodeOutline = "white", gsNameSize = 6, gsNameColor = "black", geneLegTitle = "Mean logFC", gsLegTitle = colorGsBy, maxOverlaps = 10, ... )
normalisedScores |
A |
gsTopology |
List of pathway topology matrices generated using function
|
geneFC |
An optional named vector of pathways' fold changes |
mapEntrezID |
Optional. A |
colorGsBy |
Column within |
foldGSname |
|
foldafter |
The number of words after which gene-set names should be folded. |
labelFun |
function to manipulate or modify gene-set labels. By default, any database will be stripped from the prefix using a regex pattern |
filterGeneBy |
Filtration cut-off applied to genes' connectivity (ie. how many pathways was a gene involved in). |
layout |
The layout algorithm to apply. Accepts all layout supported by
|
edgeColor , edgeAlpha
|
Color and transparency of edges |
edgeArc |
The bend of edges. 1 approximates a semi-circle whilst 0 will give a straight line. |
geneNodeSize , geneNodeShape
|
Size and shape for gene nodes |
geneNameSize , geneNameColor , geneNameFace
|
Size, color and fontface to use for gene labels |
labelGene |
|
gsNodeSize |
Size for gene-set/pathway nodes |
gsNodeShape |
Shape for gene-set/pathway nodes. Should be a shape with a fill parameter, such as 21:25 |
gsNodeStroke , gsNodeOutline
|
Border thickness and color for gene-set/pathway nodes |
gsNameSize , gsNameColor
|
Size and color of gene-set/pathway labels |
geneLegTitle |
|
gsLegTitle |
|
maxOverlaps |
passed to geom_node_text |
... |
Not used |
Taking the perturbation scores of a list of gene-sets derived from
normalise_by_permu()
as input, this function matches gene-sets to
their associated genes by utilizing information from pathway topology matrices.
If providing logFC values as a named vector, the names must be entrezgene IDs
in the format of "ENTREZID:XXXX" for compatibility with the values returned
by retrieve_topology()
. If not providing this vector, only genes associated
with two or more pathways will be added to the plot, however, it should be
noted that if omitting this vector, network plots can easily become
unmanageable.
Users can also choose to provide a mapEntrezID
data.frame
to match
entrezgene IDs to their chosen identifiers. The data.frame
should contain
the columns: "entrezid"
and "mapTo"
.
If geneFC
is provided, gene nodes will be colored by values provided,
otherwise all gene nodes will drawn in grey.
Since some gene-sets could can contain hundreds of genes, it is not
recommended to plot all genes. If mapEntrezID
data.frame
is provided,
only genes included in that data.frame
will be used in the plot.
It is strongly recommended to filter genes using some criteria, such as those
with the largest magnitude of change. If all pathway genes are desired,
please consider setting labelGene
to FALSE to remove gene names.
A ggplot2 object
load(system.file("extdata", "gsTopology.rda", package = "sSNAPPY")) load(system.file("extdata", "normalisedScores.rda", package = "sSNAPPY")) # Subset pathways significantly perturbed in sample R5020_N2_48 subset <- dplyr::filter(normalisedScores, adjPvalue < 0.05, sample == "R5020_N2_48") subset$response <- ifelse(subset$robustZ > 0, "Activated", "Inhibited") # Color gene-sets nodes by robust z-scores. plot_gs2gene( subset, gsTopology, colorGsBy = "robustZ", labelGene = FALSE, geneNodeSize = 1, gsNodeSize = 4 ) + scale_fill_gradient2() # When fold-changes are not provided, gene nodes are colored grey. # To color genes by their direction of change, firstly compute single-sample logFC data(logCPM_example) data(metadata_example) metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) ls <- weight_ss_fc( logCPM_example, metadata = metadata_example, groupBy = "patient", treatColumn = "treatment", sampleColumn = "sample" ) # Provide fold-changes of sample R5020_N2_48 as a named vector plot_gs2gene( subset, gsTopology, geneFC = ls$weighted_logFC[,"R5020_N2_48"], colorGsBy = "response", labelGene = FALSE ) + scale_colour_gradient2() # By default, the function only include genes involved in at least 2 pathways, # which can be overwritten by the `filterGeneBy` parameter. But there are still # a large number of genes, making the plot cumbersome. Instead, only include # fold-changes of genes within the top 500 absolute values for fold-change top500 <- rank(1/abs(ls$weighted_logFC[,"R5020_N2_48"])) <= 500 fcByDir <- ifelse(ls$weighted_logFC[top500,"R5020_N2_48"] > 0, "Up-Regulated", "Down-Regulated") plot_gs2gene(subset, gsTopology, geneFC = fcByDir, colorGsBy = "response") + scale_fill_manual(values = c("darkred", "lightskyblue")) + scale_colour_manual(values = c("red", "blue")) # To make the gene labels more informative, map genes' entrez id to chosen identifiers. load(system.file("extdata", "entrez2name.rda", package = "sSNAPPY")) plot_gs2gene( subset, gsTopology, geneFC = fcByDir, mapEntrezID = entrez2name, colorGsBy = "response", gsNodeSize = 4 ) + scale_fill_manual(values = c("darkred", "lightskyblue"), name = "Pathway") + scale_colour_manual(values = c("blue", "red"), name = "Gene\nDirection")
load(system.file("extdata", "gsTopology.rda", package = "sSNAPPY")) load(system.file("extdata", "normalisedScores.rda", package = "sSNAPPY")) # Subset pathways significantly perturbed in sample R5020_N2_48 subset <- dplyr::filter(normalisedScores, adjPvalue < 0.05, sample == "R5020_N2_48") subset$response <- ifelse(subset$robustZ > 0, "Activated", "Inhibited") # Color gene-sets nodes by robust z-scores. plot_gs2gene( subset, gsTopology, colorGsBy = "robustZ", labelGene = FALSE, geneNodeSize = 1, gsNodeSize = 4 ) + scale_fill_gradient2() # When fold-changes are not provided, gene nodes are colored grey. # To color genes by their direction of change, firstly compute single-sample logFC data(logCPM_example) data(metadata_example) metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) ls <- weight_ss_fc( logCPM_example, metadata = metadata_example, groupBy = "patient", treatColumn = "treatment", sampleColumn = "sample" ) # Provide fold-changes of sample R5020_N2_48 as a named vector plot_gs2gene( subset, gsTopology, geneFC = ls$weighted_logFC[,"R5020_N2_48"], colorGsBy = "response", labelGene = FALSE ) + scale_colour_gradient2() # By default, the function only include genes involved in at least 2 pathways, # which can be overwritten by the `filterGeneBy` parameter. But there are still # a large number of genes, making the plot cumbersome. Instead, only include # fold-changes of genes within the top 500 absolute values for fold-change top500 <- rank(1/abs(ls$weighted_logFC[,"R5020_N2_48"])) <= 500 fcByDir <- ifelse(ls$weighted_logFC[top500,"R5020_N2_48"] > 0, "Up-Regulated", "Down-Regulated") plot_gs2gene(subset, gsTopology, geneFC = fcByDir, colorGsBy = "response") + scale_fill_manual(values = c("darkred", "lightskyblue")) + scale_colour_manual(values = c("red", "blue")) # To make the gene labels more informative, map genes' entrez id to chosen identifiers. load(system.file("extdata", "entrez2name.rda", package = "sSNAPPY")) plot_gs2gene( subset, gsTopology, geneFC = fcByDir, mapEntrezID = entrez2name, colorGsBy = "response", gsNodeSize = 4 ) + scale_fill_manual(values = c("darkred", "lightskyblue"), name = "Pathway") + scale_colour_manual(values = c("blue", "red"), name = "Gene\nDirection")
Propagate weighted single sample logFCs down the pathway topologies to compute gene-wise perturbation score per gene per sample per pathway
raw_gene_pert(weightedFC, gsTopology)
raw_gene_pert(weightedFC, gsTopology)
weightedFC |
A matrix of weighted single sample logFCs
derived from function |
gsTopology |
List of pathway topology matrices generated using function
|
This function use the algorithm adopted from SPIA
(see citation) to
integrate genes' changes in expression and gene-gene interaction to compute
gene-wise perturbation score per gene per sample per pathway. The rownames of
the weighted single sample logFC matrix and the pathway topology matrices must
use the same type of gene identifier (ie. entrez ID).
Pathways with zero perturbation scores across all genes and samples will be dropped from the output.
A list where each element is a matrix corresponding to a pathway. Each column of an element corresponds to a sample, and each row corresponds to a pathway gene.
Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim JS, Kim CJ, Kusanovic JP, Romero R. A novel signaling pathway impact analysis. Bioinformatics. 2009 Jan 1;25(1):75-82.
#compute weighted single sample logFCs data(metadata_example) data(logCPM_example) metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) ls <- weight_ss_fc(logCPM_example, metadata = metadata_example, groupBy = "patient", treatColumn = "treatment", sampleColumn = "sample") # extract all the KEGG pathways gsTopology <- retrieve_topology(database = "kegg", species = "hsapiens") # compute raw gene-wise perturbation scores genePertScore <- raw_gene_pert(ls$weighted_logFC, gsTopology)
#compute weighted single sample logFCs data(metadata_example) data(logCPM_example) metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) ls <- weight_ss_fc(logCPM_example, metadata = metadata_example, groupBy = "patient", treatColumn = "treatment", sampleColumn = "sample") # extract all the KEGG pathways gsTopology <- retrieve_topology(database = "kegg", species = "hsapiens") # compute raw gene-wise perturbation scores genePertScore <- raw_gene_pert(ls$weighted_logFC, gsTopology)
Retrieve pathway topology matrices and convert them to normalized weighted directed adjacency matrices describing gene signaling networks.
retrieve_topology( database = c("kegg", "wikipathways", "reactome"), species = c("hsapiens", "athaliana", "btaurus", "celegans", "cfamiliaris", "dmelanogaster", "drerio", "ecoli", "ggallus", "mmusculus", "rnorvegicus", "scerevisiae", "sscrofa", "xlaevis"), keyword = NULL, beta = NULL )
retrieve_topology( database = c("kegg", "wikipathways", "reactome"), species = c("hsapiens", "athaliana", "btaurus", "celegans", "cfamiliaris", "dmelanogaster", "drerio", "ecoli", "ggallus", "mmusculus", "rnorvegicus", "scerevisiae", "sscrofa", "xlaevis"), keyword = NULL, beta = NULL )
database |
A character vector of supported databases. |
species |
One of the supported species. Currently support |
keyword |
Optional. Areas of interests as a character vector. |
beta |
Optional. A named numeric vector of weights to be assigned to each type of gene/protein relation type.See details for more information. |
This function takes pathway topology information retrieved using graphite
and converts these to normalized weighted directed adjacency matrices
describing the gene signaling network, which can be used to compute gene-wise
and pathway-level perturbation score through the scoring algorithm derived
from the SPIA algorithm. See cited document for more details.
The database
parameter may specify multiple databases but only one species
can be provided to the species
parameter.
Users can provide areas of interests as keywords, which will be matched to pathway names from chosen databases to subset pathways. Cases will be ignored in string matching.
The beta parameter specifies weights to be assigned to each type of gene-gene
interaction. It should be a named numeric vector of length 25,
whose names must be: c("activation","compound","binding/association", "expression","inhibition","activation_phosphorylation","phosphorylation", "indirect","inhibition_phosphorylation","dephosphorylation_inhibition", "dissociation","dephosphorylation","activation_dephosphorylation", "state","activation_indirect","inhibition_ubiquination","ubiquination", "expression_indirect","indirect_inhibition","repression", "binding/association_phosphorylation","dissociation_phosphorylation", "indirect_phosphorylation")
.
If unspecified, beta will be set as an integer vector with: a) values of 1 for interactions which match 'expression' or 'activation'; b) values of -1 for interactions which match 'repression' or 'inhibition'; and c) 0 elsewhere.
The retrieved topology matrices will be processed as decribed by SPIA to: 1) scale gene-gene interactions by the number of downstream genes and 2) subtract an identity matrix of the same size from the topology matrix.
The converted weighted adjacent matrices will be stored in a list. We recommend users to store the returned list as a file so this step only needs to be performed once for each database.
A list where each element is a weighted directed adjacency matrix corresponding to a pathway
Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim JS, Kim CJ, Kusanovic JP, Romero R. A novel signaling pathway impact analysis. Bioinformatics. 2009 Jan 1;25(1):75-82. Sales, G., Calura, E., Cavalieri, D. et al. graphite - a Bioconductor package to convert pathway topology to gene network. BMC Bioinformatics 13, 20 (2012).
# retrieve pathway topology matrices of all KEGG pathway gsTopology <- retrieve_topology(database = "kegg", species = "hsapiens") # If only interested in selected pathways, specify the areas of interest as # keywords gsTopology <- retrieve_topology(database = "kegg", keyword = c("metabolism", "signaling"), species = "hsapiens")
# retrieve pathway topology matrices of all KEGG pathway gsTopology <- retrieve_topology(database = "kegg", species = "hsapiens") # If only interested in selected pathways, specify the areas of interest as # keywords gsTopology <- retrieve_topology(database = "kegg", keyword = c("metabolism", "signaling"), species = "hsapiens")
A package for testing directional single sample pathway perturbation
Compute weighted single sample logFC for each treated samples using normalized logCPM values. Fit a lowess curve on variances ~ mean of logCPM values, and use it to predict gene-wise weights. The weighted single sample logFC are ready to be used for computing perturbation scores.
weight_ss_fc( expreMatrix, metadata = NULL, sampleColumn, treatColumn, groupBy, prefix = "ENTREZID:" ) ## S4 method for signature 'matrix' weight_ss_fc( expreMatrix, metadata = NULL, sampleColumn, treatColumn, groupBy, prefix = "ENTREZID:" ) ## S4 method for signature 'data.frame' weight_ss_fc( expreMatrix, metadata = NULL, sampleColumn, treatColumn, groupBy, prefix = "ENTREZID:" ) ## S4 method for signature 'DGEList' weight_ss_fc( expreMatrix, metadata = NULL, sampleColumn, treatColumn, groupBy, prefix = "ENTREZID:" ) ## S4 method for signature 'SummarizedExperiment' weight_ss_fc( expreMatrix, metadata = NULL, sampleColumn, treatColumn, groupBy, prefix = "ENTREZID:" )
weight_ss_fc( expreMatrix, metadata = NULL, sampleColumn, treatColumn, groupBy, prefix = "ENTREZID:" ) ## S4 method for signature 'matrix' weight_ss_fc( expreMatrix, metadata = NULL, sampleColumn, treatColumn, groupBy, prefix = "ENTREZID:" ) ## S4 method for signature 'data.frame' weight_ss_fc( expreMatrix, metadata = NULL, sampleColumn, treatColumn, groupBy, prefix = "ENTREZID:" ) ## S4 method for signature 'DGEList' weight_ss_fc( expreMatrix, metadata = NULL, sampleColumn, treatColumn, groupBy, prefix = "ENTREZID:" ) ## S4 method for signature 'SummarizedExperiment' weight_ss_fc( expreMatrix, metadata = NULL, sampleColumn, treatColumn, groupBy, prefix = "ENTREZID:" )
expreMatrix |
|
metadata |
Sample metadata |
sampleColumn |
Name of the column in the |
treatColumn |
Name of the column in the |
groupBy |
Name of the column in the |
prefix |
Character(1). Prefix to be add to the |
This function computes weighted single-sample logFC (ssFC) from normalised logCPM values, used for computing single-sample perturbation scores.
Genes with low expression tend to have larger variances, which will introduce biases to the ssFC computed. Therefore, a lowess curve will be fitted to estimate the relationship between the variances and the mean of logCPM, and the relationship will be used to estimate the variance of each gene-wise mean logCPM value. Gene-wise weights, which are defined to be the inverse of predicted variances that are scaled to have a sum of 1, will then be multiplied to ssFC to down-weight genes with low counts.
It is assumed that the genes with extremely low counts have been removed and the count matrix has been normalised prior to the logCPM matrix was derived. Row names of the matrix must be in genes' entrez IDs to align with the fomrat of pathway topology matrices.
If a S4 object of DGEList or SummarizedExperiment
is provided as input
to expreMatrix
, the gene expression matrix will be extracted from it and
converted to a logCPM matrix. Sample metadata will also be extracted from the
same S4 object unless otherwise specified.
Provided sample metadata should have the same number of rows as the number of columns in the logCPM matrix and must contain the a column for treatment, one for sample names and a column for how samples should be matched into pairs.
The treatment column in the sample metadata must be a factor with the reference level set to be the control condition.
A list with two elements: $weight gene-wise weights; $weighted_logFC weighted single sample logFC matrix
# Inspect metadata data frame to make sure it has treatment, sample # and patient columns data(metadata_example) data(logCPM_example) # Set the treatment column to be a factor where the reference is the control #treatment metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) ls <- weight_ss_fc(logCPM_example, metadata = metadata_example, sampleColumn = "sample", groupBy = "patient", treatColumn = "treatment")
# Inspect metadata data frame to make sure it has treatment, sample # and patient columns data(metadata_example) data(logCPM_example) # Set the treatment column to be a factor where the reference is the control #treatment metadata_example <- dplyr::mutate(metadata_example, treatment = factor( treatment, levels = c("Vehicle", "E2+R5020", "R5020"))) ls <- weight_ss_fc(logCPM_example, metadata = metadata_example, sampleColumn = "sample", groupBy = "patient", treatColumn = "treatment")