--- title: "Introduction to the GraphExperiment class" author: - name: Fabricio Almeida-Silva affiliation: | VIB-UGent Center for Plant Systems Biology, Ghent University, Ghent, Belgium - name: Yves Van de Peer affiliation: | VIB-UGent Center for Plant Systems Biology, Ghent University, Ghent, Belgium output: BiocStyle::html_document: toc: true number_sections: yes bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{Introduction to the GraphExperiment class} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ) ``` # Introduction Networks (or graphs) have become widely used data representations in biology, as they can efficiently encode node-node interactions and neighborhoods. In high-throughput, quantitative omics data (e.g., transcriptomics, proteomics, metabolomics, epigenomics, etc), widely used feature-level network representations include gene coexpression, protein-protein interaction, gene regulatory, and co-abundance networks, and cell-cell communication, sample-sample similarity, and species-species networks at the observation level. While data structures to store quantitative data and associated metadata exist (e.g., `SummarizedExperiment`, `SingleCellExperiment`, `SpatialExperiment`, etc), support for networks describing how features (i.e., assay rows) and/or observations (i.e., assay columns) relate to each other is currently missing. `GraphExperiment` is an S4 class that extends `SingleCellExperiment` [@sce] to include additional containers for networks associated with assay features (`rowGraphs`) and assay observations (`colGraphs`). Of note, trees are an alternative way of representing how assay features/observations are related to each other. Users interested in tree representations of assays rows/columns can use the `r BiocStyle::Biocpkg("TreeSummarizedExperiment")` package. Trees are essentially *a kind of graph* (i.e., all trees are graphs, but not all graphs are trees). Here, we chose to use a more general graph representation (namely `igraph` objects) to provide users and developers with more flexibility. # Installation `GraphExperiment` can be installed from Bioconductor with the following code: ```{r installation, eval=FALSE} if(!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager') BiocManager::install("GraphExperiment") ``` ```{r load_package, message=FALSE} # Load package after installation library(GraphExperiment) set.seed(777) # for reproducibility ``` # Anatomy of a `GraphExperiment` object Since the `GraphExperiment` class extends the `SingleCellExperiment` class, all `SingleCellExperiment` slots are present in `GraphExperiment`, including: - `assays`: list of matrices with primary (e.g., counts) and transformed (e.g., log-normalized counts, TPM, etc) data, with features in rows and observations in columns. - `colData`: a data frame with column (observation) metadata, such as cell type, sample ID, condition, batch ID, genotype, etc. - `rowData`: a data frame with row (feature) metadata, such as gene ID, genomic coordinates, functional annotation, etc. - `reducedDims`: list of data frames with reduced dimensions, such as PCA, t-SNE, and UMAP embeddings. Compared to `SingleCellExperiment` objects, `GraphExperiment` provides two additional containers: [^1] - `rowGraphs`: list of `igraph` objects containing graphs representing feature-feature relationships, including (but optional) node and edge attributes. - `colGraphs`: list of `igraph` objects containing graphs representing observation-observation relationships, including (but optional) node and edge attributes. ```{r fig, echo=FALSE, out.width = "100%", fig.cap="The GraphExperiment class."} knitr::include_graphics("GraphExperiment.png") ``` [^1]: **Note on software design:** if you're familiar with `SingleCellExperiment` objects, you probably know that it offers `rowPairs`/`colPairs` slots to store pairwise relationships between rows and columns of assays, respectively. In theory, some of the data stored in `rowGraphs`/`colGraphs` (of a `GraphExperiment` object) could be stored in `rowPairs`/`colPairs` (of a `SingleCellExperiment`). However, we chose to implement a dedicated slot with `igraph` objects to guarantee (i) seamless interoperability with other packages, given that `igraph` is the de facto standard class for graphs in R; and (ii) convenience in methods (e.g., subsetting, integration with `rowData`/`colData`, integration across multiple graphs, etc). The `igraph` data class from the `r BiocStyle::CRANpkg("igraph")` package is the standard data structure for graph representation in R. If you are unfamiliar with `igraph` objects, you can learn more about it by reading the `r BiocStyle::CRANpkg("igraph")` vignettes. # Building a `GraphExperiment` object `GraphExperiment` objects can be created from scratch using the constructor function `GraphExperiment()`. Below we will simulate a scRNA-seq count matrix with some gene (row) and cell (column) metadata, and create graphs based on gene-gene and cell-cell correlations [^2]. [^2]: **Tip:** in day-to-day single-cell RNA-seq analyses, researchers typically infer cell-cell graphs based on shared nearest-neighbors (SNN), which are then used to find clusters that can be mapped to cell types. Readers interested in this sort of graph can have a look at the `buildSNNGraph()` function from the `r BiocStyle::Biocpkg("scran")` package. ```{r simulate_slots, message=FALSE} # Simulate parts of a `GraphExperiment` object ## Assays gene_ids <- paste0("gene", seq_len(200)) cell_ids <- paste0("cell", seq_len(100)) mat <- matrix(rpois(20000, 5), ncol = 100, dimnames = list(gene_ids, cell_ids)) mat[1:5, 1:5] ## rowData rdata <- data.frame( row.names = gene_ids, pathway = sample(c("P1", "P2"), size = length(gene_ids), replace = TRUE), coding = sample(c(TRUE, FALSE), size = length(gene_ids), replace = TRUE) ) head(rdata) ## colData cdata <- data.frame( row.names = cell_ids, cell_type = sample(c("ct1", "ct2"), size = length(cell_ids), replace = TRUE) ) head(cdata) ## rowGraph (with node attribute `degree`) rg <- graph_from_adjacency_matrix( cor(t(mat)), mode = "undirected", weighted = TRUE ) rg <- set_vertex_attr(rg, "degree", value = strength(rg)) rg ## colGraph cg <- graph_from_adjacency_matrix( cor(mat), mode = "undirected", weighted = TRUE ) ``` To create a `GraphExperiment` object from the constructor function, you would run: ```{r create_ge} # Create a `GraphExperiment` object ge <- GraphExperiment( assays = list(counts = mat), rowData = rdata, colData = cdata, rowGraphs = list(gene_cor = rg), colGraphs = list(cell_cor = cg) ) ge ``` If you're familiar with `SummarizedExperiment` and `SingleCellExperiment` objects, you will certainly recognize nearly everything you see in `ge`. Compared to `SingleCellExperiment` objects, the only difference here is in the last two rows, which indicate that this object contains a `rowGraph` named 'gene_cor' and a `colGraph` named 'cell_cor'. Importantly, since nodes of `rowGraphs`/`colGraphs` are always in sync with `rownames`/`colnames`, **feature IDs in rownames and rowGraphs must be the same**, and likewise for observation IDs in colnames and colGraphs. For example, attempting to create a `GraphExperiment` object with some features from `rownames` missing would lead to an error: ```{r error_missing_from_graph, error = TRUE} # Remove 'gene1' to 'gene10' from the rowGraph and try to recreate object rg2 <- delete_vertices(rg, paste0("gene", 1:10)) GraphExperiment( assays = list(counts = mat), rowData = rdata, colData = cdata, rowGraphs = list(gene_cor = rg2) ) ``` Alternatively, you can create a `GraphExperiment` object by coercing from an existing `(Ranged)SummarizedExperiment` or `SingleCellExperiment` object. For example: ```{r coerce_se} # Coercing from `SummarizedExperiment` se <- SummarizedExperiment(list(counts = mat)) ge1 <- as(se, "GraphExperiment") ge1 ``` Note that the `rowGraphs`/`colGraphs` containers are still there, but empty. To access the names of all graphs, you will use the `rowGraphNames()` and `colGraphNames()` functions. ```{r graphNames} # Get rowGraph names rowGraphNames(ge) # 'gene_cor' rowGraphNames(ge1) # empty (NULL) # Get colGraph names colGraphNames(ge) # 'cell_cor' colGraphNames(ge1) # empty (NULL) ``` # Accessing `rowGraphs`/`colGraphs` and `rowData`/`colData` (a.k.a. 'getters') To access graphs in `rowGraphs`/`colGraphs`, you can use one of the following getter functions: - `rowGraphs(x)`/`colGraphs(x)`: retrieves **all** (row/col)Graphs as a list of `igraph` objects. - `rowGraph(x, i)`/`colGraph(x, i)`: retrieves only graph $i$ from the list. Note that $i$ can be a numeric scalar (index) or a character scalar (name). The design here is equivalent to `assays()` versus `assay()` for `SummarizedExperiment` objects. ```{r getters} # Get rowGraphs rowGraphs(ge) # Get colGraphs colGraphs(ge) # Get first rowGraph by index rowGraph(ge, 1) # Get first rowGraph by index (alternative) rowGraphs(ge)[[1]] # Get graph by name rowGraph(ge, "gene_cor") ``` Careful readers will notice that this `igraph` object has node attributes that were not present in the original graph: 'pathway' and 'coding'. This is because `rowGraphs()`/`rowGraph()` automatically extract `rowData` variables (if any) and add them to node attributes. `colGraphs()`/`colGraph()` work in the same way (but with `colData`, of course). The same happens in the other direction: the `rowData()`/`colData()` methods for `GraphExperiment` objects automatically add node attributes (if any) to `rowData`/`colData`. ```{r rowdata_getter} # `rowGraphs` and `rowData` are always in sync! rowData(ge) # `colGraphs` and `colData` too - yay! colGraph(ge, 1) # note the `cell_type` attribute extracted from `colData` ``` Variables 'pathway' and 'coding' were in the original data frame we used as `rowData`, but variable 'gene_cor__degree' was added by extracting the *degree* attribute of nodes in rowGraph `gene_cor`. # Modifying `GraphExperiment` objects (a.k.a. 'setters') Like in the `SummarizedExperiment` and `SingleCellExperiment` classes, all getter methods specific to `GraphExperiment` objects have a corresponding setter method. Such methods allow users to modify elements by adding `<-` after the getter method. For example, to add or replace a particular graph, you would use the `rowGraph<-`/`colGraph<-` method as follows: ```{r graph_setter} # Create a new rowGraph without correlations between -0.4 and 0.4 rg_filt <- rowGraph(ge, "gene_cor") |> delete_vertex_attr("pathway") |> delete_vertex_attr("degree") |> delete_vertex_attr("coding") todelete <- abs(E(rg_filt)$weight) <0.4 rg_filt <- delete_edges(rg_filt, which(todelete)) rg_filt # Add filtered graph a new graph named `fcor` rowGraph(ge, "filt_genecor") <- rg_filt ge ``` If you'd like to replace all graphs at once, you could use the `rowGraphs<-`/`colGraphs<-` setters. For example, let's add a few graphs to the `GraphExperiment` object we created before by coercing from `SummarizedExperiment`: ```{r graphs_setter} # Taking a quick look (note: nothing in `rowGraphs`/`colGraphs`) ge1 # Adding graphs from `ge` rowGraphs(ge1) <- rowGraphs(ge) colGraphs(ge1) <- colGraphs(ge) ge1 ``` Lastly, you can also rename graphs by updating `rowGraphNames`/`colGraphNames` as follows: ```{r graphNames_setter} # Rename graphs rowGraphNames(ge1) <- c("correlations", "correlations_filtered_0.4") colGraphNames(ge1) <- c("cell_correlations") ge1 ``` # Subsetting `GraphExperiment` objects In `SummarizedExperiment` objects, subsetting rows and columns (using square brackets, `[`) automatically subsets `rowData` and `colData` besides the assays. The same is true for `SingleCellExperiment` objects: subsetting columns automatically subsets `colData` and `reducedDims`. Since graphs in `GraphExperiment` objects are linked to rows and columns, subsetting rows of a `GraphExperiment` object automatically subsets rows of the `assays`, `rowData`, and all graphs in `rowGraphs`, and subsetting columns automatically subsets columns of the `assays`, `colData`, and graphs in `colGraphs`. For example: ```{r subset} # Subsetting `GraphExperiment` object ge_subset <- ge[1:10, 1:10] ge_subset rowGraph(ge_subset, "gene_cor") ``` # Session information {.unnumbered} This document was created under the following conditions: ```{r session_info} sessioninfo::session_info() ``` # References {.unnumbered}