BiocSet: Representing Element Sets in the Tidyverse

Introduction

BiocSet is a package that represents element sets in a tibble format with the BiocSet class. Element sets are read in and converted into a tibble format. From here, typical dplyr operations can be performed on the element set. BiocSet also provides functionality for mapping different ID types and providing reference urls for elements and sets.

Installation

Install the most recent version from Bioconductor:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BiocSet")

The development version is also available for install from GitHub:

BiocManager::install("Kayla-Morrell/BiocSet")

Then load BiocSet:

library(BiocSet)

BiocSet

Input and Output

BiocSet can create a BiocSet object using two different input methods. The first is to input named character vectors of element sets. BiocSet returns three tibbles, es_element which contains the elements, es_set which contains the sets and es_elementset which contains elements and sets together.

tbl <- BiocSet(set1 = letters, set2 = LETTERS)
tbl
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 52 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # ℹ 49 more rows
#> 
#> es_set():
#> # A tibble: 2 × 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 52 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 b       set1 
#> 3 c       set1 
#> # ℹ 49 more rows

The second method of creating a BiocSet object would be to read in a .gmt file. Using import(), a path to a downloaded .gmt file is read in and a BiocSet object is returned. The example below uses a hallmark element set downloaded from GSEA, which is also included with this package. This BiocSet includes a source column within the es_elementset tibble for reference as to where the element set came from.

gmtFile <- system.file(package = "BiocSet",
                        "extdata",
                        "hallmark.gene.symbol.gmt")
tbl2 <- import(gmtFile)
tbl2
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 4,386 × 1
#>   element
#>   <chr>  
#> 1 JUNB   
#> 2 CXCL2  
#> 3 ATF3   
#> # ℹ 4,383 more rows
#> 
#> es_set():
#> # A tibble: 50 × 2
#>   set                              source                                       
#>   <chr>                            <chr>                                        
#> 1 HALLMARK_TNFA_SIGNALING_VIA_NFKB http://www.broadinstitute.org/gsea/msigdb/ca…
#> 2 HALLMARK_HYPOXIA                 http://www.broadinstitute.org/gsea/msigdb/ca…
#> 3 HALLMARK_CHOLESTEROL_HOMEOSTASIS http://www.broadinstitute.org/gsea/msigdb/ca…
#> # ℹ 47 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 7,324 × 2
#>   element set                             
#>   <chr>   <chr>                           
#> 1 JUNB    HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> 2 CXCL2   HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> 3 ATF3    HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> # ℹ 7,321 more rows

export() allows for a BiocSet object to be exported into a temporary file with the extention .gmt.

fl <- tempfile(fileext = ".gmt")
gmt <- export(tbl2, fl)
gmt
#> GMTFile object
#> resource: /tmp/RtmpaJFiO1/file1ac543ad8ba8.gmt

We also support importing files to create an OBOSet object which extends the BiocSet class. The default only displays the most basic amount of columns needed to perform certain tasks, but the argument extract_tag = everything allows for additional columns to be imported as well. The following workflow will demonstrate the use of both export and import to create a smaller subset of the large obo file that is accessable from GeneOntology.

download.file("http://current.geneontology.org/ontology/go.obo", "obo_file.obo")

foo <- import("obo_file.obo", extract_tag = "everything")

small_tst <- es_element(foo)[1,] %>%
    unnest("ancestors") %>%
    select("element", "ancestors") %>%
    unlist() %>%
    unique()

small_oboset <- foo %>% filter_elementset(element %in% small_tst)

fl <- tempfile(fileext = ".obo")
export(small_oboset, fl)

Then you can import this smaller obo file which we utilize here, in tests, and example code.

oboFile <- system.file(package = "BiocSet",
                        "extdata",
                        "sample_go.obo")
obo <- import(oboFile)
obo
#> class: OBOSet
#> 
#> es_element():
#> # A tibble: 8 × 3
#>   element    name                             obsolete
#>   <chr>      <chr>                            <lgl>   
#> 1 GO:0033955 mitochondrial DNA inheritance    FALSE   
#> 2 GO:0000002 mitochondrial genome maintenance FALSE   
#> 3 GO:0007005 mitochondrion organization       FALSE   
#> # ℹ 5 more rows
#> 
#> es_set():
#> # A tibble: 8 × 2
#>   set        name                            
#>   <chr>      <chr>                           
#> 1 GO:0000002 mitochondrial genome maintenance
#> 2 GO:0006996 organelle organization          
#> 3 GO:0007005 mitochondrion organization      
#> # ℹ 5 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 36 × 3
#>   element    set        is_a 
#>   <chr>      <chr>      <lgl>
#> 1 GO:0033955 GO:0000002 TRUE 
#> 2 GO:0000002 GO:0000002 FALSE
#> 3 GO:0007005 GO:0006996 TRUE 
#> # ℹ 33 more rows

Implemented functions

A feature available to BiocSet is the ability to activate different tibbles to perform certain functions on. When a BiocSet is created, the tibble es_elementset is automatically activated and all functions will be performed on this tibble. BiocSet adopts the use of many common dplyr functions such as filter(), select(), mutate(), summarise(), and arrange(). With each of the functions the user is able to pick a different tibble to activate and work on by using ‘verb_tibble’. After the function is executed than the ‘active’ tibble is returned back to the tibble that was active before the function call. Some examples are shown below of how these functions work.

tbl <- BiocSet(set1 = letters, set2 = LETTERS)
tbl
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 52 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # ℹ 49 more rows
#> 
#> es_set():
#> # A tibble: 2 × 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 52 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 b       set1 
#> 3 c       set1 
#> # ℹ 49 more rows
tbl %>% filter_element(element == "a" | element == "A")
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 2 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 A      
#> 
#> es_set():
#> # A tibble: 2 × 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 2 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 A       set2
tbl %>% mutate_set(pval = rnorm(1:2))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 52 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # ℹ 49 more rows
#> 
#> es_set():
#> # A tibble: 2 × 2
#>   set     pval
#>   <chr>  <dbl>
#> 1 set1  -1.43 
#> 2 set2  -0.745
#> 
#> es_elementset() <active>:
#> # A tibble: 52 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 b       set1 
#> 3 c       set1 
#> # ℹ 49 more rows
tbl %>% arrange_elementset(desc(element))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 52 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # ℹ 49 more rows
#> 
#> es_set():
#> # A tibble: 2 × 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 52 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 z       set1 
#> 2 y       set1 
#> 3 x       set1 
#> # ℹ 49 more rows

Set operations

BiocSet also allows for common set operations to be performed on the BiocSet object. union() and intersection() are the two set operations available in BiocSet. We demonstate how a user can find the union between two BiocSet objects or within a single BiocSet object. Intersection is used in the same way.

# union of two BiocSet objects
es1 <- BiocSet(set1 = letters[c(1:3)], set2 = LETTERS[c(1:3)])
es2 <- BiocSet(set1 = letters[c(2:4)], set2 = LETTERS[c(2:4)])
union(es1, es2)
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 8 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # ℹ 5 more rows
#> 
#> es_set():
#> # A tibble: 2 × 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 8 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 b       set1 
#> 3 c       set1 
#> # ℹ 5 more rows

# union within a single BiocSet object
es3 <- BiocSet(set1 = letters[c(1:10)], set2 = letters[c(4:20)])
union_single(es3)
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 20 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # ℹ 17 more rows
#> 
#> es_set():
#> # A tibble: 1 × 1
#>   set  
#>   <chr>
#> 1 union
#> 
#> es_elementset() <active>:
#> # A tibble: 20 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       union
#> 2 b       union
#> 3 c       union
#> # ℹ 17 more rows

Case study I

Next, we demonstrate the a couple uses of BiocSet with an experiment dataset airway from the package airway. This data is from an RNA-Seq experiment on airway smooth muscle (ASM) cell lines.

The first step is to load the library and the necessary data.

library(airway)
data("airway")
se <- airway

The function go_sets() discovers the keys from the org object and uses AnnotationDbi::select to create a mapping to GO ids. go_sets() also allows the user to indicate which evidence type or ontology type they would like when selecting the GO ids. The default is using all evidence types and all ontology types. We represent these identifieres as a BiocSet object. Using the go_sets function we are able to map the Ensembl ids and GO ids from the genome wide annotation for Human data in the org.Hs.eg.db package. The Ensembl ids are treated as elements while the GO ids are treated as sets.

library(org.Hs.eg.db)
go <- go_sets(org.Hs.eg.db, "ENSEMBL")
go
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 23,555 × 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # ℹ 23,552 more rows
#> 
#> es_set():
#> # A tibble: 18,677 × 1
#>   set       
#>   <chr>     
#> 1 GO:0000002
#> 2 GO:0000009
#> 3 GO:0000012
#> # ℹ 18,674 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 339,946 × 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # ℹ 339,943 more rows

# an example of subsetting by evidence type
go_sets(org.Hs.eg.db, "ENSEMBL", evidence = c("IPI", "TAS"))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 18,282 × 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000076248
#> 3 ENSG00000168685
#> # ℹ 18,279 more rows
#> 
#> es_set():
#> # A tibble: 4,524 × 2
#>   set        evidence    
#>   <chr>      <named list>
#> 1 GO:0000002 <chr [1]>   
#> 2 GO:0000012 <chr [1]>   
#> 3 GO:0000018 <chr [1]>   
#> # ℹ 4,521 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 59,623 × 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000076248 GO:0000012
#> 3 ENSG00000168685 GO:0000018
#> # ℹ 59,620 more rows

Some users may not be interested in reporting the non-descriptive elements. We demonstrate subsetting the airway data to include non-zero assays and then filter out the non-descriptive elements.

se1 = se[rowSums(assay(se)) != 0,]
go %>% filter_element(element %in% rownames(se1))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 16,658 × 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # ℹ 16,655 more rows
#> 
#> es_set():
#> # A tibble: 18,677 × 1
#>   set       
#>   <chr>     
#> 1 GO:0000002
#> 2 GO:0000009
#> 3 GO:0000012
#> # ℹ 18,674 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 263,707 × 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # ℹ 263,704 more rows

It may also be of interest to users to know how many elements are in each set. Using the count function we are able to calculate the elements per set.

go %>% group_by(set) %>% dplyr::count()
#> Warning: The `add` argument of `group_by()` is deprecated as of dplyr 1.0.0.
#> ℹ Please use the `.add` argument instead.
#> ℹ The deprecated feature was likely used in the dplyr package.
#>   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> # A tibble: 18,677 × 2
#> # Groups:   set [18,677]
#>    set            n
#>    <chr>      <int>
#>  1 GO:0000002    12
#>  2 GO:0000009     2
#>  3 GO:0000012    12
#>  4 GO:0000014    10
#>  5 GO:0000015     4
#>  6 GO:0000016     1
#>  7 GO:0000017     2
#>  8 GO:0000018     4
#>  9 GO:0000019     2
#> 10 GO:0000022     2
#> # ℹ 18,667 more rows

It may also be helpful to remove sets that are empty. Since we have shown how to calculate the number of elements per set, we know that this data set does not contain any empty sets. We decide to demonstrate regardless for those users that may need this functionality.

drop <- es_activate(go, elementset) %>% group_by(set) %>%
    dplyr::count() %>% filter(n == 0) %>% pull(set)
go %>% filter_set(!(set %in% drop))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 23,555 × 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # ℹ 23,552 more rows
#> 
#> es_set():
#> # A tibble: 18,677 × 1
#>   set       
#>   <chr>     
#> 1 GO:0000002
#> 2 GO:0000009
#> 3 GO:0000012
#> # ℹ 18,674 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 339,946 × 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # ℹ 339,943 more rows

To simplify mapping we created a couple map functions. map_unique() is used when there is known 1:1 mapping, This takes four arguements, a BiocSet object, an AnnotationDbi object, the id to map from, and the id to map to. map_multiple() needs a fifth argument to indicate how the function should treat an element when there is multiple mapping. Both functions utilize mapIds from AnnotationDbi and return a BiocSet object. In the example below we show how to use map_unique to map go’s ids from Ensembl to gene symbols.

go %>% map_unique(org.Hs.eg.db, "ENSEMBL", "SYMBOL")
#> 'select()' returned 1:many mapping between keys and columns
#> Joining with `by = join_by(element)`
#> `mutate_if()` ignored the following grouping variables:
#> Joining with `by = join_by(element)`
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 20,468 × 1
#>   element
#>   <chr>  
#> 1 AKT3   
#> 2 LONP1  
#> 3 MEF2A  
#> # ℹ 20,465 more rows
#> 
#> es_set():
#> # A tibble: 18,677 × 1
#>   set       
#>   <chr>     
#> 1 GO:0000002
#> 2 GO:0000009
#> 3 GO:0000012
#> # ℹ 18,674 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 298,525 × 2
#>   element set       
#>   <chr>   <chr>     
#> 1 AKT3    GO:0000002
#> 2 LONP1   GO:0000002
#> 3 MEF2A   GO:0000002
#> # ℹ 298,522 more rows

Another functionality of BiocSet is the ability to add information to the tibbles. Using the GO.db library we are able to map definitions to the GO ids. From there we can add the mapping to the tibble using map_add() and the mutate function.

library(GO.db)
map <- map_add_set(go, GO.db, "GOID", "DEFINITION")
go %>% mutate_set(definition = map)
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 23,555 × 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # ℹ 23,552 more rows
#> 
#> es_set():
#> # A tibble: 18,677 × 2
#>   set        definition                                                         
#>   <chr>      <chr>                                                              
#> 1 GO:0000002 The maintenance of the structure and integrity of the mitochondria…
#> 2 GO:0000009 Catalysis of the transfer of a mannose residue to an oligosacchari…
#> 3 GO:0000012 The repair of single strand breaks in DNA. Repair of such breaks i…
#> # ℹ 18,674 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 339,946 × 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # ℹ 339,943 more rows

The library KEGGREST is a client interface to the KEGG REST server. KEGG contains pathway maps that represent interaction, reaction and relation networks for various biological processes and diseases. BiocSet has a function that utilizes KEGGREST to develop a BiocSet object that contains the elements for every pathway map in KEGG.

Due to limiations of the KEGGREST package, kegg_sets can take some time to run depending on the amount of pathways for the species of interest. Therefore we demonstrate using BiocFileCache to make the data available to the user.

library(BiocFileCache)
rname <- "kegg_hsa"
exists <- NROW(bfcquery(query=rname, field="rname")) != 0L
if (!exists)
{
    kegg <- kegg_sets("hsa")
    fl <- bfcnew(rname = rname, ext = ".gmt")
    export(kegg_sets("hsa"), fl)
}
#> GMTFile object
#> resource: /github/home/.cache/R/BiocFileCache/1ac52820c0d0_1ac52820c0d0.gmt
kegg <- import(bfcrpath(rname=rname))

Within the kegg_sets() function we remove pathways that do not contain any elements. We then mutate the element tibble using the map_add function to contain both Ensembl and Entrez ids.

map <- map_add_element(kegg, org.Hs.eg.db, "ENTREZID", "ENSEMBL")
#> 'select()' returned 1:many mapping between keys and columns
kegg <- kegg %>% mutate_element(ensembl = map)

Since we are working with ASM data we thought we would subset the airway data to contain only the elements in the asthma pathway. This filter is performed on the KEGG id, which for asthma is “hsa05310”.

asthma <- kegg %>% filter_set(set == "hsa05310")

se <- se[rownames(se) %in% es_element(asthma)$ensembl,]

se
#> class: RangedSummarizedExperiment 
#> dim: 8153 8 
#> metadata(1): ''
#> assays(1): counts
#> rownames(8153): ENSG00000000419 ENSG00000000938 ... ENSG00000273079
#>   ENSG00000273085
#> rowData names(10): gene_id gene_name ... seq_coord_system symbol
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(9): SampleName cell ... Sample BioSample

The filtering can also be done for multiple pathways.

pathways <- c("hsa05310", "hsa04110", "hsa05224", "hsa04970")
multipaths <- kegg %>% filter_set(set %in% pathways)

multipaths
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 8,527 × 2
#>   element ensembl        
#>   <chr>   <chr>          
#> 1 3101    ENSG00000160883
#> 2 3098    ENSG00000156515
#> 3 3099    ENSG00000159399
#> # ℹ 8,524 more rows
#> 
#> es_set():
#> # A tibble: 4 × 2
#>   set      source
#>   <chr>    <chr> 
#> 1 hsa04110 <NA>  
#> 2 hsa04970 <NA>  
#> 3 hsa05224 <NA>  
#> # ℹ 1 more row
#> 
#> es_elementset() <active>:
#> # A tibble: 435 × 2
#>   element set     
#>   <chr>   <chr>   
#> 1 1017    hsa04110
#> 2 1019    hsa04110
#> 3 1021    hsa04110
#> # ℹ 432 more rows

Case study II

This example will start out the same way that Case Study I started, by loading in the airway dataset, but we will also do some reformating of the data. The end goal is to be able to perform a Gene Set Enrichment Analysis on the data and return a BiocSet object of the gene sets.

data("airway")
airway$dex <- relevel(airway$dex, "untrt")

Similar to other analyses we perform a differential expression analysis on the airway data using the library DESeq2 and then store the results into a tibble.

library(DESeq2)
library(tibble)
des <- DESeqDataSet(airway, design = ~ cell + dex)
des <- DESeq(des)
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
res <- results(des)

tbl <- res %>% 
    as.data.frame() %>%
    as_tibble(rownames = "ENSEMBL") 

Since we want to use limma::goana() to perform the GSEA we will need to have ENTREZ identifiers in the data, as well as filter out some NA information. Later on this will be considered our ‘element’ tibble.

tbl <- tbl %>% 
    mutate(
        ENTREZID = mapIds(
            org.Hs.eg.db, ENSEMBL, "ENTREZID", "ENSEMBL"
        ) %>% unname()
    )
#> 'select()' returned 1:many mapping between keys and columns

tbl <- tbl %>% filter(!is.na(padj), !is.na(ENTREZID))
tbl
#> # A tibble: 14,987 × 8
#>    ENSEMBL        baseMean log2FoldChange  lfcSE   stat  pvalue    padj ENTREZID
#>    <chr>             <dbl>          <dbl>  <dbl>  <dbl>   <dbl>   <dbl> <chr>   
#>  1 ENSG000000000…    709.         -0.381  0.101  -3.79  1.52e-4 1.28e-3 7105    
#>  2 ENSG000000004…    520.          0.207  0.112   1.84  6.53e-2 1.96e-1 8813    
#>  3 ENSG000000004…    237.          0.0379 0.143   0.264 7.92e-1 9.11e-1 57147   
#>  4 ENSG000000004…     57.9        -0.0882 0.287  -0.307 7.59e-1 8.95e-1 55732   
#>  5 ENSG000000009…   5817.          0.426  0.0883  4.83  1.38e-6 1.82e-5 3075    
#>  6 ENSG000000010…   1282.         -0.241  0.0887 -2.72  6.58e-3 3.28e-2 2519    
#>  7 ENSG000000010…    610.         -0.0476 0.167  -0.286 7.75e-1 9.03e-1 2729    
#>  8 ENSG000000011…    369.         -0.500  0.121  -4.14  3.48e-5 3.42e-4 4800    
#>  9 ENSG000000014…    183.         -0.124  0.180  -0.689 4.91e-1 7.24e-1 90529   
#> 10 ENSG000000014…   2814.         -0.0411 0.103  -0.400 6.89e-1 8.57e-1 57185   
#> # ℹ 14,977 more rows

Now that the data is ready for GSEA we can go ahead and use goana() and make the results into a tibble. This tibble will be considered our ‘set’ tibble.

library(limma)
#> 
#> Attaching package: 'limma'
#> The following object is masked from 'package:DESeq2':
#> 
#>     plotMA
#> The following object is masked from 'package:BiocGenerics':
#> 
#>     plotMA
go_ids <- goana(tbl$ENTREZID[tbl$padj < 0.05], tbl$ENTREZID, "Hs") %>%
    as.data.frame() %>%
    as_tibble(rownames = "GOALL")
go_ids
#> # A tibble: 20,916 × 6
#>    GOALL      Term                                    Ont       N    DE     P.DE
#>    <chr>      <chr>                                   <chr> <dbl> <dbl>    <dbl>
#>  1 GO:0002376 immune system process                   BP     1745   583 3.82e-17
#>  2 GO:0002682 regulation of immune system process     BP     1097   376 7.16e-13
#>  3 GO:0002764 immune response-regulating signaling p… BP      351   109 5.80e- 3
#>  4 GO:0006955 immune response                         BP     1120   363 4.64e- 9
#>  5 GO:0007154 cell communication                      BP     4386  1478 3.29e-54
#>  6 GO:0007165 signal transduction                     BP     4043  1359 1.44e-47
#>  7 GO:0008150 biological_process                      BP    12439  3405 6.29e-54
#>  8 GO:0009987 cellular process                        BP    11767  3228 2.18e-41
#>  9 GO:0023052 signaling                               BP     4381  1478 1.40e-54
#> 10 GO:0048583 regulation of response to stimulus      BP     3050  1004 1.91e-28
#> # ℹ 20,906 more rows

The last thing we need to do is create a tibble that we will consider our ‘elementset’ tibble. This tibble will be a mapping of all the elements and sets.

foo <- AnnotationDbi::select(
    org.Hs.eg.db,
    tbl$ENTREZID,
    "GOALL",
    "ENTREZID") %>% as_tibble()
#> 'select()' returned many:many mapping between keys and columns
foo <- foo %>% dplyr::select(-EVIDENCEALL) %>% distinct()
foo <- foo %>% filter(ONTOLOGYALL == "BP") %>% dplyr::select(-ONTOLOGYALL)
foo
#> # A tibble: 961,404 × 2
#>    ENTREZID GOALL     
#>    <chr>    <chr>     
#>  1 7105     GO:0002218
#>  2 7105     GO:0002221
#>  3 7105     GO:0002253
#>  4 7105     GO:0002376
#>  5 7105     GO:0002682
#>  6 7105     GO:0002683
#>  7 7105     GO:0002684
#>  8 7105     GO:0002753
#>  9 7105     GO:0002757
#> 10 7105     GO:0002758
#> # ℹ 961,394 more rows

The function BiocSet_from_elementset() allows for users to create a BiocSet object from tibbles. This function is helpful when there is metadata contained in the tibble that should be in the BiocSet object. For this function to work properly, the columns that are being joined on need to be named correctly. For instance, in order to use this function on the tibbles we created we need to change the column in the ‘element’ tibble to ‘element’, the column in the ‘set’ tibble to ‘set’ and the same will be for the ‘elementset’ tibble. We demonstrate this below and then create the BiocSet object with the simple function.

foo <- foo %>% dplyr::rename(element = ENTREZID, set = GOALL)
tbl <- tbl %>% dplyr::rename(element = ENTREZID)
go_ids <- go_ids %>% dplyr::rename(set = GOALL)
es <- BiocSet_from_elementset(foo, tbl, go_ids)
#> more elements in 'element' than in 'elementset'
#> more elements in 'set' than in 'elementset'
es
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 12,444 × 8
#>   element ENSEMBL         baseMean log2FoldChange  lfcSE   stat   pvalue    padj
#>   <chr>   <chr>              <dbl>          <dbl>  <dbl>  <dbl>    <dbl>   <dbl>
#> 1 3980    ENSG00000005156     206.        -0.393  0.172  -2.29  2.23e- 2 8.66e-2
#> 2 1890    ENSG00000025708     145.         1.23   0.199   6.18  6.58e-10 1.49e-8
#> 3 50484   ENSG00000048392    1407.        -0.0701 0.0902 -0.778 4.37e- 1 6.80e-1
#> # ℹ 12,441 more rows
#> 
#> es_set():
#> # A tibble: 14,360 × 6
#>   set        Term                             Ont       N    DE  P.DE
#>   <chr>      <chr>                            <chr> <dbl> <dbl> <dbl>
#> 1 GO:0000002 mitochondrial genome maintenance BP       29     6 0.770
#> 2 GO:0000012 single strand break repair       BP       13     2 0.874
#> 3 GO:0000018 regulation of DNA recombination  BP      122    24 0.933
#> # ℹ 14,357 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 961,404 × 2
#>   element set       
#>   <chr>   <chr>     
#> 1 3980    GO:0000002
#> 2 1890    GO:0000002
#> 3 50484   GO:0000002
#> # ℹ 961,401 more rows

For those users that may need to put this metadata filled BiocSet object back into an object similar to GRanges or SummarizedExperiment, we have created functions that allow for the BiocSet object to be created into a tibble or data.frame.

tibble_from_element(es)
#> Joining with `by = join_by(set)`
#> Joining with `by = join_by(element)`
#> # A tibble: 12,439 × 14
#>    element   set   Term  Ont   N     DE    P.DE  ENSEMBL baseMean log2FoldChange
#>    <chr>     <lis> <lis> <lis> <lis> <lis> <lis> <list>  <list>   <list>        
#>  1 1         <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>   <dbl>    <dbl [16]>    
#>  2 100       <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>   <dbl>    <dbl [430]>   
#>  3 1000      <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>   <dbl>    <dbl [186]>   
#>  4 10000     <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>   <dbl>    <dbl [132]>   
#>  5 10001     <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>   <dbl>    <dbl [81]>    
#>  6 10003     <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>   <dbl>    <dbl [6]>     
#>  7 10004     <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>   <dbl>    <dbl [12]>    
#>  8 100048912 <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>   <dbl>    <dbl [35]>    
#>  9 10005     <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>   <dbl>    <dbl [97]>    
#> 10 10006     <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>   <dbl>    <dbl [122]>   
#> # ℹ 12,429 more rows
#> # ℹ 4 more variables: lfcSE <list>, stat <list>, pvalue <list>, padj <list>

head(data.frame_from_elementset(es))
#> Joining with `by = join_by(set)`
#> Joining with `by = join_by(element)`
#>   element        set                             Term Ont  N DE      P.DE
#> 1    3980 GO:0000002 mitochondrial genome maintenance  BP 29  6 0.7697388
#> 2    1890 GO:0000002 mitochondrial genome maintenance  BP 29  6 0.7697388
#> 3   50484 GO:0000002 mitochondrial genome maintenance  BP 29  6 0.7697388
#> 4    4205 GO:0000002 mitochondrial genome maintenance  BP 29  6 0.7697388
#> 5   64863 GO:0000002 mitochondrial genome maintenance  BP 29  6 0.7697388
#> 6    9093 GO:0000002 mitochondrial genome maintenance  BP 29  6 0.7697388
#>           ENSEMBL  baseMean log2FoldChange      lfcSE       stat       pvalue
#> 1 ENSG00000005156  205.6627    -0.39272067 0.17180717 -2.2858223 2.226466e-02
#> 2 ENSG00000025708  144.6965     1.23104228 0.19933257  6.1758212 6.582046e-10
#> 3 ENSG00000048392 1406.6344    -0.07014829 0.09020226 -0.7776778 4.367590e-01
#> 4 ENSG00000068305 1565.5675     0.52905471 0.09186326  5.7591547 8.453618e-09
#> 5 ENSG00000101574  179.0598     0.04924321 0.15446981  0.3187886 7.498869e-01
#> 6 ENSG00000103423  670.6312    -0.04006479 0.09900566 -0.4046718 6.857188e-01
#>           padj
#> 1 8.658851e-02
#> 2 1.493644e-08
#> 3 6.802838e-01
#> 4 1.636061e-07
#> 5 8.909142e-01
#> 6 8.543864e-01

Information look up

A final feature to BiocSet is the ability to add reference information about all of the elements/sets. A user could utilize the function url_ref() to add information to the BiocSet object. If a user has a question about a specific id then they can follow the reference url to get more informtion. Below is an example of using url_ref() to add reference urls to the go data set we worked with above.

url_ref(go)
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 23,555 × 2
#>   element         url                                                           
#>   <chr>           <chr>                                                         
#> 1 ENSG00000151729 https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=E…
#> 2 ENSG00000025708 https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=E…
#> 3 ENSG00000068305 https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=E…
#> # ℹ 23,552 more rows
#> 
#> es_set():
#> # A tibble: 18,677 × 2
#>   set        url                                                           
#>   <chr>      <chr>                                                         
#> 1 GO:0000002 http://amigo.geneontology.org/amigo/medial_search?q=GO:0000002
#> 2 GO:0000009 http://amigo.geneontology.org/amigo/medial_search?q=GO:0000009
#> 3 GO:0000012 http://amigo.geneontology.org/amigo/medial_search?q=GO:0000012
#> # ℹ 18,674 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 339,946 × 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # ℹ 339,943 more rows

Session info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] limma_3.63.2                tibble_3.2.1               
#>  [3] DESeq2_1.47.1               BiocFileCache_2.15.0       
#>  [5] dbplyr_2.5.0                GO.db_3.20.0               
#>  [7] org.Hs.eg.db_3.20.0         AnnotationDbi_1.69.0       
#>  [9] airway_1.26.0               SummarizedExperiment_1.37.0
#> [11] Biobase_2.67.0              GenomicRanges_1.59.1       
#> [13] GenomeInfoDb_1.43.2         IRanges_2.41.2             
#> [15] S4Vectors_0.45.2            BiocGenerics_0.53.3        
#> [17] generics_0.1.3              MatrixGenerics_1.19.0      
#> [19] matrixStats_1.4.1           BiocSet_1.21.0             
#> [21] dplyr_1.1.4                 BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1        blob_1.2.4              filelock_1.0.3         
#>  [4] Biostrings_2.75.3       fastmap_1.2.0           digest_0.6.37          
#>  [7] lifecycle_1.0.4         statmod_1.5.0           KEGGREST_1.47.0        
#> [10] RSQLite_2.3.9           magrittr_2.0.3          compiler_4.4.2         
#> [13] rlang_1.1.4             sass_0.4.9              tools_4.4.2            
#> [16] utf8_1.2.4              yaml_2.3.10             knitr_1.49             
#> [19] S4Arrays_1.7.1          ontologyIndex_2.12      bit_4.5.0.1            
#> [22] curl_6.0.1              DelayedArray_0.33.3     plyr_1.8.9             
#> [25] abind_1.4-8             BiocParallel_1.41.0     withr_3.0.2            
#> [28] purrr_1.0.2             sys_3.4.3               grid_4.4.2             
#> [31] colorspace_2.1-1        ggplot2_3.5.1           scales_1.3.0           
#> [34] cli_3.6.3               rmarkdown_2.29          crayon_1.5.3           
#> [37] httr_1.4.7              DBI_1.2.3               cachem_1.1.0           
#> [40] zlibbioc_1.52.0         parallel_4.4.2          formatR_1.14           
#> [43] BiocManager_1.30.25     XVector_0.47.0          vctrs_0.6.5            
#> [46] Matrix_1.7-1            jsonlite_1.8.9          bit64_4.5.2            
#> [49] maketools_1.3.1         locfit_1.5-9.10         tidyr_1.3.1            
#> [52] jquerylib_0.1.4         glue_1.8.0              codetools_0.2-20       
#> [55] gtable_0.3.6            BiocIO_1.17.1           UCSC.utils_1.3.0       
#> [58] munsell_0.5.1           pillar_1.10.0           htmltools_0.5.8.1      
#> [61] GenomeInfoDbData_1.2.13 R6_2.5.1                evaluate_1.0.1         
#> [64] lattice_0.22-6          png_0.1-8               memoise_2.0.1          
#> [67] bslib_0.8.0             Rcpp_1.0.13-1           SparseArray_1.7.2      
#> [70] xfun_0.49               buildtools_1.0.0        pkgconfig_2.0.3