BugSigDB is a manually curated database of microbial signatures from the published literature of differential abundance studies of human and other host microbiomes.
BugSigDB provides:
The bugsigdbr package implements convenient access to BugSigDB from within R/Bioconductor. The goal of the package is to facilitate import of BugSigDB data into R/Bioconductor, provide utilities for extracting microbe signatures, and enable export of the extracted signatures to plain text files in standard file formats such as GMT.
The bugsigdbr package is primarily a data package. For descriptive statistics and comprehensive analysis of BugSigDB contents, please see the BugSigDBStats package and analysis vignette.
We start by loading the package.
The function importBugSigDB
can be used to import the
complete collection of curated signatures from BugSigDB. The dataset is
downloaded once and subsequently cached. Use cache = FALSE
to force a fresh download of BugSigDB and overwrite the local copy in
your cache.
bsdb <- importBugSigDB()
dim(bsdb)
#> [1] 5520 50
colnames(bsdb)
#> [1] "BSDB ID" "Study"
#> [3] "Study design" "PMID"
#> [5] "DOI" "URL"
#> [7] "Authors list" "Title"
#> [9] "Journal" "Year"
#> [11] "Keywords" "Experiment"
#> [13] "Location of subjects" "Host species"
#> [15] "Body site" "UBERON ID"
#> [17] "Condition" "EFO ID"
#> [19] "Group 0 name" "Group 1 name"
#> [21] "Group 1 definition" "Group 0 sample size"
#> [23] "Group 1 sample size" "Antibiotics exclusion"
#> [25] "Sequencing type" "16S variable region"
#> [27] "Sequencing platform" "Statistical test"
#> [29] "Significance threshold" "MHT correction"
#> [31] "LDA Score above" "Matched on"
#> [33] "Confounders controlled for" "Pielou"
#> [35] "Shannon" "Chao1"
#> [37] "Simpson" "Inverse Simpson"
#> [39] "Richness" "Signature page name"
#> [41] "Source" "Curated date"
#> [43] "Curator" "Revision editor"
#> [45] "Description" "Abundance in Group 1"
#> [47] "MetaPhlAn taxon names" "NCBI Taxonomy IDs"
#> [49] "State" "Reviewer"
Each row of the resulting data.frame
corresponds to a
microbe signature from differential abundance analysis, i.e. a set of
microbes that has been found with increased or decreased abundance in
one sample group when compared to another sample group (eg. in a
case-vs.-control setup). The curated signatures are richly annotated
with additional metadata columns providing information on study design,
antibiotics exclusion criteria, sample size, and experimental and
statistical procedures, among others.
Subsetting the full dataset to certain conditions, body sites, or
other metadata columns of interest can be done along the usual lines for
subsetting data.frame
s.
For example, the following subset
command restricts the
dataset to signatures obtained from microbiome studies on obesity, based
on fecal samples from participants in the US.
Given the full BugSigDB collection (or a subset of interest), the
function getSignatures
can be used to obtain the microbes
annotated to each signature.
Microbes annotated to a signature are returned following the NCBI Taxonomy nomenclature per default.
sigs <- getSignatures(bsdb)
length(sigs)
#> [1] 5269
sigs[1:3]
#> $`bsdb:1/1/1_Colorectal-adenoma:conventional-adenoma-cases_vs_controls_UP`
#> [1] "91061" "1236" "1654" "1716" "1301" "162289" "189330" "33024"
#> [9] "40544" "2037" "2049" "506" "186826" "1300" "31977" "91347"
#> [17] "1653" "57037" "1386" "186817"
#>
#> $`bsdb:1/1/2_Colorectal-adenoma:conventional-adenoma-cases_vs_controls_DOWN`
#> [1] "100883" "1117"
#>
#> $`bsdb:1/2/1_Hyperplastic-Polyp:hyperplastic-polyp-cases_vs_controls_UP`
#> [1] "207244" "57037"
It is also possible obtain signatures based on the full taxonomic classification in MetaPhlAn format …
mp.sigs <- getSignatures(bsdb, tax.id.type = "metaphlan")
mp.sigs[1:3]
#> $`bsdb:1/1/1_Colorectal-adenoma:conventional-adenoma-cases_vs_controls_UP`
#> [1] "k__Bacteria|p__Bacillota|c__Bacilli"
#> [2] "k__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria"
#> [3] "k__Bacteria|p__Actinomycetota|c__Actinomycetes|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces"
#> [4] "k__Bacteria|p__Actinomycetota|c__Actinomycetes|o__Mycobacteriales|f__Corynebacteriaceae|g__Corynebacterium"
#> [5] "k__Bacteria|p__Bacillota|c__Bacilli|o__Lactobacillales|f__Streptococcaceae|g__Streptococcus"
#> [6] "k__Bacteria|p__Bacillota|c__Tissierellia|o__Tissierellales|f__Peptoniphilaceae|g__Peptoniphilus"
#> [7] "k__Bacteria|p__Bacillota|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae|g__Dorea"
#> [8] "k__Bacteria|p__Bacillota|c__Negativicutes|o__Acidaminococcales|f__Acidaminococcaceae|g__Phascolarctobacterium"
#> [9] "k__Bacteria|p__Pseudomonadota|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae|g__Sutterella"
#> [10] "k__Bacteria|p__Actinomycetota|c__Actinomycetes|o__Actinomycetales"
#> [11] "k__Bacteria|p__Actinomycetota|c__Actinomycetes|o__Actinomycetales|f__Actinomycetaceae"
#> [12] "k__Bacteria|p__Pseudomonadota|c__Betaproteobacteria|o__Burkholderiales|f__Alcaligenaceae"
#> [13] "k__Bacteria|p__Bacillota|c__Bacilli|o__Lactobacillales"
#> [14] "k__Bacteria|p__Bacillota|c__Bacilli|o__Lactobacillales|f__Streptococcaceae"
#> [15] "k__Bacteria|p__Bacillota|c__Negativicutes|o__Veillonellales|f__Veillonellaceae"
#> [16] "k__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Enterobacterales"
#> [17] "k__Bacteria|p__Actinomycetota|c__Actinomycetes|o__Mycobacteriales|f__Corynebacteriaceae"
#> [18] "k__Bacteria|p__Bacillota|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lacticaseibacillus|s__Lacticaseibacillus zeae"
#> [19] "k__Bacteria|p__Bacillota|c__Bacilli|o__Bacillales|f__Bacillaceae|g__Bacillus"
#> [20] "k__Bacteria|p__Bacillota|c__Bacilli|o__Bacillales|f__Bacillaceae"
#>
#> $`bsdb:1/1/2_Colorectal-adenoma:conventional-adenoma-cases_vs_controls_DOWN`
#> [1] "k__Bacteria|p__Bacillota|c__Erysipelotrichia|o__Erysipelotrichales|f__Coprobacillaceae|g__Coprobacillus"
#> [2] "k__Bacteria|p__Cyanobacteriota"
#>
#> $`bsdb:1/2/1_Hyperplastic-Polyp:hyperplastic-polyp-cases_vs_controls_UP`
#> [1] "k__Bacteria|p__Bacillota|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae|g__Anaerostipes"
#> [2] "k__Bacteria|p__Bacillota|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lacticaseibacillus|s__Lacticaseibacillus zeae"
… or using the taxonomic name only:
tn.sigs <- getSignatures(bsdb, tax.id.type = "taxname")
tn.sigs[1:3]
#> $`bsdb:1/1/1_Colorectal-adenoma:conventional-adenoma-cases_vs_controls_UP`
#> [1] "Bacilli" "Gammaproteobacteria"
#> [3] "Actinomyces" "Corynebacterium"
#> [5] "Streptococcus" "Peptoniphilus"
#> [7] "Dorea" "Phascolarctobacterium"
#> [9] "Sutterella" "Actinomycetales"
#> [11] "Actinomycetaceae" "Alcaligenaceae"
#> [13] "Lactobacillales" "Streptococcaceae"
#> [15] "Veillonellaceae" "Enterobacterales"
#> [17] "Corynebacteriaceae" "Lacticaseibacillus zeae"
#> [19] "Bacillus" "Bacillaceae"
#>
#> $`bsdb:1/1/2_Colorectal-adenoma:conventional-adenoma-cases_vs_controls_DOWN`
#> [1] "Coprobacillus" "Cyanobacteriota"
#>
#> $`bsdb:1/2/1_Hyperplastic-Polyp:hyperplastic-polyp-cases_vs_controls_UP`
#> [1] "Anaerostipes" "Lacticaseibacillus zeae"
As metagenomic profiling with 16S RNA sequencing or whole-metagenome shotgun sequencing is typically conducted on a certain taxonomic level, it is also possible to obtain signatures restricted to eg. the genus level …
gn.sigs <- getSignatures(bsdb,
tax.id.type = "taxname",
tax.level = "genus")
gn.sigs[1:3]
#> $`bsdb:1/1/1_Colorectal-adenoma:conventional-adenoma-cases_vs_controls_UP`
#> [1] "Actinomyces" "Corynebacterium" "Streptococcus"
#> [4] "Peptoniphilus" "Dorea" "Phascolarctobacterium"
#> [7] "Sutterella" "Bacillus"
#>
#> $`bsdb:1/1/2_Colorectal-adenoma:conventional-adenoma-cases_vs_controls_DOWN`
#> [1] "Coprobacillus"
#>
#> $`bsdb:1/2/1_Hyperplastic-Polyp:hyperplastic-polyp-cases_vs_controls_UP`
#> [1] "Anaerostipes"
… or the species level:
gn.sigs <- getSignatures(bsdb,
tax.id.type = "taxname",
tax.level = "species")
gn.sigs[1:3]
#> $`bsdb:1/1/1_Colorectal-adenoma:conventional-adenoma-cases_vs_controls_UP`
#> [1] "Lacticaseibacillus zeae"
#>
#> $`bsdb:1/2/1_Hyperplastic-Polyp:hyperplastic-polyp-cases_vs_controls_UP`
#> [1] "Lacticaseibacillus zeae"
#>
#> $`bsdb:1/6/1_Colorectal-adenoma:Non-advanced-conventional-adenoma-cases_vs_controls_UP`
#> [1] "Lacticaseibacillus zeae"
Note that restricting signatures to microbes given at the genus level, will per default exclude microbes given at a more specific taxonomic rank such as species or strain.
For certain applications, it might be desirable to not exclude
microbes given at a more specific taxonomic rank, but rather extract the
more general tax.level
for microbes given at a more
specific taxonomic level.
This can be achieved by setting the argument
exact.tax.level
to FALSE
, which will here
extract genus level taxon names, for taxa given at the species or strain
level.
gn.sigs <- getSignatures(bsdb,
tax.id.type = "taxname",
tax.level = "genus",
exact.tax.level = FALSE)
gn.sigs[1:3]
#> $`bsdb:1/1/1_Colorectal-adenoma:conventional-adenoma-cases_vs_controls_UP`
#> [1] "Actinomyces" "Corynebacterium" "Streptococcus"
#> [4] "Peptoniphilus" "Dorea" "Phascolarctobacterium"
#> [7] "Sutterella" "Lacticaseibacillus" "Bacillus"
#>
#> $`bsdb:1/1/2_Colorectal-adenoma:conventional-adenoma-cases_vs_controls_DOWN`
#> [1] "Coprobacillus"
#>
#> $`bsdb:1/2/1_Hyperplastic-Polyp:hyperplastic-polyp-cases_vs_controls_UP`
#> [1] "Anaerostipes" "Lacticaseibacillus"
Once signatures have been extracted using a taxonomic identifier type
of choice, the function writeGMT
allows to write the
signatures to plain text files in GMT
format.
This is the standard file format for gene sets used by MSigDB and GeneSigDB and is compatible with most enrichment analysis software.
Leveraging BugSigDB’s semantic MediaWiki web interface, we can also programmatically access annotations for individual microbes and microbe signatures.
The browseSignature
function can be used to display
BugSigDB signature pages in an interactive session. For programmatic
access in a non-interactive setting, the URL of the signature page is
returned.
Analogously, the browseTaxon
function displays BugSigDB
taxon pages in an interactive session, or the URL of the corresponding
taxon page otherwise.
The Semantic MediaWiki curation interface at bugsigdb.org enforces metadata annotation of signatures to follow established ontologies such as the Experimental Factor Ontology (EFO) for condition, and the Uber-Anatomy Ontology (UBERON) for body site.
The getOntology
function can be used to import both
ontologies into R. The result is an object of class
ontology_index
from the ontologyIndex
package.
efo <- getOntology("efo")
#> Loading required namespace: ontologyIndex
efo
#> Ontology with 51510 terms
#>
#> format-version: 1.2
#> data-version: http://www.ebi.ac.uk/efo/releases/v3.72.0/efo.owl
#> ontology: http://www.ebi.ac.uk/efo/efo.owl
#>
#> Properties:
#> id: character
#> name: character
#> parents: list
#> children: list
#> ancestors: list
#> obsolete: logical
#> equivalent_to: list
#> Roots:
#> EFO:0000001 - experimental factor
#> EFO:0000824 - relationship
#> RO:0000053 - bearer_of
#> RO:0000057 - has_participant
#> RO:0000056 - participates_in
#> RO:0002323 - mereotopologically related to
#> RO:0002502 - depends on
#> located_in - located_in
#> location_of - location_of
#> CHEBI:16422 - NA
#> ... 59 more
uberon <- getOntology("uberon")
uberon
#> Ontology with 14107 terms
#>
#> format-version: 1.2
#> data-version: releases/2020-09-16
#> default-namespace: uberon
#> ontology: uberon
#>
#> Properties:
#> id: character
#> name: character
#> parents: list
#> children: list
#> ancestors: list
#> obsolete: logical
#> Roots:
#> part_of - part of
#> has_part - has part
#> functionally_related_to - functionally related to
#> UBERON:0001062 - anatomical entity
#> adjacent_to - adjacent_to
#> UBERON:0000000 - processual entity
#> anterior_to - anterior_to
#> posterior_to - posterior_to
#> attaches_to_part_of - attaches_to_part_of
#> bearer_of - bearer of
#> ... 127 more
As demonstrated above, subsets of BugSigDB signatures can be obtained
for signatures associated with certain experimental factors or specific
body sites of interest. Higher-level queries can be performed with the
subsetByOntology
function, which implements subsetting by
more general ontology terms. This facilitates grouping of signatures
that semantically belong together.
More specifically, subsetting BugSigDB signatures by an EFO term then
involves subsetting the Condition
column to the term itself
and all descendants of that term in the EFO ontology and that are
present in the Condition
column. Here, we demonstrate the
usage by subsetting to signatures associated with cancer.
sdf <- subsetByOntology(bsdb,
column = "Condition",
term = "cancer",
ontology = efo)
dim(sdf)
#> [1] 441 50
table(sdf[,"Condition"])
#>
#> Acute myeloid leukemia
#> 11
#> Bladder carcinoma
#> 2
#> Breast cancer
#> 26
#> Breast carcinoma
#> 8
#> Cervical cancer
#> 18
#> Cervical glandular intraepithelial neoplasia,Cervical cancer
#> 2
#> Chronic gastritis,Gastric cancer
#> 6
#> Colorectal adenocarcinoma
#> 9
#> Colorectal carcinoma
#> 13
#> Cutaneous T-cell lymphoma
#> 2
#> Cutaneous melanoma
#> 9
#> Digestive System Carcinoma
#> 4
#> Digestive system cancer
#> 2
#> Endometrial cancer
#> 2
#> Esophageal adenocarcinoma
#> 22
#> Esophageal cancer
#> 10
#> Esophageal carcinoma
#> 2
#> Esophageal squamous cell carcinoma
#> 7
#> Essential thrombocythemia
#> 8
#> Gastric adenocarcinoma
#> 6
#> Gastric cancer
#> 79
#> Gastric carcinoma
#> 2
#> Genital neoplasm, female
#> 2
#> HER2 Positive Breast Carcinoma
#> 2
#> Head and neck carcinoma
#> 7
#> Head and neck squamous cell carcinoma
#> 12
#> Hepatitis virus-related hepatocellular carcinoma
#> 5
#> Hepatocellular carcinoma
#> 12
#> Human papilloma virus infection,Cervical cancer
#> 2
#> Lung cancer
#> 26
#> Metastatic colorectal cancer
#> 6
#> Multiple myeloma
#> 11
#> Mycobacterium tuberculosis,Lung cancer
#> 1
#> Nasopharyngeal squamous cell carcinoma
#> 4
#> Non-small cell lung carcinoma,Renal cell carcinoma,Disease progression measurement
#> 5
#> Oral cavity carcinoma
#> 19
#> Oral squamous cell carcinoma
#> 20
#> Ovarian cancer
#> 6
#> Pancreatic carcinoma
#> 14
#> Papillary thyroid carcinoma
#> 9
#> Prostate cancer
#> 4
#> Prostate carcinoma
#> 2
#> Squamous cell carcinoma
#> 20
#> Thyroid carcinoma
#> 2
And analogously, subsetting by an UBERON term involves subsetting the
Body site
column to the term itself and all descendants of
that term in the UBERON ontology and that are present in the
Body site
column. For example, we can use
subsetByOntology
to subset to signatures for which
microbiome samples have been obtained from parts of the digestive
system.
sdf <- subsetByOntology(bsdb,
column = "Body site",
term = "digestive system",
ontology = uberon)
dim(sdf)
#> [1] 1015 50
table(sdf[,"Body site"])
#>
#> Alimentary part of gastrointestinal system
#> 3
#> Body of stomach
#> 5
#> Bronchus,Mouth
#> 2
#> Buccal mucosa
#> 16
#> Buccal mucosa,Lower lip
#> 8
#> Caecum
#> 24
#> Cardia of stomach
#> 7
#> Cavity of pharynx
#> 2
#> Cecum mucosa
#> 12
#> Cecum mucosa,Colonic mucosa,Mucosa of rectum,Ileal mucosa
#> 5
#> Colon
#> 53
#> Colon,Feces
#> 2
#> Colonic mucosa
#> 8
#> Colorectal mucosa
#> 18
#> Colorectal mucosa,Feces
#> 2
#> Colorectum
#> 3
#> Dental plaque
#> 13
#> Dental plaque,Internal cheek pouch,Saliva
#> 2
#> Digestive tract
#> 8
#> Duodenal mucosa
#> 6
#> Duodenum
#> 16
#> Duodenum,Bile duct
#> 1
#> Epithelium of oropharynx
#> 3
#> Esophagus
#> 53
#> Feces,Colon
#> 4
#> Feces,Colonic mucosa
#> 1
#> Feces,Colorectal mucosa
#> 2
#> Feces,Large intestine
#> 4
#> Feces,Mucosa of descending colon
#> 5
#> Feces,Mucosa of small intestine
#> 4
#> Feces,Spleen
#> 2
#> Feces,Stomach,Caecum,Small intestine,Colon
#> 2
#> Gastric pit
#> 4
#> Gingiva
#> 5
#> Hypopharynx
#> 1
#> Ileum
#> 21
#> Ileum,Colon
#> 2
#> Ileum,Feces
#> 2
#> Ileum,Jejunum
#> 2
#> Ileum,Rectum
#> 2
#> Ileum,Rectum,Feces
#> 2
#> Internal cheek pouch
#> 8
#> Intestinal mucosa
#> 12
#> Intestine
#> 61
#> Jejunum
#> 25
#> Lumen of duodenum
#> 2
#> Midgut
#> 2
#> Mouth
#> 113
#> Mouth mucosa
#> 4
#> Mucosa of ascending colon
#> 2
#> Mucosa of body of stomach
#> 1
#> Mucosa of oral region
#> 2
#> Mucosa of oral region,Vagina,Skin of forehead,Skin of forearm
#> 1
#> Mucosa of oropharynx
#> 4
#> Mucosa of rectum
#> 5
#> Mucosa of sigmoid colon
#> 3
#> Mucosa of small intestine
#> 2
#> Mucosa of stomach
#> 3
#> Nasopharyngeal gland
#> 2
#> Nasopharyngeal gland,Saliva
#> 6
#> Nasopharynx
#> 69
#> Nasopharynx,Lung
#> 3
#> Nasopharynx,Oropharynx
#> 8
#> Nasopharynx,Throat
#> 12
#> Nose,Mouth
#> 2
#> Oral cavity
#> 38
#> Oral cavity,Esophagus
#> 2
#> Oral cavity,Feces
#> 2
#> Oral opening
#> 6
#> Oropharyngeal gland,Saliva
#> 6
#> Oropharynx
#> 32
#> Pharynx
#> 4
#> Posterior wall of oropharynx
#> 6
#> Rectal lumen
#> 2
#> Rectum
#> 23
#> Rumen
#> 7
#> Saliva,Subgingival dental plaque
#> 1
#> Saliva,Supragingival dental plaque
#> 6
#> Small intestine
#> 14
#> Sputum,Mouth
#> 2
#> Stomach
#> 44
#> Subgingival dental plaque
#> 68
#> Subgingival dental plaque,Feces
#> 3
#> Subgingival dental plaque,Saliva
#> 2
#> Superior surface of tongue
#> 2
#> Supragingival dental plaque
#> 8
#> Supragingival dental plaque,Saliva
#> 2
#> Surface of tongue
#> 10
#> Tongue
#> 29
#> Tonsillar fossa
#> 5
#> Wall of small intestine
#> 2
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] bugsigdbr_1.13.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3
#> [4] RSQLite_2.3.8 digest_0.6.37 magrittr_2.0.3
#> [7] evaluate_1.0.1 fastmap_1.2.0 blob_1.2.4
#> [10] jsonlite_1.8.9 ontologyIndex_2.12 DBI_1.2.3
#> [13] BiocManager_1.30.25 httr_1.4.7 purrr_1.0.2
#> [16] fansi_1.0.6 jquerylib_0.1.4 cli_3.6.3
#> [19] rlang_1.1.4 crayon_1.5.3 dbplyr_2.5.0
#> [22] bit64_4.5.2 withr_3.0.2 cachem_1.1.0
#> [25] yaml_2.3.10 tools_4.4.2 parallel_4.4.2
#> [28] tzdb_0.4.0 memoise_2.0.1 dplyr_1.1.4
#> [31] filelock_1.0.3 curl_6.0.1 buildtools_1.0.0
#> [34] vctrs_0.6.5 R6_2.5.1 BiocFileCache_2.15.0
#> [37] lifecycle_1.0.4 bit_4.5.0 vroom_1.6.5
#> [40] pkgconfig_2.0.3 pillar_1.9.0 bslib_0.8.0
#> [43] glue_1.8.0 xfun_0.49 tibble_3.2.1
#> [46] tidyselect_1.2.1 sys_3.4.3 knitr_1.49
#> [49] htmltools_0.5.8.1 rmarkdown_2.29 maketools_1.3.1
#> [52] compiler_4.4.2