| Title: | Taxon Set Enrichment Analysis |
|---|---|
| Description: | TaxSEA is an R package for Taxon Set Enrichment Analysis, which utilises a Kolmogorov-Smirnov test analyses to investigate differential abundance analysis output for whether there are alternations in a-priori defined sets of taxa from public databases (BugSigDB, MiMeDB, GutMGene, mBodyMap, BacDive and GMRepoV2) and collated from the literature. TaxSEA takes as input a list of taxonomic identifiers (e.g. species names, NCBI IDs etc.) and a rank (E.g. fold change, correlation coefficient). TaxSEA be applied to any microbiota taxonomic profiling technology (array-based, 16S rRNA gene sequencing, shotgun metagenomics & metatranscriptomics etc.) and enables researchers to rapidly contextualize their findings within the broader literature to accelerate interpretation of results. |
| Authors: | Feargal Ryan [aut, cre, fnd] (ORCID: <https://orcid.org/0000-0002-1565-4598>, funding: Supported by NHMRC Investigator Grant) |
| Maintainer: | Feargal Ryan <[email protected]> |
| License: | GPL-3 |
| Version: | 1.5.0 |
| Built: | 2026-05-30 09:44:01 UTC |
| Source: | https://github.com/bioc/TaxSEA |
This function takes a vector of taxon names and returns a vector of NCBI taxonomy IDs by querying the NCBI Entrez API.
get_ncbi_taxon_ids(taxon_names)get_ncbi_taxon_ids(taxon_names)
taxon_names |
A character vector of taxon names |
A character vector of NCBI taxonomy IDs corresponding to the input taxon names
taxon_names <- c("Escherichia coli", "Staphylococcus aureus") taxon_ids <- get_ncbi_taxon_ids(taxon_names)taxon_names <- c("Escherichia coli", "Staphylococcus aureus") taxon_ids <- get_ncbi_taxon_ids(taxon_names)
Retrieve from the TaxSEA database which taxon sets (metabolite producers and disease signatures) contain a taxon of interest.
get_taxon_sets(taxon_to_fetch = taxon)get_taxon_sets(taxon_to_fetch = taxon)
taxon_to_fetch |
The taxon to search for in the TaxSEA database. |
A character vector containing the names of taxonomic sets where the specified taxon is present.
# Retrieve sets for Bifidobacterium longum get_taxon_sets(taxon="Bifidobacterium_longum")# Retrieve sets for Bifidobacterium longum get_taxon_sets(taxon="Bifidobacterium_longum")
A dataset for mapping NCBI IDs to species/genus names. This named vector allows for lookup of NCBI IDs associated with species or genus names.
NCBI_idsNCBI_ids
A named vector where:
NCBI IDs
Species or genus names
A named vector mapping NCBI IDs to species or genus names.
NCBI
data(NCBI_ids) # Can look up either with or without spaces NCBI_ids["Bifidobacterium_breve"] NCBI_ids["Bifidobacterium breve"]data(NCBI_ids) # Can look up either with or without spaces NCBI_ids["Bifidobacterium_breve"] NCBI_ids["Bifidobacterium breve"]
Computes per-sample enrichment scores for taxon sets using an ssGSEA-style approach. Counts are CLR-transformed, then each taxon is z-scored across samples to capture between-sample variation. Per-sample enrichment is then computed by ranking each sample's z-scores and applying a weighted running-sum statistic against taxon sets from the TaxSEA database.
ssTaxSEA( counts, lookup_missing = FALSE, min_set_size = 5, max_set_size = 300, custom_db = NULL )ssTaxSEA( counts, lookup_missing = FALSE, min_set_size = 5, max_set_size = 300, custom_db = NULL )
counts |
A numeric matrix, data.frame, or
|
lookup_missing |
Logical indicating whether to fetch missing NCBI IDs via the NCBI API. Default is FALSE. |
min_set_size |
Minimum size of taxon sets to include. Default is 5. |
max_set_size |
Maximum size of taxon sets to include. Default is 300. |
custom_db |
A user-provided list of taxon sets. If NULL (default), the built-in TaxSEA database is used (excluding BugSigDB). |
The approach works as follows:
Raw counts are CLR-transformed (centered log-ratio with pseudocount of 0.5 for zeros).
Each taxon is then z-scored across all samples, so that values represent how much higher or lower a taxon is in a given sample relative to the cohort mean.
For each sample, the z-scores are ranked and an ssGSEA-style weighted running-sum enrichment score is computed for each taxon set.
A KS test p-value is also computed per sample per set.
This cohort-relative approach ensures that taxon sets which are consistently elevated in a subset of samples (e.g. disease samples) will produce high enrichment scores in those samples, even if the taxa are not the most abundant within any single sample.
A list with two elements:
A matrix (samples x taxon sets) of enrichment scores. Positive scores indicate the set taxa tend to have higher abundance in that sample relative to the cohort.
A matrix (samples x taxon sets) of KS test p-values for each sample-set combination.
## Not run: # From a count matrix (taxa x samples) counts <- matrix(rpois(500, lambda = 10), nrow = 50, ncol = 10) rownames(counts) <- paste0("Taxon_", seq_len(50)) colnames(counts) <- paste0("Sample_", seq_len(10)) res <- ssTaxSEA(counts, custom_db = list( set1 = paste0("Taxon_", 1:10), set2 = paste0("Taxon_", 20:30) ), min_set_size = 2) head(res$scores) head(res$pvalues) ## End(Not run)## Not run: # From a count matrix (taxa x samples) counts <- matrix(rpois(500, lambda = 10), nrow = 50, ncol = 10) rownames(counts) <- paste0("Taxon_", seq_len(50)) colnames(counts) <- paste0("Sample_", seq_len(10)) res <- ssTaxSEA(counts, custom_db = list( set1 = paste0("Taxon_", 1:10), set2 = paste0("Taxon_", 20:30) ), min_set_size = 2) head(res$scores) head(res$pvalues) ## End(Not run)
Groups species by taxonomic ranks and performs TaxSEA enrichment analysis at each rank level. Returns a named list of data frames, one per taxonomic rank (excluding species).
taxon_rank_sets(taxon_ranks, lineage_df, min_set_size = 5, max_set_size = 100)taxon_rank_sets(taxon_ranks, lineage_df, min_set_size = 5, max_set_size = 100)
taxon_ranks |
A named numeric vector of log2 fold changes.
Names should be feature identifiers matching the
|
lineage_df |
Either a data frame or a
Data frame input: Must include a SummarizedExperiment input: Taxonomy is extracted from
|
min_set_size |
Minimum number of species in a set to include in the analysis. Default is 5. |
max_set_size |
Maximum number of species in a set to include in the analysis. Default is 100. |
A named list of data frames, one per taxonomic rank. Each data frame contains columns: taxonSetName, median_rank_of_set_members, PValue, Test_statistic, and FDR. Ranks that produce no valid sets (e.g., due to size filtering) are included as empty data frames with a message.
# --- Example 1: Data frame input --- # Create a lineage data frame (e.g., parsed from curatedMetagenomicData) # The 'species' column must match the names in taxon_ranks. lineage_df <- data.frame( species = c("Cutibacterium_acnes", "Klebsiella_pneumoniae", "Propionibacterium_humerusii", "Moraxella_osloensis", "Enhydrobacter_aerosaccus", "Staphylococcus_capitis", "Staphylococcus_epidermidis", "Staphylococcus_aureus", "Escherichia_coli", "Enterobacter_cloacae", "Pseudomonas_aeruginosa", "Acinetobacter_baumannii", "Lactobacillus_rhamnosus", "Lactobacillus_acidophilus", "Bifidobacterium_longum", "Bifidobacterium_breve"), kingdom = rep("Bacteria", 16), phylum = c("Actinobacteria", "Proteobacteria", "Actinobacteria", "Proteobacteria", "Proteobacteria", "Firmicutes", "Firmicutes", "Firmicutes", "Proteobacteria", "Proteobacteria", "Proteobacteria", "Proteobacteria", "Firmicutes", "Firmicutes", "Actinobacteria", "Actinobacteria"), class = c("Actinobacteria", "Gammaproteobacteria", "Actinobacteria", "Gammaproteobacteria", "Alphaproteobacteria", "Bacilli", "Bacilli", "Bacilli", "Gammaproteobacteria", "Gammaproteobacteria", "Gammaproteobacteria", "Gammaproteobacteria", "Bacilli", "Bacilli", "Actinobacteria", "Actinobacteria"), order = c("Propionibacteriales", "Enterobacterales", "Propionibacteriales", "Pseudomonadales", "Rhodospirillales", "Bacillales", "Bacillales", "Bacillales", "Enterobacterales", "Enterobacterales", "Pseudomonadales", "Pseudomonadales", "Lactobacillales", "Lactobacillales", "Bifidobacteriales", "Bifidobacteriales"), family = c("Propionibacteriaceae", "Enterobacteriaceae", "Propionibacteriaceae", "Moraxellaceae", "Rhodospirillaceae", "Staphylococcaceae", "Staphylococcaceae", "Staphylococcaceae", "Enterobacteriaceae", "Enterobacteriaceae", "Pseudomonadaceae", "Moraxellaceae", "Lactobacillaceae", "Lactobacillaceae", "Bifidobacteriaceae", "Bifidobacteriaceae"), genus = c("Cutibacterium", "Klebsiella", "Cutibacterium", "Moraxella", "Enhydrobacter", "Staphylococcus", "Staphylococcus", "Staphylococcus", "Escherichia", "Enterobacter", "Pseudomonas", "Acinetobacter", "Lactobacillus", "Lactobacillus", "Bifidobacterium", "Bifidobacterium"), stringsAsFactors = FALSE ) set.seed(42) fc <- setNames(rnorm(16), lineage_df$species) results <- taxon_rank_sets(fc, lineage_df, min_set_size = 2) names(results) results$family # --- Example 2: SummarizedExperiment / TreeSummarizedExperiment input --- ## Not run: library(mia) data(GlobalPatterns, package = "mia") tse <- GlobalPatterns # Run differential abundance (e.g., ALDEx2) to get fold changes # aldex_out <- ... (your DA analysis) # fc <- aldex_out$effect # names(fc) <- rownames(aldex_out) # Run taxon rank set enrichment directly from the TSE results <- taxon_rank_sets(fc, tse, min_set_size = 5) names(results) # Kingdom, Phylum, Class, Order, Family, Genus results$Family ## End(Not run)# --- Example 1: Data frame input --- # Create a lineage data frame (e.g., parsed from curatedMetagenomicData) # The 'species' column must match the names in taxon_ranks. lineage_df <- data.frame( species = c("Cutibacterium_acnes", "Klebsiella_pneumoniae", "Propionibacterium_humerusii", "Moraxella_osloensis", "Enhydrobacter_aerosaccus", "Staphylococcus_capitis", "Staphylococcus_epidermidis", "Staphylococcus_aureus", "Escherichia_coli", "Enterobacter_cloacae", "Pseudomonas_aeruginosa", "Acinetobacter_baumannii", "Lactobacillus_rhamnosus", "Lactobacillus_acidophilus", "Bifidobacterium_longum", "Bifidobacterium_breve"), kingdom = rep("Bacteria", 16), phylum = c("Actinobacteria", "Proteobacteria", "Actinobacteria", "Proteobacteria", "Proteobacteria", "Firmicutes", "Firmicutes", "Firmicutes", "Proteobacteria", "Proteobacteria", "Proteobacteria", "Proteobacteria", "Firmicutes", "Firmicutes", "Actinobacteria", "Actinobacteria"), class = c("Actinobacteria", "Gammaproteobacteria", "Actinobacteria", "Gammaproteobacteria", "Alphaproteobacteria", "Bacilli", "Bacilli", "Bacilli", "Gammaproteobacteria", "Gammaproteobacteria", "Gammaproteobacteria", "Gammaproteobacteria", "Bacilli", "Bacilli", "Actinobacteria", "Actinobacteria"), order = c("Propionibacteriales", "Enterobacterales", "Propionibacteriales", "Pseudomonadales", "Rhodospirillales", "Bacillales", "Bacillales", "Bacillales", "Enterobacterales", "Enterobacterales", "Pseudomonadales", "Pseudomonadales", "Lactobacillales", "Lactobacillales", "Bifidobacteriales", "Bifidobacteriales"), family = c("Propionibacteriaceae", "Enterobacteriaceae", "Propionibacteriaceae", "Moraxellaceae", "Rhodospirillaceae", "Staphylococcaceae", "Staphylococcaceae", "Staphylococcaceae", "Enterobacteriaceae", "Enterobacteriaceae", "Pseudomonadaceae", "Moraxellaceae", "Lactobacillaceae", "Lactobacillaceae", "Bifidobacteriaceae", "Bifidobacteriaceae"), genus = c("Cutibacterium", "Klebsiella", "Cutibacterium", "Moraxella", "Enhydrobacter", "Staphylococcus", "Staphylococcus", "Staphylococcus", "Escherichia", "Enterobacter", "Pseudomonas", "Acinetobacter", "Lactobacillus", "Lactobacillus", "Bifidobacterium", "Bifidobacterium"), stringsAsFactors = FALSE ) set.seed(42) fc <- setNames(rnorm(16), lineage_df$species) results <- taxon_rank_sets(fc, lineage_df, min_set_size = 2) names(results) results$family # --- Example 2: SummarizedExperiment / TreeSummarizedExperiment input --- ## Not run: library(mia) data(GlobalPatterns, package = "mia") tse <- GlobalPatterns # Run differential abundance (e.g., ALDEx2) to get fold changes # aldex_out <- ... (your DA analysis) # fc <- aldex_out$effect # names(fc) <- rownames(aldex_out) # Run taxon rank set enrichment directly from the TSE results <- taxon_rank_sets(fc, tse, min_set_size = 5) names(results) # Kingdom, Phylum, Class, Order, Family, Genus results$Family ## End(Not run)
Modular TaxSEA implementation supporting enrichment (KS) and
ORA (Fisher).
Provide either taxon_ranks for enrichment or
input_taxa for ORA.
TaxSEA( taxon_ranks = NULL, input_taxa = NULL, mode = NULL, lookup_missing = FALSE, min_set_size = 5, max_set_size = 300, custom_db = NULL )TaxSEA( taxon_ranks = NULL, input_taxa = NULL, mode = NULL, lookup_missing = FALSE, min_set_size = 5, max_set_size = 300, custom_db = NULL )
taxon_ranks |
Named numeric vector of statistics (e.g., log2 fold changes). Required for enrichment. |
input_taxa |
Character vector of taxa to treat as "hits"/selected taxa. Required for ORA. |
mode |
Character. One of |
lookup_missing |
Logical indicating whether to fetch missing NCBI IDs. Default is FALSE. |
min_set_size |
Minimum size of taxon sets to include in the analysis. Default is 5. |
max_set_size |
Maximum size of taxon sets to include in the analysis. Default is 100. |
custom_db |
A user-provided list of taxon sets. If NULL (default), the built-in database is used. |
A list of data frames with taxon set results.
data("TaxSEA_test_data") res <- TaxSEA(taxon_ranks = TaxSEA_test_data) head(res$All_databases) # ORA example (toy): treat taxa with positive values as "hits" hits <- names(TaxSEA_test_data) res_ora <- TaxSEA(input_taxa = hits, mode = "ora") head(res_ora$All_databases)data("TaxSEA_test_data") res <- TaxSEA(taxon_ranks = TaxSEA_test_data) head(res$All_databases) # ORA example (toy): treat taxa with positive values as "hits" hits <- names(TaxSEA_test_data) res_ora <- TaxSEA(input_taxa = hits, mode = "ora") head(res_ora$All_databases)
TaxSEA Database A dataset containing taxon sets. Each item in the list is a taxon set, and each member within a taxon set is a taxon.
TaxSEA_dbTaxSEA_db
A list of vectors. Each vector contains character strings representing taxa.
See READ ME.
data(TaxSEA_db) all_sets <- names(TaxSEA_db) GABA_producers<-TaxSEA_db[["MiMeDB_producers_of_GABA"]]data(TaxSEA_db) all_sets <- names(TaxSEA_db) GABA_producers<-TaxSEA_db[["MiMeDB_producers_of_GABA"]]
A dataset containing taxon ranks and taxon IDs.
TaxSEA_test_dataTaxSEA_test_data
A data frame with two columns:
Character vector representing taxon ranks
Character vector representing taxon IDs
A data frame with columns 'rank' and 'id' representing taxon ranks and taxon IDs, respectively.
See READ ME.
data(TaxSEA_test_data) test_results <- TaxSEA(TaxSEA_test_data)data(TaxSEA_test_data) test_results <- TaxSEA(TaxSEA_test_data)