Metagenomics bioinformatics at MGnify

Introduction

In this notebook we aim to demonstrate how the MGnifyR tool can be used to fetch data of a MGnify microbiome data resource. We then showcase how to analyze the datausing advanced microbiome data science tools, including estimating alpha and beta diversity, as well as performing differential abundance analysis.

MGnifyR is an R/Bioconductor package that provides a set of tools for easily accessing and processing MGnify data in R, making queries to MGnify databases through the MGnify API.

The benefit of MGnifyR is that it streamlines data access, allowing users to fetch data either in its “raw” format or directly as a TreeSummarizedExperiment (TreeSE) object. This enables seamless integration with custom workflows for analysis.

Utilizing TreeSE provides access to a wide range of tools within Bioconductor’s SummarizedExperiment (SE) ecosystem. It also integrates with the mia package, which offers microbiome-specific methods within the SE framework.

For more information on microbiome data science in Bioconductor, refer to Orchestrating Microbiome Analysis (OMA) online book.

Load packages

# List of packages that we need
packages <- c("ANCOMBC", "MGnifyR", "mia",  "miaViz", "scater")

# Get packages that are already installed
packages_already_installed <- packages[ packages %in% installed.packages() ]
# Get packages that need to be installed
packages_need_to_install <- setdiff( packages, packages_already_installed )
# Loads BiocManager into the session. Install it if it is not already installed.
if( !require("BiocManager") ){
    install.packages("BiocManager")
    library("BiocManager")
}
# If there are packages that need to be installed, installs them with BiocManager
# Updates old packages.
if( length(packages_need_to_install) > 0 ) {
   install(packages_need_to_install, ask = FALSE)
}

# Load all packages into session. Stop if there are packages that were not
# successfully loaded
pkgs_not_loaded <- !sapply(packages, require, character.only = TRUE)
pkgs_not_loaded <- names(pkgs_not_loaded)[ pkgs_not_loaded ]
if( length(pkgs_not_loaded) > 0 ){
    stop(
        "Error in loading the following packages into the session: '",
        paste0(pkgs_not_loaded, collapse = "', '"), "'")
}

Data import

To interact with the MGnify database, we need to create a MgnifyClient object. This object allows us to store options for data fetching. For instance, we can configure it to use a cache for improved efficiency.

# Create the MgnifyClient object with caching enabled
mg <- MgnifyClient(
  useCache = TRUE,
  cacheDir = "/home/training" # Set this to your desired cache directory
  )

In this workflow, we will fetch taxonomy annotations and metadata from the study “MGYS00005154”. The dataset focuses on the human gut microbiome, analyzed across different geographic regions.

We can now search for all analyses associated with the certain study. The analysis refers to metagenomic runs performed to samples. Each sample can have multiple runs made, which is why we work with analyses and not with samples; analysis identifier points to a single entity.

study_id <- "MGYS00005154"
analysis_id <- searchAnalysis(mg, "studies", study_id)

Now we are ready to load the metadata on the analyses to get idea on what kind of data we are dealing with.

There are currently (17 Sep 2024) almost 1,000 analyses available. Downloading whole dataset will take some time, which is why we use memory cache.

metadata <- getMetadata(mg, accession = analysis_id)

We can see that there are analyses that are performed with different pipelines. Let’s take only those analyses that are generated with the pipeline version 5.0.

metadata <- metadata[metadata[["analysis_pipeline-version"]] == "5.0", ]

We have now analyses that each point to unique sample. The final step is to fetch abundance tables in TreeSummarizedExperiment (TreeSE) format.

tse <- getResult(
    mg,
    accession = metadata[["analysis_accession"]],
    get.func = FALSE
    )
tse

The fetched data is TreeSE object, including taxonomy annotations. See OMA online book on how to handle the data in this format.

Preprocessing

Below, we agglomerate the data to the Order level, meaning we summarize the abundances at this specific taxonomic rank. The OMA provides a detailed chapter explaining agglomeration in more depth.

tse_order <- agglomerateByRank(tse, rank = "Order")

Because of the unique properties of microbiome data, we have to apply transformations. Here, we perform relative transformation. You can find more information on transformations from OMA.

# Transform the main TreeSE
tse <- transformAssay(tse, method = "relabundance")
# Transform the agglomerated TreeSE
tse_order <- transformAssay(tse_order, method = "relabundance")

Alpha diversity

Alpha diversity measures community diversity within a sample. Learn more on community diversity from here.

# Calculate alpha diversity
tse <- addAlpha(tse)

# Create a plot
p <- plotColData(
  tse,
  y = "shannon_diversity",
  x = "sample_geographic.location..country.and.or.sea.region.",
  show_boxplot = TRUE
  )
p

We can test whether the diversity differences are statistically significant. We utilize Mann Whithney U test (or Wilcoxon test).

pairwise.wilcox.test(
    tse[["shannon_diversity"]],
    tse[["sample_geographic.location..country.and.or.sea.region."]],
    p.adjust.method = "fdr"
    )

To add p-values to the plot, see OMA.

Beta diversity

We can assess the differences in microbial compositions between samples, aiming to identify patterns in the data that are associated with covariates.

To achieve this, we perform Principal Coordinate Analysis (PCoA) using Bray-Curtis dissimilarity.

# Perform PCoA
tse <- runMDS(
  tse,
  FUN = getDissimilarity,
  method = "bray",
  assay.type = "relabundance"
)
# Visualize PCoA
p <- plotReducedDim(
  tse,
  dimred = "MDS",
  colour_by = "sample_geographic.location..country.and.or.sea.region."
  )
p

See community similarity chapter from OMA for more information.

Differential abundance analysis (DAA)

In DAA, we analyze whether abundances of certain features vary between study groups. Again, OMA has a dedicated chapter also on this topic.

# Perform DAA
res <- ancombc2(
    data = tse_order,
    assay.type = "counts",
    fix_formula = "sample_geographic.location..country.and.or.sea.region.",
    p_adj_method = "fdr",
    )

Next we visualize features that have the lowest p-values.

# Get the most significant features
n_top <- 4
res_tab <- res[["res"]]
res_tab <- res_tab[order(res_tab[["q_(Intercept)"]]), ]
top_feat <- res_tab[seq_len(n_top), "taxon"]

# Create a plot
p <- plotExpression(
  tse_order,
  features = top_feat,
  assay.type = "relabundance",
  x = "sample_geographic.location..country.and.or.sea.region.",
  show_boxplot = TRUE, show_violin = FALSE, point_shape = NA
  ) +
  scale_y_log10()
p

Session info

sessionInfo()