Introduction to dar

An Example

The package includes a dataset from a study by Noguera-Julian, M., et al. 2016, which investigated the differential abundance of microbial species between men who have sex with men (MSM) and non-MSM (hts). This data is stored as an object of the phyloseq class, a standard input format for creating recipes with dar in conjunction with TreeSummarizedExperiment. To begin the analysis, we first load and inspect the data:

library(dar)

data("metaHIV_phy", package = "dar")

metaHIV_phy
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 451 taxa and 156 samples ]
#> sample_data() Sample Data:       [ 156 samples by 3 sample variables ]
#> tax_table()   Taxonomy Table:    [ 451 taxa by 7 taxonomic ranks ]

An Initial Recipe

First, we will create a recipe object from the original data and then specify the processing and differential analysis steps.

Recipes can be created manually by sequentially adding roles to variables in a data set.

The easiest way to create the initial recipe is:

rec_obj <- recipe(metaHIV_phy, var_info = "RiskGroup2", tax_info = "Species") 
rec_obj
#> ── DAR Recipe ──────────────────────────────────────────────────────────────────
#> Inputs:
#> 
#>      ℹ phyloseq object with 451 taxa and 156 samples 
#>      ℹ variable of interes RiskGroup2 (class: character, levels: hts, msm, pwid) 
#>      ℹ taxonomic level Species

The var_info argument corresponds to the variable to be considered in the modeling process and tax_info indicates the taxonomic level that will be used for the analyses.

Preprocessing Steps

From here, preprocessing steps for some step X can be added sequentially in one of two ways:

rec_obj <- step_{X}(rec_obj, arguments)   
## or
rec_obj <- rec_obj |> step_{X}(arguments)

Note that all step_ancom and the other functions will always return updated recipes.

We have two types of steps, those in charge of processing (prepro) and those destined to define the methods of differential analysis (da).

The prepro steps are used to modify the data loaded into the recipe which will then be used for the da steps. The dar package include 3 main preprocessing functionalities.

  • step_subset_taxa: Is used for subsetting the columns and values within the tax_table.

  • step_filter_taxa: Is used for filtering OTUs from recipe objects.

  • step_rarefaction: Is used to resample an OTU table such that all samples have the same library size.

Additionally, the dar package provides convenient wrappers for the step_filter_taxa function, designed to filter Operational Taxonomic Units (OTUs) based on specific criteria: prevalence, variance, abundance, and rarity.

  • step_filter_by_prevalence: Filters OTUs according to the number of samples in which the OTU appears.
  • step_filter_by_variance: Filters OTUs based on the variance of the OTU’s presence across samples.
  • step_filter_by_abundance: Filters OTUs according to the OTU’s abundance across samples.
  • step_filter_by_rarity: Filters OTUs based on the rarity of the OTU across samples.

For our data, we can add an operation to preprocessing the data stored in the initial recpie. First, we will use step_subset_taxa to retain only Bacteria and Archaea OTUs from the Kingdom taxonomic level. We will then filter out OTUs where at least 3% of the samples have counts greater than 0.

rec_obj <- rec_obj |>
  step_subset_taxa(tax_level = "Kingdom", taxa = c("Bacteria", "Archaea")) |>
  step_filter_by_prevalence(0.03)
  
rec_obj
#> ── DAR Recipe ──────────────────────────────────────────────────────────────────
#> Inputs:
#> 
#>      ℹ phyloseq object with 451 taxa and 156 samples 
#>      ℹ variable of interes RiskGroup2 (class: character, levels: hts, msm, pwid) 
#>      ℹ taxonomic level Species 
#> 
#> Preporcessing steps:
#> 
#>      ◉ step_subset_taxa() id = subset_taxa__Jachnun 
#>      ◉ step_filter_by_prevalence() id = filter_by_prevalence__Bougatsa 
#> 
#> DA steps:

Differential Analysis

Now that we have defined the preprocessing of the input data for all the da methods that will be used, we need to define them. For this introduction we will use metagenomeSeq and maaslin2 methods with default parameters (those defined by the authors of each method).

rec_obj <- rec_obj |>
  step_deseq() |>
  step_metagenomeseq(rm_zeros = 0.01) |>
  step_maaslin()

rec_obj
#> ── DAR Recipe ──────────────────────────────────────────────────────────────────
#> Inputs:
#> 
#>      ℹ phyloseq object with 451 taxa and 156 samples 
#>      ℹ variable of interes RiskGroup2 (class: character, levels: hts, msm, pwid) 
#>      ℹ taxonomic level Species 
#> 
#> Preporcessing steps:
#> 
#>      ◉ step_subset_taxa() id = subset_taxa__Jachnun 
#>      ◉ step_filter_by_prevalence() id = filter_by_prevalence__Bougatsa 
#> 
#> DA steps:
#> 
#>      ◉ step_deseq() id = deseq__Nunt 
#>      ◉ step_metagenomeseq() id = metagenomeseq__Kringle 
#>      ◉ step_maaslin() id = maaslin__Kouign_amann

The dar package includes more da steps than those defined above. Below is the full list:

grep(
  "_new|_to_expr|filter|subset|rarefaction",
  grep("^step_", ls("package:dar"), value = TRUE),
  value = TRUE,
  invert = TRUE
)
#> [1] "step_aldex"         "step_ancom"         "step_corncob"      
#> [4] "step_deseq"         "step_lefse"         "step_maaslin"      
#> [7] "step_metagenomeseq" "step_wilcox"

Prep

To ensure the reproducibility and consistency of the generated results, all the steps defined in the recipe are executed at the same time using the prep function.

da_results <- prep(rec_obj, parallel = TRUE)
da_results
#> ── DAR Results ─────────────────────────────────────────────────────────────────
#> Inputs:
#> 
#>      ℹ phyloseq object with 278 taxa and 156 samples 
#>      ℹ variable of interes RiskGroup2 (class: character, levels: hts, msm, pwid) 
#>      ℹ taxonomic level Species 
#> 
#> Results:
#> 
#>      ✔ deseq__Nunt diff_taxa = 166 
#>      ✔ metagenomeseq__Kringle diff_taxa = 233 
#>      ✔ maaslin__Kouign_amann diff_taxa = 146 
#> 
#>      ℹ 64 taxa are present in all tested methods

Note that the resulting object print shows information about the amount of differentially abundant OTUs in each of the methods, as well as the number of OTUs that have been detected by all methods (consensus).

Bake and cool

Now that we have the results we need to extract them, however for this we first need to define a consensus strategy with the bake. For this example we are only interested in those OTUs detected as differentially abundant in the three methods used.

## Number of used methods
count <- steps_ids(da_results, type = "da") |> length()

## Define the bake 
da_results <- bake(da_results, count_cutoff = count)

Finally we can extract the table with the results using the cool function.

cool(da_results)
#> # A tibble: 64 × 2
#>    taxa_id taxa                        
#>    <chr>   <chr>                       
#>  1 Otu_15  Bifidobacterium_catenulatum 
#>  2 Otu_34  Olsenella_scatoligenes      
#>  3 Otu_35  Collinsella_aerofaciens     
#>  4 Otu_37  Collinsella_stercoris       
#>  5 Otu_38  Enorma_massiliensis         
#>  6 Otu_45  Slackia_isoflavoniconvertens
#>  7 Otu_47  Bacteroides_cellulosilyticus
#>  8 Otu_48  Bacteroides_clarus          
#>  9 Otu_63  Bacteroides_plebeius        
#> 10 Otu_69  Bacteroides_sp_CAG_530      
#> # ℹ 54 more rows

Session info

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.1 (2024-06-14)
#>  os       Ubuntu 24.04.1 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       Etc/UTC
#>  date     2024-10-30
#>  pandoc   3.2.1 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package          * version date (UTC) lib source
#>  ade4               1.7-22  2023-02-06 [2] RSPM (R 4.4.0)
#>  ape                5.8     2024-04-11 [2] RSPM (R 4.4.0)
#>  assertthat         0.2.1   2019-03-21 [2] RSPM (R 4.4.0)
#>  Biobase            2.67.0  2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  BiocGenerics       0.53.0  2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  biomformat         1.35.0  2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  Biostrings         2.75.0  2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  brio               1.1.5   2024-04-24 [2] RSPM (R 4.4.0)
#>  bslib              0.8.0   2024-07-29 [2] RSPM (R 4.4.0)
#>  buildtools         1.0.0   2024-10-28 [3] local (/pkg)
#>  ca                 0.71.1  2020-01-24 [2] RSPM (R 4.4.0)
#>  cachem             1.1.0   2024-05-16 [2] RSPM (R 4.4.0)
#>  cli                3.6.3   2024-06-21 [2] RSPM (R 4.4.0)
#>  cluster            2.1.6   2023-12-01 [2] RSPM (R 4.4.0)
#>  codetools          0.2-20  2024-03-31 [2] RSPM (R 4.4.0)
#>  colorspace         2.1-1   2024-07-26 [2] RSPM (R 4.4.0)
#>  crayon             1.5.3   2024-06-20 [2] RSPM (R 4.4.0)
#>  crosstalk          1.2.1   2023-11-23 [2] RSPM (R 4.4.0)
#>  dar              * 1.3.0   2024-10-30 [1] https://bioc.r-universe.dev (R 4.4.1)
#>  data.table         1.16.2  2024-10-10 [2] RSPM (R 4.4.0)
#>  dendextend         1.18.1  2024-10-13 [2] RSPM (R 4.4.0)
#>  devtools           2.4.5   2022-10-11 [2] RSPM (R 4.4.0)
#>  digest             0.6.37  2024-08-19 [2] RSPM (R 4.4.0)
#>  dplyr              1.1.4   2023-11-17 [2] RSPM (R 4.4.0)
#>  ellipsis           0.3.2   2021-04-29 [2] RSPM (R 4.4.0)
#>  evaluate           1.0.1   2024-10-10 [2] RSPM (R 4.4.0)
#>  fansi              1.0.6   2023-12-08 [2] RSPM (R 4.4.0)
#>  farver             2.1.2   2024-05-13 [2] RSPM (R 4.4.0)
#>  fastmap            1.2.0   2024-05-15 [2] RSPM (R 4.4.0)
#>  foreach            1.5.2   2022-02-02 [2] RSPM (R 4.4.0)
#>  fs                 1.6.4   2024-04-25 [2] RSPM (R 4.4.0)
#>  furrr              0.3.1   2022-08-15 [2] RSPM (R 4.4.0)
#>  future             1.34.0  2024-07-29 [2] RSPM (R 4.4.0)
#>  generics           0.1.3   2022-07-05 [2] RSPM (R 4.4.0)
#>  GenomeInfoDb       1.41.2  2024-10-02 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  GenomeInfoDbData   1.2.13  2024-10-30 [2] Bioconductor
#>  ggplot2            3.5.1   2024-04-23 [2] RSPM (R 4.4.0)
#>  globals            0.16.3  2024-03-08 [2] RSPM (R 4.4.0)
#>  glue               1.8.0   2024-09-30 [2] RSPM (R 4.4.0)
#>  gridExtra          2.3     2017-09-09 [2] RSPM (R 4.4.0)
#>  gtable             0.3.6   2024-10-25 [2] RSPM (R 4.4.0)
#>  heatmaply          1.5.0   2023-10-06 [2] RSPM (R 4.4.0)
#>  highr              0.11    2024-05-26 [2] RSPM (R 4.4.0)
#>  htmltools          0.5.8.1 2024-04-04 [2] RSPM (R 4.4.0)
#>  htmlwidgets        1.6.4   2023-12-06 [2] RSPM (R 4.4.0)
#>  httpuv             1.6.15  2024-03-26 [2] RSPM (R 4.4.0)
#>  httr               1.4.7   2023-08-15 [2] RSPM (R 4.4.0)
#>  igraph             2.1.1   2024-10-19 [2] RSPM (R 4.4.0)
#>  IRanges            2.39.2  2024-10-25 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  iterators          1.0.14  2022-02-05 [2] RSPM (R 4.4.0)
#>  jquerylib          0.1.4   2021-04-26 [2] RSPM (R 4.4.0)
#>  jsonlite           1.8.9   2024-09-20 [2] RSPM (R 4.4.0)
#>  knitr              1.48    2024-07-07 [2] RSPM (R 4.4.0)
#>  labeling           0.4.3   2023-08-29 [2] RSPM (R 4.4.0)
#>  later              1.3.2   2023-12-06 [2] RSPM (R 4.4.0)
#>  lattice            0.22-6  2024-03-20 [2] RSPM (R 4.4.0)
#>  lazyeval           0.2.2   2019-03-15 [2] RSPM (R 4.4.0)
#>  lifecycle          1.0.4   2023-11-07 [2] RSPM (R 4.4.0)
#>  listenv            0.9.1   2024-01-29 [2] RSPM (R 4.4.0)
#>  magrittr           2.0.3   2022-03-30 [2] RSPM (R 4.4.0)
#>  maketools          1.3.1   2024-10-28 [3] Github (jeroen/maketools@d46f92c)
#>  MASS               7.3-61  2024-06-13 [2] RSPM (R 4.4.0)
#>  Matrix             1.7-1   2024-10-18 [2] RSPM (R 4.4.0)
#>  memoise            2.0.1   2021-11-26 [2] RSPM (R 4.4.0)
#>  mgcv               1.9-1   2023-12-21 [2] RSPM (R 4.4.0)
#>  microbiome         1.27.0  2024-10-22 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  mime               0.12    2021-09-28 [2] RSPM (R 4.4.0)
#>  miniUI             0.1.1.1 2018-05-18 [2] RSPM (R 4.4.0)
#>  multtest           2.61.0  2024-10-20 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  munsell            0.5.1   2024-04-01 [2] RSPM (R 4.4.0)
#>  nlme               3.1-166 2024-08-14 [2] RSPM (R 4.4.0)
#>  parallelly         1.38.0  2024-07-27 [2] RSPM (R 4.4.0)
#>  permute            0.9-7   2022-01-27 [2] RSPM (R 4.4.0)
#>  phyloseq           1.49.0  2024-10-24 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  pillar             1.9.0   2023-03-22 [2] RSPM (R 4.4.0)
#>  pkgbuild           1.4.5   2024-10-28 [2] RSPM (R 4.4.0)
#>  pkgconfig          2.0.3   2019-09-22 [2] RSPM (R 4.4.0)
#>  pkgload            1.4.0   2024-06-28 [2] RSPM (R 4.4.0)
#>  plotly             4.10.4  2024-01-13 [2] RSPM (R 4.4.0)
#>  plyr               1.8.9   2023-10-02 [2] RSPM (R 4.4.0)
#>  profvis            0.4.0   2024-09-20 [2] RSPM (R 4.4.0)
#>  promises           1.3.0   2024-04-05 [2] RSPM (R 4.4.0)
#>  purrr              1.0.2   2023-08-10 [2] RSPM (R 4.4.0)
#>  R6                 2.5.1   2021-08-19 [2] RSPM (R 4.4.0)
#>  RColorBrewer       1.1-3   2022-04-03 [2] RSPM (R 4.4.0)
#>  Rcpp               1.0.13  2024-07-17 [2] RSPM (R 4.4.0)
#>  registry           0.5-1   2019-03-05 [2] RSPM (R 4.4.0)
#>  remotes            2.5.0   2024-03-17 [2] RSPM (R 4.4.0)
#>  reshape2           1.4.4   2020-04-09 [2] RSPM (R 4.4.0)
#>  rhdf5              2.49.0  2024-10-28 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  rhdf5filters       1.17.0  2024-10-03 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  Rhdf5lib           1.27.0  2024-10-03 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  rlang              1.1.4   2024-06-04 [2] RSPM (R 4.4.0)
#>  rmarkdown        * 2.28    2024-08-17 [2] RSPM (R 4.4.0)
#>  Rtsne              0.17    2023-12-07 [2] RSPM (R 4.4.0)
#>  S4Vectors          0.43.2  2024-10-17 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  sass               0.4.9   2024-03-15 [2] RSPM (R 4.4.0)
#>  scales             1.3.0   2023-11-28 [2] RSPM (R 4.4.0)
#>  seriation          1.5.6   2024-08-19 [2] RSPM (R 4.4.0)
#>  sessioninfo        1.2.2   2021-12-06 [2] RSPM (R 4.4.0)
#>  shiny              1.9.1   2024-08-01 [2] RSPM (R 4.4.0)
#>  stringi            1.8.4   2024-05-06 [2] RSPM (R 4.4.0)
#>  stringr            1.5.1   2023-11-14 [2] RSPM (R 4.4.0)
#>  survival           3.7-0   2024-06-05 [2] RSPM (R 4.4.0)
#>  sys                3.4.3   2024-10-04 [2] RSPM (R 4.4.0)
#>  testthat           3.2.1.1 2024-04-14 [2] RSPM (R 4.4.0)
#>  tibble             3.2.1   2023-03-20 [2] RSPM (R 4.4.0)
#>  tidyr              1.3.1   2024-01-24 [2] RSPM (R 4.4.0)
#>  tidyselect         1.2.1   2024-03-11 [2] RSPM (R 4.4.0)
#>  TSP                1.2-4   2023-04-04 [2] RSPM (R 4.4.0)
#>  UCSC.utils         1.1.0   2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  UpSetR             1.4.0   2019-05-22 [2] RSPM (R 4.4.0)
#>  urlchecker         1.0.1   2021-11-30 [2] RSPM (R 4.4.0)
#>  usethis            3.0.0   2024-07-29 [2] RSPM (R 4.4.0)
#>  utf8               1.2.4   2023-10-22 [2] RSPM (R 4.4.0)
#>  vctrs              0.6.5   2023-12-01 [2] RSPM (R 4.4.0)
#>  vegan              2.6-8   2024-08-28 [2] RSPM (R 4.4.0)
#>  viridis            0.6.5   2024-01-29 [2] RSPM (R 4.4.0)
#>  viridisLite        0.4.2   2023-05-02 [2] RSPM (R 4.4.0)
#>  webshot            0.5.5   2023-06-26 [2] RSPM (R 4.4.0)
#>  withr              3.0.2   2024-10-28 [2] RSPM (R 4.4.0)
#>  xfun               0.48    2024-10-03 [2] RSPM (R 4.4.0)
#>  xtable             1.8-4   2019-04-21 [2] RSPM (R 4.4.0)
#>  XVector            0.45.0  2024-10-02 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  yaml               2.3.10  2024-07-26 [2] RSPM (R 4.4.0)
#>  zlibbioc           1.51.2  2024-10-21 [2] Bioconductor 3.20 (R 4.4.1)
#> 
#>  [1] /tmp/Rtmplc7Qga/Rinst44d2321d7110
#>  [2] /github/workspace/pkglib
#>  [3] /usr/local/lib/R/site-library
#>  [4] /usr/lib/R/site-library
#>  [5] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────