--- title: "DirichletMultinomial for Clustering and Classification of Microbiome Data" date: "`r BiocStyle::doc_date()`" author: - name: Martin Morgan affiliation: - Roswell Park Comprehensive Cancer Center, Buffalo, NY vignette: > %\VignetteIndexEntry{DirichletMultinomial for Clustering and Classification of Microbiome Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc_float: true package: DirichletMultinomial --- Modified: 6 March 2012, 19 October 2024 (HTML version) ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This document illustrates the main features of the *DirichletMultinomial* package, and in the process replicates key tables and figures from Holmes et al., . We start by loading the package, in addition to the packages *lattice* (for visualization) and *parallel* (for use of multiple cores during cross-validation). ```{r library, message = FALSE} library(DirichletMultinomial) library(lattice) library(parallel) ``` We set the width of [R]{.sans-serif} output to 70 characters, and the number of floating point digits displayed to two. The `full` flag is set to `FALSE`, so that cached values are used instead of re-computing during production of this vignette. The package defines a set of standard colors; we use `.qualitative` during visualization. ```{r colors} options(width=70, digits=2) full <- FALSE .qualitative <- DirichletMultinomial:::.qualitative ``` # Data The data used in Homes et al. is included in the package. We read the data in to a matrix `count` of samples by taxa. ```{r data-input} fl <- system.file(package="DirichletMultinomial", "extdata", "Twins.csv") count <- t(as.matrix(read.csv(fl, row.names=1))) count[1:5, 1:3] ``` The figure below shows the distribution of reads from each taxon, on a log scale. ```{r taxon-counts} cnts <- log10(colSums(count)) densityplot( cnts, xlim=range(cnts), xlab="Taxon representation (log 10 count)" ) ``` # Clustering The `dmn` function fits a Dirichlet-Multinomial model, taking as input the count data and a parameter $k$ representing the number of Dirichlet components to model. Here we fit the count data to values of $k$ from 1 to 7, displaying the result for $k = 4$. A sense of the model return value is provided by the documentation for the [R]{.sans-serif} object `fit`, `class ? DMN`. ```{r fit} if (full) { fit <- mclapply(1:7, dmn, count=count, verbose=TRUE) save(fit, file=file.path(tempdir(), "fit.rda")) } else data(fit) fit[[4]] ``` The return value can be queried for measures of fit (Laplace, AIC, BIC); these are plotted for different $k$ in The figure. The best fit is for $k=4$ distinct Dirichlet components. ```{r min-laplace, figure=TRUE} lplc <- sapply(fit, laplace) plot(lplc, type="b", xlab="Number of Dirichlet Components", ylab="Model Fit") (best <- fit[[which.min(lplc)]]) ``` In addition to `laplace` goodness of fit can be assessed with the `AIC` and `BIC` functions. The `mixturewt` function reports the weight $\pi$ and homogeneity $\theta$ (large values are more homogeneous) of the fitted model. `mixture` returns a matrix of sample x estimated Dirichlet components; the argument `assign` returns a vector of length equal to the number of samples indicating the component with maximum value. ```{r mix-weight} mixturewt(best) head(mixture(best), 3) ``` The `fitted` function describes the contribution of each taxonomic group (each point in the panels of the figure to the Dirichlet components; the diagonal nature of the points in a panel suggest that the Dirichlet components are correlated, perhaps reflecting overall numerical abundance. ```{r fitted} splom(log(fitted(best))) ``` The posterior mean difference between the best and single-component Dirichlet multinomial model measures how each component differs from the population average; the sum is a measure of total difference from the mean. ```{r posterior-mean-diff} p0 <- fitted(fit[[1]], scale=TRUE) # scale by theta p4 <- fitted(best, scale=TRUE) colnames(p4) <- paste("m", 1:4, sep="") (meandiff <- colSums(abs(p4 - as.vector(p0)))) sum(meandiff) ``` The table below summarizes taxonomic contributions to each Dirichlet component. ```{r table-1} diff <- rowSums(abs(p4 - as.vector(p0))) o <- order(diff, decreasing=TRUE) cdiff <- cumsum(diff[o]) / sum(diff) df <- cbind(Mean=p0[o], p4[o,], diff=diff[o], cdiff) DT::datatable(df) |> DT::formatRound(colnames(df), digits = 4) ``` The figure shows samples arranged by Dirichlet component, with samples placed into the component for which they had the largest fitted value. ```{r heatmap-similarity} heatmapdmn(count, fit[[1]], best, 30) ``` # Generative classifier The following reads in phenotypic information ('Lean', 'Obese', 'Overweight') for each sample. ```{r twin-pheno} fl <- system.file(package="DirichletMultinomial", "extdata", "TwinStudy.t") pheno0 <- scan(fl) lvls <- c("Lean", "Obese", "Overwt") pheno <- factor(lvls[pheno0 + 1], levels=lvls) names(pheno) <- rownames(count) table(pheno) ``` Here we subset the count data into sub-counts, one for each phenotype. We retain only the Lean and Obese groups for subsequent analysis. ```{r subsets} counts <- lapply(levels(pheno), csubset, count, pheno) sapply(counts, dim) keep <- c("Lean", "Obese") count <- count[pheno %in% keep,] pheno <- factor(pheno[pheno %in% keep], levels=keep) ``` The `dmngroup` function identifies the best (minimum Laplace score) Dirichlet-multinomial model for each group. ```{r fit-several-} if (full) { bestgrp <- dmngroup( count, pheno, k=1:5, verbose=TRUE, mc.preschedule=FALSE ) save(bestgrp, file=file.path(tempdir(), "bestgrp.rda")) } else data(bestgrp) ``` The Lean group is described by a model with one component, the Obese group by a model with three components. Three of the four Dirichlet components of the original single group (`best`) model are represented in the Obese group, the other in the Lean group. The total Laplace score of the two group model is less than of the single-group model, indicating information gain from considering groups separately. ```{r best-several} bestgrp lapply(bestgrp, mixturewt) c( sapply(bestgrp, laplace), 'Lean+Obese' = sum(sapply(bestgrp, laplace)), Single = laplace(best) ) ``` The `predict` function assigns samples to classes; the confusion matrix shows that the classifier is moderately effective. ```{r confusion} xtabs(~pheno + predict(bestgrp, count, assign=TRUE)) ``` The `cvdmngroup` function performs cross-validation. This is a computationally expensive step. ```{r cross-validate} if (full) { ## full leave-one-out; expensive! xval <- cvdmngroup( nrow(count), count, c(Lean=1, Obese=3), pheno, verbose=TRUE, mc.preschedule=FALSE ) save(xval, file=file.path(tempdir(), "xval.rda")) } else data(xval) ``` The figure shows an ROC curve for the single and two-group classifier. The single group classifier is performing better than the two-group classifier. ```{r ROC-dmngroup} bst <- roc(pheno[rownames(count)] == "Obese", predict(bestgrp, count)[,"Obese"]) bst$Label <- "Single" two <- roc(pheno[rownames(xval)] == "Obese", xval[,"Obese"]) two$Label <- "Two group" both <- rbind(bst, two) pars <- list(superpose.line=list(col=.qualitative[1:2], lwd=2)) xyplot( TruePostive ~ FalsePositive, group=Label, both, type="l", par.settings=pars, auto.key=list(lines=TRUE, points=FALSE, x=.6, y=.1), xlab="False Positive", ylab="True Positive" ) ``` ```{r sessionInfo} sessionInfo() ```