--- title: "Introduction to SpiecEasi" author: - name: Zachary Kurtz email: zdkurtz@gmail.com output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r format(Sys.time(), '%B %d, %Y')`" package: "SpiecEasi" version: "2.1.1" vignette: > %\VignetteIndexEntry{Introduction to SpiecEasi} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo = FALSE, eval=TRUE, message=FALSE} library(BiocStyle) knitr::opts_knit$set( upload.fun = NULL, base.url = NULL) # use local files for images knitr::opts_chunk$set( collapse = TRUE, comment = "#" ) # Override BiocStyle plot hook to avoid css_align issues knitr::knit_hooks$set(plot = function(x, options) { paste0('![', basename(x), '](', x, ')') }) runchunks = if (Sys.getenv("FORCE_VIGNETTE_REBUILD", "FALSE") == "TRUE") TRUE else FALSE cache_file <- '../vignette_cache/SpiecEasi.RData' if (!runchunks && file.exists(cache_file)) { load(cache_file) # If we loaded trimmed objects, reassign them to original names if (exists("se.mb.trimmed")) { se.mb.amgut <- se.mb.trimmed rm(se.mb.trimmed) } if (exists("se.gl.trimmed")) { se.gl.amgut <- se.gl.trimmed rm(se.gl.trimmed) } if (exists("se.trimmed")) { se <- se.trimmed rm(se.trimmed) } message("Loaded cached data from SpiecEasi.RData") } else { if (!runchunks) { message("Cache file SpiecEasi.RData not found - building from scratch") } runchunks <- TRUE } saveout = runchunks ``` # Introduction to SpiecEasi Sparse InversE Covariance estimation for Ecological Association and Statistical Inference This package will be useful to anybody who wants to infer graphical models for all sorts of compositional data, though primarily intended for microbiome relative abundance data (generated from 16S amplicon sequence data). It also includes a generator for [overdispersed, zero inflated] multivariate, correlated count data. Please see the paper published in [PLoS Comp Bio](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004226). One small point on notation: we refer to the method as "SPIEC-EASI" and reserve "SpiecEasi" for this package. ## Installation To install SpiecEasi from Bioconductor: ```{r, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SpiecEasi") ``` For development versions, you can install from GitHub: ```{r, eval=FALSE} if (!require("remotes", quietly = TRUE)) install.packages("remotes") remotes::install_github("zdk123/SpiecEasi") ``` ## Available vignettes This package includes several vignettes covering different aspects of SpiecEasi: - **[Introduction to SpiecEasi](SpiecEasi.html)** (this vignette): Basic usage and introduction - **[Working with phyloseq](phyloseq-integration.html)**: Integration with phyloseq objects - **[Learning latent variable graphical models](latent-variable-models.html)**: Advanced methods for handling latent variables - **[Cross Domain SPIEC-EASI](cross-domain-interactions.html)**: Analyzing multiple data types together - **[pulsar: parallel utilities for model selection](pulsar-parallel.html)**: Parallel processing and performance optimization - **[Troubleshooting](troubleshooting.html)**: Common issues and solutions ## Basic Usage Lets simulate some multivariate data under zero-inflated negative binomial model, based on (high depth/count) round 1 of the American gut project, with a sparse network. The basic steps are: 1. load the data and normalize counts to to common scale (min depth) 2. fit count margins to the model 3. generate a synthetic network 4. generate some synthetic data 5. clr transformation 6. inverse covariance estimation along a lambda (sparsity) path 7. stability selection using the StARS criterion 8. evaluate performance Obviously, for real data, skip 1-4. Session info: ```{r, eval=TRUE} sessionInfo() ``` Setup: ```{r, eval=TRUE} library(SpiecEasi) data(amgut1.filt) ``` ```{r, eval=runchunks} depths <- rowSums(amgut1.filt) amgut1.filt.n <- t(apply(amgut1.filt, 1, norm_to_total)) amgut1.filt.cs <- round(amgut1.filt.n * min(depths)) d <- ncol(amgut1.filt.cs) n <- nrow(amgut1.filt.cs) e <- d ``` Synthesize the data: ```{r, eval=runchunks} set.seed(10010) graph <- SpiecEasi::make_graph('cluster', d, e) Prec <- graph2prec(graph) Cor <- cov2cor(prec2cov(Prec)) X <- synth_comm_from_counts(amgut1.filt.cs, mar=2, distr='zinegbin', Sigma=Cor, n=n) ``` The main SPIEC-EASI pipeline: Data transformation, sparse inverse covariance estimation and model selection: ```{r, eval=runchunks} se <- spiec.easi(X, method='mb', lambda.min.ratio=1e-2, nlambda=15) ``` Examine ROC over lambda path and PR over the stars index for the selected graph: ```{r, eval=TRUE, fig.width=5, fig.height=5} huge::huge.roc(se$est$path, graph, verbose=FALSE) stars.pr(getOptMerge(se), graph, verbose=FALSE) # stars selected final network under: getRefit(se) ``` The above example does not cover all possible options and parameters. For example, other generative network models are available, the lambda.min.ratio (the scaling factor that determines the minimum sparsity/lambda parameter) shown here might not be right for your dataset, and its possible that you'll want more repetitions (number of subsamples) for StARS. ## Analysis of American Gut data Now let's apply SpiecEasi directly to the American Gut data. Don't forget that the normalization is performed internally in the `spiec.easi` function. Also, we should use a larger number of stars repetitions for real data. We can pass in arguments to the inner stars selection function as a list via the parameter `pulsar.params`. If you have more than one processor available, you can also supply a number to `ncores`. Also, let's compare results from the MB and glasso methods as well as SparCC (correlation). **Note**: On Windows systems, `mc.cores > 1` is not supported by default. For Windows users, we recommend using `ncores=1` for serial processing or the `snow` cluster type for parallel processing. See the [pulsar-parallel vignette](pulsar-parallel.html) for detailed Windows-specific guidance. ```{r, eval=runchunks, message=FALSE, warning=FALSE} se.mb.amgut <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-2, nlambda=10, pulsar.params=list(rep.num=20)) se.gl.amgut <- spiec.easi(amgut1.filt, method='glasso', lambda.min.ratio=1e-2, nlambda=10, pulsar.params=list(rep.num=20)) sparcc.amgut <- sparcc(amgut1.filt) ``` Create igraph objects for visualization: ```{r, eval=TRUE} ## Define arbitrary threshold for SparCC correlation matrix for the graph sparcc.graph <- abs(sparcc.amgut$Cor) >= 0.3 diag(sparcc.graph) <- 0 library(Matrix) sparcc.graph <- Matrix(sparcc.graph, sparse=TRUE) ## Create igraph objects ig.mb <- adj2igraph(getRefit(se.mb.amgut)) ig.gl <- adj2igraph(getRefit(se.gl.amgut)) ig.sparcc <- adj2igraph(sparcc.graph) ``` Visualize using igraph plotting: ```{r, eval=TRUE, fig.width=15, fig.height=7, message=FALSE} library(igraph) library(Matrix) ## set size of vertex proportional to clr-mean vsize <- rowMeans(clr(amgut1.filt, 1))+6 am.coord <- layout.fruchterman.reingold(ig.mb) par(mfrow=c(1,3)) plot(ig.mb, layout=am.coord, vertex.size=vsize, vertex.label=NA, main="MB") plot(ig.gl, layout=am.coord, vertex.size=vsize, vertex.label=NA, main="glasso") plot(ig.sparcc, layout=am.coord, vertex.size=vsize, vertex.label=NA, main="sparcc") ``` We can evaluate the weights on edges networks using the terms from the underlying model. SparCC correlations can be used directly, while SpiecEasi networks need to be massaged a bit. Note that since SPIEC-EASI is based on penalized estimators, the edge weights are not directly comparable to SparCC (or Pearson/Spearman correlation coefficients): ```{r, eval=TRUE, fig.width=8, fig.height=5} # Create edge lists for weight comparison secor <- cov2cor(getOptCov(se.gl.amgut)) sebeta <- symBeta(getOptBeta(se.mb.amgut), mode='maxabs') elist.gl <- summary(triu(secor*getRefit(se.gl.amgut), k=1)) elist.mb <- summary(sebeta) elist.sparcc <- summary(sparcc.graph*sparcc.amgut$Cor) # Plot edge weight distributions hist(elist.sparcc[,3], main='', xlab='edge weights') hist(elist.mb[,3], add=TRUE, col='forestgreen') hist(elist.gl[,3], add=TRUE, col='red') ``` Lets look at the degree statistics from the networks inferred by each method: ```{r, eval=TRUE, fig.width=9, fig.height=6} dd.gl <- degree.distribution(ig.gl) dd.mb <- degree.distribution(ig.mb) dd.sparcc <- degree.distribution(ig.sparcc) plot(0:(length(dd.sparcc)-1), dd.sparcc, ylim=c(0,.35), type='b', ylab="Frequency", xlab="Degree", main="Degree Distributions") points(0:(length(dd.gl)-1), dd.gl, col="red" , type='b') points(0:(length(dd.mb)-1), dd.mb, col="forestgreen", type='b') legend("topright", c("MB", "glasso", "sparcc"), col=c("forestgreen", "red", "black"), pch=1, lty=1) ``` ## Next steps For more advanced usage, please refer to the other vignettes: - **[Working with phyloseq](phyloseq-integration.html)**: Learn how to integrate SpiecEasi with phyloseq objects for easier data management and visualization - **[Learning latent variable graphical models](latent-variable-models.html)**: Advanced methods for handling unobserved variables that can affect network inference - **[Cross Domain SPIEC-EASI](cross-domain-interactions.html)**: Analyze multiple data types (e.g., 16S and ITS) together - **[pulsar: parallel utilities for model selection](pulsar-parallel.html)**: Optimize performance with parallel processing - **[Troubleshooting](troubleshooting.html)**: Solutions for common issues and parameter tuning guidelines ```{r, echo = FALSE, eval=TRUE, message=FALSE} # Save only the expensive computations if they exist and we're rebuilding if (runchunks && exists("se.mb.amgut") && exists("se.gl.amgut")) { cache_file <- '../vignette_cache/SpiecEasi.RData' tryCatch({ # Load trimming function and trim objects to reduce size source('../vignette_cache/trim_spiec_easi.R') se.mb.trimmed <- trim_spiec_easi(se.mb.amgut) se.gl.trimmed <- trim_spiec_easi(se.gl.amgut) se.trimmed <- trim_spiec_easi(se) # Save the trimmed expensive spiec.easi results plus other needed variables save(se.mb.trimmed, se.gl.trimmed, se.trimmed, sparcc.amgut, graph, X, Prec, Cor, depths, amgut1.filt.cs, d, n, e, file=cache_file) message("Saved trimmed cache file: ", cache_file, " in directory: ", getwd()) }, error = function(e) { message("Failed to save cache file: ", e$message) }) } ```