--- title: "Testing non-linear effects" subtitle: 'Categories and continuous variables' author: "Developed by [Gabriel Hoffman](http://gabrielhoffman.github.io/)" date: "Run on `r format(Sys.time())`" documentclass: article vignette: > %\VignetteIndexEntry{Testing non-linear effects} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\usepackage[utf8]{inputenc} output: BiocStyle::html_document: toc: true toc_float: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, error = FALSE, tidy = FALSE, dev = c("png"), cache = TRUE ) ``` # Introduction Typical analysis using regression models assumes a linear affect of the covariate on the response. Here we consider testing non-linear effects in the case of 1) continuous and 2) ordered categorical variables. We demonstrate this feature on a lightly modified analysis of PBMCs from 8 individuals stimulated with interferon-β ([Kang, et al, 2018, Nature Biotech](https://www.nature.com/articles/nbt.4042)). # Standard processing Here is the code from the main vignette: ```{r preprocess.data} library(dreamlet) library(muscat) library(ExperimentHub) library(scater) # Download data, specifying EH2259 for the Kang, et al study eh <- ExperimentHub() sce <- eh[["EH2259"]] # only keep singlet cells with sufficient reads sce <- sce[rowSums(counts(sce) > 0) > 0, ] sce <- sce[, colData(sce)$multiplets == "singlet"] # compute QC metrics qc <- perCellQCMetrics(sce) # remove cells with few or many detected genes ol <- isOutlier(metric = qc$detected, nmads = 2, log = TRUE) sce <- sce[, !ol] # set variable indicating stimulated (stim) or control (ctrl) sce$StimStatus <- sce$stim sce$id <- paste0(sce$StimStatus, sce$ind) # Create pseudobulk pb <- aggregateToPseudoBulk(sce, assay = "counts", cluster_id = "cell", sample_id = "id", verbose = FALSE ) ``` # Continuous variable Consider the continuous variable `Age`. Typical analysis only considers linear effects using a single regression coefficient, but we also want to consider the non-linear effects of age. We can peform a [basis expansion using splines](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-019-0666-3) instead use 3 coefficients to model the age effect. ```{r Age} # Simulate age between 18 and 65 pb$Age <- runif(ncol(pb), 18, 65) # formula included non-linear effects of Age # by using a natural spline of degree 3 # This corresponds to using 3 coefficients instead of 1 form <- ~ splines::ns(Age, 3) # Normalize and apply voom/voomWithDreamWeights res.proc <- processAssays(pb, form, min.count = 5) # Differential expression analysis within each assay res.dl <- dreamlet(res.proc, form) # The spline has degree 3, so there are 3 coefficients # estimated for Age effects coefNames(res.dl) # Jointly test effects of the 3 spline components # The test of the 3 coefficients is performed with an F-statistic topTable(res.dl, coef = coefNames(res.dl)[2:4], number = 3) ``` # Ordered categorical We can also test non-linear effects in the case of categorical variables with a natural ordering to the categories. Consider time course data with 4 time points. Each time point is a category and has a natural ordering from first to last. We have multiple options to model the time course. * **Continuous:** Modeling time point as a continuous variable uses a single regression coefficient to model the linear effects of the time course. This is simple, models the order of the time points, but ignores non-linear effects Model using `as.numeric(TimePoint)` * **Categorical:** Including time point as a typical categorical variable uses estimated the mean response value for each category. So it estimates 4 coefficients. While this can be useful for comparing two categories, it ignores the order of the time points. Model using `factor(TimePoint)` * **Ordered categorical:** Here, the trend across ordered time points is modled using orthogonal polynomials. The trend is decomposed into independent linear, quadratic, etc., effects that can be tested either jointly or by themselves. Model using: ```{r eval=FALSE} ord <- c("time_1", "time_2", "time_3", "time_4") ordered(factor(TimePoint), ord) ``` Here we simulated 4 time points, and perform differential expression analysis. ```{r timepoint} # Consider data generated across 4 time points # While there are no time points in the real data # we can add some for demonstration purposes pb$TimePoint <- ordered(paste0("time_", rep(1:4, 4))) # examine the ordering pb$TimePoint # Use formula including time point form <- ~TimePoint # Normalize and apply voom/voomWithDreamWeights res.proc <- processAssays(pb, form, min.count = 5) # Differential expression analysis within each assay res.dl <- dreamlet(res.proc, form) # Examine the coefficient estimated # for TimePoint it estimates # linear (i.e. L) # quadratic (i.e. Q) # and cubic (i.e. C) effects coefNames(res.dl) # Test only linear effect topTable(res.dl, coef = "TimePoint.L", number = 3) # Test linear, quadratic and cubic effcts coefs <- c("TimePoint.L", "TimePoint.Q", "TimePoint.C") topTable(res.dl, coef = coefs, number = 3) ``` ## Sample filtering Due to variation in cell and read count for each sample, `processAssays()` filters out some sample. This filtering is summarized here: ```{r details} details(res.dl) ``` Whle all 16 samples are detained in B cells, only 9 are retained for megakaryocytes. This can result in a time point being dropped, and so the polynomial expansion for some cell types can have a lower degree. The combined results will then have `NA` values for these coefficients. For example, for `TIMP1` in `Megakaryocytes` above there is not enought data to fit the cubic term, so `TimePoint.C` is `NA`. # Session Info

```{r session, echo=FALSE} sessionInfo() ```