--- title: "Assessing Differential Gene Expression Experiments with the *erccdashboard*" author: - name: Sarah Munro - name: Steve Lund date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_depth: 2 toc_float: true vignette: > %\VignetteIndexEntry{erccdashboard introduction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, message=FALSE} #options(SweaveHooks=list(fig=function() # par(mar=c(5.1,4.1,1.1,2.1)))) library( "erccdashboard" ) knitr::opts_chunk$set(echo = TRUE, warning=FALSE,message=FALSE, error = FALSE) ``` # Introduction This vignette describes the use of the erccdashboard R package to analyze External RNA Controls Consortium (ERCC) spike-in control ratio mixtures in gene expression experiments. If you use this package for method validation of your gene expression experiments please cite our manuscript that describes this R package using citation("erccdashboard"). ```{r} citation("erccdashboard") ``` In this vignette we demonstrate analysis of two types of gene expression experiments from the SEQC project that used ERCC control ratio mixture spike-ins: 1. Rat toxicogenomics methimazole-treated and control samples 2. Human reference RNA samples from the MAQC I project, Universal Human Reference RNA (UHRR) and Human Brain Reference RNA (HBRR) A subset of the large data set produced in the SEQC study are provided here as examples. The three sets of example data are: 1. Rat toxicogenomics RNA-Seq gene expression count data 2. UHRR/HBRR RNA-Seq gene expression count data 3. UHRR/HBRR Microarray gene expression fluorescent intensity data # Example: Rat Toxicogenomics In this example Rat Toxicogenomics experiment we are comparing MET (methimazole treatment) and CTL (control) rat RNA-Seq samples. ## Load Example Data Load the package gene expression data. ```{r loadExampleData} data(SEQC.Example) ``` The R workspace should now contain 5 objects. Three of these objects are gene expression experiment expression measures: 1. UHRR.HBRR.arrayDat - Fluorescent signal data from an Illumina beadarray microarray experiment with UHRR and HBRR in the SEQC interlaboratory study 2. MET.CTL.countDat - RNA-Seq count data from a rat toxicogenomics experiment 3. UHRR.HBRR.countDat - RNA-Seq count data from Lab 5 in the SEQC interlaboratory study with UHRR and HBRR The other two objects are vectors of total reads for the 2 sequencing experiments 1. MET.CTL.totalReads - total sequenced reads factors for each column in the corresponding rat experiment count table 2. UHRR.HBRR.totalReads - total sequenced reads factors for each column in the corresponding UHRR/HBRR count table ## Define Input Parameters To run the default analysis function runDashboard on the MET-CTL rat toxicogenomics RNA-Seq experiment, the following input arguments are required: ```{r defineInputData} datType = "count" # "count" for RNA-Seq data, "array" for microarray data isNorm = FALSE # flag to indicate if input expression measures are already # normalized, default is FALSE exTable = MET.CTL.countDat # the expression measure table filenameRoot = "RatTox" # user defined filename prefix for results files sample1Name = "MET" # name for sample 1 in the experiment sample2Name = "CTL" # name for sample 2 in the experiment erccmix = "RatioPair" # name of ERCC mixture design, "RatioPair" is default erccdilution = 1/100 # dilution factor used for Ambion spike-in mixtures spikeVol = 1 # volume (in microliters) of diluted spike-in mixture added to # total RNA mass totalRNAmass = 0.500 # mass (in micrograms) of total RNA choseFDR = 0.05 # user defined false discovery rate (FDR), default is 0.05 ``` The first input argument, datType, indicates whether that data is integer count data from an RNA-Seq experiment ("count") or data from a microarray experiment ("array"). The isNorm argument indicates if the input expression measures are already normalized, the default value is FALSE. If you want to use normalized RNA-Seq or microarray data in your analysis, the isNorm argument must be set to TRUE. If isNorm is TRUE, then the software asks if the input data is length normalized. Type Y at the command line in the R console if the data is length normalized (e.g. FPKM or RPKM data) otherwise type N. If the data is normalized, then limma will be used for array data differential expression (DE) testing, but for normalized RNA-Seq data, DE testing results must be generated outside of the erccdashboard pipeline and the DE testing results should be provided in the working directory in a file named 'filenameRoot ERCC Pvals.csv' (for details on how to do this see section "Flexibility in Differential Expression Testing". The third argument, exTable, is the expression measure table. Take a look at the expression measure table from the RatTox experiment, to see an example exTable argument: ```{r inspectRatCount} head(MET.CTL.countDat) ``` The first column of the expression measure table, Feature, contains unique names for all the transcripts that were quantified in this experiment. The remaining columns represent replicates of the pair of samples, in this expression measure table the control sample is labeled CTL and the treatment sample is labeled MET. An underscore is included to separate the sample names from the replicate numbers during analysis. This column name format 'Sample_Rep' is required for the columns of any input expression measure table. Only one underscore should be used in the column names. The default differential expression testing of RNA-Seq experiments in the erccdashboard is done with the QuasiSeq package, which requires the use of integer count data. If you are trying to use either a different approach for DE testing of RNA-Seq count data (e.g. edgeR or DESeq) or RNA-Seq data that is not integer count data (e.g. FPKM data from Cufflinks) please see section "Flexibility in Differential Expression Testing". The erccdashboard default normalization for RNA-Seq count data is 75th percentile (also known as upper quartile) normalization. It is optional to provide a vector of per replicate normalization factors through the input argument repNormFactor, such as a vector of total reads for each replicate. The example total reads vectors we provide here were derived from the FASTQ files associated with each column in the RNA-Seq experiment count tables. Any repNormFactor vector will be used as a library size normalization factor for each column of exTable. This will be adjusted to be a per million reads factor. For any experiment the sample spiked with ERCC Mix 1 is sample1Name and the sample spiked with ERCC Mix 2 is sample2Name. In this experiment sample1Name = MET and sample2Name = CTL. For a more robust experimental design the reverse spike-in design could be created using additional replicates of the treatment and control samples. ERCC Mix 2 would be spiked into MET samples and ERCC Mix 1 would be spiked into CTL control replicates. The dilution factor of the pure Ambion ERCC mixes prior to spiking into total RNA samples is erccdilution. The amount of diluted ERCC mix spiked into the total RNA sample is spikeVol (units are $\mu$L). The mass of total RNA spiked with the diluted ERCC mix is totalRNAmass (units are $\mu$g). The final required input parameter, choseFDR, is the False Discovery Rate (FDR) for differential expression testing. A typical choice would be 0.05 (5\% FDR), so this is the default choseFDR value. For the rat data since most genes are not differentially expressed a less conservative FDR is chosen (FDR = 0.1) and for the UHRR and HBRR reference RNA samples FDR = 0.01 is chosen, because there is a large number of differentially expressed genes for this pair of samples. The function runDashboard.R is provided for convenient default erccdashboard analysis. Execution of the runDashboard function calls the default functions for erccdashboard analysis and reports parameters and progress to the R console. The functions called within runDashboard.R are also available to the user (details provided in Section 4). All data and analysis results are stored in the list object exDat. For convenience the main diagnostic figures are saved to a pdf file and the exDat object is saved to an .RData object named using the filenameRoot provided by the user. ## Run Example and View Output Use the following command to run the default runDashboard script: ```{r runDashboardRatcount,results='hold'} exDat <- runDashboard(datType="count", isNorm = FALSE, exTable=MET.CTL.countDat, filenameRoot="RatTox", sample1Name="MET", sample2Name="CTL", erccmix="RatioPair", erccdilution=1/100, spikeVol=1, totalRNAmass=0.500, choseFDR=0.1) ``` The `summary` function will give a top level view of the `exDat` list structure. The `str` function will give more detail. It is a good idea to set the `max.level` argument in the `str` function, because by the end of the analysis the `exDat` structure is quite large. ```{r initializeData} summary(exDat) ``` The figures from the analysis are stored in `exDat$Figures`. The four main diagnostic figures that are saved to a pdf file are the dynRangePlot, rocPlot, lodrERCCPlot, and maPlot. ```{r ratPlotA} exDat$Figures$dynRangePlot ``` For this particular experiment the relationship between abundance and signal for the ERCC controls shows that the measurement results span a 2$^{15}$ dynamic range. These ERCC mixtures were designed to span a 2$^{20}$ dynamic range, but there was insufficient evidence to reliably quantify ERCC transcripts at low abundances. ```{r ratPlotB} grid.arrange(exDat$Figures$rocPlot) ``` The receiver operator characteristic (ROC) curve and the Area Under the Curve (AUC) statistic provide evidence of the diagnostic power for detecting differential expression in this rat toxicogenomics experiment. As expected with increased fold change, diagnostic power increases. The AUC summary statistic for different experiments can be used to compare diagnostic performance. ```{r ratPlotC} grid.arrange(exDat$Figures$lodrERCCPlot) ``` By modeling the relationship between average signal and p-values we can obtain Limit of Detection of Ratios (LODR) estimates for each differential fold change (or Ratio,indicated by color) and a threshold p-value, p.thresh, indicated by the dotted black line. LODR values can be compared between experiments to evaluate detection of differences between samples as a function of transcript abundance. If the input data used in erccdashboard analysis are unnormalized RNA-Seq counts, then the LODR estimates from the lodrERCCPlot are unnormalized counts. If normalized RNA-Seq data or microarray data are used for analysis then the LODR estimates will have corresponding units. ```{r ratPlotD} grid.arrange(exDat$Figures$maPlot) ``` An MA plot (Ratio of Signals vs Average Signals) shows the ratio measurements of transcripts in the pair of samples as a function of abundance. The ERCC control ratios measurements are coded to indicate which controls are above a given LODR (solid circles) or below the LODR (open circles). This plot also shows the variability in ratio measurements as a function of dynamic range and the bias in control ratio measurements (r$_m$), which is influenced by the mRNA fraction difference between the pair of samples. # Comparison of Performance Between Experiments The performance metrics provided here derived from measurements of ERCC ratios in gene expression experiments (AUC, LODR, r$_m$, and the standard deviations of the ERCC ratio measurements) can be used to assess performance between experiments within the same laboratory, or between different laboratories or technology platforms. To illustrate the difference between technology platforms example data are also provided from the SEQC interlaboratory study. The interlaboratory experiments were performed with aliquots from single large batches of commercially available reference RNA samples, Universal Human Reference RNA (UHRR) and Human Brain Reference RNA (HBRR).To prepare these large batches of reference RNA samples, 50 $\mu$L of undiluted Ambion ERCC Mix 1 was spiked into 2500 $\mu$g of the UHRR total RNA and 50 $\mu$L of undiluted Ambion ERCC Mix 2 was spiked into 2500 $\mu$g the HBRR sample before these samples were aliquoted and shared amongst laboratories and across platforms. ## UHRR vs. HBRR Microarray Experiment To analyze the example reference RNA experiment microarray data use the following command: ``` exDat <- runDashboard(datType="array", isNorm = FALSE, exTable=UHRR.HBRR.arrayDat, filenameRoot = "Lab13.array", sample1Name = "UHRR", sample2Name="HBRR", erccmix = "RatioPair", erccdilution = 1, spikeVol = 50, totalRNAmass = 2.5*10^(3), choseFDR=0.01) ``` Note that unnormalized fluorescent signals are expected for erccdashboard analysis of microarray data (with the argument isNorm = FALSE) and input data should not be log transformed. As with RNA-Seq experiments for microarray experiments it is optional to provide a vector of per replicate normalization factors through the input argument repNormFactor. Normalized microarray data may be analyzed with the erccdashboard if `isNorm = TRUE`. ## UHRR vs. HBRR RNA-Seq Experiment To analyze the RNA-Seq reference RNA experiment data simply repeat the same command that you used in the previous section with the microarray data, but change the `datType` to "count", use `UHRR.HBRR.countDat` as your `exTable`, and change your filenameRoot to a different character string, so that your microarray data is not overwritten. # Analysis Details and Options The analysis functions contained in the convenience wrapper function `runDashboard` can also be used directly by the user. Comments are provided above each analysis step included in `runDashboard` to describe the purpose and ordering constraints. View the `runDashboard` script to see comments describing the analysis functions and the ordering constraints: ```{r viewDashboardOrder} runDashboard ``` ## Flexibility in Differential Expression Testing The `geneExprTest` function has 2 major actions: 1. differential expression testing 2. selecting a p-value threshold based on DE testing For differential expression testing, `geneExprTest` first looks for a correctly formatted filenameRoot.All.Pvals.csv file in the working directory. If this file is not found then `geneExprTest`proceeds with differential expression testing with the `edgeR` package for `datType = "count"` or uses the `limma` package for differential expression testing if `datType = "array"`. To use results from another DE testing tool the working directory must contain a csv file with the name "filenameRoot.All.Pvals.csv" and correct column headers. A correctly formatted "filenameRoot.All.Pvals.csv" file will have column names "Feature", "MnSignal", "Pval", and "Fold". "Feature" is a column of transcript names including ERCC controls and endogenous transcripts, "MnSignal" is the mean signal across sample replicates (e.g. counts or FPKM, etc.), "Pval" are the unadjusted P-values from differential expression testing (Q-values will be automatically estimated for the P-values, and take into account multiple hypothesis testing), and "Fold" are the numeric fold changes for the ERCC controls (0.5, 4, 1, and 0.667) and for endogenous transcripts the "Fold" is NA. The `geneExprTest` function must locate the "filenameRoot.All.Pvals.csv" file in the working directory to select a threshold p-value for use in the estLODR function for LODR estimation. ## LODR Estimation Options The default behavior of `runDashboard` is to use the `estLODR` function to obtain an LODR estimate using empirical data from the ERCCs and a model-based simulation using the endogenous genes in the sample. The type of LODR estimation is selected using the argument kind in the estLODR function. The other parameter that may be adjusted is the probability for the LODR estimate, in the default analysis prob = 0.9 is selected. ## Printing Plots to File The function `saveERCCPlots` will save selected figures to a pdf file. The default is to print the 4 main erccdashboard figures to a single page (`plotsPerPg = "main"`). If `plotsPerPg = "single"` then each plot is placed on an individual page in one pdf file. If plotlist is not defined (`plotlist = NULL`) then all plots in `exDat$Figures` are printed to the pdf file. ## Analysis of Alternative Spike-in Designs By default the package is configured to analyze the ERCC ratio mixtures produced by Ambion (ERCC ExFold RNA Spike-In Mixes, Catalog Number 4456739). This pair of control ratio mixtures were designed to have 1:1, 4:1, 1:1.5, and 1:2 ratios of 92 distinct RNA transcripts (23 different RNA control sequences are in each of these four ratio subpools). Alternative ERCC RNA control ratio mixture designs can be produced using the NIST DNA Plasmid Library for External Spike-in Controls (NIST Standard Reference Material 2374, https://tsapps.nist.gov/srmext/certificates/2374.pdf). For example, a pair of RNA control mixtures could be created with a ternary ratio design, three subpools of RNA controls with either no change (1:1) or 2-fold increased (2:1) and 2-fold decreased (1:2) relative abundances between the pair of mixtures (Mix 1/Mix 2). To use alternative spike-in mixture designs with the dashboard a csv file must be provided to the package with the argument userMixFile for the initDat function. If all samples from both conditions were only spiked with a single ERCC mixture (e.g. Ambion Catalog Number 4456740, ERCC RNA Spike-In Mix) a limited subset of the package functions can be used: `initDat`, `est_r_m`, and `dynRangePlot`. For `initDat` use `ERCCMixes="Single"` and `est_r_m` and `dynRangePlot` functions can then be used to examine the mRNA fraction differences for the pair of samples and evaluate the dynamic range of the experiment. # R version and sessionInfo() output The results shown in this R vignette were obtained with the following R session information. ```{r sessionData} sessionInfo() ```