Non-detects in qPCR data: methods to model and impute non-detects in the results of qPCR experiments (nondetects)

1 Background on non-detects in qPCR data

Quantitative real-time PCR (qPCR) measures gene expression for a subset of genes through repeated cycles of sequence-specific DNA amplification and expression measurements. During the exponential amplification phase, each cycle results in an approximate doubling of the quanitity of each target transcript. The threshold cycle (Ct) – the cycle at which the target gene’s expression first exceeds a predetermined threshold – is used to quantify the expression of each target gene. These Ct values typically represent the raw data from a qPCR experiment.

One challenge of qPCR data is the presence of non-detects those reactions failing to attain the expression threshold. While most current software replaces these non-detects with the maximum possible Ct value (typically 40), recent work has shown that this introduces large biases in estimation of both absolute and differential expression. Here, we treat the non-detects as missing data, model the missing data mechanism, and use this model to impute Ct values for the non-detects.

2 A statistical model for qPCR non-detects

We propose the following generative model for qPCR data in which Yij is the measured expression value for gene i in sample j, some of which are missing (non-detects), Xij represents the fully observed expression values, and Zij indicates whether a value is detected: Xij = f(θij, η) + εij $$ Y_{ij} = \left\{ \begin{array} {rr} X_{ij} & \textrm{if $Z_{ij}=1$}\\ \textrm{non-detect} & \textrm{if $Z_{ij}=0$} \end{array} \right.$$

In this model, we assume that the fully observed expression values, Xij are a function of the true gene expression, θij, non-biological factors, η, and random measurement error, εij. The probability of an expressed value being detected is assumed to be a function of the expression value itself, g(Xij), for values below the detection limit, S. The following logistic regression model is a natural choice for such a relationship:

$$ Pr(Z_{ij}=1) = \left\{ \begin{array} {rr} g(X_{ij}) & \textrm{if $X_{ij} < 40$} \\ 0 & \textrm{otherwise} \end{array} \right.$$

Here, g(Xij) can be estimated via the following logistic regression: logit(Pr(Zij = 1)) = β0 + β1Xij

3 Example

Data from Sampson et al. Oncogene 2013

Two cell types – young adult mouse colon (YAMC) cells and mutant-p53/activated-Ras transformed YAMC cells – in combination with three treatments – untreated, sodium butyrate, or valproic acid. Four replicates were performed for each cell-type/treatment combination (Sampson et al. 2013).

Load the data

library(HTqPCR)
## Loading required package: Biobase
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: RColorBrewer
## Loading required package: limma
## 
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
## Warning in fun(libname, pkgname): Package 'HTqPCR' is deprecated and will be removed from Bioconductor
##   version 3.21
library(mvtnorm)
library(nondetects)
data(oncogene2013)

Examine residuals when non-detects are replaced by 40

Normalize to Becn1:

normCt <- normalizeCtData(oncogene2013, norm = "deltaCt", deltaCt.genes = "Becn1")
## Calculating deltaCt values
##  Using control gene(s): Becn1 
##  Card 1: Mean=26.17  Stdev=NA
##  Card 2: Mean=25.5   Stdev=NA
##  Card 3: Mean=25.85  Stdev=NA
##  Card 4: Mean=26.22  Stdev=NA
##  Card 5: Mean=26.59  Stdev=NA
##  Card 6: Mean=25.35  Stdev=NA
##  Card 7: Mean=25.73  Stdev=NA
##  Card 8: Mean=26.36  Stdev=NA
##  Card 9: Mean=26.13  Stdev=NA
##  Card 10:    Mean=25.38  Stdev=NA
##  Card 11:    Mean=25.61  Stdev=NA
##  Card 12:    Mean=26.52  Stdev=NA
##  Card 13:    Mean=26.12  Stdev=NA
##  Card 14:    Mean=26.56  Stdev=NA
##  Card 15:    Mean=26.02  Stdev=NA
##  Card 16:    Mean=25.5   Stdev=NA
##  Card 17:    Mean=25.76  Stdev=NA
##  Card 18:    Mean=26.03  Stdev=NA
##  Card 19:    Mean=26.67  Stdev=NA
##  Card 20:    Mean=26.69  Stdev=NA
##  Card 21:    Mean=26.11  Stdev=NA
##  Card 22:    Mean=25.86  Stdev=NA
##  Card 23:    Mean=26.27  Stdev=NA
##  Card 24:    Mean=26.26  Stdev=NA

Calculate residuals for each set of replicates:

conds <- paste(pData(normCt)$sampleType,pData(normCt)$treatment,sep=":")
resids <- matrix(nrow=nrow(normCt), ncol=ncol(normCt))
for(i in 1:nrow(normCt)){
  for(j in 1:ncol(normCt)){
    ind <- which(conds==conds[j])
    resids[i,j] <- exprs(normCt)[i,j]-mean(exprs(normCt)[i,ind])
  }
}

Create boxplots of residuals stratified by the presence of a non-detect:

iND <- which(featureCategory(normCt)=="Undetermined", arr.ind=TRUE)
iD <- which(featureCategory(normCt)!="Undetermined", arr.ind=TRUE)
boxes <- list("observed"=-resids[iD], "non-detect"=-resids[iND])
boxplot(boxes, main="",ylim=c(-5,5),
        ylab=expression(paste("-",Delta,"Ct residuals",sep="")))

Impute non-detects

oncogene2013_1 <- qpcrImpute(oncogene2013, 
groupVars=c("sampleType","treatment"), outform = c("Multy"), 
vary_fit=FALSE, vary_model=TRUE, add_noise=TRUE, numsam=2,
linkglm = c("logit"))
## ~0 + nrep
## <environment: 0x55bcdde8db48>
## [1] "1 / 100"
## -1585.93719357229
## Warning: fitted probabilities numerically 0 or 1 occurred
## [1] "2 / 100"
## -1547.65473798079
## [1] "3 / 100"
## -1525.63747493401
## [1] "4 / 100"
## -1507.70854344257
## [1] "5 / 100"
## -1494.34791647616
## [1] "6 / 100"
## -1486.84145953593
## [1] "7 / 100"
## -1482.65081095015
## [1] "8 / 100"
## -1480.02741565203
## [1] "9 / 100"
## -1478.28522499918
## [1] "10 / 100"
## -1477.09557013291
## [1] "11 / 100"
## -1476.26802386548
## [1] "Multy"
## vary model= TRUE vary_fit= FALSE add_noise= TRUE
##  creating data set  1
## Warning in rbind(deparse.level, ...): number of columns of result, 1, is not a
## multiple of vector length 5 of arg 2
## 
##  creating data set  2
## Warning in rbind(deparse.level, ...): number of columns of result, 1, is not a
## multiple of vector length 5 of arg 2

Examine residuals when non-detects are replaced by imputed values

Normalize to Becn1:

normCt <- normalizeCtData(oncogene2013_1[[1]], norm = "deltaCt", deltaCt.genes = "Becn1")
## Calculating deltaCt values
##  Using control gene(s): Becn1 
##  Card 1: Mean=26.17  Stdev=NA
##  Card 2: Mean=25.5   Stdev=NA
##  Card 3: Mean=25.85  Stdev=NA
##  Card 4: Mean=26.22  Stdev=NA
##  Card 5: Mean=26.59  Stdev=NA
##  Card 6: Mean=25.35  Stdev=NA
##  Card 7: Mean=25.73  Stdev=NA
##  Card 8: Mean=26.36  Stdev=NA
##  Card 9: Mean=26.13  Stdev=NA
##  Card 10:    Mean=25.38  Stdev=NA
##  Card 11:    Mean=25.61  Stdev=NA
##  Card 12:    Mean=26.52  Stdev=NA
##  Card 13:    Mean=26.12  Stdev=NA
##  Card 14:    Mean=26.56  Stdev=NA
##  Card 15:    Mean=26.02  Stdev=NA
##  Card 16:    Mean=25.5   Stdev=NA
##  Card 17:    Mean=25.76  Stdev=NA
##  Card 18:    Mean=26.03  Stdev=NA
##  Card 19:    Mean=26.67  Stdev=NA
##  Card 20:    Mean=26.69  Stdev=NA
##  Card 21:    Mean=26.11  Stdev=NA
##  Card 22:    Mean=25.86  Stdev=NA
##  Card 23:    Mean=26.27  Stdev=NA
##  Card 24:    Mean=26.26  Stdev=NA

Remove the normalization gene:

normCt <- normCt[-which(featureNames(normCt)=="Becn1"),]

Calculate residuals for each set of replicates:

conds <- paste(pData(normCt)$sampleType,
               pData(normCt)$treatment,sep=":")
resids <- matrix(nrow=nrow(normCt), ncol=ncol(normCt))
for(i in 1:nrow(normCt)){
  for(j in 1:ncol(normCt)){
    ind <- which(conds==conds[j])
    resids[i,j] <- exprs(normCt)[i,j]-mean(exprs(normCt)[i,ind])
  }
}

Create boxplots of residuals stratified by the presence of a non-detect:

iI <- which(featureCategory(normCt)=="Imputed", arr.ind=TRUE)
iD <- which(featureCategory(normCt)!="Imputed", arr.ind=TRUE)
boxes <- list("observed"=-resids[iD], "imputed"=-resids[iI])
boxplot(boxes, main="",ylim=c(-5,5),
        ylab=expression(paste("-",Delta,"Ct residuals",sep="")))

Additional examples

Two additional example data sets are used in the paper and included in the package. These are each briefly described below.

Data from Almudevar et al. SAGMB 2011

Cells transformed to malignancy by mutant p53 and activated Ras are perturbed with the aim of restoring gene expression to levels found in non-transformed parental cells via retrovirus-mediated re-expression of corresponding cDNAs or shRNA-dependent stable knock-down. The data contain 4-6 replicates for each perturbation, and each perturbation has a corresponding control sample in which only the vector has been added (Almudevar et al. 2011).

library(nondetects)
data(sagmb2011)

Data from McMurray et al. Nature 2008

A study of the effect of p53 and/or Ras mutations on gene expression. The third dataset is a comparison between four cell types – YAMC cells, mutant-p53 YAMC cells, activated-Ras YAMC cells, and p53/Ras double mutant YAMC cells. Three replicates were performed for the untransformed YAMC cells, and four replicates were performed for each of the other cell types (McMurray et al. 2008).

library(nondetects)
data(nature2008)

Funding

This work was supported by National Institutes of Health [grant numbers CA009363, CA138249, HG006853]; and an Edelman-Gardner Foundation Award.

Session Info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] nondetects_2.35.0   mvtnorm_1.2-5       HTqPCR_1.59.1      
## [4] limma_3.61.7        RColorBrewer_1.1-3  Biobase_2.65.0     
## [7] BiocGenerics_0.51.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9            gplots_3.1.3.1        bitops_1.0-8         
##  [4] KernSmooth_2.23-24    gtools_3.9.5          lattice_0.22-6       
##  [7] lme4_1.1-35.5         digest_0.6.36         caTools_1.18.2       
## [10] evaluate_0.24.0       grid_4.4.1            fastmap_1.2.0        
## [13] jsonlite_1.8.8        Matrix_1.7-0          BiocManager_1.30.23  
## [16] preprocessCore_1.67.0 jquerylib_0.1.4       abind_1.4-5          
## [19] cli_3.6.3             rlang_1.1.4           splines_4.4.1        
## [22] cachem_1.1.0          yaml_2.3.10           tools_4.4.1          
## [25] nloptr_2.1.1          coda_0.19-4.1         minqa_1.2.7          
## [28] boot_1.3-30           buildtools_1.0.0      R6_2.5.1             
## [31] stats4_4.4.1          lifecycle_1.0.4       zlibbioc_1.51.1      
## [34] MASS_7.3-61           affyio_1.75.0         bslib_0.8.0          
## [37] Rcpp_1.0.13           statmod_1.5.0         arm_1.14-4           
## [40] xfun_0.46             affy_1.83.0           highr_0.11           
## [43] sys_3.4.2             knitr_1.48            htmltools_0.5.8.1    
## [46] nlme_3.1-165          rmarkdown_2.27        maketools_1.3.0      
## [49] compiler_4.4.1

References

Almudevar, Anthony, Matthew N McCall, Helene McMurray, and Hartmut Land. 2011. “Fitting Boolean Networks from Steady State Perturbation Data.” Statistical Applications in Genetics and Molecular Biology 10 (1): 47.
McMurray, Helene R, Erik R Sampson, George Compitello, Conan Kinsey, Laurel Newman, Bradley Smith, Shaw-Ree Chen, et al. 2008. “Synergistic Response to Oncogenic Mutations Defines Gene Class Critical to Cancer Phenotype.” Nature 453 (7198): 1112–16.
Sampson, ER, HR McMurray, DC Hassane, L Newman, P Salzman, CT Jordan, and H Land. 2013. “Gene Signature Critical to Cancer Phenotype as a Paradigm for Anticancer Drug Discovery.” Oncogene 32 (33): 3809–18.