Missing value handling

Goal of this vignette

Proteomics data suffer from a high rate of missing values, which need to be accounted for. Different methods have been applied to deal with this issue, including multiple imputation methods (see for example Lazar et al). The different options to deal with missing values in DEP are described in this vignette.

Simulate data

To exemplify the missing value handling, we work with a simulated dataset. This is very useful, because we know the ground truth of our dataset, meaning we know which proteins are belonging to the background (null distribution) and which proteins are differentially expressed between the control and sample conditions.

Generate intensity values

We generate a dataset with 3300 proteins, of which 300 proteins are differentially expressed. The variables used to generate the data are depicted below.

# Variables
replicates = 3
bg_proteins = 3000
DE_proteins = 300
log2_mean_bg = 27
log2_sd_bg = 2
log2_mean_DE_control = 25
log2_mean_DE_treatment = 30
log2_sd_DE = 2
# Loading DEP and packages required for data handling
library("DEP")
library("dplyr")
library("tidyr")
library("purrr")
library("ggplot2")
library("SummarizedExperiment")

# Background data (sampled from null distribution)
sim_null <- data_frame(
  name = paste0("bg_", rep(1:bg_proteins, rep((2*replicates), bg_proteins))),
  ID = rep(1:bg_proteins, rep((2*replicates), bg_proteins)),
  var = rep(c("control_1", "control_2", "control_3", 
    "treatment_1", "treatment_2", "treatment_3"), bg_proteins), 
  val = 2^rnorm(2*replicates*bg_proteins, mean = log2_mean_bg, sd = log2_sd_bg))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Data for DE proteins (sampled from alternative distribution)
sim_diff <- rbind(
  data_frame(
    name = paste0("DE_", rep(1:DE_proteins, rep(replicates, DE_proteins))),
    ID = rep((bg_proteins+1):(bg_proteins+DE_proteins), 
      rep(replicates, DE_proteins)),
    var = rep(c("control_1", "control_2", "control_3"), DE_proteins), 
    val = 2^rnorm(replicates*DE_proteins, 
      mean = log2_mean_DE_control, sd = log2_sd_DE)),
  data_frame(
    name = paste0("DE_", rep(1:DE_proteins, rep(replicates, DE_proteins))),
    ID = rep((bg_proteins+1):(bg_proteins+DE_proteins), 
      rep(replicates, DE_proteins)),
    var = rep(c("treatment_1", "treatment_2", "treatment_3"), DE_proteins),
    val = 2^rnorm(replicates*DE_proteins, 
      mean = log2_mean_DE_treatment, sd = log2_sd_DE)))

# Combine null and DE data
sim <- rbind(sim_null, sim_diff) %>% 
  spread(var, val) %>% 
  arrange(ID)

# Generate experimental design
experimental_design <- data_frame(
  label = colnames(sim)[!colnames(sim) %in% c("name", "ID")],
  condition = c(rep("control", replicates), 
    rep("treatment", replicates)),
  replicate = rep(1:replicates, 2))

Introduce missing values

Data can be missing at random (MAR) or missing not at random (MNAR). MAR means that values are randomly missing from all samples. In the case of MNAR, values are missing in specific samples and/or for specific proteins. For example, certain proteins might not be quantified in specific conditions, because they are below the detection limit in these specific samples.

To mimick these two types of missing values, we introduce missing values randomly over all data points (MAR) and we introduce missing values in the control samples of 100 differentially expressed proteins (MNAR). The variables used to introduce missing values are depicted below.

# Variables
MAR_fraction = 0.05
MNAR_proteins = 100
# Generate a MAR matrix
MAR_matrix <- matrix(
  data = sample(c(TRUE, FALSE), 
    size = 2*replicates*(bg_proteins+DE_proteins), 
    replace = TRUE, 
    prob = c(MAR_fraction, 1-MAR_fraction)), 
  nrow = bg_proteins+DE_proteins, 
  ncol = 2*replicates)

# Introduce missing values at random (MAR)
controls <- grep("control", colnames(sim))
treatments <- grep("treatment", colnames(sim))
sim[, c(controls, treatments)][MAR_matrix] <- 0
sim$MAR <- apply(MAR_matrix, 1, any)

# Introduce missing values not at random (MNAR)
DE_protein_IDs <- grep("DE", sim$name)
sim[DE_protein_IDs[1:MNAR_proteins], controls] <- 0
sim$MNAR <- FALSE
sim$MNAR[DE_protein_IDs[1:MNAR_proteins]] <- TRUE

Generate a SummarizedExperiment

The data is stored in a SummarizedExperiment, as described in the ‘Introduction to DEP’ vignette.

# Generate a SummarizedExperiment object
sim_unique_names <- make_unique(sim, "name", "ID", delim = ";")
se <- make_se(sim_unique_names, c(controls, treatments), experimental_design)

Filter proteins based on missing values

A first consideration with missing values is whether or not to filter out proteins with too many missing values.

Visualize the extend of missing values

The number of proteins quantified over the samples can be visualized to investigate the extend of missing values.

# Plot a barplot of the protein quantification overlap between samples
plot_frequency(se)

Many proteins are quantified in all six samples and only a small subset of proteins were detected in less than half of the samples.

Filter options

We can choose to not filter out any proteins at all, filter for only the proteins without missing values, filter for proteins with a certain fraction of quantified samples, and for proteins that are quantified in all replicates of at least one condition.

# No filtering
no_filter <- se

# Filter for proteins that are quantified in all replicates of at least one condition
condition_filter <- filter_proteins(se, "condition", thr = 0)

# Filter for proteins that have no missing values
complete_cases <- filter_proteins(se, "complete")

# Filter for proteins that are quantified in at least 2/3 of the samples.
frac_filtered <- filter_proteins(se, "fraction", min = 0.66)

To check the consequences of filtering, we calculate the number of background and DE proteins left in the dataset.

# Function to extract number of proteins
number_prots <- function(se) {
  names <- rownames(get(se))
  data_frame(Dataset = se,
    bg_proteins = sum(grepl("bg", names)),
    DE_proteins = sum(grepl("DE", names)))
}

# Number of bg and DE proteins still included
objects <- c("no_filter", 
  "condition_filter",
  "complete_cases",
  "frac_filtered")

map_df(objects, number_prots)
## # A tibble: 4 × 3
##   Dataset          bg_proteins DE_proteins
##   <chr>                  <int>       <int>
## 1 no_filter               3000         300
## 2 condition_filter        2937         267
## 3 complete_cases          2147         141
## 4 frac_filtered           2997         200

For our simulated dataset, filtering for complete cases drastically reduces the number of proteins, only keeping half of the DE proteins. Filtering protein to be quantified in at least 2/3 of the samples, keeps many more DE proteins and filters out all proteins with values MNAR. Filtering for proteins quantified in all replicates of at least one condition only filters out a minimal amount of DE proteins.

Scaling and variance stabilization

The data is scaled and variance stabilized using vsn.

# Scale and variance stabilize
no_filter <- normalize_vsn(se)
condition_filter <- normalize_vsn(condition_filter)
complete_cases <- normalize_vsn(complete_cases)
frac_filtered <- normalize_vsn(frac_filtered)
# Mean versus Sd plot
meanSdPlot(no_filter)
## Warning: Computation failed in `stat_binhex()`.
## Caused by error in `compute_group()`:
## ! The package "hexbin" is required for `stat_bin_hex()`.

Data imputation of missing data

A second important consideration with missing values is whether or not to impute the missing values.

MAR and MNAR (see Introduce missing values) require different imputation methods. See the MSnbase vignette and more specifically the impute function descriptions for detailed information.

Explore the pattern of missing values

To explore the pattern of missing values in the data, a heatmap can be plotted indicating whether values are missing (0) or not (1). Only proteins with at least one missing value are visualized.

# Plot a heatmap of proteins with missing values
plot_missval(no_filter)

The missing values seem to be randomly distributed across the samples (MAR). However, we do note a block of values that are missing in all control samples (bottom left side of the heatmap). These proteins might have missing values not at random (MNAR).

To check whether missing values are biased to lower intense proteins, the densities and cumulative fractions are plotted for proteins with and without missing values.

# Plot intensity distributions and cumulative fraction of proteins 
# with and without missing values
plot_detect(no_filter)

In our example data, there is no clear difference between the two distributions.

Imputation options

DEP borrows the imputation functions from MSnbase. See the MSnbase vignette and more specifically the impute function description for more information on the imputation methods.

# All possible imputation methods are printed in an error, if an invalid function name is given.
impute(no_filter, fun = "")
## Error in match.arg(fun): 'arg' should be one of "bpca", "knn", "QRILC", "MLE", "MinDet", "MinProb", "man", "min", "zero", "mixed", "nbavg"

Some examples of imputation methods.

# No imputation
no_imputation <- no_filter

# Impute missing data using random draws from a 
# Gaussian distribution centered around a minimal value (for MNAR)
MinProb_imputation <- impute(no_filter, fun = "MinProb", q = 0.01)

# Impute missing data using random draws from a 
# manually defined left-shifted Gaussian distribution (for MNAR)
manual_imputation <- impute(no_filter, fun = "man", shift = 1.8, scale = 0.3)

# Impute missing data using the k-nearest neighbour approach (for MAR)
knn_imputation <- impute(no_filter, fun = "knn", rowmax = 0.9)

The effect of data imputation on the distributions can be visualized.

# Plot intensity distributions before and after imputation
plot_imputation(no_filter, MinProb_imputation, 
  manual_imputation, knn_imputation)

Advanced imputation methods

Mixed imputation on proteins (rows)

One can also perform a mixed imputation on the proteins, which uses a MAR and MNAR imputation method on different subsets of proteins. First, we have to define a logical vector defining the rows that are to be imputed with the MAR method. Here, we consider a protein to have missing values not at random (MNAR) if it has missing values in all replicates of at least one condition.

# Extract protein names with missing values 
# in all replicates of at least one condition
proteins_MNAR <- get_df_long(no_filter) %>%
  group_by(name, condition) %>%
  summarize(NAs = all(is.na(intensity))) %>% 
  filter(NAs) %>% 
  pull(name) %>% 
  unique()

# Get a logical vector
MNAR <- names(no_filter) %in% proteins_MNAR

# Perform a mixed imputation
mixed_imputation <- impute(
  no_filter, 
  fun = "mixed",
  randna = !MNAR, # we have to define MAR which is the opposite of MNAR
  mar = "knn", # imputation function for MAR
  mnar = "zero") # imputation function for MNAR

Mixed imputation on samples (columns)

Additionally, the imputation can also be performed on a subset of samples. To peform a sample specific imputation, we first need to transform our SummarizedExperiment into a MSnSet object. Subsequently, we imputed the controls using the “MinProb” method and the samples using the “knn” method.

# SummarizedExperiment to MSnSet object conversion
sample_specific_imputation <- no_filter
MSnSet <- as(sample_specific_imputation, "MSnSet")

# Impute differently for two sets of samples
MSnSet_imputed1 <- MSnbase::impute(MSnSet[, 1:3], method = "MinProb")
MSnSet_imputed2 <- MSnbase::impute(MSnSet[, 4:6], method = "knn")

# Combine into the SummarizedExperiment object
assay(sample_specific_imputation) <- cbind(
  MSnbase::exprs(MSnSet_imputed1), 
  MSnbase::exprs(MSnSet_imputed2))

The effect of data imputation on the distributions can be visualized.

# Plot intensity distributions before and after imputation
plot_imputation(no_filter, mixed_imputation, sample_specific_imputation)

Test for differential expression

We perform differential analysis on the different imputated data sets. The following datasets are compared:

  • No imputation
  • knn imputation
  • MinProb imputation
  • Mixed imputation
# Function that wraps around test_diff, add_rejections and get_results functions
DE_analysis <- function(se) {
  se %>% 
    test_diff(., type = "control", control = "control") %>%
    add_rejections(., alpha = 0.1, lfc = 0) %>% 
    get_results()
}

# DE analysis on no, knn, MinProb and mixed imputation
no_imputation_results <- DE_analysis(no_imputation)
knn_imputation_results <- DE_analysis(knn_imputation)
MinProb_imputation_results <- DE_analysis(MinProb_imputation)
mixed_imputation_results <- DE_analysis(mixed_imputation)

Number of identified differentially expressed proteins

As an initial parameter we look at the number of differentially expressed proteins identified (adjusted P ≤ 0.05).

# Function to extract number of DE proteins
DE_prots <- function(results) {
  data_frame(Dataset = gsub("_results", "", results),
    significant_proteins = get(results) %>% 
      filter(significant) %>% 
      nrow())
}

# Number of significant proteins
objects <- c("no_imputation_results", 
  "knn_imputation_results",
  "MinProb_imputation_results",
  "mixed_imputation_results")

map_df(objects, DE_prots)
## # A tibble: 4 × 2
##   Dataset            significant_proteins
##   <chr>                             <int>
## 1 no_imputation                        97
## 2 knn_imputation                      123
## 3 MinProb_imputation                  147
## 4 mixed_imputation                    204

For our simulated dataset, no and knn imputation result in less identified differentially expressed proteins compared to the MinProb and mixed imputation. Mixed imputation results in the identification of the most differentially expressed proteins in our simulated dataset with many proteins missing values not at random.

Note that the performance of the different imputation methods is dataset-depedent. It is recommended to always carefully check the effect of filtering and data imputation on your results.

ROC curves

To further compare the results of the different imputation methods, ROC curves are plotted.

# Function to obtain ROC data
get_ROC_df <- function(results) {
  get(results) %>% 
    select(name, treatment_vs_control_p.val, significant) %>% 
    mutate(
      DE = grepl("DE", name),
      BG = grepl("bg", name)) %>% 
    arrange(treatment_vs_control_p.val) %>% 
    mutate(
      TPR = cumsum(as.numeric(DE)) / 300,
      FPR = cumsum(as.numeric(BG)) / 3000,
      method = results)
}

# Get ROC data for no, knn, MinProb and mixed imputation
ROC_df <- map_df(objects, get_ROC_df)

# Plot ROC curves
ggplot(ROC_df, aes(FPR, TPR, col = method)) +
  geom_line() +
  theme_DEP1() +
  ggtitle("ROC-curve")

# Plot ROC curves zoom
ggplot(ROC_df, aes(FPR, TPR, col = method)) +
  geom_line() +
  theme_DEP1() +
  xlim(0, 0.1) +
  ggtitle("ROC-curve zoom")
## Warning: Removed 11022 rows containing missing values or values outside the scale range
## (`geom_line()`).

The ROC curves also show that mixed imputation has the best performance for our simulated dataset. Because the performance of the different imputation methods is dataset-depedent, it is again recommended to carefully check the effect of filtering and imputation on your dataset.

Differences in response

To start the investigation of imputation method performance on the dataset, we ask ourselves the following questions: Which proteins are not identified as differentially expressed proteins in the datasets with no or knn imputation? And which proteins are specifically for mixed imputation? We look at both true and false positive hits as well as the missing values.

# Function to obtain summary data
get_rejected_proteins <- function(results) {
  get(results) %>% 
    filter(significant) %>% 
    left_join(., select(sim, name, MAR, MNAR), by = "name") %>% 
    mutate(
      DE = grepl("DE", name),
      BG = grepl("bg", name),
      method = results)
}

# Get summary data for no, knn, MinProb and mixed imputation
objects <- c("no_imputation_results", 
  "knn_imputation_results",
  "MinProb_imputation_results",
  "mixed_imputation_results")

summary_df <- map_df(objects, get_rejected_proteins)

# Plot number of DE proteins (True and False)
summary_df %>% 
  group_by(method) %>% 
  summarize(TP = sum(DE), FP = sum(BG)) %>% 
  gather(category, number, -method) %>% 
  mutate(method = gsub("_results", "", method)) %>% 
  ggplot(aes(method, number, fill = category)) +
  geom_col(position = position_dodge()) +
  theme_DEP2() +
  labs(title = "True and False Hits",
    x = "",
    y = "Number of DE proteins",
    fill = "False or True")

MinProb and mixed imputation identify many more truely differentially expressed proteins (TP) and only minimally increase the number of false positivies (FP) in our simulated dataset.

# Plot number of DE proteins with missing values
summary_df %>% 
  group_by(method) %>% 
  summarize(MNAR = sum(MNAR), MAR = sum(MAR)) %>% 
  gather(category, number, -method) %>% 
  mutate(method = gsub("_results", "", method)) %>% 
  ggplot(aes(method, number, fill = category)) +
  geom_col(position = position_dodge()) +
  theme_DEP2() +
  labs(title = "Category of Missing Values",
    x = "",
    y = "Number of DE proteins",
    fill = "")

The gain of identification of truely differentially expressed proteins in MinProb and mixed imputation clearly comes from increased sensitivity for data MNAR. This is also expected as these imputation methods perform left-censored imputation, which is specific for data MNAR.

Session information

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [3] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
##  [5] IRanges_2.41.2              S4Vectors_0.45.2           
##  [7] BiocGenerics_0.53.3         generics_0.1.3             
##  [9] MatrixGenerics_1.19.0       matrixStats_1.4.1          
## [11] ggplot2_3.5.1               purrr_1.0.2                
## [13] tidyr_1.3.1                 dplyr_1.1.4                
## [15] DEP_1.29.0                  BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] gridExtra_2.3               fdrtool_1.2.18             
##   [3] sandwich_3.1-1              rlang_1.1.4                
##   [5] magrittr_2.0.3              shinydashboard_0.7.2       
##   [7] clue_0.3-66                 GetoptLong_1.0.5           
##   [9] compiler_4.4.2              png_0.1-8                  
##  [11] vctrs_0.6.5                 reshape2_1.4.4             
##  [13] stringr_1.5.1               ProtGenerics_1.39.0        
##  [15] pkgconfig_2.0.3             shape_1.4.6.1              
##  [17] crayon_1.5.3                fastmap_1.2.0              
##  [19] XVector_0.47.0              labeling_0.4.3             
##  [21] utf8_1.2.4                  promises_1.3.2             
##  [23] rmarkdown_2.29              tzdb_0.4.0                 
##  [25] UCSC.utils_1.3.0            preprocessCore_1.69.0      
##  [27] xfun_0.49                   MultiAssayExperiment_1.33.1
##  [29] zlibbioc_1.52.0             cachem_1.1.0               
##  [31] jsonlite_1.8.9              gmm_1.8                    
##  [33] later_1.4.1                 DelayedArray_0.33.3        
##  [35] BiocParallel_1.41.0         parallel_4.4.2             
##  [37] cluster_2.1.8               R6_2.5.1                   
##  [39] bslib_0.8.0                 stringi_1.8.4              
##  [41] RColorBrewer_1.1-3          limma_3.63.2               
##  [43] jquerylib_0.1.4             assertthat_0.2.1           
##  [45] Rcpp_1.0.13-1               iterators_1.0.14           
##  [47] knitr_1.49                  zoo_1.8-12                 
##  [49] readr_2.1.5                 httpuv_1.6.15              
##  [51] Matrix_1.7-1                igraph_2.1.2               
##  [53] tidyselect_1.2.1            abind_1.4-8                
##  [55] yaml_2.3.10                 doParallel_1.0.17          
##  [57] codetools_0.2-20            affy_1.85.0                
##  [59] lattice_0.22-6              tibble_3.2.1               
##  [61] plyr_1.8.9                  withr_3.0.2                
##  [63] shiny_1.10.0                tmvtnorm_1.6               
##  [65] evaluate_1.0.1              norm_1.0-11.1              
##  [67] circlize_0.4.16             pillar_1.10.0              
##  [69] affyio_1.77.1               BiocManager_1.30.25        
##  [71] DT_0.33                     foreach_1.5.2              
##  [73] MSnbase_2.33.2              MALDIquant_1.22.3          
##  [75] ncdf4_1.23                  hms_1.1.3                  
##  [77] munsell_0.5.1               scales_1.3.0               
##  [79] xtable_1.8-4                glue_1.8.0                 
##  [81] lazyeval_0.2.2              maketools_1.3.1            
##  [83] tools_4.4.2                 mzID_1.45.0                
##  [85] sys_3.4.3                   QFeatures_1.17.0           
##  [87] vsn_3.75.0                  mzR_2.41.1                 
##  [89] imputeLCMD_2.1              mvtnorm_1.3-2              
##  [91] XML_3.99-0.17               buildtools_1.0.0           
##  [93] grid_4.4.2                  impute_1.81.0              
##  [95] MsCoreUtils_1.19.0          colorspace_2.1-1           
##  [97] GenomeInfoDbData_1.2.13     PSMatch_1.11.0             
##  [99] cli_3.6.3                   S4Arrays_1.7.1             
## [101] ComplexHeatmap_2.23.0       AnnotationFilter_1.31.0    
## [103] pcaMethods_1.99.0           gtable_0.3.6               
## [105] sass_0.4.9                  digest_0.6.37              
## [107] ggrepel_0.9.6               SparseArray_1.7.2          
## [109] farver_2.1.2                htmlwidgets_1.6.4          
## [111] rjson_0.2.23                htmltools_0.5.8.1          
## [113] lifecycle_1.0.4             httr_1.4.7                 
## [115] mime_0.12                   GlobalOptions_0.1.2        
## [117] statmod_1.5.0               MASS_7.3-61