This package provides an integrated analysis workflow for robust and reproducible analysis of mass spectrometry proteomics data for differential protein expression or differential enrichment. It requires tabular input (e.g. txt files) as generated by quantitative analysis softwares of raw mass spectrometry data, such as MaxQuant or IsobarQuant. Functions are provided for data preparation, filtering, variance normalization and imputation of missing values, as well as statistical testing of differentially enriched / expressed proteins. It also includes tools to check intermediate steps in the workflow, such as normalization and missing values imputation. Finally, visualization tools are provided to explore the results, including heatmap, volcano plot and barplot representations. For scientists with limited experience in R, the package also contains wrapper functions that entail the complete analysis workflow and generate a report (see the chapter on Workflow functions). Even easier to use are the interactive Shiny apps that are provided by the package (see the chapter on Shiny apps).
The most recent version of R can be downloaded from CRAN. Additional system requirements are Pandoc, as well as Rtools for Windows and NetCDF for Linux.
Start R and install the DEP package:
The package contains shiny apps, which allow for the interactive analysis entailing the entire workflow described below. These apps are especially relevant for users with limited or no experience in R. Currently, there are two different apps. One for label-free quantification (LFQ)-based analysis (output from MaxQuant) and the second for tandem-mass-tags (TMT)-based analysis (output from IsobarQuant). To run the apps, simply run this single command:
We analyze a proteomics dataset in which Ubiquitin-protein interactors were characterized (Zhang et al. Mol Cell 2017). The raw mass spectrometry data were first analyzed using MaxQuant (Cox and Mann, Nat Biotech 2007) and the resulting “proteinGroups.txt” file is used as input for the downstream analysis.
# Loading a package required for data handling
library("dplyr")
# The data is provided with the package
data <- UbiLength
# We filter for contaminant proteins and decoy database hits, which are indicated by "+" in the columns "Potential.contaminants" and "Reverse", respectively.
data <- filter(data, Reverse != "+", Potential.contaminant != "+")
The data.frame has the following dimensions:
## [1] 2941 23
The data.frame has the following column names:
## [1] "Protein.IDs" "Majority.protein.IDs"
## [3] "Protein.names" "Gene.names"
## [5] "Fasta.headers" "Peptides"
## [7] "Razor...unique.peptides" "Unique.peptides"
## [9] "LFQ.intensity.Ubi4_1" "LFQ.intensity.Ubi4_2"
## [11] "LFQ.intensity.Ubi4_3" "LFQ.intensity.Ubi6_1"
## [13] "LFQ.intensity.Ubi6_2" "LFQ.intensity.Ubi6_3"
## [15] "LFQ.intensity.Ctrl_1" "LFQ.intensity.Ctrl_2"
## [17] "LFQ.intensity.Ctrl_3" "LFQ.intensity.Ubi1_1"
## [19] "LFQ.intensity.Ubi1_2" "LFQ.intensity.Ubi1_3"
## [21] "Only.identified.by.site" "Reverse"
## [23] "Potential.contaminant"
The “LFQ.intensity” columns will be used for subsequent analysis.
The dataset has unique Uniprot identifiers, however those are not immediately informative. The associated gene names are informative, however these are not always unique.
## [1] TRUE
# Make a table of duplicated gene names
data %>% group_by(Gene.names) %>% summarize(frequency = n()) %>%
arrange(desc(frequency)) %>% filter(frequency > 1)
## # A tibble: 51 × 2
## Gene.names frequency
## <chr> <int>
## 1 "" 7
## 2 "ATXN2" 4
## 3 "ATXN2L" 4
## 4 "SF1" 4
## 5 "HSPA8" 3
## 6 "RBM33" 3
## 7 "UGP2" 3
## 8 "ACTL6A" 2
## 9 "BCLAF1" 2
## 10 "BRAP" 2
## # ℹ 41 more rows
For further analysis these proteins must get unique names. Additionally, some proteins do not have an annotated gene name and for those we will use the Uniprot ID.
# Make unique names using the annotation in the "Gene.names" column as primary names and the annotation in "Protein.IDs" as name for those that do not have an gene name.
data_unique <- make_unique(data, "Gene.names", "Protein.IDs", delim = ";")
# Are there any duplicated names?
data$name %>% duplicated() %>% any()
## [1] FALSE
Many Bioconductor packages use SummarizedExperiment objects as input and/or output. This class of objects contains and coordinates the actual (assay) data, information on the samples as well as feature annotation. We can generate the SummarizedExperiment object from our data using two different approaches. We can extract sample information directly from the column names or we add sample information using an experimental design template. An experimental design is included in the package for our example dataset.
The experimental design must contain ‘label’, ‘condition’ and ‘replicate’ columns. The ‘label’ column contains the identifiers of the different samples and they should correspond to the column names containing the assay data. The ‘condition’ and ‘replicate’ columns contain the annotation of these samples as defined by the user.
label | condition | replicate |
---|---|---|
Ubi4_1 | Ubi4 | 1 |
Ubi4_2 | Ubi4 | 2 |
Ubi4_3 | Ubi4 | 3 |
Ubi6_1 | Ubi6 | 1 |
Ubi6_2 | Ubi6 | 2 |
Ubi6_3 | Ubi6 | 3 |
Ctrl_1 | Ctrl | 1 |
Ctrl_2 | Ctrl | 2 |
Ctrl_3 | Ctrl | 3 |
Ubi1_1 | Ubi1 | 1 |
Ubi1_2 | Ubi1 | 2 |
Ubi1_3 | Ubi1 | 3 |
# Generate a SummarizedExperiment object using an experimental design
LFQ_columns <- grep("LFQ.", colnames(data_unique)) # get LFQ column numbers
experimental_design <- UbiLength_ExpDesign
data_se <- make_se(data_unique, LFQ_columns, experimental_design)
# Generate a SummarizedExperiment object by parsing condition information from the column names
LFQ_columns <- grep("LFQ.", colnames(data_unique)) # get LFQ column numbers
data_se_parsed <- make_se_parse(data_unique, LFQ_columns)
# Let's have a look at the SummarizedExperiment object
data_se
## class: SummarizedExperiment
## dim: 2941 12
## metadata(0):
## assays(1): ''
## rownames(2941): RBM47 UBA6 ... ATXN2.3 X6RHB9
## rowData names(13): Protein.IDs Majority.protein.IDs ... name ID
## colnames(12): Ubi4_1 Ubi4_2 ... Ubi1_2 Ubi1_3
## colData names(4): label ID condition replicate
The make_se and make_se_parse functions generate a SummarizedExperiment object that has a couple of specifications. The assay data is log2-transformed and its rownames depict the protein names. The rowData contains, amongst others, the ‘name’ and ‘ID’ columns that were generated by make_unique. The colData contains the experimental design and thereby the sample annotation. Thereby the colData includes the ‘label’, ‘condition’ and ‘replicate’ columns as well as a newly generated ‘ID’ column. The log2-transformed assay data and the specified rowData and colData columns are prerequisites for the subsequent analysis steps.
The dataset contains proteins which are not quantified in all replicates. Some proteins are even only quantified in a single replicate.
This leaves our dataset with missing values, which need to be imputed. However, this should not be done for proteins that contain too many missing values. Therefore, we first filter out proteins that contain too many missing values. This is done by setting the threshold for the allowed number of missing values per condition in the filter_missval function.
# Filter for proteins that are identified in all replicates of at least one condition
data_filt <- filter_missval(data_se, thr = 0)
# Less stringent filtering:
# Filter for proteins that are identified in 2 out of 3 replicates of at least one condition
data_filt2 <- filter_missval(data_se, thr = 1)
After filtering, the number of identified proteins per sample can be plotted as well as the overlap in identifications between samples.
The data is background corrected and normalized by variance stabilizing transformation (vsn).
The normalization can be inspected by checking the distributions of the samples before and after normalization.
The remaining missing values in the dataset need to be imputed. The data can be missing at random (MAR), for example if proteins are quantified in some replicates but not in others. Data can also be missing not at random (MNAR), for example if proteins are not quantified in specific conditions (e.g. in the control samples). MNAR can indicate that proteins are below the detection limit in specific samples, which could be very well the case in proteomics experiments. For these different conditions, different imputation methods have to be used, as described in the MSnbase vignette and more specifically in the impute function descriptions.
To explore the pattern of missing values in the data, a heatmap is plotted indicating whether values are missing (0) or not (1). Only proteins with at least one missing value are visualized.
This heatmap indicates that missing values are highly biased to specific samples. The example dataset is an affinity enrichment dataset of ubiquitin interactors, which is likely to have proteins which are below the detection limit in specific samples. These can be proteins that are specifically enriched in the ubiquitin purifications, but are not enriched in the controls samples, or vice versa. To check whether missing values are biased to lower intense proteins, the densities and cumulative fractions are plotted for proteins with and without missing values.
# Plot intensity distributions and cumulative fraction of proteins with and without missing values
plot_detect(data_filt)
Indeed the proteins with missing values have on average low intensities. This data (MNAR and close to the detection limit) should be imputed by a left-censored imputation method, such as the quantile regression-based left-censored function (“QRILC”) or random draws from a left-shifted distribution (“MinProb” and “man”). In contrast, MAR data should be imputed with methods such as k-nearest neighbor (“knn”) or maximum likelihood (“MLE”) functions. See the MSnbase vignette and more specifically the impute function description for more information.
# All possible imputation methods are printed in an error, if an invalid function name is given.
impute(data_norm, fun = "")
## Error in match.arg(fun): 'arg' should be one of "bpca", "knn", "QRILC", "MLE", "MinDet", "MinProb", "man", "min", "zero", "mixed", "nbavg"
# Impute missing data using random draws from a Gaussian distribution centered around a minimal value (for MNAR)
data_imp <- impute(data_norm, fun = "MinProb", q = 0.01)
# Impute missing data using random draws from a manually defined left-shifted Gaussian distribution (for MNAR)
data_imp_man <- impute(data_norm, fun = "man", shift = 1.8, scale = 0.3)
# Impute missing data using the k-nearest neighbour approach (for MAR)
data_imp_knn <- impute(data_norm, fun = "knn", rowmax = 0.9)
The effect of the imputation on the distributions can be visualized.
Protein-wise linear models combined with empirical Bayes statistics are used for the differential enrichment analysis (or differential expression analysis). The test_diff function introduced here uses limma and automatically generates the contrasts to be tested. For the contrasts generation, the control sample has to be specified. Additionally, the types of contrasts to be produced need to be indicated, allowing the generation of all possible comparisons (“all”) or the generation of contrasts of every sample versus control (“control”). Alternatively, the user can manually specify the contrasts to be tested (type = “manual”), which need to be specified in the argument test.
# Differential enrichment analysis based on linear models and empherical Bayes statistics
# Test every sample versus control
data_diff <- test_diff(data_imp, type = "control", control = "Ctrl")
## Tested contrasts: Ubi4_vs_Ctrl, Ubi6_vs_Ctrl, Ubi1_vs_Ctrl
# Test all possible comparisons of samples
data_diff_all_contrasts <- test_diff(data_imp, type = "all")
## Tested contrasts: Ubi4_vs_Ubi6, Ubi4_vs_Ctrl, Ubi4_vs_Ubi1, Ubi6_vs_Ctrl, Ubi6_vs_Ubi1, Ctrl_vs_Ubi1
# Test manually defined comparisons
data_diff_manual <- test_diff(data_imp, type = "manual",
test = c("Ubi4_vs_Ctrl", "Ubi6_vs_Ctrl"))
## Tested contrasts: Ubi4_vs_Ctrl, Ubi6_vs_Ctrl
Finally, significant proteins are defined by user-defined cutoffs using add_rejections.
The results from the previous analysis can be easily visualized by a number of functions. These visualizations assist in the determination of the optimal cutoffs to be used, highlight the most interesting samples and contrasts, and pinpoint differentially enriched/expressed proteins.
The PCA plot can be used to get a high-level overview of the data. This can be very useful to observe batch effects, such as clear differences between replicates.
# Plot the first and second principal components
plot_pca(dep, x = 1, y = 2, n = 500, point_size = 4)
## Warning: Use of `pca_df[[indicate[1]]]` is discouraged.
## ℹ Use `.data[[indicate[1]]]` instead.
## Warning: Use of `pca_df[[indicate[2]]]` is discouraged.
## ℹ Use `.data[[indicate[2]]]` instead.
A correlation matrix can be plotted as a heatmap, to visualize the Pearson correlations between the different samples.
The heatmap representation gives an overview of all significant proteins (rows) in all samples (columns). This allows to see general trends, for example if one sample or replicate is really different compared to the others. Additionally, the clustering of samples (columns) can indicate closer related samples and clustering of proteins (rows) indicates similarly behaving proteins. The proteins can be clustered by k-means clustering (kmeans argument) and the number of clusters can be defined by argument k.
# Plot a heatmap of all significant proteins with the data centered per protein
plot_heatmap(dep, type = "centered", kmeans = TRUE,
k = 6, col_limit = 4, show_row_names = FALSE,
indicate = c("condition", "replicate"))
The heatmap shows a clustering of replicates and indicates that 4Ubi and 6Ubi enrich a similar repertoire of proteins. The k-means clustering of proteins (general clusters of rows) nicely separates protein classes with different binding behaviors.
Alternatively, a heatmap can be plotted using the contrasts, i.e. the direct sample comparisons, as columns. Here, this emphasises the enrichment of ubiquitin interactors compared to the control sample.
Volcano plots can be used to visualize a specific contrast (comparison between two samples). This allows to inspect the enrichment of proteins between the two samples (x axis) and their corresponding adjusted p value (y axis). The add_names argument can be set to FALSE if the protein labels should be omitted, for example if there are too many names.
# Plot a volcano plot for the contrast "Ubi6 vs Ctrl""
plot_volcano(dep, contrast = "Ubi6_vs_Ctrl", label_size = 2, add_names = TRUE)
## Warning: ggrepel: 50 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
It can also be useful to plot the data of a single protein, for example if this protein is of special interest.
Proteins can be differentially enriched/expressed in multiple comparisons. To visualize the distribution of significant conditions per protein and the overlap between conditions, the plot_cond function can be used.
To extract a table containing the essential results, the get_results function can be used.
# Generate a results table
data_results <- get_results(dep)
# Number of significant proteins
data_results %>% filter(significant) %>% nrow()
## [1] 258
The resulting table contains the following columns:
## [1] "name" "ID"
## [3] "Ubi1_vs_Ctrl_p.val" "Ubi4_vs_Ctrl_p.val"
## [5] "Ubi6_vs_Ctrl_p.val" "Ubi1_vs_Ctrl_p.adj"
## [7] "Ubi4_vs_Ctrl_p.adj" "Ubi6_vs_Ctrl_p.adj"
## [9] "Ubi1_vs_Ctrl_significant" "Ubi4_vs_Ctrl_significant"
## [11] "Ubi6_vs_Ctrl_significant" "significant"
## [13] "Ubi1_vs_Ctrl_ratio" "Ubi4_vs_Ctrl_ratio"
## [15] "Ubi6_vs_Ctrl_ratio" "Ctrl_centered"
## [17] "Ubi1_centered" "Ubi4_centered"
## [19] "Ubi6_centered"
Of these columns, the p.val and p.adj columns contain the raw and adjusted p values, respectively, for the contrast as depicted in the column name. The ratio columns contain the average log2 fold changes. The significant columns indicate whether the protein is differentially enriched/expressed, as defined by the chosen cutoffs. The centered columns contain the average log2 fold changes scaled by protein-wise centering.
You might want to obtain an ordinary data.frame of the results. For this purpose, the package provides functions to convert SummarizedExperiment objects to data.frames. get_df_wide will generate a wide table, whereas get_df_long will generate a long table.
To facilitate future analysis and/or visualization of your current data, saving your analyzed data is highly recommended. We save the final data object (dep) as well as intermediates of the analysis, i.e. the initial SummarizedExperiment object (data_se), normalized data (data_norm), imputed data (data_imp) and differentially expression analyzed data (data_diff). This allows us to easily change parameters in future analysis.
The package contains workflow functions that entail the complete analysis and generate a report.
Differential enrichment analysis of label-free proteomics data can be performed using the LFQ workflow function.
# The data is provided with the package
data <- UbiLength
experimental_design <- UbiLength_ExpDesign
# The wrapper function performs the full analysis
data_results <- LFQ(data, experimental_design, fun = "MinProb",
type = "control", control = "Ctrl", alpha = 0.05, lfc = 1)
## [1] 0.2608395
This wrapper produces a list of objects, which can be used to create a report and/or for further analysis. The report function produces two reports (pdf and html), a results table and a RData object, which are saved in a generated “Report” folder.
The results generated by the LFQ function contain, among others, the results object (data.frame object) and the dep object (SummarizedExperiment object).
## [1] "data" "se" "filt" "norm" "imputed" "diff" "dep"
## [8] "results" "param"
The results object contains the essential results and the dep object contains the full SummarizedExperiment object. The results table can be explored by selecting the $results object
# Extract the results table
results_table <- data_results$results
# Number of significant proteins
results_table %>% filter(significant) %>% nrow()
## [1] 233
The full data (dep object) can be used for the plotting functions as described in the chapter “Visualization of the results”, for example a heatmap.
Differential enrichment analysis of Tandem-Mass-Tag labeled proteomics data is also supported. Protein results tables from IsobarQuant can be directly analyzed using the TMT wrapper function.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.4 DEP_1.29.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] jsonlite_1.8.9 shape_1.4.6.1
## [5] MultiAssayExperiment_1.33.0 magrittr_2.0.3
## [7] farver_2.1.2 MALDIquant_1.22.3
## [9] rmarkdown_2.29 GlobalOptions_0.1.2
## [11] zlibbioc_1.52.0 vctrs_0.6.5
## [13] htmltools_0.5.8.1 S4Arrays_1.7.1
## [15] SparseArray_1.7.2 mzID_1.45.0
## [17] sass_0.4.9 bslib_0.8.0
## [19] htmlwidgets_1.6.4 plyr_1.8.9
## [21] sandwich_3.1-1 impute_1.81.0
## [23] zoo_1.8-12 cachem_1.1.0
## [25] buildtools_1.0.0 igraph_2.1.1
## [27] mime_0.12 lifecycle_1.0.4
## [29] iterators_1.0.14 pkgconfig_2.0.3
## [31] Matrix_1.7-1 R6_2.5.1
## [33] fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [35] MatrixGenerics_1.19.0 shiny_1.9.1
## [37] clue_0.3-66 fdrtool_1.2.18
## [39] digest_0.6.37 pcaMethods_1.99.0
## [41] colorspace_2.1-1 S4Vectors_0.45.2
## [43] GenomicRanges_1.59.0 labeling_0.4.3
## [45] fansi_1.0.6 httr_1.4.7
## [47] abind_1.4-8 compiler_4.4.2
## [49] withr_3.0.2 doParallel_1.0.17
## [51] BiocParallel_1.41.0 MASS_7.3-61
## [53] DelayedArray_0.33.2 rjson_0.2.23
## [55] mzR_2.41.0 tools_4.4.2
## [57] PSMatch_1.11.0 httpuv_1.6.15
## [59] glue_1.8.0 QFeatures_1.17.0
## [61] promises_1.3.0 grid_4.4.2
## [63] cluster_2.1.6 reshape2_1.4.4
## [65] generics_0.1.3 gtable_0.3.6
## [67] tzdb_0.4.0 preprocessCore_1.69.0
## [69] tidyr_1.3.1 hms_1.1.3
## [71] utf8_1.2.4 XVector_0.47.0
## [73] BiocGenerics_0.53.3 ggrepel_0.9.6
## [75] foreach_1.5.2 pillar_1.9.0
## [77] stringr_1.5.1 limma_3.63.2
## [79] later_1.3.2 circlize_0.4.16
## [81] gmm_1.8 lattice_0.22-6
## [83] tidyselect_1.2.1 ComplexHeatmap_2.23.0
## [85] maketools_1.3.1 knitr_1.49
## [87] gridExtra_2.3 IRanges_2.41.1
## [89] ProtGenerics_1.39.0 imputeLCMD_2.1
## [91] SummarizedExperiment_1.37.0 stats4_4.4.2
## [93] xfun_0.49 shinydashboard_0.7.2
## [95] Biobase_2.67.0 statmod_1.5.0
## [97] MSnbase_2.33.2 matrixStats_1.4.1
## [99] DT_0.33 stringi_1.8.4
## [101] UCSC.utils_1.3.0 lazyeval_0.2.2
## [103] yaml_2.3.10 evaluate_1.0.1
## [105] codetools_0.2-20 MsCoreUtils_1.19.0
## [107] tibble_3.2.1 BiocManager_1.30.25
## [109] cli_3.6.3 affyio_1.77.0
## [111] xtable_1.8-4 munsell_0.5.1
## [113] jquerylib_0.1.4 Rcpp_1.0.13-1
## [115] GenomeInfoDb_1.43.1 norm_1.0-11.1
## [117] png_0.1-8 XML_3.99-0.17
## [119] parallel_4.4.2 ggplot2_3.5.1
## [121] readr_2.1.5 assertthat_0.2.1
## [123] AnnotationFilter_1.31.0 mvtnorm_1.3-2
## [125] scales_1.3.0 tmvtnorm_1.6
## [127] affy_1.85.0 ncdf4_1.23
## [129] purrr_1.0.2 crayon_1.5.3
## [131] GetoptLong_1.0.5 rlang_1.1.4
## [133] vsn_3.75.0