--- title: "Introduction to AlpsNMR" author: "AlpsNMR authors" package: AlpsNMR abstract: > An introduction to the AlpsNMR package, showing the most relevant functions and a proposed workflow. This includes loading bruker NMR samples, adding sample annotations, preprocessing the spectra, detecting outliers, detecting peaks, aligning the samples and integrating the peaks to build a peak table. date: "`r format(Sys.Date(), '%F')`" output: BiocStyle::pdf_document: latex_engine: lualatex vignette: > %\VignetteIndexEntry{Vignette 01: Introduction to AlpsNMR (start here)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Getting started The `AlpsNMR` package has most of its functions prefixed with `nmr_`. The main reason for this is to avoid conflicts with other packages. Besides, it helps for autocompletion: Most coding environments such as RStudio will let you see most of the function names by typing `nmr_` followed by pressing the tab key. This vignette assumes some basic knowledge of NMR and data analysis, and some basic R programming. We will start by loading `AlpsNMR` along some convenience packages: ```{r} library(dplyr) library(ggplot2) library(readxl) library(BiocParallel) library(AlpsNMR) ``` # Enable parallellization This package is able to parallellize several functions through the use of the `BiocParallel` package. Whether to parallelize or not is left to the user that can control the parallellization registering backends. Please check `vignette("Introduction_To_BiocParallel", package = "BiocParallel")`. ```{r} #register(SerialParam(), default = TRUE) # disable parallellization register(SnowParam(workers = 2, exportglobals = FALSE), default = TRUE) # enable parallellization with 2 workers ``` # Data: The `MeOH_plasma_extraction` dataset To explore the basics of the AlpsNMR package, we have included three NMR samples acquired in a 600 MHz Bruker instrument bundled with the package. The samples are pooled quality control plasma samples, that were extracted with methanol. They only contain small molecules. If you have installed this package, you can obtain the directory where the samples are with the command: ```{r} MeOH_plasma_extraction_dir <- system.file("dataset-demo", package = "AlpsNMR") MeOH_plasma_extraction_dir ``` The demo directory includes three zipped Bruker samples and a dummy Excel metadata file: ```{r} list.files(MeOH_plasma_extraction_dir) ``` Since these are quality control samples, the metadata is a dummy table: ```{r} MeOH_plasma_extraction_xlsx <- file.path(MeOH_plasma_extraction_dir, "dummy_metadata.xlsx") annotations <- readxl::read_excel(MeOH_plasma_extraction_xlsx) annotations ``` # Loading samples The function to read samples is called `nmr_read_samples()`. It expects a character vector with the samples to load that can be paths to directories of Bruker format samples or paths to JDX files. Additionally, this function can filter by pulse sequences (e.g. load only NOESY samples) or loading only metadata. ```{r load-samples} zip_files <- fs::dir_ls(MeOH_plasma_extraction_dir, glob = "*.zip") zip_files dataset <- nmr_read_samples(sample_names = zip_files) dataset ``` If your samples happen to be in different **folders per class**, AlpsNMR provides convenience functions to read them as well. With this example: ``` - your_dataset/ + control/ * 10/ * 20/ * 30/ + mutated/ * 10/ * 20/ * 30/ ``` You could use: ``` dataset <- nmr_read_samples_dir(c("your_dataset/control", "your_dataset/mutated")) dataset ``` If after reading the `?nmr_read_samples` page you still have issues, feel free to open an issue at https://github.com/sipss/AlpsNMR/issues and ask for clarification. # Adding annotations We can embed the external annotations we loaded above into the dataset: ```{r} dataset <- nmr_meta_add(dataset, metadata = annotations, by = "NMRExperiment") ``` And retrieve them from the dataset: ```{r} nmr_meta_get(dataset, groups = "external") ``` If you want to learn more about sample metadata (including acquisition and FID processing parameters), as well as more complex ways of adding annotations, check out the `vignette("Vig02-handling-metadata-and-annotations", package = "AlpsNMR")`. # Phasing It might be the case that automatically reconstructed metabolite NMR spectra have a first-order phase error. AlpsNMR provides a convenient wrapper function the `NMRphasing` package, offering a variety of algorithms to estimate and correct for phase errors, which arise on physical grounds. ```{r} #dataset <- nmr_autophase(dataset, method="MPC_DANM") ``` # Interpolation 1D NMR samples can be interpolated together, in order to arrange all the spectra into a matrix, with one row per sample. Here we choose the range of ppm values that we want to include in further analyses. ```{r} dataset <- nmr_interpolate_1D(dataset, axis = c(min = -0.5, max = 10)) ``` If the `axis = NULL` then the ppm axis is autodetected from the samples. See `nmr_interpolate_1D()` for further reference on the axis options. # Plotting samples Plotting many spectra with so many points is quite expensive so it is possible to include only some regions of the spectra or plot only some samples. ```{r} plot(dataset, NMRExperiment = c("10", "30"), chemshift_range = c(2.2, 2.8)) ``` # Exclude regions Some regions can easily be excluded from the spectra with `nmr_exclude_region()`: ```{r} regions_to_exclude <- list(water = c(4.6, 5), methanol = c(3.33, 3.39)) dataset <- nmr_exclude_region(dataset, exclude = regions_to_exclude) plot(dataset, chemshift_range = c(4.2, 5.5)) ``` # Filter samples Maybe we just want to analyze a subset of the data, e.g., only a class group or a particular gender. We can filter some samples according to their metadata as follows: ```{r} samples_10_20 <- filter(dataset, SubjectID == "Ana") nmr_meta_get(samples_10_20, groups = "external") ``` # Robust PCA for outlier detection The AlpsNMR package includes robust PCA analysis for outlier detection. ```{r} pca_outliers_rob <- nmr_pca_outliers_robust(dataset, ncomp = 3) nmr_pca_outliers_plot(dataset, pca_outliers_rob) ``` Samples with greater QResiduals and Tscores than the threshold defined by the red line are candidates for further exploration and exclusion. With this small dataset, there is not much to see. # Baseline estimation Spectra may display an unstable baseline, specially when processing blood/fecal samples. The peak detection and integration algorithms benefit from having an estimation of the baseline, so it is advisable to compute it first and check it fits as expected. See before: ```{r} plot(dataset, chemshift_range = c(1.37, 2.5)) ``` ```{r} plot(dataset, chemshift_range = c(3.5,3.8)) ``` Estimate the baseline: ```{r} dataset <- nmr_baseline_estimation(dataset, lambda = 9, p = 0.01) ``` And after: ```{r fig.height=10, fig.width=10} # TODO: Simplify this plot spectra_to_plot <- tidy(dataset, chemshift_range = c(1.37, 2.5)) baseline_to_plot <- tidy(dataset, chemshift_range = c(1.37, 2.5), matrix_name = "data_1r_baseline") ggplot(mapping = aes(x = chemshift, y = intensity, color = NMRExperiment)) + geom_line(data = spectra_to_plot) + geom_line(data = baseline_to_plot, linetype = "dashed") + facet_wrap(~NMRExperiment, ncol = 1) ``` ```{r fig.height=10, fig.width=10} # TODO: Simplify this plot spectra_to_plot <- tidy(dataset, chemshift_range = c(3.5, 3.8)) baseline_to_plot <- tidy(dataset, chemshift_range = c(3.5, 3.8), matrix_name = "data_1r_baseline") ggplot(mapping = aes(x = chemshift, y = intensity, color = NMRExperiment)) + geom_line(data = spectra_to_plot) + geom_line(data = baseline_to_plot, linetype = "dashed") + facet_wrap(~NMRExperiment, ncol = 1) ``` # Peak detection The peak detection is performed on short spectra segments using a continuous wavelet transform. Peaks below a threshold intensity are automatically discarded. Our current approach relies on the use of the baseline threshold (`baselineThresh`) automatically calculated (see `?nmr_baseline_threshold`) and the Signal to Noise Threshold (`SNR.Th`) to discriminate valid peaks from noise. See `?nmr_detect_peaks` for more information. ```{r} baselineThresh <- nmr_baseline_threshold(dataset, range_without_peaks = c(9.5, 10), method = "median3mad") nmr_baseline_threshold_plot(dataset, baselineThresh) ``` ```{r} peak_list_initial <- nmr_detect_peaks( dataset, nDivRange_ppm = 0.1, scales = seq(1, 16, 2), baselineThresh = baselineThresh, SNR.Th = 3, fit_lorentzians = TRUE ) ``` We can get an overview of the number of peaks we detect on each sample and each chemical shift region: ```{r} nmr_detect_peaks_plot_overview(peak_list_initial) ``` We can explore in a more detailed way the detected peaks: ```{r} nmr_detect_peaks_plot(dataset, peak_list_initial, NMRExperiment = "10", chemshift_range = c(3, 3.3)) ``` Let's the detected peaks in a smaller region across samples: ```{r} peak_list_in_range <- filter(peak_list_initial, ppm > 3.22, ppm < 3.24) peak_list_in_range ``` ```{r} plot(dataset, chemshift_range = c(3.22, 3.25)) ``` ```{r} nmr_detect_peaks_plot_peaks( dataset, peak_list_initial, peak_ids = peak_list_in_range$peak_id, caption = paste("{peak_id}", "(NMRExp.\u00A0{NMRExperiment},", "gamma(ppb)\u00a0=\u00a0{gamma_ppb},", "\narea\u00a0=\u00a0{area},", "nrmse\u00a0=\u00a0{norm_rmse})") ) ``` ```{r} peak_list_initial_accepted <- peaklist_accept_peaks( peak_list_initial, dataset, area_min = 50, keep_rejected = FALSE, verbose = TRUE ) ``` # Spectra alignment Once we have a preliminary peak list, we can align the spectra using the `nmr_align()` function. We expect shifts between the spectra, this becomes necessary so we can cluster the peaks correctly afterwards and build a peak table. The alignment process takes several parameters, including: - `NMRExp_ref`: An NMRExperiment with a reference sample. Usually it should be a pool of all samples if it is available. Otherwise, you can use `nmr_align_find_ref()` to find a sample. Depending on how heterogeneous your dataset is, there may not be a good reference sample (even if the function picks one, the alignment might not succeed), so please always check the results afterwards. - `maxShift_ppm`: The maximum shift allowed when aligning the spectra. - `acceptLostPeak`: Set it to `TRUE` if you want to accept some peaks getting lost during the alignment process. Since the peak detection is never perfect, it is reasonable to accept some lost peaks. ```{r} NMRExp_ref <- nmr_align_find_ref(dataset, peak_list_initial_accepted) message("Your reference is NMRExperiment ", NMRExp_ref) ``` ```{r} dataset_align <- nmr_align( nmr_dataset = dataset, peak_data = peak_list_initial_accepted, NMRExp_ref = NMRExp_ref, maxShift_ppm = 0.0015, acceptLostPeak = TRUE ) ``` Compare the dataset before and after alignment, to verify the quality of the alignment: ```{r} plot(dataset, chemshift_range = c(3.025, 3.063)) plot(dataset_align, chemshift_range = c(3.025, 3.063)) ``` ```{r} cowplot::plot_grid( plot(dataset, chemshift_range = c(3.22, 3.25)) + theme(legend.position = "none"), plot(dataset_align, chemshift_range = c(3.22, 3.25)) + theme(legend.position = "none") ) ``` # Normalization With the spectra correctly aligned, you can use spectra normalization techniques. We normalize after alignment because some of the normalization techniques are sensitive to misalignments. There are multiple normalization techniques available. The most strongly recommended is the Probabilistic Quantile Normalization (`pqn`), but it requires more samples for its internal estimations to be reliable, as it needs a computation of the median spectra. Nevertheless, it is possible to compute it: ```{r} dataset_norm <- nmr_normalize(dataset_align, method = "pqn") ``` The normalization essentially computes a normalization factor for each sample. The plot shows the dispersion with respect to the median of the normalization factors, and can highlight samples with abnormally large or small normalization factors. ```{r} normalization_info <- nmr_normalize_extra_info(dataset_norm) normalization_info$norm_factor normalization_info$plot ``` We can confirm sample 20 is now slightly more diluted: ```{r} to_plot <- dplyr::bind_rows( tidy(dataset_align, NMRExperiment = "20", chemshift_range = c(2,2.5)) %>% mutate(Normalized = "No"), tidy(dataset_norm, NMRExperiment = "20", chemshift_range = c(2,2.5)) %>% mutate(Normalized = "Yes"), ) ggplot(data = to_plot, mapping = aes(x = chemshift, y = intensity, color = Normalized)) + geom_line() + scale_x_reverse() + labs(y = "Intensity", x = "Chemical shift (ppm)", caption = "The normalization slightly diluted experiment 20") ``` And all samples are more homogeneous now: ```{r} cowplot::plot_grid( plot(dataset_align, chemshift_range = c(2, 2.5)) + labs(title="Before Normalization"), plot(dataset_norm, chemshift_range = c(2, 2.5)) + labs(title="After Normalization"), ncol = 1 ) ``` # Peak grouping If you align or normalize your samples, you should rerun the peak detection to ensure the peak positions and estimations are well calculated: ```{r} baselineThresh <- nmr_baseline_threshold(dataset_norm, range_without_peaks = c(9.5, 10), method = "median3mad") nmr_baseline_threshold_plot(dataset_norm, baselineThresh) ``` ```{r} peak_list_for_clustering_unfiltered <- nmr_detect_peaks( dataset_norm, nDivRange_ppm = 0.1, scales = seq(1, 16, 2), baselineThresh = baselineThresh, SNR.Th = 3, fit_lorentzians = TRUE, verbose = TRUE ) peak_list_for_clustering <- peaklist_accept_peaks( peak_list_for_clustering_unfiltered, dataset_norm, area_min = 50, keep_rejected = FALSE, verbose = TRUE ) ``` Feel free to plot, explore and further curate your peak list. Or proceed with the current one: Once we have a peak list for each sample `peak_list`, we need to turn it into a table, merging peaks from different samples together. ```{r} clustering <- nmr_peak_clustering(peak_list_for_clustering, verbose = TRUE) ``` ```{r} cowplot::plot_grid( clustering$num_cluster_estimation$plot + labs(title = "Full"), clustering$num_cluster_estimation$plot + xlim(clustering$num_cluster_estimation$num_clusters-50, clustering$num_cluster_estimation$num_clusters+50) + ylim(0, 10*clustering$num_cluster_estimation$max_dist_thresh_ppb) + labs(title = "Fine region") ) ``` ```{r} peak_list_clustered <- clustering$peak_data ``` We can plot the samples, with the detected peaks and how they have been connected. This allows us to compare the peak detection across samples, and check how good the peak matching is. If peaks are matched they are connected with a black segment. If peaks are detected but not matched, they appear as a dot. If you see a peak, without a point on top then it means the peak was not detected or it was filtered out. ```{r} nmr_peak_clustering_plot( dataset = dataset_norm, peak_list_clustered = peak_list_clustered, NMRExperiments = c("10", "20"), chemshift_range = c(2.4, 3.0) ) ``` Sometimes we see peaks and we wonder why are they not detected. We can include the `baselineThresh` to plot it in the sample as well. This can help to diagnose if the `baselineThresh` argument is the cause of a peak not being detected. ```{r} nmr_peak_clustering_plot( dataset_norm, peak_list_clustered, NMRExperiments = c("10", "20"), chemshift_range = c(4.2, 4.6), baselineThresh = baselineThresh ) ``` ```{r} peak_table <- nmr_build_peak_table(peak_list_clustered, dataset_norm) peak_table ``` ```{r} peak_matrix <- nmr_data(peak_table) peak_matrix[1:3, 1:8] ``` Or you can get a data frame with the corresponding annotations: ```{r} peak_table_df <- as.data.frame(peak_table) peak_table_df ``` ```{r} saveRDS(peak_table, "demo_peak_table.rds") ``` From this peak table you can proceed to use statistical testing, machine learning, and any downstream analysis you may be interested in. # Session Info: ```{r} sessionInfo() ```