---
title: "Introduction to AlpsNMR"
author: "AlpsNMR authors"
package: AlpsNMR
abstract: >
  An introduction to the AlpsNMR package, showing the most relevant functions and
  a proposed workflow. This includes loading bruker NMR samples, adding sample
  annotations, preprocessing the spectra, detecting outliers, detecting peaks,
  aligning the samples and integrating the peaks to build a peak table.
date: "`r format(Sys.Date(), '%F')`"
output:
  BiocStyle::pdf_document:
    latex_engine: lualatex
vignette: >
  %\VignetteIndexEntry{Vignette 01: Introduction to AlpsNMR (start here)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


# Getting started

The `AlpsNMR` package has most of its functions prefixed with `nmr_`.
The main reason for this is to avoid conflicts with other packages.
Besides, it helps for autocompletion:
Most coding environments such as RStudio will let you see most of the function
names by typing `nmr_` followed by pressing the tab key.

This vignette assumes some basic knowledge of NMR and data analysis, and some
basic R programming.

We will start by loading `AlpsNMR` along some convenience packages:

```{r}
library(dplyr)
library(ggplot2)
library(readxl)
library(BiocParallel)
library(AlpsNMR)
```


# Enable parallellization

This package is able to parallellize several functions through the use of the
`BiocParallel` package. Whether to parallelize or not is left to the user
that can control the parallellization registering backends. Please check
`vignette("Introduction_To_BiocParallel", package = "BiocParallel")`.

```{r}
#register(SerialParam(), default = TRUE)  # disable parallellization
register(SnowParam(workers = 2, exportglobals = FALSE), default = TRUE)  # enable parallellization with 2 workers
```


# Data: The `MeOH_plasma_extraction` dataset

To explore the basics of the AlpsNMR package, we have included three NMR samples
acquired in a 600 MHz Bruker instrument bundled with the package.
The samples are pooled quality control plasma samples, that were extracted with methanol.
They only contain small molecules.

If you have installed this package, you can obtain the directory where the
samples are with the command:

```{r}
MeOH_plasma_extraction_dir <- system.file("dataset-demo", package = "AlpsNMR")
MeOH_plasma_extraction_dir
```

The demo directory includes three zipped Bruker samples and a dummy Excel metadata file:

```{r}
list.files(MeOH_plasma_extraction_dir)
```

Since these are quality control samples, the metadata is a dummy table:

```{r}
MeOH_plasma_extraction_xlsx <- file.path(MeOH_plasma_extraction_dir, "dummy_metadata.xlsx")
annotations <- readxl::read_excel(MeOH_plasma_extraction_xlsx)
annotations
```

# Loading samples

The function to read samples is called `nmr_read_samples()`. It expects a
character vector with the samples to load that can be paths to directories of
Bruker format samples or paths to JDX files.

Additionally, this function can filter by pulse sequences (e.g. load only NOESY
samples) or loading only metadata.

```{r load-samples}
zip_files <- fs::dir_ls(MeOH_plasma_extraction_dir, glob = "*.zip")
zip_files
dataset <- nmr_read_samples(sample_names = zip_files)
dataset
```

If your samples happen to be in different **folders per class**, AlpsNMR provides
convenience functions to read them as well. With this example:

```
- your_dataset/
  + control/
     * 10/
     * 20/
     * 30/
  + mutated/
     * 10/
     * 20/
     * 30/
```

You could use:

```
dataset <- nmr_read_samples_dir(c("your_dataset/control", "your_dataset/mutated"))
dataset
```

If after reading the `?nmr_read_samples` page you still have issues, feel free
to open an issue at https://github.com/sipss/AlpsNMR/issues and ask for
clarification.

# Adding annotations

We can embed the external annotations we loaded above into the dataset:

```{r}
dataset <- nmr_meta_add(dataset, metadata = annotations, by = "NMRExperiment")
```

And retrieve them from the dataset:

```{r}
nmr_meta_get(dataset, groups = "external")
```


If you want to learn more about sample metadata (including acquisition and FID 
processing parameters), as well as more complex ways of adding annotations,
check out the `vignette("Vig02-handling-metadata-and-annotations", package = "AlpsNMR")`.

# Phasing 

It might be the case that automatically reconstructed metabolite NMR spectra have 
a first-order phase error. AlpsNMR provides a convenient wrapper function 
the `NMRphasing` package, offering a variety of algorithms to estimate 
and correct for phase errors, which arise on physical grounds.

```{r}
#dataset <- nmr_autophase(dataset, method="MPC_DANM")
```


# Interpolation

1D NMR samples can be interpolated together, in order to arrange all the spectra 
into a matrix, with one row per sample. Here we choose the 
range of ppm values that we want to include in further analyses.

```{r}
dataset <- nmr_interpolate_1D(dataset, axis = c(min = -0.5, max = 10))
```

If the `axis = NULL` then the ppm axis is autodetected from the samples.

See `nmr_interpolate_1D()` for further reference on the axis options.

# Plotting samples

Plotting many spectra with so many points is quite expensive so it is possible
to include only some regions of the spectra or plot only some samples.

```{r}
plot(dataset, NMRExperiment = c("10", "30"), chemshift_range = c(2.2, 2.8))
```

<!--
See `vignette("Vig03-plotting-samples", package="AlpsNMR")` for more advanced plot options, or
having interactive plots.
-->


# Exclude regions

Some regions can easily be excluded from the spectra with `nmr_exclude_region()`:

```{r}
regions_to_exclude <- list(water = c(4.6, 5), methanol = c(3.33, 3.39))
dataset <- nmr_exclude_region(dataset, exclude = regions_to_exclude)
plot(dataset, chemshift_range = c(4.2, 5.5))
```


# Filter samples

Maybe we just want to analyze a subset of the data, e.g., only a class group or
a particular gender. We can filter some samples according to their metadata as 
follows:

```{r}
samples_10_20 <- filter(dataset, SubjectID == "Ana")
nmr_meta_get(samples_10_20, groups = "external")
```


# Robust PCA for outlier detection

The AlpsNMR package includes robust PCA analysis for outlier detection.

```{r}
pca_outliers_rob <- nmr_pca_outliers_robust(dataset, ncomp = 3)
nmr_pca_outliers_plot(dataset, pca_outliers_rob)
```

Samples with greater QResiduals and Tscores than the threshold defined by the red line
are candidates for further exploration and exclusion. With this small dataset,
there is not much to see.

# Baseline estimation

Spectra may display an unstable baseline, specially when processing
blood/fecal samples. 

The peak detection and integration algorithms benefit from having
an estimation of the baseline, so it is advisable to compute it first
and check it fits as expected.


See before:

```{r}
plot(dataset, chemshift_range = c(1.37, 2.5))
```


```{r}
plot(dataset, chemshift_range = c(3.5,3.8))
```

Estimate the baseline:

```{r}
dataset <- nmr_baseline_estimation(dataset, lambda = 9, p = 0.01)
```


And after:


```{r fig.height=10, fig.width=10}
# TODO: Simplify this plot
spectra_to_plot <- tidy(dataset, chemshift_range = c(1.37, 2.5))
baseline_to_plot <- tidy(dataset, chemshift_range = c(1.37, 2.5), matrix_name = "data_1r_baseline")

ggplot(mapping = aes(x = chemshift, y = intensity, color = NMRExperiment)) +
    geom_line(data = spectra_to_plot) +
    geom_line(data = baseline_to_plot, linetype = "dashed") + 
    facet_wrap(~NMRExperiment, ncol = 1)

```


```{r fig.height=10, fig.width=10}
# TODO: Simplify this plot
spectra_to_plot <- tidy(dataset, chemshift_range = c(3.5, 3.8))
baseline_to_plot <- tidy(dataset, chemshift_range = c(3.5, 3.8), matrix_name = "data_1r_baseline")

ggplot(mapping = aes(x = chemshift, y = intensity, color = NMRExperiment)) +
    geom_line(data = spectra_to_plot) +
    geom_line(data = baseline_to_plot, linetype = "dashed") + 
    facet_wrap(~NMRExperiment, ncol = 1)

```


# Peak detection

The peak detection is performed on short spectra segments using a continuous
wavelet transform. Peaks below a threshold intensity are automatically discarded.

Our current approach relies on the use of the baseline threshold 
(`baselineThresh`) automatically calculated (see `?nmr_baseline_threshold`) 
and the Signal to Noise Threshold (`SNR.Th`) to discriminate valid peaks 
from noise.

See `?nmr_detect_peaks` for more information.

```{r}
baselineThresh <- nmr_baseline_threshold(dataset, range_without_peaks = c(9.5, 10), method = "median3mad")
nmr_baseline_threshold_plot(dataset, baselineThresh)
```


```{r}
peak_list_initial <- nmr_detect_peaks(
    dataset,
    nDivRange_ppm = 0.1,
    scales = seq(1, 16, 2),
    baselineThresh = baselineThresh,
    SNR.Th = 3,
    fit_lorentzians = TRUE
)
```

We can get an overview of the number of peaks we detect on each sample and each chemical shift region:

```{r}
nmr_detect_peaks_plot_overview(peak_list_initial)
```


We can explore in a more detailed way the detected peaks:

```{r}
nmr_detect_peaks_plot(dataset, peak_list_initial, NMRExperiment = "10", chemshift_range = c(3, 3.3))
```

Let's the detected peaks in a smaller region across samples:

```{r}
peak_list_in_range <- filter(peak_list_initial, ppm > 3.22, ppm < 3.24)
peak_list_in_range
```

```{r}
plot(dataset, chemshift_range = c(3.22, 3.25))
```


```{r}
nmr_detect_peaks_plot_peaks(
    dataset,
    peak_list_initial,
    peak_ids = peak_list_in_range$peak_id,
    caption = paste("{peak_id}",
                    "(NMRExp.\u00A0{NMRExperiment},",
                    "gamma(ppb)\u00a0=\u00a0{gamma_ppb},",
                    "\narea\u00a0=\u00a0{area},",
                    "nrmse\u00a0=\u00a0{norm_rmse})")
)

```


```{r}
peak_list_initial_accepted <- peaklist_accept_peaks(
    peak_list_initial,
    dataset,
    area_min = 50, 
    keep_rejected = FALSE,
    verbose = TRUE
)
```

# Spectra alignment

Once we have a preliminary peak list, we can align the spectra using the
`nmr_align()` function. We expect shifts between the spectra, this becomes
necessary so we can cluster the peaks correctly afterwards and build a peak table.

The alignment process takes several parameters, including:

- `NMRExp_ref`: An NMRExperiment with a reference sample. Usually it should be a pool
  of all samples if it is available. Otherwise, you can use `nmr_align_find_ref()`
  to find a sample. Depending on how heterogeneous your dataset is, there may
  not be a good reference sample (even if the function picks one, the alignment
  might not succeed), so please always check the results afterwards.
 
- `maxShift_ppm`: The maximum shift allowed when aligning the spectra.
- `acceptLostPeak`: Set it to `TRUE` if you want to accept some peaks getting
   lost during the alignment process. Since the peak detection is never perfect,
   it is reasonable to accept some lost peaks.


```{r}
NMRExp_ref <- nmr_align_find_ref(dataset, peak_list_initial_accepted)
message("Your reference is NMRExperiment ", NMRExp_ref)
```


```{r}
dataset_align <- nmr_align(
    nmr_dataset = dataset, 
    peak_data = peak_list_initial_accepted, 
    NMRExp_ref = NMRExp_ref, 
    maxShift_ppm = 0.0015, 
    acceptLostPeak = TRUE
)
```

Compare the dataset before and after alignment, to verify the quality of the alignment:

```{r}
plot(dataset, chemshift_range = c(3.025, 3.063))
plot(dataset_align, chemshift_range = c(3.025, 3.063))
```

```{r}
cowplot::plot_grid(
    plot(dataset, chemshift_range = c(3.22, 3.25)) + theme(legend.position = "none"),
    plot(dataset_align, chemshift_range = c(3.22, 3.25)) + theme(legend.position = "none")
)
```


# Normalization

With the spectra correctly aligned, you can use spectra normalization techniques.
We normalize after alignment because some of the normalization techniques are
sensitive to misalignments.

There are multiple normalization techniques available. The most strongly 
recommended is the Probabilistic Quantile Normalization (`pqn`), but it requires
more samples for its internal estimations to be reliable, as it needs a
computation of the  median spectra. Nevertheless, it is possible to compute it:

```{r}
dataset_norm <- nmr_normalize(dataset_align, method = "pqn")
```

The normalization essentially computes a normalization factor for each sample.

The plot shows the dispersion with respect to the median of the normalization
factors, and can highlight samples with abnormally large or small normalization
factors.

```{r}
normalization_info <- nmr_normalize_extra_info(dataset_norm)
normalization_info$norm_factor
normalization_info$plot
```

We can confirm sample 20 is now slightly more diluted:

```{r}
to_plot <- dplyr::bind_rows(
    tidy(dataset_align, NMRExperiment = "20", chemshift_range = c(2,2.5)) %>%
        mutate(Normalized = "No"),
    tidy(dataset_norm, NMRExperiment = "20", chemshift_range = c(2,2.5)) %>%
        mutate(Normalized = "Yes"),
)
ggplot(data = to_plot, mapping = aes(x = chemshift, y = intensity, color = Normalized)) + 
    geom_line() +
    scale_x_reverse() +
    labs(y = "Intensity", x = "Chemical shift (ppm)",
         caption = "The normalization slightly diluted experiment 20")


```

And all samples are more homogeneous now:

```{r}
cowplot::plot_grid(
    plot(dataset_align, chemshift_range = c(2, 2.5)) + labs(title="Before Normalization"),
    plot(dataset_norm, chemshift_range = c(2, 2.5)) + labs(title="After Normalization"),
    ncol = 1
)
```


# Peak grouping


If you align or normalize your samples, you should rerun the peak detection to ensure the peak
positions and estimations are well calculated:


```{r}
baselineThresh <- nmr_baseline_threshold(dataset_norm, range_without_peaks = c(9.5, 10), method = "median3mad")
nmr_baseline_threshold_plot(dataset_norm, baselineThresh)
```

```{r}
peak_list_for_clustering_unfiltered <- nmr_detect_peaks(
    dataset_norm,
    nDivRange_ppm = 0.1,
    scales = seq(1, 16, 2),
    baselineThresh = baselineThresh,
    SNR.Th = 3,
    fit_lorentzians = TRUE,
    verbose = TRUE
)

peak_list_for_clustering <- peaklist_accept_peaks(
    peak_list_for_clustering_unfiltered,
    dataset_norm,
    area_min = 50, 
    keep_rejected = FALSE,
    verbose = TRUE
)

```

Feel free to plot, explore and further curate your peak list. Or proceed with the current one:


Once we have a peak list for each sample `peak_list`, we need to turn it into
a table, merging peaks from different samples together.

```{r}
clustering <- nmr_peak_clustering(peak_list_for_clustering, verbose = TRUE)
```


```{r}
cowplot::plot_grid(
    clustering$num_cluster_estimation$plot + labs(title = "Full"),
    clustering$num_cluster_estimation$plot +
        xlim(clustering$num_cluster_estimation$num_clusters-50, clustering$num_cluster_estimation$num_clusters+50) +
        ylim(0, 10*clustering$num_cluster_estimation$max_dist_thresh_ppb) +
        labs(title = "Fine region")
)
```


```{r}
peak_list_clustered <- clustering$peak_data

```

We can plot the samples, with the detected peaks and how they have been connected.
This allows us to compare the peak detection across samples, and check how good
the peak matching is.

If peaks are matched they are connected with a black segment. If peaks are detected
but not matched, they appear as a dot. If you see a peak, without a point on top
then it means the peak was not detected or it was filtered out.


```{r}
nmr_peak_clustering_plot(
    dataset = dataset_norm,
    peak_list_clustered = peak_list_clustered,
    NMRExperiments = c("10", "20"),
    chemshift_range = c(2.4, 3.0)
)
```

Sometimes we see peaks and we wonder why are they not detected. We can include
the `baselineThresh` to plot it in the sample as well. This can help to diagnose
if the `baselineThresh` argument is the cause of a peak not being detected.


```{r}
nmr_peak_clustering_plot(
    dataset_norm,
    peak_list_clustered, 
    NMRExperiments = c("10", "20"),
    chemshift_range = c(4.2, 4.6),
    baselineThresh = baselineThresh
)
```


```{r}
peak_table <- nmr_build_peak_table(peak_list_clustered, dataset_norm)
peak_table
```


```{r}
peak_matrix <- nmr_data(peak_table)
peak_matrix[1:3, 1:8]
```

Or you can get a data frame with the corresponding annotations:

```{r}
peak_table_df <- as.data.frame(peak_table)
peak_table_df
```

```{r}
saveRDS(peak_table, "demo_peak_table.rds")
```


From this peak table you can proceed to use statistical testing, machine learning,
and any downstream analysis you may be interested in.

<!--

See `vignette("Vig04-from-peak-table-to-differential-analysis", package = "AlpsNMR")`
for more.
-->


# Session Info:

```{r}
sessionInfo()
```