Load the required packages
This Vignette provides an example workflow for how to use the package MSstatsLOBD.
To install this package, start R (version “4.0”) and enter:
We will estimate the LOB/LOD for a few peptides an assay available on the CPTAC (Clinical Proteomic Tumor Analysis Consortium) assay portal c.f. [@Thomas]. The dataset contains spike in data for 43 distinct peptides. For each peptide, 8 distinct concentration spikes for 3 different replicates are measured. The Skyline files for the assay along with details about the experiment can be obtained from this webpage: https://assays.cancer.gov. The particular dataset examined here (called JHU_DChan_HZhang_ZZhang) can be found at https://panoramaweb.org/labkey/project/CPTAC%20Assay%20Portal/JHU_DChan_HZhang_ZZhang/Serum_QExactive_GlycopeptideEnrichedPRM/begin.view?. It should be downloaded from the MSStats website http://msstats.org/?smd_process_download=1&download_id=548.
The data is then exported in a csv file (${\tt calibration\_data\_raw.csv}$) from
Skyline. This is done in Skyline by selecting File → Export →
Report → QuaSAR input and then clicking
Export. The csv file contains the measured peak area for each fragment
of each light and heavy version of each peptide. Depending on the format
of the Skyline file and depending on whether standards were used, the
particular outputs obtained in the csv file may vary. In this particular
case the following variables are obtained in the output file ${\tt calibration\_data\_raw.csv}$: ${\tt File\ Name, Sample\ Name, Replicate\
Name, Protein\ Name, Peptide\ Sequence, Peptide\ Modified\
Sequence,}$
${\tt Precursor\ Charge, Product\
Charge,}$ ${\tt Fragment\
Ion, Average\ Measured\ Retention\ Time}$, ${\tt SampleGroup, IS\ Spike,}$ ${\tt Concentration, Replicate,light\ Area,
heavy\ Area}$. A number of variables are byproducts of the
acquisition process and will not be considered for the following, i.e.
${\tt File\ Name, Sample\ Name, Replicate\
Name, SampleGroup, IS\ Spike}$.
Variables that are important for the assay characterization are detailed
below (others are assumed to be self explanatory):
## File.Name Sample.Name Replicate.Name Protein.Name
## 1 Blank_0_1.raw NA Blank_0_1 sp|Q9HDC9|APMAP_HUMAN
## 2 Blank_0_2.raw NA Blank_0_2 sp|Q9HDC9|APMAP_HUMAN
## 3 Blank_0_3.raw NA Blank_0_3 sp|Q9HDC9|APMAP_HUMAN
## 4 A_1.raw NA A_1 sp|Q9HDC9|APMAP_HUMAN
## 5 B_1.raw NA B_1 sp|Q9HDC9|APMAP_HUMAN
## 6 C_1.raw NA C_1 sp|Q9HDC9|APMAP_HUMAN
## Peptide.Sequence Peptide.Modified.Sequence Precursor.Charge Product.Charge
## 1 AGPNGTLFVADAYK AGPN[+1]GTLFVADAYK 2 1
## 2 AGPNGTLFVADAYK AGPN[+1]GTLFVADAYK 2 1
## 3 AGPNGTLFVADAYK AGPN[+1]GTLFVADAYK 2 1
## 4 AGPNGTLFVADAYK AGPN[+1]GTLFVADAYK 2 1
## 5 AGPNGTLFVADAYK AGPN[+1]GTLFVADAYK 2 1
## 6 AGPNGTLFVADAYK AGPN[+1]GTLFVADAYK 2 1
## Fragment.Ion Average.Measured.Retention.Time SampleGroup IS.Spike
## 1 y10 35.19 Bl NA
## 2 y10 35.19 Bl NA
## 3 y10 35.19 Bl NA
## 4 y10 35.19 A NA
## 5 y10 35.19 B NA
## 6 y10 35.19 C NA
## Concentration Replicate light.Area heavy.Area
## 1 0.0000 1 59322 0
## 2 0.0000 2 75627 0
## 3 0.0000 3 62117 0
## 4 0.0576 1 75369 0
## 5 0.2880 1 77955 21216
## 6 1.4400 1 81893 329055
We normalize the intensity of the light peptides using that of the heavy peptides. This corrects any systematic errors that can occur during a run or across replicates. The calculation is greatly simplified by the use of the ${\tt tidyr}$ and ${\tt dplyr}$ packages. The area from all the different peptide fragments is first summed then log transformed. The median intensity of the reference heavy peptides ${\tt medianlog2heavy}$ is calculated. Their intensities should ideally remain constant across runs since the spiked concentration of the heavy peptide is constant. The difference between the median for all the heavy peptide spikes is calculated. It is then used to correct (i.e. to normalize) the intensity of the light peptides ${\tt log2light}$ to obtain the adjusted intensity ${\tt log2light\_norm}$. The intensity is finally converted back to original space.
#Select variable that are need
df <- raw_data %>% select(Peptide.Sequence,Precursor.Charge,
Product.Charge, Fragment.Ion,Concentration,
Replicate, light.Area, heavy.Area,
SampleGroup, File.Name)
#Convert factors to numeric and remove NA values:
df <- df %>% mutate(heavy.Area = as.numeric(heavy.Area)) %>%
filter(!is.na(heavy.Area))
#Sum area over all fragments
df2 <- df %>% group_by(Peptide.Sequence, Replicate, SampleGroup,
Concentration, File.Name) %>%
summarize(A_light = sum(light.Area), A_heavy = sum(heavy.Area))
## `summarise()` has grouped output by 'Peptide.Sequence', 'Replicate',
## 'SampleGroup', 'Concentration'. You can override using the `.groups` argument.
#Convert to log scale
df2 <- df2 %>% mutate(log2light = log2(A_light), log2heavy = log2(A_heavy))
#Calculate median of heavy(reference) for a run
df3 <- df2 %>% group_by(Peptide.Sequence) %>%
summarize(medianlog2light = median(log2light),
medianlog2heavy= median(log2heavy))
#Modify light intensity so that the intensity of the heavy is constant (=median) across a run.
df4 <- left_join(df2,df3, by = "Peptide.Sequence") %>%
mutate(log2light_delta = medianlog2light - log2light) %>%
mutate(log2heavy_norm = log2heavy + log2light_delta,
log2light_norm = log2light + log2light_delta) %>%
mutate(A_heavy_norm = 2**log2heavy_norm, A_light_norm = 2**log2light_norm)
#Format the data for MSstats:
#Select the heavy area, concentration, peptide name and Replicate
df_out <- df4 %>% ungroup() %>%
select(A_heavy_norm, Concentration, Peptide.Sequence, Replicate)
#Change the names according to MSStats requirement:
df_out <- df_out %>% rename(INTENSITY = A_heavy_norm,
CONCENTRATION = Concentration,
NAME = Peptide.Sequence, REPLICATE = Replicate)
# We choose NAME as the peptide sequence
head(df_out)
## # A tibble: 6 × 4
## INTENSITY CONCENTRATION NAME REPLICATE
## <dbl> <dbl> <chr> <int>
## 1 34177. 0.0576 AAPAPQEATATFNSTADR 1
## 2 448636. 0.288 AAPAPQEATATFNSTADR 1
## 3 0 0 AAPAPQEATATFNSTADR 1
## 4 0 0 AAPAPQEATATFNSTADR 1
## 5 62968. 0 AAPAPQEATATFNSTADR 1
## 6 43668. 0 AAPAPQEATATFNSTADR 1
In the following we estimate the LOB and LOD for individual peptides. The first step in the estimation is to fit a function to all the (Spiked Concentration, Measured Intensity) points. When the ${\tt nonlinear\_quantlim}$ function is used, the function that is fit automatically adapts to the data. For instance, when the data is linear, a straight line is used, while when a threshold (i.e. a leveling off of the measured intensity at low concentrations) an elbow like function is fit. The fit is called ${\tt MEAN}$ in the output of function ${\tt nonlinear\_quantlim}$ as shown in Fig.1. Each value of ${\tt MEAN}$ is given for a particular ${\tt CONCENTRATION}$ value ${\tt CONCENTRATION}$ is thus a discretization of x–Spiked Concentration axis. The lower and upper bound of the 90% prediction interval of the fit are called ${\tt LOW}$ and ${\tt UP}$ in the output of ${\tt nonlinear\_quantlim}$. They correspond respectively to the 5% and 95% percentile of predictions.
The second step in the procedure is to estimate the upper bound of the noise in the blank sample (blue dashed line in Fig. 1). It is found by assuming that blank sample measurements are normally distributed.
We define the LOB as the highest apparent concentration of a peptide expected when replicates of a blank sample containing no peptides are measured. This amounts to finding the concentration at the intersection of the fit (which represents the averaged measured intensity) with the 95% upper prediction bound of the noise.
The LOD is defined as the measured concentration value for which the probability of falsely claiming the absence of a peptide in the sample is 0.05, given a probability 0.05 of falsely claiming its presence. Estimating the LOD thus amounts to finding the concentration at the intersection between the 5% percentile line of the prediction interval of the fit (i.e. the lower bound of the 90% prediction interval) and the 95% percentile line of the blank sample. At the LOB concentration, there is an 0.05 probability of false positive and a 50% chance of false negative. At the LOD concentration there is 0.05 probability of false negative and a false positive probability of 0.05 in accordance with its definition. By default, a probability of 0.05 for the LOB/LOD estimation is used but it can be changed, as detailed in the manual.
#Select peptide of interest: LPPGLLANFTLLR
spikeindata <- df_out %>% filter(NAME == "LPPGLLANFTLLR")
#This contains the measured intensity for the peptide of interest
head(spikeindata)
## # A tibble: 6 × 4
## INTENSITY CONCENTRATION NAME REPLICATE
## <dbl> <dbl> <chr> <int>
## 1 26291. 0.0576 LPPGLLANFTLLR 1
## 2 244841. 0.288 LPPGLLANFTLLR 1
## 3 0 0 LPPGLLANFTLLR 1
## 4 774274. 0 LPPGLLANFTLLR 1
## 5 482008. 0 LPPGLLANFTLLR 1
## 6 780792. 0 LPPGLLANFTLLR 1
## CONCENTRATION MEAN LOW UP LOB LOD SLOPE
## 1 0.000000e+00 183235.5 -316962.1 747151.8 0.8254855 0.9292975 2381531
## 2 1.776357e-15 183235.5 -292651.0 736814.7 0.8254855 0.9292975 2381531
## 3 5.760000e-02 183235.5 156942.3 212979.1 0.8254855 0.9292975 2381531
## 4 2.880000e-01 192908.1 144372.7 246439.5 0.8254855 0.9292975 2381531
## 5 7.040283e-01 775234.2 664882.7 917963.6 0.8254855 0.9292975 2381531
## 6 1.440000e+00 1805393.8 1580835.4 2007628.7 0.8254855 0.9292975 2381531
## INTERCEPT NAME METHOD
## 1 -5750521 LPPGLLANFTLLR NONLINEAR
## 2 -5750521 LPPGLLANFTLLR NONLINEAR
## 3 -5750521 LPPGLLANFTLLR NONLINEAR
## 4 -5750521 LPPGLLANFTLLR NONLINEAR
## 5 -5750521 LPPGLLANFTLLR NONLINEAR
## 6 -5750521 LPPGLLANFTLLR NONLINEAR
After estimating LOB/LOD we can plot the results.
#plot results in the directory
plot_quantlim(spikeindata = spikeindata, quantlim_out = quant_out, address = FALSE)
## [[1]]
##
## [[2]]
The threshold is captured by the fit at low concentrations. The ${\tt MEAN}$ of the output of the function is the red line (mean prediction) in the plots. ${\tt LOW}$ is the orange line (5% percentile of predictions) while ${\tt UP}$ is the upper boundary of the red shaded area. The LOB is the concentration at the intersection of the fit and the estimate for the 95% upper bound of the noise (blue line). A more accurate “smoother” fit can be obtained by increasing the number of points ${\tt Npoints}$ used to discretize the concentration axis (see manual for ${\tt nonlinear\_quantlim}$).
The nonlinear MSStats function (${\tt nonlinear\_quantlim}$) works for all peptides (those with a linear response and those with a non-linear response). We now examine a peptide with a linear behavior.
#Select peptide of interest: FLNDTMAVYEAK
spikeindata2 <- df_out %>% filter(NAME == "FVGTPEVNQTTLYQR")
#This contains the measured intensity for the peptide of interest
head(spikeindata2)
FALSE # A tibble: 6 × 4
FALSE INTENSITY CONCENTRATION NAME REPLICATE
FALSE <dbl> <dbl> <chr> <int>
FALSE 1 323763. 0.0576 FVGTPEVNQTTLYQR 1
FALSE 2 2036098. 0.288 FVGTPEVNQTTLYQR 1
FALSE 3 205. 0 FVGTPEVNQTTLYQR 1
FALSE 4 1431235. 0 FVGTPEVNQTTLYQR 1
FALSE 5 1244348. 0 FVGTPEVNQTTLYQR 1
FALSE 6 1455085. 0 FVGTPEVNQTTLYQR 1
## CONCENTRATION MEAN LOW UP LOB LOD SLOPE
## 1 0.000000e+00 481422.2 -524215.0 1461807.4 0.2580243 0.2685278 5020096
## 2 1.776357e-15 481422.2 -458815.7 1569586.2 0.2580243 0.2685278 5020096
## 3 5.760000e-02 782241.6 665346.2 888950.3 0.2580243 0.2685278 5020096
## 4 2.880000e-01 1985519.4 1936392.0 2035837.5 0.2580243 0.2685278 5020096
## 5 7.040283e-01 4158252.1 4018744.4 4292107.7 0.2580243 0.2685278 5020096
## 6 1.440000e+00 8001908.5 7800466.0 8220279.9 0.2580243 0.2685278 5020096
## INTERCEPT NAME METHOD
## 1 10349523 FVGTPEVNQTTLYQR NONLINEAR
## 2 10349523 FVGTPEVNQTTLYQR NONLINEAR
## 3 10349523 FVGTPEVNQTTLYQR NONLINEAR
## 4 10349523 FVGTPEVNQTTLYQR NONLINEAR
## 5 10349523 FVGTPEVNQTTLYQR NONLINEAR
## 6 10349523 FVGTPEVNQTTLYQR NONLINEAR
#plot results in the directory: "/Users/cyrilg/Desktop/Workflow/Results"
#Change directory appropriately for your computer
plot_quantlim(spikeindata = spikeindata2, quantlim_out = quant_out2,
address = FALSE)
## [[1]]
##
## [[2]]
The plots indicate that the fit is observed to be linear as the response is linear.
## CONCENTRATION MEAN LOW UP LOB LOD SLOPE
## 1 0.000000e+00 -134085.07 -629908.58 324620.79 0.7904698 0.9085273 2381531
## 2 1.776357e-15 -134085.07 -614139.96 337654.07 0.7904698 0.9085273 2381531
## 3 5.760000e-02 -55436.66 -88473.36 -23992.85 0.7904698 0.9085273 2381531
## 4 2.880000e-01 259157.01 214578.70 304459.44 0.7904698 0.9085273 2381531
## 5 7.040283e-01 827211.96 699807.07 964957.65 0.7904698 0.9085273 2381531
## 6 1.440000e+00 1832125.31 1583100.08 2054120.90 0.7904698 0.9085273 2381531
## INTERCEPT NAME METHOD
## 1 -5750521 LPPGLLANFTLLR LINEAR
## 2 -5750521 LPPGLLANFTLLR LINEAR
## 3 -5750521 LPPGLLANFTLLR LINEAR
## 4 -5750521 LPPGLLANFTLLR LINEAR
## 5 -5750521 LPPGLLANFTLLR LINEAR
## 6 -5750521 LPPGLLANFTLLR LINEAR
After estimating LOB/LOD we can plot the results.
#plot results in the directory
plot_quantlim(spikeindata = spikeindata, quantlim_out = quant_out, address = FALSE)
## [[1]]
##
## [[2]]
## CONCENTRATION MEAN LOW UP LOB LOD SLOPE
## 1 0.000000e+00 352468.4 -540201.2 1391397.5 0.2419664 0.2516009 5020096
## 2 1.776357e-15 352468.4 -575601.5 1505734.4 0.2419664 0.2516009 5020096
## 3 5.760000e-02 703948.9 599796.7 803185.8 0.2419664 0.2516009 5020096
## 4 2.880000e-01 2109871.0 2059591.0 2157673.3 0.2419664 0.2516009 5020096
## 5 7.040283e-01 4648513.8 4520007.0 4772710.3 0.2419664 0.2516009 5020096
## 6 1.440000e+00 9139481.0 8954748.9 9333780.2 0.2419664 0.2516009 5020096
## INTERCEPT NAME METHOD
## 1 10349523 FVGTPEVNQTTLYQR LINEAR
## 2 10349523 FVGTPEVNQTTLYQR LINEAR
## 3 10349523 FVGTPEVNQTTLYQR LINEAR
## 4 10349523 FVGTPEVNQTTLYQR LINEAR
## 5 10349523 FVGTPEVNQTTLYQR LINEAR
## 6 10349523 FVGTPEVNQTTLYQR LINEAR
After estimating LOB/LOD we can plot the results.
#plot results in the directory
plot_quantlim(spikeindata = spikeindata2, quantlim_out = quant_out, address = FALSE)
## [[1]]
##
## [[2]]
C. Galitzine et al. “Nonlinear regression improves accuracy of characterization of multiplexed mass spectrometric assays.” Molecular & Cellular Proteomics (2018), doi:10.1074/mcp.RA117.000322