Title: | Peak Matrix Processing and signal batch correction for metabolomics datasets |
---|---|
Description: | Methods and tools for (pre-)processing of metabolomics datasets (i.e. peak matrices), including filtering, normalisation, missing value imputation, scaling, and signal drift and batch effect correction methods. Filtering methods are based on: the fraction of missing values (across samples or features); Relative Standard Deviation (RSD) calculated from the Quality Control (QC) samples; the blank samples. Normalisation methods include Probabilistic Quotient Normalisation (PQN) and normalisation to total signal intensity. A unified user interface for several commonly used missing value imputation algorithms is also provided. Supported methods are: k-nearest neighbours (knn), random forests (rf), Bayesian PCA missing value estimator (bpca), mean or median value of the given feature and a constant small value. The generalised logarithm (glog) transformation algorithm is available to stabilise the variance across low and high intensity mass spectral features. Finally, this package provides an implementation of the Quality Control-Robust Spline Correction (QCRSC) algorithm for signal drift and batch effect correction of mass spectrometry-based datasets. |
Authors: | Andris Jankevics [aut], Gavin Rhys Lloyd [aut, cre], Ralf Johannes Maria Weber [aut] |
Maintainer: | Gavin Rhys Lloyd <[email protected]> |
License: | GPL-3 |
Version: | 1.19.0 |
Built: | 2024-10-31 03:34:09 UTC |
Source: | https://github.com/bioc/pmp |
Metabolomics datasets often contain many features of non-biological origin e.g. those associated with extraction and analysis solvents. This tool facilitates the removal of such features from the data matrix, as defined using an appropriate blank sample.
filter_peaks_by_blank( df, fold_change, classes, blank_label, qc_label = NULL, remove_samples = TRUE, remove_peaks = TRUE, fraction_in_blank = 0 )
filter_peaks_by_blank( df, fold_change, classes, blank_label, qc_label = NULL, remove_samples = TRUE, remove_peaks = TRUE, fraction_in_blank = 0 )
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
fold_change |
|
classes |
|
blank_label |
|
qc_label |
|
remove_samples |
|
remove_peaks |
|
fraction_in_blank |
|
If parameter qc_label
is not NULL
, QC samples which will be
used to calculate the median signal intensity.
Object of class SummarizedExperiment
. If input data are a
matrix-like (e.g. an ordinary matrix, a data frame) object, function returns
numeric()
matrix-like object of filtered data set. Function
flags
are added to the object attributes
and is a
DataFrame-class with five columns. The same
DataFrame
object containing flags is added to rowData()
element of SummarizedExperiment
object as well.
Columns in rowData()
or flags
element contain: median_non_blanks
median intensities of features of non-blank
samples; median_blanks
median intensities of features of blank samples; fold_change
fold change between analytical and blank samples; blank_flags
integer()
, if 0 feature is flagged to be
removed; blank_fraction_flags
numeric()
, fraction in how many blank
samples peaks is present.
df <- MTBLS79[ ,MTBLS79$Batch == 1] df$Class[1:2] <- "Blank" out <- filter_peaks_by_blank(df=df, fold_change=1.2, classes=df$Class, blank_label="Blank", qc_label=NULL, remove_samples=FALSE, remove_peaks=TRUE, fraction_in_blank=0)
df <- MTBLS79[ ,MTBLS79$Batch == 1] df$Class[1:2] <- "Blank" out <- filter_peaks_by_blank(df=df, fold_change=1.2, classes=df$Class, blank_label="Blank", qc_label=NULL, remove_samples=FALSE, remove_peaks=TRUE, fraction_in_blank=0)
Metabolomics datasets often contain 'features' with irreproducible peak intensity values, or with large numbers of missing values. This tool facilitates the remove of such features from a data matrix, based upon the relative proportion (minimum fraction) of samples containing non-missing values.
filter_peaks_by_fraction( df, min_frac, classes = NULL, method = "QC", qc_label = "QC", remove_peaks = TRUE )
filter_peaks_by_fraction( df, min_frac, classes = NULL, method = "QC", qc_label = "QC", remove_peaks = TRUE )
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
min_frac |
|
classes |
|
method |
|
qc_label |
|
remove_peaks |
|
Object of class SummarizedExperiment
. If input data are a
matrix-like (e.g. an ordinary matrix, a data frame) object, function returns
numeric()
matrix-like object of filtered data set. Function
flags
are added to the object attributes
and is a
DataFrame-class with five columns. The same
DataFrame
object containing flags is added to rowData()
element of SummarizedExperiment
object as well.
Columns in rowData()
or flags
element contain fractions
of missing values per feature within QC samples (mehtod QC
),
across (method across
) or within (mehtod within
) each sample
group.
df <- MTBLS79[ ,MTBLS79$Batch == 1] out <- filter_peaks_by_fraction(df=df, min_frac=1, classes=df$Class, method='QC', qc_label='QC') out <- filter_peaks_by_fraction(df=df, min_frac=1, classes=df$Class, method='across', qc_label='QC') out <- filter_peaks_by_fraction(df=df, min_frac=1, classes=df$Class, method='within', qc_label='QC')
df <- MTBLS79[ ,MTBLS79$Batch == 1] out <- filter_peaks_by_fraction(df=df, min_frac=1, classes=df$Class, method='QC', qc_label='QC') out <- filter_peaks_by_fraction(df=df, min_frac=1, classes=df$Class, method='across', qc_label='QC') out <- filter_peaks_by_fraction(df=df, min_frac=1, classes=df$Class, method='within', qc_label='QC')
Metabolomics datasets often contain 'features' with irreproducible peak intensity values, or with large numbers of missing values. This tool facilitates the remove of such features from a data matrix, based upon relative standard deviation of intensity values for a given feature within specified QC samples.
filter_peaks_by_rsd(df, max_rsd, classes, qc_label, remove_peaks = TRUE)
filter_peaks_by_rsd(df, max_rsd, classes, qc_label, remove_peaks = TRUE)
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
max_rsd |
|
classes |
|
qc_label |
|
remove_peaks |
|
Object of class SummarizedExperiment
. If input data are a
matrix-like (e.g. an ordinary matrix, a data frame) object, function returns
numeric()
matrix-like object of filtered data set. Function
flags
are added to the object attributes
and is a
DataFrame-class with five columns. The same
DataFrame-class
object containing flags is added to rowData()
element of SummarizedExperiment
object as well.
Columns in rowData()
or flags
element contain: rsd_QC
numeric()
, RSD% value of QC samples per feature; rsd_flags
integer()
,if 0 feature is flagged to be removed.
df <- MTBLS79[ ,MTBLS79$Batch == 1] out <- filter_peaks_by_rsd(df=df, max_rsd=20, classes=df$Class, qc_label='QC')
df <- MTBLS79[ ,MTBLS79$Batch == 1] out <- filter_peaks_by_rsd(df=df, max_rsd=20, classes=df$Class, qc_label='QC')
Missing values in mass spectrometry metabolomic datasets occur widely and can originate from a number of sources, including for both technical and biological reasons. In order for robust conclusions to be drawn from down-stream statistical testing procedures, the issue of missing values must first be addressed. This tool facilitates the removal of samples containing a user-defined maximum percentage of missing values.
filter_samples_by_mv(df, max_perc_mv, classes = NULL, remove_samples = TRUE)
filter_samples_by_mv(df, max_perc_mv, classes = NULL, remove_samples = TRUE)
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
max_perc_mv |
|
classes |
|
remove_samples |
|
Object of class SummarizedExperiment
. If input data are a
matrix-like (e.g. an ordinary matrix, a data frame) object, function returns
numeric()
matrix-like object of filtered data set. Function
flags
are added to the object attributes
and is a
DataFrame-class with five columns. The same
DataFrame
object containing flags is added to rowData()
element of SummarizedExperiment
object as well. If element
colData()
already exists flags are appended to existing values.
Columns in colData()
or flags
element contain: perc_mv
numeric()
, fraction of missing values per sample; flags
integer()
,if 0 feature is flagged to be removed.
df <- MTBLS79 out <- filter_samples_by_mv (df=df, max_perc_mv=0.8)
df <- MTBLS79 out <- filter_samples_by_mv (df=df, max_perc_mv=0.8)
Plot SSE error of lambda optimisation process
glog_plot_optimised_lambda( df, optimised_lambda, classes, qc_label, plot_grid = 100 )
glog_plot_optimised_lambda( df, optimised_lambda, classes, qc_label, plot_grid = 100 )
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
optimised_lambda |
|
classes |
|
qc_label |
|
plot_grid |
|
Class ggplot
object containing optimisation plot.
data <- MTBLS79[, MTBLS79$Batch == 1] classes <- data$Class data <- mv_imputation(df=data, method='knn') out <- glog_transformation (df=data, classes=classes, qc_label='QC') optimised_lambda <- S4Vectors::metadata(out) optimised_lambda <- optimised_lambda$processing_history$glog_transformation$lambda_opt glog_plot_optimised_lambda(df=data, classes=classes, qc_label="QC", optimised_lambda=optimised_lambda)
data <- MTBLS79[, MTBLS79$Batch == 1] classes <- data$Class data <- mv_imputation(df=data, method='knn') out <- glog_transformation (df=data, classes=classes, qc_label='QC') optimised_lambda <- S4Vectors::metadata(out) optimised_lambda <- optimised_lambda$processing_history$glog_transformation$lambda_opt glog_plot_optimised_lambda(df=data, classes=classes, qc_label="QC", optimised_lambda=optimised_lambda)
Performs glog transformation on the data set. QC samples can be used to
estimate technical variation in the data set and calculate transformation
parameter (lambda). QC samples usually comprise a pool of
aliquots taken from all other samples in the study and analysed repeatedly
throughout an analytical batch.
glog_transformation(df, classes, qc_label, lambda = NULL)
glog_transformation(df, classes, qc_label, lambda = NULL)
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
classes |
|
qc_label |
|
lambda |
|
Many univariate and multivariate statistical tests require homogeneity and
n ormality of dataset variance. Real-world metabolomics datasets often fail
to meet these criteria due to asymmetric (i.e. non-'normal') and/or
heteroscedatic (i.e. non-homogenous) variance structure. To address this
issue, glog
data transformations may be applied.
For each cell within the data matrix, transform the raw value (x) according
to: . The parameter
is
typically calculated using quality control (QC) samples analysed throughout
an analysis batch.
Object of class SummarizedExperiment
. If input data are a
matrix-like (e.g. an ordinary matrix, a data frame) object, function returns
the same R data structure as input with all value of data type
numeric()
.
Parsons HM et. al., BMC Bionf., 8(234), 2007. https://doi.org/10.1186/1471-2105-8-234
df <- MTBLS79[, MTBLS79$Batch == 1] out <- mv_imputation(df=df, method="knn") out <- glog_transformation (df=out, classes=df$Class, qc_label="QC")
df <- MTBLS79[, MTBLS79$Batch == 1] out <- mv_imputation(df=df, method="knn") out <- glog_transformation (df=out, classes=df$Class, qc_label="QC")
Data set of 20 biological (cow vs sheep) serum samples that were analysed repeatedly, in 8 batches across 7 days.
MTBLS79
MTBLS79
A RangedSummarizedExperiment-class
object. assay(MTBLS79)
Peak intensities of the DIMS data set.
Contains 172 samples and 2488 features. colData(MTBLS79)
Sample meta data containing 4 columns. Batch
- character()
, sample batch name. Sample_Rep
- character()
, sample replicate code. Class
- character()
, sample class labels. Class2
- character()
, alternative sample class labels
grouping together replicate samples.
Code below includes all commands used to generate MTBLS79 object.
library (openxlsx) library (SummarizedExperiment) download.file(destfile = "MTBLS79.xlsx", mode="wb", url = "ftp://ftp.ebi.ac.uk/pub/databases/metabolights/studies/public/ MTBLS79/Dataset07__SFPM.xlsx") wb <- openxlsx::loadWorkbook(xlsxFile="MTBLS79.xlsx") MTBLS79 <- list() MTBLS79$assay <- openxlsx::readWorkbook(wb, "data", colNames=T, rowNames=T) # Last row of the peak matrix represent mean intensities across all samples. MTBLS79$assay <- MTBLS79$assay[-c(nrow(MTBLS79$assay)), ] # Transpose peak matrix, so that features are in rows and samples in columns. MTBLS79$assay <- as.matrix(t(MTBLS79$assay)) # Missing values in the input data are stored as 0, replace with NA MTBLS79$assay[MTBLS79$assay == 0] <- NA rownames(MTBLS79$assay) <- round(as.numeric(rownames(MTBLS79$assay)), 5) MTBLS79$colData <- openxlsx::readWorkbook(wb, "meta", colNames=T, rowNames=F) MTBLS79$colData <- MTBLS79$colData[-c(nrow(MTBLS79$colData)), 1:4] MTBLS79 <- SummarizedExperiment(assays=list(MTBLS79$assay), colData=DataFrame(MTBLS79$colData))
https://www.ebi.ac.uk/metabolights/MTBLS79
Kirwan et al, Scientific Data volume 1, Article number: 140012 (2014) https://www.nature.com/articles/sdata201412
Missing values in metabolomics data sets occur widely and can originate from
a number of sources, including technical and biological reasons.
Missing values imputation is applied to replace non-existing values
with an estimated values while maintaining the data structure. A number of
different methods are available as part of this function.
mv_imputation( df, method, k = 10, rowmax = 0.5, colmax = 0.5, maxp = NULL, check_df = TRUE )
mv_imputation( df, method, k = 10, rowmax = 0.5, colmax = 0.5, maxp = NULL, check_df = TRUE )
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
method |
|
k |
|
rowmax |
|
colmax |
|
maxp |
|
check_df |
|
Supported missing value imputation methods are: knn
- K-nearest neighbour. For each feature in each sample, missing
values are replaced by the mean average value (non-weighted) calculated
from its k
closest neighbours in multivariate space (default distance
metric: euclidean distance); rf
- Random Forest. This method is a wrapper of
missForest function. For each feature, missing values are
iteratively imputed until a maximum number of iterations (10), or until the
difference between consecutively-imputed matrices becomes positive.
Trees per forest are set to 100, variables included per tree are calculate
using formula ;
bpca
- Bayesian principal component analysis. This method is a
wrapper of pca function. Missing values are replaced by
the values obtained from principal component analysis regression with a
Bayesian method. Therefore every imputed missing value does not occur
multiple times, neither across the samples nor across the metabolite
features; sv
- Small value. For each feature, replace missing values with half
of the lowest value recorded in the entire data matrix; 'mn'
- Mean. For each feature, replace missing values with the mean
average (non-weighted) of all other non-missing values for that variable;'md'
- Median. For each feature, replace missing values with the
median of all other non-missing values for that variable.
Object of class SummarizedExperiment
. If input data are a
matrix-like (e.g. an ordinary matrix, a data frame) object, function returns
the same R data structure as input with all value of data type
numeric()
.
df <- MTBLS79[ ,MTBLS79$Batch == 1] out <- mv_imputation(df=df, method='knn')
df <- MTBLS79[ ,MTBLS79$Batch == 1] out <- mv_imputation(df=df, method='knn')
For each sample, every feature intensity value is divided by the total sum of
all feature intensity values measured in that sample (NA
values
ignored by default), before multiplication by 100; the unit is %.
normalise_to_sum(df, check_df = TRUE)
normalise_to_sum(df, check_df = TRUE)
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
check_df |
|
Object of class SummarizedExperiment
. If input data are a
matrix-like (e.g. an ordinary matrix, a data frame) object, function returns
the same R data structure as input with all value of data type
numeric()
.
df <- MTBLS79[ ,MTBLS79$Batch == 1] out <- normalise_to_sum (df=df)
df <- MTBLS79[ ,MTBLS79$Batch == 1] out <- normalise_to_sum (df=df)
For every feature the mean response is calculated across all QC samples. A reference vector is then generated. The median between the reference vector and every sample is computed obtaining a vector of coefficients related to each sample. Each sample is then divided by the median value of the vector of coefficients; this median value is different for each sample. This method was adapted by Dieterle et al. (2006) (see references). Its purpose is to take into account the concentration changes of some metabolite features that affect limited regions of the data.
pqn_normalisation( df, classes, qc_label, ref_mean = NULL, qc_frac = 0, sample_frac = 0, ref_method = "mean" )
pqn_normalisation( df, classes, qc_label, ref_mean = NULL, qc_frac = 0, sample_frac = 0, ref_method = "mean" )
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
classes |
|
qc_label |
|
ref_mean |
|
qc_frac |
|
sample_frac |
|
ref_method |
|
Object of class SummarizedExperiment
. If input data are
matrix-like (e.g. an ordinary matrix, a data frame) object, the same R data
structure as the input will be returned with all values of the data type.
numeric()
.
Dieterle F. et al., Anal. Chem., 78(13), 2006. http://dx.doi.org/10.1021/ac051632c
df <- MTBLS79[ , MTBLS79$Batch==1] pqn_normalisation(df=df, classes=df$Class, qc_label='QC')
df <- MTBLS79[ , MTBLS79$Batch==1] pqn_normalisation(df=df, classes=df$Class, qc_label='QC')
Return history of applied functions and argument from pmp package.
processing_history(df)
processing_history(df)
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
List of function names and argument values.
df <- MTBLS79[ ,MTBLS79$Batch == 1] df$Class[1:2] <- "Blank" out <- filter_peaks_by_blank(df=df, fold_change=1.2, classes=df$Class, blank_label="Blank", qc_label=NULL, remove_samples=FALSE, remove_peaks=TRUE, fraction_in_blank=0) processing_history(out)
df <- MTBLS79[ ,MTBLS79$Batch == 1] df$Class[1:2] <- "Blank" out <- filter_peaks_by_blank(df=df, fold_change=1.2, classes=df$Class, blank_label="Blank", qc_label=NULL, remove_samples=FALSE, remove_peaks=TRUE, fraction_in_blank=0) processing_history(out)
Implementation of Quality QC-RSC algorithm for signal drift and batch effect correction within/across a multi-batch direct infusion mass spectrometry (DIMS) and liquid chromatography mass spectrometry (LCMS) datasets. This version supports missing values, but requires at least 4 data point for quality control (QC) samples measured within each analytical batch. The smoothing parameter (spar) can be optimised using leave-one-out cross validation to avoid overfitting.
QCRSC( df, order, batch, classes, spar = 0, log = TRUE, minQC = 5, qc_label = "QC", spar_lim = c(-1.5, 1.5) )
QCRSC( df, order, batch, classes, spar = 0, log = TRUE, minQC = 5, qc_label = "QC", spar_lim = c(-1.5, 1.5) )
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
order |
|
batch |
|
classes |
|
spar |
|
log |
|
minQC |
|
qc_label |
|
spar_lim |
A 2 element numeric vector containing the min and max
values of spar when searching for an optimum. Default |
Object of class SummarizedExperiment
. If input data are a
matrix-like (e.g. an ordinary matrix, a data frame) object, function returns
the same R data structure as input with all value of data type
numeric()
.
Andris Jankevics [email protected]
Kirwan et al, Anal. Bioanal. Chem., 405 (15), 2013 https://dx.doi.org/10.1007/s00216-013-6856-7
classes <- MTBLS79$Class batch <- MTBLS79$Batch order <- c(1:ncol(MTBLS79)) out <- QCRSC(df = MTBLS79[1:10, ], order = order, batch = MTBLS79$Batch, classes = MTBLS79$Class, spar = 0, minQC = 4)
classes <- MTBLS79$Class batch <- MTBLS79$Batch order <- c(1:ncol(MTBLS79)) out <- QCRSC(df = MTBLS79[1:10, ], order = order, batch = MTBLS79$Batch, classes = MTBLS79$Class, spar = 0, minQC = 4)
Filter to remove features.
remove_peaks(df, rem_index)
remove_peaks(df, rem_index)
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
rem_index |
|
Object of class SummarizedExperiment
. If input data are a
matrix-like (e.g. an ordinary matrix, a data frame) object, function returns
the same R data structure as input with all value of data type
numeric()
.
df <- MTBLS79[ ,MTBLS79$Batch == 1] rem_index <- vector(mode="logical", length=nrow(SummarizedExperiment::assay(df))) rem_index[c(1, 20, 456, 789)] <- TRUE out <- remove_peaks(df=df, rem_index=rem_index)
df <- MTBLS79[ ,MTBLS79$Batch == 1] rem_index <- vector(mode="logical", length=nrow(SummarizedExperiment::assay(df))) rem_index[c(1, 20, 456, 789)] <- TRUE out <- remove_peaks(df=df, rem_index=rem_index)
Plot the output from signal batch correction for the selected or the first 100 features.
sbc_plot( df, corrected_df, classes, batch, indexes = NULL, qc_label = "QC", output = "sbcms_plots.pdf" )
sbc_plot( df, corrected_df, classes, batch, indexes = NULL, qc_label = "QC", output = "sbcms_plots.pdf" )
df |
A matrix-like (e.g. an ordinary matrix, a data frame) or
RangedSummarizedExperiment-class object with
all values of class |
corrected_df |
Output from QCRSC function. |
classes |
|
batch |
|
indexes |
|
qc_label |
|
output |
|
Pdf file or list()
object ggplot
class showing data
before and after signal correction.
order <- c(1:ncol(MTBLS79)) data <- MTBLS79[1:10, ] out <- QCRSC(df =data, order=order, batch=MTBLS79$Batch, classes=MTBLS79$Class, spar=0, minQC=4) plots <- sbc_plot (df=data, corrected_df=out, classes=MTBLS79$Class, batch=MTBLS79$Batch, output=NULL)
order <- c(1:ncol(MTBLS79)) data <- MTBLS79[1:10, ] out <- QCRSC(df =data, order=order, batch=MTBLS79$Batch, classes=MTBLS79$Class, spar=0, minQC=4) plots <- sbc_plot (df=data, corrected_df=out, classes=MTBLS79$Class, batch=MTBLS79$Batch, output=NULL)