| Title: | Workflow for non-targeted LC-MS metabolic profiling |
|---|---|
| Description: | Provides functionality for untargeted LC-MS metabolomics research as specified in the associated protocol article in the 'Metabolomics Data Processing and Data Analysis—Current Best Practices' special issue of the Metabolites journal (2020). This includes tabular data preprocessing and quality control, uni- and multivariate analysis as well as quality control visualizations, feature-wise visualizations and results visualizations. Raw data preprocessing and functionality related to biological context, such as pathway analysis, is not included. |
| Authors: | Anton Klåvus [aut, cph] (ORCID: <https://orcid.org/0000-0003-2612-0230>), Jussi Paananen [aut, cph] (ORCID: <https://orcid.org/0000-0001-5100-4907>), Oskari Timonen [aut, cph] (ORCID: <https://orcid.org/0000-0002-6317-6260>), Atte Lihtamo [aut], Vilhelm Suksi [aut, cre] (ORCID: <https://orcid.org/0009-0005-1108-518X>), Retu Haikonen [aut] (ORCID: <https://orcid.org/0000-0003-0830-3850>), Leo Lahti [aut] (ORCID: <https://orcid.org/0000-0001-5537-637X>), Kati Hanhineva [aut] (ORCID: <https://orcid.org/0000-0001-6834-7375>), Ville Koistinen [ctb] (ORCID: <https://orcid.org/0000-0003-1587-8361>), Olli Kärkkäinen [ctb] (ORCID: <https://orcid.org/0000-0003-0825-4956>), Artur Sannikov [ctb] (ORCID: <https://orcid.org/0000-0001-7765-123X>) |
| Maintainer: | Vilhelm Suksi <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.3.0 |
| Built: | 2026-05-22 09:52:52 UTC |
| Source: | https://github.com/bioc/notame |
notame package.Provides tabular data preprocessing and data wrangling functionality for untargeted LC-MS metabolomics research.
Maintainer: Vilhelm Suksi [email protected] (ORCID)
Authors:
Anton Klåvus (ORCID) [copyright holder]
Jussi Paananen (ORCID) [copyright holder]
Oskari Timonen (ORCID) [copyright holder]
Atte Lihtamo
Retu Haikonen (ORCID)
Leo Lahti (ORCID)
Kati Hanhineva (ORCID)
Other contributors:
Ville Koistinen (ORCID) [contributor]
Olli Kärkkäinen (ORCID) [contributor]
Artur Sannikov (ORCID) [contributor]
Klåvus et al. (2020). "notame": Workflow for Non-Targeted LC-MS Metabolic Profiling. Metabolites, 10: 135.
Useful links:
Report bugs at https://github.com/hanhineva-lab/notame/issues
Assess features using the quality metrics defined in (Broadhurst
2018). The quality metrics are described in Details section of
flag_quality
assess_quality(object, assay.type = NULL)assess_quality(object, assay.type = NULL)
object |
a |
assay.type |
character, assay to be used in case of multiple assays |
A SummarizedExperiment object with quality metrics in feature data.
data(toy_notame_set) ex_set <- assess_quality(toy_notame_set) rowData(ex_set)data(toy_notame_set) ex_set <- assess_quality(toy_notame_set) rowData(ex_set)
This function lists citations behind the notame functions that have been
called during the session. All notame functions
update the list automatically. The citations are taken from the call to
'citation("package"), and complemented with a brief description of
what the package was used for.
NOTE: the citations might not point to the correct paper if the package
authors have not supplied correct citation information for their package.
The output is written to the current log file, if specified.
citations()citations()
None, the function is invoked for its side effect.
citations() data(toy_notame_set) ex_set <- flag_quality(toy_notame_set) # Broadhurst et al.(2018) added to citations citations()citations() data(toy_notame_set) ex_set <- flag_quality(toy_notame_set) # Broadhurst et al.(2018) added to citations citations()
Clusters features potentially originating from the same compound. Features with high Pearson correlation coefficient and small retention time difference are linked together. Then clusters are formed by setting a threshold for the relative degree that each node in a cluster needs to fulfil. Each cluster is named after the feature with the highest median peak area (median abundance). This is a wrapper around numerous functions that are based on the MATLAB code by David Broadhurst.
cluster_features( object, mz_col = NULL, rt_col = NULL, all_features = FALSE, rt_window = 1/60, corr_thresh = 0.9, d_thresh = 0.8, assay.type = NULL )cluster_features( object, mz_col = NULL, rt_col = NULL, all_features = FALSE, rt_window = 1/60, corr_thresh = 0.9, d_thresh = 0.8, assay.type = NULL )
object |
a |
mz_col |
the column name in feature data that holds mass-to-charge ratios |
rt_col |
the column name in feature data that holds retention times |
all_features |
logical, should all features be included in the clustering? If FALSE, as the default, flagged features are not included in clustering |
rt_window |
the retention time window for potential links NOTE: use the same unit as the retention time |
corr_thresh |
the correlation threshold required for potential links between features |
d_thresh |
the threshold for the relative degree required by each node |
assay.type |
character, assay to be used in case of multiple assays |
a SummarizedExperiment object, with median peak area (MPA), the cluster ID, the features in the cluster, and cluster size added to feature data.
data(toy_notame_set) # The parameters are really weird because example data is imaginary clustered <- cluster_features(toy_notame_set, rt_window = 1, corr_thresh = 0.5, d_thresh = 0.6)data(toy_notame_set) # The parameters are really weird because example data is imaginary clustered <- cluster_features(toy_notame_set, rt_window = 1, corr_thresh = 0.5, d_thresh = 0.6)
Retrieve both sample information and features
combined_data(object, assay.type = NULL)combined_data(object, assay.type = NULL)
object |
a |
assay.type |
character, assay to be used in case of multiple assays |
A data frame with sample information plus all features as columns, one row per sample.
data(toy_notame_set) combined_data(toy_notame_set)data(toy_notame_set) combined_data(toy_notame_set)
This function compresses clusters found by cluster_features, keeping only the feature with the highest median peak area. The features that were discarded are recorded in feature data, under Cluster_features.
compress_clusters(object)compress_clusters(object)
object |
a |
A SummarizedExperiment object with only one feature per cluster.
data(toy_notame_set) clustered <- cluster_features(toy_notame_set, rt_window = 1, corr_thresh = 0.5, d_thresh = 0.6) compressed <- compress_clusters(clustered)data(toy_notame_set) clustered <- cluster_features(toy_notame_set, rt_window = 1, corr_thresh = 0.5, d_thresh = 0.6) compressed <- compress_clusters(clustered)
A wrapper function for applying cubic spline drift correction and saving before and after plots.
correct_drift( object, log_transform = TRUE, spar = NULL, spar_lower = 0.5, spar_upper = 1.5, check_quality = FALSE, condition = "RSD_r < 0 & D_ratio_r < 0", file = NULL, width = 16, height = 8, color = "QC", shape = color, color_scale = getOption("notame.color_scale_dis"), shape_scale = scale_shape_manual(values = c(15, 16)), assay.type = NULL, name = NULL )correct_drift( object, log_transform = TRUE, spar = NULL, spar_lower = 0.5, spar_upper = 1.5, check_quality = FALSE, condition = "RSD_r < 0 & D_ratio_r < 0", file = NULL, width = 16, height = 8, color = "QC", shape = color, color_scale = getOption("notame.color_scale_dis"), shape_scale = scale_shape_manual(values = c(15, 16)), assay.type = NULL, name = NULL )
object |
a |
log_transform |
logical, should drift correction be done on log-transformed values? See Details |
spar |
smoothing parameter as in
|
spar_lower, spar_upper
|
lower and upper limits for the smoothing parameter |
check_quality |
logical, whether quality should be monitored. |
condition |
a character specifying the condition used to decide whether drift correction works adequately, see Details |
file |
path to the PDF file where the plots should be saved |
width, height
|
width and height of the plots in inches |
color |
character, name of the column used for coloring the points |
shape |
character, name of the column used for shape |
color_scale, shape_scale
|
the color and shape scales as returned by a ggplot function |
assay.type |
character, assay to be used in case of multiple assays |
name |
character, name of the resultant assay |
If log_transform = TRUE, the correction will be done on
log-transformed values.
The correction formula depends on whether the correction is run on original
values or log-transformed values.
In log-space: .
In original space: .
We recommend doing the correction in the log-space since the log-transfomred
data better follows the assumptions of cubic spline regression. The drift
correction in the original space also sometimes results
in negative values, and results in rejection of the drift corrrection
procedure.
If check_quality = TRUE, the condition parameter should be a
character giving a condition compatible with filter.
The condition is applied on the changes in the quality metrics
RSD, RSD_r, D_ratio and D_ratio_r. For example, the default is "RSD_r < 0
and D_ratio_r < 0", meaning that both RSD_r and D_ratio_r need to decrease
in the drift correction, otherwise the drift corrected feature is discarded
and the original is retained.
By default, the column used for color is also used for shape.
A SummarizedExperiment object as the one supplied, with drift corrected features.
smooth.spline for details about the regression
data(toy_notame_set) corrected <- correct_drift(mark_nas(toy_notame_set[1:5, ], value = 0))data(toy_notame_set) corrected <- correct_drift(mark_nas(toy_notame_set[1:5, ], value = 0))
Removes all features that have been flagged by quality control functions. Only features that do not have a flag (Flag == NA) are retained.
drop_flagged(object, all_features = FALSE)drop_flagged(object, all_features = FALSE)
object |
a |
all_features |
logical, should all features be retained? Mainly used by internal functions |
A SummarizedExperiment object without the previously flagged features.
data(toy_notame_set) dim(toy_notame_set) flagged <- flag_quality(toy_notame_set) noflags <- drop_flagged(flagged) dim(noflags)data(toy_notame_set) dim(toy_notame_set) flagged <- flag_quality(toy_notame_set) noflags <- drop_flagged(flagged) dim(noflags)
Drop QC samples
drop_qcs(object)drop_qcs(object)
object |
a |
A SummarizedExperiment object as the one supplied, without QC samples.
data(toy_notame_set) dim(toy_notame_set) noqc <- drop_qcs(toy_notame_set) dim(noqc)data(toy_notame_set) dim(toy_notame_set) noqc <- drop_qcs(toy_notame_set) dim(noqc)
Logs the current date and time and session info, and switches logging off.
finish_log()finish_log()
None, the function is invoked for its side effect.
finish_log()finish_log()
Change the MS/MS output from MS-DIAL format to publication-ready format. Original spectra is sorted according to abundance percentage and clarified. See the example below.
fix_MSMS( object, ms_ms_spectrum_col = "MS_MS_spectrum", peak_num = 10, min_abund = 5, deci_num = 3 )fix_MSMS( object, ms_ms_spectrum_col = "MS_MS_spectrum", peak_num = 10, min_abund = 5, deci_num = 3 )
object |
a |
ms_ms_spectrum_col |
name of column with original MS/MS spectra |
peak_num |
maximum number of peak that is kept (Recommended: 4-10) |
min_abund |
minimum relative abundance to be kept (Recommended: 1-5) |
deci_num |
maximum number of decimals to m/z value (Recommended: >2) |
Original MS/MS spectra from MS-DIAL: m/z:Raw Abundance
23.193:254 26.13899:5 27.50986:25 55.01603:82 70.1914:16 73.03017:941 73.07685:13 73.13951:120
Spectra after transformation: m/z (Abundance)
73.03 (100), 23.193 (27), 73.14 (12.8), 55.016 (8.7)
A SummarizedExperiment object as the one supplied, with publication-ready MS/MS peak information.
data(toy_notame_set) # Spectra before fixing ex_set <- toy_notame_set rowData(ex_set)$MS_MS_spectrum <- NA rowData(ex_set)[1, ]$MS_MS_spectrum <- "28.769:53 44.933:42 52.106:89 69.518:140" rowData(ex_set)$MS_MS_spectrum[ !is.na(rowData(ex_set)$MS_MS_spectrum)] # Fixing spectra with default settings fixed_MSMS_peaks <- fix_MSMS(ex_set) # Spectra after fixing rowData(fixed_MSMS_peaks)$MS_MS_Spectrum_clean[ !is.na(rowData(fixed_MSMS_peaks)$MS_MS_Spectrum_clean)]data(toy_notame_set) # Spectra before fixing ex_set <- toy_notame_set rowData(ex_set)$MS_MS_spectrum <- NA rowData(ex_set)[1, ]$MS_MS_spectrum <- "28.769:53 44.933:42 52.106:89 69.518:140" rowData(ex_set)$MS_MS_spectrum[ !is.na(rowData(ex_set)$MS_MS_spectrum)] # Fixing spectra with default settings fixed_MSMS_peaks <- fix_MSMS(ex_set) # Spectra after fixing rowData(fixed_MSMS_peaks)$MS_MS_Spectrum_clean[ !is.na(rowData(fixed_MSMS_peaks)$MS_MS_Spectrum_clean)]
Attempts to create missing columns needed for notame in pheno and feature data. Optionally cleans the object and splits the object by mode.
fix_object( object, id_prefix = "ID_", id_column = NULL, split_by = NULL, name = NULL, clean = TRUE, split_data = FALSE, assay.type = NULL )fix_object( object, id_prefix = "ID_", id_column = NULL, split_by = NULL, name = NULL, clean = TRUE, split_data = FALSE, assay.type = NULL )
object |
a |
id_prefix |
character, prefix for autogenerated sample IDs, see Details |
id_column |
character, column name for unique identification of samples |
split_by |
character vector, in the case where all the modes are in the same object, the column names of feature data used to separate the modes (usually Mode and Column) |
name |
in the case where object only contains one mode, the name of the mode, such as "Hilic_neg" |
clean |
boolean, whether to select best classes, reorder columns and consistently rename columns in pheno and feature |
split_data |
logical, whether to split data by analytical mode recorded in the "Split" column of feature data. If TRUE (the default), will return a list of objects, one per analytical mode. If FALSE, will return a single object. |
assay.type |
character, assay to be used in case of multiple assays |
Only specify one of split_by and name. The feature
data will contain columns named "Split", used to separate features from
different modes, and "Flag" for recording flagged features. Unless a column
named "Feature_ID" is found in feature data, a feature ID will be generated
based on the value of "Split", mass and retention time. The function will
try to find columns for mass and retention time by looking at a few common
alternatives, and throw
an error if no matching column is found. Sample information needs to contain
a row called "Injection_order", and the values need to be unique. In
addition, a possible sample identifier row needs to be named "Sample_ID",
or to be specified in id_column, and the values need to be unique,
with an exception of QC samples: if there are any "QC" identifiers, they
will be replaced with "QC_1", "QC_2" and so on.
If a "Sample_ID" column is not found, it will be created using the
id_prefix and injection order or by renaming id_column.
A new SummarizedExperiment object with a single peak table. If split_data = TRUE, a list containing separate objects for analytical modes.
data(toy_notame_set) ex_set <- toy_notame_set rowData(ex_set)$Flag <- NULL fixed <- fix_object(ex_set)data(toy_notame_set) ex_set <- toy_notame_set rowData(ex_set)$Flag <- NULL fixed <- fix_object(ex_set)
Get and set the values in the flag column
flag(object) flag(object) <- valueflag(object) flag(object) <- value
object |
a |
value |
character vector, values for flag column |
Character vector of feature flags.
For the endomorphism, an object with a modified flag column.
data(toy_notame_set) # Get values in flag column of rowData flag(toy_notame_set) data(toy_notame_set) # Flag a suspicious feature manually flag(toy_notame_set)[1] <- "Contaminant, known from experience"data(toy_notame_set) # Get values in flag column of rowData flag(toy_notame_set) data(toy_notame_set) # Flag a suspicious feature manually flag(toy_notame_set)[1] <- "Contaminant, known from experience"
Flags contaminant features by comparing either median, mean or max of blanks and biological samples. Biological samples are defined as samples that are not marked as blanks and are not QCs.
flag_contaminants( object, blank_col, blank_label, blank_type = c("mean", "median", "max"), sample_type = c("max", "median", "mean"), flag_thresh = 5, flag_label = "Contaminant", assay.type = NULL )flag_contaminants( object, blank_col, blank_label, blank_type = c("mean", "median", "max"), sample_type = c("max", "median", "mean"), flag_thresh = 5, flag_label = "Contaminant", assay.type = NULL )
object |
a |
blank_col |
character, the column name in colData with blank labels |
blank_label |
character, the label for blank samples in blank_col |
blank_type |
character, one of "mean", "median", or "max" |
sample_type |
character, one of "max", "median", or "mean" |
flag_thresh |
numeric, the scaled ratio threshold for flagging contaminants. If blank_type(blanks) * flag_thresh > sample_type(biological samples), the feature gets flagged. |
flag_label |
character, the label used when flagging contaminants. Can be changed if sample processing contaminants and carryover contaminants are flagged separately. |
assay.type |
character, assay to be used in case of multiple assays |
If the calculation(biological samples) < the calculation(blanks) times a set ratio, the feature is flagged as contaminant. Default calculations are "max" for biological samples and "mean" for blanks.
A SummarizedExperiment object with contaminant features flagged.
data(toy_notame_set) # Make a blank sample which has one (first) feature exceeding the threshold ## Abundance matrix max_sample <- max(assay(toy_notame_set)[1, toy_notame_set$QC != "QC"]) assay <- matrix(c(max_sample * 0.20 + 1, rep(0, 79)), ncol = 1, nrow = 80, dimnames = list(NULL, "Demo_51")) assay <- cbind(assay(toy_notame_set), assay) ## Sample metadata pheno_data <- colData(toy_notame_set)[1, ] rownames(pheno_data) <- "Demo_51" pheno_data$Sample_ID <- "Demo_51" pheno_data$Injection_order <- 51 pheno_data[c("Subject_ID", "Group", "QC", "Time")] <- "Blank" pheno_data <- rbind(colData(toy_notame_set), pheno_data) ## Feature metadata feature_data <- rowData(toy_notame_set) # Construct SummarizedExperiment object with blank sample ex_set <- SummarizedExperiment(assays = assay, colData = pheno_data, rowData = feature_data) # Flag contaminant(s) contaminants_flagged <- flag_contaminants(ex_set, blank_col = "QC", blank_label = "Blank")data(toy_notame_set) # Make a blank sample which has one (first) feature exceeding the threshold ## Abundance matrix max_sample <- max(assay(toy_notame_set)[1, toy_notame_set$QC != "QC"]) assay <- matrix(c(max_sample * 0.20 + 1, rep(0, 79)), ncol = 1, nrow = 80, dimnames = list(NULL, "Demo_51")) assay <- cbind(assay(toy_notame_set), assay) ## Sample metadata pheno_data <- colData(toy_notame_set)[1, ] rownames(pheno_data) <- "Demo_51" pheno_data$Sample_ID <- "Demo_51" pheno_data$Injection_order <- 51 pheno_data[c("Subject_ID", "Group", "QC", "Time")] <- "Blank" pheno_data <- rbind(colData(toy_notame_set), pheno_data) ## Feature metadata feature_data <- rowData(toy_notame_set) # Construct SummarizedExperiment object with blank sample ex_set <- SummarizedExperiment(assays = assay, colData = pheno_data, rowData = feature_data) # Flag contaminant(s) contaminants_flagged <- flag_contaminants(ex_set, blank_col = "QC", blank_label = "Blank")
Flags features with too high amount of missing values. There are two
detection rate limits, both defined as the minimum proportion of samples
that need to have a value (not NA) for the feature to be kept.
'qc_limit is the detection rate limit for QC samples,
'group_limit is the detection rate limit for the actual study groups.
If the group limit is passed for AT LEAST ONE GROUP, then the feature is
kept. Features with low detection rate in QCs are flagged as
"Low_qc_detection", while low detection rate in the study groups is flagged
as "Low_group_detection". The detection rates for all the groups are
recorded in feature data.
flag_detection( object, qc_limit = 0.7, group_limit = 0.5, group = NULL, assay.type = NULL )flag_detection( object, qc_limit = 0.7, group_limit = 0.5, group = NULL, assay.type = NULL )
object |
a |
qc_limit |
the detection rate limit for QC samples |
group_limit |
the detection rate limit for study groups |
group |
the columns name in sample information to use as the grouping variable |
assay.type |
character, assay to be used in case of multiple assays |
A SummarizedExperiment object with the features flagged.
data(toy_notame_set) ex_set <- mark_nas(toy_notame_set, value = 0) ex_set <- flag_detection(ex_set, group = "Group") rowData(ex_set)data(toy_notame_set) ex_set <- mark_nas(toy_notame_set, value = 0) ex_set <- flag_detection(ex_set, group = "Group") rowData(ex_set)
Flags low-quality features using the quality metrics defined in (Broadhurst
2018). The metrics are described in more detain in Details. A condition for
keeping the features is given as a character, which is passed to
filter.
flag_quality flag_quality(object, assay.type = NULL, condition = "(RSD_r < 0.2 & D_ratio_r < 0.4) | (RSD < 0.1 & RSD_r < 0.1 & D_ratio < 0.1)")flag_quality flag_quality(object, assay.type = NULL, condition = "(RSD_r < 0.2 & D_ratio_r < 0.4) | (RSD < 0.1 & RSD_r < 0.1 & D_ratio < 0.1)")
object |
a |
assay.type |
character, assay to be used in case of multiple assays |
condition |
character, condition for keeping the features, see Details |
The quality metrics measure two things: internal spread of the QCs, and spread of the QCs compared to the spread of the biological samples. Internal spread is measured with relative standard deviation (RSD), also known as coefficient of variation (CV).
Where is the standard deviation of the QC samples and
' is the sample mean of the signal in the QC samples.
RSD can also be replaced by a non-parametric, robust version based on the
median and median absolute deviation (MAD):
The spread of the QC samples compared to the biological samples is measured using a metric called D-ratio:
Or, as before, a non-parametric, robust alternative:
The default condition keeps features that pass either of the two following conditions:
a SummarizedExperiment object with the features flagged.
Broadhurst, David et al. Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies. Metabolomics : Official journal of the Metabolomic Society vol. 14,6 (2018): 72. doi:10.1007/s11306-018-1367-3
data(toy_notame_set) ex_set <- flag_quality(toy_notame_set) rowData(ex_set) # Custom condition ex_set <- flag_quality(toy_notame_set, condition = "RSD_r < 0.3 & D_ratio_r < 0.6") rowData(ex_set)data(toy_notame_set) ex_set <- flag_quality(toy_notame_set) rowData(ex_set) # Custom condition ex_set <- flag_quality(toy_notame_set, condition = "RSD_r < 0.3 & D_ratio_r < 0.6") rowData(ex_set)
Computes the number of features at each stage of flagging for each mode.
flag_report(object)flag_report(object)
object |
a |
A data frame with the number of features at each stage of flagging.
data(toy_notame_set) flagged <- toy_notame_set |> mark_nas(0) |> flag_detection(group = "Group") |> flag_quality() flag_report(flagged)data(toy_notame_set) flagged <- toy_notame_set |> mark_nas(0) |> flag_detection(group = "Group") |> flag_quality() flag_report(flagged)
Reads data from an Excel file of the following format:
Left side of the sheet contains information about the features, size features x feature info columns
Top part contains sample information, size sample info variables x samples
The middle contains the actual abundances, size features x samples
Exported data from MS-DIAL converted to an excel file creates such format.
import_from_excel( file, sheet = 1, id_column = NULL, corner_row = NULL, corner_column = NULL, id_prefix = "ID_", split_by = NULL, name = NULL, mz_limits = c(10, 2000), rt_limits = c(0, 20), skip_checks = FALSE )import_from_excel( file, sheet = 1, id_column = NULL, corner_row = NULL, corner_column = NULL, id_prefix = "ID_", split_by = NULL, name = NULL, mz_limits = c(10, 2000), rt_limits = c(0, 20), skip_checks = FALSE )
file |
path to the Excel file |
sheet |
the sheet number or name |
id_column |
character, column name for unique identification of samples |
corner_row |
integer, the bottom row of sample information, usually contains data file names and feature info column names. If set to NULL, will be detected automatically. |
corner_column |
integer or character, the corresponding column number or the column name (letter) in Excel. If set to NULL, will be detected automatically. |
id_prefix |
character, prefix for autogenerated sample IDs |
split_by |
character vector, in the case where all the modes are in the same Excel file, the column names of feature data used to separate the modes (usually Mode and Column) |
name |
in the case where the Excel file only contains one mode, the name of the mode, such as "Hilic_neg" |
mz_limits |
numeric vector of two, all m/z values should be in between these |
rt_limits |
numeric vector of two, all retention time values should be in between these |
skip_checks |
logical: skip checking and fixing of data integrity. Not recommended, but sometimes useful when you just want to read the data in as is and fix errors later. The data integrity checks are important for functioning of notame. |
A SummarizedExperiment with assays, rowData and colData filled from the Excel file.
data <- import_from_excel( file = system.file("extdata", "toy_notame_set.xlsx", package = "notame"), sheet = 1, corner_row = 11, corner_column = "H", split_by = c("Column", "Ion_mode"))data <- import_from_excel( file = system.file("extdata", "toy_notame_set.xlsx", package = "notame"), sheet = 1, corner_row = 11, corner_column = "H", split_by = c("Column", "Ion_mode"))
Reads a tab-delimited peak table (TSV) exported from MS-DIAL. This works
similarly to import_from_excel, but removes the need to
manually open and re-save the TSV files in Excel. Excludes any summary
statistics computed by MS-DIAL, such as average abundances, since these are
not valid after notame processing.
import_from_msdial( file, id_column = NULL, id_prefix = "ID_", split_by = NULL, name = NULL, mz_limits = c(10, 2000), rt_limits = c(0, 20), skip_checks = FALSE )import_from_msdial( file, id_column = NULL, id_prefix = "ID_", split_by = NULL, name = NULL, mz_limits = c(10, 2000), rt_limits = c(0, 20), skip_checks = FALSE )
file |
path to the TSV file |
id_column |
character, column name for unique identification of samples |
id_prefix |
character, prefix for autogenerated sample IDs |
split_by |
character vector, in the case where all the modes are in the same Excel file, the column names of feature data used to separate the modes (usually Mode and Column) |
name |
in the case where the Excel file only contains one mode, the name of the mode, such as "Hilic_neg" |
mz_limits |
numeric vector of two, all m/z values should be in between these |
rt_limits |
numeric vector of two, all retention time values should be in between these |
skip_checks |
logical: skip checking and fixing of data integrity. Not recommended, but sometimes useful when you just want to read the data in as is and fix errors later. The data integrity checks are important for functioning of notame. |
A SummarizedExperiment with assays, rowData and colData filled from the TSV file.
# Read a TSV peak table # data <- import_from_msdial(file = "path/to/peak_table.tsv", name = "RP_neg")# Read a TSV peak table # data <- import_from_msdial(file = "path/to/peak_table.tsv", name = "RP_neg")
Impute the missing values in the peak table of the object using a
random forest. The estimated error in the imputation is logged.
It is recommended to set the seed number for reproducibility
(it is called random forest for a reason).
This a wrapper around missForest.
Use parallelize = "variables" to run in parallel for faster testing.
NOTE: running in parallel prevents user from setting a seed number.
impute_rf(object, all_features = FALSE, assay.type = NULL, name = NULL, ...)impute_rf(object, all_features = FALSE, assay.type = NULL, name = NULL, ...)
object |
a |
all_features |
logical, should all features be used? If FALSE (the default), flagged features are removed before imputation. |
assay.type |
character, assay to be used in case of multiple assays |
name |
character, name of the resultant assay in case of multiple assays |
... |
passed to |
An object as the one supplied, with missing values imputed.
missForest for detail about the algorithm
and the parameters
data(toy_notame_set) missing <- mark_nas(toy_notame_set, 0) set.seed(38) imputed <- impute_rf(missing)data(toy_notame_set) missing <- mark_nas(toy_notame_set, 0) set.seed(38) imputed <- impute_rf(missing)
Impute missing values using a simple imputation strategy. All missing values of a feature are imputed with the same value. It is possible to only impute features with a large number of missing values this way. This can be useful for using this function before random forest imputation to speed things up. The imputation strategies available are:
a numeric value: impute all missing values in all features with the same value, e.g. 1
"mean": impute missing values of a feature with the mean of observed values of that feature
"median": impute missing values of a feature with the median of observed values of that feature
"min": impute missing values of a feature with the minimum observed value of that feature
"half_min": impute missing values of a feature with half the minimum observed value of that feature
"small_random": impute missing values of a feature with random numbers between 0 and the minimum of that feature (uniform distribution, remember to set the seed number!).
impute_simple(object, value, na_limit = 0, assay.type = NULL, name = NULL)impute_simple(object, value, na_limit = 0, assay.type = NULL, name = NULL)
object |
a |
value |
the value used for imputation, either a numeric or one of '"min", "half_min", "small_random", see above |
na_limit |
only impute features with the proportion of NAs over this
limit. For example, if |
assay.type |
character, assay to be used in case of multiple assays |
name |
character, name of the resultant assay in case of multiple assays |
A SummarizedExperiment object with imputed peak table.
data(toy_notame_set) missing <- mark_nas(toy_notame_set, 0) imputed <- impute_simple(missing, value = "min")data(toy_notame_set) missing <- mark_nas(toy_notame_set, 0) imputed <- impute_simple(missing, value = "min")
Initialize a log file with the current data and time. All major operations run after this will be logged to the specified file.
init_log(log_file)init_log(log_file)
log_file |
Path to the log file |
None, the function is invoked for its side effect.
file_name <- "log.txt" init_log(file_name) # Print the contents of the file scan(file_name, sep = "\n", what = "character")file_name <- "log.txt" init_log(file_name) # Print the contents of the file scan(file_name, sep = "\n", what = "character")
Applies inverse rank normalization to all features to approximate a normal distribution.
inverse_normalize(object, assay.type = NULL, name = NULL)inverse_normalize(object, assay.type = NULL, name = NULL)
object |
a |
assay.type |
character, assay to be used in case of multiple assays |
name |
character, name of the resultant assay in case of multiple assays |
An object as the one supplied, with normalized features.
data(toy_notame_set) normalized <- inverse_normalize(toy_notame_set)data(toy_notame_set) normalized <- inverse_normalize(toy_notame_set)
Join a new data frame of information to pheno data of a SummarizedExperiment object.
join_colData(object, dframe)join_colData(object, dframe)
object |
a |
dframe |
a data frame with the new information |
A SummarizedExperiment object with the new information added to colData(object).
data(toy_notame_set) new_info <- data.frame( Sample_ID = colnames(toy_notame_set), BMI = stats::runif(ncol(toy_notame_set), 22, 26) ) with_new_info <- join_colData(toy_notame_set, new_info) colnames(colData(with_new_info))data(toy_notame_set) new_info <- data.frame( Sample_ID = colnames(toy_notame_set), BMI = stats::runif(ncol(toy_notame_set), 22, 26) ) with_new_info <- join_colData(toy_notame_set, new_info) colnames(colData(with_new_info))
Join a new data frame of information to feature data of a SummarizedExperiment object. The data frame needs to have a column "Feature_ID". This function is usually used internally by some of the functions in the package, but can be useful.
join_rowData(object, dframe)join_rowData(object, dframe)
object |
|
dframe |
a data frame with the new information |
A SummarizedExperiment object with the new information added to rowData(object).
data(toy_notame_set) new_info <- data.frame( Feature_ID = rownames(toy_notame_set), Feature_number = seq_len(nrow(toy_notame_set)) ) with_new_info <- join_rowData(toy_notame_set, new_info) colnames(rowData(with_new_info))data(toy_notame_set) new_info <- data.frame( Feature_ID = rownames(toy_notame_set), Feature_number = seq_len(nrow(toy_notame_set)) ) with_new_info <- join_rowData(toy_notame_set, new_info) colnames(rowData(with_new_info))
The specified text is printed and written to the current log file. Does not overwrite the file. Also used internally by many functions in the package.
log_text(text)log_text(text)
text |
The text to be logged |
None, the function is invoked for its side effect.
file_name <- "log.txt" init_log(file_name) log_text("Hello World!") # Print the contents of the file scan(file_name, sep = "\n", what = "character")file_name <- "log.txt" init_log(file_name) log_text("Hello World!") # Print the contents of the file scan(file_name, sep = "\n", what = "character")
Replaces all values in the peak table that equal the specified value with NA. For example, vendor software might use 0 or 1 to signal a missing value, which is not understood by R.
mark_nas(object, value, assay.type = NULL, name = NULL)mark_nas(object, value, assay.type = NULL, name = NULL)
object |
a |
value |
the value to be converted to NA |
assay.type |
character, assay to be used in case of multiple assays |
name |
character, name of the resultant assay in case of multiple assays |
SummarizedExperiment object as the one supplied, with missing values correctly set to NA.
data(toy_notame_set) nas_marked <- mark_nas(toy_notame_set, value = 0)data(toy_notame_set) nas_marked <- mark_nas(toy_notame_set, value = 0)
Merges two or more SummarizedExperiment objects together. Can be used to merge analytical modes or batches.
merge_notame_sets(..., merge = c("features", "samples"), assay.type = NULL)merge_notame_sets(..., merge = c("features", "samples"), assay.type = NULL)
... |
|
merge |
what to merge? features is used for combining analytical modes, samples is used for batches |
assay.type |
character, assay to be used in case of multiple assays. The same assay needs to be present in all objects to be merged, and the resultant object contains this single assay. |
When merging samples, sample IDs that begin with "QC" or "Ref" are combined so that they have running numbers on them. This means that if both batches have samples called "QC_1", this will not result in an error, but the sample IDs will be adjusted so that they are unique. Column names in the feature data that are shared between batches but have different content are renamed by adding a suffix to avoid data loss. The suffix is the index of the batch in the input list.
A merged SummarizedExperiment object.
# Merge analytical modes data(hilic_neg_sample, hilic_pos_sample, rp_neg_sample, rp_pos_sample) merged <- merge_notame_sets( hilic_neg_sample, hilic_pos_sample, rp_neg_sample, rp_pos_sample ) # Merge batches batch1 <- toy_notame_set[, toy_notame_set$Batch == 1] batch2 <- toy_notame_set[, toy_notame_set$Batch == 2] merged <- merge_notame_sets(batch1, batch2, merge = "samples")# Merge analytical modes data(hilic_neg_sample, hilic_pos_sample, rp_neg_sample, rp_pos_sample) merged <- merge_notame_sets( hilic_neg_sample, hilic_pos_sample, rp_neg_sample, rp_pos_sample ) # Merge batches batch1 <- toy_notame_set[, toy_notame_set$Batch == 1] batch2 <- toy_notame_set[, toy_notame_set$Batch == 2] merged <- merge_notame_sets(batch1, batch2, merge = "samples")
Computes Bhattacharyya distance between all pairs of batches using
fpc:bhattacharyya.matrix
after projecting the samples into PCA space with
pca.
pca_bhattacharyya_dist( object, batch, all_features = FALSE, center = TRUE, scale = "uv", nPcs = 3, assay.type = NULL, ... )pca_bhattacharyya_dist( object, batch, all_features = FALSE, center = TRUE, scale = "uv", nPcs = 3, assay.type = NULL, ... )
object |
a |
batch |
column name of pheno data giving the batch labels |
all_features |
logical, should all features be used? If FALSE (the default), flagged features are removed before imputation. |
center |
logical, should the data be centered prior to PCA? (usually yes) |
scale |
scaling used, as in
|
nPcs |
the number of principal components to use |
assay.type |
character, assay to be used in case of multiple assays |
... |
other parameters to |
A matrix of Bhattacharyya distances between batches.
data(toy_notame_set) # Batch correction replicates <- list(which(toy_notame_set$QC == "QC")) batch_corrected <- ruvs_qc(toy_notame_set, replicates = replicates) # Evaluate batch correction pca_bhattacharyya_dist(toy_notame_set, batch = "Batch") pca_bhattacharyya_dist(batch_corrected, batch = "Batch")data(toy_notame_set) # Batch correction replicates <- list(which(toy_notame_set$QC == "QC")) batch_corrected <- ruvs_qc(toy_notame_set, replicates = replicates) # Evaluate batch correction pca_bhattacharyya_dist(toy_notame_set, batch = "Batch") pca_bhattacharyya_dist(batch_corrected, batch = "Batch")
Computes repeatability for each feature with the following formula:
The repeatability ranges from 0 to 1. Higher repeatability depicts less variation between batches.
perform_repeatability(object, group, assay.type = NULL)perform_repeatability(object, group, assay.type = NULL)
object |
a |
group |
column name of pheno data giving the group labels |
assay.type |
character, assay to be used in case of multiple assays |
A data frame with one row per feature with the repeatability measure.
data(toy_notame_set) # Batch correction replicates <- list(which(toy_notame_set$QC == "QC")) batch_corrected <- ruvs_qc(toy_notame_set, replicates = replicates) # Evaluate batch correction rep_orig <- perform_repeatability(toy_notame_set, group = "Group") mean(rep_orig$Repeatability, na.rm = TRUE) rep_corr <- perform_repeatability(batch_corrected, group = "Group") mean(rep_corr$Repeatability, na.rm = TRUE)data(toy_notame_set) # Batch correction replicates <- list(which(toy_notame_set$QC == "QC")) batch_corrected <- ruvs_qc(toy_notame_set, replicates = replicates) # Evaluate batch correction rep_orig <- perform_repeatability(toy_notame_set, group = "Group") mean(rep_orig$Repeatability, na.rm = TRUE) rep_corr <- perform_repeatability(batch_corrected, group = "Group") mean(rep_corr$Repeatability, na.rm = TRUE)
Apply probabilistic quotient normalization (PQN) to the peak table of a SummarizedExperiment object. By default, reference is calculated from high-quality QC samples and the median of the reference is used for normalization. Check parameters for more options.
pqn_normalization( object, ref = c("qc", "all"), method = c("median", "mean"), all_features = FALSE, assay.type = NULL, name = NULL )pqn_normalization( object, ref = c("qc", "all"), method = c("median", "mean"), all_features = FALSE, assay.type = NULL, name = NULL )
object |
a |
ref |
character, the type of reference samples to use for normalization. |
method |
character, the method to use for calculating the reference sample. |
all_features |
logical, should all features be used for calculating the reference sample? |
assay.type |
character, assay to be used in case of multiple assays |
name |
character, name of the resultant assay in case of multiple assays |
A SummarizedExperiment object with altered feature abundances.
data(toy_notame_set) pqn_set <- pqn_normalization(toy_notame_set)data(toy_notame_set) pqn_set <- pqn_normalization(toy_notame_set)
Extract quality information of features
quality(object)quality(object)
object |
a |
A data frame with quality metrics for each feature.
data(toy_notame_set) ex_set <- assess_quality(toy_notame_set) quality(ex_set)data(toy_notame_set) ex_set <- assess_quality(toy_notame_set) quality(ex_set)
An interface for link[RUVSeq]{RUVs} method
ruvs_qc(object, replicates, k = 3, assay.type = NULL, name = NULL, ...)ruvs_qc(object, replicates, k = 3, assay.type = NULL, name = NULL, ...)
object |
a |
replicates |
list of numeric vectors, indexes of replicates |
k |
The number of factors of unwanted variation to be estimated from the data. |
assay.type |
character, assay to be used in case of multiple assays |
name |
character, name of the resultant assay in case of multiple assays |
... |
other parameters passed to |
A SummarizedExperiment object with the normalized data.
data(toy_notame_set) # Batch correction replicates <- list(which(toy_notame_set$QC == "QC")) batch_corrected <- ruvs_qc(toy_notame_set, replicates = replicates) # Evaluate batch correction pca_bhattacharyya_dist(toy_notame_set, batch = "Batch") pca_bhattacharyya_dist(batch_corrected, batch = "Batch")data(toy_notame_set) # Batch correction replicates <- list(which(toy_notame_set$QC == "QC")) batch_corrected <- ruvs_qc(toy_notame_set, replicates = replicates) # Evaluate batch correction pca_bhattacharyya_dist(toy_notame_set, batch = "Batch") pca_bhattacharyya_dist(batch_corrected, batch = "Batch")
Contains imaginary data used in testing the package functions. The dataset has 50 samples and 80 features. This dataset includes multiple observations from same subjects, sampled at two timepoints in separate batches and divided to two groups. The analytical modes are also available as separate SummarizedExperiment objects. Note that across batches, the features don't have different feature ID's, m/z and retention time as would be the case with real-world data. In essence, the example data reflects that features were aligned perfectly between batches.
toy_notame_set hilic_neg_sample hilic_pos_sample rp_neg_sample rp_pos_sampletoy_notame_set hilic_neg_sample hilic_pos_sample rp_neg_sample rp_pos_sample
An object of class SummarizedExperiment with 80 rows and 50 columns.
An object of class SummarizedExperiment with 20 rows and 50 columns.
An object of class SummarizedExperiment with 20 rows and 50 columns.
An object of class SummarizedExperiment with 20 rows and 50 columns.
An object of class SummarizedExperiment with 20 rows and 50 columns.
Writes all the data in a SummarizedExperiment object to an Excel spreadsheet. The format is similar to the one used to read data in, except for the fact that EVERYTHING NEEDS TO BE WRITTEN AS TEXT. To fix numeric values in Excel, choose any cell with a number, press Ctrl + A, then go to the dropdown menu in upper left corner and choose "Convert to Number". This will fix the file, but can take quite a while.
write_to_excel(object, file, ...)write_to_excel(object, file, ...)
object |
a |
file |
path to the file to write |
... |
Additional parameters passed to
|
None, the function is invoked for its side effect.
data(toy_notame_set) write_to_excel(toy_notame_set, file = "toy_notame_set.xlsx")data(toy_notame_set) write_to_excel(toy_notame_set, file = "toy_notame_set.xlsx")