Title: | Quantitative Metabolomics Data Processing Tools |
---|---|
Description: | The qmtools (quantitative metabolomics tools) package provides basic tools for processing quantitative metabolomics data with the standard SummarizedExperiment class. This includes functions for imputation, normalization, feature filtering, feature clustering, dimension-reduction, and visualization to help users prepare data for statistical analysis. This package also offers a convenient way to compute empirical Bayes statistics for which metabolic features are different between two sets of study samples. Several functions in this package could also be used in other types of omics data. |
Authors: | Jaehyun Joo [aut, cre], Blanca Himes [aut] |
Maintainer: | Jaehyun Joo <[email protected]> |
License: | GPL-3 |
Version: | 1.11.0 |
Built: | 2024-12-18 03:57:07 UTC |
Source: | https://github.com/bioc/qmtools |
Function to cluster LC-MS features according to their retention time and intensity correlation across samples with a SummarizedExperiment.
clusterFeatures( x, i, rtime_var = "rtime", rt_cut = 10, cor_cut = 0.7, rt_grouping = c("hclust", "closest", "consecutive"), cor_grouping = c("louvain", "SimilarityMatrix", "connected", "none"), cor_use = c("everything", "all.obs", "complete.obs", "na.or.complete", "pairwise.complete.obs"), cor_method = c("pearson", "kendall", "spearman"), log2 = FALSE, hclust_linkage = "complete" )
clusterFeatures( x, i, rtime_var = "rtime", rt_cut = 10, cor_cut = 0.7, rt_grouping = c("hclust", "closest", "consecutive"), cor_grouping = c("louvain", "SimilarityMatrix", "connected", "none"), cor_use = c("everything", "all.obs", "complete.obs", "na.or.complete", "pairwise.complete.obs"), cor_method = c("pearson", "kendall", "spearman"), log2 = FALSE, hclust_linkage = "complete" )
x |
A SummarizedExperiment object. |
i |
A string or integer value specifying which assay values to use. |
rtime_var |
A string specifying the name of variable containing a
numeric vector of retention times in |
rt_cut |
A numeric value specifying a cut-off for the retention-time based feature grouping. |
cor_cut |
A numeric value specifying a cut-off for the correlation-based feature grouping. |
rt_grouping |
A string specifying which method to use for the retention-time based feature grouping. |
cor_grouping |
A string specifying which method to use for the correlation-based feature grouping. |
cor_use |
A string specifying which method to compute correlations in
the presence of missing values. Refer to |
cor_method |
A string specifying which correlation coefficient is to be
computed. See |
log2 |
A logical specifying whether feature intensities need to be log2-transformed before calculating a correlation matrix. |
hclust_linkage |
A string specifying the linkage method to be used when
|
For soft ionization methods (e.g., LC/ESI-MS) commonly used in metabolomics, one or more ions could be generated from an individual compound upon ionization. The redundancy of feature data needs to be addressed since we typically interested in compounds rather than different ion species. This function attempts to identify a group of features from the same compound with the following steps:
Features are grouped by their retention times to identify co-eluting compounds.
For each retention time-based group, features are further clustered by patterns of the intensity correlations across samples to identify a subset of features from the same compound.
The retention time-based grouping is performed using either a hierarchical
clustering via hclust or the methods available in the MsFeatures
package via MsFeatures::groupClosest and MsFeatures::groupConsecutive.
For the rt_grouping
= "hclust", by default, complete-linkage
clustering is conducted using the Manhattan distance (i.e., difference in
retention times) where the distance between two clusters is defined as the
difference in retention times between the farthest pair of elements in the
two clusters. Group memberships are assigned by specifying the cut height
for the distance metric. Other linkage methods can be specified with
hclust_linkage
. Please refer to ?hclust
for details. For the
"closest" and "consecutive", please refer to
?MsFeatures::groupClosest
and ?MsFeatures::groupConsecutive
for the details of algorithms.
For the correlation-based grouping, cor_grouping
= "connected"
creates a undirected graph using feature correlations as an adjacency matrix
(i.e., correlations serve as edge weights). The edges whose weights are
below the cut-off specified by cor_cut
will be removed from the graph,
separating features into several disconnected subgroups. Features in the
same subgroup will be assigned to the same feature cluster. For the
"louvain", the function further applies the Louvain algorithm to the graph
in order to identify densely connected features via
igraph::cluster_louvain. For the "SimilarityMatrix",
MsFeatures::groupSimilarityMatrix is used for feature grouping. Please
refer to ?MsFeatures::groupSimilarityMatrix
for the details of
algorithm.
A SummarizedExperiment object with the grouping
results added to columns "rtime_group" (initial grouping on retention
times) and "feature_group" in its rowData
.
Johannes Rainer (2022). MsFeatures: Functionality for Mass Spectrometry Features. R package version 1.3.0. 'https://github.com/RforMassSpectrometry/MsFeatures
Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre: Fast unfolding of communities in large networks. J. Stat. Mech. (2008) P10008
Csardi G, Nepusz T: The igraph software package for complex network research, InterJournal, Complex Systems 1695. 2006. https://igraph.org
See hclust, cutree, MsFeatures::groupClosest, MsFeatures::groupConsecutive, MsFeatures::groupSimilarityMatrix, and igraph::cluster_louvain for the underlying functions that do work.
See plotRTgroup to visualize the grouping result.
data(faahko_se) se <- clusterFeatures(faahko_se, i = "knn_vsn", rtime_var = "rtmed") rowData(se)[, c("rtmed", "rtime_group", "feature_group")]
data(faahko_se) se <- clusterFeatures(faahko_se, i = "knn_vsn", rtime_var = "rtmed") rowData(se)[, c("rtmed", "rtime_group", "feature_group")]
Function to make a comparisons between two groups in study samples with a SummarizedExperiment.
compareSamples( x, i, group, class1, class2, covariates = NULL, confint = TRUE, number = nrow(x), adjust.method = "BH", sort.by = "B", resort.by = NULL, p.value = 1, fc = NULL, lfc = NULL, ... )
compareSamples( x, i, group, class1, class2, covariates = NULL, confint = TRUE, number = nrow(x), adjust.method = "BH", sort.by = "B", resort.by = NULL, p.value = 1, fc = NULL, lfc = NULL, ... )
x |
A SummarizedExperiment object. |
i |
A string or integer value specifying which assay values to use. The assay is expected to contain log-transformed intensities. |
group |
A string specifying the name of variable containing a class
label of each sample in |
class1 , class2
|
A string specifying the class label of samples to be
compared. Must be one of |
covariates |
A vector indicating the names of variables to be included
in the model as covariates. The covariates must be found in |
confint |
A logical specifying whether 95% confidence intervals of log-fold-change need to be reported. Alternatively, a numeric value between zero and one specifying the confidence level required. |
number |
The maximum number of metabolic features to list. |
adjust.method |
A string specifying which p-value adjustment method to use. Options, in increasing conservatism, include "none", "BH", "BY" and "holm". See p.adjust for the complete list of options. A NULL value will result in the default adjustment method, which is "BH". |
sort.by |
A string specifying which statistic to rank the metabolic features by. Possible values for topTable are "logFC", "AveExpr", "t", "P", "p", "B" or "none" (Permitted synonyms are "M" for "logFC", "A" or "Amean" for "AveExpr", "T" for "t" and "p" for "P"). |
resort.by |
A string specifying statistic to sort the selected metabolic features by in the output. Possibilities are the same as for sort.by. |
p.value |
A numeric value specifying a cut-off for adjusted p-values. Only metabolic features with lower p-values are listed. |
fc |
A numeric value specifying a minimum fold-change to be required.
If specified, the function output only includes metabolic features with
absolute fold-change greater than |
lfc |
A numeric value specifying a minimum log-fold-change required.
|
... |
Additional arguments passed to limma::eBayes. |
This function provides a simplified interface of fitting a linear model to
make a comparison of interest using the limma::lmFit, limma::eBayes, and
limma::topTable functions. For more flexible model specifications (e.g.,
interaction model, multi-level model), please use a standard workflow
outlined in the limma
package user's guide.
A data.frame with a row for the metabolic features and the following columns:
logFC: an estimate of log-fold-change corresponding to the contrast tested
CI.L: a left limit of confidence interval for logFC
(if confint
is
enabled)
CI.R: a right limit of confidence interval for logFC
(if confint
is
enabled)
AveExpr: an average log-expression/abundance of metabolic features
t: a moderated t-statistic
P.Value: a raw p-value
adj.P.Value: an adjusted p-value
B: a log-odds that the metabolic feature is differentially expressed
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015 Apr 20;43(7):e47. doi: 10.1093/nar/gkv007. Epub 2015 Jan 20. PMID: 25605792; PMCID: PMC4402510.
See limma::lmFit, limma::eBayes, and limma::topTable for underlying functions that do work.
data(faahko_se) compareSamples(faahko_se, i = "knn_vsn", group = "sample_group", number = 5)
data(faahko_se) compareSamples(faahko_se, i = "knn_vsn", group = "sample_group", number = 5)
A SummarizedExperiment object containing FAAH knockout LC/MS
feature intensity data from the faahKO package created using the
faahko3
data.
data(faahko_se)
data(faahko_se)
An object of class SummarizedExperiment
with 206 rows and 12 columns.
Colin A. Smith (2021). faahKO: Saghatelian et al. (2004) FAAH knockout LC/MS data. R package version 1.32.0. http://dx.doi.org/10.1021/bi0480335
data(faahko_se)
data(faahko_se)
Performs a variety of data imputation methods on a matrix-like object or SummarizedExperiment object. The methods include k-Nearest Neighbors (kNN), Random Forest (RF), and many others from the MsCoreUtils::impute_matrix. See the details below.
## S4 method for signature 'ANY' imputeIntensity( x, method = c("knn", "rf", "bpca", "QRILC", "MLE", "MinDet", "MinProb", "min", "zero", "mixed", "nbavg", "with", "none"), ... ) ## S4 method for signature 'SummarizedExperiment' imputeIntensity( x, method = c("knn", "rf", "bpca", "QRILC", "MLE", "MinDet", "MinProb", "min", "zero", "mixed", "nbavg", "with", "none"), i, name, ... )
## S4 method for signature 'ANY' imputeIntensity( x, method = c("knn", "rf", "bpca", "QRILC", "MLE", "MinDet", "MinProb", "min", "zero", "mixed", "nbavg", "with", "none"), ... ) ## S4 method for signature 'SummarizedExperiment' imputeIntensity( x, method = c("knn", "rf", "bpca", "QRILC", "MLE", "MinDet", "MinProb", "min", "zero", "mixed", "nbavg", "with", "none"), i, name, ... )
x |
A matrix-like object or SummarizedExperiment object. |
method |
A string specifying which imputation method to use. |
... |
Arguments passed to a specific imputation method. |
i |
A string or integer value specifying which assay values to use
when |
name |
A string specifying the name to be used to store the imputed
intensities in |
The method argument can be one of "knn", "rf", "bpca", "QRILC", "MLE", "MinDet", "MinProb", "min", "zero", "mixed", "nbavg", "with", "none". Please choose one that best describes the nature of missing data. While this function provides several simple imputation methods, they may only work under restrictive assumptions.
"knn" performs kNN imputation based on the Gower distance or Euclidean distance. See imputeKNN for details.
"rf" performs random forest imputation using the missForest::missForest, as described in Stekhoven D. J., & Buehlmann, P. (2012). This method is not sensitive to monotonic transformations of the intensity matrix.
For the other method arguments, please refer to the MsCoreUtils::impute_matrix. Briefly,
"bpca": Bayesian PCA missing value imputation.
"QRILC": Quantile regression approach for the imputation of left-censored missing data.
"MLE": Maximum likelihood-based imputation.
"MinDet": Deterministic minimal value approach for the imputation of left-censored data.
"MinProb": Stochastic minimal value approach for the imputation of left-censored data.
"min": Replace the missing values with the smallest non-missing value in the data.
"zero": Replace the missing values with 0.
"mixed": Mixed imputation applying two methods.
"nbavg": Average neighbour imputation for fractions collected along a fractionation/separation gradient.
"with": Replace the missing values with a user-provided value.
"none": Reserved for the "mixed" method.
A matrix or SummarizedExperiment object of the same
dimension as x
containing the imputed intensities.
Laurent Gatto, Johannes Rainer and Sebastian Gibb (2021). MsCoreUtils: Core Utils for Mass Spectrometry Data. R package version 1.4.0. https://github.com/RforMassSpectrometry/MsCoreUtils
Stekhoven D. J., & Buehlmann, P. (2012). MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.
See imputeKNN, missForest::missForest, and MsCoreUtils::impute_matrix for the underlying functions that do work.
data(faahko_se) ## SummarizedExperiment object se <- imputeIntensity(faahko_se, i = "raw", name = "imp1", method = "knn") assayNames(se) ## Matrix m <- assay(faahko_se, i = "raw") imputeIntensity(m, method = "min")
data(faahko_se) ## SummarizedExperiment object se <- imputeIntensity(faahko_se, i = "raw", name = "imp1", method = "knn") assayNames(se) ## Matrix m <- assay(faahko_se, i = "raw") imputeIntensity(m, method = "min")
Performs k-nearest neighbor (kNN) imputation on a matrix-like object where rows represent features and columns represent samples. This function finds k-nearest neighbors using either Gower distance or Euclidean distance.
imputeKNN( x, k = 10, type = c("gower", "euclidean"), by = c("feature", "sample"), scale = FALSE, ... )
imputeKNN( x, k = 10, type = c("gower", "euclidean"), by = c("feature", "sample"), scale = FALSE, ... )
x |
A matrix-like object. |
k |
An integer specifying the number of nearest neighbors to be used in imputation. |
type |
A string specifying the distance metric to be used. Either "gower" or "euclidean". |
by |
A string specifying whether the imputation is performed by k-nearest features or by k-nearest samples. Either "feature" or "sample". |
scale |
A logical specifying whether |
... |
Arguments passed to VIM::kNN (Gower distance) or impute::impute.knn (Euclidean distance). |
The kNN imputation based on Euclidean distance typically requires
standardization of input data to avoid variance-based weighting of variables
(make variables on similar scales). When Gower distance is used, the
imputation can be done with original units (would get the same result with
the standardized input on a different scale). The type
"gower" utilizes
the VIM::kNN and "euclidean" uses the impute::impute.knn.
A matrix of the same dimension as x
containing the imputed
intensities.
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan and Gilbert Chu (2021). impute: impute: Imputation for microarray data. R package version 1.66.0.
Alexander Kowarik, Matthias Templ (2016). Imputation with the R Package VIM. Journal of Statistical Software, 74(7), 1-16. doi:10.18637/jss.v074.i07
See imputeIntensity that provides a SummarizedExperiment-friendly wrapper for this function.
See VIM::kNN and missForest::missForest for the underlying functions that do work.
data(faahko_se) m <- assay(faahko_se, "raw") imputeKNN(m)
data(faahko_se) m <- assay(faahko_se, "raw") imputeKNN(m)
Performs a few data-driven normalization methods on a matrix-like object or SummarizedExperiment object. The methods include probabilistic quotient normalization (PQN), cyclic loess normalization, feature-based scaling, and many others from the MsCoreUtils::normalize_matrix. See the details below.
## S4 method for signature 'ANY' normalizeIntensity( x, method = c("pqn", "div.sum", "div.mean", "div.median", "div.mad", "center.mean", "center.median", "diff.median", "cyclicloess", "vsn", "quantiles", "quantiles.robust", "feature.scale"), ... ) ## S4 method for signature 'SummarizedExperiment' normalizeIntensity( x, method = c("pqn", "div.sum", "div.mean", "div.median", "div.mad", "center.mean", "center.median", "diff.median", "cyclicloess", "vsn", "quantiles", "quantiles.robust", "feature.scale"), i, name, ... )
## S4 method for signature 'ANY' normalizeIntensity( x, method = c("pqn", "div.sum", "div.mean", "div.median", "div.mad", "center.mean", "center.median", "diff.median", "cyclicloess", "vsn", "quantiles", "quantiles.robust", "feature.scale"), ... ) ## S4 method for signature 'SummarizedExperiment' normalizeIntensity( x, method = c("pqn", "div.sum", "div.mean", "div.median", "div.mad", "center.mean", "center.median", "diff.median", "cyclicloess", "vsn", "quantiles", "quantiles.robust", "feature.scale"), i, name, ... )
x |
A matrix-like object or SummarizedExperiment object. |
method |
A string specifying which normalization method to use. |
... |
Arguments passed to a specific normalization method. |
i |
A string or integer value specifying which assay values to use
when |
name |
A string specifying the name to be used to store the normalized
intensities in |
The method argument can be one of "pqn", "cyclicloess", "vsn", "feature.scale", "div.sum", "div.mean", "div.median", "div.mad", "center.mean", "center.median", "diff.median", "quantiles", and "quantiles.robust".
"pqn" performs probabilistic quotient normalization, as described in Dieterle et al. (2006). See normalizePQN for details.
"cyclicloess" performs cyclic LOESS normalization using the
limma::normalizeCyclicLoess. The input x
is expected to contain
log-transformed intensities. See Bolstad et al. (2003) and Ballman et al.
(2004) for details. Please use type
if you want to specify a cyclic loess
method due to a name conflict with the existing argument in this function.
"vsn" performs variance stabilizing normalization (VSN), as described in Huber et al. (2002). It produces normalized intensities based on a glog (generalized logarithm) scale. See the vsn::vsn2 for details.
"feature.scale" performs feature-based scaling (applied along the rows) as described in van den Berg et al. (2006). See scaleRows for details.
For "div.sum", "div.mean", "div.median", and "div.mad", the respective sample intensities are divided by the column sums, means, medians, or median absolute deviations. See scaleCols for details.
"center.mean" and "center.median" center the intensities by subtracting the column means or medians, respectively.
"diff.median" centers all samples so that they all match the grand median by subtracting the respective columns medians differences to the grand median.
"quantiles" and "quantiles.robust" perform quantiles normalization, as described in Bolstad et al. (2003). See the preprocessCore::normalize.quantiles and preprocessCore::normalize.quantiles.robust for details.
A matrix or SummarizedExperiment object of the same
dimension as x
containing the normalized intensities.
Laurent Gatto, Johannes Rainer and Sebastian Gibb (2021). MsCoreUtils: Core Utils for Mass Spectrometry Data. R package version 1.4.0. https://github.com/RforMassSpectrometry/MsCoreUtils
Dieterle F, Ross A, Schlotterbeck G, Senn H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal Chem. 2006 Jul 1;78(13):4281-90. doi: 10.1021/ac051632c. PMID: 16808434.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015 Apr 20;43(7):e47. doi: 10.1093/nar/gkv007. Epub 2015 Jan 20. PMID: 25605792; PMCID: PMC4402510.
Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003 Jan 22;19(2):185-93. doi: 10.1093/bioinformatics/19.2.185. PMID: 12538238.
Ballman KV, Grill DE, Oberg AL, Therneau TM. Faster cyclic loess: normalizing RNA arrays via linear models. Bioinformatics. 2004 Nov 1;20(16):2778-86. doi: 10.1093/bioinformatics/bth327. Epub 2004 May 27.
Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002;18 Suppl 1:S96-104. doi: 10.1093/bioinformatics/18.suppl_1.s96. PMID: 12169536.
van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf MJ. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics. 2006 Jun 8;7:142. doi: 10.1186/1471-2164-7-142. PMID: 16762068; PMCID: PMC1534033.
See normalizePQN, scaleRows, scaleCols, limma::normalizeCyclicLoess, and MsCoreUtils::normalize_matrix for the underlying functions that do the work.
data(faahko_se) ## SummarizedExperiment object se <- normalizeIntensity(faahko_se, i = "knn", name = "knn_pqn", method = "pqn") assayNames(se) ##' ## Matrix m <- assay(faahko_se, "knn") normalizeIntensity(m, method = "feature.scale", type = "pareto")
data(faahko_se) ## SummarizedExperiment object se <- normalizeIntensity(faahko_se, i = "knn", name = "knn_pqn", method = "pqn") assayNames(se) ##' ## Matrix m <- assay(faahko_se, "knn") normalizeIntensity(m, method = "feature.scale", type = "pareto")
Performs probabilistic quotient normalization (PQN) on a matrix-like object where rows present features and columns represent samples.
normalizePQN(x, ref_samples = NULL, min_frac = 0.5, type = c("median", "mean"))
normalizePQN(x, ref_samples = NULL, min_frac = 0.5, type = c("median", "mean"))
x |
A matrix-like object. |
ref_samples |
A vector of sample names or column indices specifying
reference samples for the calculation of quotients. Must be a subset of
|
min_frac |
A numeric value between 0 and 1 specifying a minimum proportion of reference samples for features to be included in the calculation of a reference spectrum. |
type |
A method to compute a reference spectrum. Either "median" or "mean". |
For the calculation of quotients, a reference spectrum needs to be obtained from a median or mean spectrum based on all spectra of the study or a subset of the study. Feature intensities are normalized by the median of quotients. See Dieterle et al. (2006) for details.
A matrix of the same dimension as x
containing the normalized
intensities.
Dieterle F, Ross A, Schlotterbeck G, Senn H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal Chem. 2006 Jul 1;78(13):4281-90. doi: 10.1021/ac051632c. PMID: 16808434.
See normalizeIntensity that provides a SummarizedExperiment-friendly wrapper for this function.
data(faahko_se) m <- assay(faahko_se, "knn") normalizePQN(m)
data(faahko_se) m <- assay(faahko_se, "knn") normalizePQN(m)
Produces a box-and-whisker plot with a SummarizedExperiment or matrix-like object where rows represent features and columns represent samples.
plotBox(x, i, group, log2 = FALSE, violin = FALSE, ylab = "Intensity")
plotBox(x, i, group, log2 = FALSE, violin = FALSE, ylab = "Intensity")
x |
A matrix-like object or SummarizedExperiment object. |
i |
A string or integer value specifying which assay values to use
when |
group |
A discrete variable to visualize the grouping structure. |
log2 |
A logical specifying whether feature intensities needs to be log2-transformed before visualization. |
violin |
A logical specifying whether a violin plot is shown instead of a boxplot. |
ylab |
A string specifying the title of y-axis. |
A ggplot object.
data(faahko_se) ## Sample group g <- colData(faahko_se)$sample_group ## SummarizedExperiment object plotBox(faahko_se, i = "knn", group = g, log2 = TRUE) # before normalization ## Matrix m <- assay(faahko_se, "knn_vsn") plotBox(m, group = g) # after normalization
data(faahko_se) ## Sample group g <- colData(faahko_se)$sample_group ## SummarizedExperiment object plotBox(faahko_se, i = "knn", group = g, log2 = TRUE) # before normalization ## Matrix m <- assay(faahko_se, "knn_vsn") plotBox(m, group = g) # after normalization
Visualizes correlations between samples or features with a SummarizedExperiment or matrix-like object where rows represent features and columns represent samples. A correlation matrix is visualized using a heatmap with dendrograms.
plotCorr( x, i, type = c("sample", "feature"), log2 = FALSE, use = c("everything", "all.obs", "complete.obs", "na.or.complete", "pairwise.complete.obs"), method = c("pearson", "kendall", "spearman"), dendrogram = TRUE, colors = (scales::viridis_pal())(256), label = FALSE, digits = 2, widths = c(0.8, 0.2), heights = c(0.2, 0.8), hide_colorbar = FALSE, showticklabels = c(TRUE, TRUE), row_dend_left = FALSE, k_row = 1, k_col = 1, ... )
plotCorr( x, i, type = c("sample", "feature"), log2 = FALSE, use = c("everything", "all.obs", "complete.obs", "na.or.complete", "pairwise.complete.obs"), method = c("pearson", "kendall", "spearman"), dendrogram = TRUE, colors = (scales::viridis_pal())(256), label = FALSE, digits = 2, widths = c(0.8, 0.2), heights = c(0.2, 0.8), hide_colorbar = FALSE, showticklabels = c(TRUE, TRUE), row_dend_left = FALSE, k_row = 1, k_col = 1, ... )
x |
A matrix-like object or SummarizedExperiment object. |
i |
A string or integer value specifying which assay values to use
when |
type |
A string specifying whether a correlation matrix is computed based on samples or features. |
log2 |
A logical specifying whether feature intensities needs to be log2-transformed before calculating a correlation matrix. |
use |
A string specifying which method to compute correlations in the
presence of missing values. Refer to |
method |
A string specifying which correlation coefficient is to be
computed. See |
dendrogram |
A logical specifying whether dendogram is computed and reordering is performed. |
colors |
A vector of colors for the heatmap. |
label |
A logical specifying whether cell values are shown. |
digits |
A numeric value specifying the desired number of digits when
|
widths |
A numeric vectors specifying relative widths of heatmap and dendrogram. |
heights |
A numeric vectors specifying relative heights of heatmap and dendrogram. |
hide_colorbar |
A logical specifying whether the color bar (legend) is hidden. |
showticklabels |
A logical vector of length 2 (x-axis, y-axis) specifying whether the ticks are removed from the sides of the plot. |
row_dend_left |
A logical controlling whether the row dendrogram is placed on the left on the plot. |
k_row |
A numeric value specifying the desired number of groups by
which to color the dendrogram's branches in the rows. If |
k_col |
A numeric value specifying the desired number of groups by
which to color the dendrogram's branches in the columns. If |
... |
Additional arguments passed to heatmaply::heatmaply. |
A patchwork object of aligned ggplots.
Tal Galili, Alan O'Callaghan, Jonathan Sidi, Carson Sievert; heatmaply: an R package for creating interactive cluster heatmaps for online publishing, Bioinformatics, btx657, https://doi.org/10.1093/bioinformatics/btx657
data(faahko_se) ## Sample group g <- colData(faahko_se)$sample_group ## SummarizedExperiment object plotCorr(faahko_se, i = "knn_vsn", method = "spearman", k_col = 4) ## Matrix m <- assay(faahko_se, "knn_vsn") plotCorr(m[1:50, ], type = "feature", method = "spearman")
data(faahko_se) ## Sample group g <- colData(faahko_se)$sample_group ## SummarizedExperiment object plotCorr(faahko_se, i = "knn_vsn", method = "spearman", k_col = 4) ## Matrix m <- assay(faahko_se, "knn_vsn") plotCorr(m[1:50, ], type = "feature", method = "spearman")
Visualizes missing values with a SummarizedExperiment object or matrix of intensity data where rows represent features and columns represent samples. All values in a data matrix are re-coded (1: missing; 0: non-missing). The left panel displays the amount of missing values in each samples. The right panel displays the pattern of missing values using a heatmap with dendrograms.
plotMiss( x, i, group, dendrogram_row = TRUE, dendrogram_col = FALSE, colors = (scales::viridis_pal())(2), hide_colorbar = TRUE, showticklabels = c(TRUE, FALSE), row_dend_left = FALSE, k_row = 1, k_col = 1, ... )
plotMiss( x, i, group, dendrogram_row = TRUE, dendrogram_col = FALSE, colors = (scales::viridis_pal())(2), hide_colorbar = TRUE, showticklabels = c(TRUE, FALSE), row_dend_left = FALSE, k_row = 1, k_col = 1, ... )
x |
A matrix-like object or SummarizedExperiment object. |
i |
A string or integer value specifying which assay values to use
when |
group |
A discrete variable to change colors of the barplot by sample groups. |
dendrogram_row |
A logical specifying whether dendogram is computed and reordering is performed based on rows. |
dendrogram_col |
A logical specifying whether dendogram is computed and reordering is performed based on columns. |
colors |
A vector of colors for the heatmap. |
hide_colorbar |
A logical specifying whether the color bar (legend) in the heatmap is hidden. |
showticklabels |
A logical vector of length 2 (x-axis, y-axis) specifying whether the ticks are removed from the sides of the heatmap. |
row_dend_left |
A logical controlling whether the row dendrogram is placed on the left on the heatmap. |
k_row |
A numeric value specifying the desired number of groups by
which to color the dendrogram's branches in the rows. If |
k_col |
A numeric value specifying the desired number of groups by
which to color the dendrogram's branches in the columns. If |
... |
Additional arguments passed to heatmaply::heatmaply. |
A patchwork object of aligned ggplots
Tal Galili, Alan O'Callaghan, Jonathan Sidi, Carson Sievert; heatmaply: an R package for creating interactive cluster heatmaps for online publishing, Bioinformatics, btx657, https://doi.org/10.1093/bioinformatics/btx657
data(faahko_se) ## Sample group g <- colData(faahko_se)$sample_group ## SummarizedExperiment object plotMiss(faahko_se, i = 1, group = g) ## Matrix m <- assay(faahko_se, i = 1) plotMiss(m, group = g, dendrogram_col = TRUE)
data(faahko_se) ## Sample group g <- colData(faahko_se)$sample_group ## SummarizedExperiment object plotMiss(faahko_se, i = 1, group = g) ## Matrix m <- assay(faahko_se, i = 1) plotMiss(m, group = g, dendrogram_col = TRUE)
Function to visualize dimension-reduced data matrices mainly produced by
reduceFeatures, including reduced.pca
, reduced.tsne
, and
reduced.plsda
objects (or a matrix with the same structure).
plotReduced( x, comp = c(1, 2), biplot = FALSE, group, group_col = NULL, point_size = 1.5, point_shape_by_group = FALSE, label = FALSE, label_size = 3.88, label_subset = NULL, ellipse = FALSE, xlab = NULL, ylab = NULL, title = NULL, legend = TRUE, arrow_len = 0.2, arrow_col = "orange", arrow_alpha = 0.3, arrow_label = TRUE, arrow_label_ext = 1.05, arrow_label_size = 3.88, arrow_label_col = "orange", arrow_label_subset = NULL )
plotReduced( x, comp = c(1, 2), biplot = FALSE, group, group_col = NULL, point_size = 1.5, point_shape_by_group = FALSE, label = FALSE, label_size = 3.88, label_subset = NULL, ellipse = FALSE, xlab = NULL, ylab = NULL, title = NULL, legend = TRUE, arrow_len = 0.2, arrow_col = "orange", arrow_alpha = 0.3, arrow_label = TRUE, arrow_label_ext = 1.05, arrow_label_size = 3.88, arrow_label_col = "orange", arrow_label_subset = NULL )
x |
A matrix containing the dimension-reduced data typically produced by reduceFeatures. |
comp |
A numeric vector of length 2 specifying the components to display. |
biplot |
A logical specifying whether visualize an overlay of scores
and loadings. Ignored if |
group |
A discrete variable to visualize the grouping structure. For a
|
group_col |
A vector of colors with the same length of unique values in
|
point_size |
A numeric value specifying the size of points. |
point_shape_by_group |
A logical specifying whether each group has
different shapes of data points. Also can be a numeric vector with the
same length of unique values in |
label |
A logical specifying whether score labels are shown instead of points. |
label_size |
A numeric value controlling the size of labels. |
label_subset |
A character vector specifying a subset of score labels to display. |
ellipse |
A logical specifying whether data ellipses are shown. |
xlab |
A string specifying the title of x-axis. |
ylab |
A string specifying the title of y-axis. |
title |
A string specifying the main title of the plot. |
legend |
A logical specifying whether the plot legend is shown. |
arrow_len |
A numeric value specifying the length of arrow head. |
arrow_col |
A string specifying the color of arrows. |
arrow_alpha |
A numeric value specifying the transparency of arrow. |
arrow_label |
A logical specifying whether text labels for arrows are shown. |
arrow_label_ext |
A numeric value specifying the scalar extension for arrow labels. |
arrow_label_size |
A numeric value specifying the size of arrow labels. |
arrow_label_col |
A string specifying the color of arrow labels. |
arrow_label_subset |
A character vector specifying a subset of arrow labels to display. |
A ggplot object.
data(faahko_se) ## Sample group g <- colData(faahko_se)$sample_group ## PCA pca_res <- reduceFeatures(faahko_se, i = "knn_vsn", method = "pca") ## Visualizes the result plotReduced(pca_res, group = g) plotReduced(pca_res, group = g, label = TRUE, ellipse = TRUE)
data(faahko_se) ## Sample group g <- colData(faahko_se)$sample_group ## PCA pca_res <- reduceFeatures(faahko_se, i = "knn_vsn", method = "pca") ## Visualizes the result plotReduced(pca_res, group = g) plotReduced(pca_res, group = g, label = TRUE, ellipse = TRUE)
Visualizes feature grouping results produced by clusterFeatures. A retention-time based feature group is displayed with its sub-groups based on the feature intensity correlations either using a pair plot or graph. Features with the same color indicate that they are in the same group.
plotRTgroup( x, i, group, type = c("graph", "pairs"), rtime_group_var = "rtime_group", feature_group_var = "feature_group", cor_cut = 0.7, cor_use = c("everything", "all.obs", "complete.obs", "na.or.complete", "pairwise.complete.obs"), cor_method = c("pearson", "kendall", "spearman"), log2 = FALSE )
plotRTgroup( x, i, group, type = c("graph", "pairs"), rtime_group_var = "rtime_group", feature_group_var = "feature_group", cor_cut = 0.7, cor_use = c("everything", "all.obs", "complete.obs", "na.or.complete", "pairwise.complete.obs"), cor_method = c("pearson", "kendall", "spearman"), log2 = FALSE )
x |
A SummarizedExperiment object. |
i |
A string or integer value specifying which assay values to use. Choose the same value used in the feature grouping. |
group |
A string specifying the label of retention time-based group to visualize. |
type |
A string specifying which type of plots to visualize. |
rtime_group_var |
A string specifying the names of variable containing
the retention-time based grouping result in |
feature_group_var |
A string specifying the names of variable
containing the final feature grouping result in |
cor_cut |
A numeric value specifying a cut-off for the visualizing correlations in a graph as edges. Ignored if type is "pairs". |
cor_use |
A string specifying which method to compute correlations in
the presence of missing values. Refer to |
cor_method |
A string specifying which correlation coefficient is to be
computed. See |
log2 |
A logical specifying whether feature intensities needs to be log2-transformed before calculating a correlation matrix. Ignored if type is "pairs". Choose the same value used in the feature grouping. |
A graph or pair plot.
See clusterFeatures for feature grouping.
data(faahko_se) ## Clustering se <- clusterFeatures(faahko_se, i = "knn_vsn", rtime_var = "rtmed") ## Graph plotRTgroup(se, i = "knn_vsn", group = "FG.22") ## Pairwise scatter plotRTgroup(se, i = 3, group = "FG.22", cor_method = "spearman", log2 = TRUE, type = "pairs")
data(faahko_se) ## Clustering se <- clusterFeatures(faahko_se, i = "knn_vsn", rtime_var = "rtmed") ## Graph plotRTgroup(se, i = "knn_vsn", group = "FG.22") ## Pairwise scatter plotRTgroup(se, i = 3, group = "FG.22", cor_method = "spearman", log2 = TRUE, type = "pairs")
Performs dimensionality reduction on a matrix-like object or SummarizedExperiment object.
## S4 method for signature 'ANY' reduceFeatures(x, method = c("pca", "tsne", "plsda"), ncomp = 2, y, ...) ## S4 method for signature 'SummarizedExperiment' reduceFeatures(x, method = c("pca", "tsne", "plsda"), ncomp = 2, i, y, ...)
## S4 method for signature 'ANY' reduceFeatures(x, method = c("pca", "tsne", "plsda"), ncomp = 2, y, ...) ## S4 method for signature 'SummarizedExperiment' reduceFeatures(x, method = c("pca", "tsne", "plsda"), ncomp = 2, i, y, ...)
x |
A matrix-like object or SummarizedExperiment object. |
method |
A string specifying which dimension-reduction method to use. |
ncomp |
A integer specifying the number of components extract. |
y |
A factor vector for the information about each sample's group. |
... |
Arguments passed to a specific dimension-reduction method. |
i |
A string or integer value specifying which assay values to use
when |
Currently, principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and partial least squares-discriminant analysis (PLS-DA) are supported. For the method argument,
pca
performs PCA using singular value decomposition. If there is any
missing value, the non-linear iterative partial least squares (NIPALS)
algorithm is used instead using the pcaMethods::nipalsPca. See reducePCA
for details.
tsne
performs t-SNE using the Rtsne::Rtsne. See reduceTSNE for
details.
plsda
performs PLS-DA using a standard PLS model for classification with
the pls::plsr. See reducePLSDA for details.
A matrix containing custom attributes related to the dimension-reduction method used.
Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In P. R. Krishnajah (Ed.), Multivariate analysis (pp. 391-420). NewYork: Academic Press.
Stacklies, W., Redestig, H., Scholz, M., Walther, D. and Selbig, J. pcaMethods – a Bioconductor package providing PCA methods for incomplete data. Bioinformatics, 2007, 23, 1164-1167
L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.
L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15(Oct):3221-3245, 2014.
Jesse H. Krijthe (2015). Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation, URL: https://github.com/jkrijthe/Rtsne
Kristian Hovde Liland, Bjørn-Helge Mevik and Ron Wehrens (2021). pls: Partial Least Squares and Principal Component Regression. R package version 2.8-0. https://CRAN.R-project.org/package=pls
See reducePCA, reduceTSNE, and reducePLSDA for the underlying functions that do the work.
data(faahko_se) ## SummarizedExperiment object res_pca <- reduceFeatures(faahko_se, i = "knn_vsn", method = "pca") summary(res_pca) ## Matrix y <- factor(colData(faahko_se)$sample_group) m <- assay(faahko_se, i = "knn_vsn") res_plsda <- reduceFeatures(m, method = "plsda", y = y, ncomp = 3) summary(res_plsda)
data(faahko_se) ## SummarizedExperiment object res_pca <- reduceFeatures(faahko_se, i = "knn_vsn", method = "pca") summary(res_pca) ## Matrix y <- factor(colData(faahko_se)$sample_group) m <- assay(faahko_se, i = "knn_vsn") res_plsda <- reduceFeatures(m, method = "plsda", y = y, ncomp = 3) summary(res_plsda)
Performs PCA on a matrix-like object where rows represent features and columns represents samples.
reducePCA(x, ncomp = 2, center = TRUE, scale = FALSE, ...)
reducePCA(x, ncomp = 2, center = TRUE, scale = FALSE, ...)
x |
A matrix-like object. |
ncomp |
An integer specifying the number of components to extract. |
center |
A logical specifying whether |
scale |
A logical specifying whether the unit variance scaling needs to
be performed on |
... |
Additional arguments passed to pcaMethods::nipalsPca. Ignored
if |
For the data without missing values, PCA is performed with the transpose of
x
via singular value decomposition. Otherwise, PCA is performed with the
transpose of x
using the non-linear iterative partial least squares
(NIPALS) algorithm via the pcaMethods::nipalsPca. The function returns a
reduced.pca
object that is a matrix with custom attributes to summarize
(via summary) and visualize (via plotReduced) the PCA result. The custom
attributes include the following:
method
: The method used to reduce the dimension of data.
ncomp
: The number of components extracted.
R2
: A vector indicating the amount of variance explained by each
principal component.
R2cum
: A vector of cumulative R2.
loadings
: A matrix of variable loadings.
sdev
: A vector indicating the standard deviations of the principal
components.
centered
: A logical indicating whether the data was mean-centered
prior to PCA.
scaled
: A logical indicating whether the data was scaled prior to PCA.
A reduced.pca object with the same number of rows as ncol(x)
containing the dimension reduction result.
Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In P. R. Krishnajah (Ed.), Multivariate analysis (pp. 391-420). NewYork: Academic Press.
Stacklies, W., Redestig, H., Scholz, M., Walther, D. and Selbig, J. pcaMethods – a Bioconductor package providing PCA methods for incomplete data. Bioinformatics, 2007, 23, 1164-1167
See reduceFeatures that provides a SummarizedExperiment-friendly wrapper for this function.
See plotReduced for visualization.
See pcaMethods::nipalsPca for the underlying function that does the work.
data(faahko_se) m <- assay(faahko_se, "knn_vsn") res <- reducePCA(m, ncomp = 3) summary(res)
data(faahko_se) m <- assay(faahko_se, "knn_vsn") res <- reducePCA(m, ncomp = 3) summary(res)
Performs PLS-DA on a matrix-like object where rows represent features and columns represent samples.
reducePLSDA( x, y, ncomp = 2, center = TRUE, scale = FALSE, validation = c("none", "CV", "LOO"), return_mvr = FALSE, ... )
reducePLSDA( x, y, ncomp = 2, center = TRUE, scale = FALSE, validation = c("none", "CV", "LOO"), return_mvr = FALSE, ... )
x |
A matrix-like object. |
y |
A factor vector for the information about each sample's group. |
ncomp |
A integer specifying the number of components to extract. |
center |
A logical specifying whether |
scale |
A logical specifying whether the unit variance scaling needs to
be performed on |
validation |
A string specifying a validation method to use. See pls::plsr for the details. |
return_mvr |
A logical indicating whether |
... |
Additional arguments passed to pls::plsr. |
This function performs standard PLS for classification with the transpose of
x
using the pls::plsr. Since PLS-DA is a supervised method, users must
supply the information about each sample's group. Here, y
must be a factor
so that it can be internally converted to an indicator matrix. The function
returns a reduced.plsda
object that is a matrix with custom attributes to
summarize (via summary) and visualize (via plotReduced) the PLS-DA
result. The custom attributes include the following:
method
: The method used to reduce the dimension of data.
ncomp
: The number of components extracted.
explvar
: A vector indicating the amount of X variance explained by
each component.
responses
: A vector indicating the levels of factor y
.
predictors
: A vector of predictor variables.
coefficient
: An array of regression coefficients.
loadings
: A matrix of loadings.
loadings.weights
: A matrix of loading weights.
Y.observed
: A vector of observed responses.
Y.predicted
: A vector of predicted responses.
Y.scores
: A matrix of Y-scores.
Y.loadings
: A matrix of Y-loadings.
projection
: The projection matrix.
fitted.values
: An array of fitted values.
residuals
: An array of regression residuals.
vip
: An array of VIP (Variable Importance in the Projection)
coefficients.
centered
: A logical indicating whether the data was mean-centered prior
to PLS-DA.
scaled
: A logical indicating whether the data was scaled prior to
PLS-DA.
validation
: Results of the validation if requested.
A reduced.plsda object with the same number of rows as
ncol(x)
containing the dimension reduction result.
Kristian Hovde Liland, Bjørn-Helge Mevik and Ron Wehrens (2021). pls: Partial Least Squares and Principal Component Regression. R package version 2.8-0. https://CRAN.R-project.org/package=pls
See reduceFeatures that provides a SummarizedExperiment-friendly wrapper for this function.
See plotReduced for visualization.
See pls::plsr for the underlying function that does the work.
data(faahko_se) m <- assay(faahko_se, "knn_vsn") y <- factor(colData(faahko_se)$sample_group) res <- reducePLSDA(m, y = y) summary(res)
data(faahko_se) m <- assay(faahko_se, "knn_vsn") y <- factor(colData(faahko_se)$sample_group) res <- reducePLSDA(m, y = y) summary(res)
Performs t-SNE on a matrix-like object where rows represent features and columns represent samples.
reduceTSNE(x, ncomp = 2, normalize = TRUE, ...)
reduceTSNE(x, ncomp = 2, normalize = TRUE, ...)
x |
A matrix-like object. |
ncomp |
A integer specifying the number of components to extract. Must be either 1, 2, or 3. |
normalize |
A logical specifying whether the input matrix is mean-centered and scaled so that the largest absolute of the centered matrix is equal to unity. See Rtsne::normalize_input for details. |
... |
Additional arguments passed to Rtsne::Rtsne. |
t-SNE is well-suited for visualizing high-dimensional data by giving each
data point a location in a two or three-dimensional map. This function
performs t-SNE with the transpose of x
using Rtsne::Rtsne and returns a
reduced.tsne
object that is a matrix with custom attributes to summarize
(via summary) and visualize (via plotReduced) the t-SNE result. The
custom attributes include the following:
method
: The method used to reduce the dimension of data.
ncomp
: The number of components extracted.
perplexity
: The perplexity parameter used.
theta
: The speed/accuracy trade-off parameter used.
normalized
: A logical indicating whether the data was normalized prior
to t-SNE.
A reduced.tsne object with the same number of rows as ncol(x)
containing the dimension reduction result.
L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.
L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15(Oct):3221-3245, 2014.
Jesse H. Krijthe (2015). Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation, URL: https://github.com/jkrijthe/Rtsne
See reduceFeatures that provides a SummarizedExperiment-friendly wrapper for this function.
See plotReduced for visualization.
See Rtsne::Rtsne for the underlying function that does the work.
data(faahko_se) m <- assay(faahko_se, "knn_vsn") res <- reduceTSNE(m, perplexity = 3) summary(res)
data(faahko_se) m <- assay(faahko_se, "knn_vsn") res <- reduceTSNE(m, perplexity = 3) summary(res)
Removes Features with based on QC/blank ratios using the data matrix where rows represent features and columns represent samples. A feature will be retained if there are not enough blank samples to calculate an intensity ratio for a feature (or completely absent in blank samples). Use removeMiss to remove features based on a proportion of missing values. Features with a QC/blank ratio below a cut-off will be discarded.
removeBlankRatio( x, blank_samples, qc_samples, cut = 2, type = c("median", "mean"), blank_min_n = 3 )
removeBlankRatio( x, blank_samples, qc_samples, cut = 2, type = c("median", "mean"), blank_min_n = 3 )
x |
A matrix-like object. |
blank_samples |
A vector of sample names or column indices specifying
blank samples for the calculation of ratio. Must be a subset of
|
qc_samples |
A vector of sample names or column indices specifying QC
samples for the calculation of ratio. Must be a subset of
|
cut |
A numeric value greater than 1 specifying a QC/blank ratio cut-off to retain a feature. |
type |
A method to compute a QC/blank ratio. Either "median" or "mean". |
blank_min_n |
An integer value specifying the minimum number of blank samples to calculate a ratio. |
A matrix containing the filtered features.
See removeFeatures that provides a SummarizedExperiment-friendly wrapper for this function.
set.seed(1e7) m_blank <- matrix(rlnorm(200), ncol = 5) m_qc <- matrix(rlnorm(400, 1), ncol = 10) m <- cbind(m_blank, m_qc) colnames(m) <- c(paste0("B", seq_len(5)), paste0("Q", seq_len(10))) removeBlankRatio(m, blank_samples = paste0("B", seq_len(5)), qc_samples = paste0("Q", seq_len(10)))
set.seed(1e7) m_blank <- matrix(rlnorm(200), ncol = 5) m_qc <- matrix(rlnorm(400, 1), ncol = 10) m <- cbind(m_blank, m_qc) colnames(m) <- c(paste0("B", seq_len(5)), paste0("Q", seq_len(10))) removeBlankRatio(m, blank_samples = paste0("B", seq_len(5)), qc_samples = paste0("Q", seq_len(10)))
Removes Features based on missing values, QC and blank samples. See the details below.
## S4 method for signature 'ANY' removeFeatures(x, method = c("missing", "blankratio", "rsd", "icc"), ...) ## S4 method for signature 'SummarizedExperiment' removeFeatures(x, method = c("missing", "blankratio", "rsd", "icc"), i, ...)
## S4 method for signature 'ANY' removeFeatures(x, method = c("missing", "blankratio", "rsd", "icc"), ...) ## S4 method for signature 'SummarizedExperiment' removeFeatures(x, method = c("missing", "blankratio", "rsd", "icc"), i, ...)
x |
A matrix-like object or SummarizedExperiment object. |
method |
A string specifying which filtering method to use. |
... |
Arguments passed to a specific filtering method. |
i |
A string or integer value specifying which assay values to use
when |
The method argument can be one of "missing", "blankratio", "rsd", "icc".
"missing" removes features based on proportions of missing values. Users can specify one or more groups in samples. For multiple groups, a feature is retained if there is at least one group with a proportion of non-missing values above a cut-off.
For "blankratio", QC/blank intensity ratios are calculated for features present at the blank samples. Features with a ratio below a cut-off will be discarded.
"rsd" calculates a relative standard deviation (also known as coefficient of variation) for each feature using QC samples. Features with a RSD above a cut-off will be removed.
"icc" calculates an intraclass correlation coefficient (ICC) for each feature using both biological and QC samples to identify how much of the total variation is explained by biological variability, as described in Schiffman, C et al (2019). Features with an ICC below a cut-off will be removed.
A matrix or SummarizedExperiment object.
Schiffman, C., Petrick, L., Perttula, K. et al. Filtering procedures for untargeted LC-MS metabolomics data. BMC Bioinformatics 20, 334 (2019). https://doi.org/10.1186/s12859-019-2871-9
See removeMiss, removeBlankRatio, removeRSD, and removeICC for the underlying functions that do work.
data(faahko_se) g <- colData(faahko_se)$sample_group ## SummarizedExperiment object se <- removeFeatures(faahko_se, i = "raw", method = "missing", group = g, cut = 0.9) ## Matrix m <- assay(faahko_se, i = "raw") removeFeatures(m, method = "missing", group = g, levels = "WT", cut = 0.9)
data(faahko_se) g <- colData(faahko_se)$sample_group ## SummarizedExperiment object se <- removeFeatures(faahko_se, i = "raw", method = "missing", group = g, cut = 0.9) ## Matrix m <- assay(faahko_se, i = "raw") removeFeatures(m, method = "missing", group = g, levels = "WT", cut = 0.9)
Removes Features based on a intraclass correlation coefficient (ICC) using the data matrix where rows represent features and columns represent samples. For each feature, ICC will be calculated using both biological and QC samples to identify how much of the total variation is explained by biological variability, as described in Schiffman, C et al (2019). Informative features are expected to have relatively high variability across the biological samples, compared to QC replicates. Features with an ICC below a cut-off will be removed.
removeICC( x, qc_samples, bio_samples = setdiff(colnames(x), qc_samples), cut = 0.4 )
removeICC( x, qc_samples, bio_samples = setdiff(colnames(x), qc_samples), cut = 0.4 )
x |
A matrix-like object. |
qc_samples |
A vector of sample names or column indices specifying QC
samples for the calculation of ICC. Must be a subset of
|
bio_samples |
A vector of sample names or column indices specifying
biological samples for the calculation of ICC. Must be a subset of
|
cut |
A numeric value between 0 and 1 specifying a ICC cut-off to retain a feature. |
A matrix containing the filtered features.
Schiffman, C., Petrick, L., Perttula, K. et al. Filtering procedures for untargeted LC-MS metabolomics data. BMC Bioinformatics 20, 334 (2019). https://doi.org/10.1186/s12859-019-2871-9
See removeFeatures that provides a SummarizedExperiment-friendly wrapper for this function.
set.seed(1e7) m_bio_1 <- matrix(rlnorm(600, sdlog = 1), ncol = 20) m_bio_2 <- matrix(rlnorm(200, sdlog = 0.3), ncol = 20) m_bio <- rbind(m_bio_1, m_bio_2) m_qc <- matrix(rlnorm(400, sdlog = 0.25), ncol = 10) m <- cbind(m_bio, m_qc) colnames(m) <- c(paste0("S", seq_len(20)), paste0("Q", seq_len(10))) removeICC(m, qc_samples = paste0("Q", seq_len(10)), bio_samples = paste0("S", seq_len(20)))
set.seed(1e7) m_bio_1 <- matrix(rlnorm(600, sdlog = 1), ncol = 20) m_bio_2 <- matrix(rlnorm(200, sdlog = 0.3), ncol = 20) m_bio <- rbind(m_bio_1, m_bio_2) m_qc <- matrix(rlnorm(400, sdlog = 0.25), ncol = 10) m <- cbind(m_bio, m_qc) colnames(m) <- c(paste0("S", seq_len(20)), paste0("Q", seq_len(10))) removeICC(m, qc_samples = paste0("Q", seq_len(10)), bio_samples = paste0("S", seq_len(20)))
Removes Features based on proportions of missing values in the matrix where rows represent features and columns represent samples. Features can be removed based on missing values within a specific group or multiple groups. A feature will be retained, if there is at least one group with a proportion of non-missing values above a cut-off.
removeMiss(x, group, levels = NULL, cut = 0.7)
removeMiss(x, group, levels = NULL, cut = 0.7)
x |
A matrix-like object. |
group |
A character vector for the information about each sample's group. |
levels |
A string or character vector specifying one or more groups for
filter filtering based on missing values. If |
cut |
A numeric value between 0 and 1 specifying a minimum proportion of non-missing values to retain a feature. |
A matrix containing the filtered features.
See removeFeatures that provides a SummarizedExperiment-friendly wrapper for this function.
data(faahko_se) m <- assay(faahko_se, "raw") g <- colData(faahko_se)$sample_group table(g) ## Filter based on missing values in "KO" and "WT" groups removeMiss(m, group = g, cut = 0.9) ## Consider only "KO" group (can be useful for QC-based filtering) removeMiss(m, group = g, levels = "KO", cut = 0.9)
data(faahko_se) m <- assay(faahko_se, "raw") g <- colData(faahko_se)$sample_group table(g) ## Filter based on missing values in "KO" and "WT" groups removeMiss(m, group = g, cut = 0.9) ## Consider only "KO" group (can be useful for QC-based filtering) removeMiss(m, group = g, levels = "KO", cut = 0.9)
Removes Features with low reproducibility based on a relative standard deviation (also known as coefficient of variation) of QC samples using the data matrix where rows represent features and columns represent samples. Features with a RSD above a cut-off will be removed from the data.
removeRSD(x, qc_samples, cut = 0.3)
removeRSD(x, qc_samples, cut = 0.3)
x |
A matrix-like object. |
qc_samples |
A vector of sample names or column indices specifying QC
samples for the calculation of RSD. Must be a subset of
|
cut |
A numeric value between specifying a RSD cut-off to retain a feature. |
A matrix containing the filtered features.
See removeFeatures that provides a SummarizedExperiment-friendly wrapper for this function.
set.seed(1e7) m_bio <- matrix(rlnorm(800, sdlog = 1), ncol = 20) m_qc <- matrix(rlnorm(400, sdlog = 0.25), ncol = 10) m <- cbind(m_bio, m_qc) colnames(m) <- c(paste0("S", seq_len(20)), paste0("Q", seq_len(10))) removeRSD(m, qc_samples = paste0("Q", seq_len(10)))
set.seed(1e7) m_bio <- matrix(rlnorm(800, sdlog = 1), ncol = 20) m_qc <- matrix(rlnorm(400, sdlog = 0.25), ncol = 10) m <- cbind(m_bio, m_qc) colnames(m) <- c(paste0("S", seq_len(20)), paste0("Q", seq_len(10))) removeRSD(m, qc_samples = paste0("Q", seq_len(10)))
Function to scale a matrix of intensity data along the columns (samples).
scaleCols( x, type = c("div.sum", "div.mean", "div.median", "div.mad"), restrict = FALSE, rescale = FALSE )
scaleCols( x, type = c("div.sum", "div.mean", "div.median", "div.mad"), restrict = FALSE, rescale = FALSE )
x |
A matrix-like object. |
type |
A scaling method to use. |
restrict |
A logical specifying whether only features that are common to all samples are used in the calculation of scaling factors. |
rescale |
A logical specifying whether the normalized intensities are re-scaled by multiplying the median of normalization factors to make look similar to the original scale. |
Sample intensities are divided by the column sums ("div.sum"), means ("div.mean"), medians ("div.median"), or median absolute deviations ("div.mad").
A matrix of the same dimension as x
containing the scaled
intensities.
See normalizeIntensity that provides a SummarizedExperiment-friendly wrapper for this function.
data(faahko_se) m <- assay(faahko_se, "knn") scaleCols(m)
data(faahko_se) m <- assay(faahko_se, "knn") scaleCols(m)
Function to scale a matrix of intensity data along the rows (features), as described in van den Berg et al. (2006).
scaleRows( x, type = c("auto", "range", "pareto", "vast", "level", "sum", "max") )
scaleRows( x, type = c("auto", "range", "pareto", "vast", "level", "sum", "max") )
x |
A matrix-like object. |
type |
A scaling method to use. |
This function will do the following:
Auto scaling (unit variance scaling): each feature is mean-centered and divided by its standard deviation.
Range scaling: each feature is mean-centered and divided by its range.
Pareto scaling: each feature is mean-centered and divided by the square root of its standard deviation.
Vast scaling (variance stability scaling): it is an extension of auto scaling, using the product of standard deviation and coefficient of variation as a scale factor.
Level scaling: each feature is mean-centered and divided by its mean.
Sum scaling: each feature is divided by its sum.
Max scaling: each feature is divided by its maximum.
A matrix of the same dimension as x
containing the scaled
intensities.
van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf MJ. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics. 2006 Jun 8;7:142. doi: 10.1186/1471-2164-7-142. PMID: 16762068; PMCID: PMC1534033.
See normalizeIntensity that provides a SummarizedExperiment-friendly wrapper for this function.
data(faahko_se) m <- assay(faahko_se, "knn") scaleRows(m, type = "pareto")
data(faahko_se) m <- assay(faahko_se, "knn") scaleRows(m, type = "pareto")