Title: | Mass Spectrometry-Based Single-Cell Proteomics Data Analysis |
---|---|
Description: | Utility functions for manipulating, processing, and analyzing mass spectrometry-based single-cell proteomics data. The package is an extension to the 'QFeatures' package and relies on 'SingleCellExpirement' to enable single-cell proteomics analyses. The package offers the user the functionality to process quantitative table (as generated by MaxQuant, Proteome Discoverer, and more) into data tables ready for downstream analysis and data visualization. |
Authors: | Christophe Vanderaa [aut, cre] , Laurent Gatto [aut] |
Maintainer: | Christophe Vanderaa <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.17.0 |
Built: | 2024-10-31 04:51:02 UTC |
Source: | https://github.com/bioc/scp |
The function will add the component results computed
by scpComponentAnalysis()
to a SingleCellExperiment
's
reducedDims
slot, to all using the many scater
functions,
such as scater::plotReducedDim()
, scater::plotTSNE()
, ...
addReducedDims(sce, x)
addReducedDims(sce, x)
sce |
An instance of class SingleCellExperiment. |
x |
A |
A SingleCellExperiment
with updated reducedDims
.
Laurent Gatto and Christophe Vanderaa
library("scater") data("leduc_minimal") pcs <- scpComponentAnalysis( leduc_minimal, method = "ASCA", effects = "SampleType")$bySample reducedDims(leduc_minimal) leduc_minimal <- addReducedDims(leduc_minimal, pcs) reducedDims(leduc_minimal) plotReducedDim(leduc_minimal, dimred = "ASCA_SampleType", colour_by = "SampleType") leduc_minimal <- runTSNE(leduc_minimal, dimred = "ASCA_SampleType") plotTSNE(leduc_minimal, colour_by = "SampleType")
library("scater") data("leduc_minimal") pcs <- scpComponentAnalysis( leduc_minimal, method = "ASCA", effects = "SampleType")$bySample reducedDims(leduc_minimal) leduc_minimal <- addReducedDims(leduc_minimal, pcs) reducedDims(leduc_minimal) plotReducedDim(leduc_minimal, dimred = "ASCA_SampleType", colour_by = "SampleType") leduc_minimal <- runTSNE(leduc_minimal, dimred = "ASCA_SampleType") plotTSNE(leduc_minimal, colour_by = "SampleType")
This function is a wrapper function around
QFeatures::aggregateFeatures.
It allows the user to provide multiple assays for which
aggregateFeatures
will be applied sequentially.
aggregateFeaturesOverAssays(object, i, fcol, name, fun, ...)
aggregateFeaturesOverAssays(object, i, fcol, name, fun, ...)
object |
A |
i |
A |
fcol |
The feature variables for each assays |
name |
A |
fun |
A function used for quantitative feature aggregation. |
... |
Additional parameters passed the |
A QFeatures
object
data("scp1") scp1 <- aggregateFeaturesOverAssays(scp1, i = 1:3, fcol = "peptide", name = paste0("peptides", 1:3), fun = colMeans, na.rm = TRUE) scp1
data("scp1") scp1 <- aggregateFeaturesOverAssays(scp1, i = 1:3, fcol = "peptide", name = paste0("peptides", 1:3), fun = colMeans, na.rm = TRUE) scp1
The function computes the ratio of the intensities of sample channels over the intentisty of the carrier channel for each feature. The ratios are averaged within the assay.
computeSCR( object, i, colvar, samplePattern, sampleFUN = "mean", carrierPattern, carrierFUN = sampleFUN, rowDataName = "SCR" )
computeSCR( object, i, colvar, samplePattern, sampleFUN = "mean", carrierPattern, carrierFUN = sampleFUN, rowDataName = "SCR" )
object |
A |
i |
A |
colvar |
A |
samplePattern |
A |
sampleFUN |
A |
carrierPattern |
A |
carrierFUN |
A |
rowDataName |
A |
A QFeatures
object for which the rowData
of the given
assay(s) is augmented with the mean SCR.
data("scp1") scp1 <- computeSCR(scp1, i = 1, colvar = "SampleType", carrierPattern = "Carrier", samplePattern = "Blank|Macrophage|Monocyte", sampleFUN = "mean", rowDataName = "MeanSCR") ## Check results rowData(scp1)[[1]][, "MeanSCR"]
data("scp1") scp1 <- computeSCR(scp1, i = 1, colvar = "SampleType", carrierPattern = "Carrier", samplePattern = "Blank|Macrophage|Monocyte", sampleFUN = "mean", rowDataName = "MeanSCR") ## Check results rowData(scp1)[[1]][, "MeanSCR"]
The cumulative sensitivity curve is used to evaluate if the sample size is sufficient to accurately estimate the total sensitivity. If it is not the case, an asymptotic regression model may provide a prediction of the total sensitivity if more samples would have been acquired.
cumulativeSensitivityCurve( object, i, by = NULL, batch = NULL, nsteps = 30, niters = 10 ) predictSensitivity(df, nSamples)
cumulativeSensitivityCurve( object, i, by = NULL, batch = NULL, nsteps = 30, niters = 10 ) predictSensitivity(df, nSamples)
object |
An object of class QFeatures. |
i |
The index of the assay in |
by |
A vector of length equal to the number of columns in
assay |
batch |
A vector of length equal to the number of columns in
assay |
nsteps |
The number of equally spaced sample sizes to compute the sensitivity. |
niters |
The number of iteration to compute |
df |
The output from |
nSamples |
A |
As more samples are added to a dataset, the total number of distinct features increases. When sufficient number of samples are acquired, all peptides that are identifiable by the technology and increasing the sample size no longer increases the set of identified features. The cumulative sensitivity curve depicts the relationship between sensitivity (number of distinct peptides in the data) and the sample size. More precisely, the curve is built by sampling cells in the data and count the number of distinct features found across the sampled cells. The sampling is repeated multiple times to account for the stochasticity of the approach. Datasets that have a sample size sufficiently large should have a cumulative sensitivity curve with a plateau.
The set of features present in a cell depends on the cell type.
Therefore, we suggest to build the cumulative sensitivity curve
for each cell type separately. This is possible when providing the
by
argument.
For multiplexed experiments, several cells are acquired in a run.
In that case, when a features is identified in a cell, it is
frequently also identified in all other cells of that run, and
this will distort the cumulative sensitivity curve. Therefore, the
function allows to compute the cumulative sensitivity curve at the
batches level rather than at the cell level. This is possible when
providing the batch
argument.
Once the cumulative sensitivity curve is computed, the returned data can be visualized to explore the relationship between the sensitivity and the sample size. If enough samples are acquired, the curve should plateau at high numbers of samples. If it is not the case, the total sensitivity can be predicted using an asymptotic regression curve. To predict the total sensitivity, the model is extrapolated to infinite sample size. Therefore, the accuracy of the extrapolation will highly depend on the available data. The closer the curve is to the plateau, the more accurate the prediction.
A data.frame
with groups as many rows as pairs of cells
and the following column(s):
jaccard
: the computed Jaccard index
by
: if by
is not NULL
, the group of the pair of cells
for which the Jaccard index is computed.
## Simulate data ## 1000 features in 100 cells library(SingleCellExperiment) id <- matrix(FALSE, 1000, 1000) id[sample(1:length(id), 5000)] <- TRUE dimnames(id) <- list( paste0("feat", 1:1000), paste0("cell", 1:1000) ) sce <- SingleCellExperiment(assays = List(id)) sim <- QFeatures(experiments = List(id = sce)) sim$batch <- rep(1:100, each = 10) sim$SampleType <- rep(c("A", "B"), each = 500) sim ## Compute the cumulative sensitivity curve, take batch and sample ## type into account csc <- cumulativeSensitivityCurve( sim, "id", by = sim$SampleType, batch = sim$batch ) predCSC <- predictSensitivity(csc, nSample = 1:50) library(ggplot2) ggplot(csc) + aes(x = SampleSize, y = Sensitivity, colour = by) + geom_point() + geom_line(data = predCSC) ## Extrapolate the total sensitivity predictSensitivity(csc, nSamples = Inf) ## (real total sensitivity = 1000)
## Simulate data ## 1000 features in 100 cells library(SingleCellExperiment) id <- matrix(FALSE, 1000, 1000) id[sample(1:length(id), 5000)] <- TRUE dimnames(id) <- list( paste0("feat", 1:1000), paste0("cell", 1:1000) ) sce <- SingleCellExperiment(assays = List(id)) sim <- QFeatures(experiments = List(id = sce)) sim$batch <- rep(1:100, each = 10) sim$SampleType <- rep(c("A", "B"), each = 500) sim ## Compute the cumulative sensitivity curve, take batch and sample ## type into account csc <- cumulativeSensitivityCurve( sim, "id", by = sim$SampleType, batch = sim$batch ) predCSC <- predictSensitivity(csc, nSample = 1:50) library(ggplot2) ggplot(csc) + aes(x = SampleSize, y = Sensitivity, colour = by) + geom_point() + geom_line(data = predCSC) ## Extrapolate the total sensitivity predictSensitivity(csc, nSamples = Inf) ## (real total sensitivity = 1000)
The function divides the sample columns by a reference column. The sample
and reference columns are defined based on the provided colvar
variable and on regular expression matching.
divideByReference(object, i, colvar, samplePattern = ".", refPattern)
divideByReference(object, i, colvar, samplePattern = ".", refPattern)
object |
A |
i |
A |
colvar |
A |
samplePattern |
A |
refPattern |
A |
The supplied assay(s) are replaced with the values computed after reference division.
A QFeatures
object
data("scp1") scp1 <- divideByReference(scp1, i = 1, colvar = "SampleType", samplePattern = "Macrophage", refPattern = "Ref")
data("scp1") scp1 <- divideByReference(scp1, i = 1, colvar = "SampleType", samplePattern = "Macrophage", refPattern = "Ref")
The function computes the Jaccard index between all pairs of cells.
jaccardIndex(object, i, by = NULL)
jaccardIndex(object, i, by = NULL)
object |
An object of class QFeatures. |
i |
The index of the assay in |
by |
A vector of length equal to the number of columns in
assay |
A data.frame
with as many rows as pairs of cells
and the following column(s):
jaccard
: the computed Jaccard index
by
: if by
is not NULL
, the group of the pair of cells
for which the Jaccard index is computed.
data("scp1") ## Define the identification matrix peps <- scp1[["peptides"]] assay(peps) <- ifelse(is.na(assay(peps)), FALSE, TRUE) scp1 <- addAssay(scp1, peps, "id") ## Compute Jaccard indices jaccardIndex(scp1, "id") ## Compute Jaccard indices by sample type jaccardIndex(scp1, "id", scp1$SampleType)
data("scp1") ## Define the identification matrix peps <- scp1[["peptides"]] assay(peps) <- ifelse(is.na(assay(peps)), FALSE, TRUE) scp1 <- addAssay(scp1, peps, "id") ## Compute Jaccard indices jaccardIndex(scp1, "id") ## Compute Jaccard indices by sample type jaccardIndex(scp1, "id", scp1$SampleType)
A SingleCellExperiment
object that has been minimally processed.
The data set is published by Leduc et al. 2022 (see references)
and retrieved using scpdata::leduc2022_pSCoPE()
. The data
processing was conducted with QFeatures
and scp
. Quality control
was performed, followed by building the peptide data and
log2-transformation. To limit the size of the data, only cells
associated to the 3 first and 3 last MS acquisition runs were
kept. For the same reason, 200 peptides were randomly
sampled. Therefore, the data set consists of 200 peptides and
73 cells. Peptide annotations can be retrieved from the rowData
and cell annotations can be retrieved from the colData
.
data("leduc_minimal")
data("leduc_minimal")
An object of class SingleCellExperiment
with 200 rows and 73 columns.
Any zero value has been replaced by NA.
A peptide was removed from the data set if:
it matched to a decoy or contaminant peptide
it had an parental ion fraction below 60 \
it had a DART-ID adjusted q-value superior to 1\
it had an average sample to carrier ratio above 0.05
A cell was removed from the data set if:
it had a median coefficient of variation superior to 0.6
it had a log2 median intensity outside (6, 8)
it contained less than 750 peptides
PSMs belonging to the same peptide were aggregating using the
median value. Some peptides were mapped to a different protein
depending on the MS acquisition run. To solve this issue, a
majority vote was applied to assign a single protein to each
peptide. Protein IDs were translated into gene symbols using the
ensembldb
package.
Christophe Vanderaa, Laurent Gatto
Leduc, Andrew, R. Gray Huffman, Joshua Cantlon, Saad Khan, and Nikolai Slavov. 2022. “Exploring Functional Protein Covariation across Single Cells Using nPOP.” Genome Biology 23 (1): 261.
The function computes for each cell the median CV and stores them
accordingly in the colData
of the QFeatures
object. The CVs in
each cell are computed from a group of features. The grouping is
defined by a variable in the rowData
. The function can be
applied to one or more assays, as long as the samples (column
names) are not duplicated. Also, the user can supply a minimal
number of observations required to compute a CV to avoid that CVs
computed on too few observations influence the distribution within
a cell. The quantification matrix can be optionally normalized
before computing the CVs. Multiple normalizations are possible.
medianCVperCell( object, i, groupBy, nobs = 5, na.rm = TRUE, colDataName = "MedianCV", norm = "none", ... )
medianCVperCell( object, i, groupBy, nobs = 5, na.rm = TRUE, colDataName = "MedianCV", norm = "none", ... )
object |
A |
i |
A |
groupBy |
A |
nobs |
An |
na.rm |
A |
colDataName |
A |
norm |
A |
... |
Additional arguments that are passed to the normalization method. |
A new column is added to the colData
of the object. The samples
(columns) that are not present in the selection i
will get
assigned an NA.
A QFeatures
object.
Specht, Harrison, Edward Emmott, Aleksandra A. Petelski, R. Gray Huffman, David H. Perlman, Marco Serra, Peter Kharchenko, Antonius Koller, and Nikolai Slavov. 2021. “Single-Cell Proteomic and Transcriptomic Analysis of Macrophage Heterogeneity Using SCoPE2.” Genome Biology 22 (1): 50.
data("scp1") scp1 <- filterFeatures(scp1, ~ !is.na(Proteins)) scp1 <- medianCVperCell(scp1, i = 1:3, groupBy = "Proteins", nobs = 5, na.rm = TRUE, colDataName = "MedianCV", norm = "div.median") ## Check results hist(scp1$MedianCV)
data("scp1") scp1 <- filterFeatures(scp1, ~ !is.na(Proteins)) scp1 <- medianCVperCell(scp1, i = 1:3, groupBy = "Proteins", nobs = 5, na.rm = TRUE, colDataName = "MedianCV", norm = "div.median") ## Check results hist(scp1$MedianCV)
A data.frame
with 1088 observations and 139 variables, as
produced by reading a MaxQuant output file with
read.delim()
.
Sequence: a character vector
Length: a numeric vector
Modifications: a character vector
Modified.sequence: a character vector
Deamidation..N..Probabilities: a character vector
Oxidation..M..Probabilities: a character vector
Deamidation..N..Score.Diffs: a character vector
Oxidation..M..Score.Diffs: a character vector
Acetyl..Protein.N.term.: a numeric vector
Deamidation..N.: a numeric vector
Oxidation..M.: a numeric vector
Missed.cleavages: a numeric vector
Proteins: a character vector
Leading.proteins: a character vector
protein: a character vector
Gene.names: a character vector
Protein.names: a character vector
Type: a character vector
Set: a character vector
MS.MS.m.z: a numeric vector
Charge: a numeric vector
m.z: a numeric vector
Mass: a numeric vector
Resolution: a numeric vector
Uncalibrated...Calibrated.m.z..ppm.: a numeric vector
Uncalibrated...Calibrated.m.z..Da.: a numeric vector
Mass.error..ppm.: a numeric vector
Mass.error..Da.: a numeric vector
Uncalibrated.mass.error..ppm.: a numeric vector
Uncalibrated.mass.error..Da.: a numeric vector
Max.intensity.m.z.0: a numeric vector
Retention.time: a numeric vector
Retention.length: a numeric vector
Calibrated.retention.time: a numeric vector
Calibrated.retention.time.start: a numeric vector
Calibrated.retention.time.finish: a numeric vector
Retention.time.calibration: a numeric vector
Match.time.difference: a logical vector
Match.m.z.difference: a logical vector
Match.q.value: a logical vector
Match.score: a logical vector
Number.of.data.points: a numeric vector
Number.of.scans: a numeric vector
Number.of.isotopic.peaks: a numeric vector
PIF: a numeric vector
Fraction.of.total.spectrum: a numeric vector
Base.peak.fraction: a numeric vector
PEP: a numeric vector
MS.MS.count: a numeric vector
MS.MS.scan.number: a numeric vector
Score: a numeric vector
Delta.score: a numeric vector
Combinatorics: a numeric vector
Intensity: a numeric vector
Reporter.intensity.corrected.0: a numeric vector
Reporter.intensity.corrected.1: a numeric vector
Reporter.intensity.corrected.2: a numeric vector
Reporter.intensity.corrected.3: a numeric vector
Reporter.intensity.corrected.4: a numeric vector
Reporter.intensity.corrected.5: a numeric vector
Reporter.intensity.corrected.6: a numeric vector
Reporter.intensity.corrected.7: a numeric vector
Reporter.intensity.corrected.8: a numeric vector
Reporter.intensity.corrected.9: a numeric vector
Reporter.intensity.corrected.10: a numeric vector
RI1: a numeric vector
RI2: a numeric vector
RI3: a numeric vector
RI4: a numeric vector
RI5: a numeric vector
RI6: a numeric vector
RI7: a numeric vector
RI8: a numeric vector
RI9: a numeric vector
RI10: a numeric vector
RI11: a numeric vector
Reporter.intensity.count.0: a numeric vector
Reporter.intensity.count.1: a numeric vector
Reporter.intensity.count.2: a numeric vector
Reporter.intensity.count.3: a numeric vector
Reporter.intensity.count.4: a numeric vector
Reporter.intensity.count.5: a numeric vector
Reporter.intensity.count.6: a numeric vector
Reporter.intensity.count.7: a numeric vector
Reporter.intensity.count.8: a numeric vector
Reporter.intensity.count.9: a numeric vector
Reporter.intensity.count.10: a numeric vector
Reporter.PIF: a logical vector
Reporter.fraction: a logical vector
Reverse: a character vector
Potential.contaminant: a logical vector
id: a numeric vector
Protein.group.IDs: a character vector
Peptide.ID: a numeric vector
Mod..peptide.ID: a numeric vector
MS.MS.IDs: a character vector
Best.MS.MS: a numeric vector
AIF.MS.MS.IDs: a logical vector
Deamidation..N..site.IDs: a numeric vector
Oxidation..M..site.IDs: a logical vector
remove: a logical vector
dart_PEP: a numeric vector
dart_qval: a numeric vector
razor_protein_fdr: a numeric vector
Deamidation..NQ..Probabilities: a logical vector
Deamidation..NQ..Score.Diffs: a logical vector
Deamidation..NQ.: a logical vector
Reporter.intensity.corrected.11: a logical vector
Reporter.intensity.corrected.12: a logical vector
Reporter.intensity.corrected.13: a logical vector
Reporter.intensity.corrected.14: a logical vector
Reporter.intensity.corrected.15: a logical vector
Reporter.intensity.corrected.16: a logical vector
RI12: a logical vector
RI13: a logical vector
RI14: a logical vector
RI15: a logical vector
RI16: a logical vector
Reporter.intensity.count.11: a logical vector
Reporter.intensity.count.12: a logical vector
Reporter.intensity.count.13: a logical vector
Reporter.intensity.count.14: a logical vector
Reporter.intensity.count.15: a logical vector
Reporter.intensity.count.16: a logical vector
Deamidation..NQ..site.IDs: a logical vector
input_id: a logical vector
rt_minus: a logical vector
rt_plus: a logical vector
mu: a logical vector
muij: a logical vector
sigmaij: a logical vector
pep_new: a logical vector
exp_id: a logical vector
peptide_id: a logical vector
stan_peptide_id: a logical vector
exclude: a logical vector
residual: a logical vector
participated: a logical vector
peptide: a character vector
data("mqScpData")
data("mqScpData")
An object of class data.frame
with 1361 rows and 149 columns.
The dataset is a subset of the SCoPE2 dataset (version 2, Specht
et al. 2019,
BioRXiv). The
input file evidence_unfiltered.csv
was downloaded from a
Google Drive repository.
The MaxQuant evidence file was loaded and the data was cleaned
(renaming columns, removing duplicate fields,...). MS runs that
were selected in the scp1
dataset (see ?scp1
) were kept along
with a blank run. The data is stored as a data.frame
.
readSCP()
for an example on how mqScpData
is
parsed into a QFeatures object.
This function normalises an assay in a QFeatures
according to
the supplied method (see Details). The normalized data is added as
a new assay
normalizeSCP(object, i, name = "normAssay", method, ...)
normalizeSCP(object, i, name = "normAssay", method, ...)
object |
An object of class |
i |
A numeric vector or a character vector giving the index or the name, respectively, of the assay(s) to be processed. |
name |
A |
method |
|
... |
Additional parameters passed to
|
The method
parameter in normalize
can be one of "sum"
,
"max"
, "center.mean"
, "center.median"
, "div.mean"
,
"div.median"
, "diff.meda"
, "quantiles
", "quantiles.robust
"
or "vsn"
. The MsCoreUtils::normalizeMethods()
function returns
a vector of available normalisation methods.
For "sum"
and "max"
, each feature's intensity is divided by
the maximum or the sum of the feature respectively. These two
methods are applied along the features (rows).
"center.mean"
and "center.median"
center the respective
sample (column) intensities by subtracting the respective column
means or medians. "div.mean"
and "div.median"
divide by the
column means or medians. These are equivalent to sweep
ing the
column means (medians) along MARGIN = 2
with FUN = "-"
(for
"center.*"
) or FUN = "/"
(for "div.*"
).
"diff.median"
centers all samples (columns) so that they all
match the grand median by subtracting the respective columns
medians differences to the grand median.
Using "quantiles"
or "quantiles.robust"
applies (robust)
quantile normalisation, as implemented in
preprocessCore::normalize.quantiles()
and
preprocessCore::normalize.quantiles.robust()
. "vsn"
uses the
vsn::vsn2()
function. Note that the latter also glog-transforms
the intensities. See respective manuals for more details and
function arguments.
For further details and examples about normalisation, see
MsCoreUtils::normalize_matrix()
.
A QFeatures
object with an additional assay containing the
normalized data.
QFeatures::normalize for more details about normalize
data("scp1") scp1 normalizeSCP(scp1, i = "proteins", name = "normproteins", method = "center.mean")
data("scp1") scp1 normalizeSCP(scp1, i = "proteins", name = "normproteins", method = "center.mean")
This function computes q-values from the posterior error
probabilities (PEPs). The functions takes the PEPs from the given
assay's rowData
and adds a new variable to it that contains the
computed q-values.
pep2qvalue(object, i, groupBy, PEP, rowDataName = "qvalue")
pep2qvalue(object, i, groupBy, PEP, rowDataName = "qvalue")
object |
A |
i |
A |
groupBy |
A |
PEP |
A |
rowDataName |
A |
The q-value of a feature (PSM, peptide, protein) is the minimum FDR at which that feature will be selected upon filtering (Savitski et al.). On the other hand, the feature PEP is the probability that the feature is wrongly matched and hence can be seen as a local FDR (Kall et al.). While filtering on PEP is guaranteed to control for FDR, it is usually too conservative. Therefore, we provide this function to convert PEP to q-values.
We compute the q-value of a feature as the average of the PEPs associated to PSMs that have equal or greater identification confidence (so smaller PEP). See Kall et al. for a visual interpretation.
We also allow inference of q-values at higher level, for instance
computing the protein q-values from PSM PEP. This can be performed
by supplying the groupBy
argument. In this case, we adopt the
best feature strategy that will take the best (smallest) PEP for
each group (Savitski et al.).
A QFeatures
object.
Käll, Lukas, John D. Storey, Michael J. MacCoss, and William Stafford Noble. 2008. “Posterior Error Probabilities and False Discovery Rates: Two Sides of the Same Coin.” Journal of Proteome Research 7 (1): 40–44.
Savitski, Mikhail M., Mathias Wilhelm, Hannes Hahne, Bernhard Kuster, and Marcus Bantscheff. 2015. “A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets.” Molecular & Cellular Proteomics: MCP 14 (9): 2394–2404.
data("scp1") scp1 <- pep2qvalue(scp1, i = 1, groupBy = "protein", PEP = "dart_PEP", rowDataName = "qvalue_protein") ## Check results rowData(scp1)[[1]][, c("dart_PEP", "qvalue_protein")]
data("scp1") scp1 <- pep2qvalue(scp1, i = 1, groupBy = "protein", PEP = "dart_PEP", rowDataName = "qvalue_protein") ## Check results rowData(scp1)[[1]][, c("dart_PEP", "qvalue_protein")]
Function to import and convert tabular data from a spreadsheet or
a data.frame
into a SingleCellExperiment
and QFeatures
object.
readSCP(...) readSCPfromDIANN(...) readSingleCellExperiment(...)
readSCP(...) readSCPfromDIANN(...) readSingleCellExperiment(...)
... |
Parameters passed to |
An instance of class SingleCellExperiment
or a
QFeatures
, composed of SingleCellExperiment
objects.
The SingleCellExperiment
class is built on top of the
RangedSummarizedExperiment
class. This means that some column names
are forbidden in the rowData
. Avoid using the following names:
seqnames
, ranges
, strand
, start
, end
,
width
, element
The more general QFeatures::readQFeatures()
function, which
this function depends on.
The more general QFeatures::readQFeaturesFromDIANN()
function,
for details and an example on how to read label-free and plexDIA
(mTRAQ) data processed with DIA-NN.
The QFeatures::readSummarizedExperiment()
function, which
readSingleCellExperiment()
depends on.
The SingleCellExperiment::SingleCellExperiment()
class.
###################################################### ## Load a single acquisition as a SingleCellExperiment ## Load a data.frame with PSM-level data data("mqScpData") ## Create the QFeatures object sce <- readSingleCellExperiment(mqScpData, quantCols = grep("RI", colnames(mqScpData))) sce ###################################################### ## Load multiple acquisitions as a QFeatures ## Load an example table containing MaxQuant output data("mqScpData") ## Load the (user-generated) annotation table data("sampleAnnotation") ## Format the tables into a QFeatures object readSCP(assayData = mqScpData, colData = sampleAnnotation, runCol = "Raw.file")
###################################################### ## Load a single acquisition as a SingleCellExperiment ## Load a data.frame with PSM-level data data("mqScpData") ## Create the QFeatures object sce <- readSingleCellExperiment(mqScpData, quantCols = grep("RI", colnames(mqScpData))) sce ###################################################### ## Load multiple acquisitions as a QFeatures ## Load an example table containing MaxQuant output data("mqScpData") ## Load the (user-generated) annotation table data("sampleAnnotation") ## Format the tables into a QFeatures object readSCP(assayData = mqScpData, colData = sampleAnnotation, runCol = "Raw.file")
The function computes four metrics to report missing values in single-cell proteomics.
reportMissingValues(object, i, by = NULL)
reportMissingValues(object, i, by = NULL)
object |
An object of class QFeatures. |
i |
The index of the assay in |
by |
A vector of length equal to the number of columns in
assay |
A data.frame
with groups as rows and 5 columns:
LocalSensitivityMean
: the average number of features per cell.
LocalSensitivitySd
: the standard deviation of the local
sensitivity.
TotalSensitivity
: the total number of features found in the
dataset.
Completeness
: the proportion of values that are not missing in
the data.
NumberCells
: the number of cells in the dataset.
data("scp1") ## Define the identification matrix peps <- scp1[["peptides"]] assay(peps) <- !is.na(assay(peps)) scp1 <- addAssay(scp1, peps, "id") ## Report metrics reportMissingValues(scp1, "id") ## Report metrics by sample type reportMissingValues(scp1, "id", scp1$SampleType) data
data("scp1") ## Define the identification matrix peps <- scp1[["peptides"]] assay(peps) <- !is.na(assay(peps)) scp1 <- addAssay(scp1, peps, "id") ## Report metrics reportMissingValues(scp1, "id") ## Report metrics by sample type reportMissingValues(scp1, "id", scp1$SampleType) data
A data frame with 48 observations on the following 6 variables.
Set: a character vector
Channel: a character vector
SampleType: a character vector
lcbatch: a character vector
sortday: a character vector
digest: a character vector
data("sampleAnnotation")
data("sampleAnnotation")
An object of class data.frame
with 64 rows and 6 columns.
##' The dataset is a subset of the SCoPE2 dataset (version 2, Specht
et al. 2019,
BioRXiv). The
input files batch.csv
and annotation.csv
were downloaded from a
Google Drive repository.
The two files were loaded and the columns names were adapted for
consistency with mqScpData
table (see ?mqScpData
). The two
tables were filtered to contain only sets present in “mqScpData. The tables were then merged based on the run ID, hence merging the sample annotation and the batch annotation. Finally, annotation for the blank run was added manually. The data is stored as a
data.frame'.
readSCP()
to see how this file is used.
A small QFeatures object with SCoPE2 data. The object is composed of 5 assays, including 3 PSM-level assays, 1 peptide assay and 1 protein assay.
data("scp1")
data("scp1")
An object of class QFeatures
of length 5.
The dataset is a subset of the SCoPE2 dataset (version 2, Specht
et al. 2019,
BioRXiv).
This dataset was converted to a QFeatures
object where each
assay in stored as a SingleCellExperiment
object. One assay
per chromatographic batch ("LCA9", "LCA10", "LCB3") was randomly
sampled. For each assay, 100 proteins were randomly sampled. PSMs
were then aggregated to peptides and joined in a single assay.
Then peptides were aggregated to proteins.
data("scp1") scp1
data("scp1") scp1
The function takes as input a list of DFrame
and a table with
additional annotations. The annotation tables is automatically
merged into all tables of the list by matching the specified
columns (given by the arguments by
and by2
). This function is
useful to add annotation to analysis results generated by
scpVarianceAnalysis()
, scpDifferentialAnalysis()
, or
scpComponentAnalysis()
. The annotation table is typically the
colData
or rowData
of the object used for modelling. In case
of shared column names between the input tables and the annotation
table, any annotation that is already present in the list of
tables will be overwritten by the new annotations.
scpAnnotateResults(tableList, annotations, by, by2 = NULL)
scpAnnotateResults(tableList, annotations, by, by2 = NULL)
tableList |
A list of tables, typically the output of
|
annotations |
A table of class 'data.frame' or 'DFrame' containing the annotations to add. If no further arguments are provided, the table must have row names. |
by |
A |
by2 |
A |
Christophe Vanderaa, Laurent Gatto
data("leduc_minimal") var <- scpVarianceAnalysis(leduc_minimal) colnames(var$Residuals) ## Add peptide annotations available from the rowData var <- scpAnnotateResults( var, rowData(leduc_minimal), by = "feature", by2 = "Sequence" ) colnames(var$Residuals)
data("leduc_minimal") var <- scpVarianceAnalysis(leduc_minimal) colnames(var$Residuals) ## Add peptide annotations available from the rowData var <- scpAnnotateResults( var, rowData(leduc_minimal), by = "feature", by2 = "Sequence" ) colnames(var$Residuals)
An ScpModel
object must be always stored in the metadata()
of
an object that inherits from the SingleCellExperiment
class. The
ScpModel
object should never be accessed directly by the
user. Instead, we provide several setter function to retrieve
information that may be useful to the user.The ScpModel
class
contains several slots:
scpModelFormula
: a formula
object controlling which
variables are to be modelled.
scpModelInputIndex
: a numeric(1)
, selecting the assay to use
in the SingleCellExperiment
object as input matrix. Note that
this slot serves as a pointer, meaning that the quantitative
data is not duplicated. Any change to the assay in the
SingleCellExperiment
will impact the estimation of the
ScpModel
object.
scpModelFilterThreshold
: A numeric(1)
indicating the minimal
n/p ratio required for a feature to be included in further model
exploration. n/p is the number of measured values for a features
divided by the number of coefficients to estimate. n/p cannot be
smaller than 1 because this would lead to over-specified models.
scpModelFitList
: A List
that contains the model results for
each feature. Each element is a ScpModelFit
object (see
ScpModelFit
)
scpModelFormula(object, name) scpModelInput(object, name, filtered = TRUE) scpModelFilterThreshold(object, name) scpModelFilterNPRatio(object, name, filtered = TRUE) scpModelResiduals(object, name, join = TRUE, filtered = TRUE) scpModelEffects(object, name, join = TRUE, filtered = TRUE) scpModelNames(object) scpModelFilterThreshold(object, name) <- value
scpModelFormula(object, name) scpModelInput(object, name, filtered = TRUE) scpModelFilterThreshold(object, name) scpModelFilterNPRatio(object, name, filtered = TRUE) scpModelResiduals(object, name, join = TRUE, filtered = TRUE) scpModelEffects(object, name, join = TRUE, filtered = TRUE) scpModelNames(object) scpModelFilterThreshold(object, name) <- value
object |
An object that inherits from the
|
name |
A |
filtered |
A |
join |
A |
value |
An |
Each slot has a getter function associated:
scpModelNames()
: returns a vector of names of ScpModel
objects stored in the SingleCellExperiment
object.
scpModelFormula()
: returns the formula
slot of the ScpModel
within an object that inherits from the SummarizedExperiment
class.
scpModelFilterThreshold()
: returns the n/p ration threshold
used for feature filtering.
scpModelInput()
: returns a matrix
with the quantitative
values used as input of the model. Hence, the matrix contains
the data before modelling. If filtered = TRUE
, the feature of
the matrix are restricted to the features that satisfy the n/p
ratio threshold.
scpModelFilterNPRatio()
: returns the computed n/p ratio for
each feature. If filtered = TRUE
, the function returns only
the n/p of the features that satisfy the n/p ratio threshold.
scpModelResiduals()
: when join = FALSE
, the function returns
a list where each element corresponds to a feature and contains
the estimated residuals. When join = TRUE
(default), the function
combines the list into a matrix with features in rows and cells
in columns, and filling the gaps with NA
. If filtered = TRUE
,
the feature of the matrix are restricted to the features that
satisfy the n/p ratio threshold.
scpModelEffects()
: when join = FALSE
, the function return a
list where each element of the list corresponds to a feature.
Each element contains another list with as many elements as
variable in the model and each element contains the data effect vector
for that vector. When join = TRUE
(default), each element of the list is
a matrix with features in rows and cells in columns where gaps
are filled with NA
. If filtered = TRUE
, the feature of the
matrix are restricted to the features that satisfy the n/p
ratio threshold.
Setter:
scpModelFilterThreshold<-()
: the function changes the n/p
ratio threshold used for filtering features.
Christophe Vanderaa, Laurent Gatto
ScpModelFit for a description of the class that store modelling results
ScpModel-Workflow that uses the class to store the estimated model.
data("leduc_minimal") ####---- Getters ----#### scpModelNames(leduc_minimal) scpModelFormula(leduc_minimal) dim(leduc_minimal) dim(scpModelInput(leduc_minimal)) dim(scpModelInput(leduc_minimal, filtered = FALSE)) head(scpModelFilterNPRatio(leduc_minimal)) dim(scpModelResiduals(leduc_minimal)) dim(scpModelResiduals(leduc_minimal, filtered = FALSE)) scpModelResiduals(leduc_minimal, join = FALSE) scpModelEffects(leduc_minimal) dim(scpModelEffects(leduc_minimal)$Set) dim(scpModelEffects(leduc_minimal, filtered = FALSE)$Set) scpModelEffects(leduc_minimal, join = FALSE)[[1]] scpModelFilterThreshold(leduc_minimal) ####---- Setter ----#### scpModelFilterThreshold(leduc_minimal) <- 2 scpModelFilterThreshold(leduc_minimal)
data("leduc_minimal") ####---- Getters ----#### scpModelNames(leduc_minimal) scpModelFormula(leduc_minimal) dim(leduc_minimal) dim(scpModelInput(leduc_minimal)) dim(scpModelInput(leduc_minimal, filtered = FALSE)) head(scpModelFilterNPRatio(leduc_minimal)) dim(scpModelResiduals(leduc_minimal)) dim(scpModelResiduals(leduc_minimal, filtered = FALSE)) scpModelResiduals(leduc_minimal, join = FALSE) scpModelEffects(leduc_minimal) dim(scpModelEffects(leduc_minimal)$Set) dim(scpModelEffects(leduc_minimal, filtered = FALSE)$Set) scpModelEffects(leduc_minimal, join = FALSE)[[1]] scpModelFilterThreshold(leduc_minimal) ####---- Setter ----#### scpModelFilterThreshold(leduc_minimal) <- 2 scpModelFilterThreshold(leduc_minimal)
The function uses the data modelling output to generate corrected
data that can be used for downstream analysis. The input
is expected to be a SingleCellExperiment
object that contains an
estimated ScpModel
. There are two approaches:
scpKeepEffect()
: keep the effects of interests. The
reconstructed data is the sum of the effect matrices for the
variable of interest and the residuals. Note that the intercepts
(baseline intensity of each feature) are not included by
default, but they can be added when intercept = TRUE
.
scpRemoveBatchEffect()
: remove any undesired effect. The batch
corrected data is the input data minus the effect matrices that
correspond to batch effect variables. Note that the intercepts
(baseline intensity of each feature) are removed by default, but
they can be kept when intercept = FALSE
.
Despite the two approaches are conceptually different, they can lead to similar results if the effects that are used to reconstruct the data are the ones that are not removed when performing batch correction (see examples).
The function returns a new SingleCellExperiment
that contains an
assay with the batch corrected data. Note that the 'ScpModel' is
erased in this new object.
scpKeepEffect(object, effects = NULL, intercept = FALSE, name) scpRemoveBatchEffect(object, effects = NULL, intercept = TRUE, name)
scpKeepEffect(object, effects = NULL, intercept = FALSE, name) scpRemoveBatchEffect(object, effects = NULL, intercept = TRUE, name)
object |
An object that inherits from the
|
effects |
A |
intercept |
A |
name |
A |
Christophe Vanderaa, Laurent Gatto
ScpModel for functions to extract information from the
ScpModel
object
ScpModel-Workflow to run a model on SCP data required for batch correction.
data("leduc_minimal") scpModelFormula(leduc_minimal) reconstructed <- scpKeepEffect(leduc_minimal, effects = "SampleType") batchCorreced <- scpRemoveBatchEffect( leduc_minimal, effects = c("Channel", "Set", "MedianIntensity") ) ## The two approaches are identical identical(reconstructed, batchCorreced)
data("leduc_minimal") scpModelFormula(leduc_minimal) reconstructed <- scpKeepEffect(leduc_minimal, effects = "SampleType") batchCorreced <- scpRemoveBatchEffect( leduc_minimal, effects = c("Channel", "Set", "MedianIntensity") ) ## The two approaches are identical identical(reconstructed, batchCorreced)
Differential abundance analysis assess the statistical significance of the differences observed between group of samples of interest.
scpDifferentialAnalysis(object, coefficients = NULL, contrasts = NULL, name) scpDifferentialAggregate(differentialList, fcol, ...) scpVolcanoPlot( differentialList, fdrLine = 0.05, top = 10, by = "padj", decreasing = FALSE, textBy = "feature", pointParams = list(), labelParams = list() )
scpDifferentialAnalysis(object, coefficients = NULL, contrasts = NULL, name) scpDifferentialAggregate(differentialList, fcol, ...) scpVolcanoPlot( differentialList, fdrLine = 0.05, top = 10, by = "padj", decreasing = FALSE, textBy = "feature", pointParams = list(), labelParams = list() )
object |
An object that inherits from the
|
coefficients |
A |
contrasts |
A |
name |
A |
differentialList |
A list of tables returned by
|
fcol |
A |
... |
Further arguments passed to
|
fdrLine |
A |
top |
A |
by |
A |
decreasing |
A |
textBy |
A |
pointParams |
A |
labelParams |
A |
scpDifferentialAnalysis()
performs statistical inference by
means of a t-test on the estimatated parameters. There are 2 use
cases:
Statistical inference for differences between 2 groups
You can contrast 2 groups of interest through the contrasts
argument. Multiple contrasts, that is multiple pairwise group
comparisons, can be performed. Therefore, contrasts
must be
provided as a list where each element describes the comparison to
perform as a three-element character vector (see examples). The
first element is the name of the annotation variable that contains
the two groups to compare. This variable must be categorical.
The second element is the name of the reference group. The third
element is the name of the other group to compare against the
reference.
Statistical inference for numerical variables
Numerical variables can be tested by providing the coefficient
argument, that is the name of the numerical annotation variable.
The statistical tests in both use cases are conducted for each
feature independently. The p-values are adjusted using
IHW::ihw()
, where each test is weighted using the feature
intercept (that is the average feature intensity). The function
returns a list of DataFrame
s with one table for each test
contrast and/or coefficient. It provides the adjusted p-values and
the estimates. For contrast, the estimates represent the estimated
log fold changes between the groups. For coefficients, the
estimates are the estimated slopes. Results are only provided for
features for which contrasts or coefficients are estimable, that
are features for which there is sufficient observations for
inference.
scpDifferentialAggregate()
combines the differential abundance
analysis results for groups of features. This is useful, for
example, to return protein-level results when data is modelled at
the peptide level. The function heavily relies on the approaches
implemented in metapod::combineGroupedPValues()
. The p-values
are combined into a single value using one of the following
methods: Simes' method
(default), Fisher's method, Berger's method, Pearson's method,
minimum Holm's approach, Stouffer's Z-score method, and
Wilkinson's method. We refer to the metapod
documentation for
more details on the assumptions underlying each approach. The
estimates are combined using the representative estimate, as
defined by metapod
. Which estimate is representative depends on
the selected combination method. The function takes the list of
tables generated by scpDifferentialAnalysis()
and returns a new
list of DataFrame
s with aggregated results. Note that we cannot
meaningfully aggregate degrees of freedom. Those are hence removed
from the aggregated result tables.
scpAnnotateResults()
adds annotations to the differential abundance
analysis results. The annotations are added to all elements of the
list returned by ()
. See the associated
man page for more information.
scpVolcanoPlot()
takes the list of tables generated by
scpDifferentialAnalysis()
and returns a ggplot2
scatter plot.
The plots show the adjusted p-values with respect to the estimate.
A horizontal bar also highlights the significance threshold
(defaults to 5%, fdrLine
). The top (default 10) features with lowest
p-values are labeled on the plot. You can control which features
are labelled using the top
, by
and decreasing
arguments.
Finally, you can change the point and label aesthetics thanks to
the pointParams
and the labelParams
arguments, respectively.
Christophe Vanderaa, Laurent Gatto
ScpModel-Workflow to run a model on SCP data upstream of differential abundance analysis.
scpAnnotateResults()
to annotate analysis of variance results.
library("patchwork") library("ggplot2") data("leduc_minimal") ## Add n/p ratio information in rowData rowData(leduc_minimal)$npRatio <- scpModelFilterNPRatio(leduc_minimal, filtered = FALSE) ####---- Run differential abundance analysis ----#### (res <- scpDifferentialAnalysis( leduc_minimal, coefficients = "MedianIntensity", contrasts = list(c("SampleType", "Melanoma", "Monocyte")) )) ## IHW return a message because of the example data set has only few ## peptides, real dataset should not have that problem. ####---- Annotate results ----#### ## Add peptide annotations available from the rowData res <- scpAnnotateResults( res, rowData(leduc_minimal), by = "feature", by2 = "Sequence" ) ####---- Plot results ----#### scpVolcanoPlot(res, textBy = "gene") |> wrap_plots(guides = "collect") ## Modify point and label aesthetics scpVolcanoPlot( res, textBy = "gene", top = 20, pointParams = list(aes(colour = npRatio), alpha = 0.5), labelParams = list(size = 2, max.overlaps = 20)) |> wrap_plots(guides = "collect") ####---- Aggregate results ----#### ## Aggregate to protein-level results byProteinDA <- scpDifferentialAggregate( res, fcol = "Leading.razor.protein.id" ) scpVolcanoPlot(byProteinDA) |> wrap_plots(guides = "collect")
library("patchwork") library("ggplot2") data("leduc_minimal") ## Add n/p ratio information in rowData rowData(leduc_minimal)$npRatio <- scpModelFilterNPRatio(leduc_minimal, filtered = FALSE) ####---- Run differential abundance analysis ----#### (res <- scpDifferentialAnalysis( leduc_minimal, coefficients = "MedianIntensity", contrasts = list(c("SampleType", "Melanoma", "Monocyte")) )) ## IHW return a message because of the example data set has only few ## peptides, real dataset should not have that problem. ####---- Annotate results ----#### ## Add peptide annotations available from the rowData res <- scpAnnotateResults( res, rowData(leduc_minimal), by = "feature", by2 = "Sequence" ) ####---- Plot results ----#### scpVolcanoPlot(res, textBy = "gene") |> wrap_plots(guides = "collect") ## Modify point and label aesthetics scpVolcanoPlot( res, textBy = "gene", top = 20, pointParams = list(aes(colour = npRatio), alpha = 0.5), labelParams = list(size = 2, max.overlaps = 20)) |> wrap_plots(guides = "collect") ####---- Aggregate results ----#### ## Aggregate to protein-level results byProteinDA <- scpDifferentialAggregate( res, fcol = "Leading.razor.protein.id" ) scpVolcanoPlot(byProteinDA) |> wrap_plots(guides = "collect")
Analysis of variance investigates the contribution of each effects in capturing the variance in the data.
scpVarianceAnalysis(object, name) scpVarianceAggregate(varianceList, fcol) scpVariancePlot( varianceList, effect = "Residuals", by = "percentExplainedVar", top = Inf, decreasing = TRUE, combined = TRUE, fcol = NULL, colourSeed = 1234 )
scpVarianceAnalysis(object, name) scpVarianceAggregate(varianceList, fcol) scpVariancePlot( varianceList, effect = "Residuals", by = "percentExplainedVar", top = Inf, decreasing = TRUE, combined = TRUE, fcol = NULL, colourSeed = 1234 )
object |
An object that inherits from the
|
name |
A |
varianceList |
A list of tables returned by
|
fcol |
A |
effect |
A |
by |
A |
top |
A |
decreasing |
A |
combined |
A |
colourSeed |
A |
scpVarianceAnalysis()
computes the amount of data (measured as
the sums of squares) that is captured by each model variable, but
also that is not modelled and hence captured in the residuals. The
proportion of variance explained by each effect is the sums of
squares for that effect divided by the sum of all sums of squares
for each effect and residuals. This is computed for each feature
separately. The function returns a list of DataFrame
s with one
table for each effect.
scpVarianceAggregate()
combines the analysis of variance results
for groups of features. This is useful, for example, to
return protein-level results when data is modelled at the peptide
level. The function takes the list of tables generated by
scpVarianceAnalysis()
and returns a new list of DataFrame
s
with aggregated results.
scpAnnotateResults()
adds annotations to the component
analysis results. The annotations are added to all elements of the
list returned by scpComponentAnalysis()
. See the associated man
page for more information.
scpVariancePlot()
takes the list of tables generated by
scpVarianceAnalysis()
and returns a ggplot2
bar plot. The
bar plot shows the proportion of explained variance by each effect
and the residual variance. By default, the function will combine
the results over all features, showing the effect's contributions
on the complete data set. When combine = FALSE
, the results
are shown for individual features, with additional arguments to
control how many and which features are shown. Bars can also be
grouped by fcol
. This is particularly useful when exploring
peptide level results, but grouping peptides that belong to the
same protein (note that you should not use scpVarianceAggregate()
in that case).
Christophe Vanderaa, Laurent Gatto
ScpModel-Workflow to run a model on SCP data upstream of analysis of variance.
scpAnnotateResults()
to annotate analysis of variance results.
data("leduc_minimal") ####---- Run analysis of variance ----#### (var <- scpVarianceAnalysis(leduc_minimal)) ####---- Annotate results ----#### ## Add peptide annotations available from the rowData var <- scpAnnotateResults( var, rowData(leduc_minimal), by = "feature", by2 = "Sequence" ) ####---- Plot results ----#### ## Plot the analysis of variance through the whole data scpVariancePlot(var) ## Plot the analysis of variance for the top 20 peptides with highest ## percentage of variance explained by the cell type scpVariancePlot( var, effect = "SampleType", top = 20, combined = FALSE ) ## Same but grouped by protein scpVariancePlot( var, effect = "SampleType", top = 20, combined = FALSE, fcol = "gene" ) ####---- Aggregate results ----#### ## Aggregate to protein-level results varProtein <- scpVarianceAggregate(var, fcol = "gene") scpVariancePlot( varProtein, effect = "SampleType", top = 20, combined = FALSE )
data("leduc_minimal") ####---- Run analysis of variance ----#### (var <- scpVarianceAnalysis(leduc_minimal)) ####---- Annotate results ----#### ## Add peptide annotations available from the rowData var <- scpAnnotateResults( var, rowData(leduc_minimal), by = "feature", by2 = "Sequence" ) ####---- Plot results ----#### ## Plot the analysis of variance through the whole data scpVariancePlot(var) ## Plot the analysis of variance for the top 20 peptides with highest ## percentage of variance explained by the cell type scpVariancePlot( var, effect = "SampleType", top = 20, combined = FALSE ) ## Same but grouped by protein scpVariancePlot( var, effect = "SampleType", top = 20, combined = FALSE, fcol = "gene" ) ####---- Aggregate results ----#### ## Aggregate to protein-level results varProtein <- scpVarianceAggregate(var, fcol = "gene") scpVariancePlot( varProtein, effect = "SampleType", top = 20, combined = FALSE )
Function to estimate a linear model for each feature (peptide or protein) of a single-cell proteomics data set.
scpModelWorkflow(object, formula, i = 1, name = "model", verbose = TRUE) scpModelFilterPlot(object, name)
scpModelWorkflow(object, formula, i = 1, name = "model", verbose = TRUE) scpModelFilterPlot(object, name)
object |
An object that inherits from the
|
formula |
A |
i |
A |
name |
A |
verbose |
A |
The main input is object
that inherits from the
SingleCellExperiment
class. The quantitative data will be
retrieve using assay(object)
. If object
contains multiple
assays, you can specify which assay to take as input thanks to the
argument i
, the function will then assume assay(object, i)
as
quantification input .
The objective of modelling single-cell proteomics data is to
estimate, for each feature (peptide or protein), the effect of
known cell annotations on the measured intensities. These annotations
may contain biological information such as the cell line,
FACS-derived cell type, treatment, etc. We also highly recommend
including technical information, such as the MS acquisition run
information or the chemical label (in case of multiplexed
experiments). These annotation must be available from
colData(object)
. formula
specifies which annotations to use
during modelling.
The modelling worflow starts with generating a model matrix for
each feature given the colData(object)
and formula
. The model
matrix for peptide , denoted
, is adapted to the
pattern of missing values (see section below). Then, the functions
fits the model matrix against the quantitative data. In other
words, the function determines for each feature
(row in
the input data) the contribution of each variable in the model.
More formally, the general model definition is:
where is the feature by cell quantification matrix,
contains the estimated coefficients for feature
with as many coefficients as variables to estimate,
is the model matrix generated for feature
,
and
is the feature by cell matrix with
residuals.
The coefficients are estimated using penalized least squares
regression. Next, the function computes the residual matrix and
the effect matrices. An effect matrix contains the data that is
captured by a given cell annotation. Formally, for each feature
:
where is a cell by feature matrix containing the
variables associated to annotation
,
are the estimated coefficients associated to annotation
and estimated for feature
, and
is the
model matrix for peptide
containing only the variables to
annotation
.
All the results are stored in an ScpModel object which is stored
in the object
's metadata. Note that multiple models can be
estimated for the same object
. In that case, provide the name
argument to store the results in a separate ScpModel
.
The proportion of missing values for each features is high in
single-cell proteomics data. Many features can typically contain
more coefficients to estimate than observed values. These features
cannot be estimated and will be ignored during further steps.
These features are identified by computing the ratio between the
number of observed values and the number of coefficients to
estimate. We call it the n/p ratio. Once the model is
estimated, use scpModelFilterPlot(object)
to explore the
distribution of n/p ratios across the features. You can also
extract the n/p ratio for each feature using
scpModelFilterNPRatio(object)
. By default, any feature that has
an n/p ratio lower than 1 is ignored. However, feature with an
n/p ratio close to 1 may lead to unreliable outcome because there
are not enough observed data. You could consider the n/p ratio as
the average number of replicate per coefficient to estimate.
Therefore, you may want to increase the n/p threshold. You can do
so using scpModelFilter(object) <- npThreshold
.
The data modelling workflow is designed to take the presence of missing values into account. We highly recommend to not impute the data before modelling. Instead, the modelling approach will ignore missing values and will generate a model matrix using only the observed values for each feature. However, the model matrices for some features may contain highly correlated variables, leading to near singular designs. We include a small ridge penalty to reduce numerical instability associated to correlated variables.
Christophe Vanderaa, Laurent Gatto
ScpModel for functions to extract information from the
ScpModel
object
ScpModel-VarianceAnalysis, ScpModel-DifferentialAnalysis, ScpModel-ComponentAnalysis to explore the model results
scpKeepEffect and scpRemoveBatchEffect to perform batch correction for downstream analyses.
data("leduc_minimal") leduc_minimal ## Overview of available cell annotations colData(leduc_minimal) ####---- Model data ----#### f <- ~ 1 + ## intercept Channel + Set + ## batch variables MedianIntensity +## normalization SampleType ## biological variable leduc_minimal <- scpModelWorkflow(leduc_minimal, formula = f) ####---- n/p feature filtering ----#### ## Get n/p ratios head(scpModelFilterNPRatio(leduc_minimal)) ## Plot n/p ratios scpModelFilterPlot(leduc_minimal) ## Change n/p ratio threshold scpModelFilterThreshold(leduc_minimal) <- 2 scpModelFilterPlot(leduc_minimal)
data("leduc_minimal") leduc_minimal ## Overview of available cell annotations colData(leduc_minimal) ####---- Model data ----#### f <- ~ 1 + ## intercept Channel + Set + ## batch variables MedianIntensity +## normalization SampleType ## biological variable leduc_minimal <- scpModelWorkflow(leduc_minimal, formula = f) ####---- n/p feature filtering ----#### ## Get n/p ratios head(scpModelFilterNPRatio(leduc_minimal)) ## Plot n/p ratios scpModelFilterPlot(leduc_minimal) ## Change n/p ratio threshold scpModelFilterThreshold(leduc_minimal) <- 2 scpModelFilterPlot(leduc_minimal)
Component analysis is a powerful tool for exploring data. The package implements the ANOVA-principal component analysis extended to linear models (APCA+) and derivatives (suggested by Thiel at al. 2017). This framework is based on principal component analysis (PCA) and allows exploring the data captured by each model variable individually.
scpModelComponentMethods scpComponentAnalysis( object, method = NULL, effects = NULL, pcaFUN = "auto", residuals = TRUE, unmodelled = TRUE, name, ... ) scpComponentAggregate(componentList, fcol, fun = colMedians, ...) scpComponentPlot( componentList, comp = 1:2, pointParams = list(), maxLevels = NULL ) scpComponentBiplot( scoreList, eigenvectorList, comp = 1:2, pointParams = list(), arrowParams = list(arrow = arrow(length = unit(0.2, "cm"))), labelParams = list(size = 2, max.overlaps = 10), textBy = "feature", top = 10, maxLevels = NULL )
scpModelComponentMethods scpComponentAnalysis( object, method = NULL, effects = NULL, pcaFUN = "auto", residuals = TRUE, unmodelled = TRUE, name, ... ) scpComponentAggregate(componentList, fcol, fun = colMedians, ...) scpComponentPlot( componentList, comp = 1:2, pointParams = list(), maxLevels = NULL ) scpComponentBiplot( scoreList, eigenvectorList, comp = 1:2, pointParams = list(), arrowParams = list(arrow = arrow(length = unit(0.2, "cm"))), labelParams = list(size = 2, max.overlaps = 10), textBy = "feature", top = 10, maxLevels = NULL )
object |
An object that inherits from the
|
method |
A |
effects |
A |
pcaFUN |
A |
residuals |
A |
unmodelled |
A |
name |
A |
... |
For |
componentList |
A list of components analysis results. This
is typically the |
fcol |
A |
fun |
A |
comp |
An |
pointParams |
A |
maxLevels |
An |
scoreList |
A list of components analysis results. This
is typically the |
eigenvectorList |
A list of components analysis results. This
is typically the |
arrowParams |
A |
labelParams |
A |
textBy |
A |
top |
An |
An object of class character
of length 3.
Given a m x n matrix, PCA can be summarized as the
following decomposition:
Where is a m x k orthogonal matrix, that is
,
with k the number of components.
is called the matrix of
eigenvectors.
is the k x k diagonal matrix of eigenvalues
that contains the variance associated to each component, ordered
from highest to lowest variance. The unscaled PC scores are given
by
.
There are 2 available algorithm to perform PCA:
nipals
: The non-linear iterative partial least squares
(NIPALS) algorithm can handle missing values and
approximates classical PCA, although it does not explicitly
maximize the variance. This is implemented in nipals::nipals()
.
svd
: The singular value decomposition (SVD) is used to perform
an exact PCA, but it cannot handle missing values. This is
implemented in base::svd()
.
Which algorithm to use is controlled by the pcaFUN
argument, by
default ("auto"
), the function automatically uses svd
when
there is no missing values and nipals
when there is at least
one missing value.
scpComponentAnalysis()
performs a PCA on the modelling output.
What modelling output the function will use depends on the
method
. The are 3 PCA approaches:
ASCA
performs a PCA on the effect matrix, that is
where
is one of the effects in the
model. This PCA is useful to explore the modelled effects and
the relationship between different levels of a factor.
ASCA.E
: perform PCA on the effect matrix, just like ASCA. The
scores are then updated by projecting the effect matrix added to
the residuals using the eigenvectors, that is
. This PCA is useful
to explore the modelled effects while blurring these effects
with the unmodelled variability. Note however that for this
approach, the scores are no longer guaranteed to be orthogonal
and the eigenvalues are no longer meaningful. The percentage of
variation should not be interpreted.
APCA
(default) performs PCA on the effect matrix plus the
residuals, that is . This PCA
is useful to explore the modelled effects in relation with the
unmodelled variability that is remaining in the residuals.
Available methods are listed in scpModelComponentMethods
.
Note that for all three methods, a PCA on the residual matrix is
also performed when residuals = TRUE
, that is
. A PCA on the residuals is
useful to explore residual effects that are not captured by any
effect in the model. Similarly, a PCA on the input data matrix,
that is on the data before modelling is also performed when
unmodelled = TRUE
, that is .
scpComponentAnalysis()
always returns a list with 2 elements.
The first element, bySample
is a list where each element
contains the PC scores for the desired model variable(s). The
second element, byFeature
is a list where each element
contains the eigenvectors for the desired model variable(s).
scpAnnotateResults()
adds annotations to the component
analysis results. The annotations are added to all elements of the
list returned by scpComponentAnalysis()
. See the associated man
page for more information.
scpComponentPlot()
takes one of the two elements of the list
generated by scpComponentAnalysis()
and returns a list of
ggplot2
scatter plots. Commonly, the first two components,
that bear most of the variance, are explored for visualization,
but other components can be explored as well thanks to the comp
argument. Each point represents either a sample or a feature,
depending on the provided component analysis results
(see examples). Change the point aesthetics by providing ggplot
arguments in a list (see examples).
scpComponentBiplot()
simultaneously explores the PC scores
(sample-space) and the eigenvectors (feature-space). Scores are
shown as points while eigenvectors are shown as arrows. Point
aesthetics and arrow aesthetics can be controlled with the
pointParams
and the arrowParams
arguments, respectively.
Moreover, arrows are also labelled and label aesthetics can be
controlled using labelParams
and textBy
. Plotting all
eigenvectors as arrows leads to overcrowded plots. You can limit the plotting to
the top longest arrows (default to the top 10) as defined by the
distance on the two selected PCs.
scpComponentAggregate()
offers functionality to aggregate the
results from multiple features. This can be used to obtain, for
example, component analysis results for proteins when modelling at
the peptide level. The approach is inspired from
scuttle::aggregateAcrossCells()
and combines, for each group, multiple values for each component
using QFeatures::aggregateFeatures()
. By default, values are
aggregated using the median, but QFeatures
offers other methods
as well. The annotation of the component results are automatically
aggregated as well. See the aggregateFeatures()
man page for
more information on available methods and expected behavior.
Christophe Vanderaa, Laurent Gatto
Thiel, Michel, Baptiste Féraud, and Bernadette Govaerts. 2017. "ASCA+ and APCA+: Extensions of ASCA and APCA in the Analysis of Unbalanced Multifactorial Designs." Journal of Chemometrics 31 (6): e2895.
ScpModel-Workflow to run a model on SCP data upstream of component analysis.
The nipals::nipals()
function and package for detailed
information about the algorithm and associated parameters.
The ggplot2::ggplot()
functions and associated tutorials to
manipulate and save the visualization output
scpAnnotateResults()
to annotate component analysis results.
library("patchwork") library("ggplot2") data("leduc_minimal") leduc_minimal$cell <- rownames(colData(leduc_minimal)) ####---- Run component analysis ----#### (pcs <- scpComponentAnalysis( leduc_minimal, method = "ASCA", effects = "SampleType", pcaFUN = "auto", residuals = FALSE, unmodelled = FALSE )) ####---- Annotate results ----#### ## Add cell annotation available from the colData bySamplePCs <- scpAnnotateResults( pcs$bySample, colData(leduc_minimal), by = "cell" ) ## Add peptide annotations available from the rowData byFeaturePCs <- scpAnnotateResults( pcs$byFeature, rowData(leduc_minimal), by = "feature", by2 = "Sequence" ) ####---- Plot results ----#### ## Plot result in cell-space, ie each dot is a cell scpComponentPlot( bySamplePCs, pointParams = list( ## ggplot arguments aes(colour = SampleType, shape = lcbatch), alpha = 0.6 ) ) |> wrap_plots(guides = "collect") ## Plot result in peptide-space, ie each dot is a peptide scpComponentPlot( byFeaturePCs, pointParams = list(colour = "dodgerblue", alpha = 0.6) ) |> wrap_plots(guides = "collect") ## Plot both scpComponentBiplot( bySamplePCs, byFeaturePCs, pointParams = list(aes(colour = SampleType), alpha = 0.6), labelParams = list(max.overlaps = 20), textBy = "gene" ) |> wrap_plots(guides = "collect") ####---- Aggregate results ----#### ## Aggregate to protein-level results byProteinPCs <- scpComponentAggregate( byFeaturePCs, fcol = "Leading.razor.protein.id" ) ## Plot result in protein-space, ie each dot is a protein scpComponentPlot( byProteinPCs, pointParams = list(colour = "firebrick", alpha = 0.6) ) |> wrap_plots(guides = "collect")
library("patchwork") library("ggplot2") data("leduc_minimal") leduc_minimal$cell <- rownames(colData(leduc_minimal)) ####---- Run component analysis ----#### (pcs <- scpComponentAnalysis( leduc_minimal, method = "ASCA", effects = "SampleType", pcaFUN = "auto", residuals = FALSE, unmodelled = FALSE )) ####---- Annotate results ----#### ## Add cell annotation available from the colData bySamplePCs <- scpAnnotateResults( pcs$bySample, colData(leduc_minimal), by = "cell" ) ## Add peptide annotations available from the rowData byFeaturePCs <- scpAnnotateResults( pcs$byFeature, rowData(leduc_minimal), by = "feature", by2 = "Sequence" ) ####---- Plot results ----#### ## Plot result in cell-space, ie each dot is a cell scpComponentPlot( bySamplePCs, pointParams = list( ## ggplot arguments aes(colour = SampleType, shape = lcbatch), alpha = 0.6 ) ) |> wrap_plots(guides = "collect") ## Plot result in peptide-space, ie each dot is a peptide scpComponentPlot( byFeaturePCs, pointParams = list(colour = "dodgerblue", alpha = 0.6) ) |> wrap_plots(guides = "collect") ## Plot both scpComponentBiplot( bySamplePCs, byFeaturePCs, pointParams = list(aes(colour = SampleType), alpha = 0.6), labelParams = list(max.overlaps = 20), textBy = "gene" ) |> wrap_plots(guides = "collect") ####---- Aggregate results ----#### ## Aggregate to protein-level results byProteinPCs <- scpComponentAggregate( byFeaturePCs, fcol = "Leading.razor.protein.id" ) ## Plot result in protein-space, ie each dot is a protein scpComponentPlot( byProteinPCs, pointParams = list(colour = "firebrick", alpha = 0.6) ) |> wrap_plots(guides = "collect")
An ScpModelFit
object is expected to be stored as a list element
in the scpModelFitList
of an ScpModel
object. The
ScpModelFit
object should never be accessed directly by the
user. Refer to the ScpModel for a list of function to
access the information in an ScpModelFit
. The ScpModelFit
class contains several slots that contain the model output for a
feature:
n
: an integer
, the number of observations for the feature
p
: an integer
, the number of coefficient to estimate
coefficients
: a numeric
vector with the estimated
coefficients
residuals
: a numeric
vector with the estimated residuals
effects
: a List
with the
df
: an integer
providing the number of degrees of freedom
of the model estimation
var
: a numeric
vector with the residual variance of the
model estimation
uvcov
: the unscaled variance covariance matrix
levels
: a named List
where each elements corresponds to a
categorical model variable and contains a vector with the
possible categories.
Christophe Vanderaa, Laurent Gatto
ScpModel for a description of the class that relies on
ScpModelFit
new("ScpModelFit") ## this should never be used by the user
new("ScpModelFit") ## this should never be used by the user