Title: | Quantification and Differential Analysis of Proteomics Data |
---|---|
Description: | Quantification and differential analysis of mass-spectrometry proteomics data, with probabilistic recovery of information from missing values. Estimates the detection probability curve (DPC), which relates the probability of successful detection to the underlying expression level of each peptide, and uses it to incorporate peptide missing values into protein quantification and into subsequent differential expression analyses. The package produces objects suitable for downstream analysis in limma. The package accepts peptide-level data with missing values and produces complete protein quantifications without missing values. The uncertainty introduced by missing value imputation is propagated through to the limma analyses using variance modeling and precision weights. The package name "limpa" is an acronym for "Linear Models for Proteomics Data". |
Authors: | Mengbo Li [aut] |
Maintainer: | Gordon Smyth <[email protected]> |
License: | GPL (>=2) |
Version: | 0.99.10 |
Built: | 2025-03-27 03:34:46 UTC |
Source: | https://github.com/bioc/limpa |
This package implements a pipeline for quantification and differential expression analysis of mass-spectrometry-based proteomics data.
Mass spectrometry (MS)-based proteomics is a powerful tool in biomedical research. Recent advancements in label-free methods and MS instruments have enabled the quantitative characterisation of large-scale complex biological samples with the increasingly deeper coverage of the proteome. However, missing values are still ubiquitous in MS-based proteomics data. We observe from a wide range of real datasets that missingness in label-free data is intensity-dependent, so that the missing values are missing not at random or, in other words, are non-ignorable.
This package implements statistical and computational methods for analysing MS-based label-free proteomics data with non-ignorable missing values. The package use the observed proteomics data to estimate the detection probability curve (DPC), which provides a formal probabilistic model for the intensity-dependent missingness. Based on exponential tilting, the DPC estimates the detection probabilities given the underlying intensity of each observation, observed or unobserved. Importantly, the DPC evaluates how much statistical information can or cannot be recovered from the missing value pattern, and can be used to inform downstream analyses such as differential expression (DE) analysis.
Next, the package implements a novel protein quantification method, called DPC-quant, where missing values are represented by the DPC. An empirical Bayes scheme is employed to borrow information across the tens of thousands of peptides measured in a typical experiment. A multivariate normal prior is estimated empirically from data to describe the variability in log-intensities across the samples and across the peptides.
Finally, quantification uncertainty is incorporated into the differential expression analysis using precision weights.
Leveraging the limma package, a new variance modelling approach with multiple predictors is used, which allows the DPC-quant precisions to be propagated to the differential expression analysis while simultaneously assuming a mean-variance relationship.
The new differential expression pipeline has been implemented in the limma R package in the vooma()
function.
The limpa package is fully compatible with limma pipelines, allowing any arbitrarily complex experimental design and other downstream tasks such as the gene ontology or pathway analysis.
Mengbo Li and Gordon K Smyth
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
Li M, Smyth GK (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. Bioinformatics 39(5), btad200. 10.1093/bioinformatics/btad200 )
Mean and standard-deviation of the complete data distribution under the observed normal model.
completeMomentsON(mean.obs=6, sd.obs=1, dpc=c(-4,0.7))
completeMomentsON(mean.obs=6, sd.obs=1, dpc=c(-4,0.7))
mean.obs |
mean of observed normal distribution. |
sd.obs |
standard deviation of observed normal distribution. |
dpc |
numeric vector of length 2 giving the DPC intercept and slope. |
Under the observed normal model, calculate the mean and standard deviation of the complete data distribution that would have occurred if the missing value mechanism hadn't operated.
A list with compoenents
mean.comp |
mean of complete data distribution. |
sd.comp |
standard deviation of complete data distribution. |
prob.obs |
unconditional probability that values are observed. |
completeMomentsON(mean.obs=6, sd.obs=2)
completeMomentsON(mean.obs=6, sd.obs=2)
Detection probability curve for label free shotgun proteomics data assuming observed normal intensities.
dpc(y, maxit = 100, eps = 1e-04, b1.upper = 1)
dpc(y, maxit = 100, eps = 1e-04, b1.upper = 1)
y |
numeric matrix of log2-transformed intensities. Rows correspond to peptide precursors and columns to samples. Any object such as an EList that can be coerced to a matrix is also acceptable. |
maxit |
maximum number of iterations. |
eps |
convergence tolerance. |
b1.upper |
upper bound for beta1. |
Estimate the detection probability curve (DPC) for label-free shotgun proteomics data using the method described by Li & Smyth (2023). This function assumes that the observed log-intensities are normally distributed (the "observed normal" model), and uses exponential tilting to reformulate the DPC in terms of observed statistics instead of in terms of unobserved quantities.
A list with components
dpc |
estimated DPC coefficients. |
history |
iteration history. |
dpc.start |
initial values estimated for the DPC coefficients. |
prop.detected |
proportion of observed values for each row. |
mu.prior |
prior value for row-wise means for observed values. |
n.prior |
precision of prior for row-wise means, expressed as effective number of observations. |
s2.prior |
prior value for row-wise variances for observed values. |
df.prior |
precision of prior for row-wise variances, expressed as effective degrees of freedom. |
mu.obs |
posterior row-wise means for observed values. |
s2.obs |
posterior row-wise variances for observed values. |
mu.mis |
posterior row-wise means for values that are missing. |
Li M, Smyth GK (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. Bioinformatics 39(5), btad200. 10.1093/bioinformatics/btad200
y <- simProteinDataSet(n.peptides=100, n.groups=1) out <- dpc(y) out$dpc
y <- simProteinDataSet(n.peptides=100, n.groups=1) out <- dpc(y) out$dpc
Detection probability curve for label-free shotgun proteomics data assuming a complete normal model for the peptide intensities.
dpcCN(y, dpc.start= c(-4,0.7), iterations = 3, verbose = TRUE)
dpcCN(y, dpc.start= c(-4,0.7), iterations = 3, verbose = TRUE)
y |
numeric matrix of log2-intensities. Rows correspond to peptide precursors and columns to samples. |
dpc.start |
numeric vector of length 2 giving starting estimates for the DPC intercept and slope. |
iterations |
number of outer iterations. |
verbose |
if |
Estimate the detection probability curve (DPC) for label-free shotgun proteomics data by maximum posterior assuming that the complete log-intensities are normally distributed (the "complete normal" model). The complete log-intensities are the values that would have been observed if the missing value mechanism had not operated.
The algorithm uses an alternating iteration (Smyth, 1996), alternately estimating the row-wise means and standard deviations (mu and sigma) for fixed DPC and estimating the DPC for fixed mu and sigma.
The inner estimations use the BFGS algorithm implemented in the optim
function.
Three outer iterations are usually sufficient.
dpc
estimates the DPC by a different method, described in Li & Smyth (2023), based on exponential tilting and assuming that only the observed values are normally distributed (the "observed normal" model).
A list with components
dpc |
numeric vector of length 2 giving estimated DPC coefficients. |
mu |
numeric vector of length |
sigma |
numeric vector of length |
This function may underestimate the DPC slope if entirely missing peptides are omitted and the proportion of peptides that are entirely missing by chance is not small.
dpcCN
can take several minutes on large datasets so, by default, progress information is turned on with verbose=TRUE
.
The function will run quietly if verbose=FALSE
is set.
Li M, Smyth GK (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. Bioinformatics 39(5), btad200. 10.1093/bioinformatics/btad200
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
Smyth GK (1996). Partitioned algorithms for maximum likelihood and other non-linear estimation. Statistics and Computing 6, 201-216. doi:10.1007/BF00140865 https://gksmyth.github.io/pubs/partitio.pdf
dpc
.
y <- simProteinDataSet(n.peptides=100, n.groups=1) out <- dpcCN(y) out$dpc
y <- simProteinDataSet(n.peptides=100, n.groups=1) out <- dpcCN(y) out$dpc
Fit linear models and make precision weights from the DPC-Quant standard errors.
dpcDE(y, design, plot=TRUE, ...)
dpcDE(y, design, plot=TRUE, ...)
y |
protein-level EList produced by dpcQuant(). |
design |
design matrix. |
plot |
should the variance trend be plotted? |
... |
other arguments are passed to |
Calls voomaLmFit
to compute vooma precision weights from the DPC-Quant standard errors stored in y
and to use those weights to fit protein-wise linear models.
Any voomaLmFit
functinality can be used, giving access to optional empirical sample weights or random blocks.
An MArrayLM
object suitable for analysis in limma.
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
voomaLmFit
y.peptide <- simProteinDataSet() y.protein <- dpcQuant(y.peptide, "Protein", dpc=c(-4,0.7)) Group <- factor(y.peptide$targets$Group) design <- model.matrix(~Group) fit <- dpcDE(y.protein, design)
y.peptide <- simProteinDataSet() y.protein <- dpcQuant(y.peptide, "Protein", dpc=c(-4,0.7)) Group <- factor(y.peptide$targets$Group) design <- model.matrix(~Group) fit <- dpcDE(y.protein, design)
Use the DPC to quantify protein expression values.
## S3 method for class 'EList' dpcQuant(y, protein.id = "Protein.Group", dpc = NULL, dpc.slope = 0.8, verbose = TRUE, chunk = 1000, ...) ## S3 method for class 'EList' dpcImpute(y, dpc = NULL, dpc.slope = 0.8, verbose = TRUE, chunk = 1000, ...)
## S3 method for class 'EList' dpcQuant(y, protein.id = "Protein.Group", dpc = NULL, dpc.slope = 0.8, verbose = TRUE, chunk = 1000, ...) ## S3 method for class 'EList' dpcImpute(y, dpc = NULL, dpc.slope = 0.8, verbose = TRUE, chunk = 1000, ...)
y |
a numeric matrix or EList of peptide-level log2-expression values. Columns are samples and rows are peptides or precursors. |
protein.id |
protein IDs.
Either an annotation column name (if |
dpc |
numeric vector giving intercept and slope of DPC.
Alternatively the output objects from |
dpc.slope |
slope coefficient of DPC.
Only used if |
verbose |
should progress information be output?
If |
chunk |
When |
... |
other arguments are passed to |
Implements the DPC-Quant method, which quantifies protein log2-expression values from peptide data. The method represents missing values probabilistically using the PDC and returns maximum posterior estimates for all the protein log2-expression values, so that there are no missing values in the final summary.
The dpc
function is usually used to estimate the detection probability curve (DPC) before running dpcQuant
, however a preset DPC slope can also be used.
If the dpc
argument is NULL
, then dpc.slope
will be used as the DPC together with a DPC intercept estimated by estimateDPCIntercept
.
The output from dpcQuant
can be input to dpcDE
.
dpcImpute
performs imputation without summarization by treating each row as a separate protein.
dpcQuant()
produces an EList
object with a row for each protein, with the following extra components:
other$n.observations |
matrix giving the number of missing non-missing peptide observations supporting each protein expression value. |
other$standard.error |
matrix giving the standard error of each protein expression value. |
dpcImpute()
produces an EList
object with the same number of rows as y
.
dpcQuant
can take several minutes on large datasets so, by default, progress information is turned on with verbose=TRUE
.
The function will run quietly if verbose=FALSE
is set.
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
dpc
, dpcQuantHyperparam
, dpcDE
, EList-class
.
y.peptide <- simProteinDataSet(n.groups=1,samples.per.group=4,prop.missing=0.2) y.protein <- dpcQuant(y.peptide, "Protein", dpc.slope=0.7)
y.peptide <- simProteinDataSet(n.groups=1,samples.per.group=4,prop.missing=0.2) y.protein <- dpcQuant(y.peptide, "Protein", dpc.slope=0.7)
Estimate hyperparameters for the DPC-based protein quantification method (DPC-Quant).
dpcQuantHyperparam(y, protein.id, dpc.slope = 0.7, sd.quantile.for.logFC = 0.9, robust = FALSE, ...) dpcImputeHyperparam(y, dpc.slope = 0.7, sd.quantile.for.logFC = 0.9, robust = FALSE, ...)
dpcQuantHyperparam(y, protein.id, dpc.slope = 0.7, sd.quantile.for.logFC = 0.9, robust = FALSE, ...) dpcImputeHyperparam(y, dpc.slope = 0.7, sd.quantile.for.logFC = 0.9, robust = FALSE, ...)
y |
a numeric matrix of peptide-level log2-expression values. Columns are samples and rows are peptides or precursors. |
protein.id |
a character vector of length |
dpc.slope |
slope of the DPC. |
sd.quantile.for.logFC |
a number between 0 and 1. The quantile of the precursor-level variances to represent the typical between-sample variation. |
robust |
should robust empirical Bayes moderation be applied to the protein standard deviations?
|
... |
other arguments are passed to |
Estimates and returns the empirical Bayes hyperparameters required for DPC-Quant protein quantification.
dpcQuantHyperparam
is called by dpcQuant
function, and dpcImputeHyperparam
is called by dpcImpute
.
A list with components
prior.mean |
mean of the global prior distribution for protein log-expression values. Represents the typical average log-expression of a protein. |
prior.sd |
standard deviation of the global prior distribution for protein log-expression values. Represents the standard deviation of average log-expression across proteins. |
prior.logFC |
standard deviation to be expected between log-expression values for the same protein across conditions. |
sigma |
protein standard deviations from additive model fitted to peptide log expression values.
Numeric vector of same length as |
The last component is omitted in the dpcImputeHyperparam
output.
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
Estimate the DPC intercept given a value for the slope.
estimateDPCIntercept(y, dpc.slope = 0.8, trace = FALSE)
estimateDPCIntercept(y, dpc.slope = 0.8, trace = FALSE)
y |
numeric matrix of log2-intensities, or any data object than can be coerced to a matrix. Includes NAs. Rows correspond to peptide precursors and columns to samples. |
dpc.slope |
DPC slope. |
trace |
if |
Estimates the intercept coefficient of the detection probability curve (DPC) by using imputeByExpTilt
to impute complete data, then fitting a binomial glm model with the slope as an offset vector.
If the dataset is large, then similar y
values are aggregated before fitting the glm.
A single numeric value giving the intercept.
y <- simProteinDataSet(n.peptides=100, n.groups=1, dpc.slope=0.7) estimateDPCIntercept(y, dpc.slope=0.7)
y <- simProteinDataSet(n.peptides=100, n.groups=1, dpc.slope=0.7) estimateDPCIntercept(y, dpc.slope=0.7)
Filter peptides or proteins from the dataset based on uniqueness of annotation.
## Default S3 method: filterCompoundProteins(y, protein.group, ...) ## S3 method for class 'EList' filterCompoundProteins(y, protein.group="Protein.Group", ...) ## Default S3 method: filterSingletonPeptides(y, protein.group, min.n.peptides = 2, ...) ## S3 method for class 'EList' filterSingletonPeptides(y, protein.group="Protein.Group", min.n.peptides = 2, ...) ## Default S3 method: filterNonProteotypicPeptides(y, proteotypic, ...) ## S3 method for class 'EList' filterNonProteotypicPeptides(y, proteotypic="Proteotypic", ...)
## Default S3 method: filterCompoundProteins(y, protein.group, ...) ## S3 method for class 'EList' filterCompoundProteins(y, protein.group="Protein.Group", ...) ## Default S3 method: filterSingletonPeptides(y, protein.group, min.n.peptides = 2, ...) ## S3 method for class 'EList' filterSingletonPeptides(y, protein.group="Protein.Group", min.n.peptides = 2, ...) ## Default S3 method: filterNonProteotypicPeptides(y, proteotypic, ...) ## S3 method for class 'EList' filterNonProteotypicPeptides(y, proteotypic="Proteotypic", ...)
y |
a matrix, |
protein.group |
protein group for each row of |
proteotypic |
indicates whether each peptide is proteotypic (detectable and unique to one protein).
Should contain 0/1 or TRUE/FALSE values.
Can be either a vector of length |
min.n.peptides |
minimum number of peptides required in a protein. |
... |
other arguments are not currently used. |
Filter peptide or proteins from the dataset based on uniqueness of annotation.
filterCompoundProteins
removes compound protein groups consisting of multiple proteins separated by ";" delimiters.
filterSingletonPeptides
removes proteins with only one peptide.
filterNonProteotypicPeptides
removes peptides that belong to more than one protein, using the "Proteotypic" annotation column that is returned by DIA-NN and other proteomics quantification software.
An object the same as y
but with non-compliant rows removed.
Estimate a logistic regression, with optionally capped probabilities, by maximum likelihood with zero-truncated data.
fitZTLogit(n.successes, n.trials, X = NULL, capped = FALSE, beta.start = NULL, alpha.start = 0.95)
fitZTLogit(n.successes, n.trials, X = NULL, capped = FALSE, beta.start = NULL, alpha.start = 0.95)
n.successes |
number of binomial successes (numeric vector).
Should be bounded below by 1 and bounded above by |
n.trials |
number of binomial trials (numeric vector). |
X |
the regression design matrix. Number of rows should match |
capped |
if |
beta.start |
starting values for the regression coefficients. Of same length as |
alpha.start |
starting value for alpha. |
Estimates a logistic regression equation for zero-truncated binomial observations. Optionally estimates a limiting value for the probabilities that may be less than one.
The function maximizes the zero-truncated binomial likelihood using the optim
function with method="BFGS"
.
The fitted probabilities are equal to alpha * plogis(X %*% beta)
.
A list with components
beta |
linear predictor coefficients. |
alpha |
capping parameter, maximum or asymptotic value for the probabilities. |
p |
fitted probabilities. |
deviance |
minus twice the maximized log-likelihood. |
calls |
number of function calls used in the optimization. |
Li M, Smyth GK (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. Bioinformatics 39(5), btad200. 10.1093/bioinformatics/btad200
# Generate binomial data n <- 30 n.trials <- rep(4,n) x <- seq(from=3, to=9, length.out=n) X <- model.matrix(~x) beta <- c(-4,0.7) p <- plogis(X %*% beta) n.successes <- rbinom(n, size=n.trials, prob=p) # Zero truncation is.pos <- (n.successes > 0) n.successes <- n.successes[is.pos] n.trials <- n.trials[is.pos] x <- x[is.pos] X <- X[is.pos,] # Zero-truncated regression fit <- fitZTLogit(n.successes, n.trials, X) p.observed <- n.successes / n.trials plot(x, p.observed) lines(x, fit$p)
# Generate binomial data n <- 30 n.trials <- rep(4,n) x <- seq(from=3, to=9, length.out=n) X <- model.matrix(~x) beta <- c(-4,0.7) p <- plogis(X %*% beta) n.successes <- rbinom(n, size=n.trials, prob=p) # Zero truncation is.pos <- (n.successes > 0) n.successes <- n.successes[is.pos] n.trials <- n.trials[is.pos] x <- x[is.pos] X <- X[is.pos,] # Zero-truncated regression fit <- fitZTLogit(n.successes, n.trials, X) p.observed <- n.successes / n.trials plot(x, p.observed) lines(x, fit$p)
Impute missing values in a log-expression matrix by applying exponential tilting to rows, columns or both.
## Default S3 method: imputeByExpTilt(y, dpc.slope = 0.7, prior.logfc = NULL, by = "both", ...) expTiltByRows(y, dpc.slope = 0.7, sigma.obs = NULL) expTiltByColumns(y, dpc.slope = 0.7)
## Default S3 method: imputeByExpTilt(y, dpc.slope = 0.7, prior.logfc = NULL, by = "both", ...) expTiltByRows(y, dpc.slope = 0.7, sigma.obs = NULL) expTiltByColumns(y, dpc.slope = 0.7)
y |
an |
dpc.slope |
slope of detection probability curve. |
prior.logfc , sigma.obs
|
simple standard deviation to be expected between observed values for the same peptide or protein. Can a single value or vector of length |
by |
character value. Should imputation by rows ( |
... |
other arguments are not used. |
Implements exponential tilting strategy outlined by Li & Smyth (2023). The imputed values are the expected values of the missing value distribution.
The strategy can be applied to rows or columns.
If by="both"
, the imputated values are an average of the row and column imputations, weighted inversely by the prediction variances.
An object of the same class as y
but with NAs imputed.
Li M, Smyth GK (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. Bioinformatics 39(5), btad200. doi:10.1093/bioinformatics/btad200
y <- matrix(rnorm(25),5,5) y[1,1] <- NA imputeByExpTilt(y)
y <- matrix(rnorm(25),5,5) y[1,1] <- NA imputeByExpTilt(y)
Mean and standard-deviation of the observed data distribution under the complete normal model.
observedMomentsCN(mean.comp=6, sd.comp=1, dpc=c(-4,0.7))
observedMomentsCN(mean.comp=6, sd.comp=1, dpc=c(-4,0.7))
mean.comp |
mean of complete normal distribution. |
sd.comp |
standard deviation of complete normal distribution. |
dpc |
numeric vector of length 2 giving the DPC intercept and slope. |
Under the complete normal model, calculate the mean and standard deviation of the observed data distribution.
A list with compoenents
mean.obs |
mean of observed data distribution. |
sd.obs |
standard deviation of observed data distribution. |
prob.obs |
unconditional probability that values are observed. |
observedMomentsCN(mean.comp=6, sd.comp=2)
observedMomentsCN(mean.comp=6, sd.comp=2)
Convert a matrix of peptide log-expression values for one protein to protein-level expression values by the DPC-Quant method.
peptides2ProteinBFGS(y, sigma = 0.5, weights = NULL, dpc = c(-4, 0.7), prior.mean = 6, prior.sd = 10, prior.logFC = 2, standard.errors = TRUE, newton.polish = TRUE, start = NULL) peptides2ProteinNewton(y, sigma = 0.5, weights = NULL, dpc = c(-4, 0.7), prior.mean = 6, prior.sd = 10, prior.logFC = 2, standard.errors = TRUE, tol=1e-6, maxit=10, start = NULL, verbose = FALSE) peptides2ProteinWithoutNAs(y, sigma = 0.5, weights = NULL, dpc = c(-4, 0.7), prior.mean = 6, prior.sd = 10, prior.logFC = 2)
peptides2ProteinBFGS(y, sigma = 0.5, weights = NULL, dpc = c(-4, 0.7), prior.mean = 6, prior.sd = 10, prior.logFC = 2, standard.errors = TRUE, newton.polish = TRUE, start = NULL) peptides2ProteinNewton(y, sigma = 0.5, weights = NULL, dpc = c(-4, 0.7), prior.mean = 6, prior.sd = 10, prior.logFC = 2, standard.errors = TRUE, tol=1e-6, maxit=10, start = NULL, verbose = FALSE) peptides2ProteinWithoutNAs(y, sigma = 0.5, weights = NULL, dpc = c(-4, 0.7), prior.mean = 6, prior.sd = 10, prior.logFC = 2)
y |
a numeric matrix of log-expression values.
Columns are samples and rows are peptides or precursors.
Typically contains NAs, but NAs are not allowed for |
sigma |
standard deviation of peptide-level expression values after allowing for peptide and sample baseline differences. |
weights |
numeric matrix of same size as |
dpc |
numeric vector giving intercept and slope of the detection probability curve (DPC). |
prior.mean |
mean of the global prior distribution for protein log-expression values. Represents the typical average log-expression of a protein. |
prior.sd |
standard deviation of the global prior distribution for protein log-expression values. Represents the standard deviation of average log-expression across proteins. |
prior.logFC |
standard deviation to be expected between log-expression values for the same protein. |
standard.errors |
logical, should standard errors for the protein expression values be returned? |
newton.polish |
logical.
If |
start |
numeric vector of starting values for the linear model coefficients.
Of length |
tol |
stopping criterion tolerance for Newton's method, to be achieved by the average local slope statistic. |
maxit |
maximum number of iterations for Newton's method. |
verbose |
logical.
If |
Implements the DPC-Quant method, which returns maximum posterior estimates for protein expression values.
peptides2ProteinBFGS
maximizes the posterior using the BFGS algorithm with analytic first derivatives.
The standard errors are computed from analytic second derivatives.
peptides2ProteinNewton
maximizes the posterior using Newton's method.
peptides2ProteinBFGS
and peptides2ProteinNewton
return a list with components.
protein.expression |
numeric vector giving the estimated protein log-expression value for each sample. |
standard.error |
numeric vector giving standard errors for the protein log-expression values. |
value |
the minimized objective function, minus twice the log-posterior distribution. |
peptides2ProteinWithoutNAs
returns a numeric vector of protein expression values.
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
Smyth GK (2005). Optimization and nonlinear equations. In: Encyclopedia of Biostatistics Second Edition, Volume 6, P. Armitage and T. Colton (eds.), Wiley, London, pages 3857-3863. https://gksmyth.github.io/pubs/OptimNonlinEqnPreprint.pdf
y <- matrix(rnorm(12),3,4) y[1:2,1] <- NA y[1,2] <- NA peptides2ProteinBFGS(y)
y <- matrix(rnorm(12),3,4) y[1:2,1] <- NA y[1,2] <- NA peptides2ProteinBFGS(y)
Quantify protein expression values by the DPC-Quant method.
peptides2Proteins(y, protein.id, sigma = 0.5, dpc = c(-4, 0.7), prior.mean = 6, prior.sd = 10, prior.logFC = 2, standard.errors = FALSE, newton.polish = FALSE, verbose = FALSE, chunk = 1000L)
peptides2Proteins(y, protein.id, sigma = 0.5, dpc = c(-4, 0.7), prior.mean = 6, prior.sd = 10, prior.logFC = 2, standard.errors = FALSE, newton.polish = FALSE, verbose = FALSE, chunk = 1000L)
y |
a numeric matrix of log-expression values. Columns are samples and rows are peptides or precursors. |
protein.id |
protein IDs.
Character vector of length |
sigma |
standard deviations of peptide-level expression values.
Numeric vector of same length as |
dpc |
numeric vector giving intercept and slope of detection probability curve (DPC). |
prior.mean |
mean of the global prior distribution for protein log-expression values. Represents the typical average log-expression of a protein. |
prior.sd |
standard deviation of the global prior distribution for protein log-expression values. Represents the standard deviation of average log-expression across proteins. |
prior.logFC |
standard deviation to be expected between log-expression values for the same protein. |
standard.errors |
logical, should standard errors for the protein expression values be returned? |
newton.polish |
logical.
If |
verbose |
should progress information be output?
If |
chunk |
When |
Implements the DPC-Quant method, which returns maximum posterior estimates for protein expression values.
An EList
object with a row for each protein.
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
y.peptide <- simProteinDataSet(8,n.groups=1,samples.per.group=4,prop.missing=0.2) y.protein <- peptides2Proteins(y.peptide$E, y.peptide$genes$Protein)
y.peptide <- simProteinDataSet(8,n.groups=1,samples.per.group=4,prop.missing=0.2) y.protein <- peptides2Proteins(y.peptide$E, y.peptide$genes$Protein)
Plot the detection probability curve using output from the dpc
function.
plotDPC(dpcfit, add.jitter = TRUE, point.cex = 0.2, lwd = 2, ylim = c(0, 1), main = "Detection probability curve", ...)
plotDPC(dpcfit, add.jitter = TRUE, point.cex = 0.2, lwd = 2, ylim = c(0, 1), main = "Detection probability curve", ...)
dpcfit |
object produced by |
add.jitter |
logical, whether to add jitter to the detected proportion axis. |
point.cex |
relative size of points. |
lwd |
relative line width. |
ylim |
limits of the y-axis. |
main |
main title of plot. |
... |
other arguments are passed to |
A plot is produced on the current device.
A list with components x
and y
is also invisibly returned.
y <- simProteinDataSet(n.peptides=100, n.groups=1) dpcfit <- dpc(y) plotDPC(dpcfit)
y <- simProteinDataSet(n.peptides=100, n.groups=1) dpcfit <- dpc(y) plotDPC(dpcfit)
Plot samples on a two-dimensional scatterplot so that distances on the plot approximate the typical z-statistic of differences between the samples.
plotMDSUsingSEs(y, top = 500, labels = NULL, pch = NULL, cex = 1, dim.plot = c(1,2), gene.selection = "pairwise", xlab = NULL, ylab = NULL, plot = TRUE, var.explained = TRUE, ...)
plotMDSUsingSEs(y, top = 500, labels = NULL, pch = NULL, cex = 1, dim.plot = c(1,2), gene.selection = "pairwise", xlab = NULL, ylab = NULL, plot = TRUE, var.explained = TRUE, ...)
y |
|
top |
number of top genes used to calculate pairwise distances. |
labels |
character vector of sample names or labels. Defaults to |
pch |
plotting symbol or symbols. See |
cex |
numeric vector of plot symbol expansions. |
dim.plot |
integer vector of length two specifying which principal components should be plotted. |
gene.selection |
character, |
xlab |
title for the x-axis. |
ylab |
title for the y-axis. |
plot |
logical. If |
var.explained |
logical. If |
... |
any other arguments are passed to |
This function uses multidimensional scaling (MDS) to produce a principal coordinate (PCoA) plot showing the relationships between the expression profiles represented by the columns of x
.
Distances on the plot represent the leading z-statistic.
The leading log-fold-change between a pair of samples is defined as the root-mean-square average of the top
largest z-statistics between those two samples.
If pch=NULL
, then each sample is represented by a text label, defaulting to the column names of x
.
If pch
is not NULL
, then plotting symbols are used.
See text
for possible values for col
and cex
.
If plot=TRUE
or if x
is an object of class "MDS"
, then a plot is created on the current graphics device.
An object of class "MDS"
is also invisibly returned.
This is a list containing the following components:
eigen.values |
eigen values |
eigen.vectors |
eigen vectors |
var.explained |
proportion of variance explained by each dimension |
distance.matrix.squared |
numeric matrix of squared pairwise distances between columns of |
dim.plot |
dimensions plotted |
x |
x-xordinates of plotted points |
y |
y-cordinates of plotted points |
gene.selection |
gene selection method |
Gordon Smyth
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, and Smyth GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47. http://nar.oxfordjournals.org/content/43/7/e47
plotMDS
in the limma package.
# See dpcQuant()
# See dpcQuant()
Plot the log-intensity of a protein summarized by DPC-Quant for each sample with error bars.
plotProtein(y, protein, col = "black", cex = 2, lwd = 2, ...)
plotProtein(y, protein, col = "black", cex = 2, lwd = 2, ...)
y |
protein-level EList produced by |
protein |
A vector of length 1. Can be the name of the protein or the numeric index that locates
the protein to plot from rows of |
col |
Color for the points and error bars. |
cex |
Size for the points. |
lwd |
Line width for the error bars. |
... |
other arguments are passed to |
Plot the sample-wise protein quantification results from dpcQuant()
for a specified protein.
The error bars (standard errors) indicate the quantification uncertainty associated with each estimate.
Typically within a dataset, the larger the error bar is, the more missing values there are in the
precursor/peptide-level data for that protein.
A plot is created on the current graphics device.
A list with components y
and se
is also invisibly returned.
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
y.peptide <- simProteinDataSet() y.protein <- dpcQuant(y.peptide, "Protein", dpc=c(-4,0.7)) plotProtein(y.protein, protein = "Protein01", col = rep(c("blue", "red"), each = 5)) y.protein$other$standard.error["Protein01",]
y.peptide <- simProteinDataSet() y.protein <- dpcQuant(y.peptide, "Protein", dpc=c(-4,0.7)) plotProtein(y.protein, protein = "Protein01", col = rep(c("blue", "red"), each = 5)) y.protein$other$standard.error["Protein01",]
Get protein-wise residual variances by fitting a two-way additive model to the complete (imputed) peptide data for each protein.
proteinResVarFromCompletePeptideData(y, protein.id, reorder=FALSE)
proteinResVarFromCompletePeptideData(y, protein.id, reorder=FALSE)
y |
a numeric matrix of complete peptide log2-expression values without NAs. Columns are samples and rows are peptides or precursors. |
protein.id |
a character vector of length |
reorder |
does the data need to sorted into protein order?
If |
This function operates on complete data after imputation of missing values, and is used to get the sigma
hyperparameters required by peptides2Proteins
and dpcQuant
.
The function fits an additive linear model (~ sample + peptide) to the peptide data for each protein and returns the residual variances.
A list with components
prior.mean |
mean of the global prior distribution for protein log-expression values. Represents the typical average log-expression of a protein. |
prior.sd |
standard deviation of the global prior distribution for protein log-expression values. Represents the standard deviation of average log-expression across proteins. |
prior.logFC |
standard deviation to be expected between log-expression values for the same protein across conditions. |
sigma |
protein standard deviations from additive model fitted to peptide log expression values.
Numeric vector of same length as |
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
y <- simProteinDataSet(8, n.groups=1, samples.per.group=4, prop.missing=0) proteinResVarFromCompletePeptideData(y$E, y$genes$Protein)
y <- simProteinDataSet(8, n.groups=1, samples.per.group=4, prop.missing=0) proteinResVarFromCompletePeptideData(y$E, y$genes$Protein)
Read DIA-NN Reports.tsv file into EList object.
readDIANN(file = "Report.tsv", path = NULL, format = "tsv", sep = "\t", log = TRUE, q.columns = c("Global.Q.Value", "Lib.Q.Value"), q.cutoffs = c(0.01, 0.01))
readDIANN(file = "Report.tsv", path = NULL, format = "tsv", sep = "\t", log = TRUE, q.columns = c("Global.Q.Value", "Lib.Q.Value"), q.cutoffs = c(0.01, 0.01))
file |
the name of the file from which the data are to be read. Or it can also be the data.frame read from the report in the long format, where each row is an observation. |
path |
character string giving the directory containing the file. Defaults to the current working directory. |
format |
character string giving the format of the file. Possible values are "tsv" or "parquet". Default is "tsv". |
sep |
the field separator character |
log |
logical. If |
q.columns |
column headings in the DIA-NN output containing Q-values for peptide identification. Character vector. |
q.cutoffs |
cutoffs to apply to the Q-value columns. Only peptides with values below the cutoffs will be retained. Numeric vector of same length as |
DIA-NN (Demichev et al 2020) writes a file in long (data.frame) format, typically called Report.tsv
, containing normalized intensities for peptide precursors.
readDIANN
reads this file and produces an object in limma EList or EListRaw format.
From version 2.0, DIA-NN returns the main report in Apache Parquet format (https://github.com/vdemichev/DiaNN/releases).
readDIANN
can read the Parquet file directly or, alternatively, one can read the Parquet file into a data.frame, and use readDIANN
to process the long-format data.frame into a limma EList or EListRaw object.
If log=FALSE
, an EListRaw object containing precursor-level unlogged intensities with zeros and protein annotation.
If log=TRUE
, an EList object containing precursor-level log2 intensities with NAs and protein annotation.
Rows are peptide-precursors and columns are samples.
Peptide precursor and protein annotation is stored in the 'genes' output component.
Demichev V, Messner CB, Vernardis SI, Lilley KS, Ralser M (2020). DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nature Methods 17(1), 41-44.
## Not run: ypep <- readDIAN() ypep <- filterCompoundProteins(ypep) ypep <- filterNonProteotypicPeptides(ypep) dpcfit <- dpc(ypep) yprot <- dpcQuant(ypep, dpc=dpcfit) ## End(Not run)
## Not run: ypep <- readDIAN() ypep <- filterCompoundProteins(ypep) ypep <- filterNonProteotypicPeptides(ypep) dpcfit <- dpc(ypep) yprot <- dpcQuant(ypep, dpc=dpcfit) ## End(Not run)
Read Spectronaut Reports.tsv file into EList object.
readSpectronaut(file = "Report.tsv", path = NULL, sep = "\t", log = TRUE, run.column = "R.Raw File Name", qty.column = "EG.TotalQuantity", q.columns = c("EG.Qvalue", "PG.Qvalue"), q.cutoffs = 0.01)
readSpectronaut(file = "Report.tsv", path = NULL, sep = "\t", log = TRUE, run.column = "R.Raw File Name", qty.column = "EG.TotalQuantity", q.columns = c("EG.Qvalue", "PG.Qvalue"), q.cutoffs = 0.01)
file |
the name of the file from which the data are to be read. |
path |
character string giving the directory containing the file. Defaults to the current working directory. |
sep |
the field separator character |
log |
logical. If |
run.column |
column containing run |
qty.column |
column containing intensities |
q.columns |
column headings in the Spectronaut output containing Q-values for peptide identification. Character vector. |
q.cutoffs |
cutoffs to apply to the Q-value columns. Only peptides with values below the cutoffs will be retained. Numeric vector of same length as |
Spectronaut (https://biognosys.com/software/spectronaut/) writes a file in long (data.frame) format, typically called Report.tsv
, containing normalized intensities for peptide precursors.
readSpectronaut
reads this file and produces an object in limma EList or EListRaw format.
If log=FALSE
, an EListRaw object containing precursor-level unlogged intensities with zeros and protein annotation.
If log=TRUE
, an EList object containing precursor-level log2 intensities with NAs and protein annotation.
Rows are peptide-precursors and columns are samples.
Peptide precursor and protein annotation is stored in the 'genes' output component.
## Not run: y <- readSpectronaut() dpcfit <- dpc(y) ## End(Not run)
## Not run: y <- readSpectronaut() dpcfit <- dpc(y) ## End(Not run)
Remove rows from a matrix that have fewer than a user-specified minimum number of non-missing observations.
## Default S3 method: removeNARows(y, nobs.min = 1, ...)
## Default S3 method: removeNARows(y, nobs.min = 1, ...)
y |
a matrix or an EList object. |
nobs.min |
minimum number of non-missing observations for rows to be kept. |
... |
other arguments are not currently used. |
Produces a new matrix keeping only those rows that have at least the specified number of non-missing values.
A matrix or EList the same as y
but with entirely or mostly missing rows removed.
y <- matrix(rnorm(25),5,5) y[y < -0.5] <- NA removeNARows(y)
y <- matrix(rnorm(25),5,5) y[y < -0.5] <- NA removeNARows(y)
Simulate a vector complete data together with the associated missing value events, under two different models.
simCompleteDataCN(n, mean.comp=6, sd.comp=1, dpc=c(-4,0.7)) simCompleteDataON(n, mean.obs=6, sd.obs=1, dpc=c(-4,0.7))
simCompleteDataCN(n, mean.comp=6, sd.comp=1, dpc=c(-4,0.7)) simCompleteDataON(n, mean.obs=6, sd.obs=1, dpc=c(-4,0.7))
n |
number of values to simulate. |
mean.comp |
mean of complete normal distribution. |
sd.comp |
standard deviation of complete normal distribution. |
mean.obs |
mean of observed normal distribution. |
sd.obs |
standard deviation of observed normal distribution. |
dpc |
numeric vector of length 2 giving the DPC intercept and slope. |
These functions simulate a vector of complete log2-expression data and identify which will be observed and which will be missing.
The complete values themselves are all non-missing, but some will be undetected in a hypothetical real dataset.
simCompleteDataCN
simulates data according to the complete normal model (CN), while simCompleteDataON
simulates data according to the observed normal model (ON).
These functions can be used to explore the differences between the complete and observed normal models. Under the CN model, the complete values (including both observed and unobserved) are exactly normally distributed, while the subset that are observed are only approximately normal. Under the ON model, the opposite is true. The observed values are exactly normal while the complete values are only approximately normal.
A list with compoenents
y.complete |
vector of complete values. |
is.missing |
vector of |
prob.missing |
conditional probability given |
# Complete values are only approximately normal under the ON model. out <- simCompleteDataON(100, mean.obs=6, sd.obs=1) mean(out$prob.missing) qqnorm(out$y.complete) qqline(out$y.complete)
# Complete values are only approximately normal under the ON model. out <- simCompleteDataON(100, mean.obs=6, sd.obs=1) mean(out$prob.missing) qqnorm(out$y.complete) qqline(out$y.complete)
Simulate peptide-level log2-expression values from a mass spectrometry experiment.
simProteinDataSet(n.peptides = 100, n.groups = 2, samples.per.group = 5, peptides.per.protein = 4, mu.range = c(2,10), sigma = 0.4, prop.de = 0.2, fc = 2, dpc.intercept = NULL, dpc.slope = 0.7, prop.missing = 0.4)
simProteinDataSet(n.peptides = 100, n.groups = 2, samples.per.group = 5, peptides.per.protein = 4, mu.range = c(2,10), sigma = 0.4, prop.de = 0.2, fc = 2, dpc.intercept = NULL, dpc.slope = 0.7, prop.missing = 0.4)
n.peptides |
number of peptides (rows of output). |
n.groups |
number of experimental groups (conditions). |
samples.per.group |
number of samples per group. |
peptides.per.protein |
number of peptides per protein. |
mu.range |
range of log2-expression values, in terms of expected value per peptide. |
sigma |
standard deviation of log2-expression values for each peptide in each group. |
prop.de |
proportion of differentially expressed proteins. |
fc |
true fold-change for differentially expressed proteins. |
dpc.intercept |
intercept of detection probability curve. Usually determined from |
dpc.slope |
slope of detection probability curve. |
prop.missing |
proportion of missing values (at average log2-expression). Ignored if |
Simulate peptide-level log2-expression values (log2-intensities) from a mass spectrometry experiment. Values are generated and missing values assigned according to the complete normal model.
Each group of successive peptides is assumed to belong to one protein. If the protein is differentially expressed (DE), then each peptide belonging to that protein is also DE with the same fold-change.
If dpc.intercept
is not specified, then it is chosen to ensure that the proportion of missing values is equal to prop.missing
at the average log2-expression value.
The simulated data is stored in an EList object, the standard limma package data class for log-expression values. Peptides are ordered by average expected expression level. Some of the more lowly expressed peptides may be entirely NA, depending on the argument settings.
EList containing simulated log2-expression values with n.peptides
rows and n.groups * n.samples.per.group
columns.
The EList contains the following components:
E |
matrix of peptide log2-expression values with NAs. |
other$E.complete |
matrix of complete log2-expression values without NAs. |
genes |
data.frame with columns |
targets |
data.frame with column |
y <- simProteinDataSet(n.peptides=10, n.groups=1) show(y)
y <- simProteinDataSet(n.peptides=10, n.groups=1) show(y)
Estimate the variance trend, use it to compute observational weights and use the weights to a fit a linear model. Includes automatic estimation of sample weights and block correlation. Equivalent to calling vooma(), arrayWeights(), duplicateCorrelation() and lmFit() iteratively.
voomaLmFitWithImputation(y, design = NULL, prior.weights = NULL, imputed = NULL, block = NULL, sample.weights = FALSE, var.design = NULL, var.group = NULL, prior.n = 10, predictor = NULL, span = NULL, legacy.span = FALSE, plot = FALSE, save.plot = FALSE, keep.EList = TRUE)
voomaLmFitWithImputation(y, design = NULL, prior.weights = NULL, imputed = NULL, block = NULL, sample.weights = FALSE, var.design = NULL, var.group = NULL, prior.n = 10, predictor = NULL, span = NULL, legacy.span = FALSE, plot = FALSE, save.plot = FALSE, keep.EList = TRUE)
y |
a numeric |
design |
design matrix with rows corresponding to samples and columns to coefficients to be estimated. Defaults to the unit vector meaning that samples are treated as replicates. |
prior.weights |
prior weights.
Can be a numeric matrix of individual weights of same dimensions as the |
imputed |
logical matrix of the same size as |
block |
vector or factor specifying a blocking variable on the arrays.
Has length equal to |
sample.weights |
logical value. If |
var.design |
design matrix for predicting the sample variances. Defaults to the sample-specific model whereby each sample has a different variance. |
var.group |
vector or factor indicating groups to have different sample weights.
This is another way to specify |
prior.n |
prior number of genes for squeezing the weights towards equality. Larger values squeeze the sample weights more strongly towards equality. |
predictor |
precision predictor.
Either a column vector of length |
span |
width of the smoothing window, as a proportion of the data set.
Defaults to a value between 0.3 and 1 that depends the number of genes ( |
legacy.span |
logical.
If |
plot |
logical.
If |
save.plot |
logical, should the coordinates and line of the plot be saved in the output? |
keep.EList |
logical. If |
This function is a modification of voomaLmFit
in the limma package, to give special treatment to imputed values.
This function gives more accurate estimation of the row-wise variances because it discounts fitted values and associated residuals that are determined entirely by imputed values that are all identical.
In a regular limma pipeline, such residuals will be structurally zero and will cause underestimation of the residual variance for that gene.
In voomaLmFit
, such residuals do not contribute to the genewise variances and the genewise residual degrees of freedom (df) are correspondingly reduced.
The principle is the same as for Lun & Smyth (2017), but here the loss of df is from imputed values instead of from zero counts.
This function is analogous to voomLmFit
in the edgeR package but for continuous log-expression values instead of count data.
voomLmFit
is a refinement of voom
adjusting for loss of residual df from all zero groups, whereas voomaLmFitWithImputation
is a refinement of voomaLmFit
adjusting for loss of residual df from all imputed groups.
The results from voomaLmFitWithImputation
are similar to those from voomaLmFit
, but the df.residual
values are equal or smaller and the sigma
values are equal or larger.
voomaLmFitWithImputation
is similar to calling vooma()
followed by lmFit()
, optionally with arrayWeights()
and duplicateCorrelation()
to estimate sample weights and block correlation.
The function finishes with lmFit()
and returns a fitted model object.
Like vooma
, voomaLmFitWithImputation
estimates the mean-variance relationship in the data and uses it to compute appropriate precision weights for each observation.
The mean-variance trend is estimated from gene-level data but is extrapolated back to individual observations to obtain a precision weight (inverse variance) for each observation.
The weights are then used by lmFit()
to adjust for heteroscedasticity.
Like voomLmFit
, which corrects for loss of residual degrees of freedom due to entirely zero counts in a group (Lun & Smyth 2017), voomaLmFitWithImputation
corrects for loss of residual degrees of freedom due to entirely imputed values in a group.
This adjustment prevents from the residual standard deviations from being underestimated due to zero variance between identical imputed values in a group.
If span=NULL
, then an optimal span value is estimated depending on nrow(y)
.
The span is chosen by chooseLowessSpan
with n=nrow(y)
, small.n=50
, min.span=0.3
and power=1/3
.
If legacy.span = TRUE
, then the chooseLowessSpan
arguments are reset to small.n=10
, min.span=0.3
and power = 0.5
to match the settings used by vooma
in limma version 3.59.1 and earlier.
If predictor
is not NULL
, then the variance trend is modeled as a function of both the mean log-expression and the predictor
using a multiple linear regression with the two predictors.
In this case, the predictor
is assumed to be some prior predictor of the precision or standard deviation of each log-expression value.
Any predictor
that is correlated with the precision of each observation should give good results.
This ability to model the variance trend using two covariates (mean log-expression and the predictor covariate) was described for the first time by Li (2024).
Sample weights will be estimated using arrayWeights
if sample.weights = TRUE
or if either var.design
or var.group
are non-NULL.
An intra-block correlation will be estimated using duplicateCorrelation
if block
is non-NULL.
In either case, the whole estimation pipeline will be repeated twice to update the sample weights and/or block correlation.
An MArrayLM object containing linear model fits for each row of data.
If sample weights are estimated, then the output object will include a targets
data.frame component with the sample weights as a column with heading "sample.weights"
.
If save.plot=TRUE
then the output object will include components voom.xy
and voom.line
.
voom.xy
contains the x and y coordinates of the points in the vooma variance-trend plot and voom.line
contains the estimated trend line.
If keep.EList=TRUE
, then the output includes component EList
with sub-components Elist$E
and EList$weights
.
If y
was an EList object, then the output EList
preserves all the components of y
and adds the weights.
Mengbo Li and Gordon Smyth
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
Lun ATL, Smyth GK (2017). No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data. Statistical Applications in Genetics and Molecular Biology 16(2), 83-93. doi:10.1515/sagmb-2017-0010
vooma
, lmFit
, voomLmFit
(in the edgeR package).
# Example with a precision predictor group <- gl(2,4) design <- model.matrix(~group) y <- matrix(rnorm(500*8),500,8) u <- matrix(runif(length(y)),500,8) yu <- y*u fit <- voomaLmFitWithImputation(yu,design,plot=TRUE,predictor=u) # Reproducing vooma plot from output object fit <- voomaLmFitWithImputation(yu,design,predictor=u,save.plot=TRUE) do.call(plot,fit$voom.xy) do.call(lines,fit$voom.line)
# Example with a precision predictor group <- gl(2,4) design <- model.matrix(~group) y <- matrix(rnorm(500*8),500,8) u <- matrix(runif(length(y)),500,8) yu <- y*u fit <- voomaLmFitWithImputation(yu,design,plot=TRUE,predictor=u) # Reproducing vooma plot from output object fit <- voomaLmFitWithImputation(yu,design,predictor=u,save.plot=TRUE) do.call(plot,fit$voom.xy) do.call(lines,fit$voom.line)
Density and distribution function for the zero-truncated binomial distribution, using the same arguments as for the R stats binomial distribution functions.
dztbinom(x, size, prob, log = FALSE) pztbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)
dztbinom(x, size, prob, log = FALSE) pztbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)
x , q
|
vector of quantiles. |
p |
vector of probabilities. |
size |
number of trials (zero or more). |
prob |
probability of success on each trial. |
log |
logical; if |
lower.tail |
logical; if |
log.p |
logical; if |
These functions perform simmilarly to the R stats functions dbinom
and pbinom
except for the zero-truncation.
Output values give density (dztbinom
) or cumulative probability (pztbinom
)
for the zero-truncated binomial distribution with parameters size
and prob
.
Output is a vector of length equal to the maximum length of any of the arguments x
, q
, size
or prob
.
If the first argument is the longest, then all the attributes of the input argument are preserved on output, for example, a matrix x
will give a matrix on output.
Elements of input vectors that are missing will cause the corresponding elements of the result to be missing, as will non-positive values for size
or prob
.
# Compare to binomial x <- 1:3 dztbinom(x, size=3, prob=0.5) dbinom(x, size=3, prob=0.5) pztbinom(x, size=3, prob=0.5) pbinom(x, size=3, prob=0.5)
# Compare to binomial x <- 1:3 dztbinom(x, size=3, prob=0.5) dbinom(x, size=3, prob=0.5) pztbinom(x, size=3, prob=0.5) pbinom(x, size=3, prob=0.5)