Title: | Exploratory Data Analysis of LC-MS/MS data by spectral counts |
---|---|
Description: | Exploratory data analysis to assess the quality of a set of LC-MS/MS experiments, and visualize de influence of the involved factors. |
Authors: | Josep Gregori, Alex Sanchez, and Josep Villanueva |
Maintainer: | Josep Gregori <[email protected]> |
License: | GPL-2 |
Version: | 1.45.0 |
Built: | 2024-10-30 09:18:48 UTC |
Source: | https://github.com/bioc/msmsEDA |
Exploratory data analysis to assess the quality of a set of label-free LC-MS/MS experiments, quantified by spectral counts, and visualize de influence of the involved factors. Visualization tools to assess quality and to discover outliers and eventual confounding.
Package: | msmsEDA |
Type: | Package |
Version: | 1.2.0 |
Date: | 2014-01-18 |
License: | GPL-2 |
pp.msms.data |
data preprocessing |
gene.table |
extract gene symbols from protein description |
count.stats |
summaries by sample |
counts.pca |
principal components analysis |
counts.hc |
hierarchical clustering of samples |
norm.counts |
normalization of spectral counts matrix |
counts.heatmap |
experiment heatmap |
disp.estimates |
dispersion analysis and plots |
filter.flags |
flag informative features |
spc.barplots |
sample sizes barplots |
spc.boxplots |
samples SpC boxplots |
spc.densityplot |
samples SpC density plots |
spc.scatterplot |
scatterplot comparing two conditions |
batch.neutralize |
batch effects correction |
Josep Gregori, Alex Sanchez and Josep Villanueva
Maintainer: Josep Gregori <[email protected]>
Gregori J, Villarreal L, Mendez O, Sanchez A, Baselga J, Villanueva J, "Batch effects correction improves the sensitivity of significance tests in spectral counting-based comparative discovery proteomics." J Proteomics. 2012 Jul 16;75(13):3938-51. doi: 10.1016/j.jprot.2012.05.005. Epub 2012 May 12.
Computes the SpC matrix where the fixed effects of a blocking factor are substracted.
batch.neutralize(dat, fbatch, half=TRUE, sqrt.trans=TRUE)
batch.neutralize(dat, fbatch, half=TRUE, sqrt.trans=TRUE)
dat |
A SpC matrix with proteins in the rows and samples in the columns. |
fbatch |
A blocking factor of length equal to the number of columns in the expression matrix. |
half |
When FALSE, the contrast coefficients are of the contr.treatment style. When TRUE, the contrast coefficients are of the contr.sum style, its aim is to distribute equally the effect to each batch level, instead of having untouched reference levels. |
sqrt.trans |
When TRUE the fit is done on the square root transformed SpC matrix. |
A model with intercept and the blocking factor is fitted. The batch effects
corrected SpC matrix is computed by substracting the estimated effect of the
given blocking factor. When there is no clear reference batch level, the
default option half=TRUE
should be preferred.
The square root transformation is known to stabilize the variance of Poisson
distributed counts (with variance equal to the mean). The linear model fitting
gives more accurate errors and p-values on the square root transformed SpC
matrix. Nevertheless with exploratory data analysis purposes, both the raw and
square root transformed SpC matrix may give good results.
The batch effects corrected SpC matrix.
Josep Gregori
The MSnSet
class documentation and normalize
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) ### Plot the PCA on the two first PC, and colour by treatment level ftreat <- pData(msnset)$treat counts.pca(msnset, facs=ftreat, do.plot=TRUE, snms=as.character(ftreat)) ### Correct the batch effects spcm <- exprs(msnset) fbatch <- pData(msnset)$batch spcm2 <- batch.neutralize(spcm, fbatch, half=TRUE, sqrt.trans=TRUE) ### Plot the PCA on the two first PC, and colour by treatment level ### to visualize the improvement. exprs(msnset) <- spcm2 counts.pca(msnset, facs=ftreat, do.plot=TRUE, snms=as.character(ftreat)) ### Incidence of the correction summary(as.vector(spcm-spcm2)) plot(density(as.vector(spcm-spcm2)))
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) ### Plot the PCA on the two first PC, and colour by treatment level ftreat <- pData(msnset)$treat counts.pca(msnset, facs=ftreat, do.plot=TRUE, snms=as.character(ftreat)) ### Correct the batch effects spcm <- exprs(msnset) fbatch <- pData(msnset)$batch spcm2 <- batch.neutralize(spcm, fbatch, half=TRUE, sqrt.trans=TRUE) ### Plot the PCA on the two first PC, and colour by treatment level ### to visualize the improvement. exprs(msnset) <- spcm2 counts.pca(msnset, facs=ftreat, do.plot=TRUE, snms=as.character(ftreat)) ### Incidence of the correction summary(as.vector(spcm-spcm2)) plot(density(as.vector(spcm-spcm2)))
Computes the number of proteins identified, the total spectral counts, and a summary of each sample
count.stats(msnset)
count.stats(msnset)
msnset |
A MSnSet with spectral counts in the expression matrix. |
A data frame with one row by sample and with variables:
proteins |
Number of identified proteins in sample |
counts |
Total spectral counts in sample |
min |
Min spectral counts |
lwh |
Tukey's lower hinge spectral counts |
med |
Median spectral counts |
hgh |
Tukey's upper hinge spectral counts |
max |
Max spectral counts |
Josep Gregori
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) res <- count.stats(msnset) res
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) res <- count.stats(msnset) res
Hierarchical clustering of samples in an spectral counts matrix, coloring tree branches according to factor levels.
counts.hc(msnset, do.plot=TRUE, facs=NULL, wait=TRUE)
counts.hc(msnset, do.plot=TRUE, facs=NULL, wait=TRUE)
msnset |
A MSnSet with spectral counts in the expression matrix. |
do.plot |
A logical indicating whether to plot the dendrograms. |
facs |
NULL, or a data frame with factors. See details below. |
wait |
This function may draw different plots, one by given factor in
|
The hierarchical clustering is done by means of hclust
with default parameters.
If do.plot
is TRUE, a dendrogram is plotted for each factor, with branches colored as per factor level. If facs
is NULL then the factors are taken
from pData(msnset)
.
Invisibly returns the the value obtained from hclust
.
Josep Gregori
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) hc <- counts.hc(msnset) str(hc)
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) hc <- counts.hc(msnset) str(hc)
Heatmap showing the clustering of proteins and samples in a matrix of spectral counts
counts.heatmap(msnset,etit=NULL,fac=NULL,to.pdf=FALSE)
counts.heatmap(msnset,etit=NULL,fac=NULL,to.pdf=FALSE)
msnset |
A MSnSet with spectral counts in the expression matrix. |
etit |
The root name of the pdf file names where the heatmaps are sent. |
fac |
A factor which is used for the column color bar. |
to.pdf |
A logical indicating whether the heatmaps are sent to a pdf file. |
A heatmap of the msnset
expression matrix is plot.
If to.pdf
is TRUE two heatmaps are plot, the first is fitted on an A4
page, the second is plotted with 3mm by row, allocating enough height to make
the rownames readable.
If fac
is not NULL then a column color bar will show the levels
of the factor.
If to.pdf
is TRUE the heatmaps are sent to pdf files whose names
are the concatenation of etit
and "-HeatMap.pdf" and "-FullHeatMap.pdf",
otherwise etit
has no effect.
No value is returned
Josep Gregori
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) counts.heatmap(msnset,fac = pData(msnset)$treat)
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) counts.heatmap(msnset,fac = pData(msnset)$treat)
A summary and different plots are given as a result of principal components analysis of an spectral counts matrix.
counts.pca(msnset, facs = NULL, do.plot = TRUE, snms = NULL, wait = TRUE)
counts.pca(msnset, facs = NULL, do.plot = TRUE, snms = NULL, wait = TRUE)
msnset |
A MSnSet with spectral counts in the expression matrix. |
do.plot |
A logical indicating whether to plot the PCA PC1/PC2 map. |
facs |
NULL or a data frame with factors. See details below. |
snms |
Character vector with sample short names to be plotted. If NULL then 'Xnn' is plotted where 'nn' is the column number in the datset. |
wait |
This function may draw different plots, one by given factor in
|
The spectral counts matrix is decomposed by means of prcomp
.
If do.plot
is TRUE, a plot is generated for each factor showing the PC1/PC2 samples map, with samples colored as per factor level. If facs
is NULL
then the factors are taken from pData(msnset)
.
Invisibly returns a list with values:
pca |
The return value obtained from |
pc.vars |
The percentage of variability corresponding to each principal component. |
Josep Gregori
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) lst <- counts.pca(msnset) str(lst) print(lst$pc.vars[,1:4])
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) lst <- counts.pca(msnset) str(lst) print(lst$pc.vars[,1:4])
Estimates the residual dispersion of each row of a spectral counts matrix as the ratio residual variance to mean of mean values by level, for each factor in
facs
. Different plots are drawn to help in the interpretation of the results.
disp.estimates(msnset, facs=NULL, do.plot=TRUE, etit=NULL, to.pdf=FALSE, wait=TRUE)
disp.estimates(msnset, facs=NULL, do.plot=TRUE, etit=NULL, to.pdf=FALSE, wait=TRUE)
msnset |
A MSnSet with spectral counts in the expression matrix. |
facs |
A factor or a data frame with factors. |
do.plot |
A logical indicating whether to produce dispersion distribution plots. |
etit |
Root name of the pdf file where to send the plots. |
to.pdf |
A logical indicating whether a pdf file should be produced. |
wait |
This function draws different plots, two by given factor in
|
Estimates the residual dispersion of each protein in the spectral counts matrix, for each factor in facs
, and returns the quantiles at c(0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 1)
of the distribution of dispersion values for each factor. If facs
is NULL the factors are taken from pData(msnset)
. If do.plot
is TRUE this function produces a density plot of dispersion values, and the scatterplot of residual variance vs mean values, in log10 scale. If do.pdf
is TRUE etit
provides the root name for the pdf file name, ending with "-DispPlots.pdf". If etit
is NULL a default value of "MSMS" is provided. A different set of plots is produced for each factor in facs
.
Invisibly returns a matrix with the quantiles at c(0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 1)
of the residual dispersion estimates. Each row has the residual dispersion values attribuable to each factor in facs
.
Josep Gregori
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) disp.q <- disp.estimates(msnset) disp.q
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) disp.q <- disp.estimates(msnset) disp.q
In general the spectral counts (SpC) matrix of a LC-MS/MS experiment is a sparse matrix, where most of the features have very low signal. Besides, the features with low variance to mean ratio (dispersion) will be scarcely informative in a biomarker discovery experiment. Given a minimum number of spectral counts and/or a fraction of the features to be excluded by low dispersion, this function returns a vector of logicals flagging all features with values above the given thresholds.
filter.flags(data,minSpC=2,frac.out=0.4)
filter.flags(data,minSpC=2,frac.out=0.4)
data |
A SpC matrix with proteins in the rows and samples in the columns. |
minSpC |
All features with SpC below this threshold will be flagged as FALSE. |
frac.out |
The fraction of features to be excluded, with the lowest observed dispersion. These will be flagged as FALSE. |
The less informative features in a SpC matrix are flagged as FALSE. Those with high enough signal and dispersion are flagged as TRUE. This vector of logicals may be used to filter the SpC matrix which is used in plots where only the relevant informattion matters, and where the high number of 0 may distort the plot and difficult its interpretation.
A vector of logical values.
Josep Gregori
data(msms.dataset) fraction <- 0.3 msnset <- pp.msms.data(msms.dataset) flags <- filter.flags(exprs(msnset),minSpC=2,frac.out=fraction) cat("\nNumber of informative features:",sum(flags),"\n")
data(msms.dataset) fraction <- 0.3 msnset <- pp.msms.data(msms.dataset) flags <- filter.flags(exprs(msnset),minSpC=2,frac.out=fraction) cat("\nNumber of informative features:",sum(flags),"\n")
Given a character vector with protein accessions, and a character vector with protein descriptions including gene symbols, returns a character vector with gene symbols whose names are the protein accessions. A character pattern should also be given to match the gene symbols.
gene.table(Accession, Protein, patt = "GN=[A-Z0-9_]*", off = 3)
gene.table(Accession, Protein, patt = "GN=[A-Z0-9_]*", off = 3)
Accession |
A character vector with protein accessions |
Protein |
A character vector of protein descriptions including gene name symbols. |
patt |
A character pattern to match the gene symbol within the protein description. |
off |
Offset from the first character in the pattern corresponding to the gene symbol. |
NA is inserted where no match is found
A character vector with gene symbols, whose names are the corresponding protein accessions.
Josep Gregori
data(pnms) head(pnms) gene.smb <- gene.table(pnms$Accession,pnms$Proteins) head(gene.smb)
data(pnms) head(pnms) gene.smb <- gene.table(pnms$Accession,pnms$Proteins) head(gene.smb)
A MSnSet with a spectral counts matrix as expression and two factors in the
phenoData.
The spectral counts matrix has samples in the columns, and proteins in the rows.
The factors give the treatment and batch conditions of each sample in the dataset.
data(msms.dataset)
data(msms.dataset)
A MSnSet
Josep Gregori, Laura Villarreal, Olga Mendez, Alex Sanchez, Jose Baselga, Josep Villanueva, "Batch effects correction improves the sensitivity of significance tests in spectral counting-based comparative discovery proteomics." J Proteomics. 2012 Jul 16;75(13):3938-51. doi: 10.1016/j.jprot.2012.05.005. Epub 2012 May 12.
Laurent Gatto and Kathryn S. Lilley, MSnbase - an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation, Bioinformatics 28(2), 288-289 (2012).
See MSnSet
for detail on the class, and the exprs
and pData
accessors.
data(msms.dataset) msms.dataset dim(msms.dataset) head(exprs(msms.dataset)) head(pData(msms.dataset)) table(pData(msms.dataset)$treat) table(pData(msms.dataset)$batch) table(pData(msms.dataset)$treat, pData(msms.dataset)$batch)
data(msms.dataset) msms.dataset dim(msms.dataset) head(exprs(msms.dataset)) head(pData(msms.dataset)) table(pData(msms.dataset)$treat) table(pData(msms.dataset)$batch) table(pData(msms.dataset)$treat, pData(msms.dataset)$batch)
An spectral counts matrix is normalized by means of a set of samples divisors.
norm.counts(msnset, div)
norm.counts(msnset, div)
msnset |
A MSnSet with spectral counts in the expression matrix. |
div |
A vector of divisors by sample |
Each column in the data matrix is divided by the corresponding divisor to obtain the normalizad matrix.
A MSnSet object with the normalized spectral counts.
Josep Gregori
The MSnSet
class documentation and normalize
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) (tspc <- apply(exprs(msnset),2,sum)) div <- tspc/median(tspc) e.norm <- norm.counts(msnset, div) apply(exprs(e.norm),2,sum) e.norm
data(msms.dataset) msnset <- pp.msms.data(msms.dataset) (tspc <- apply(exprs(msnset),2,sum)) div <- tspc/median(tspc) e.norm <- norm.counts(msnset, div) apply(exprs(e.norm),2,sum) e.norm
A data frame with accessions in one column, and protein description including gene symbols in the second column.
data(pnms)
data(pnms)
A data frame with 1160 observations on the following 2 variables.
Accession
a character vector with the protein accessions
Proteins
a character vector with a description of each protein, including the gene symbol
data(pnms) str(pnms) head(pnms)
data(pnms) str(pnms) head(pnms)
Given a MSnSet, possibly subsetted from a bigger dataset, removes the all zero rows, and those whith row names (accessions) ending with '-R' in the corresponding expression matrix. NAs are replaced by zeroes, as usually a NA in a spectral counts matrix corresponds to a proteint not identified in a sample.
pp.msms.data(msnset)
pp.msms.data(msnset)
msnset |
A MSnSet with spectral counts in the expression matrix. |
An '-R' protein corresponds to an artefactual identification.
Rows with all zeros are uninformative and may give rise to errors in the
analysis.
A NA is understood as a unidintified protein in a sample.
Returns an updated MSnSet object.
Its processingData slot shows that the object has been processed by
pp.msms.data
Josep Gregori
data(msms.dataset) dim(msms.dataset) msnset <- pp.msms.data(msms.dataset) dim(msnset)
data(msms.dataset) dim(msms.dataset) msnset <- pp.msms.data(msms.dataset) dim(msnset)
Draws bars of height proportional to the sample size of each column in a SpC matrix. The sizes are scaled to the median of the total SpC by sample.
spc.barplots(msms.counts,fact=NULL,...)
spc.barplots(msms.counts,fact=NULL,...)
msms.counts |
A SpC matrix with proteins in the rows and samples in the columns. |
fact |
NULL or a factor of length equal to the number of columns in the expression matrix. If provided the bars are colored by factor level. |
... |
Extra parameters passed to the plot function. |
.
Josep Gregori
data(msms.dataset) spc.barplots(exprs(msms.dataset),fact=pData(msms.dataset)[,1], main="UPS1 200fm vs 600fm")
data(msms.dataset) spc.barplots(exprs(msms.dataset),fact=pData(msms.dataset)[,1], main="UPS1 200fm vs 600fm")
Draws a boxplot for each column (sample) in a SpC matrix. The SpC are previosly transformed by log2, with an offset of 0.1. If a factor is provided the boxplots are colored by factor level to better visualize the differences.
spc.boxplots(msms.counts,fact=NULL,minSpC=2,...)
spc.boxplots(msms.counts,fact=NULL,minSpC=2,...)
msms.counts |
A SpC matrix with proteins in the rows and samples in the columns. |
minSpC |
All matrix cells with values below this threshold are excluded. |
fact |
NULL or a factor of length equal to the number of columns in the expression matrix. If provided the boxplots are colored by factor level. |
... |
Extra parameters passed to the plot function. |
More informative plots are obtained when excluding the cells with values below
2, the default for minSpC
.
Josep Gregori
data(msms.dataset) spc.boxplots(exprs(msms.dataset),fact=pData(msms.dataset)[,1], main="UPS1 200fm vs 600fm")
data(msms.dataset) spc.boxplots(exprs(msms.dataset),fact=pData(msms.dataset)[,1], main="UPS1 200fm vs 600fm")
Draws superposed density plots, one for each column (sample) in a SpC matrix. The SpC are previosly transformed by log2, with an offset of 0.1. If a factor is provided the density curves are colored by factor level to better visualize the differences.
spc.densityplots(msms.counts,fact=NULL,minSpC=2,...)
spc.densityplots(msms.counts,fact=NULL,minSpC=2,...)
msms.counts |
A SpC matrix with proteins in the rows and samples in the columns. |
minSpC |
All matrix cells with values below this threshold are excluded. |
fact |
NULL or a factor of length equal to the number of columns in the expression matrix. If provided the density curves are colored by factor level. |
... |
Extra parameters passed to the plot function. |
More informative plots are obtained when excluding the cells with values below
2, the default for minSpC
.
Josep Gregori
data(msms.dataset) spc.densityplots(exprs(msms.dataset),fact=pData(msms.dataset)[,1], main="UPS1 200fm vs 600fm")
data(msms.dataset) spc.densityplots(exprs(msms.dataset),fact=pData(msms.dataset)[,1], main="UPS1 200fm vs 600fm")
Given a SpC matrix and a two levels factor, draws a scatterplot with SpC means of one condition in the x axis and SpC means of the second condition in the y axis.
spc.scatterplot(msms.counts, treat, trans="log2", minSpC=2, minLFC=1, ...)
spc.scatterplot(msms.counts, treat, trans="log2", minSpC=2, minLFC=1, ...)
msms.counts |
A SpC matrix with proteins in the rows and samples in the columns. |
treat |
A two level factor of length equal to the number of columns in the expression matrix. The two levels represent the conditions to be compared. |
trans |
The transformation made on the means before plotting. One among "log2", "sqrt", or "none". The default is "log2". |
minSpC |
Used as signal threshold. |
minLFC |
Used as size effect threshold. |
... |
Extra parameters passed to the plot function. |
The transformed means are plotted, one condition versus the other. The borders
representing absolute log fold change 1 are drawn as dashed lines.
All features with log fold change equal to or greather than minLFC
and with
mean SpC in the most abundant condition equal to or greather than minSpC
are colored in red.
Josep Gregori
data(msms.dataset) spc.scatterplot(exprs(msms.dataset),treat=pData(msms.dataset)[,1],trans="log2", minSpC=2,minLFC=1,main="UPS1 200fm vs 600fm")
data(msms.dataset) spc.scatterplot(exprs(msms.dataset),treat=pData(msms.dataset)[,1],trans="log2", minSpC=2,minLFC=1,main="UPS1 200fm vs 600fm")