Title: | Differential Abundant/Expression Analysis for Metabolomics, Proteomics and single-cell RNA sequencing Data |
---|---|
Description: | This Package utilizes a Semi-parametric Differential Abundance/expression analysis (SDA) method for metabolomics and proteomics data from mass spectrometry as well as single-cell RNA sequencing data. SDA is able to robustly handle non-normally distributed data and provides a clear quantification of the effect size. |
Authors: | Yuntong Li <[email protected]>, Chi Wang <[email protected]>, Li Chen <[email protected]> |
Maintainer: | Yuntong Li <[email protected]> |
License: | GPL |
Version: | 1.27.0 |
Built: | 2024-11-19 04:26:48 UTC |
Source: | https://github.com/bioc/SDAMS |
SDAMS is an R package for differential abundance/expression analysis of
metabolomics, proteomics and single-cell RNA sequencing data, and the main
function for differential abundance/expression analysis
is SDA
. See the examples at SDA
for
basic analysis steps. SDAMS considers a two-part model, a logistic regression
for the zero proportion and a semi-parametric log-linear model for the
non-zero values.
Yuntong Li <[email protected]>, Chi Wang <[email protected]>, Li Chen <[email protected]>
Li, Y., Fan, T.W., Lane, A.N. et al. SDA: a semi-parametric differential abundance analysis method for metabolomics and proteomics data. BMC Bioinformatics 20, 501 (2019).
Two ways to input metabolomics or proteomics data from mass spectrometry or
single-cell RNA sequencing data as SummarizedExperiment
:
createSEFromCSV
creates SummarizedExperiment object from csv files;
createSEFromMatrix
creates SummarizedExperiment object from
separate matrices: one for feature/gene data and the other one for colData.
createSEFromCSV(featurePath, colDataPath, rownames1 = 1, rownames2 = 1, header1 = TRUE, header2 = TRUE) createSEFromMatrix(feature, colData)
createSEFromCSV(featurePath, colDataPath, rownames1 = 1, rownames2 = 1, header1 = TRUE, header2 = TRUE) createSEFromMatrix(feature, colData)
featurePath |
path for feature/gene data. |
colDataPath |
path for colData. |
rownames1 |
indicator for feature/gene data with row names. If NULL, row numbers are automatically generated. |
rownames2 |
indicator for colData with row names. If NULL, row numbers are automatically generated. |
header1 |
a logical value indicating whether the first row of feature/gene is column names. The default value is TRUE. |
header2 |
a logical value indicating whether the first row of colData is column names. The default value is TRUE. If colData input is a vector, set to False. |
feature |
a matrix with row being features/genes and column being subjects/cells. |
colData |
a column type data containing information about the subjects/cells. |
An object of SummarizedExperiment
class.
Yuntong Li <[email protected]>, Chi Wang <[email protected]>, Li Chen <[email protected]>
SDA
input requires an object of SummarizedExperiment
class.
# ---------- csv input ------------- directory1 <- system.file("extdata", package = "SDAMS", mustWork = TRUE) path1 <- file.path(directory1, "ProstateFeature.csv") directory2 <- system.file("extdata", package = "SDAMS", mustWork = TRUE) path2 <- file.path(directory2, "ProstateGroup.csv") exampleSE <- createSEFromCSV(path1, path2) exampleSE # ---------- matrix input ------------- set.seed(100) featureInfo <- matrix(runif(800, -2, 5), ncol = 40) featureInfo[featureInfo<0] <- 0 rownames(featureInfo) <- paste("gene", 1:20, sep = '') colnames(featureInfo) <- paste('cell', 1:40, sep = '') groupInfo <- data.frame(grouping=matrix(sample(0:1, 40, replace = TRUE), ncol = 1)) rownames(groupInfo) <- colnames(featureInfo) exampleSE <- createSEFromMatrix(feature = featureInfo, colData = groupInfo) exampleSE
# ---------- csv input ------------- directory1 <- system.file("extdata", package = "SDAMS", mustWork = TRUE) path1 <- file.path(directory1, "ProstateFeature.csv") directory2 <- system.file("extdata", package = "SDAMS", mustWork = TRUE) path2 <- file.path(directory2, "ProstateGroup.csv") exampleSE <- createSEFromCSV(path1, path2) exampleSE # ---------- matrix input ------------- set.seed(100) featureInfo <- matrix(runif(800, -2, 5), ncol = 40) featureInfo[featureInfo<0] <- 0 rownames(featureInfo) <- paste("gene", 1:20, sep = '') colnames(featureInfo) <- paste('cell', 1:40, sep = '') groupInfo <- data.frame(grouping=matrix(sample(0:1, 40, replace = TRUE), ncol = 1)) rownames(groupInfo) <- colnames(featureInfo) exampleSE <- createSEFromMatrix(feature = featureInfo, colData = groupInfo) exampleSE
SDAMS package provides two types of example datasets: one is prostate cancer proteomics data from mass spectrometry and the other one is single-cell RNA sequencing data.
For prostate cancer proteomics data, it is from the human urinary
proteome database(http://mosaiques-diagnostics.de/mosaiques-diagnostics/human-urinary-proteom-database). There are 526 prostate cancer subjects and 1503 healthy subjects. A
total of 5605 proteomic features were measured for each subject. For
illustration purpose, we took a 10% subsample randomly from this real data.
This example data contains 560 proteomic features for 202 experimental subjects
with 49 prostate cancer subjects and 153 healthy subjects. SDAMS package
provides two different kinds of data formats for prostate cancer proteomics
data.
exampleSumExp.rda
is an object of SummarizedExperiment
class which
stores the information of both proteomic features and experimental subjects.
ProstateFeature.csv
contains a matrix-like proteomic feature data and
ProstateGroup.csv
contains a single column of experimental subject group
data.
For single cell RNA sequencing data, it is in the form of transcripts per
kilobase million (TPM). The count data can be found at Gene Expression Omnibus
(GEO) database with Accession No. GSE29087. There are 92 single
cells (48 mouse embryonic stem (ES) cells and 44 mouse embryonic fibroblasts
(MEF)) that were analyzed. The example data provided by SDAMS contains 10% of
genes which are randomly sampled from the raw dataset.
exampleSingleCell.rda
is an object of SummarizedExperiment
class
which stores the information of both gene expression and cell information.
data(exampleSumExp) data(exampleSingleCell)
data(exampleSumExp) data(exampleSingleCell)
An object of SummarizedExperiment
class.
Siwy, J., Mullen, W., Golovko, I., Franke, J., and Zurbig, P. (2011). Human urinary peptide database for multiple disease biomarker discovery. PROTEOMICS-Clinical Applications 5, 367-374.
Islam, S., Kjallquist, U., Moliner, A., Zajac, P., Fan, J. B., Lonnerberg, P., & Linnarsson, S. (2011). Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome research, 21(7), 1160-1167.
#------ load data -------- data(exampleSumExp) exampleSumExp feature = assay(exampleSumExp) # access feature data group = colData(exampleSumExp)$grouping # access grouping information SDA(exampleSumExp)
#------ load data -------- data(exampleSumExp) exampleSumExp feature = assay(exampleSumExp) # access feature data group = colData(exampleSumExp)$grouping # access grouping information SDA(exampleSumExp)
This function considers a two-part semi-parametric model for metabolomics, proteomics and single-cell RNA sequencing data. A kernel-smoothed method is applied to estimate the regression coefficients. And likelihood ratio test is constructed for differential abundance/expression analysis.
SDA(sumExp, VOI = NULL, ...)
SDA(sumExp, VOI = NULL, ...)
sumExp |
An object of 'SummarizedExperiment' class. |
VOI |
Variable of interest. Default is NULL, when there is only one covariate, otherwise it must be one of the column names in colData. |
... |
Additional arguments passed to |
The differential abundance/expression analysis is to compare metabolomic or proteomic profiles or gene expression between different experimental groups,
which utilizes a two-part
model: a logistic regression model to characterize the zero proportion and a
semi-parametric model to characterize non-zero values. Let
be the random variable and
is a vector of covariates. This two-part model has the following
form:
where . The model parameters
quantify the covariates effects on the
fraction of zero values and
is the intercept.
are the model parameters quantifying the
covariates effects on the non-zero values,
are independent error terms with a common but completely unspecified density function
.
For differential abundant analysis on data from mass spectrometry,
represents the abundance of certain feature for subject
,
is the probability of point mass.
is a Q-vector of covariates that specifies the treatment
conditions applied to subject
. The corresponding Q-vector of model
parameters
and
quantify the covariates effects for certain feature. Hypothesis
testing on the effect of the
th covariate on certain
feature is performed by assessing
and
. Consider the null hypothesis
:
and
against alternative
hypothesis
: at least one of the two parameters is non-zero.
We also consider the hypotheses for testing
and
individually.
For differential expression analysis on single-cell RNA sequencing data,
represents represents the expression (TPM value) of certain gene in
th cell,
is the drop-out probability.
is a vector of covariates with
being a binary
indicator of the cell population under comparison and
being a vector of other covariates, e.g. cell
size, and
and
are model parameters. Hypothesis
testing on the effect of different cell subpopulations on certain
gene is performed by assessing
and
. For
each gene, the likelihood ratio test is performed on the null hypothesis
:
and
against alternative
hypothesis
: at least one of the two parameters is non-zero.
We also consider the hypotheses for testing
and
individually.
The p-value is calculated based on an asympotic chi-squared distribution. To adjust for multiple comparisons across features, the false discovery discovery rate (FDR) q-value is calculated based on the qvalue function in R/Bioconductor.
A list containing the following components:
gamma |
a matrix of point estimators for |
beta |
a matrix of point estimators for |
pv_gamma |
a matrix of one-part p-values for |
pv_beta |
a matrix of one-part p-values for |
qv_gamma |
a matrix of one-part q-values for |
qv_beta |
a matrix of one-part q-values for |
pv_2part |
a matrix of two-part p-values for overall test |
qv_2part |
a matrix of two-part q-values for overall test |
feat.names |
a vector of feature/gene names |
Yuntong Li <[email protected]>, Chi Wang <[email protected]>, Li Chen <[email protected]>
##--------- load data ------------ data(exampleSumExp) results = SDA(exampleSumExp) ##------ two part q-values ------- results$qv_2part
##--------- load data ------------ data(exampleSumExp) results = SDA(exampleSumExp) ##------ two part q-values ------- results$qv_2part