Title: | R package for RIVER (RNA-Informed Variant Effect on Regulation) |
---|---|
Description: | An implementation of a probabilistic modeling framework that jointly analyzes personal genome and transcriptome data to estimate the probability that a variant has regulatory impact in that individual. It is based on a generative model that assumes that genomic annotations, such as the location of a variant with respect to regulatory elements, determine the prior probability that variant is a functional regulatory variant, which is an unobserved variable. The functional regulatory variant status then influences whether nearby genes are likely to display outlier levels of gene expression in that person. See the RIVER website for more information, documentation and examples. |
Authors: | Yungil Kim [aut, cre], Alexis Battle [aut] |
Maintainer: | Yungil Kim <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.31.0 |
Built: | 2024-11-30 03:48:50 UTC |
Source: | https://github.com/bioc/RIVER |
appRIVER
trains RIVER with all instances and computes posterior
probabilities of FR for downstream analyses.
appRIVER(dataInput, pseudoc = 50, theta_init = matrix(c(0.99, 0.01, 0.3, 0.7), nrow = 2), costs = c(100, 10, 1, 0.1, 0.01, 0.001, 1e-04), verbose = FALSE)
appRIVER(dataInput, pseudoc = 50, theta_init = matrix(c(0.99, 0.01, 0.3, 0.7), nrow = 2), costs = c(100, 10, 1, 0.1, 0.01, 0.001, 1e-04), verbose = FALSE)
dataInput |
An object of ExpressionSet class which contains input data required for all functions in RIVER including genomic features, outlier status, and N2 pairs. |
pseudoc |
Pseudo count. |
theta_init |
Initial values of theta. |
costs |
Candidate penalty parameter values for L2-regularized logistic regression. |
verbose |
Logical option for showing extra information on progress. |
A list which contains subject IDs, gene names, posterior probabilities from GAM and RIVER, and estimated parameters from RIVER with used hyperparameters.
To input a vector of candidate penalty values makes
glmnet
faster than to input a single penalty value
Yungil Kim, [email protected]
cv.glmnet
, predict
,
integratedEM
, testPosteriors
,
getData
, exprs
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) postprobs <- appRIVER(dataInput, verbose=TRUE)
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) postprobs <- appRIVER(dataInput, verbose=TRUE)
evaRIVER
trains RIVER by holding out a list of individual and gene
pairs having same rare variants for evaluation, computes test
posterior probabilities of FR for 1st individual, and compares
them with outlier status of 2nd individual from the list.
evaRIVER(dataInput, pseudoc = 50, theta_init = matrix(c(0.99, 0.01, 0.3, 0.7), nrow = 2), costs = c(100, 10, 1, 0.1, 0.01, 0.001, 1e-04), verbose = FALSE)
evaRIVER(dataInput, pseudoc = 50, theta_init = matrix(c(0.99, 0.01, 0.3, 0.7), nrow = 2), costs = c(100, 10, 1, 0.1, 0.01, 0.001, 1e-04), verbose = FALSE)
dataInput |
An object of ExpressionSet class which contains input data required for all functions in RIVER including genomic features, outlier status, and N2 pairs. |
pseudoc |
Pseudo count. |
theta_init |
Initial values of theta. |
costs |
Candidate penalty parameter values for L2-regularized logistic regression. |
verbose |
Logical option for showing extra information on progress. |
A list which contains two AUC values from RIVER and GAM, computed specificities and sensitivities from two models, and P-value of comparing the two AUC values.
A vector of candidate penalty values make glmnet
faster than to input a single penalty value
Yungil Kim, [email protected]
cv.glmnet
, predict
,
integratedEM
, testPosteriors
,
getData
, exprs
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) evaROC <- evaRIVER(dataInput, verbose=TRUE)
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) evaROC <- evaRIVER(dataInput, verbose=TRUE)
getData
extracts genomic features, z-scores of gene expression, and
N2 pairs having same rare variants from an imported compressed data,
computes outlier status from z-scores given a z-score threshold and
coordinates the genomic features, outlier status, and a list of N2
pairs into ExpressionSet class having standardized data structure.
getData(filename = system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd = 1.5)
getData(filename = system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd = 1.5)
filename |
A full path of a compressed input file that consists of all samples in rows and subject ID, gene name, genomic features, z-scores of corresponding gene expression, and a list of N2 pairs in columns from left to right. In N2 pairs, samples not paired with othersamples have NA while two samples sharing same rare variant near a gene have same pre-assigend integers. |
ZscoreThrd |
A |Z-score| threshold for defining outlier status of samples |
dataInput An object of ExpressionSet class which contains input data required for all functions in RIVER including genomic features, outlier status, and N2 pairs.
Yungil Kim, [email protected]
InputData <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5)
InputData <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5)
getFuncRvFeat
computes posterior probabilities of FR (functionality of
regulatory variant) given G (genomic features) and current estimate
of beta (parameters between FR and G).
getFuncRvFeat(Feat, logistic.model, lambda)
getFuncRvFeat(Feat, logistic.model, lambda)
Feat |
Genomic features (G) |
logistic.model |
Logistic regression model with current estimate of beta |
lambda |
Selected lambda |
probabilities of FR given genomic features, P(FR | G)
Yungil Kim, [email protected]
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) Feat <- scale(t(Biobase::exprs(dataInput))) # genomic features (G) Out <- as.vector(as.numeric(unlist(dataInput$Outlier))-1) # outlier status (E) costs <- c(100, 10, 1, .1, .01, 1e-3, 1e-4) logisticAllCV <- glmnet::cv.glmnet(Feat, Out, lambda=costs, family="binomial", alpha = 0, nfolds=10) probFuncRvFeat <- getFuncRvFeat(Feat, logistic.model=logisticAllCV$glmnet.fit, lambda=logisticAllCV$lambda.min)
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) Feat <- scale(t(Biobase::exprs(dataInput))) # genomic features (G) Out <- as.vector(as.numeric(unlist(dataInput$Outlier))-1) # outlier status (E) costs <- c(100, 10, 1, .1, .01, 1e-3, 1e-4) logisticAllCV <- glmnet::cv.glmnet(Feat, Out, lambda=costs, family="binomial", alpha = 0, nfolds=10) probFuncRvFeat <- getFuncRvFeat(Feat, logistic.model=logisticAllCV$glmnet.fit, lambda=logisticAllCV$lambda.min)
getFuncRvPosteriors
computes posterior probabilities of functionality
of regulatory variant (FR) given genomic features (G) and outlier
status (E) with current estimate of beta (parameters between FR and G)
and theta (parameters between FR and E).
getFuncRvPosteriors(Out, probFuncRvFeat, theta)
getFuncRvPosteriors(Out, probFuncRvFeat, theta)
Out |
Binary values of outlier status (E). |
probFuncRvFeat |
probabilities of FR given genomic features and estimated
beta, P(FR | G, beta), from |
theta |
Current estimate of theta. |
posterior probabilities of FR (P(FR | G, E, beta, theta)) and probable status of FR.
Yungil Kim, [email protected]
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) Feat <- scale(t(Biobase::exprs(dataInput))) # genomic features (G) Out <- as.vector(as.numeric(unlist(dataInput$Outlier))-1) # outlier status (E) theta.init<-matrix(c(.99, .01, .3, .7), nrow=2) costs<-c(100, 10, 1, .1, .01, 1e-3, 1e-4) logisticAllCV <- glmnet::cv.glmnet(Feat, Out, lambda=costs, family="binomial", alpha=0, nfolds=10) probFuncRvFeat <- getFuncRvFeat(Feat, logisticAllCV$glmnet.fit, logisticAllCV$lambda.min) posteriors <- getFuncRvPosteriors(Out, probFuncRvFeat, theta=theta.init)
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) Feat <- scale(t(Biobase::exprs(dataInput))) # genomic features (G) Out <- as.vector(as.numeric(unlist(dataInput$Outlier))-1) # outlier status (E) theta.init<-matrix(c(.99, .01, .3, .7), nrow=2) costs<-c(100, 10, 1, .1, .01, 1e-3, 1e-4) logisticAllCV <- glmnet::cv.glmnet(Feat, Out, lambda=costs, family="binomial", alpha=0, nfolds=10) probFuncRvFeat <- getFuncRvFeat(Feat, logisticAllCV$glmnet.fit, logisticAllCV$lambda.min) posteriors <- getFuncRvPosteriors(Out, probFuncRvFeat, theta=theta.init)
integratedEM
iteratively executes e-step and m-step until it
converges. This is a main function of RIVER.
integratedEM(Feat, Out, lambda, logistic.init, pseudoc, theta.init, costs, verbose = FALSE)
integratedEM(Feat, Out, lambda, logistic.init, pseudoc, theta.init, costs, verbose = FALSE)
Feat |
Genomic features (G). |
Out |
Binary values of outlier status (E). |
lambda |
Selected lambda. |
logistic.init |
Smart initialization of beta (parameters between FR and G) from estimate of beta with E via multivariate logistic regression. |
pseudoc |
Pseudo count. |
theta.init |
Initial theta (parameters between FR (functionality of regulatory variant) and E). |
costs |
Candidate penalty parameter values for L2-regularization within logistic regression. |
verbose |
Logical option for showing extra information on progress. |
Best estimate of beta and theta, final multivariate logistic regression model, and posterior probabilities of FR.
Yungil Kim, [email protected]
getFuncRvFeat
, getFuncRvPosteriors
,
mleTheta
, mleBeta
,
cv.glmnet
,
and https://github.com/ipw012/RIVER
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) Feat <- scale(t(Biobase::exprs(dataInput))) # genomic features (G) Out <- as.vector(as.numeric(unlist(dataInput$Outlier))-1) # outlier status (E) theta.init=matrix(c(.99, .01, .3, .7), nrow=2) costs <- c(100, 10, 1, .1, .01, 1e-3, 1e-4) logisticAllCV <- glmnet::cv.glmnet(Feat, Out, lambda=costs, family="binomial", alpha = 0, nfolds=10) emModelAll <- integratedEM(Feat, Out, lambda=logisticAllCV$lambda.min, logistic.init=logisticAllCV$glmnet.fit, pseudoc=50, theta=theta.init, costs, verbose=FALSE)
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) Feat <- scale(t(Biobase::exprs(dataInput))) # genomic features (G) Out <- as.vector(as.numeric(unlist(dataInput$Outlier))-1) # outlier status (E) theta.init=matrix(c(.99, .01, .3, .7), nrow=2) costs <- c(100, 10, 1, .1, .01, 1e-3, 1e-4) logisticAllCV <- glmnet::cv.glmnet(Feat, Out, lambda=costs, family="binomial", alpha = 0, nfolds=10) emModelAll <- integratedEM(Feat, Out, lambda=logisticAllCV$lambda.min, logistic.init=logisticAllCV$glmnet.fit, pseudoc=50, theta=theta.init, costs, verbose=FALSE)
mleBeta
computes maximum likelihoood estimate of beta (parameters
between FR (functionality of regulatory variant) and G (genomic
annotations); multivariate logistic regression).
mleBeta(Feat, FuncRv, costs)
mleBeta(Feat, FuncRv, costs)
Feat |
Genomic features (G) |
FuncRv |
Soft-assignments of FR from E-step |
costs |
Candidate penalty parameter values for L2-regularization within logistic regression. |
MLE of beta
To input a vector of candidate penalty values makes
glmnet
faster than to input a single penalty value
Yungil Kim, [email protected]
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) Feat <- scale(t(Biobase::exprs(dataInput))) # genomic features (G) Out <- as.vector(as.numeric(unlist(dataInput$Outlier))-1) # outlier status (E) theta.init <- matrix(c(.99, .01, .3, .7), nrow=2) costs <- c(100, 10, 1, .1, .01, 1e-3, 1e-4) logisticAllCV <- glmnet::cv.glmnet(Feat, Out, lambda=costs, family="binomial", alpha=0, nfolds=10) probFuncRvFeat <- getFuncRvFeat(Feat, logisticAllCV$glmnet.fit, logisticAllCV$lambda.min) posteriors <- getFuncRvPosteriors(Out, probFuncRvFeat, theta=theta.init) logistic.cur <- mleBeta(Feat, FuncRv=posteriors$posterior, costs)
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) Feat <- scale(t(Biobase::exprs(dataInput))) # genomic features (G) Out <- as.vector(as.numeric(unlist(dataInput$Outlier))-1) # outlier status (E) theta.init <- matrix(c(.99, .01, .3, .7), nrow=2) costs <- c(100, 10, 1, .1, .01, 1e-3, 1e-4) logisticAllCV <- glmnet::cv.glmnet(Feat, Out, lambda=costs, family="binomial", alpha=0, nfolds=10) probFuncRvFeat <- getFuncRvFeat(Feat, logisticAllCV$glmnet.fit, logisticAllCV$lambda.min) posteriors <- getFuncRvPosteriors(Out, probFuncRvFeat, theta=theta.init) logistic.cur <- mleBeta(Feat, FuncRv=posteriors$posterior, costs)
mleTheta
computes maximum likelihoood estimate of theta (parameters
between FR (functionality of regulatory variant) and E (outlier
status); Naive-Bayes).
mleTheta(Out, FuncRv, pseudocount)
mleTheta(Out, FuncRv, pseudocount)
Out |
Binary values of outlier status (E). |
FuncRv |
Soft-assignments of FR from E-step |
pseudocount |
Pseudo count. |
MLE of theta
Yungil Kim, [email protected]
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) Feat <- scale(t(Biobase::exprs(dataInput))) # genomic features (G) Out <- as.vector(as.numeric(unlist(dataInput$Outlier))-1) # outlier status (E) theta.init <- matrix(c(.99, .01, .3, .7), nrow=2) costs <- c(100, 10, 1, .1, .01, 1e-3, 1e-4) logisticAllCV <- glmnet::cv.glmnet(Feat, Out, lambda=costs, family="binomial", alpha = 0, nfolds=10) probFuncRvFeat <- getFuncRvFeat(Feat, logisticAllCV$glmnet.fit, logisticAllCV$lambda.min) posteriors <- getFuncRvPosteriors(Out, probFuncRvFeat, theta=theta.init) thetaCur <- mleTheta(Out, FuncRv=posteriors$posterior, pseudoc=50)
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) Feat <- scale(t(Biobase::exprs(dataInput))) # genomic features (G) Out <- as.vector(as.numeric(unlist(dataInput$Outlier))-1) # outlier status (E) theta.init <- matrix(c(.99, .01, .3, .7), nrow=2) costs <- c(100, 10, 1, .1, .01, 1e-3, 1e-4) logisticAllCV <- glmnet::cv.glmnet(Feat, Out, lambda=costs, family="binomial", alpha = 0, nfolds=10) probFuncRvFeat <- getFuncRvFeat(Feat, logisticAllCV$glmnet.fit, logisticAllCV$lambda.min) posteriors <- getFuncRvPosteriors(Out, probFuncRvFeat, theta=theta.init) thetaCur <- mleTheta(Out, FuncRv=posteriors$posterior, pseudoc=50)
plotPosteriors
draws scatter plots of posterior probabilities from
both RIVER GAM (genomic annotation model) in terms of outlier
status.
plotPosteriors(postprobs, outliers)
plotPosteriors(postprobs, outliers)
postprobs |
Output of |
outliers |
Outlier status of examples |
A figure of posteriors from RIVER (y-axis) and GAM (x-axis) models for ouliters and non-outliers separately
Yungil Kim, [email protected]
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) postprobs <- appRIVER(dataInput) plotPosteriors(postprobs, outliers=as.numeric(unlist(dataInput$Outlier))-1)
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) postprobs <- appRIVER(dataInput) plotPosteriors(postprobs, outliers=as.numeric(unlist(dataInput$Outlier))-1)
The RIVER package provides two applicable functions of RIVER to actual datasets including genomic features and outlier status and a list of required functions for RIVER: evalRIVER, appRIVER, getFRPosteriors, mleTheta, mleBeta, getFRgivenG, testPosteriors, and integratedEM.
The getFuncRvFeat
computes posterior probabilities
of FR (functionality of regulatory variant) given genomic features (G)
and current estimate of beta (parameters between FR and G).
The getFuncRvPosteriors
computes posterior
probabilities of FR (functionality of regulatory variant) given genomic
features (G) and outlier status (E) with current estimate of beta
(parameters between FR and G) and theta (parameters between FR and
E).
The mleBeta
computes maximum likelihoood estimate of
beta (parameters between FR and G; multivariate logistic regression).
The mleTheta
computes maximum likelihoood estimate
of theta (parameters between FR and E; Naive-Bayes).
The testPosteriors
computes posterior
probabilities of FR (functionality of regulatory variant) given G
(genomic annotations) and E (outlier status) with estimate of beta
(parameters between FR and G) and theta (parameters between FR and E).
The integratedEM
iteratively executes e-step
and m-step until it converges. This is a main function of RIVER.
The plotPosteriors
draw scatter plots of
posterior probabilities from both RIVER GAM in terms of outliers status.
The evaRIVER
trains RIVER by holding out a list of
individual and gene pairs having same rare variants for evaluation,
computes test posterior probabilities of FR (functionality of regulatory
variant) for 1st individual, and compares them with outlier status of
2nd individual from the list.
The appRIVER
trains RIVER with all instances and
computes posterior probabilities of FR (functionality of regulatory
variant) for downstream analyses.
https://github.com/ipw012/RIVER
testPosteriors
computes posterior probabilities of FR (functionality
of regulatory variant) given G (genomic annotations) and E (outlier
status) with estimate of beta (parameters between FR and G) and
theta (parameters between FR and E).
testPosteriors(Feat, Out, emModel)
testPosteriors(Feat, Out, emModel)
Feat |
Genomic features (G) |
Out |
Binary values of outlier status (E). |
emModel |
Estimated parameters including beta and theta via EM and selected lambdas |
test posterior probabilities of FR given new outlier status (E) and genomic features (G), P(FR | G, E, beta, theta), and probable status of FR.
Yungil Kim, [email protected]
getFuncRvFeat
and getFuncRvPosteriors
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) Feat <- scale(t(Biobase::exprs(dataInput))) # genomic features (G) Out <- as.vector(as.numeric(unlist(dataInput$Outlier))-1) # outlier status (E) theta.init <- matrix(c(.99, .01, .3, .7), nrow=2) costs <- c(100, 10, 1, .1, .01, 1e-3, 1e-4) logisticAllCV <- glmnet::cv.glmnet(Feat, Out, lambda=costs, family="binomial", alpha = 0, nfolds=10) emModelAll <- integratedEM(Feat, Out, logisticAllCV$lambda.min, logisticAllCV$glmnet.fit, pseudoc=50, theta.init, costs, verbose=FALSE) trainedpost <- testPosteriors(Feat, Out, emModel=emModelAll)
dataInput <- getData(filename=system.file("extdata", "simulation_RIVER.gz", package = "RIVER"), ZscoreThrd=1.5) Feat <- scale(t(Biobase::exprs(dataInput))) # genomic features (G) Out <- as.vector(as.numeric(unlist(dataInput$Outlier))-1) # outlier status (E) theta.init <- matrix(c(.99, .01, .3, .7), nrow=2) costs <- c(100, 10, 1, .1, .01, 1e-3, 1e-4) logisticAllCV <- glmnet::cv.glmnet(Feat, Out, lambda=costs, family="binomial", alpha = 0, nfolds=10) emModelAll <- integratedEM(Feat, Out, logisticAllCV$lambda.min, logisticAllCV$glmnet.fit, pseudoc=50, theta.init, costs, verbose=FALSE) trainedpost <- testPosteriors(Feat, Out, emModel=emModelAll)