Title: | Impute Covariate Data in RNA Sequencing Studies |
---|---|
Description: | The RNAseqCovarImpute package makes linear model analysis for RNA sequencing read counts compatible with multiple imputation (MI) of missing covariates. A major problem with implementing MI in RNA sequencing studies is that the outcome data must be included in the imputation prediction models to avoid bias. This is difficult in omics studies with high-dimensional data. The first method we developed in the RNAseqCovarImpute package surmounts the problem of high-dimensional outcome data by binning genes into smaller groups to analyze pseudo-independently. This method implements covariate MI in gene expression studies by 1) randomly binning genes into smaller groups, 2) creating M imputed datasets separately within each bin, where the imputation predictor matrix includes all covariates and the log counts per million (CPM) for the genes within each bin, 3) estimating gene expression changes using `limma::voom` followed by `limma::lmFit` functions, separately on each M imputed dataset within each gene bin, 4) un-binning the gene sets and stacking the M sets of model results before applying the `limma::squeezeVar` function to apply a variance shrinking Bayesian procedure to each M set of model results, 5) pooling the results with Rubins’ rules to produce combined coefficients, standard errors, and P-values, and 6) adjusting P-values for multiplicity to account for false discovery rate (FDR). A faster method uses principal component analysis (PCA) to avoid binning genes while still retaining outcome information in the MI models. Binning genes into smaller groups requires that the MI and limma-voom analysis is run many times (typically hundreds). The more computationally efficient MI PCA method implements covariate MI in gene expression studies by 1) performing PCA on the log CPM values for all genes using the Bioconductor `PCAtools` package, 2) creating M imputed datasets where the imputation predictor matrix includes all covariates and the optimum number of PCs to retain (e.g., based on Horn’s parallel analysis or the number of PCs that account for >80% explained variation), 3) conducting the standard limma-voom pipeline with the `voom` followed by `lmFit` followed by `eBayes` functions on each M imputed dataset, 4) pooling the results with Rubins’ rules to produce combined coefficients, standard errors, and P-values, and 5) adjusting P-values for multiplicity to account for false discovery rate (FDR). |
Authors: | Brennan Baker [aut, cre] , Sheela Sathyanarayana [aut], Adam Szpiro [aut], James MacDonald [aut], Alison Paquette [aut] |
Maintainer: | Brennan Baker <[email protected]> |
License: | GPL-3 |
Version: | 1.5.0 |
Built: | 2024-10-31 04:29:21 UTC |
Source: | https://github.com/bioc/RNAseqCovarImpute |
Combines results from each imputed dataset using Rubin's rules.
combine_rubins( DGE, model_results, predictor, covariate = NULL, robust = FALSE, winsor.tail.p = c(0.05, 0.1) )
combine_rubins( DGE, model_results, predictor, covariate = NULL, robust = FALSE, winsor.tail.p = c(0.05, 0.1) )
DGE |
A DGEList object. |
model_results |
Output from limmavoom_imputed_datalist. |
predictor |
Independent variable of interest, in the form of a linear model contrast. Must be a variable in voom_formula. |
covariate |
Arguments passed to limma::squeezeVar. If non-NULL, var.prior will depend on this numeric covariate. Otherwise, var.prior is constant. |
robust |
Arguments passed to limma::squeezeVar. Llogical, should the estimation of df.prior and var.prior be robustified against outlier sample variances? |
winsor.tail.p |
Arguments passed to limma::squeezeVar. Numeric vector of length 1 or 2, giving left and right tail proportions of x to Winsorize. Used only when robust=TRUE. |
Dataframe with one row per gene containing coefficients standard errors, degrees of freedom, t-statistics, P-Values, and adjusted P-values from the limma-voom pipeline.
coef_combined |
combined logFCs across the multiple imputed datasets using Rubin's rules |
SE_P |
pooled standard error across the multiple imputed datasets using Rubin's rules |
SE_P_bayes |
pooled standard error across the multiple imputed datasets using Rubin's rules squeezed to global mean variance trend curve with limma-voom Bayesian procedure |
df |
limma-voom residual degrees of freedom adjusted for Rubin's rules |
df_bayes |
limma-voom residual degrees of freedom adjusted for Rubin's rules and Bayesian procedure |
rubins_t |
t-statistic = coef_combined divided by SE_p |
rubins_t_bayes |
t-statistic = coef_combined divided by SE_p_bayes |
combined_p |
p-value from two-sided t-distribution alpha = 0.05 using rubins_t |
combined_p_bayes |
p-value from two-sided t-distribution alpha = 0.05 using rubins_t_bayes |
combined_p_adj |
false discovery rate (FDR) adjusted combined_p |
combined_p_adj_bayes |
false discovery rate (FDR) adjusted combined_p_bayes |
data(example_data) data(example_DGE) intervals <- get_gene_bin_intervals(example_DGE, example_data, n = 10) gene_bin_impute <- impute_by_gene_bin(example_data, intervals, example_DGE, m = 2 ) coef_se <- limmavoom_imputed_data_list( gene_intervals = intervals, DGE = example_DGE, imputed_data_list = gene_bin_impute, m = 2, voom_formula = "~x + y + z + a + b" ) final_res <- combine_rubins( DGE = example_DGE, model_results = coef_se, predictor = "x" )
data(example_data) data(example_DGE) intervals <- get_gene_bin_intervals(example_DGE, example_data, n = 10) gene_bin_impute <- impute_by_gene_bin(example_data, intervals, example_DGE, m = 2 ) coef_se <- limmavoom_imputed_data_list( gene_intervals = intervals, DGE = example_DGE, imputed_data_list = gene_bin_impute, m = 2, voom_formula = "~x + y + z + a + b" ) final_res <- combine_rubins( DGE = example_DGE, model_results = coef_se, predictor = "x" )
The exact code used to generate these data are found in the Example_Data_for_RNAseqCovarImpute
vignette. In short, example_data
contains 500 rows with data for variables x, y, and z, which
are continuous normally distributed, and a and b, which are binary variables. Missigness
was simulated for all variables other than x such that a complete case analysis would drop
24.2% of participants. example_DGE
contains random count data from the Poisson distribution
for 500 made up genes, ENS1-ENS500
data(example_data)
data(example_data)
example_data
data frame with 500 rows and 5 variables
continuous normally distributed
continuous normally distributed
continuous normally distributed
binary
binary
...
Tibble with 500 rows of data for variables x, y, and z
data(example_data)
data(example_data)
The exact code used to generate these data are found in the Example_Data_for_RNAseqCovarImpute
vignette. In short, example_data
contains 500 rows with data for variables x, y, and z, which
are continuous normally distributed, and a and b, which are binary variables. Missigness
was simulated for all variables other than x such that a complete case analysis would drop
24.2% of participants. example_DGE
contains random count data from the Poisson distribution
for 500 made up genes, ENS1-ENS500
data(example_DGE)
data(example_DGE)
example_DGE
A DGElist with 500 genes and 500 samples
DGElist for 500 made up genes, ENS1-ENS500
data(example_DGE)
data(example_DGE)
Creates gene bins. Input DGE list, sample data, and 'n' number of individuals per genes. By default, number of bins and genes per bin are set so that each bin has approximately 1 gene per 10 individuals in the data.
get_gene_bin_intervals(DGE, data, n = 10)
get_gene_bin_intervals(DGE, data, n = 10)
DGE |
A DGEList object. |
data |
Sample data with one row per sample. Sample row order should match the col order in the DGEList. |
n |
Genes per bin are set so that each bin has approximately 1 gene per n individuals in the data. |
Data frame with one row per gene bin. Columns indicate the start and end positions and the number of genes of each bin.
data(example_data) data(example_DGE) intervals <- get_gene_bin_intervals(example_DGE, example_data, n = 10) gene_bin_impute <- impute_by_gene_bin(example_data, intervals, example_DGE, m = 2 ) coef_se <- limmavoom_imputed_data_list( gene_intervals = intervals, DGE = example_DGE, imputed_data_list = gene_bin_impute, m = 2, voom_formula = "~x + y + z + a + b" ) final_res <- combine_rubins( DGE = example_DGE, model_results = coef_se, predictor = "x" )
data(example_data) data(example_DGE) intervals <- get_gene_bin_intervals(example_DGE, example_data, n = 10) gene_bin_impute <- impute_by_gene_bin(example_data, intervals, example_DGE, m = 2 ) coef_se <- limmavoom_imputed_data_list( gene_intervals = intervals, DGE = example_DGE, imputed_data_list = gene_bin_impute, m = 2, voom_formula = "~x + y + z + a + b" ) final_res <- combine_rubins( DGE = example_DGE, model_results = coef_se, predictor = "x" )
Loops through DGE list using the gene bin intervals from the "get_gene_bin_intervals" function and makes imputed datasets. For instance, if n = 100 and intervals contains 200 gene bin intervals, output will be a list of 200 sets of 100 imputed datasets. Each of the 200 sets are imputed using only the genes in one gene bin.
impute_by_gene_bin(data, intervals, DGE, m, maxit = 10, BPPARAM = bpparam())
impute_by_gene_bin(data, intervals, DGE, m, maxit = 10, BPPARAM = bpparam())
data |
Sample data with one row per sample. Sample row order should match the col order in the DGEList. |
intervals |
Output from get_gene_bin_intervals function. A dataframe where each row contains the start (first col) and end (second col) values for each gene bin interval. |
DGE |
A DGEList object. |
m |
Number of imputed data sets. |
maxit |
Used by mice function. |
BPPARAM |
A BiocParallelParam object |
A list of sets of n imputed datasets, one per gene bin.
data(example_data) data(example_DGE) intervals <- get_gene_bin_intervals(example_DGE, example_data, n = 10) gene_bin_impute <- impute_by_gene_bin(example_data, intervals, example_DGE, m = 2 ) coef_se <- limmavoom_imputed_data_list( gene_intervals = intervals, DGE = example_DGE, imputed_data_list = gene_bin_impute, m = 2, voom_formula = "~x + y + z + a + b" ) final_res <- combine_rubins( DGE = example_DGE, model_results = coef_se, predictor = "x" )
data(example_data) data(example_DGE) intervals <- get_gene_bin_intervals(example_DGE, example_data, n = 10) gene_bin_impute <- impute_by_gene_bin(example_data, intervals, example_DGE, m = 2 ) coef_se <- limmavoom_imputed_data_list( gene_intervals = intervals, DGE = example_DGE, imputed_data_list = gene_bin_impute, m = 2, voom_formula = "~x + y + z + a + b" ) final_res <- combine_rubins( DGE = example_DGE, model_results = coef_se, predictor = "x" )
Loops through the imputed data list (output from "impute_by_gene_bin" function) and runs limma-voom RNA seq analysis.
limmavoom_imputed_data_list( gene_intervals, DGE, imputed_data_list, m, voom_formula, BPPARAM = bpparam() )
limmavoom_imputed_data_list( gene_intervals, DGE, imputed_data_list, m, voom_formula, BPPARAM = bpparam() )
gene_intervals |
Output from get_gene_bin_intervals function. A dataframe where each row contains the start (first col) and end (second col) values for each gene bin interval. |
DGE |
A DGEList object. |
imputed_data_list |
Output from impute_by_gene_bin. |
m |
Number of imputed data sets. |
voom_formula |
Formula for design matrix. |
BPPARAM |
A BiocParallelParam object |
A dataframe with coefficient, standard error, sigma, and residual degrees of freedom values from limma-voom gene expression analysis. One row per gene and one set of values per imputed dataset.
data(example_data) data(example_DGE) intervals <- get_gene_bin_intervals(example_DGE, example_data, n = 10) gene_bin_impute <- impute_by_gene_bin(example_data, intervals, example_DGE, m = 2 ) coef_se <- limmavoom_imputed_data_list( gene_intervals = intervals, DGE = example_DGE, imputed_data_list = gene_bin_impute, m = 2, voom_formula = "~x + y + z + a + b" ) final_res <- combine_rubins( DGE = example_DGE, model_results = coef_se, predictor = "x" )
data(example_data) data(example_DGE) intervals <- get_gene_bin_intervals(example_DGE, example_data, n = 10) gene_bin_impute <- impute_by_gene_bin(example_data, intervals, example_DGE, m = 2 ) coef_se <- limmavoom_imputed_data_list( gene_intervals = intervals, DGE = example_DGE, imputed_data_list = gene_bin_impute, m = 2, voom_formula = "~x + y + z + a + b" ) final_res <- combine_rubins( DGE = example_DGE, model_results = coef_se, predictor = "x" )
Combines results from each imputed dataset using Rubin's rules.
limmavoom_imputed_data_pca(imp, DGE, voom_formula, BPPARAM = bpparam())
limmavoom_imputed_data_pca(imp, DGE, voom_formula, BPPARAM = bpparam())
imp |
Imputed data from mice (mids object) |
DGE |
A DGEList object. |
voom_formula |
Formula for design matrix. |
BPPARAM |
A BiocParallelParam object |
Dataframe with one row per gene. Columns contain coefficients, standard errors, and p-values from the limma-voom pipeline.
data(example_data) data(example_DGE) pca_data = limma::voom(example_DGE)$E p = PCAtools::pca(pca_data) pcs = p$rotated[,1:78] example_data = cbind(example_data, pcs) imp = mice::mice(example_data, m=3) mi_pca_res = limmavoom_imputed_data_pca( imp = imp, DGE = example_DGE, voom_formula = "~x + y + z + a + b" )
data(example_data) data(example_DGE) pca_data = limma::voom(example_DGE)$E p = PCAtools::pca(pca_data) pcs = p$rotated[,1:78] example_data = cbind(example_data, pcs) imp = mice::mice(example_data, m=3) mi_pca_res = limmavoom_imputed_data_pca( imp = imp, DGE = example_DGE, voom_formula = "~x + y + z + a + b" )