Title: | Batch Effects Correction with Unknown Subtypes |
---|---|
Description: | High-throughput experimental data are accumulating exponentially in public databases. However, mining valid scientific discoveries from these abundant resources is hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed "batch effects," and the latter is often modelled by "subtypes." The R package BUScorrect fits a Bayesian hierarchical model, the Batch-effects-correction-with-Unknown-Subtypes model (BUS), to correct batch effects in the presence of unknown subtypes. BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, and (d) enjoying a linear-order computation complexity. |
Authors: | Xiangyu Luo <[email protected]>, Yingying Wei <[email protected]> |
Maintainer: | Xiangyu Luo <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.25.0 |
Built: | 2025-01-17 03:55:02 UTC |
Source: | https://github.com/bioc/BUScorrect |
High-throughput experimental data are accumulating exponentially in public databases. However, mining valid scientific discoveries from these abundant resources is hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed "batch effects," and the latter is often modelled by "subtypes." The R package BUScorrect fits a Bayesian hierarchical model, the Batch-effects-correction-with-Unknown-Subtypes model (BUS), to correct batch effects in the presence of unknown subtypes. BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, and (d) enjoying a linear-order computation complexity.
BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, (d) allowing the number of subtypes to vary from batch to batch, (e) integrating batches from different platforms, and (f) enjoying a linear-order computation complexity.
Xiangyu Luo <[email protected]>, Yingying Wei <[email protected]>
Maintainer: Xiangyu Luo <[email protected]>
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
############################################################################### #Generate Simulation Data ############################################################################### rm(list = ls(all = TRUE)) set.seed(123) B <- 3 #total number of batches K <- 3 #total number of subtypes G <- 3000 #total number of genes pi <- matrix(NA, B, K) # pi[b,k] stands for the proportion of the kth subtype in the bth batch pi[1, ] <- c(0.2, 0.3, 0.5) pi[2, ] <- c(0.4, 0.2, 0.4) pi[3, ] <- c(0.3, 0.4, 0.3) #total number of samples in each bacth. n_vec <- rep(NA, B) #n_vec[b] represents the total number of samples in batch b. n_vec <- c(100, 110, 120) #Data list example_Data <- list() #baseline expression level alpha <- rep(2, G) #subtype effect mu <- matrix(NA, G, K) #subtype effect, mu[g,k] stands for the kth-subtype effect of gene g mu[ ,1] <- 0 #the first subtype is taken as the baseline subtype #the subtype effect of subtype 1 is set to zero mu[ ,2] <- c(rep(2,G/20), rep(0,G/20),rep(0, G-G/20-G/20)) mu[ ,3] <- c(rep(0,G/20), rep(2,G/20),rep(0, G-G/20-G/20)) #batch effect gamma <- matrix(NA, B, G) #'location' batch effect of gene g in batch b gamma[1, ] <- 0 #the first batch is taken as the reference batch without batch effects #the batch effect of batch 1 is set to zero gamma[2, ] <- c(rep(3,G/5),rep(2,G/5),rep(1,G/5), rep(2,G/5),rep(3,G/5)) gamma[3, ] <- c(rep(1,G/5),rep(2,G/5),rep(3,G/5), rep(2,G/5),rep(1,G/5)) sigma_square <- matrix(NA, B,G) #sigma_square[b,g] denotes the error variance of gene g in batch b. sigma_square[1,] <- rep(0.1, G) sigma_square[2,] <- rep(0.2, G) sigma_square[3,] <- rep(0.15, G) Z <- list() #subtype indicator. Z[b,j] represents the subtype of sample j in batch b Z[[1]] <- as.integer(c(rep(1,floor(pi[1,1]*n_vec[1])),rep(2,floor(pi[1,2]*n_vec[1])), rep(3,floor(pi[1,3]*n_vec[1])))) Z[[2]] <- as.integer(c(rep(1,floor(pi[2,1]*n_vec[2])),rep(2,floor(pi[2,2]*n_vec[2])), rep(3,floor(pi[2,3]*n_vec[2])))) Z[[3]] <- as.integer(c(rep(1,floor(pi[3,1]*n_vec[3])),rep(2,floor(pi[3,2]*n_vec[3])), rep(3,floor(pi[3,3]*n_vec[3])))) for(b in 1:B){ #generate data num <- n_vec[b] example_Data[[b]] <- sapply(1:num, function(j){ tmp <- alpha + mu[ ,Z[[b]][j]] + gamma[b, ] + rnorm(G, sd = sqrt(sigma_square[b, ])) tmp }) } ############################################################################### #Apply the BUSgibbs Function ############################################################################### set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) summary(BUSfits) ## Not run: #estimated subtypes est_subtypes <- Subtypes(BUSfits) #estimated subtype effects est_subtype_effects <- subtype_effects(BUSfits) #estimated location batch effects est_location_batch_effects <- location_batch_effects(BUSfits) #BIC BIC_val <- BIC(BUSfits) #adjusted expression values adjusted_data <- adjusted_values(BUSfits, Data) #estimate intrinsic gene indicators by using the default postprob_DE_threshold = 0.5 est_L <- estimate_IG_indicators(BUSfits, postprob_DE_threshold = 0.5) #obtain the intrinsic gene indicators intrinsic_gene_indices <- IG_index(est_L) ## End(Not run)
############################################################################### #Generate Simulation Data ############################################################################### rm(list = ls(all = TRUE)) set.seed(123) B <- 3 #total number of batches K <- 3 #total number of subtypes G <- 3000 #total number of genes pi <- matrix(NA, B, K) # pi[b,k] stands for the proportion of the kth subtype in the bth batch pi[1, ] <- c(0.2, 0.3, 0.5) pi[2, ] <- c(0.4, 0.2, 0.4) pi[3, ] <- c(0.3, 0.4, 0.3) #total number of samples in each bacth. n_vec <- rep(NA, B) #n_vec[b] represents the total number of samples in batch b. n_vec <- c(100, 110, 120) #Data list example_Data <- list() #baseline expression level alpha <- rep(2, G) #subtype effect mu <- matrix(NA, G, K) #subtype effect, mu[g,k] stands for the kth-subtype effect of gene g mu[ ,1] <- 0 #the first subtype is taken as the baseline subtype #the subtype effect of subtype 1 is set to zero mu[ ,2] <- c(rep(2,G/20), rep(0,G/20),rep(0, G-G/20-G/20)) mu[ ,3] <- c(rep(0,G/20), rep(2,G/20),rep(0, G-G/20-G/20)) #batch effect gamma <- matrix(NA, B, G) #'location' batch effect of gene g in batch b gamma[1, ] <- 0 #the first batch is taken as the reference batch without batch effects #the batch effect of batch 1 is set to zero gamma[2, ] <- c(rep(3,G/5),rep(2,G/5),rep(1,G/5), rep(2,G/5),rep(3,G/5)) gamma[3, ] <- c(rep(1,G/5),rep(2,G/5),rep(3,G/5), rep(2,G/5),rep(1,G/5)) sigma_square <- matrix(NA, B,G) #sigma_square[b,g] denotes the error variance of gene g in batch b. sigma_square[1,] <- rep(0.1, G) sigma_square[2,] <- rep(0.2, G) sigma_square[3,] <- rep(0.15, G) Z <- list() #subtype indicator. Z[b,j] represents the subtype of sample j in batch b Z[[1]] <- as.integer(c(rep(1,floor(pi[1,1]*n_vec[1])),rep(2,floor(pi[1,2]*n_vec[1])), rep(3,floor(pi[1,3]*n_vec[1])))) Z[[2]] <- as.integer(c(rep(1,floor(pi[2,1]*n_vec[2])),rep(2,floor(pi[2,2]*n_vec[2])), rep(3,floor(pi[2,3]*n_vec[2])))) Z[[3]] <- as.integer(c(rep(1,floor(pi[3,1]*n_vec[3])),rep(2,floor(pi[3,2]*n_vec[3])), rep(3,floor(pi[3,3]*n_vec[3])))) for(b in 1:B){ #generate data num <- n_vec[b] example_Data[[b]] <- sapply(1:num, function(j){ tmp <- alpha + mu[ ,Z[[b]][j]] + gamma[b, ] + rnorm(G, sd = sqrt(sigma_square[b, ])) tmp }) } ############################################################################### #Apply the BUSgibbs Function ############################################################################### set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) summary(BUSfits) ## Not run: #estimated subtypes est_subtypes <- Subtypes(BUSfits) #estimated subtype effects est_subtype_effects <- subtype_effects(BUSfits) #estimated location batch effects est_location_batch_effects <- location_batch_effects(BUSfits) #BIC BIC_val <- BIC(BUSfits) #adjusted expression values adjusted_data <- adjusted_values(BUSfits, Data) #estimate intrinsic gene indicators by using the default postprob_DE_threshold = 0.5 est_L <- estimate_IG_indicators(BUSfits, postprob_DE_threshold = 0.5) #obtain the intrinsic gene indicators intrinsic_gene_indices <- IG_index(est_L) ## End(Not run)
Call the function to obtain the corrected expression values with batch effects removed from the original input data.
adjusted_values(BUSfits, original_data)
adjusted_values(BUSfits, original_data)
BUSfits |
The BUSfits object output by the function |
original_data |
The original gene expression data list with length equal to the batch number. Each of its element is a gene expression matrix for a specific batch, in which each row corresponds to a gene and each column corresponds to a sample. |
adjusted_data |
An R list with length equal to the batch number. Each of its element is a corrected gene expression matrix for a specific batch, in which each row corresponds to a gene and each column corresponds to a sample. |
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) adjusted_data <- adjusted_values(BUSfits, example_Data)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) adjusted_data <- adjusted_values(BUSfits, example_Data)
Call the function to obtain the baseline expression values for subtype 1 from the output by BUSgibbs.
baseline_expression_values(BUSfits)
baseline_expression_values(BUSfits)
BUSfits |
The BUSfits object output by the function |
est_baseline_expression_values |
The estimated baseline expression values for subtype 1, a vector with length equal to the gene number. |
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) est_baseline_expression_values <- baseline_expression_values(BUSfits)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) est_baseline_expression_values <- baseline_expression_values(BUSfits)
The BIC value can be used to determine the subtype number if it is unknown to the users.
BIC_BUS(BUSfits)
BIC_BUS(BUSfits)
BUSfits |
The BUSfits object from the function |
BIC_val |
The BIC value for the BUS model with the subtype number being n.subtypes, the input subtype number for the BUSgibbs function that generates the BUSfits object. |
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) BIC_val <- BIC_BUS(BUSfits)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) BIC_val <- BIC_BUS(BUSfits)
A simulated data set for demonstrating how to use the BUScorrect package
## Not run: #This data set is simulated according to the following R code rm(list = ls(all = TRUE)) set.seed(123456) B <- 3 #total number of batches K <- 3 #total number of subtypes G <- 2000 #total number of genes pi <- matrix(NA, B, K) # pi[b,k] stands for the proportion of kth subtype in bth batch pi[1, ] <- c(0.2, 0.3, 0.5) pi[2, ] <- c(0.4, 0.2, 0.4) pi[3, ] <- c(0.3, 0.4, 0.3) #total number of samples in each bacth. n_vec <- rep(NA, B) #n_vec[b] represents the total number of samples in batch b. n_vec <- c(70, 80, 70) #Data list example_Data <- list() #baseline expression level alpha <- rep(2, G) #subtype effect mu <- matrix(NA, G, K) #subtype effect, mu[g,k] stands for the kth-subtype effect of gene g mu[ ,1] <- 0 #the first subtype is taken as the baseline subtype #the subtype effect of subtype 1 is set to zero mu[ ,2] <- c(rep(2,G/20), rep(0,G/20),rep(0, G-G/20-G/20)) mu[ ,3] <- c(rep(0,G/20), rep(2,G/20),rep(0, G-G/20-G/20)) #batch effect gamma <- matrix(NA, B, G) #'location' batch effect of gene g in batch b gamma[1, ] <- 0 #the first batch is taken as the reference batch without batch effects #the batch effect of batch 1 is set to zero gamma[2, ] <- c(rep(3,G/5),rep(2,G/5),rep(1,G/5), rep(2,G/5),rep(3,G/5)) gamma[3, ] <- c(rep(1,G/5),rep(2,G/5),rep(3,G/5), rep(2,G/5),rep(1,G/5)) sigma_square <- matrix(NA, B,G) #sigma_square[b,g] denotes the error variance of gene g in batch b. sigma_square[1,] <- rep(0.1, G) sigma_square[2,] <- rep(0.2, G) sigma_square[3,] <- rep(0.15, G) Z <- list() #subtype indicator. Z[b,j] represents the subtype of sample j in batch b Z[[1]] <- as.integer(c(rep(1,floor(pi[1,1]*n_vec[1])),rep(2,floor(pi[1,2]*n_vec[1])), rep(3,floor(pi[1,3]*n_vec[1])))) Z[[2]] <- as.integer(c(rep(1,floor(pi[2,1]*n_vec[2])),rep(2,floor(pi[2,2]*n_vec[2])), rep(3,floor(pi[2,3]*n_vec[2])))) Z[[3]] <- as.integer(c(rep(1,floor(pi[3,1]*n_vec[3])),rep(2,floor(pi[3,2]*n_vec[3])), rep(3,floor(pi[3,3]*n_vec[3])))) for(b in 1:B){ #generate data num <- n_vec[b] example_Data[[b]] <- sapply(1:num, function(j){ tmp <- alpha + mu[ ,Z[[b]][j]] + gamma[b, ] + rnorm(G, sd = sqrt(sigma_square[b, ])) tmp }) } BUSexample_data <- example_Data ## End(Not run)
## Not run: #This data set is simulated according to the following R code rm(list = ls(all = TRUE)) set.seed(123456) B <- 3 #total number of batches K <- 3 #total number of subtypes G <- 2000 #total number of genes pi <- matrix(NA, B, K) # pi[b,k] stands for the proportion of kth subtype in bth batch pi[1, ] <- c(0.2, 0.3, 0.5) pi[2, ] <- c(0.4, 0.2, 0.4) pi[3, ] <- c(0.3, 0.4, 0.3) #total number of samples in each bacth. n_vec <- rep(NA, B) #n_vec[b] represents the total number of samples in batch b. n_vec <- c(70, 80, 70) #Data list example_Data <- list() #baseline expression level alpha <- rep(2, G) #subtype effect mu <- matrix(NA, G, K) #subtype effect, mu[g,k] stands for the kth-subtype effect of gene g mu[ ,1] <- 0 #the first subtype is taken as the baseline subtype #the subtype effect of subtype 1 is set to zero mu[ ,2] <- c(rep(2,G/20), rep(0,G/20),rep(0, G-G/20-G/20)) mu[ ,3] <- c(rep(0,G/20), rep(2,G/20),rep(0, G-G/20-G/20)) #batch effect gamma <- matrix(NA, B, G) #'location' batch effect of gene g in batch b gamma[1, ] <- 0 #the first batch is taken as the reference batch without batch effects #the batch effect of batch 1 is set to zero gamma[2, ] <- c(rep(3,G/5),rep(2,G/5),rep(1,G/5), rep(2,G/5),rep(3,G/5)) gamma[3, ] <- c(rep(1,G/5),rep(2,G/5),rep(3,G/5), rep(2,G/5),rep(1,G/5)) sigma_square <- matrix(NA, B,G) #sigma_square[b,g] denotes the error variance of gene g in batch b. sigma_square[1,] <- rep(0.1, G) sigma_square[2,] <- rep(0.2, G) sigma_square[3,] <- rep(0.15, G) Z <- list() #subtype indicator. Z[b,j] represents the subtype of sample j in batch b Z[[1]] <- as.integer(c(rep(1,floor(pi[1,1]*n_vec[1])),rep(2,floor(pi[1,2]*n_vec[1])), rep(3,floor(pi[1,3]*n_vec[1])))) Z[[2]] <- as.integer(c(rep(1,floor(pi[2,1]*n_vec[2])),rep(2,floor(pi[2,2]*n_vec[2])), rep(3,floor(pi[2,3]*n_vec[2])))) Z[[3]] <- as.integer(c(rep(1,floor(pi[3,1]*n_vec[3])),rep(2,floor(pi[3,2]*n_vec[3])), rep(3,floor(pi[3,3]*n_vec[3])))) for(b in 1:B){ #generate data num <- n_vec[b] example_Data[[b]] <- sapply(1:num, function(j){ tmp <- alpha + mu[ ,Z[[b]][j]] + gamma[b, ] + rnorm(G, sd = sqrt(sigma_square[b, ])) tmp }) } BUSexample_data <- example_Data ## End(Not run)
The function "BUSgibbs" stands for fitting the Batch effects correction with Unknown Subtypes model (BUS) with the Gibbs Sampler. BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, and (d) enjoying a linear-order computation complexity. After correcting the batch effects with BUS, the corrected value can be used for other analysis as if all samples are measured in a single batch. We adopt the Bayesian framework and use the Gibbs sampler to conduct posterior inference for the BUS model.
BUSgibbs(Data, n.subtypes, n.iterations = 500, n.records = floor(n.iterations/2), hyperparameters = c(1, sqrt(5), sqrt(5), 2, 2, 1, 2, 0.005, 1, 3, 10), showIteration = TRUE)
BUSgibbs(Data, n.subtypes, n.iterations = 500, n.records = floor(n.iterations/2), hyperparameters = c(1, sqrt(5), sqrt(5), 2, 2, 1, 2, 0.005, 1, 3, 10), showIteration = TRUE)
Data |
|
n.subtypes |
|
n.iterations |
|
n.records |
The posterior samples in the last |
hyperparameters |
|
showIteration |
If TRUE, the iteration number will be displayed when conducting Gibbs sampler. The default is TRUE. |
Notice that Data
, the input original gene expression values, are organized in the format of an R list with length equal to the batch number. Its b
th element Data[[b]]
is a G
by n_b
matrix, where G
is the gene number and n_b
is the sampler size of batch b
.
L_PosterSamp |
The posterior samples of the intrinsic gene indicators. The return is a G by K-1 by |
Subtypes |
The estimated subtypes, an R list with length equal to the batch number B, in which Subtypes[[b]] is an integer vector showing the subtype indicators of samples in batch b. |
tau_mu_zero |
The estimated tau_mu 0, which is the prior normal distribution's standard deviation of the subtype effects when there is no differential expression. |
p |
The estimated proportion of intrinsic genes. |
pi |
The estimated subtype proportions across batches, a B by K matrix, whose [b,k] element is the estimated proportion of subtype k in the batch b. |
alpha |
The estimated baseline expression levels, a G-dimension vector, whose gth element is the estimated mean gene expression level of gene g in subtype one. |
gamma_PosterSamp |
The posterior samples of location batch effects, a G by B by |
gamma |
The estimated location batch effects, a G by B matrix, where gamma_gb is the “location” batch effect on gene g in the batch b. Note that the first column is zero as the first batch is taken as the reference batch without batch effects. |
sigma_sq_PosterSamp |
The posterior samples of variances, a G by B by |
sigma_sq |
The estimated variance, a G by B matrix, whose [g,b] element is the variance of gene g's expression in the batch b. |
mu_PosterSamp |
The posterior samples of subtype effects, a G by K by |
mu |
The estimated subtype effects, a G by K matrix, whose [g,k] element is the subtype k effect on gene g. Note that the first column is zero as the fist subtype is taken as the baseline subtype. |
BIC |
the BIC value when K = |
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
############################################################################### #Generate Simulation Data ############################################################################### rm(list = ls(all = TRUE)) set.seed(123) B <- 3 #total number of batches K <- 3 #total number of subtypes G <- 3000 #total number of genes pi <- matrix(NA, B, K) # pi[b,k] stands for the proportion of kth subtype in bth batch pi[1, ] <- c(0.2, 0.3, 0.5) pi[2, ] <- c(0.4, 0.2, 0.4) pi[3, ] <- c(0.3, 0.4, 0.3) #total number of samples in each bacth. n_vec <- rep(NA, B) #n_vec[b] represents the total number of samples in batch b. n_vec <- c(100, 110, 120) #Data list example_Data <- list() #baseline expression level alpha <- rep(2, G) #subtype effect mu <- matrix(NA, G, K) #subtype effect, mu[g,k] stands for the kth-subtype effect of gene g mu[ ,1] <- 0 #the first subtype is taken as the baseline subtype #the subtype effect of subtype 1 is set to zero mu[ ,2] <- c(rep(2,G/20), rep(0,G/20),rep(0, G-G/20-G/20)) mu[ ,3] <- c(rep(0,G/20), rep(2,G/20),rep(0, G-G/20-G/20)) #batch effect gamma <- matrix(NA, B, G) #'location' batch effect of gene g in batch b gamma[1, ] <- 0 #the first batch is taken as the reference batch without batch effects #the batch effect of batch 1 is set to zero gamma[2, ] <- c(rep(3,G/5),rep(2,G/5),rep(1,G/5), rep(2,G/5),rep(3,G/5)) gamma[3, ] <- c(rep(1,G/5),rep(2,G/5),rep(3,G/5), rep(2,G/5),rep(1,G/5)) sigma_square <- matrix(NA, B,G) #sigma_square[b,g] denotes the error variance of gene g in batch b. sigma_square[1,] <- rep(0.1, G) sigma_square[2,] <- rep(0.2, G) sigma_square[3,] <- rep(0.15, G) Z <- list() #subtype indicator. Z[b,j] represents the subtype of sample j in batch b Z[[1]] <- as.integer(c(rep(1,floor(pi[1,1]*n_vec[1])),rep(2,floor(pi[1,2]*n_vec[1])), rep(3,floor(pi[1,3]*n_vec[1])))) Z[[2]] <- as.integer(c(rep(1,floor(pi[2,1]*n_vec[2])),rep(2,floor(pi[2,2]*n_vec[2])), rep(3,floor(pi[2,3]*n_vec[2])))) Z[[3]] <- as.integer(c(rep(1,floor(pi[3,1]*n_vec[3])),rep(2,floor(pi[3,2]*n_vec[3])), rep(3,floor(pi[3,3]*n_vec[3])))) for(b in 1:B){ #generate data num <- n_vec[b] example_Data[[b]] <- sapply(1:num, function(j){ tmp <- alpha + mu[ ,Z[[b]][j]] + gamma[b, ] + rnorm(G, sd = sqrt(sigma_square[b, ])) tmp }) } ############################################################################### #Apply the BUSgibbs Function ############################################################################### set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE)
############################################################################### #Generate Simulation Data ############################################################################### rm(list = ls(all = TRUE)) set.seed(123) B <- 3 #total number of batches K <- 3 #total number of subtypes G <- 3000 #total number of genes pi <- matrix(NA, B, K) # pi[b,k] stands for the proportion of kth subtype in bth batch pi[1, ] <- c(0.2, 0.3, 0.5) pi[2, ] <- c(0.4, 0.2, 0.4) pi[3, ] <- c(0.3, 0.4, 0.3) #total number of samples in each bacth. n_vec <- rep(NA, B) #n_vec[b] represents the total number of samples in batch b. n_vec <- c(100, 110, 120) #Data list example_Data <- list() #baseline expression level alpha <- rep(2, G) #subtype effect mu <- matrix(NA, G, K) #subtype effect, mu[g,k] stands for the kth-subtype effect of gene g mu[ ,1] <- 0 #the first subtype is taken as the baseline subtype #the subtype effect of subtype 1 is set to zero mu[ ,2] <- c(rep(2,G/20), rep(0,G/20),rep(0, G-G/20-G/20)) mu[ ,3] <- c(rep(0,G/20), rep(2,G/20),rep(0, G-G/20-G/20)) #batch effect gamma <- matrix(NA, B, G) #'location' batch effect of gene g in batch b gamma[1, ] <- 0 #the first batch is taken as the reference batch without batch effects #the batch effect of batch 1 is set to zero gamma[2, ] <- c(rep(3,G/5),rep(2,G/5),rep(1,G/5), rep(2,G/5),rep(3,G/5)) gamma[3, ] <- c(rep(1,G/5),rep(2,G/5),rep(3,G/5), rep(2,G/5),rep(1,G/5)) sigma_square <- matrix(NA, B,G) #sigma_square[b,g] denotes the error variance of gene g in batch b. sigma_square[1,] <- rep(0.1, G) sigma_square[2,] <- rep(0.2, G) sigma_square[3,] <- rep(0.15, G) Z <- list() #subtype indicator. Z[b,j] represents the subtype of sample j in batch b Z[[1]] <- as.integer(c(rep(1,floor(pi[1,1]*n_vec[1])),rep(2,floor(pi[1,2]*n_vec[1])), rep(3,floor(pi[1,3]*n_vec[1])))) Z[[2]] <- as.integer(c(rep(1,floor(pi[2,1]*n_vec[2])),rep(2,floor(pi[2,2]*n_vec[2])), rep(3,floor(pi[2,3]*n_vec[2])))) Z[[3]] <- as.integer(c(rep(1,floor(pi[3,1]*n_vec[3])),rep(2,floor(pi[3,2]*n_vec[3])), rep(3,floor(pi[3,3]*n_vec[3])))) for(b in 1:B){ #generate data num <- n_vec[b] example_Data[[b]] <- sapply(1:num, function(j){ tmp <- alpha + mu[ ,Z[[b]][j]] + gamma[b, ] + rnorm(G, sd = sqrt(sigma_square[b, ])) tmp }) } ############################################################################### #Apply the BUSgibbs Function ############################################################################### set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE)
When Gibbs sampler attains stationary, the distances between multiple chains (with multiple initial values) should be small. The EPSR factors are calculated to help decide the iteration number of Gibbs sampler.
calculate_EPSR_gamma(gamma_PosterSamp_chain1, gamma_PosterSamp_chain2)
calculate_EPSR_gamma(gamma_PosterSamp_chain1, gamma_PosterSamp_chain2)
gamma_PosterSamp_chain1 |
posterior samples of location batch effects from chain 1. |
gamma_PosterSamp_chain2 |
posterior samples of location batch effects from chain 2. |
EPSR_gamma |
The EPSR factors for gamma, a G by B matrix. Note that EPSR_gamma[,1] is a NA vector. gamma[,1] are fixed at zero, so their EPSR factors are not taken into account. |
Xiangyu Luo
#2 batches, 10 genes, 100 posterior samples per parameter chain1 <- 1+array(rnorm(10*2*100,sd=0.05), dim=c(2,10,100)) chain2 <- 1+array(rnorm(10*2*100,sd=0.05), dim=c(2,10,100)) calculate_EPSR_gamma(chain1,chain2)
#2 batches, 10 genes, 100 posterior samples per parameter chain1 <- 1+array(rnorm(10*2*100,sd=0.05), dim=c(2,10,100)) chain2 <- 1+array(rnorm(10*2*100,sd=0.05), dim=c(2,10,100)) calculate_EPSR_gamma(chain1,chain2)
When Gibbs sampler attains stationary, the distances between multiple chains (with multiple initial values) should be small. The EPSR factors are calculated to help decide the iteration number of Gibbs sampler.
calculate_EPSR_mu(mu_PosterSamp_chain1, mu_PosterSamp_chain2)
calculate_EPSR_mu(mu_PosterSamp_chain1, mu_PosterSamp_chain2)
mu_PosterSamp_chain1 |
posterior samples of subtype effects from chain 1. |
mu_PosterSamp_chain2 |
posterior samples of subtype effects from chain 1. |
EPSR_gamma |
The EPSR factors for mu, a G by K matrix. Note that EPSR_mu[,1] is a NA vector. mu[,1] are fixed at zero, so their EPSR factors are not taken into account. |
Xiangyu Luo
#10 genes, 2 subtypes, 100 posterior samples per parameter chain1 <- 1+array(rnorm(10*2*100,sd=0.05), dim=c(10,2,100)) chain2 <- 1+array(rnorm(10*2*100,sd=0.05), dim=c(10,2,100)) calculate_EPSR_mu(chain1,chain2)
#10 genes, 2 subtypes, 100 posterior samples per parameter chain1 <- 1+array(rnorm(10*2*100,sd=0.05), dim=c(10,2,100)) chain2 <- 1+array(rnorm(10*2*100,sd=0.05), dim=c(10,2,100)) calculate_EPSR_mu(chain1,chain2)
When Gibbs sampler attains stationary, the distances between multiple chains (with multiple initial values) should be small. The EPSR factors are calculated to help decide the iteration number of Gibbs sampler.
calculate_EPSR_sigma_sq(sigma_sq_PosterSamp_chain1, sigma_sq_PosterSamp_chain2)
calculate_EPSR_sigma_sq(sigma_sq_PosterSamp_chain1, sigma_sq_PosterSamp_chain2)
sigma_sq_PosterSamp_chain1 |
posterior samples of variances from chain 1. |
sigma_sq_PosterSamp_chain2 |
posterior samples of variances from chain 2. |
EPSR_sigma_sq |
The EPSR factors for sigma_sq, a G by B matrix. |
Xiangyu Luo
#2 batches, 10 genes, 100 posterior samples per parameter chain1 <- 1+array(rnorm(10*2*100,sd=0.05), dim=c(2,10,100)) chain2 <- 1+array(rnorm(10*2*100,sd=0.05), dim=c(2,10,100)) calculate_EPSR_sigma_sq(chain1,chain2)
#2 batches, 10 genes, 100 posterior samples per parameter chain1 <- 1+array(rnorm(10*2*100,sd=0.05), dim=c(2,10,100)) chain2 <- 1+array(rnorm(10*2*100,sd=0.05), dim=c(2,10,100)) calculate_EPSR_sigma_sq(chain1,chain2)
Call the function to estimate the intrinsic gene indicators.
estimate_IG_indicators(BUSfits, postprob_DE_threshold = 0.5)
estimate_IG_indicators(BUSfits, postprob_DE_threshold = 0.5)
BUSfits |
The BUSfits object output by the function |
postprob_DE_threshold |
the threshold to call an intrinsic gene indicator to be one or not according to whether its posterior probability is higher than postprob_DE_threshold or not. The default is 0.5. |
est_L |
the estimated intrinsic gene indicators, a matrix where the rows represent genes and the columns correspond to subtypes k=2,...,K |
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) #select the posterior probability threshold to estimate the intrinsic gene indicators thr0 <- postprob_DE_thr_fun(BUSfits, fdr_threshold=0.1) est_L <- estimate_IG_indicators(BUSfits, postprob_DE_threshold=thr0) #obtain the intrinsic gene indicators intrinsic_gene_indices <- IG_index(est_L)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) #select the posterior probability threshold to estimate the intrinsic gene indicators thr0 <- postprob_DE_thr_fun(BUSfits, fdr_threshold=0.1) est_L <- estimate_IG_indicators(BUSfits, postprob_DE_threshold=thr0) #obtain the intrinsic gene indicators intrinsic_gene_indices <- IG_index(est_L)
Call the function to obtain the indices of the intrinsic genes.
IG_index(EstIGindicators)
IG_index(EstIGindicators)
EstIGindicators |
The estimated intrinsic gene indicators. |
intrinsic_gene_indices |
The intrinsic gene indices |
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) #select the posterior probability threshold to estimate intrinsic gene indicators thr0 <- postprob_DE_thr_fun(BUSfits, fdr_threshold=0.1) est_L <- estimate_IG_indicators(BUSfits, postprob_DE_threshold=thr0) #obtain the intrinsic gene indicators intrinsic_gene_indices <- IG_index(est_L)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) #select the posterior probability threshold to estimate intrinsic gene indicators thr0 <- postprob_DE_thr_fun(BUSfits, fdr_threshold=0.1) est_L <- estimate_IG_indicators(BUSfits, postprob_DE_threshold=thr0) #obtain the intrinsic gene indicators intrinsic_gene_indices <- IG_index(est_L)
Call the function to obtain the location batch effects from the output by BUSgibbs
.
location_batch_effects(BUSfits)
location_batch_effects(BUSfits)
BUSfits |
The BUSfits object output by the function |
est_location_batch_effects |
The estimated location batch effects, a matrix whose rows represent genes and columns correspond to batches. |
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) est_location_batch_effects <- location_batch_effects(BUSfits)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) est_location_batch_effects <- location_batch_effects(BUSfits)
Calculate the posterior probability of being differentially expressed for genes in subtypes k (k>=2) compared to subtype 1.
postprob_DE(BUSfits)
postprob_DE(BUSfits)
BUSfits |
The BUSfits object output by the function BUSgibbs. |
postprob_DE_matr |
the matrix of posterior probabilities of being differentially expressed |
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) postprob_DE(BUSfits)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) postprob_DE(BUSfits)
To control the false discovery rate at the targeted level, call postprob_DE_thr_fun to obtain the threshold for the posterior probability of being differentially expressed.
postprob_DE_thr_fun(BUSfits, fdr_threshold = 0.1)
postprob_DE_thr_fun(BUSfits, fdr_threshold = 0.1)
BUSfits |
The BUSfits object output by the function |
fdr_threshold |
the false discovery rate level we want to control. |
thre0 |
the posterior probability threshold that controls the false discovery rate. |
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) #select kappa to estimate intrinsic gene indicators thr0 <- postprob_DE_thr_fun(BUSfits, fdr_threshold=0.1) est_L <- estimate_IG_indicators(BUSfits, postprob_DE_threshold=thr0) #obtain the intrinsic gene indicators intrinsic_gene_indices <- IG_index(est_L)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) #select kappa to estimate intrinsic gene indicators thr0 <- postprob_DE_thr_fun(BUSfits, fdr_threshold=0.1) est_L <- estimate_IG_indicators(BUSfits, postprob_DE_threshold=thr0) #obtain the intrinsic gene indicators intrinsic_gene_indices <- IG_index(est_L)
Call the function to print the output by BUSgibbs.
## S3 method for class 'BUSfits' print(x, ...)
## S3 method for class 'BUSfits' print(x, ...)
x |
The BUSfits object output by the function |
... |
not used. |
print the results from the output by BUSgibbs.
Xiangyu Luo
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) print(BUSfits)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) print(BUSfits)
BUSgibbs
Call the function to obtain the scale batch effects from the output by BUSgibbs
.
scale_batch_effects(BUSfits)
scale_batch_effects(BUSfits)
BUSfits |
The BUSfits object output by the function |
est_scale_batch_effects |
The estimated scale batch effects, a matrix where rows are genes and columns are batches. |
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) est_scale_batch_effects <- scale_batch_effects(BUSfits)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) est_scale_batch_effects <- scale_batch_effects(BUSfits)
BUSgibbs
Call the function to obtain the subtype effects from the output by BUSgibbs
.
BUSfits |
The BUSfits object output by the function |
est_subtype_effects |
The estimated subtype effects, a matrix where rows are genes and columns are subtypes. |
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) est_subtype_effects <- subtype_effects(BUSfits)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) est_subtype_effects <- subtype_effects(BUSfits)
BUSgibbs
Call the function to obtain the subtype indicators from the output by BUSgibbs
.
Subtypes(BUSfits)
Subtypes(BUSfits)
BUSfits |
The BUSfits object output by the function |
est_subtypes |
The estimated subtypes, an R list with length equal to the batch number. The |
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) est_subtypes <- Subtypes(BUSfits)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) est_subtypes <- Subtypes(BUSfits)
BUSgibbs
Call the function to summarize the output object by BUSgibbs
.
## S3 method for class 'BUSfits' summary(object, ...)
## S3 method for class 'BUSfits' summary(object, ...)
object |
The BUSfits object output by the function |
... |
not used. |
summarize the results from the output by BUSgibbs.
Xiangyu Luo
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) summary(BUSfits)
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect set.seed(123) BUSfits <- BUSgibbs(example_Data, n.subtypes = 3, n.iterations = 100, showIteration = FALSE) summary(BUSfits)
Use "heatmap.2" in R package "gplots" to visualize the gene expression data across multiple batches.
visualize_data(Data, title_name="Heatmap", gene_ind_set, color_key_range=seq(-0.5,8.5,1))
visualize_data(Data, title_name="Heatmap", gene_ind_set, color_key_range=seq(-0.5,8.5,1))
Data |
The gene expression data, an R list with length equal to the batch number. Each of its element is a gene expression matrix, where rows are genes and columns represent samples. |
title_name |
The title name of the heatmap. |
gene_ind_set |
The indices of the set of genes the user wants to display in the heatmap. |
color_key_range |
The color range in the color key. |
The values displayed in the heatmap are the raw values in the argument Data
without scaling.
visualize the gene expression data matrix, where one row is a gene and one column represents a sample.
Xiangyu Luo
Xiangyu Luo, Yingying Wei. Batch Effects Correction with Unknown Subtypes. Journal of the American Statistical Association. Accepted.
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect visualize_data(example_Data, title_name="Heatmap", gene_ind_set = 1:20, color_key_range=seq(0,10,2))
rm(list = ls(all = TRUE)) set.seed(123) #a toy example, there are 6 samples and 20 genes in each batch example_Data <- list() #batch 1 example_Data[[1]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) #batch 2 batch2_effect <- c(2,2,2,1,1) example_Data[[2]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch2_effect #batch 3 batch3_effect <- c(3,2,1,1,2) example_Data[[3]] <- rbind(matrix(c(1,1,5,5,10,10, 3,3,7,7,12,12), ncol=6, byrow=TRUE), matrix(c(1,2),nrow=18, ncol=6)) + batch3_effect visualize_data(example_Data, title_name="Heatmap", gene_ind_set = 1:20, color_key_range=seq(0,10,2))