Title: | GEMINI: Variational inference approach to infer genetic interactions from pairwise CRISPR screens |
---|---|
Description: | GEMINI uses log-fold changes to model sample-dependent and independent effects, and uses a variational Bayes approach to infer these effects. The inferred effects are used to score and identify genetic interactions, such as lethality and recovery. More details can be found in Zamanighomi et al. 2019 (in press). |
Authors: | Mahdi Zamanighomi [aut], Sidharth Jain [aut, cre] |
Maintainer: | Sidharth Jain <[email protected]> |
License: | BSD_3_clause + file LICENSE |
Version: | 1.21.0 |
Built: | 2024-10-30 07:59:39 UTC |
Source: | https://github.com/bioc/gemini |
Median normalization
.median_normalize(counts, CONSTANT = 32)
.median_normalize(counts, CONSTANT = 32)
counts |
a counts matrix derived from Input |
CONSTANT |
a pseudo-count to use (default = 32) |
median-normalized counts object
## Not run: .median_normalize(rnorm(n = 1000)) ## End(Not run)
## Not run: .median_normalize(rnorm(n = 1000)) ## End(Not run)
Compound pipe
lhs , rhs
|
An object modified in place, and a function to apply to it |
a <- 1:5 a %<>% sum() # a is modified in place
a <- 1:5 a %<>% sum() # a is modified in place
Pipe
lhs , rhs
|
An object, and a function to apply to it |
a <- 1:5 b <- a %>% sum()
a <- 1:5 b <- a %>% sum()
Counts matrix from the Big Papi SynLet dataset. The Big Papi dataset was published in Najm et al. 2018 (doi: 10.1038/nbt.4048). The counts were acquired from Supplementary Table 3.
counts
counts
An object of class matrix
(inherits from array
) with 9216 rows and 16 columns.
https://www.nature.com/articles/nbt.4048
A function to visualize the results of GEMINI over the raw data.
gemini_boxplot( Model, g, h, nc_gene = NULL, sample, show_inference = TRUE, color_x = FALSE, identify_guides = FALSE )
gemini_boxplot( Model, g, h, nc_gene = NULL, sample, show_inference = TRUE, color_x = FALSE, identify_guides = FALSE )
Model |
a gemini.model object |
g |
a character naming a gene to visualize |
h |
a character naming another gene to visualize |
nc_gene |
a character naming the gene to use as a negative control,
to be paired with each individual g and h. Defaults to |
sample |
a character naming the sample to visualize |
show_inference |
a logical indicating whether to show the inferred individual or combined values for each gene/gene pair (default TRUE) |
color_x |
a logical indicating whether to visualize the sample-independent effects for each individual guide or guide pair (default FALSE) |
identify_guides |
a logical indicating whether to identify guides with unique colors and shapes (default FALSE) |
Raw LFC data is plotted for each gene combination ('g'-'nc_gene', 'h'-'nc_gene', 'g'-'h') in a standard boxplot.
Horizontal green line segments are plotted over the box plots indicating the individual gene effects or
the inferred total effect of a particular gene combination. Each guide
pair can be colored based on the inferred sample independent effects
,
, and
. Additionally, colors and shapes
can be used to distinguish unique guides targeting gene g and h, respectively.
a ggplot2 object
data("Model", package = "gemini") gemini_boxplot(Model, g = "BRCA2", h = "PARP1", nc_gene = "CD81", sample = "A549", show_inference = TRUE, color_x = FALSE, identify_guides = FALSE)
data("Model", package = "gemini") gemini_boxplot(Model, g = "BRCA2", h = "PARP1", nc_gene = "CD81", sample = "A549", show_inference = TRUE, color_x = FALSE, identify_guides = FALSE)
Given a gemini.input object, calculates log-fold change from counts
.
gemini_calculate_lfc(Input, counts = "counts", sample.column.name = "samplename", normalize = TRUE, CONSTANT = 32)
gemini_calculate_lfc(Input, counts = "counts", sample.column.name = "samplename", normalize = TRUE, CONSTANT = 32)
Input |
a gemini.input object containing an object named |
counts |
a character indicating the name of a matrix of counts within |
sample.column.name |
a character or integer indicating which column of |
normalize |
a logical indicating if counts should be median-normalized, see Details (default = TRUE) |
CONSTANT |
a numeric indicating a constant value that shifts counts to reduce outliers, see Details (default = 32). |
See Methods from Zamanighomi et al. 2019 for a comprehensive description of the calculation of log-fold change, normalization, and count processing.
If multiple early time-points are provided for a given sample, they are treated as replicates and averaged, and used to compute log-fold change against any specified late time points.
If a sample has a specific early-time point, these are matched as long as the sample names are identical between
the early and late timepoints in sample.column.name
a gemini object identical to Input
that also contains new objects called LFC
and sample.annot
.
data("Input", package = "gemini") Input <- gemini_calculate_lfc(Input) head(Input$LFC)
data("Input", package = "gemini") Input <- gemini_calculate_lfc(Input) head(Input$LFC)
Creates a gemini.input object from a counts matrix with given annotations.
gemini_create_input( counts.matrix, sample.replicate.annotation = NULL, guide.annotation = NULL, samplesAreColumns = TRUE, sample.column.name = "samplename", gene.column.names = NULL, ETP.column = 1, LTP.column = NULL, verbose = FALSE )
gemini_create_input( counts.matrix, sample.replicate.annotation = NULL, guide.annotation = NULL, samplesAreColumns = TRUE, sample.column.name = "samplename", gene.column.names = NULL, ETP.column = 1, LTP.column = NULL, verbose = FALSE )
counts.matrix |
a matrix of counts with rownames corresponding to features (e.g. guides) and colnames corresponding to samples. |
sample.replicate.annotation |
a data.frame of annotations for each sample/replicate pair.
Note that at least one column in |
guide.annotation |
a data.frame of annotations for each guide. Note that at least one column in |
samplesAreColumns |
a logical indicating if samples are on the columns or rows of counts.matrix. (default = TRUE) |
sample.column.name |
a character or integer indicating which column of |
gene.column.names |
a character or integer vector of length(2) indicating which columns of |
ETP.column |
a character or integer vector indicating which column(s) of |
LTP.column |
a character or integer vector indicating which column(s) is the later time-point of the screen (i.e. day21, post-treatment, etc.). Defaults to |
verbose |
Verbosity (default FALSE) |
This function initializes a gemini.input object from a counts matrix. There are a few key assumptions made in the input format.
The counts matrix is regular.
The counts matrix structure is in accordance with the samplesAreColumns
parameter.
The first column of sample.replicate.annotation
matches with the existing dimension names of the counts matrix.
The first column of guide.annotations
matches with the existing dimension names of the counts matrix.
sample.column.name
must specify a column in sample.replicate.annotation
(either by name or index) that describes unique samples.
gene.column.names
must specify two columns in sample.replicate.annotation
(either by name or index) that describe genes.
a gemini.input object
data("counts", package = "gemini") data("sample.replicate.annotation", package = "gemini") data("guide.annotation", package = "gemini") Input <- gemini_create_input( counts.matrix = counts, sample.replicate.annotation = sample.replicate.annotation, guide.annotation = guide.annotation, sample.column.name = "samplename", gene.column.names = c("U6.gene", "H1.gene") )
data("counts", package = "gemini") data("sample.replicate.annotation", package = "gemini") data("guide.annotation", package = "gemini") Input <- gemini_create_input( counts.matrix = counts, sample.replicate.annotation = sample.replicate.annotation, guide.annotation = guide.annotation, sample.column.name = "samplename", gene.column.names = c("U6.gene", "H1.gene") )
Estimate the posterior using a variational inference technique. Inference is performed through an iterative process until convergence.
gemini_inference( Model, n_iterations = 20, mean_x = 1, sd_x = 1, mean_xx = 1, sd_xx = 1, mean_y = 0, sd_y = 10, mean_s = 0, sd_s = 10, threshold = 0.001, cores = 1, force_results = FALSE, verbose = FALSE, save_iterations = FALSE )
gemini_inference( Model, n_iterations = 20, mean_x = 1, sd_x = 1, mean_xx = 1, sd_xx = 1, mean_y = 0, sd_y = 10, mean_s = 0, sd_s = 10, threshold = 0.001, cores = 1, force_results = FALSE, verbose = FALSE, save_iterations = FALSE )
Model |
an object of class gemini.model |
n_iterations |
a numeric indicating the maximum number of iterations (default=20). |
mean_x |
a numeric indicating prior mean of x (default=1). |
sd_x |
a numeric indicating prior sd of x (default=1). |
mean_xx |
a numeric indicating prior mean of xx (default=1). |
sd_xx |
a numeric indicating prior sd of xx (default=1) |
mean_y |
a numeric indicating prior mean of y (default=0) |
sd_y |
a numeric indicating prior sd of y (default=10). |
mean_s |
a numeric indicating prior mean of s(default=0) |
sd_s |
a numeric indicating prior sd of s (default=10) |
threshold |
a numeric indicating the threshold of change in MAE at which to stop the iterative process (default=0.001). |
cores |
a numeric indicating the number of cores to use. See details in gemini_parallel. (default=1) |
force_results |
a logical indicating if the CAVI algorithm should be halted if non-convergence is detected. (default=FALSE) |
verbose |
default FALSE |
save_iterations |
for especially large libraries that require long computations, saves the latest iteration of each update. default FALSE |
GEMINI uses the following parameters, which are described in Zamanighomi et al. and translated here for clarity:
y: individual gene effect
s: combination effect
x: screen variation corresponding to individual guides
xx: screen variation corresponding to paired guides
Default parameters may need to be changed if convergence is not achieved. See README for more details.
a gemini.model object with estimated posteriors
data("Model", package = "gemini") Model %<>% gemini_inference(verbose = FALSE, n_iterations = 1) # iterations set to 1 for testing
data("Model", package = "gemini") Model %<>% gemini_inference(verbose = FALSE, n_iterations = 1) # iterations set to 1 for testing
Creates a gemini.model object given Input data and initialization parameters.
gemini_initialize( Input, guide.pair.annot = "guide.pair.annot", replicate.map = "replicate.map", sample.column.name = "samplename", LFC.name = "LFC", nc_gene, CONSTANT = 32, concordance = 1, prior_shape = 0.5, pattern_join = ";", pattern_split = ";", window = NULL, monotonize = FALSE, verbose = FALSE, cores = 1, save = NULL )
gemini_initialize( Input, guide.pair.annot = "guide.pair.annot", replicate.map = "replicate.map", sample.column.name = "samplename", LFC.name = "LFC", nc_gene, CONSTANT = 32, concordance = 1, prior_shape = 0.5, pattern_join = ";", pattern_split = ";", window = NULL, monotonize = FALSE, verbose = FALSE, cores = 1, save = NULL )
Input |
an object of class gemini.input |
guide.pair.annot |
the name of an object within |
replicate.map |
the name of an object within |
sample.column.name |
a character or integer indicating which column of |
LFC.name |
a character indicating an object within Input to treat as LFC. By default, "LFC" is used,
which is the output of |
nc_gene |
a character naming the gene to use as a negative control. See details for more. |
CONSTANT |
a numeric indicating a constant value that shifts counts to reduce outliers, see Details (default = 32). |
concordance |
a numeric value to initialize x (default = 1) |
prior_shape |
shape parameter of Gamma distribution used to model the variation in the
data in |
pattern_join |
a character to join the gene combinations found in
|
pattern_split |
a character to split the guide combinations found in
|
window |
a numeric if window smoothing should be done on initialized tau values, otherwise NULL (default) for no window smoothing |
monotonize |
a logical specifying whether the variance should be monotonically increasing. (default FALSE) |
verbose |
default FALSE |
cores |
numeric indicating the number of cores to use. See details in |
save |
a character (file path) or logical indicating whether the initialized model should be saved. |
guide.pair.annot is created with the following format:
1st column contains guide pairs, joined by pattern_split
.
2nd column contains the name of gene mapping to the first guide in the 1st column.
3rd column contains the name of gene mapping to the second guide in the 1st column.
Additional columns may be appended to guide.pair.annot, but the first three columns must be defined as above.
Use of negative control As described in Zamanighomi et al., it is highly recommended to specify a negative control. In the event that no negative control is specified, GEMINI will use the median LFC of all guide pairs targeting a gene to initialize that gene's effect. While this may be reasonable in large all-by-all screens, it is not recommended in smaller screens or some-by-some screens. As a result, when possible, be sure to specify a negative control.
a Model object of class gemini.model
data("Input", package = "gemini") Model <- gemini_initialize(Input, nc_gene = "CD81")
data("Input", package = "gemini") Model <- gemini_initialize(Input, nc_gene = "CD81")
Notes about parallelization in combinatorial CRISPR analysis
To improve efficiency and scalability, GEMINI employs parallelization, enabling a rapid initialization and update routine.
Parallelization was implemented using the R pbmclapply
package, and specifically through the pbmclapply
function.
As pbmclapply (and it's parent function, mclapply
) relies on forking, parallelization is currently limited to Unix-like (Linux flavors and macOS) machines, but may be extended to other OS in later versions through socket parallelization. In the event that GEMINI is used on a Windows machine, all computations are performed in serial.
GEMINI enables parallelization of both initialization and inference. For initialization, parallelization is used to quickly hash the data. For the inference procedure, parallelization is used to speed up the CAVI approach by performing updates independently across cores.
To note, there is usually a trade-off in terms of number of cores and shared resources/memory transfer. See inst/figs/gemini-benchmarking.png for a visual depiction.
Also, while most functions in GEMINI have been parallelized, initialization of tau, update of x (group 3), and updates of y are performed in serial. As such, especially in large libraries, tau initialization and x (group 3) updates may take some time. Updating y is usually fast, as the number of genes tends to be the smallest in parameter space. However, these functions may be parallelized in the future as well.
Noticed that in some cases, after using multicore processing through pbmclapply, warnings have been produced: "In selectChildren(pids[!fin], -1) : cannot wait for child ... as it does not exist" "In parallel::mccollect(...) : 1 parallel job did not deliver a result" R.version=3.6.0 These warnings, although they appear menacing, are in fact harmless. See: http://r.789695.n4.nabble.com/Strange-error-messages-from-parallel-mcparallel-family-under-3-6-0-td4756875.html#a4756939
Plots the average MAE of gemini model at each iteration step.
gemini_plot_mae(Model)
gemini_plot_mae(Model)
Model |
a gemini.model object |
a ggplot2 object
data("Model", package = "gemini") gemini_plot_mae(Model)
data("Model", package = "gemini") gemini_plot_mae(Model)
This is an internal function to GEMINI, allowing for data cleanup and preprocessing before a Model object is created. This removes any gene pairs targeting the same gene twice, and removes any empty samples/replicates.
gemini_prepare_input(Input, gene.columns, sample.col.name = "samplename")
gemini_prepare_input(Input, gene.columns, sample.col.name = "samplename")
Input |
An object of class gemini.input |
gene.columns |
a character vector of length(2) |
sample.col.name |
a character indicating the name of the sample column (default = "samplename") |
a (prepared) gemini.input object
data("Input", package = "gemini") Input %<>% gemini_prepare_input(gene.columns = c("U6.gene", "H1.gene"))
data("Input", package = "gemini") Input %<>% gemini_prepare_input(gene.columns = c("U6.gene", "H1.gene"))
Score genetic interactions from a gemini.model and produce a gemini.score object, and generate p-values and FDRs if negative control
pairs (nc_pairs
) are available.
gemini_score( Model, pc_gene = NA, pc_threshold = NULL, pc_weight = 0.5, nc_pairs = NA )
gemini_score( Model, pc_gene = NA, pc_threshold = NULL, pc_weight = 0.5, nc_pairs = NA )
Model |
an object of class gemini.model |
pc_gene |
a character vector of any length naming genes to use as positive controls |
pc_threshold |
a numeric value to indicate the LFC corresponding to a positive control, if no pc_gene is specified. |
pc_weight |
a weighting applied to the positive control
( |
nc_pairs |
a set of non-interacting gene pairs to define statistical significance. |
An object of class gemini.score containing score values for
strong interactions and sensitive lethality and recovery, and if nc_pairs
is specified, statistical significance for each scoring type.
data("Model", package = "gemini") Score <- gemini_score(Model, pc_gene = "EEF2")
data("Model", package = "gemini") Score <- gemini_score(Model, pc_gene = "EEF2")
Guide and gene annotations for the Big Papi SynLet dataset. The Big Papi dataset was published in Najm et al. 2018 (doi: 10.1038/nbt.4048). The guide and gene annotations were acquired from Supplementary Table 2.
guide.annotation
guide.annotation
An object of class data.frame
with 9216 rows and 7 columns.
https://www.nature.com/articles/nbt.4048
Initialize s values using initialized y values and data from Input
initialize_s(Model, cores = 1, verbose = FALSE)
initialize_s(Model, cores = 1, verbose = FALSE)
Model |
an object of class gemini.model |
cores |
a numeric indicating the number of cores to use. See |
verbose |
default FALSE |
a Model object of class gemini.model including new slots for s values
data("Model", package = "gemini") Model <- initialize_s(Model)
data("Model", package = "gemini") Model <- initialize_s(Model)
Initialize all tau values based on the observed replicate variance.
initialize_tau( Model, CONSTANT = 32, prior_shape = 0.5, window = NULL, monotonize = FALSE, verbose = FALSE )
initialize_tau( Model, CONSTANT = 32, prior_shape = 0.5, window = NULL, monotonize = FALSE, verbose = FALSE )
Model |
an object of class gemini.model |
CONSTANT |
a numeric indicating a constant value that shifts counts to reduce outliers (default = 32). |
prior_shape |
shape parameter of Gamma distribution used to model the variation in the data in |
window |
numeric if window smoothing should be done on initialized tau values, otherwise NULL (default) for no window smoothing |
monotonize |
logical specifying whether the variance should be monotonically increasing (default FALSE) |
verbose |
default FALSE |
a Model object of class gemini.model including new slots for alpha and beta values
data("Model", package = "gemini") Model <- initialize_tau(Model, CONSTANT = 32, prior_shape = 0.5)
data("Model", package = "gemini") Model <- initialize_tau(Model, CONSTANT = 32, prior_shape = 0.5)
Initialize all x values to the value of concordance
initialize_x(Model, concordance = 1, cores = 1, verbose = FALSE)
initialize_x(Model, concordance = 1, cores = 1, verbose = FALSE)
Model |
a Model object of class gemini.model |
concordance |
a numeric value to initialize x |
cores |
a numeric indicating the number of cores to use. See |
verbose |
default FALSE |
a Model object of class gemini.model including new slots for x values and internal-use hashes
As there is much hashing involved in this function, this tends to be computationally intensive. As such, we have enabled parallelization of most hash steps, but this may still be rate-limited by the amount of memory consumed.
data("Model", package = "gemini") Model %<>% initialize_x()
data("Model", package = "gemini") Model %<>% initialize_x()
Initialize all y values from guide pairs including a negative control.
initialize_y(Model, verbose = FALSE, cores = 1)
initialize_y(Model, verbose = FALSE, cores = 1)
Model |
an object of class gemini.model |
verbose |
default FALSE |
cores |
a numeric indicating the number of cores to use. See details in |
a Model object of class gemini.model including new slots for y values
data("Model", package = "gemini") Model %<>% initialize_y()
data("Model", package = "gemini") Model %<>% initialize_y()
A gemini.input object created from the Big Papi SynLet dataset.
The Big Papi dataset was published in Najm et al. 2018 (doi: 10.1038/nbt.4048).
The Input object here is created from the data in counts,
guide.annotation, and sample.replicate.annotation
using the gemini_create_input
function.
Input
Input
An object of class list
(inherits from gemini.input
, gemini.input
) of length 6.
https://www.nature.com/articles/nbt.4048
A gemini.model object created from the Big Papi SynLet dataset after gemini_inference
was run.
The Big Papi dataset was published in Najm et al. 2018 (doi: 10.1038/nbt.4048).
The Model object here is created from the data in Input using the gemini_initialize
function.
Model
Model
An object of class list
(inherits from gemini.model
) of length 24.
https://www.nature.com/articles/nbt.4048
Sample and replicate annotations for the Big Papi SynLet dataset. The Big Papi dataset was published in Najm et al. 2018 (doi: 10.1038/nbt.4048). The sample and replicate annotations were acquired from Supplementary Table 3
sample.replicate.annotation
sample.replicate.annotation
An object of class data.frame
with 16 rows and 3 columns.
https://www.nature.com/articles/nbt.4048
Gene hashing
Sgene2Pguides_hash(guide2gene, cores = 1)
Sgene2Pguides_hash(guide2gene, cores = 1)
guide2gene |
derived from Input object |
cores |
number of cores to use (default 1) |
Gene hash/list
## Not run: #' data("Input", package = "gemini") Sgene2Pguides_hash(Input$guide.pair.annot[1:10,]) ## End(Not run)
## Not run: #' data("Input", package = "gemini") Sgene2Pguides_hash(Input$guide.pair.annot[1:10,]) ## End(Not run)
Guide hashing
Sguide2Pguides_hash(guide2gene, split = ";", cores = 1)
Sguide2Pguides_hash(guide2gene, split = ";", cores = 1)
guide2gene |
derived from Input object |
split |
character to split guides |
cores |
number of cores to use (default 1) |
Guide hash/list
## Not run: data("Input", package = "gemini") Sguide2Pguides_hash(gemini::Input$guide.pair.annot[1:10,]) ## End(Not run)
## Not run: data("Input", package = "gemini") Sguide2Pguides_hash(gemini::Input$guide.pair.annot[1:10,]) ## End(Not run)
Calculate mean absolute error across all guide combinations.
update_mae(Model, verbose = FALSE)
update_mae(Model, verbose = FALSE)
Model |
a Model object of class gemini.model |
verbose |
default FALSE |
An object of class gemini.model
data("Model", package = "gemini") Model %<>% update_mae()
data("Model", package = "gemini") Model %<>% update_mae()
Update values of s using data from Input
and current values of other parameters.
update_s_pb(Model, mean_s = 0, sd_s = 10, cores = 1, verbose = FALSE)
update_s_pb(Model, mean_s = 0, sd_s = 10, cores = 1, verbose = FALSE)
Model |
a Model object of class gemini.model |
mean_s |
numeric indicating prior mean of s (default 0) |
sd_s |
numeric indicating prior sd of s (default 10) |
cores |
a numeric indicating the number of cores to use, see |
verbose |
default FALSE |
An object of class gemini.model
data("Model", package = "gemini") Model %<>% update_s_pb()
data("Model", package = "gemini") Model %<>% update_s_pb()
Update parameters of tau using data from Input
and current values of other parameters.
update_tau_pb(Model, cores = 1, verbose = FALSE)
update_tau_pb(Model, cores = 1, verbose = FALSE)
Model |
a Model object of class gemini.model |
cores |
a numeric indicating the number of cores to use. See |
verbose |
default FALSE |
An object of class gemini.model
data("Model", package = "gemini") Model %<>% update_tau_pb()
data("Model", package = "gemini") Model %<>% update_tau_pb()
Update values of x using data from Input
and current values of other parameters.
update_x_pb( Model, mean_x = 1, sd_x = 1, mean_xx = 1, sd_xx = 1, cores = 1, verbose = FALSE )
update_x_pb( Model, mean_x = 1, sd_x = 1, mean_xx = 1, sd_xx = 1, cores = 1, verbose = FALSE )
Model |
a Model object of class gemini.model |
mean_x |
a numeric indicating prior mean of x |
sd_x |
a numeric indicating prior sd of x |
mean_xx |
a numeric indicating prior mean of xx |
sd_xx |
a numeric indicating prior sd of xx |
cores |
a numeric indicating the number of cores to use. See |
verbose |
default FALSE |
An object of class gemini.model
The structure of the screen may impede parallelization. Our ability to parallelize the updates is contingent upon the independence between guides in position 1 and 2. To account for potential dependence, we define three groups of guides: group 1 (guides only in position 1 ), group 2 (guides only in position 2), and group 3 (guides in both position 1 and position 2). Parallelization is possible for groups 1 and 2, but only serial updates are possible for group 3. As such, updates for this group will take longer.
data("Model", package = "gemini") Model %<>% update_x_pb()
data("Model", package = "gemini") Model %<>% update_x_pb()
Update values of y using data from Input
and current values of other parameters.
update_y_pb(Model, mean_y = 0, sd_y = 10, verbose = FALSE)
update_y_pb(Model, mean_y = 0, sd_y = 10, verbose = FALSE)
Model |
a Model object of class gemini.model |
mean_y |
numeric indicating prior mean of y |
sd_y |
numeric indicating prior sd of y |
verbose |
default FALSE |
An object of class gemini.model
data("Model", package = "gemini") Model %<>% update_y_pb()
data("Model", package = "gemini") Model %<>% update_y_pb()