Title: | Predict synthetic lethal partners of tumour mutations |
---|---|
Description: | An integrated pipeline to predict the potential synthetic lethality partners (SLPs) of tumour mutations, based on gene expression, mutation profiling and cell line genetic screens data. It has builtd-in support for data from cBioPortal. The primary SLPs correlating with muations in WT and compensating for the loss of function of mutations are predicted by random forest based methods (GENIE3) and Rank Products, respectively. Genetic screens are employed to identfy consensus SLPs leads to reduced cell viability when perturbed. |
Authors: | Chunxuan Shao [aut, cre] |
Maintainer: | Chunxuan Shao <[email protected]> |
License: | GPL-3 |
Version: | 1.9.0 |
Built: | 2024-12-29 07:24:44 UTC |
Source: | https://github.com/bioc/mslp |
Identify SLPs compensating for the loss of function of mutations. The up-regulated SLPs are selected via the rank prodcuts algorithm, with option calculateProduct = FALSE for a robust results and capacity on large datasets.
comp_slp( zscore_data, mut_data, mutgene = NULL, positive_perc = 0.5, p_thresh = 0.01, ... )
comp_slp( zscore_data, mut_data, mutgene = NULL, positive_perc = 0.5, p_thresh = 0.01, ... )
zscore_data |
a matrix (genes by patients) reflecting gene expression related to wide type samples. For example, generated from |
mut_data |
a data.table with columns "patientid" and "mut_entrez". |
mutgene |
identify SLPs for sepecific muatation (gene symbols). If NULL (by default), the intersection genes between zscore_data and mut_data are used. |
positive_perc |
keep genes with postive zscore in at least positive_perc * number of mutation patients. |
p_thresh |
pvalue threshold to filter out results. |
... |
additional parameters to |
A data.table with predicted SLPs.
Entrez ids of mutations.
Gene symbols of mutations.
Entrez ids of SLPs.
Gene symbols of SLPs.
p_value from RankProducts
.
"BH" adjusted pvalue via p.adjust
.
#- Toy examples, see vignette for more. #- Add the parallel backend. require(future) require(doFuture) plan(multisession, workers = 2) data("example_z") data("example_comp_mut") res <- comp_slp(example_z, example_comp_mut) plan(sequential)
#- Toy examples, see vignette for more. #- Add the parallel backend. require(future) require(doFuture) plan(multisession, workers = 2) data("example_z") data("example_comp_mut") res <- comp_slp(example_z, example_comp_mut) plan(sequential)
Identify consensus SLPs based on Cohen's Kappa or hypergeometric test.
cons_slp(screen_slp, tumour_slp)
cons_slp(screen_slp, tumour_slp)
screen_slp |
screen hits data annotated with SLPs information, generated by |
tumour_slp |
Consensus SLPs are enriched screen hits that are SLPs of same mutations in different cell lines. For each common mutation, the SLPs predicted from human tumour data are used as the total sets. We used either Cohen's Kappa coefficient on a confusion matrix, or Hypergeometric test, to test the signficance of overlapping of screen hits.
A data.table.
Entrez ids of mutations.
Gene symbols of mutations.
Entrez ids of consensus SLPs.
Gene symbols of Consensus SLPs.
From which pair of cell lines the consensus SLPs predicted.
Judgement based on Cohen's Kappa.
Cohen's Kappa coefficient
pvalue for Cohen's Kappa coefficient.
"BH" adjusted pvalue via p.adjust
.
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biomet-rics, 33: 159-174.
#- See the examples in the vignette. if (FALSE) k_res <- cons_slp(scr_res, merged_res)
#- See the examples in the vignette. if (FALSE) k_res <- cons_slp(scr_res, merged_res)
Identify SLPs of mutations based on co-expression. GENIE3 is employed to find genes highly correlated with mutations in wide type patients.
corr_slp( expr_data, mut_data, mutgene = NULL, im_thresh = 0.001, topgene = 2000, ... )
corr_slp( expr_data, mut_data, mutgene = NULL, im_thresh = 0.001, topgene = 2000, ... )
expr_data |
an expression matrix, genes by patients. |
mut_data |
a data.table with columns "patientid" and "mut_entrez". |
mutgene |
identify SLPs for sepecific muatation (gene symbols). If NULL (by default), the intersection genes between expr_data and mut_data are used. |
im_thresh |
minimum importance threshold. |
topgene |
top N genes above the |
... |
further parameters to |
A data.table with predicted SLPs.
Entrez ids of mutations.
Gene symbols of mutations.
Entrez ids of SLPs.
Gene symbols of SLPs.
"BH" adjusted pvalue via p.adjust
.
The importance value returned by genie3
.
#- Toy examples, see vignette for more. require(future) require(doFuture) plan(multisession, workers = 2) data("example_expr") data("example_corr_mut") res <- corr_slp(example_expr, example_corr_mut) plan(sequential)
#- Toy examples, see vignette for more. require(future) require(doFuture) plan(multisession, workers = 2) data("example_expr") data("example_corr_mut") res <- corr_slp(example_expr, example_corr_mut) plan(sequential)
Estimate the importance threshold based on repetition GENIE3 results via ROC.
est_im(permu_data, fdr_thresh = 0.001)
est_im(permu_data, fdr_thresh = 0.001)
permu_data |
permuated |
fdr_thresh |
fdr threshold to selected "TRUE" SLPs. |
We first generate a SLPs by repetition matrix from repetition GENIE3 results. SLPs with high im value in repetitions are selected and condsidered as "TRUE" SLPs via the rank product algorithm. Then for each repetion, we perform receiver operating characteristic curve analysis and select an optimal threshold by "youden" approach. The optimal thresholds are averaged to get the final threshold.
A data.table with mut_entrez (mutation entrez_id) and roc_thresh (estimated im threshold).
#- Toy examples. require(future) require(doFuture) plan(multisession, workers = 2) data(example_expr) data(example_corr_mut) mutgene <- sample(intersect(example_corr_mut$mut_entrez, rownames(example_expr)), 2) nperm <- 5 res <- lapply(seq_len(nperm), function(x) corr_slp(example_expr, example_corr_mut, mutgene = mutgene)) roc_thresh <- est_im(res) plan(sequential)
#- Toy examples. require(future) require(doFuture) plan(multisession, workers = 2) data(example_expr) data(example_corr_mut) mutgene <- sample(intersect(example_corr_mut$mut_entrez, rownames(example_expr)), 2) nperm <- 5 res <- lapply(seq_len(nperm), function(x) corr_slp(example_expr, example_corr_mut, mutgene = mutgene)) roc_thresh <- est_im(res) plan(sequential)
Mutations and related TCGA ids.
data(example_comp_mut)
data(example_comp_mut)
A data.table.
SLPs predicted by comp_slp
data(example_compSLP)
data(example_compSLP)
A data.table.
Mutations and related TCGA ids.
data(example_corr_mut)
data(example_corr_mut)
A data.table.
SLPs predicted by corr_slp
data(example_corrSLP)
data(example_corrSLP)
A data.table.
Expresion matrix, genes by samples.
data(example_expr)
data(example_expr)
A matrix.
Z score matrix, genes by samples.
data(example_z)
data(example_z)
A matrix.
Calculate the weight matrix between genes via randomForest, modified from original codes by Huynh-Thu, V.A.
genie3( expr.matrix, ngene = NULL, K = "sqrt", nb.trees = 1000, input.idx = NULL, importance.measure = "IncNodePurity", trace = FALSE, ... )
genie3( expr.matrix, ngene = NULL, K = "sqrt", nb.trees = 1000, input.idx = NULL, importance.measure = "IncNodePurity", trace = FALSE, ... )
expr.matrix |
exrepssion matrix (genes by samples). |
ngene |
an integer, only up to the first ngene (included) targets (responsible variables). |
K |
choice of number of input genes randomly, must be one of "sqrt", "all", an integar. |
nb.trees |
number of trees in ensemble for each target gene (default 1000). |
input.idx |
subset of genes used as input genes (default all genes). A vector of indices or gene names is accepted. |
importance.measure |
type of variable importance measure, "IncNodePurity" or "%IncMSE". |
trace |
index of currently computed gene is reported (default FALSE). |
... |
parameter to randomForest. |
A weighted adjacency matrix of inferred network, element w_ij (row i, column j) gives the importance of the link from regulatory gene i to target gene j.
Huynh-Thu, V.A., Irrthum, A., Wehenkel, L., and Geurts, P. (2010). Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLoS ONE 5, e12776.
#- Toy examples. mtx <- matrix(sample(1000, 100), nrow = 5) mtx <- rbind(mtx[1, ] * 2 + rnorm(20), mtx) colnames(mtx) <- paste0("s_", seq_len(ncol(mtx))) rownames(mtx) <- paste0("g_", seq_len(nrow(mtx))) res <- genie3(mtx, nb.trees = 100)
#- Toy examples. mtx <- matrix(sample(1000, 100), nrow = 5) mtx <- rbind(mtx[1, ] * 2 + rnorm(20), mtx) colnames(mtx) <- paste0("s_", seq_len(ncol(mtx))) rownames(mtx) <- paste0("g_", seq_len(nrow(mtx))) res <- genie3(mtx, nb.trees = 100)
Take genie3 output and sort the links.
getlink(weight.matrix, report.max = NULL)
getlink(weight.matrix, report.max = NULL)
weight.matrix |
a weighted adjacency matrix as returned by genie3. |
report.max |
maximum number of links to report (default all links). |
A data.table of links with columns "from.gene", "to.gene", "im".
mtx <- matrix(sample(1000, 100), nrow = 5) mtx <- rbind(mtx[1, ] * 2 + rnorm(20), mtx) colnames(mtx) <- paste0("s_", seq_len(ncol(mtx))) rownames(mtx) <- paste0("g_", seq_len(nrow(mtx))) res <- genie3(mtx, nb.trees = 10) res_link <- getlink(res)
mtx <- matrix(sample(1000, 100), nrow = 5) mtx <- rbind(mtx[1, ] * 2 + rnorm(20), mtx) colnames(mtx) <- paste0("s_", seq_len(ncol(mtx))) rownames(mtx) <- paste0("g_", seq_len(nrow(mtx))) res <- genie3(mtx, nb.trees = 10) res_link <- getlink(res)
Merge predcted SLPs from comp_slp and corr_slp.
merge_slp(comp_data, corr_data)
merge_slp(comp_data, corr_data)
comp_data |
predicted SLPs from |
corr_data |
predicted SLPs from |
A data.table.
Entrez ids of mutations.
Gene symbols of mutations.
Entrez ids of SLPs.
Gene symbols of SLPs.
p_value from RankProducts
.
"BH" adjusted pvalue via p.adjust
.
The importance value returned by genie3
.
data("example_z") data("example_comp_mut") comp_res <- comp_slp(example_z, example_comp_mut) data("example_expr") data("example_corr_mut") corr_res <- corr_slp(example_expr, example_corr_mut) res <- merge_slp(comp_res, corr_res)
data("example_z") data("example_comp_mut") comp_res <- comp_slp(example_z, example_comp_mut) data("example_expr") data("example_corr_mut") corr_res <- corr_slp(example_expr, example_corr_mut) res <- merge_slp(comp_res, corr_res)
Preprocess mutation, cna, expression and zscore datsets in TCGA PanCancer Atlas by cBioPortal.
pp_tcga( p_mut, p_cna, p_exprs, p_score, freq_thresh = 0.02, expr_thresh = 10, hypermut_thresh = 300 )
pp_tcga( p_mut, p_cna, p_exprs, p_score, freq_thresh = 0.02, expr_thresh = 10, hypermut_thresh = 300 )
p_mut |
path of muation data, like "data_mutations_uniprot.txt" provided by cBioPortal. |
p_cna |
path of copy number variation data, like "data_CNA.txt". |
p_exprs |
path of normalized RNAseq expression data, like "data_RNA_Seq_v2_expression_median.txt". |
p_score |
path of zscore data, like "data_RNA_Seq_v2_mRNA_median_Zscores.txt". |
freq_thresh |
threshold to select recurrent mutations. |
expr_thresh |
threshold to remove low expression genes. |
hypermut_thresh |
threshold for hpyermutations. |
It is designed to process the TCGA data provided by cBioPortal. In mutation data, "Missense_Mutation", "Nonsense_Mutation", "Frame_Shift_Del", "Frame_Shift_Ins", "In_Frame_Del", "In_Frame_Ins", "Nonstop_Mutation" are selected for the downstream analysis, In CNA data, genes with GISTIC value equal to -2 are used. Patients with hypermutations are removed. Low expression genes, or genes that are not detected in any patient are filtered out.
Return a list of mut_data, expr_data and zscore_data, while expr_data and zscore_data are matrix (entrez_id by patients), mut_data is a data.table with two columns of "patientid" and "mut_entrez".
Cerami et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discovery. May 2012 2; 401. Gao et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 6, pl1 (2013).
#- See vignette for more details. if (FALSE) { P_mut <- "data_mutations_extended.txt" P_cna <- "data_CNA.txt" P_expr <- "data_RNA_Seq_v2_expression_median.txt" P_z <- "data_RNA_Seq_v2_mRNA_median_Zscores.txt" res <- pp_tcga(P_mut, P_cna, P_expr, P_z) saveRDS(res$mut_data, "mut_data.rds") saveRDS(res$expr_data, "expr_data.rds") saveRDS(res$zscore_data, "zscore_data.rds") }
#- See vignette for more details. if (FALSE) { P_mut <- "data_mutations_extended.txt" P_cna <- "data_CNA.txt" P_expr <- "data_RNA_Seq_v2_expression_median.txt" P_z <- "data_RNA_Seq_v2_mRNA_median_Zscores.txt" res <- pp_tcga(P_mut, P_cna, P_expr, P_z) saveRDS(res$mut_data, "mut_data.rds") saveRDS(res$expr_data, "expr_data.rds") saveRDS(res$zscore_data, "zscore_data.rds") }
Identify whether screen hits are SLPs of mutations deteced in both patients and cell lines, based on
predicted SLPs in corr_slp
and comp_slp
.
scr_slp(cell, screen_data, cell_mut, tumour_slp)
scr_slp(cell, screen_data, cell_mut, tumour_slp)
cell |
a cell line. |
screen_data |
a data.table of genomic screen results with three columns, "screen_entrez", "screen_symbol" and "cell_line". |
cell_mut |
cell line mutation data. |
tumour_slp |
merged SLPs. |
A data.table.
Name of cell lines.
Entrez ids of hits.
Gene symbols of hits.
Entrez ids of mutations.
Gene symbols of mutations.
Whether the targeted gene is a SLP.
p_value from RankProducts
.
"BH" adjusted pvalue via p.adjust
.
The importance value returned by genie3
.
require(future) require(doFuture) plan(multisession, workers = 2) library(magrittr) library(data.table) data(example_compSLP) data(example_corrSLP) merged_res <- merge_slp(example_compSLP, example_corrSLP) #- Toy hits data. screen_1 <- merged_res[, .(slp_entrez, slp_symbol)] %>% unique %>% .[sample(nrow(.), round(.8 * nrow(.)))] %>% setnames(c(1, 2), c("screen_entrez", "screen_symbol")) %>% .[, cell_line := "cell_1"] screen_2 <- merged_res[, .(slp_entrez, slp_symbol)] %>% unique %>% .[sample(nrow(.), round(.8 * nrow(.)))] %>% setnames(c(1, 2), c("screen_entrez", "screen_symbol")) %>% .[, cell_line := "cell_2"] screen_hit <- rbind(screen_1, screen_2) #- Toy mutations data. mut_1 <- merged_res[, .(mut_entrez)] %>% unique %>% .[sample(nrow(.), round(.8 * nrow(.)))] %>% .[, cell_line := "cell_1"] mut_2 <- merged_res[, .(mut_entrez)] %>% unique %>% .[sample(nrow(.), round(.8 * nrow(.)))] %>% .[, cell_line := "cell_2"] mut_info <- rbind(mut_1, mut_2) #- Hits that are identified as SLPs. scr_res <- lapply(c("cell_1", "cell_2"), scr_slp, screen_hit, mut_info, merged_res) scr_res[lengths(scr_res) == 0] <- NULL scr_res <- rbindlist(scr_res) plan(sequential)
require(future) require(doFuture) plan(multisession, workers = 2) library(magrittr) library(data.table) data(example_compSLP) data(example_corrSLP) merged_res <- merge_slp(example_compSLP, example_corrSLP) #- Toy hits data. screen_1 <- merged_res[, .(slp_entrez, slp_symbol)] %>% unique %>% .[sample(nrow(.), round(.8 * nrow(.)))] %>% setnames(c(1, 2), c("screen_entrez", "screen_symbol")) %>% .[, cell_line := "cell_1"] screen_2 <- merged_res[, .(slp_entrez, slp_symbol)] %>% unique %>% .[sample(nrow(.), round(.8 * nrow(.)))] %>% setnames(c(1, 2), c("screen_entrez", "screen_symbol")) %>% .[, cell_line := "cell_2"] screen_hit <- rbind(screen_1, screen_2) #- Toy mutations data. mut_1 <- merged_res[, .(mut_entrez)] %>% unique %>% .[sample(nrow(.), round(.8 * nrow(.)))] %>% .[, cell_line := "cell_1"] mut_2 <- merged_res[, .(mut_entrez)] %>% unique %>% .[sample(nrow(.), round(.8 * nrow(.)))] %>% .[, cell_line := "cell_2"] mut_info <- rbind(mut_1, mut_2) #- Hits that are identified as SLPs. scr_res <- lapply(c("cell_1", "cell_2"), scr_slp, screen_hit, mut_info, merged_res) scr_res[lengths(scr_res) == 0] <- NULL scr_res <- rbindlist(scr_res) plan(sequential)