Title: | Open Cancer TherApeutic Discovery (OCTAD) |
---|---|
Description: | OCTAD provides a platform for virtually screening compounds targeting precise cancer patient groups. The essential idea is to identify drugs that reverse the gene expression signature of disease by tamping down over-expressed genes and stimulating weakly expressed ones. The package offers deep-learning based reference tissue selection, disease gene expression signature creation, pathway enrichment analysis, drug reversal potency scoring, cancer cell line selection, drug enrichment analysis and in silico hit validation. It currently covers ~20,000 patient tissue samples covering 50 cancer types, and expression profiles for ~12,000 distinct compounds. |
Authors: | E. Chekalin [aut, cre], S. Paithankar [aut], B. Zeng [aut], B. Glicksberg [ctb], P. Newbury [ctb], J. Xing [ctb], K. Liu [ctb], A. Wen [ctb], D. Joseph [ctb], B. Chen [aut] |
Maintainer: | E. Chekalin <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.9.0 |
Built: | 2024-11-29 08:31:50 UTC |
Source: | https://github.com/bioc/octad |
Select top CCLE cell lines sharing similar expression profiles with input case samples. Input case sample ids and output correlation scores for every cell line and/or output file. The results could be used for in-silico validation of predictions or used to weight cell lines in RGES computation.
CellLineCorrelations.csv
, correlation between CCLE cell lines and input disease samples.
computeCellLine(case_id = case_id, expSet = NULL, LINCS_overlaps = TRUE, source = c("octad.small", "octad.whole", "expSet"), file = NULL, output = TRUE, outputFolder = NULL)
computeCellLine(case_id = case_id, expSet = NULL, LINCS_overlaps = TRUE, source = c("octad.small", "octad.whole", "expSet"), file = NULL, output = TRUE, outputFolder = NULL)
case_id |
vector of ids from octad database. Ids can be obtained from |
output |
by default |
outputFolder |
Folder to store results. |
LINCS_overlaps |
vector of cell line ids from octad database. If |
source |
the file for the octad expression matrix. By default, set to |
expSet |
input expression matrix. By default set to |
file |
if |
topline |
|
#load data.frame with samples included in the OCTAD database phenoDF=get_ExperimentHub_data('EH7274') HCC_primary=subset(phenoDF,cancer=='liver hepatocellular carcinoma'& sample.type == 'primary') #select data case_id=HCC_primary$sample.id #select cases cell_line_computed=computeCellLine(case_id=case_id,source='octad.small')
#load data.frame with samples included in the OCTAD database phenoDF=get_ExperimentHub_data('EH7274') HCC_primary=subset(phenoDF,cancer=='liver hepatocellular carcinoma'& sample.type == 'primary') #select data case_id=HCC_primary$sample.id #select cases cell_line_computed=computeCellLine(case_id=case_id,source='octad.small')
Compute reference control samples from OCTAD database using precomputed EncoderDF
models.
computeRefTissue(case_id = NULL, adjacent = FALSE, source = "octad", n_varGenes = 500, method = c("varGenes",'random'), expSet = NULL, control_size = length(case_id), outputFolder = NULL, cor_cutoff = "0", output = TRUE)
computeRefTissue(case_id = NULL, adjacent = FALSE, source = "octad", n_varGenes = 500, method = c("varGenes",'random'), expSet = NULL, control_size = length(case_id), outputFolder = NULL, cor_cutoff = "0", output = TRUE)
case_id |
vector of cases used to compute references. |
source |
by default set |
adjacent |
by default set to |
expSet |
input for expression matrix. By default NULL, since autoencoder results are used. |
n_varGenes |
number of genes used to select control cases. |
method |
one of two options is avaliable. |
control_size |
number of control samples to be selected. |
outputFolder |
path to output folder. By default, the function produces result files in working directory. |
cor_cutoff |
cut-off for correlation values, by default |
output |
if |
Return
control_id |
a vector of an appropriate set of control samples. |
Besides, if output=TRUE
, two files are created in the working directory:
case_normal_corMatrix.csv |
contains pairwise correlation between case samples vs control samples. |
case_normal_median_cor.csv |
contains median correlation values with case samples for returned control samples. |
#select data #load data.frame with samples included in the OCTAD database phenoDF=get_ExperimentHub_data('EH7274') HCC_primary=subset(phenoDF,cancer=='Liver Hepatocellular Carcinoma'& sample.type == 'primary'&data.source == 'TCGA') #select cases case_id=HCC_primary$sample.id #computing reference tissue, by default using small autoEncoder, #but can use custom expression set, #by default output=TRUE and outputFolder option is empty, #which creates control corMatrix.csv to working directory control_id=computeRefTissue(case_id,outputFolder='',output=TRUE, expSet = "octad",control_size = 50)
#select data #load data.frame with samples included in the OCTAD database phenoDF=get_ExperimentHub_data('EH7274') HCC_primary=subset(phenoDF,cancer=='Liver Hepatocellular Carcinoma'& sample.type == 'primary'&data.source == 'TCGA') #select cases case_id=HCC_primary$sample.id #computing reference tissue, by default using small autoEncoder, #but can use custom expression set, #by default output=TRUE and outputFolder option is empty, #which creates control corMatrix.csv to working directory control_id=computeRefTissue(case_id,outputFolder='',output=TRUE, expSet = "octad",control_size = 50)
Compute differential expression for case vs control samples. Will produce the file computedEmpGenes.csv
listing empiricaly differentially expressed genes used for RNA-Seq normalization.
diffExp(case_id = NULL, control_id = NULL, source = "octad.small", file = "octad.counts.and.tpm.h5", normalize_samples = TRUE, k = 1, expSet = NULL, n_topGenes = 500, DE_method = c("edgeR",'DESeq2','wilcox','limma'), output = FALSE, outputFolder = NULL, annotate = TRUE)
diffExp(case_id = NULL, control_id = NULL, source = "octad.small", file = "octad.counts.and.tpm.h5", normalize_samples = TRUE, k = 1, expSet = NULL, n_topGenes = 500, DE_method = c("edgeR",'DESeq2','wilcox','limma'), output = FALSE, outputFolder = NULL, annotate = TRUE)
case_id |
vector of cases used for differential expression. |
control_id |
vector of controls used for differential expression. |
source |
the file for the octad expression matrix. By default, set to |
expSet |
input expression matrix. By default set to |
file |
if |
normalize_samples |
if TRUE, RUVSeq normalization is applied to either EdgeR or DESeq. No normalization needed for limma+voom. |
k |
eiter k=1 (by default), k=2 or k=3, number of factors used in model matrix construction in RUVSeq normalization if |
n_topGenes |
number of empiricaly differentially expressed genes estimated for RUVSeq normalization. Default is 5000. |
DE_method |
edgeR, DESeq2, limma or wilcox DE analysis. |
output |
if |
outputFolder |
path to output folder. By default, the function produces result files in working directory. |
annotate |
if |
res |
|
computedEmpGenes.csv |
|
#load data.frame with samples included in the OCTAD database phenoDF=get_ExperimentHub_data('EH7274') HCC_primary=subset(phenoDF,cancer=='liver hepatocellular carcinoma'& sample.type == 'primary') #select data case_id=HCC_primary$sample.id #select cases HCC_adjacent=subset(phenoDF,cancer=='liver hepatocellular carcinoma'& sample.type == 'adjacent'&data.source == 'TCGA') #select data control_id=HCC_adjacent$sample.id #select cases res=diffExp(case_id,control_id,source='octad.small',output=FALSE)
#load data.frame with samples included in the OCTAD database phenoDF=get_ExperimentHub_data('EH7274') HCC_primary=subset(phenoDF,cancer=='liver hepatocellular carcinoma'& sample.type == 'primary') #select data case_id=HCC_primary$sample.id #select cases HCC_adjacent=subset(phenoDF,cancer=='liver hepatocellular carcinoma'& sample.type == 'adjacent'&data.source == 'TCGA') #select data control_id=HCC_adjacent$sample.id #select cases res=diffExp(case_id,control_id,source='octad.small',output=FALSE)
Create TPM or count expression matrix for the selected samples from OCTAD.
loadOctadCounts(sample_vector='',type='tpm',file='')
loadOctadCounts(sample_vector='',type='tpm',file='')
sample_vector |
vector of samples to be selected. Use |
type |
either |
file |
full path to |
exprData |
|
#load data.frame with samples included in the OCTAD database phenoDF=get_ExperimentHub_data('EH7274') #load expression data for raw counts or tpm values. HCC_primary=subset(phenoDF,cancer=='liver hepatocellular carcinoma'& sample.type == 'primary') #select data #case_id=HCC_primary$sample.id #select cases #expression_tmp=loadOctadCounts(case_id,type='tpm', #file='octad.counts.and.tpm.h5') #expression_log2=loadOctadCounts(case_id,type='counts', #file='octad.counts.and.tpm.h5')
#load data.frame with samples included in the OCTAD database phenoDF=get_ExperimentHub_data('EH7274') #load expression data for raw counts or tpm values. HCC_primary=subset(phenoDF,cancer=='liver hepatocellular carcinoma'& sample.type == 'primary') #select data #case_id=HCC_primary$sample.id #select cases #expression_tmp=loadOctadCounts(case_id,type='tpm', #file='octad.counts.and.tpm.h5') #expression_log2=loadOctadCounts(case_id,type='counts', #file='octad.counts.and.tpm.h5')
Open Cancer TherApeutic Discovery (OCTAD) package implies sRGES approach for the drug discovery. The essential idea is to identify drugs that reverse the gene expression signature of a disease by tamping down over-expressed genes and stimulating weakly expressed ones. The following package contains all required precomputed data for whole OCTAD pipeline computation.
The main functions are:
computeRefTissue
- Compute reference control samples from OCTAD database using precomputed EncoderDF
models.
diffExp
- Compute differential expression for case vs control samples. Will produce the file computedEmpGenes.csv
listing empiricaly differentially expressed genes used for RNA-Seq normalization.
runsRGES
- Compute sRGES, a score indicating the reveral potency of each drug. It first computes RGES (Reverse Gene Expression Score) for individual instances and then summarizes RGES of invididual drugs (one drug may have multiple instances under different treatment conditions).
computeCellLine
- Compute Correlation between cell lines and vector of case ids.
topLineEval
- Evaluate predictions using pharmacogenomics data. Given a cell line, the function computes the correlation between sRGES and drug sensitivity data taken from CTRP. A higher correlation means a better prediction. The cell line could be computed from computeCellLine.
octadDrugEnrichment
- Perform enrichment analysis of drug hits based on chemical structures, drug-targets, and pharmacological classifications. An enrichment score calculated using ssGSEA and a p-value computed through a permutation test are provided.
For detailed information on usage, see the package vignette, by typing
vignette('octad')
, or the workflow linked to on the first page
of the vignette.
The code can be viewed at the GitHub repository, which also lists the contributor code of conduct:
https://github.com/Bin-Chen-Lab/OCTAD
Zeng, B., Glicksberg, B.S., Newbury, P., Chekalin, E., Xing, J., Liu, K., Wen, A., Chow, C. and Chen, B., 2021. OCTAD: an open workspace for virtually screening therapeutics targeting precise cancer patient groups using gene expression features. Nature protocols, 16(2), pp.728-753. https://www.nature.com/articles/s41596-020-00430-z _PACKAGE package
Perform enrichment analysis of drug hits based on chemical structures, drug-targets, and pharmacological classifications. An enrichment score calculated using ssGSEA and a p-value computed through a permutation test are provided.
octadDrugEnrichment(sRGES = NULL, target_type = "chembl_targets", enrichFolder = "enrichFolder", outputFolder = NULL, outputRank = FALSE)
octadDrugEnrichment(sRGES = NULL, target_type = "chembl_targets", enrichFolder = "enrichFolder", outputFolder = NULL, outputRank = FALSE)
sRGES |
sRGES data frame produced by |
target_type |
one or several of |
enrichFolder |
folder to store output. |
outputFolder |
path where to store enrichFolder, in case of |
outputRank |
output detailed rank if |
Following files are created:
enriched_*_targets.csv
and top_enriched_*_*_targets.pdf
. In the case of chemical structural analysis, additional files are created: *drugstructureClusters.csv
and *misc.csv
. The results provide useful information for following candidate selection and experimental design. For example, if two structurally similar drugs are both predicted as top hits, the chance of each drug as a true positive is high.
exprData |
|
data("sRGES_example",package='octad') #load example sRGES #run drug enrichment octadDrugEnrichment(sRGES = sRGES_example, target_type = c('chembl_targets'))
data("sRGES_example",package='octad') #load example sRGES #run drug enrichment octadDrugEnrichment(sRGES = sRGES_example, target_type = c('chembl_targets'))
Differential expression example for HCC vs adjacent liver tissue computed in diffExp() function
data(res_example)
data(res_example)
A data.frame with 963 rows and 18 variables:
Ensg ID
Log2 fold-change
log CPM value
LR value
p.value
FDR
taxon id
Gene id
Locus tag
Chromosome
Chromosome location
Full gene name
type of gene
HGNC symbol
Gene function
To generate this dataset use the following code from the octad package
#load data.frame with samples included in the OCTAD database. phenoDF=.eh[['EH7274']]
#select data HCC_primary=subset(phenoDF,cancer=='liver hepatocellular carcinoma'&sample.type == 'primary')
#select cases case_id=HCC_primary$sample.id
control_id=subset(phenoDF,biopsy.site=='LIVER'&sample.type=='normal')$sample.id[1:50]
res=diffExp(case_id,control_id,source='octad.small',output=FALSE)
Compute sRGES, a score indicating the reveral potency of each drug. It first computes RGES (Reverse Gene Expression Score) for individual instances and then summarizes RGES of invididual drugs (one drug may have multiple instances under different treatment conditions).
runsRGES(dz_signature=NULL,choose_fda_drugs = FALSE,max_gene_size=500, cells=NULL,output=FALSE,outputFolder='',weight_cell_line=NULL,permutations=10000)
runsRGES(dz_signature=NULL,choose_fda_drugs = FALSE,max_gene_size=500, cells=NULL,output=FALSE,outputFolder='',weight_cell_line=NULL,permutations=10000)
dz_signature |
disease signature. Make sure input data frame has a gene |
choose_fda_drugs |
if |
max_gene_size |
maximum number of disease genes used for drug prediction. By default 50 for each side (up/down). |
cells |
cell ids in |
weight_cell_line |
by default |
permutations |
number of permutations, by default 10000. |
output |
if |
outputFolder |
folder path to store drug results, by default write results to working directory. |
The function returns RGES data.frame |
containing scores and p.values for every instance. |
Besides, a number of additional files in the sourced directory:
dz_sig_used.csv |
contains genes in the disease signature used for computing reverse gene expression scores. |
sRGES.csv |
contains the same data as returned data.frame. |
all__lincs_score.csv |
includes information of RGES. |
diffExp, octadDrugEnrichment, computeCellLine, topLineEval
#load differential expression example for HCC #vs adjacent liver tissue computed in diffExp() function data("res_example",package='octad') res_example=subset(res_example,abs(log2FoldChange)>1&padj<0.001)[1:10,] #run sRGES computation #sRGES=runsRGES(dz_signature=res_example,max_gene_size=100,permutations=1000,output=FALSE)
#load differential expression example for HCC #vs adjacent liver tissue computed in diffExp() function data("res_example",package='octad') res_example=subset(res_example,abs(log2FoldChange)>1&padj<0.001)[1:10,] #run sRGES computation #sRGES=runsRGES(dz_signature=res_example,max_gene_size=100,permutations=1000,output=FALSE)
Data of computed example sRGEs for HCC vs liver adjacent tissues on octad.small dataset
data(sRGES_example)
data(sRGES_example)
A tibble with 12,442 rows and 6 variables:
dbl Year price was recorded
mean sRGES for obtained drug if n>1
times this drug was obtained
median sRGES for drug if n>1
standart deviation for obtained drug if n>1
sRGES score of the drug
To generate this dataset use the following code from the octad package
load differential expression example for HCC vs adjacent liver tissue computed in diffExp()
function from res_example
. data('res_example',package='octad.db')
res=subset(res_example,abs(log2FoldChange)>1&padj<0.001) #load example expression dataset
sRGES=runsRGES(res,max_gene_size=100,permutations=1000,output=FALSE)
Evaluate predictions using pharmacogenomics data. Given a cell line, the function computes the correlation between sRGES and drug sensitivity data taken from CTRP. A higher correlation means a better prediction. The cell line could be computed from computeCellLine.
topLineEval(topline=NULL,mysRGES=NULL,outputFolder="")
topLineEval(topline=NULL,mysRGES=NULL,outputFolder="")
topline |
list of cell lines to be used for prediction. |
mysRGES |
sRGES data.frame produced by |
outputFolder |
Path to store results. |
The function produces 3 feils in the output directory:
CellLineEval*_drug_sensitivity_insilico_results.txt |
with drug sensitivity information. |
*_auc_insilico_validation.html |
correlation between drug AUC and sRGES in a related cell line. |
*_ic50_insilico_validation.html |
correlation between drug IC50 and sGRES in a related cell line. |
#load example sRGES computed by runsRGES() function for HCC #vs liver adjacent tissues on octad.small dataset data("sRGES_example",package='octad') #load example sRGES #Pick up cell lines topLineEval(topline = 'HEPG2',mysRGES = sRGES_example,outputFolder=tempdir())
#load example sRGES computed by runsRGES() function for HCC #vs liver adjacent tissues on octad.small dataset data("sRGES_example",package='octad') #load example sRGES #Pick up cell lines topLineEval(topline = 'HEPG2',mysRGES = sRGES_example,outputFolder=tempdir())