Title: | ssPATHS: Single Sample PATHway Score |
---|---|
Description: | This package generates pathway scores from expression data for single samples after training on a reference cohort. The score is generated by taking the expression of a gene set (pathway) from a reference cohort and performing linear discriminant analysis to distinguish samples in the cohort that have the pathway augmented and not. The separating hyperplane is then used to score new samples. |
Authors: | Natalie R. Davidson |
Maintainer: | Natalie R. Davidson <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.21.0 |
Built: | 2024-10-31 05:31:32 UTC |
Source: | https://github.com/bioc/ssPATHS |
Gene Expression Values for PDAC Cancer Cell Lines exposed to Hypoxia
data(expected_score_output)
data(expected_score_output)
A data frame with columns:
String. The name of the sample. Samples with "hyp" or "norm" in the sample id are cell lines that were exposed to hypoxic or normoxic conditions respectively. Samples with "ctrl" or "noHIF" were samples that were able to produce a HIF-mediated hypoxic response or not, respectively.
Float. The estimated hypoxia score for this sample.
Derived Data
## Not run: expected_score_output ## End(Not run)
## Not run: expected_score_output ## End(Not run)
Gene Expression Values for PDAC Cancer Cell Lines exposed to Hypoxia
data(gene_weights_reference)
data(gene_weights_reference)
A data frame with columns:
Float. Gene weighting learned from reference data.
String. The ensembl id of the gene.
Derived data
## Not run: gene_weights_reference ## End(Not run)
## Not run: gene_weights_reference ## End(Not run)
Get the AUC-ROC, AUC-PR, and ROC/PR curves for plotting.
get_classification_accuracy(sample_scores, positive_val)
get_classification_accuracy(sample_scores, positive_val)
sample_scores |
This is a data.frame containing the sample id, score, and true label Y. This object is returned by the method get_gene_weights. |
positive_val |
This is the value that will denote a true positive. It must be one of the two values in the Y column in sample_scores. |
This returns a list of performance metrics
auc_pr |
Area under the PR-curve |
auc_roc |
Area under the ROC-curve |
perf_pr |
ROCR object for plotting the PR-curve |
perf_roc |
ROCR object for plotting the ROC-curve |
Natalie R. Davidson
data(tcga_expr_df) # transform from data.frame to SummarizedExperiment tcga_se <- SummarizedExperiment(t(tcga_expr_df[ , -(1:4)]), colData=tcga_expr_df[ , 2:4]) colnames(tcga_se) <- tcga_expr_df$tcga_id colData(tcga_se)$sample_id <- tcga_expr_df$tcga_id hypoxia_gene_ids <- get_hypoxia_genes() hypoxia_gene_ids <- intersect(hypoxia_gene_ids, rownames(tcga_se)) colData(tcga_se)$Y <- ifelse(colData(tcga_se)$is_normal, 0, 1) # now we can get the gene weightings res <- get_gene_weights(tcga_se, hypoxia_gene_ids, unidirectional=TRUE) sample_scores <- res[[2]] # check how well we did training_res <- get_classification_accuracy(sample_scores, positive_val=1) print(training_res[[2]]) plot(training_res[[3]], col="orange", ylim=c(0, 1)) legend(0.1,0.8,c(training_res$auc_pr,"\n"), border="white", cex=1.7, box.col = "white") plot(training_res[[4]], col="blue", ylim=c(0, 1)) legend(0.1,0.8,c(training_res$auc_roc,"\n"),border="white",cex=1.7, box.col = "white")
data(tcga_expr_df) # transform from data.frame to SummarizedExperiment tcga_se <- SummarizedExperiment(t(tcga_expr_df[ , -(1:4)]), colData=tcga_expr_df[ , 2:4]) colnames(tcga_se) <- tcga_expr_df$tcga_id colData(tcga_se)$sample_id <- tcga_expr_df$tcga_id hypoxia_gene_ids <- get_hypoxia_genes() hypoxia_gene_ids <- intersect(hypoxia_gene_ids, rownames(tcga_se)) colData(tcga_se)$Y <- ifelse(colData(tcga_se)$is_normal, 0, 1) # now we can get the gene weightings res <- get_gene_weights(tcga_se, hypoxia_gene_ids, unidirectional=TRUE) sample_scores <- res[[2]] # check how well we did training_res <- get_classification_accuracy(sample_scores, positive_val=1) print(training_res[[2]]) plot(training_res[[3]], col="orange", ylim=c(0, 1)) legend(0.1,0.8,c(training_res$auc_pr,"\n"), border="white", cex=1.7, box.col = "white") plot(training_res[[4]], col="blue", ylim=c(0, 1)) legend(0.1,0.8,c(training_res$auc_roc,"\n"),border="white",cex=1.7, box.col = "white")
This method performs linear discriminant analysis on a reference dataset using a pre-defined set of genes related to a pathway of interest.
get_gene_weights(expression_se, gene_ids, unidirectional)
get_gene_weights(expression_se, gene_ids, unidirectional)
expression_se |
This is an SummarizedExperiment object of the reference samples. Rows are
genes and columns are samples. The colData component must contain a
|
gene_ids |
This is a vector of strings, where each element is a |
unidirectional |
This is a boolean, |
A list containing the gene weights and estimated scores of the reference samples.
proj_vector_df |
A dataframe containing the gene weights and gene ids |
dca_proj |
A dataframe containing the sample scores and sample ids. |
Natalie R. Davidson
Steven C.H. Hoi, W. Liu, M.R. Lyu and W.Y. Ma (2006). Learning Distance Metrics with Contextual Constraints for Image Retrieval. Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR2006).
data(tcga_expr_df) # transform from data.frame to SummarizedExperiment tcga_se <- SummarizedExperiment(t(tcga_expr_df[ , -(1:4)]), colData=tcga_expr_df[ , 2:4]) colnames(tcga_se) <- tcga_expr_df$tcga_id colData(tcga_se)$sample_id <- tcga_expr_df$tcga_id # get related genes, for us hypoxia hypoxia_gene_ids <- get_hypoxia_genes() hypoxia_gene_ids <- intersect(hypoxia_gene_ids, rownames(tcga_se)) # setup labels for classification colData(tcga_se)$Y <- ifelse(colData(tcga_se)$is_normal, 0, 1) # now we can get the gene weightings res <- get_gene_weights(tcga_se, hypoxia_gene_ids, unidirectional=TRUE) gene_weights_test <- res[[1]] sample_scores <- res[[2]]
data(tcga_expr_df) # transform from data.frame to SummarizedExperiment tcga_se <- SummarizedExperiment(t(tcga_expr_df[ , -(1:4)]), colData=tcga_expr_df[ , 2:4]) colnames(tcga_se) <- tcga_expr_df$tcga_id colData(tcga_se)$sample_id <- tcga_expr_df$tcga_id # get related genes, for us hypoxia hypoxia_gene_ids <- get_hypoxia_genes() hypoxia_gene_ids <- intersect(hypoxia_gene_ids, rownames(tcga_se)) # setup labels for classification colData(tcga_se)$Y <- ifelse(colData(tcga_se)$is_normal, 0, 1) # now we can get the gene weightings res <- get_gene_weights(tcga_se, hypoxia_gene_ids, unidirectional=TRUE) gene_weights_test <- res[[1]] sample_scores <- res[[2]]
Returns a vector of Ensembl ids of hypoxia related genes.
get_hypoxia_genes()
get_hypoxia_genes()
Vector of ensembl ids.
Natalie R. Davidson
# read in the reference expression data for hypoxia score generation data(tcga_expr_df) # transform from data.frame to SummarizedExperiment tcga_se <- SummarizedExperiment(t(tcga_expr_df[ , -(1:4)]), colData=tcga_expr_df[ , 2:4]) colnames(tcga_se) <- tcga_expr_df$tcga_id colData(tcga_se)$sample_id <- tcga_expr_df$tcga_id # let's get the expression of hypoxia associated genes hypoxia_gene_ids <- get_hypoxia_genes() hypoxia_gene_ids <- intersect(hypoxia_gene_ids, rownames(tcga_se)) hypoxia_se <- tcga_se[hypoxia_gene_ids,]
# read in the reference expression data for hypoxia score generation data(tcga_expr_df) # transform from data.frame to SummarizedExperiment tcga_se <- SummarizedExperiment(t(tcga_expr_df[ , -(1:4)]), colData=tcga_expr_df[ , 2:4]) colnames(tcga_se) <- tcga_expr_df$tcga_id colData(tcga_se)$sample_id <- tcga_expr_df$tcga_id # let's get the expression of hypoxia associated genes hypoxia_gene_ids <- get_hypoxia_genes() hypoxia_gene_ids <- intersect(hypoxia_gene_ids, rownames(tcga_se)) hypoxia_se <- tcga_se[hypoxia_gene_ids,]
Using the gene weights learned from the reference cohort, we apply the weightings to new samples to estimate their pathway activity.
get_new_samp_score(gene_weights, expression_se, gene_ids, run_normalization = TRUE)
get_new_samp_score(gene_weights, expression_se, gene_ids, run_normalization = TRUE)
gene_weights |
This is a data.frame containing gene ids and gene weights, output by get_gene_weights. The gene ids must be in the column ids of expression_matr. |
expression_se |
This is an SummarizedExperiment object of the reference samples. Rows are
genes and columns are samples. The colData component must contain columns
|
gene_ids |
This is a vector of strings, where each element is a |
run_normalization |
Boolean value. If TRUE, the data will be log-transformed, centered and scaled. This is recommended since this is done to the reference set when learning the gene weights. |
A data.frame containing the sample id, sample score, and associated Y value if it was included in expression_se.
Natalie R. Davidson
data(tcga_expr_df) # transform from data.frame to SummarizedExperiment tcga_se <- SummarizedExperiment(t(tcga_expr_df[ , -(1:4)]), colData=tcga_expr_df[ , 2:4]) colnames(tcga_se) <- tcga_expr_df$tcga_id colData(tcga_se)$sample_id <- tcga_expr_df$tcga_id # get the genes of interest, here hypoxia genes hypoxia_gene_ids <- get_hypoxia_genes() hypoxia_gene_ids <- intersect(hypoxia_gene_ids, rownames(tcga_se)) # label the samples for classification colData(tcga_se)$Y <- ifelse(colData(tcga_se)$is_normal, 0, 1) # now we can get the gene weightings res <- get_gene_weights(tcga_se, hypoxia_gene_ids, unidirectional=TRUE) gene_weights <- res[[1]] sample_scores <- res[[2]] # get the new data so we can apply our score to it data(new_samp_df) new_samp_se <- SummarizedExperiment(t(new_samp_df[ , -(1)]), colData=new_samp_df[ , 1, drop=FALSE]) colnames(colData(new_samp_se)) <- "sample_id" new_score_df_calculated <- get_new_samp_score(gene_weights, new_samp_se)
data(tcga_expr_df) # transform from data.frame to SummarizedExperiment tcga_se <- SummarizedExperiment(t(tcga_expr_df[ , -(1:4)]), colData=tcga_expr_df[ , 2:4]) colnames(tcga_se) <- tcga_expr_df$tcga_id colData(tcga_se)$sample_id <- tcga_expr_df$tcga_id # get the genes of interest, here hypoxia genes hypoxia_gene_ids <- get_hypoxia_genes() hypoxia_gene_ids <- intersect(hypoxia_gene_ids, rownames(tcga_se)) # label the samples for classification colData(tcga_se)$Y <- ifelse(colData(tcga_se)$is_normal, 0, 1) # now we can get the gene weightings res <- get_gene_weights(tcga_se, hypoxia_gene_ids, unidirectional=TRUE) gene_weights <- res[[1]] sample_scores <- res[[2]] # get the new data so we can apply our score to it data(new_samp_df) new_samp_se <- SummarizedExperiment(t(new_samp_df[ , -(1)]), colData=new_samp_df[ , 1, drop=FALSE]) colnames(colData(new_samp_se)) <- "sample_id" new_score_df_calculated <- get_new_samp_score(gene_weights, new_samp_se)
A data frame with columns:
String. The name of the sample. Samples with "hyp" or "norm" in the sample id are cell lines that were exposed to hypoxic or normoxic conditions respectively. Samples with "ctrl" or "noHIF" were samples that were able to produce a HIF-mediated hypoxic response or not, respectively.
Int. Gene expression value for this gene.
data(new_samp_df)
data(new_samp_df)
An object of class data.frame
with 12 rows and 27 columns.
Generated by Philipp Markolin, files will be uploaded on GEO
## Not run: new_samp_df ## End(Not run)
## Not run: new_samp_df ## End(Not run)
A data frame with columns:
String. TCGA aliquot barcode
String. TCGA study abbreviation
Boolean. TRUE if sample is adjacent normal, FALSE if tumor.
Float. Library size as estimated by the 75th quartile.
String. Library size normalized gene expression value for this gene.
data(tcga_expr_df)
data(tcga_expr_df)
An object of class data.frame
with 9461 rows and 54 columns.
This data is generated by the TCGA Research Network: https://www.cancer.gov/tcga and downloaded from the NCI Genomic Data Commons.
## Not run: tcga_expr_df ## End(Not run)
## Not run: tcga_expr_df ## End(Not run)