Title: | Outlier Analysis for pairwise differential comparison |
---|---|
Description: | Blacksheep is a tool designed for outlier analysis in the context of pairwise comparisons in an effort to find distinguishing characteristics from two groups. This tool was designed to be applied for biological applications such as phosphoproteomics or transcriptomics, but it can be used for any data that can be represented by a 2D table, and has two sub populations within the table to compare. |
Authors: | MacIntosh Cornwell [aut], RugglesLab [cre] |
Maintainer: | RugglesLab <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.21.0 |
Built: | 2024-12-19 03:09:33 UTC |
Source: | https://github.com/bioc/blacksheepr |
Create the annotation object for plotting in a heatmap
annotationlist_builder(metatable, customcolorlist = NULL)
annotationlist_builder(metatable, customcolorlist = NULL)
metatable |
the metatable containing information for the columns |
customcolorlist |
DEFAULT: NULL, enter colorlist to manually set colors |
return the annotation object
metatable <- data.frame(row.names = c("samp1", "samp2", "samp3", "samp4"), A = c(rep("high", 2), rep("low", 2)), B = seq(1,7,2)) customcolorlist <- list(A = c("high" = "red", "low" = "blue"), B = circlize::colorRamp2(seq(-5, 5, length = 3), RColorBrewer::brewer.pal(3, "Reds"))) annotationlist_builder(metatable, customcolorlist)
metatable <- data.frame(row.names = c("samp1", "samp2", "samp3", "samp4"), A = c(rep("high", 2), rep("low", 2)), B = seq(1,7,2)) customcolorlist <- list(A = c("high" = "red", "low" = "blue"), B = circlize::colorRamp2(seq(-5, 5, length = 3), RColorBrewer::brewer.pal(3, "Reds"))) annotationlist_builder(metatable, customcolorlist)
Create all of the groups based on the input metadata
comparison_groupings(comptable)
comparison_groupings(comptable)
comptable |
table where each column will have comparisons drawn from it |
a list with each of the groups as an entry in the list NOTE - this list will be ncol*2 long where ncol is the number comparisons
data("sample_annotationdata") groupings <- comparison_groupings(sample_annotationdata)
data("sample_annotationdata") groupings <- comparison_groupings(sample_annotationdata)
Count up the outlier information for each of the groups you have made. If aggregating then you will have to turn the parameter on, but you still input the outliertable. Aggregate will count the total number of outliers AND nonoutliers in its operation, so it needs the original outlier table made by the <make_outlier_table> function.
count_outliers(groupings, outliertab, aggregate_features = FALSE, feature_delineator = "\\.")
count_outliers(groupings, outliertab, aggregate_features = FALSE, feature_delineator = "\\.")
groupings |
table generated by the comparison_groupings function |
outliertab |
outlier table generated by make_outlier_table |
aggregate_features |
DEFAULT: FALSE; Toggle the Aggregate feature, which will aggregate features in your table based on the given delineator. Aggregation will output counts for the TOTAL number of outliers and non- outliers across ALL sites you aggregate across. |
feature_delineator |
DEFAULT: <"\.">; What character delineates the separation between primary and secondary features. NOTE: to use proper R syntax with escape characters if necessary Ex) Protein1.Phosphosite1 uses "\." to aggregate on Protein1 |
the tabulated information of outliers per group
data("sample_phosphodata") reftable_function_out <- make_outlier_table(sample_phosphodata[1:1000,]) outliertab <- reftable_function_out$outliertab data("sample_annotationdata") groupings <- comparison_groupings(sample_annotationdata) count_outliers_out <- count_outliers(groupings, outliertab, aggregate_features = FALSE) grouptablist <- count_outliers_out$grouptablist fractiontab <- count_outliers_out$fractiontab
data("sample_phosphodata") reftable_function_out <- make_outlier_table(sample_phosphodata[1:1000,]) outliertab <- reftable_function_out$outliertab data("sample_annotationdata") groupings <- comparison_groupings(sample_annotationdata) count_outliers_out <- count_outliers(groupings, outliertab, aggregate_features = FALSE) grouptablist <- count_outliers_out$grouptablist fractiontab <- count_outliers_out$fractiontab
Plot out a heatmap
create_heatmap(counttab = counttab, colmetatable = NULL, colannotationlist = NULL, colclusterparam = FALSE, rowclusterparam = FALSE, nameparam)
create_heatmap(counttab = counttab, colmetatable = NULL, colannotationlist = NULL, colclusterparam = FALSE, rowclusterparam = FALSE, nameparam)
counttab |
table with counts, samples -x-axis, features -y-axis |
colmetatable |
the metatable containing information for the columns |
colannotationlist |
annotation table for columns, based off colmetatable |
colclusterparam |
cluster the columns? |
rowclusterparam |
cluster the rows? |
nameparam |
the title on the heatmap |
prints a pdf heatmap out to the designated outpath
data("sample_phosphodata") counttab <- sample_phosphodata nameparam <- "testplot" create_heatmap(counttab = counttab, colmetatable = NULL, colannotationlist = NULL,colclusterparam = FALSE, rowclusterparam = FALSE, nameparam)
data("sample_phosphodata") counttab <- sample_phosphodata nameparam <- "testplot" create_heatmap(counttab = counttab, colmetatable = NULL, colannotationlist = NULL,colclusterparam = FALSE, rowclusterparam = FALSE, nameparam)
Run the entire blacksheep Function from Start to finish
deva(se, analyze_negative_outliers = FALSE, aggregate_features = FALSE, feature_delineator = "\\.", fraction_samples_cutoff = 0.3, fdrcutoffvalue = 0.1)
deva(se, analyze_negative_outliers = FALSE, aggregate_features = FALSE, feature_delineator = "\\.", fraction_samples_cutoff = 0.3, fdrcutoffvalue = 0.1)
se |
The SummarizedExperiment object containing the countdata and the associated annotation data with comparisons in the colData object. |
analyze_negative_outliers |
DEFAULT: FALSE; Toggle the analysis of outliers in the negative direction as well. Will lead to the output of the outlier table containing "-1" values, in addition to negative outputs for boundaries and aggregate tables (if applicable) |
aggregate_features |
DEFAULT: FALSE; Toggle the Aggregate feature, which will aggregate features in your table based on the given delineator. Aggregation will output an aggregate table that counts the number of outliers per feature, and also a fraction table that show the number of outliers / number of candidates (which excludes missing values) |
feature_delineator |
DEFAULT: <"\."> What character delineates the separation between primary and secondary features. NOTE: to use proper R syntax with escape characters if necessary Ex) Protein1.Phosphosite1 uses "\." to aggregate on Protein1 |
fraction_samples_cutoff |
DEFAULT: 0.3; Input a fractional cut off for the of samples that need to have an outlier for feature to be considered. ex) 10 samples in ingroup - 3 need to have an outlier for feature to be considered significant |
fdrcutoffvalue |
DEFAULT: 0.1; The FDR value for significance |
outputs the full output of deva, including the analysis tables, the heatmaps for the analyses, the fraction table showing the fraction of outliers per sample, and the median and boundary values that together comprise the outlier boundary
suppressPackageStartupMessages(library(SummarizedExperiment)) data("sample_phosphodata") data("sample_annotationdata") se <- SummarizedExperiment( assays = list(counts = as.matrix(sample_phosphodata[1:1000,])), colData = DataFrame(sample_annotationdata)) deva(se = se, analyze_negative_outliers = FALSE, aggregate_features = FALSE, feature_delineator = "-", fraction_samples_cutoff = 0.3, fdrcutoffvalue = 0.1)
suppressPackageStartupMessages(library(SummarizedExperiment)) data("sample_phosphodata") data("sample_annotationdata") se <- SummarizedExperiment( assays = list(counts = as.matrix(sample_phosphodata[1:1000,])), colData = DataFrame(sample_annotationdata)) deva(se = se, analyze_negative_outliers = FALSE, aggregate_features = FALSE, feature_delineator = "-", fraction_samples_cutoff = 0.3, fdrcutoffvalue = 0.1)
Normalization of data to prepare for deva. Uses a Median of Ratio method followed by a log2 transformation.
deva_normalization(intable, method = "MoR-log")
deva_normalization(intable, method = "MoR-log")
intable |
table with samples along the columns and features along the rows. |
method |
DEFAULT: "MoR-log"; Method by which to normalize data in preparation for deva. Options are <"MoR-log", "MoR", "log">. Where "MoR" refers to the Median of ratio's. The "log" transformation is necessary to compress heavily skewed data and allow for proper detection. "MoR-log" as the default will perform MoR followed by a log2 transform. |
A normalized table for input into deva
library(pasilla) pasCts <- system.file("extdata", "pasilla_gene_counts.tsv", package="pasilla") cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id")) norm_cts <- deva_normalization(cts, method = "MoR-log")
library(pasilla) pasCts <- system.file("extdata", "pasilla_gene_counts.tsv", package="pasilla") cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id")) norm_cts <- deva_normalization(cts, method = "MoR-log")
Utility function that allows easier grabbing of data
deva_results(deva_out, ID = NULL, type = NULL)
deva_results(deva_out, ID = NULL, type = NULL)
deva_out |
output from the deva function |
ID |
The keyword to search through analyses and grab desired output |
type |
<"table" | "heatmap" | "fraction_table" | "median" | "boundary"> to return the desirted analysis type |
desired subset of analysis from deva
suppressPackageStartupMessages(library(SummarizedExperiment)) data("sample_phosphodata") data("sample_annotationdata") se = SummarizedExperiment( assays = list(counts = as.matrix(sample_phosphodata[1:1000,])), colData = DataFrame(sample_annotationdata)) deva_out = deva(se = se, analyze_negative_outliers = FALSE, aggregate_features = TRUE, feature_delineator = "-", fraction_samples_cutoff = 0.3, fdrcutoffvalue = 0.1) deva_results(deva_out, ID = "outlieranalysis", type = "table")
suppressPackageStartupMessages(library(SummarizedExperiment)) data("sample_phosphodata") data("sample_annotationdata") se = SummarizedExperiment( assays = list(counts = as.matrix(sample_phosphodata[1:1000,])), colData = DataFrame(sample_annotationdata)) deva_out = deva(se = se, analyze_negative_outliers = FALSE, aggregate_features = TRUE, feature_delineator = "-", fraction_samples_cutoff = 0.3, fdrcutoffvalue = 0.1) deva_results(deva_out, ID = "outlieranalysis", type = "table")
Utility function that will take in columns with several subcategories, and output several columns each with binary classifications. ex) col1: A,B,C >> colA: A,notA,notA; colB: notB,B,notB; colC: notC,notC,C
make_comparison_columns(intable)
make_comparison_columns(intable)
intable |
table where each column has more than one subcategory, can be multiple columns |
an expanded table with each of the columns as a binary labeling of each subcategory.
data("sample_annotationdata") new_comparisons <- make_comparison_columns( sample_annotationdata[,1,drop=FALSE])
data("sample_annotationdata") new_comparisons <- make_comparison_columns( sample_annotationdata[,1,drop=FALSE])
Separate out the "i"th gene, take the bounds, and then create a column that says whether or not this gene is high, low, or none in a sample with regards to the other samples in the dataset. Repeat this for every gene to create a reference table.
make_outlier_table(intable, analyze_negative_outliers = FALSE)
make_outlier_table(intable, analyze_negative_outliers = FALSE)
intable |
table with all of the inputted information, samples along the x-axis, features along the y-axis |
analyze_negative_outliers |
DEFAULT: FALSE; Toggle the analysis of outliers in the negative direction. Will lead to the output of the outlier table containing "-1" values, in addition to negative outputs for boundaries and aggregate tables (if applicable) |
a list with varied sections depending on parameters: $outliertab - table converted to outlier form with 0s, 1s, and -1s, $upperboundtab - list of upper boundaries for outliers $lowerboundtab - list of lower boundaries of outliers $sampmedtab - list of median value per feature
data("sample_phosphodata") reftable_function_out <- make_outlier_table(sample_phosphodata[1:1000,], analyze_negative_outliers = FALSE) outliertab <- reftable_function_out$outliertab upperboundtab <- reftable_function_out$upperboundtab lowerboundtab <- reftable_function_out$lowerboundtab sampmedtab <- reftable_function_out$sampmedtab
data("sample_phosphodata") reftable_function_out <- make_outlier_table(sample_phosphodata[1:1000,], analyze_negative_outliers = FALSE) outliertab <- reftable_function_out$outliertab upperboundtab <- reftable_function_out$upperboundtab lowerboundtab <- reftable_function_out$lowerboundtab sampmedtab <- reftable_function_out$sampmedtab
With the grouptablist generated by count_outliers - run through and run a fisher exact test to get the p.value for the difference in outlier count for each feature in each of your comparisons
outlier_analysis(grouptablist, fraction_table = NULL, fraction_samples_cutoff = 0.3, write_out_tables = FALSE, outfilepath = tempdir())
outlier_analysis(grouptablist, fraction_table = NULL, fraction_samples_cutoff = 0.3, write_out_tables = FALSE, outfilepath = tempdir())
grouptablist |
table generated by the count_outliers function. NOTE that the inputted grouptablist will be deciphered to determine its content. This means that user decides to input the outliertab or aggregate tab, and the output will analyze according to what positive and negative information is contained within the table |
fraction_table |
DEFAULT: NULL; Input a fraction table to filter to only include features that have x an outlier. |
fraction_samples_cutoff |
DEFAULT: 0.3; Input a fractional cut off for the of samples that need to have an outlier for feature to be considered. ex) 10 samples in ingroup - 3 need to have an outlier for feature to be considered significant |
write_out_tables |
DEFAULT: FALSE; utility in function to write out each of the analyses to a separate table to whereever <outfilepath> is specfied. |
outfilepath |
the full string path to where the file should output to, DEFAULT is a tempdir() |
the analysis table with p.value, fdr, and raw data per comparison
data("sample_phosphodata") reftable_function_out <- make_outlier_table(sample_phosphodata[1:1000,]) outliertab <- reftable_function_out$outliertab data("sample_annotationdata") groupings <- comparison_groupings(sample_annotationdata) count_outliers_out <- count_outliers(groupings, outliertab, aggregate_features = FALSE) grouptablist <- count_outliers_out$grouptablist fractiontab <- count_outliers_out$fractiontab outlier_analysis_out <- outlier_analysis(grouptablist, fraction_table = fractiontab)
data("sample_phosphodata") reftable_function_out <- make_outlier_table(sample_phosphodata[1:1000,]) outliertab <- reftable_function_out$outliertab data("sample_annotationdata") groupings <- comparison_groupings(sample_annotationdata) count_outliers_out <- count_outliers(groupings, outliertab, aggregate_features = FALSE) grouptablist <- count_outliers_out$grouptablist fractiontab <- count_outliers_out$fractiontab outlier_analysis_out <- outlier_analysis(grouptablist, fraction_table = fractiontab)
With the grouptablist generated by count_outliers - run through and run a fisher exact test to get the p.value for the difference in outlier count for each feature in each of your comparisons
outlier_heatmap(outlier_analysis_out, analysis_num = NULL, counttab, metatable, fdrcutoffvalue = 0.1)
outlier_heatmap(outlier_analysis_out, analysis_num = NULL, counttab, metatable, fdrcutoffvalue = 0.1)
outlier_analysis_out |
the full outlier_analysis data objet |
analysis_num |
DEFAULT: NULL; if you only want to plot the heatmap for a particular analysis, enter number of that analysis |
counttab |
the raw data before outlier analysis |
metatable |
the complete metatable that was used to generate the comparisons, will be used for annotation of the heatmap |
fdrcutoffvalue |
DEFAULT: 0.1; The FDR value for significance |
outputs a pdf with the heatmap in the current working directory
data("sample_phosphodata") reftable_function_out <- make_outlier_table(sample_phosphodata[1:1000,]) outliertab <- reftable_function_out$outliertab data("sample_annotationdata") groupings <- comparison_groupings(sample_annotationdata) count_outliers_out <- count_outliers(groupings, outliertab, aggregate_features = FALSE) grouptablist <- count_outliers_out$grouptablist fractiontab <- count_outliers_out$fractiontab outlier_analysis_out <- outlier_analysis(grouptablist, fraction_table = fractiontab) metatable <- sample_annotationdata counttab <- sample_phosphodata hm1 <- outlier_heatmap(outlier_analysis_out, analysis_num = NULL, fractiontab, metatable, fdrcutoffvalue = 0.1)
data("sample_phosphodata") reftable_function_out <- make_outlier_table(sample_phosphodata[1:1000,]) outliertab <- reftable_function_out$outliertab data("sample_annotationdata") groupings <- comparison_groupings(sample_annotationdata) count_outliers_out <- count_outliers(groupings, outliertab, aggregate_features = FALSE) grouptablist <- count_outliers_out$grouptablist fractiontab <- count_outliers_out$fractiontab outlier_analysis_out <- outlier_analysis(grouptablist, fraction_table = fractiontab) metatable <- sample_annotationdata counttab <- sample_phosphodata hm1 <- outlier_heatmap(outlier_analysis_out, analysis_num = NULL, fractiontab, metatable, fdrcutoffvalue = 0.1)
Example annotation data for Outlier analysis. This example data is a subset of the data used in the CPTAC3 Breast Cancer exploration study: (doi: 10.1038/nature18003). Each row corresponds to a sample and each column is an binary annotation for that sample.
sample_annotationdata
sample_annotationdata
A data frame with 76 rows and 6 variables:
The binary PAM50 Her2 classification for each sample
The binary PAM50 Basal classification for each sample
The binary PAM50 LumA classification for each sample
The binary PAM50 LumB classification for each sample
The ER Status classification for each sample
The PR Status classification for each sample
...
https://cptac-data-portal.georgetown.edu/cptac/s/S029
Example phosphoprotein data for Outlier analysis This example data is a subset of the data used in the CPTAC3 Breast Cancer exploration study: (doi: 10.1038/nature18003). Each row corresponds to a phosphoprotein site, and each column is a sample. The values within the table are normalized massspec phosphoprotein values.
sample_phosphodata
sample_phosphodata
A data frame with 15532 rows and 76 variables:
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
phosphoprotein levels for each gene
https://cptac-data-portal.georgetown.edu/cptac/s/S029
Example RNA data for Outlier analysis This example data is a subset of the data used in the CPTAC3 Breast Cancer exploration study: (doi: 10.1038/nature18003). Each row corresponds to a gene, and each column is a sample. The values within the table are normalized transcript counts.
sample_rnadata
sample_rnadata
A data frame with 4317 rows and 76 variables:
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
RNA levels for each gene
https://cptac-data-portal.georgetown.edu/cptac/s/S029