Title: | Outlier Protein and Phosphosite Target Identifier |
---|---|
Description: | The aim of oppti is to analyze protein (and phosphosite) expressions to find outlying markers for each sample in the given cohort(s) for the discovery of personalized actionable targets. |
Authors: | Abdulkadir Elmas |
Maintainer: | Abdulkadir Elmas <[email protected]> |
License: | MIT |
Version: | 1.21.0 |
Built: | 2024-10-30 09:21:19 UTC |
Source: | https://github.com/bioc/oppti |
Infers the normal-state expression of a marker based on its co-expression network, i.e., the weighted average of the marker's nearest neighbors in the data. The returned imputed data will later be used to elucidate dysregulated (protruding) events.
artImpute(dat, ku = 6, marker.proc.list = NULL, miss.pstat = 0.4, verbose = FALSE)
artImpute(dat, ku = 6, marker.proc.list = NULL, miss.pstat = 0.4, verbose = FALSE)
dat |
an object of log2-normalized protein (or gene) expressions, containing markers in rows and samples in columns. |
ku |
an integer in [1,num.markers], upper bound on the number of nearest neighbors of a marker. |
marker.proc.list |
character array, the row names of the data to be processed/imputed. |
miss.pstat |
the score threshold for ignoring potential outliers during imputation. miss.pstat = 1 ignores values outside of the density box (i.e., 1st-3rd quartiles). The algorithm ignores values lying at least (1/miss.pstat)-1 times IQR away from the box; e.g., use miss.pstat=1 to ignore all values lying outside of the box; use miss.pstat=0.4 to ignore values lying at least 1.5 x IQR away from the box; use miss.pstat=0 to employ all data during imputation. |
verbose |
logical, to show progress of the algorithm. |
the imputed data that putatively represents the expressions of the markers in the (matched) normal states.
dat = setNames(as.data.frame(matrix(1:(5*10),5,10), row.names = paste('marker',1:5,sep='')), paste('sample',1:10,sep='')) imputed = artImpute(dat, ku = 2)
dat = setNames(as.data.frame(matrix(1:(5*10),5,10), row.names = paste('marker',1:5,sep='')), paste('sample',1:10,sep='')) imputed = artImpute(dat, ku = 2)
Displays the hierarchically clustered data by the "pheatmap" package. The numbers of clusters along the markers/samples can be set by the user, then the cluster structures are estimated by pair-wise analysis.
clusterData(data, annotation_row = NULL, annotation_col = NULL, annotation_colors = NULL, main = NA, legend = TRUE, clustering_distance_rows = "euclidean", clustering_distance_cols = "euclidean", display_numbers = FALSE, number_format = "%.0f", num_clusters_row = NULL, num_clusters_col = NULL, cluster_rows = TRUE, cluster_cols = TRUE, border_color = "gray60", annotate_new_clusters_col = FALSE, zero_white = FALSE, color_low = '#006699', color_mid = 'white', color_high = 'red',color_palette = NULL, show_rownames = FALSE, show_colnames = FALSE, min_data = min(data, na.rm = TRUE), max_data = max(data, na.rm = TRUE), treeheight_row = ifelse(methods::is(cluster_rows, "hclust") || cluster_rows, 50, 0), treeheight_col = ifelse(methods::is(cluster_cols, "hclust") || cluster_cols, 50, 0))
clusterData(data, annotation_row = NULL, annotation_col = NULL, annotation_colors = NULL, main = NA, legend = TRUE, clustering_distance_rows = "euclidean", clustering_distance_cols = "euclidean", display_numbers = FALSE, number_format = "%.0f", num_clusters_row = NULL, num_clusters_col = NULL, cluster_rows = TRUE, cluster_cols = TRUE, border_color = "gray60", annotate_new_clusters_col = FALSE, zero_white = FALSE, color_low = '#006699', color_mid = 'white', color_high = 'red',color_palette = NULL, show_rownames = FALSE, show_colnames = FALSE, min_data = min(data, na.rm = TRUE), max_data = max(data, na.rm = TRUE), treeheight_row = ifelse(methods::is(cluster_rows, "hclust") || cluster_rows, 50, 0), treeheight_col = ifelse(methods::is(cluster_cols, "hclust") || cluster_cols, 50, 0))
data |
an object of log2-normalized protein (or gene) expressions, containing markers in rows and samples in columns. |
annotation_row |
data frame that specifies the annotations shown on left side of the heat map. Each row defines the features for a specific row. The rows in the data and in the annotation are matched using corresponding row names. Note that color schemes takes into account if variable is continuous or discrete. |
annotation_col |
similar to annotation_row, but for columns. |
annotation_colors |
list for specifying annotation_row and annotation_col track colors manually. It is possible to define the colors for only some of the features. |
main |
character string, an overall title for the plot. |
legend |
logical, to determine if legend should be drawn or not. |
clustering_distance_rows |
distance measure used in clustering rows. Possible values are "correlation" for Pearson correlation and all the distances supported by dist, such as "euclidean", etc. If the value is none of the above it is assumed that a distance matrix is provided. |
clustering_distance_cols |
distance measure used in clustering columns. Possible values the same as for clustering_distance_rows. |
display_numbers |
logical, determining if the numeric values are also printed to the cells. If this is a matrix (with same dimensions as original matrix), the contents of the matrix are shown instead of original values. |
number_format |
format strings (C printf style) of the numbers shown in cells. For example "%.2f" shows 2 decimal places and "%.1e" shows exponential notation (see more in sprintf). |
num_clusters_row |
number of clusters the rows are divided into, based on the hierarchical clustering (using cutree), if rows are not clustered, the argument is ignored. |
num_clusters_col |
similar to num_clusters_row, but for columns. |
cluster_rows |
logical, determining if the rows should be clustered; or a hclust object. |
cluster_cols |
similar to cluster_rows, but for columns. |
border_color |
color of cell borders on heatmap, use NA if no border should be drawn. |
annotate_new_clusters_col |
logical, to annotate cluster IDs (column) that will be identified. |
zero_white |
logical, to display 0 values as white in the colormap. |
color_low |
color code for the low intensity values in the colormap. |
color_mid |
color code for the medium intensity values in the colormap. |
color_high |
color code for the high intensity values in the colormap. |
color_palette |
vector of colors used in heatmap. |
show_rownames |
boolean, specifying if row names are be shown. |
show_colnames |
boolean, specifying if column names are be shown. |
min_data |
numeric, data value corresponding to minimum intensity in the color_palette |
max_data |
numeric, data value corresponding to maximum intensity in the color_palette |
treeheight_row |
the height of a tree for rows, if these are clustered. Default value is 50 points. |
treeheight_col |
the height of a tree for columns, if these are clustered. Default value is 50 points. |
tree, the hierarchical tree structure.
cluster_IDs_row, the (row) cluster identities of the markers.
cluster_IDs_col, the (column) cluster identities of the samples.
set.seed(1) dat = setNames(as.data.frame(matrix(runif(10*10),10,10), row.names = paste('marker',1:10,sep='')), paste('sample',1:10,sep='')) result = clusterData(dat)
set.seed(1) dat = setNames(as.data.frame(matrix(runif(10*10),10,10), row.names = paste('marker',1:10,sep='')), paste('sample',1:10,sep='')) result = clusterData(dat)
Filters out markers based on the percentage of missing values, low-expression and low-variability rates.
dropMarkers(dat, percent_NA = 0.2, low_mean_and_std = 0.05, q_low_var = 0.25, force_drop = NULL)
dropMarkers(dat, percent_NA = 0.2, low_mean_and_std = 0.05, q_low_var = 0.25, force_drop = NULL)
dat |
an object of log2-normalized protein (or gene) expressions, containing markers in rows and samples in columns. |
percent_NA |
a constant in [0,1], the percentage of missing values that will be tolerated in the filtered data. |
low_mean_and_std |
a constant in [0,inf], the lower-bound of the mean or standard deviation of a marker in the filtered data. |
q_low_var |
a constant in [0,1], the quantile of marker variances which serves as a lower-bound of the marker variances in the filtered data. |
force_drop |
character array containing the marker names that user specifically wants to filter out. |
filtered data with the same format as the input data.
the row names (markers) of the data that are filtered out due to low-expression or low-variability.
dat = setNames(as.data.frame(matrix(1:(5*10),5,10), row.names = paste('marker',1:5,sep='')), paste('sample',1:10,sep='')) dat[1,1:2] = NA # marker1 have 20% missing values dropMarkers(dat, percent_NA = .2) # marker1 is filtered out
dat = setNames(as.data.frame(matrix(1:(5*10),5,10), row.names = paste('marker',1:5,sep='')), paste('sample',1:10,sep='')) dat[1,1:2] = NA # marker1 have 20% missing values dropMarkers(dat, percent_NA = .2) # marker1 is filtered out
For each marker processed, draws a scatter plot of matching values of observed vs imputed expressions.
dysReg(dat, dat.imp, marker.proc.list = NULL, verbose = FALSE)
dysReg(dat, dat.imp, marker.proc.list = NULL, verbose = FALSE)
dat |
an object of log2-normalized protein (or gene) expressions, containing markers in rows and samples in columns. |
dat.imp |
the imputed data that putatively represents the expressions of the markers in the (matched) normal states. |
marker.proc.list |
character array, the row names of the data to be processed for dysregulation. |
verbose |
logical, to show progress of the algorithm |
samples' distances to regression line (i.e., dysregulation) on the scatter plots.
the scatter plots.
dat = setNames(as.data.frame(matrix(1:(5*10),5,10), row.names = paste('marker',1:5,sep='')), paste('sample',1:10,sep='')) dat.imp = artImpute(dat, ku=2) result = dysReg(dat, dat.imp)
dat = setNames(as.data.frame(matrix(1:(5*10),5,10), row.names = paste('marker',1:5,sep='')), paste('sample',1:10,sep='')) dat.imp = artImpute(dat, ku=2) result = dysReg(dat, dat.imp)
Mark outlying expressions on the scatter plot of a given marker
markOut(dat, dat.imp, dat.imp.test, dat.dys, dys.sig.thr.upp, marker.proc.list = NULL, dataset = "", num.omit.fit = NULL, draw.sc = TRUE, draw.vi = TRUE, conf.int = 0.95, ylab = "Observed", xlab = "Inferred")
markOut(dat, dat.imp, dat.imp.test, dat.dys, dys.sig.thr.upp, marker.proc.list = NULL, dataset = "", num.omit.fit = NULL, draw.sc = TRUE, draw.vi = TRUE, conf.int = 0.95, ylab = "Observed", xlab = "Inferred")
dat |
an object of log2-normalized protein (or gene) expressions, containing markers in rows and samples in columns. |
dat.imp |
the imputed data that putatively represents the expressions of the markers in the (matched) normal states. |
dat.imp.test |
marker's p-value of the statistical significance between its observed vs imputed values computed by the Kolmogorov-Smirnov test. |
dat.dys |
samples' distances to regression line (i.e., dysregulation) on the scatter plots. |
dys.sig.thr.upp |
the dysregulation score threshold to elucidate/mark significantly dysregulated outlier events. |
marker.proc.list |
character array, the row names of the data to be processed for outlier analyses and for plotting. |
dataset |
the cohort name to be used in the output files. |
num.omit.fit |
number of outlying events to ignore when fitting a marker's observed expressions to the imputed ones. |
draw.sc |
logical, to draw a scatter plot for every marker in marker.proc.list in a separate PDF file. |
draw.vi |
logical, to draw a violin plot for every marker in marker.proc.list in a separate PDF file. |
conf.int |
confidence interval to display around the regression line |
ylab |
a title for the y axis |
xlab |
a title for the x axis |
the scatter plots of the markers where the outlier dysregulation events are highlighted by red mark.
set.seed(1) dat = setNames(as.data.frame(matrix(runif(10*10),10,10), row.names = paste('marker',1:10,sep='')), paste('sample',1:10,sep='')) dat.imp = artImpute(dat, ku=6) dat.imp.test = statTest(dat, dat.imp)[[1]] dat.dys = dysReg(dat, dat.imp)[[1]] plots = markOut(dat, dat.imp, dat.imp.test, dat.dys, dys.sig.thr.upp = .25)
set.seed(1) dat = setNames(as.data.frame(matrix(runif(10*10),10,10), row.names = paste('marker',1:10,sep='')), paste('sample',1:10,sep='')) dat.imp = artImpute(dat, ku=6) dat.imp.test = statTest(dat, dat.imp)[[1]] dat.dys = dysReg(dat, dat.imp)[[1]] plots = markOut(dat, dat.imp, dat.imp.test, dat.dys, dys.sig.thr.upp = .25)
Find outlying markers and events across cancer types.
oppti(data, mad.norm = FALSE, cohort.names = NULL, panel = "global", panel.markers = NULL, tol.nas = 20, ku = 6, miss.pstat = 0.4, demo.panels = FALSE, save.data = FALSE, draw.sc.plots = FALSE, draw.vi.plots = FALSE, draw.sc.markers = NULL, draw.ou.plots = FALSE, draw.ou.markers = NULL, verbose = FALSE)
oppti(data, mad.norm = FALSE, cohort.names = NULL, panel = "global", panel.markers = NULL, tol.nas = 20, ku = 6, miss.pstat = 0.4, demo.panels = FALSE, save.data = FALSE, draw.sc.plots = FALSE, draw.vi.plots = FALSE, draw.sc.markers = NULL, draw.ou.plots = FALSE, draw.ou.markers = NULL, verbose = FALSE)
data |
a list object where each element contains a proteomics data for a different cohort (markers in the rows, samples in the columns) or a character string defining the path to such data (in .RDS format). |
mad.norm |
logical, to normalize the proteomes to have a unit Median Absolute Deviation. |
cohort.names |
character array. |
panel |
a character string describing marker panel, e.g., 'kinases'. Use 'global' to analyze all markers quantified across cohorts (default). Use 'pancan' to analyze the markers commonly quantified across the cohorts. |
panel.markers |
a character array containing the set of marker names that user wants to analyze, e.g., panel.markers = c("AAK1", "AATK", "ABL1", "ABL2", ...). |
tol.nas |
a constant in [0,100], tolerance for the percentage of NAs in a marker, e.g., tol.nas = 20 will filter out markers containing 20% or more NAs across samples. |
ku |
an integer in [1,num.markers], upper bound on the number of nearest neighbors of a marker. |
miss.pstat |
a constant in [0,1], statistic to estimate potential outliers. See 'artImpute()'. |
demo.panels |
logical, to draw demographics of the panel in each cohort. |
save.data |
logical, to save intermediate data (background inference and dysregulation measures). |
draw.sc.plots |
logical, to draw each marker's qqplot of observed vs inferred (imputed) expressions. |
draw.vi.plots |
logical, to draw each marker's violin plot of observed vs imputed expressions. |
draw.sc.markers |
character array, marker list to draw scatter plots |
draw.ou.plots |
logical, to draw each marker's outlier prevalence (by the percentage of outlying samples) across the cohorts. |
draw.ou.markers |
character array, marker list to draw pan-cancer outlier percentage plots |
verbose |
logical, to show progress of the algorithm. |
dysregulation scores of every marker for each sample.
the imputed data that putatively represents the expressions of the markers in the (matched) normal states.
the result of Kolmogorov-Smirnov tests that evaluates the statistical significance of each marker's outlier samples.
a data list containing, for each cohort, the percentage of outlier samples for every marker.
a data list containing, for each cohort, the outlier significance threshold.
[artImpute()] for how to set 'miss.pstat' and 'ku'
set.seed(1) dat = setNames(as.data.frame(matrix(runif(10*10),10,10), row.names = paste('marker',1:10,sep='')), paste('sample',1:10,sep='')) result = oppti(dat)
set.seed(1) dat = setNames(as.data.frame(matrix(runif(10*10),10,10), row.names = paste('marker',1:10,sep='')), paste('sample',1:10,sep='')) result = oppti(dat)
Calculates a statistical measure of each data entry being a putative outlier
outScores(dat)
outScores(dat)
dat |
an object of log2-normalized protein (or gene) expressions, containing markers in rows and samples in columns. |
outlier p-statistics
dat = setNames(as.data.frame(matrix(1:(5*10),5,10), row.names = paste('marker',1:5,sep='')), paste('sample',1:10,sep='')) result = outScores(dat)
dat = setNames(as.data.frame(matrix(1:(5*10),5,10), row.names = paste('marker',1:5,sep='')), paste('sample',1:10,sep='')) result = outScores(dat)
Draw column densities of an object over multiple plots by using limma::plotDensities() function.
plotDen(dat, name = "", per.plot = 8, main = NULL, group = NULL, legend = TRUE)
plotDen(dat, name = "", per.plot = 8, main = NULL, group = NULL, legend = TRUE)
dat |
an object of log2-normalized protein (or gene) expressions, containing markers in rows and samples in columns. |
name |
name tag for the output file. |
per.plot |
number of densities to be drawn on a single plot. If NULL, ncol(object) will be used. |
main |
character string, an overall title for the plot. |
group |
vector or factor classifying the arrays into groups. Should be same length as ncol(object). |
legend |
character string giving position to place legend. See 'legend' for possible values. Can also be logical, with FALSE meaning no legend. |
pdf plot(s).
dat = setNames(as.data.frame(matrix(1:(5*10),5,10), row.names = paste('marker',1:5,sep='')), paste('sample',1:10,sep='')) plotDen(dat, name = 'myresults')
dat = setNames(as.data.frame(matrix(1:(5*10),5,10), row.names = paste('marker',1:5,sep='')), paste('sample',1:10,sep='')) plotDen(dat, name = 'myresults')
Ranks markers in the order of decreasing percentage of outlying events.
rankPerOut(dat.dys, marker.proc.list = NULL, dys.sig.thr.upp)
rankPerOut(dat.dys, marker.proc.list = NULL, dys.sig.thr.upp)
dat.dys |
samples' distances to regression line (i.e., dysregulation) on the scatter plots. |
marker.proc.list |
character array, the row names of the data to be processed for outlier analyses. |
dys.sig.thr.upp |
the dysregulation score threshold to elucidate/mark significantly dysregulated outlier events. |
markers rank-ordered by the percentage of outliers over the samples.
the percentages of outliers corresponding to ranked markers.
set.seed(1) dat = setNames(as.data.frame(matrix(runif(10*10),10,10), row.names = paste('marker',1:10,sep='')), paste('sample',1:10,sep='')) dat.imp = artImpute(dat, ku=6) dat.dys = dysReg(dat, dat.imp)[[1]] result = rankPerOut(dat.dys, dys.sig.thr.upp = .25)
set.seed(1) dat = setNames(as.data.frame(matrix(runif(10*10),10,10), row.names = paste('marker',1:10,sep='')), paste('sample',1:10,sep='')) dat.imp = artImpute(dat, ku=6) dat.dys = dysReg(dat, dat.imp)[[1]] result = rankPerOut(dat.dys, dys.sig.thr.upp = .25)
Rank-order markers by the significance of deviation of the observed expressions from the (matched) imputed expressions based on the Kolmogorov-Smirnov (KS) test.
statTest(dat, dat.imp, marker.proc.list = NULL, pval.insig = 0.2)
statTest(dat, dat.imp, marker.proc.list = NULL, pval.insig = 0.2)
dat |
an object of log2-normalized protein (or gene) expressions, containing markers in rows and samples in columns. |
dat.imp |
the imputed data that putatively represents the expressions of the markers in the (matched) normal states. |
marker.proc.list |
character array, the row names of the data to be processed for dysregulation significance. |
pval.insig |
p-value threshold to determine spurious (null) dysregulation events. |
each marker's p-value of the statistical significance between its observed vs imputed values computed by the KS test.
ranked p-values (KS test) of the significant markers, which are lower than pval.insig.
ranked significantly dysregulated markers with p-values lower than pval.insig.
ranked p-values (KS test) of the insignificant markers, which are greater than pval.insig.
ranked insignificantly dysregulated markers (spurious dysregulations) with p-values greater than pval.insig.
set.seed(1) dat = setNames(as.data.frame(matrix(runif(10*10),10,10), row.names = paste('marker',1:10,sep='')), paste('sample',1:10,sep='')) dat.imp = artImpute(dat, ku=6) result = statTest(dat, dat.imp) # the dysregulations on marker4 is # statistically significant with p-value 0.05244755.
set.seed(1) dat = setNames(as.data.frame(matrix(runif(10*10),10,10), row.names = paste('marker',1:10,sep='')), paste('sample',1:10,sep='')) dat.imp = artImpute(dat, ku=6) result = statTest(dat, dat.imp) # the dysregulations on marker4 is # statistically significant with p-value 0.05244755.