Title: | Identification Of Clinically Relevant Genomic Subtypes Using Outcome Weighted Learning |
---|---|
Description: | survClust is an outcome weighted integrative clustering algorithm used to classify multi-omic samples on their available time to event information. The resulting clusters are cross-validated to avoid over overfitting and output classification of samples that are molecularly distinct and clinically meaningful. It takes in binary (mutation) as well as continuous data (other omic types). |
Authors: | Arshi Arora [aut, cre] |
Maintainer: | Arshi Arora <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.0 |
Built: | 2024-10-31 05:36:16 UTC |
Source: | https://github.com/bioc/survClust |
combineDist
integrates weighted distances matrices from getDist
.
All data types are now collapsed into one NxN
matrix.
combineDist(dist.dat)
combineDist(dist.dat)
dist.dat |
list of weighted data matrices from |
combineDist
integrates and does cleaning of missing pair of samples.
if datasets
list had non-overlapping samples,
then combineDist
retains only those samples that have full information after accounting for all data types.
combMatFullA matrix. Combine normalized information across m
genomic data types into NxN
matrix,
where N
is the union of all samples across m
data types/ or samples with complete pairwise information.
Final matrix should not have any NAs
Arshi Arora
library(survClust) dd <- getDist(simdat, simsurvdat) cc <- combineDist(dd)
library(survClust) dd <- getDist(simdat, simsurvdat) cc <- combineDist(dd)
survClust
for a particular k
. cv_survclust
runscv_survclust
performs k
fold cross-validation, runs survClust
on each training and
hold out test fold and return cross-validated supervised cluster labels.
cv_survclust(datasets, survdat = NULL, k, fold, cmd.k = NULL, type = NULL)
cv_survclust(datasets, survdat = NULL, k, fold, cmd.k = NULL, type = NULL)
datasets |
A list object containing |
survdat |
A matrix, containing two columns - 1st column |
k |
integer, choice of |
fold |
integer, number of folds to run cross validation |
cmd.k |
integer, number of dimensions used by |
type |
Specify |
cv.labelsreturns cross validated class labels for k
cluster
cv.logranklogrank test statistic of cross validated label
cv.spwssstandardized pooled within-cluster sum of squares calculated from cross-validation class labels
Arshi Arora
library(survClust) cv.fit <- cv_survclust(datasets = simdat, survdat = simsurvdat, k = 3, fold=3 )
library(survClust) cv.fit <- cv_survclust(datasets = simdat, survdat = simsurvdat, k = 3, fold=3 )
survClust
fit, return consolidated labels across rounds of cross validation for a specific k
.
Note that cv.fit already has consolidated class labels across foldsFor a survClust
fit, return consolidated labels across rounds of cross validation for a specific k
.
Note that cv.fit already has consolidated class labels across folds
cv_voting( cv.fit, dat.dist, pick_k, cmd.k = NULL, pick_k.test = TRUE, minlabel.test = TRUE )
cv_voting( cv.fit, dat.dist, pick_k, cmd.k = NULL, pick_k.test = TRUE, minlabel.test = TRUE )
cv.fit |
fit objects as returned from |
dat.dist |
weighted distance matrices from |
pick_k |
choice of k cluster to summarize over rounds of cross validation |
cmd.k |
number of dimensions used by |
pick_k.test |
logical, only selects cv.fit solutions where the resulting solution after consolidation contains |
minlabel.test |
logical, only selects cv.fit solutions where classes have a minimum of 5 samples. Default TRUE. Avoids edge cases, but in some cases FALSE might be desirable |
final.labels consolidated class labels over rounds of cross-validation
Arshi Arora
library(survClust) k4 <- cv_voting(uvm_survClust_cv.fit, getDist(uvm_dat, uvm_survdat), pick_k = 4) table(k4)
library(survClust) k4 <- cv_voting(uvm_survClust_cv.fit, getDist(uvm_dat, uvm_survdat), pick_k = 4) table(k4)
Given multiple genomic data types (e.g., gene expression, copy number, DNA methylation, miRNA expression (continuous) and mutation (binary)) measured across samples,
allowing for missing values (NA) and missing samples, getDist
calculates the survival weighted distance metric among samples.
Used as an input to, combineDist()
.
getDist(datasets, survdat = NULL, cv = FALSE, train.snames = NULL, type = NULL)
getDist(datasets, survdat = NULL, cv = FALSE, train.snames = NULL, type = NULL)
datasets |
A list object containing |
survdat |
A matrix, containing two columns - 1st column |
cv |
logical. If |
train.snames |
required if |
type |
|
getDist
allows for continuous and binary data type(s) in a matrix passed as a list.
If the list only has a binary matrix data type. Set type="mut"
. All data types are standardized internally.
All data types are not expected to have common samples. Non-overlapping samples within data types are replaced with NA, and returned weighted matrix consists of union of all the samples.
cv=FALSE,dist.datreturns a list of weighted data matrix/matrices, dist.dat
cv=TRUE,dist.dat=list(train, all) returns a list of training train
weighted data matrix.
And the whole matrix weighed according to the weights computed on the training dataset all
.
Arshi Arora
library(survClust) dd <- getDist(simdat, simsurvdat)
library(survClust) dd <- getDist(simdat, simsurvdat)
cv_survclust
Compute fit statistics after cross validation via cv_survclust
getStats(cv.fit, kk = 8, cvr = 50)
getStats(cv.fit, kk = 8, cvr = 50)
cv.fit |
output from |
kk |
number of |
cvr |
round of cross-validation on which |
getStats
calculates Logrank statistic and standardized pooled within sum of squares (SPWSS) across
cross-validated labels. Visualize it via plotStats
A list of the following
lr log rank statistic
spwss standardized pooled within sum of squares
bad.sol number of solutions for each kk
that have cluster class <5
samples
Arshi Arora
library(survClust) ss_stats <- getStats(uvm_survClust_cv.fit, kk=7, cvr=10)
library(survClust) ss_stats <- getStats(uvm_survClust_cv.fit, kk=7, cvr=10)
getStats
Plot the output from getStats
plotStats(out.getStats, labels = NULL, ...)
plotStats(out.getStats, labels = NULL, ...)
out.getStats |
list output from |
labels |
labels to print on the boxplot. Default is |
... |
additional arguments as passed to |
plots boxplots summarizing output of cv.survclust
calculated via getStats
.
Use this to pick optimal k
. Optimal k
maximized logrank and minimizes SPWSS similar to the elbow
method. Use consensus_summary
to pick the best k
and arrive at unique consolidated class labels
a plot with three boxplots summarizing logrank, standardized pooled within sum of squares (SPWSS) and if any class label has less than 5 samples
Arshi Arora
library(survClust) ss_stats <- getStats(uvm_survClust_cv.fit, kk=7, cvr=10) plotStats(ss_stats, 2:7)
library(survClust) ss_stats <- getStats(uvm_survClust_cv.fit, kk=7, cvr=10) plotStats(ss_stats, 2:7)
A list of length 1 with a matrix simulated with 150 samples x 150 features with a 3-class structure such that 15 features are distinct and associated with survival, other 15 features are just distinct and not associated with survival and remaining 120 are noise. See how this dataset was generated in the vignette
data(simdat)
data(simdat)
An object of class "list"
data(simdat) class(simdat) dim(simdat[[1]]) simdat[[1]][1:5,1:5]
data(simdat) class(simdat) dim(simdat[[1]]) simdat[[1]][1:5,1:5]
simdat
A matrix with simulate time-event data with 150 samples x 2 columns with a 3-class structure with median survival of 4.5, 3.25 and 2 yrs respectively. such that 15 features are distinct and associated with survival, other 15 features are just distinct and not associated with survival and remaining 120 are noise. See how this dataset was generated in the vignette
data(simsurvdat)
data(simsurvdat)
An object of class "matrix"
data(simsurvdat) dim(simsurvdat) head(simsurvdat)
data(simsurvdat) dim(simsurvdat) head(simsurvdat)
k
survClust
function performs supervised clustering on a combineDist
output for a particular k
.
It uses all n-1
dimensions for clustering.
survClust is an outcome weighted integrative clustering algorithm used to classify multi-omic samples on their available time to event information.
survClust(combine.dist, survdat, k, cmd.k = NULL)
survClust(combine.dist, survdat, k, cmd.k = NULL)
combine.dist |
integrated weighted distance matrix from |
survdat |
A nx2 matrix consisting of survival data with |
k |
choice of |
cmd.k |
number of dimensions used by |
fit returns a list , fit
consisting of all clustering samples as in kmeans
fit.lr
, computed logrank statistic between k
clusters
Arshi Arora
Maintainer: Arshi Arora [email protected] (ORCID)
Useful links:
library(survClust) dd <- getDist(datasets = simdat, survdat = simsurvdat) cc <- combineDist(dd) survclust_fit <- survClust(combine.dist = cc, survdat = simsurvdat, k = 3)
library(survClust) dd <- getDist(datasets = simdat, survdat = simsurvdat) cc <- combineDist(dd) survclust_fit <- survClust(combine.dist = cc, survdat = simsurvdat, k = 3)
A list of length 2 with TCGA UVM Mutation data with 80 samples and 87 genes TCGA UVM Copy Number data with 80 samples and 749 segments. See Appendix in vignette for more details. Teh data is downloaded from here https://gdc.cancer.gov/about-data/publications/pancanatlas
data(uvm_dat)
data(uvm_dat)
An object of class "list"
data(uvm_dat) uvm_dat[[1]][1:5,1:5] uvm_dat[[2]][1:5,1:5]
data(uvm_dat) uvm_dat[[1]][1:5,1:5] uvm_dat[[2]][1:5,1:5]
The output is a list object consisting of 6 sub-lists for k = 2:7
, with 10 cv.survclust
outputs (for each round of cross-validation), each consisting of cv.labels, cv.logrank, cv.spwss
for 3 folds.
data(uvm_survClust_cv.fit)
data(uvm_survClust_cv.fit)
An object of class "list"
data(uvm_survClust_cv.fit) names(uvm_survClust_cv.fit[[1]][[1]])
data(uvm_survClust_cv.fit) names(uvm_survClust_cv.fit[[1]][[1]])
A matrix with 2 columns, with first column as OS.time and second column as OS events. The data is downloaded from - https://gdc.cancer.gov/about-data/publications/pancanatlas
data(uvm_survdat)
data(uvm_survdat)
An object of class "matrix"
data(uvm_survdat) head(uvm_survdat)
data(uvm_survdat) head(uvm_survdat)