--- title: "An introduction to **survClust package**" author: Arshi Arora
Department of Epidemiology and Biostatistics, MSKCC date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true fig_caption: yes vignette: > %\VignetteIndexEntry{An introduction to survClust package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Install ```{r, echo=TRUE, eval=FALSE} if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("survClust") ``` # Overview ## The tldr version **survClust**$^1$ is an outcome weighted integrative supervised clustering algorithm, designed to classify patients according to their molecular as well as time-event or end point of interest. Until now, sub-typing in cancer biology has relied heavily upon clustering/mining of molecular data alone. We present classification of samples on molecular data supervised by time-event data like Overall Survival (OS), Progression Free Survival (PFS) etc. Below is the workflow of proposed survClust method: `getDist` - Compute a weighted distance matrix based on outcome across given `m` data types. Standardization across data types to facilitate the integration proces and accounting for non-overlapping samples is also accomplished in this step. `combineDist` - Integrate `m` data types by averaging over `m` weighted distance matrices. `survClust` and `cv_survclust` - Cluster integrated weighted distance matrices via `survClust`. Optimal `k` is estimated via cross-validation using `cv_survclust`. Cross-validated results are assessed over the following performance metrics - the **logrank statistic, standardized pooled within-cluster sum of squares (SPWSS)** and cluster solutions with **class size less than 5 samples**. Note: (1) The input datatypes needs to be carefully pre-processed. See the data pre-processing section. (2) `cv_survclust` is a wrapper function that cross-validates and outputs cluster assignments. If you run without cross validation and just the commands on its own (`getDist`, `combineDist` and `survClust`), you are **over-fitting!** In this document, we use the TCGA UVM data set and a simulation example to demonstrate how to use survClust to perform integrative supervised clustering. All TCGA data has been downloaded from the TCGA pan-cancer paper$^2$ # Data and Pre-processing The data and pre-processing steps are largely followed from iCluster$^3$ manual by - iCluster (Mo Q, Shen R (2022). iClusterPlus: Integrative clustering of multi-type genomic data. R package version 1.32.0.). The pre-processing steps that we used in the manuscript are described here. * Copy Number data was segmented using CBS$^4$ and reduced to non-redundant regions of alterations using the `CNregion` function in `iClusterPlus` with default epsilon of `0.001`, keeping in mind that the total numbers of features don’t exceed 10,000. See below and pre-processed data is provided with the package - `uvm_dat` See [Appendix](#appendix) for how the data attached with the package was processed. * For DNA methylation, mRNA expression and miRNA expression, if a certain feature had more than 20% missing data, that feature was removed and remaining were used for analysis. * For mRNA expression, we further removed genes having a mean expression lower than the threshold of mean expression of lower 10% quantile. * Similarly, methylation probes with mean beta values < 0.1 and > 0.9 were discarded. * For mutation data, we first filtered variants that were classified as `SILENT`. Secondly, genes harboring mutants in less than 1% of the samples were also removed. For our case study of UVM, we just removed singleton mutations. See below and pre-processed data is provided with the package - `uvm_dat` ```{r} library(survClust) library(survival) library(BiocParallel) #mutation data uvm_dat[[1]][1:5,1:5] #copy number data uvm_dat[[2]][1:5,1:5] #TCGA UVM clinical data head(uvm_survdat) ``` # Supervised integrative cluster analysis To run supervised integrative clustering analysis - we will be calling `cv_survclust`. Cross-validation takes time and we will be using the `BiocParallel` package and splitting cross-validation across `k` clusters. We will perform **3-fold** cross-validation over **10** rounds as follows: ```{r, echo=TRUE, eval=TRUE} cv_rounds = 10 #function to do cross validation uvm_all_cvrounds<-function(kk){ this.fold<-3 fit<-list() for (i in seq_len(cv_rounds)){ fit[[i]] <- cv_survclust(uvm_dat,uvm_survdat,kk,this.fold) print(paste0("finished ", i, " rounds for k= ", kk)) } return(fit) } ``` We will be using this code for both UVM data and then use this as a simulation example as discussed in the manuscript$^1$. *Note* that 10 rounds of cross-validation is not enough and we recommend at least 50 rounds of cross-validation to get some stability in results. ## UVM data ```{r, echo=TRUE, eval=FALSE} ptm <- Sys.time() cv.fit<-bplapply(2:7, uvm_all_cvrounds) ptm2 <- Sys.time() #> ptm #[1] "2022-09-05 20:54:21 EDT" #> ptm2 #[1] "2022-09-05 21:01:12 EDT" ``` Supervised integrative clustering was performed on TCGA UVM consisting of about 80 samples and 87 genes in the mutation data and CN data summarized over 749 segments. ```{r} lapply(uvm_dat, function(x) dim(x)) ``` The above process took about ~7 minutes on a *macOS Catalina* with *2.6 GHz Dual-Core Intel Core i5* running on *8Gb RAM*. If you wish to skip the run-time, output is available as `uvm_survClust_cv.fit`. Due to 10 rounds of cross validation, the results from your run might differ from what is provided. The output is a list object consisting of 6 sub-lists for $k = 2:7$, with 10 `cv_survclust` outputs (for each round of cross-validation), each consisting of `cv.labels`, `cv.logrank`, `cv.spwss` for 3 folds. ```{r} #for k=2, 1st round of cross validation names(uvm_survClust_cv.fit[[1]][[1]]) ``` Now, let's use `survClust::getStats` to summarize and `survClust::plotStats` to plot some of the supervised integrative clustering metrics. ```{r, fig.width=8, fig.height=8, fig.cap= "survClust analysis of TCGA UVM mutation and Copy Number data"} ss_stats <- getStats(uvm_survClust_cv.fit, kk=7, cvr=10) plotStats(ss_stats, 2:7) ``` ### Picking k cluster solution In the above plot, the topleft plot is summarizing logrank over 10 rounds of cross-validated class labels across 3-fold cross-validation. Here, we see how logrank peaks at `k=4`. the topright plot is summarizing `SPWSS` or `Standardized Pooled Within Sum of Squares`. Briefly, pooled within-cluster sum of squares were standardized by the total sum of squares similar to methodology used in the gap statistic$^5$ to select the appropriate number of clusters. Here `SPWSS` decreases monotonically as the number of clusters `k` increases. The optimal number of clusters is where `SPWSS` is minimized and creates an “elbow” or a point of inflection, where addition of more clusters does not improve cluster separation. For example, here the plot elbows at `k=4` Another property of `SPWSS` is that it can be used to compare among different datasets as it lies between 0 and 1 after standardization. This is useful for comparing survClust runs between individual data types and when we integrate them. The last plot, on the bottomleft, shows for each `k` how many `k` class labels have `<=5` samples in 10 rounds of cross validation. In our case here, for `k >5` the number of classes with `<=5` samples increases, so we can choose `k=4`. ```{r, message=FALSE, fig.cap= "survClust KM analysis for integrated TCGA UVM Mutation and Copy Number for k=4"} k4 <- cv_voting(uvm_survClust_cv.fit, getDist(uvm_dat, uvm_survdat), pick_k=4) table(k4) plot(survfit(Surv(uvm_survdat[,1], uvm_survdat[,2])~k4), mark.time=TRUE, col=1:4) ``` Let's see some of the differentiating features in mutation data. ```{r} mut_k4_test <- apply(uvm_dat[[1]],2,function(x) fisher.test(x,k4)$p.value) head(sort(p.adjust(mut_k4_test))) ``` All the figures as shown in the manuscript are plotted using `panelmap`. It is available on GitHub over here - https://github.com/arorarshi/panelmap We will use it to see the distribution of these mutations across the 4 clusters from a previous run. An example from previous runs is shown below. *This is only for illustration process and the cluster groups might differ between the latest run.* ```{r, echo = FALSE, fig.cap="TCGA UVM mutation features for k=4"} htmltools::img(src = knitr::image_uri("uvm_mut_example.png"), style = 'margin-left: auto;margin-right: auto') ``` And for Copy Number data as follows - ```{r, fig.width=5, fig.height=6, fig.cap= "TCGA UVM Copy Number k=4"} cn_imagedata <- uvm_dat[[2]] cn_imagedata[cn_imagedata < -1.5] <- -1.5 cn_imagedata[cn_imagedata > 1.5] <- 1.5 oo <- order(k4) cn_imagedata <- cn_imagedata[oo,] cn_imagedata <- cn_imagedata[,ncol(cn_imagedata):1] #image(cn_imagedata,col=gplots::bluered(50),axes=F) #image y labels - chr names cnames <- colnames(cn_imagedata) cnames <- unlist(lapply(strsplit(cnames, "\\."), function(x) x[1])) tt <- table(cnames) nn <- paste0("chr",1:22) chr.labels <- rep(NA, length(cnames)) index <- 1 chr.labels[1] <- "1" for(i in seq_len(length(nn)-1)) { index <- index + tt[nn[i]] chr.labels[index] <- gsub("chr","",nn[i+1]) } idx <- which(!(is.na(chr.labels))) image(cn_imagedata,col=gplots::bluered(50),axes=FALSE) axis(2, at = 1 - (idx/length(cnames)), labels = chr.labels[idx], las=1, cex.axis=0.8) abline(v = c(cumsum(prop.table(table(k4))))) abline(h=c(0,1)) ``` ## simulation example Simulation example is presented in the survClust manuscript $^1$.See Figure **(S1)** and **Supplementary Note**. Below we provide [code](#appendix) on how we generated the simulated dataset and how to run it via survClust. ```{r} #function to do cross validation sim_cvrounds<-function(kk){ this.fold<-3 fit<-list() for (i in seq_len(cv_rounds)){ fit[[i]] <- cv_survclust(simdat, simsurvdat,kk,this.fold) print(paste0("finished ", i, " rounds for k= ", kk)) } return(fit) } ptm <- Sys.time() sim_cv.fit<-bplapply(2:7, sim_cvrounds) ptm2 <- Sys.time() ptm ptm2 ``` The above process took about `r ptm2-ptm` on a *macOS Catalina* with *2.6 GHz Dual-Core Intel Core i5* running on *8Gb RAM*. The output is a list object consisting of 6 sub-lists for $k = 2:7$, with 10 `cv_survclust` output (for each round of cross-validation), each consisting of `cv.labels`, `cv.logrank`, `cv.spwss` for 3 folds. Now, let's use `survClust::getStats` to summarize and `survClust::plotStats` to plot some of the supervised integrative clustering metrics. ```{r, fig.width=8, fig.height=8, fig.cap= "survClust analysis of simulated dataset"} ss_stats <- getStats(sim_cv.fit, kk=7, cvr=10) plotStats(ss_stats, 2:7) ``` ### Picking k cluster solution In the above plot, the topleft plot is summarizing logrank over 10 rounds of cross-validated class labels across 3-fold cross-validation. Here, we see that logrank peaks at `k=3`. The topright plot is summarizing `SPWSS` or `Standardized Pooled Within Sum of Squares`. The plot elbows at `k=4` The last plot, on the bottomleft, shows for each `k` how many `k` class labels have `<=5` samples in 10 rounds of cross validation. In the simulation example, cluster solutions having `<5` samples increases at `k=7` ```{r, message=FALSE, fig.cap= "survClust k=3 class labels KM analysis for simulated dataset "} k3 <- cv_voting(sim_cv.fit, getDist(simdat, simsurvdat), pick_k=3) sim_class_labels <- c(rep(1, 50), rep(2,50), rep(3,50)) table(k3, sim_class_labels) plot(survfit(Surv(simsurvdat[,1], simsurvdat[,2]) ~ k3), mark.time=TRUE, col=1:3) ``` We see, that we are able to get the simulated class labels as survClust solution with good concordance. ## Bonus - UVM mutation data alone survClust allows for integration of one or more data types. The data can be either continuous (RNA, methylation, miRNA or protein expression or copy number segmentation values) or binary (mutation status, `wt=0, mut =1`). One can perform survClust on individual data alone. In this example, we will perform survClust on TCGA UVM mutation data alone. ```{r} #function to do cross validation cvrounds_mut <- function(kk){ this.fold<-3 fit<-list() for (i in seq_len(cv_rounds)){ fit[[i]] <- cv_survclust(uvm_mut_dat, uvm_survdat,kk,this.fold, type="mut") print(paste0("finished ", i, " rounds for k= ", kk)) } return(fit) } #let's create a list object with just the mutation data uvm_mut_dat <- list() uvm_mut_dat[[1]] <- uvm_dat[[1]] ptm <- Sys.time() uvm_mut_cv.fit<-bplapply(2:7, cvrounds_mut) ptm2 <- Sys.time() ``` ```{r, fig.width=8, fig.height=8, fig.cap= "survClust analysis of TCGA UVM mutation data alone"} ss_stats <- getStats(uvm_mut_cv.fit, kk=7, cvr=10) plotStats(ss_stats, 2:7) ``` ### Picking k cluster solution Firstly, since, these are only 10 rounds of cross validation, there is a lot of variability in each rounds of cross-validation, and the results here are for demonstration purpose only. In the above plot, the topleft plot is summarizing logrank over 10 rounds of cross-validated class labels across 3-fold cross-validation. Here, we see how logrank peaks at `k=4`. The topright plot is summarizing `SPWSS` or `Standardized Pooled Within Sum of Squares`. Here the plot elbows at `k=4` The last plot, on the bottomleft, shows for each `k` how many `k` class labels have `<=5` samples in 10 rounds of cross validation. We see that `k=3` has cluster solutions with `<=5` samples a lot more than `k=4`. ```{r, fig.width=4, fig.height=4, message=FALSE, fig.cap= "survClust k=3 class labels KM analysis for TCGA UVM mutation data alone"} k4 <- cv_voting(uvm_mut_cv.fit, getDist(uvm_mut_dat, uvm_survdat), pick_k=4) plot(survfit(Surv(uvm_survdat[,1], uvm_survdat[,2]) ~ k4), mark.time=TRUE, col=2:5) ``` Let' see these discriminant features. ```{r} mut_k4_test <- apply(uvm_mut_dat[[1]],2,function(x) fisher.test(x,k4)$p.value) head(sort(p.adjust(mut_k4_test))) ``` # Appendix {#appendix} ## Process TCGA dataset ```{r, eval=FALSE, echo=TRUE} # DO NOT RUN. Use provided dataset #Process mutation maf data #Download data from - https://gdc.cancer.gov/about-data/publications/pancanatlas maf <- data.table::fread("mc3.v0.2.8.PUBLIC.maf.gz", header = TRUE) maf_filter <- maf %>% filter(FILTER == "PASS", Variant_Classification != "Silent") # few lines of code in tidyR to convert maf to a binary file maf_binary <- maf_filter %>% select(Tumor_Sample_Barcode, Hugo_Symbol) %>% distinct() %>% pivot_wider(names_from = "Hugo_Symbol", values_from = 'Hugo_Symbol', values_fill = 0, values_fn = function(x) 1) maf_binary$tcga_short <- substr(maf_binary$Tumor_Sample_Barcode, 1, 12) # Process clinical file tcga_clin <- readxl::read_excel("TCGA-CDR-SupplementalTableS1.xlsx", sheet=1, col_names = TRUE) uvm_clin <- tcga_clin %>% filter(type == "UVM") uvm_maf_binary <- maf_binary %>% filter(tcga_short %in% uvm_clin$bcr_patient_barcode) %>% select(-Tumor_Sample_Barcode) rnames <- uvm_maf_binary$tcga_short uvm_maf <- uvm_maf_binary %>% select(-tcga_short) %>% apply(., 2, as.numeric) # Remove singletons gene_sum <- apply(uvm_maf,2,sum) idx <- which(gene_sum > 1) uvm_maf <- uvm_maf[,idx] rownames(uvm_maf) <- rnames uvm_survdat <- uvm_clin %>% select(OS.time, OS) %>% apply(., 2, as.numeric) rownames(uvm_survdat) <- uvm_clin$bcr_patient_barcode # process CN library(cluster)#pam function for derive medoid library(GenomicRanges) #interval overlap to remove CNV library(iClusterPlus) seg <- read.delim(file="broad.mit.edu_PANCAN_Genome_Wide_SNP_6_whitelisted.seg", header=TRUE,sep="\t", as.is=TRUE) pp <- substr(seg$Sample,13,16) seg.idx <- c(grep("-01A",pp),grep("-01B",pp),grep("-03A",pp)) #only take tumors seg.idx <- c(grep("-01A",pp),grep("-01B",pp)) seg <- seg[seg.idx,] seg$Sample <- substr(seg[,1],1,12) uvm_seg <- seg[seg$Sample %in% uvm_clin$bcr_patient_barcode,] colnames(uvm_seg) <- c("Sample", "Chromosome", "Start", "End", "Num_Probes", "Segment_Mean") # pass epsilon as 0.001 default or user reducedMseg <- CNregions(ss_seg,epsilon=0.001,adaptive=FALSE,rmCNV=FALSE, cnv=NULL, frac.overlap=0.5, rmSmallseg=TRUE, nProbes=75) uvm_dat <- list(uvm_mut = uvm_maf, uvm_cn = uvm_seg) ``` ## Create simulation dataset ```{r, echo=TRUE, eval=FALSE} set.seed(112) n1 <- 50 #class1 n2 <- 50 #class2 n3 <- 50 #class3 n <- n1+n2+n3 p <- 15 #survival related features (10%) q <- 120 #noise #class1 ~ N(1.5,1), class2 ~ N(0,1), class3 ~ N(-1.5,1) sample_names <- paste0("S",1:n) feature_names <- paste0("features", 1:n) #final matrix x_big <- NULL ################ # sample 15 informant features #simulating class1 x1a <- matrix(rnorm(n1*p, 1.5, 1), ncol=p) #simulating class2 x2a <- matrix(rnorm(n2*p), ncol=p) #simulating class3 x3a <- matrix(rnorm(n3*p, -1.5,1), ncol=p) #this concluded that part shaded in red of the matrix - #corresponding to related to survival and molecularly distinct xa <- rbind(x1a,x2a,x3a) ################ # sample 15 other informant features, but scramble them. permute.idx<-sample(1:length(sample_names),length(sample_names)) x1b <- matrix(rnorm(n1*p, 1.5, 1), ncol=p) x2b <- matrix(rnorm(n2*p), ncol=p) x3b <- matrix(rnorm(n3*p, -1.5,1), ncol=p) #this concluded that part shaded in blue of the matrix - #containing the molecular distinct features but not related to survival xb <- rbind(x1b,x2b,x3b) #this concludes the area shaded area in grey which corresponds to noise xc <- matrix(rnorm(n*q), ncol=q) x_big <- cbind(xa,xb[permute.idx,], xc) rownames(x_big) <- sample_names colnames(x_big) <- feature_names simdat <- list() simdat[[1]] <- x_big #the three classes will have a median survival of 4.5, 3.25 and 2 yrs respectively set.seed(112) med_surv_class1 <- log(2)/4.5 med_surv_class2 <- log(2)/3.25 med_surv_class3 <- log(2)/2 surv_dist_class1 <- rexp(n1,rate=med_surv_class1) censor_events_class1 <- runif(n1,0,10) surv_dist_class2 <- rexp(n2,rate=med_surv_class2) censor_events_class2 <- runif(n2,0,10) surv_dist_class3 <- rexp(n3,rate=med_surv_class3) censor_events_class3 <- runif(n3,0,10) surv_time_class1 <- pmin(surv_dist_class1,censor_events_class1) surv_time_class2 <- pmin(surv_dist_class2,censor_events_class2) surv_time_class3 <- pmin(surv_dist_class3,censor_events_class3) event <- c((surv_time_class1==surv_dist_class1), (surv_time_class2==surv_dist_class2), (surv_time_class3==surv_dist_class3)) time <- c(surv_time_class1, surv_time_class2, surv_time_class3) survdat <- cbind(time, event) simsurvdat <- cbind(time, event) rownames(simsurvdat) <- sample_names ``` # Refrences 1. Arora, Arshi, et al. "Pan-cancer identification of clinically relevant genomic subtypes using outcome-weighted integrative clustering." Genome medicine 12.1 (2020): 1-13. 2. Hoadley, Katherine A., et al. "Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer." Cell 173.2 (2018): 291-304. 3. Shen, Ronglai, Adam B. Olshen, and Marc Ladanyi. "Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis." Bioinformatics 25.22 (2009): 2906-2912. 4. Seshan, Venkatraman E., et al. "Package ‘DNAcopy’." Package “DNAcopy.” (2013). 5. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc B. 2001;63:411–23. https://doi.org/10.1111/1467-9868.00293. # SessionInfo ```{r} sessionInfo() ```