As an example of the utility of the PDATK package, we provide code
replicating the analysis published in Sandhu et al. (2019).
While the code presented here is run on a subset of the original data to
ensure that the PDATK installation size is not too large, the full data
from that study can is available via the MetaGxPancreas
Bioconductor data package.
## CohortList of length 11
## names(11): ICGCMICRO ICGCSEQ PCSI TCGA KIRBY UNC CHEN COLLISON ZHANG OUH WINTER
To get started using PDATK, place each of your patient cohorts into
SurvivalExperiment
objects and then assemble those into a
master CohortList
, which holds the training and validation
data for use with the various SurvivalModel
s in this
package.
commonGenes <- findCommonGenes(sampleCohortList)
# Subsets all list items, with subset for specifying rows and select for
# specifying columns
cohortList <- subset(sampleCohortList, subset=commonGenes)
ICGCcohortList <- cohortList[grepl('ICGC', names(cohortList), ignore.case=TRUE)]
validationCohortList <- cohortList[!grepl('icgc', names(cohortList),
ignore.case=TRUE)]
Since we are interested in predicting survival, it is necessary to remove patients with insufficient data to be useful. In general, we want to remove patients who did not have an event in our period of interest. As such we remove samples who dropped out of a study, but did not pass away before the first year.
validationCohortList <- dropNotCensored(validationCohortList)
ICGCcohortList <- dropNotCensored(ICGCcohortList)
We have now split our patient cohorts into training and validation data. For this analysis we will be training using the ICGC cohorts, which includes one cohort with RNA micro-array data and another with RNA sequencing data. When using multiple cohorts to train a model, it is required that those cohorts share samples. As a result we will take as training data all patients shared between the two cohorts and leave the remainder of patients as part of our validationData.
# find common samples between our training cohorts in a cohort list
commonSamples <- findCommonSamples(ICGCcohortList)
# split into shared samples for training, the rest for testing
ICGCtrainCohorts <- subset(ICGCcohortList, select=commonSamples)
ICGCtestCohorts <- subset(ICGCcohortList, select=commonSamples, invert=TRUE)
# merge our training cohort test data into the rest of the validation data
validationCohortList <- c(ICGCtestCohorts, validationCohortList)
# drop ICGCSEQ from the validation data, because it only has 7 patients
validationCohortList <-
validationCohortList[names(validationCohortList) != 'ICGCSEQ']
PCOSP
Model ObjectWe now have patient molecular data, annotated with the number of days
survived since treatment and the survival status and are ready to apply
a SurvivalModel
to this data. In this example, we are
applying a Pancreatic Cancer Overall Survival Model, as described in
switchBox
package to create an ensemble of binary classifiers, whos votes are then
tallied into a PCOSP score. A PCOSP score is simply the proportion of
models predicting good survival out of the total number of models in the
ensemble.
set.seed(1987)
PCOSPmodel <- PCOSP(ICGCtrainCohorts, minDaysSurvived=365, randomSeed=1987)
# view the model parameters; these make your model run reproducible
metadata(PCOSPmodel)$modelParams
## $randomSeed
## [1] 1987
##
## $RNGkind
## [1] "Mersenne-Twister" "Inversion" "Rejection"
##
## $minDaysSurvived
## [1] 365
To simplify working with different SurvivalModel
sub-classes, we have implemented a standard work-flow that is valid for
all SurvivalModel
s. This involves first training the model,
then using it to predict risk/risk-classes for a set of validation
cohorts and finally assessing performance on the validation data.
To train a model, the trainModel
method is used. This
function abstracts away the implementation of model training, allowing
end-users to focus on applying the SurvivalModel
to make
predictions without a need to understand the model internals. We hope
this will make the package useful for those unfamiliar or uninterested
in the details of survival prediction methods.
For training a PCOSP model there are two parameters. First,
numModels
is the number of models to train for use in the
ensemble classifier to predict PCOSP scores. To keep computation brief,
we are only training 25 models. However, it is recommended to use a
minimum of 1000 for real world applications. The second parameter is
minAccuracy
, which is the minimum model accuracy for a
trained model to included in the final model ensemble. Paradoxically,
increasing this too high can actually decrease the overall performance
of the PCOSP model. We recommend 0.6 as a sweet spot between random
chance and over-fitting but encourage experimentation to see what works
best with your data.
trainedPCOSPmodel <- trainModel(PCOSPmodel, numModels=15, minAccuracy=0.6)
metadata(trainedPCOSPmodel)$modelParams
## $randomSeed
## [1] 1987
##
## $RNGkind
## [1] "Mersenne-Twister" "Inversion" "Rejection"
##
## $minDaysSurvived
## [1] 365
##
## $numModels
## [1] 15
##
## $minAccuracy
## [1] 0.6
We can see that after training, the additional model parameters are
added to the modelParams
item in the model
metadata
. The goal is to ensure that your model training,
prediction and validation are fully reproducible by capturing the
parameters relevant to a specific model.
After training, a model can now be used with new data to make risk
predictions and classify samples into ‘good’ or ‘bad’ survival groups.
To do this, the standard predictClasses
method is used.
Similar to trainData
, we have abstracted away the
implementation details to provide users with a simple, consistent
interface for using SurvivalModel
sub-classes to make
patient risk predictions.
The returned CohortList
object now indicates that each
of the cohorts have predictions. This information is available in the
elementMetadata
slot of the cohort list and can be accessed
with the mcols
function from S4Vectors
.
## DataFrame with 10 rows and 2 columns
## mDataType hasPredictions
## <character> <logical>
## ICGCMICRO rna_micro TRUE
## PCSI rna_seq TRUE
## TCGA rna_seq TRUE
## KIRBY rna_seq TRUE
## UNC rna_micro TRUE
## CHEN rna_micro TRUE
## COLLISON rna_micro TRUE
## ZHANG rna_micro TRUE
## OUH rna_micro TRUE
## WINTER rna_micro TRUE
Predicting risk with a specific model adds a corresponding metadata
column to the object colData
. In the case of a
PCOSP
model, the new column is called
PCOSP_prob_good
and represents the proportion of models in
the ensemble which predicted good survival for a given sample.
unique_patient_ID | grade | sex | age | T | N | M | sample_type | sample_name | survival_time | event_occurred | prognosis | PCOSP_prob_good | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ICGC_0001 | ICGC_0001 | G1 | M | 59 | T2 | N0 | M0 | tumour | ICGC_0001 | 522 | 1 | good | 0.6666667 |
ICGC_0002 | ICGC_0002 | G2 | F | 77 | T2 | N1a | MX | tumour | ICGC_0002 | 2848 | 1 | good | 0.4666667 |
ICGC_0003 | ICGC_0003 | G2 | F | 58 | T3 | N1a | MX | tumour | ICGC_0003 | 1778 | 0 | good | 0.9333333 |
ICGC_0004 | ICGC_0004 | G2 | F | 78 | T2 | N0 | MX | tumour | ICGC_0004 | 1874 | 1 | good | 1.0000000 |
ICGC_0005 | ICGC_0005 | G3 | M | 63 | T3 | N1b | MX | tumour | ICGC_0005 | 641 | 1 | good | 0.5333333 |
ICGC_0008 | ICGC_0008 | G2 | M | 72 | T3 | N1b | MX | tumour | ICGC_0008 | 1209 | 1 | good | 0.8000000 |
Additionally, binary predictions of good or bad survival can be found
in the PCOSPpredictions
item of each
SurvivalExperiment
s metadata
. This contains
the raw predictions from the model for each classifier in the ensemble,
ordered by classifier accuracy. This data is not important for end
users, but is used internally when calculating validation statistics for
the model. For users wishing to classify samples rather than estimate
risks, we recommend a PCOSP cut-off of >0.5 for good survival
prognosis.
ICGC_0001 | ICGC_0002 | ICGC_0003 | ICGC_0004 | ICGC_0005 | |
---|---|---|---|---|---|
rank1 | bad | bad | good | good | good |
rank2 | good | good | good | good | bad |
rank3 | good | good | good | good | good |
rank4 | bad | good | good | good | good |
rank5 | good | bad | good | good | bad |
The final step in the standard SurvivalModel
work-flow
is to compute model performance statistics for the model on the
validation data. This can be accomplished using the
validateModel
method, which will add statistics to the
validationStats
slot of a SurvivalModel
object
and the data to the validationData
slot.
statistic | estimate | se | lower | upper | p_value | n | cohort | isSummary | mDataType |
---|---|---|---|---|---|---|---|---|---|
AUC | 0.7070000 | 0.0470000 | 0.6150000 | 0.7990000 | 0.0000142 | 158 | ICGCMICRO | FALSE | rna_micro |
log_D_index | 0.5956475 | 0.1592665 | 0.4882238 | 0.7030711 | 0.0001668 | 158 | ICGCMICRO | FALSE | rna_micro |
concordance_index | 0.6434057 | 0.0321350 | 0.5804223 | 0.7063892 | 0.0000081 | 158 | ICGCMICRO | FALSE | rna_micro |
AUC | 0.6400000 | 0.0570000 | 0.5280000 | 0.7530000 | 0.0088634 | 107 | PCSI | FALSE | rna_seq |
log_D_index | 0.4697159 | 0.1838308 | 0.3457239 | 0.5937078 | 0.0105421 | 107 | PCSI | FALSE | rna_seq |
concordance_index | 0.6212825 | 0.0387126 | 0.5454073 | 0.6971577 | 0.0017309 | 107 | PCSI | FALSE | rna_seq |
Examining the data.table
from the
validationStats
slot we can see that three model
performance statistics have been calculated for all of the validation
cohorts. Additionally, aggregate statistics have been calculated by
molecular data type and for all cohorts combined. This table can be used
to generate model performance plots. We have included several functions
for examining model performance.
PCOSPconcIndexForestPlot <- forestPlot(validatedPCOSPmodel, stat='concordance_index')
PCOSPconcIndexForestPlot
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the PDATK package.
## Please report the issue at <https://github.com/bhklab/PDATK/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
To compare the performance of a PCOSP model to random chance, we have included two model classes which permute either patient prognosis labels or the feature names. These models can be used to evaluate if a PCOSP model performs better than random chance.
The RLSModel
class is a SurvivalModel
using
the same risk prediction algorithm as a PCOSP
model, but
randomizing the patient prognosis labels in each of the individual KTSP
classifiers used in the classification ensemble. Given this random
prognosis label shuffling, we expect the classification results for this
model to be no better than random chance.
The work-flow for this model follows the standard
SurvivalModel
sub-class work-flow.
## Warning in FUN(X[[i]], ...):
## [PDATK::NULL] Dropped samples with NA survival data!
## Warning in FUN(X[[i]], ...):
## [PDATK::NULL] Dropped samples with NA survival data!
## Warning in FUN(X[[i]], ...):
## [PDATK::NULL] Dropped samples with NA survival data!
To compare the performance of a permuted model to that of a PCOSP
model, we have included the densityPlotModelComparion
method, which plots a density plot of model AUCs per molecular data type
and overall. The dotted line represents the mean AUC of the PCOSP model.
From this plot we can see that the PCOSP model performs significantly
better than the random label shuffling model.
A RandomGeneAssignmentModel
(aliased as
RGAModel
for user convenience) is similar to a
RLSModel
except that the gene labels are randomized for
each KTSP classifier in the ensemble. In this case, we also expect this
model to perform no better than random chance.
## Warning in FUN(X[[i]], ...):
## [PDATK::NULL] Dropped samples with NA survival data!
## Warning in FUN(X[[i]], ...):
## [PDATK::NULL] Dropped samples with NA survival data!
## Warning in FUN(X[[i]], ...):
## [PDATK::NULL] Dropped samples with NA survival data!
The density plot for the RGAModel
also shows that the
PCOSP
model does significantly better than random
chance.
Confident that the PCOSP model is performing better than random
chance at prediction patient survival, we can now begin to look into the
features that are most relevant to these predictions. Once we have
extracted the features, we will use the runGSEA
method to
evaluate which pathways these genes are representative of.
We can get the features which are most predictive from a
SurvivalModel
object using the getTopFeatures
method. By default this retrieves the gene names from the top 10% of
models in the SurvivalModel
. Users can customize the top N
models to extract genes from using the numModels
parameter.
## [1] "ATOX1" "NAMPT" "KRT6A" "FCGRT" "SERTAD4" "ZNF792"
## [7] "KRT6C" "BIRC3" "PKIB" "SSBP3" "PPFIBP2" "FBLIM1"
## [13] "ZSCAN29" "KRT15" "UBE2C" "C16orf74" "BLNK" "MARK3"
To perform pathway analysis, which is done internally using the
piano::runGSAhyper
function we first to pathway data to
query against. We recommend using the msigdbr
R package to
fetch pathways of interest.
Because of the small number of genes included in the sample patient cohorts, this section will not find any enriched pathways and therefore this code is not run.
Equipped with a risk prediction model performing better than random
chance, the next pertinent question is: does it out perform simpler
models? To answer this question we have included the
ClinicalModel
class, which leverages the glm
function from the stats
package to fit a generalized linear
model based on a set of clinical variables in the patient metadata. To
do this, we need to specify a model formula based on the column names
available in the colData
slot of our training data
SurvivalExpeiment
objects. All formula parameters must be
valid column names in the colData
of the training cohorts.
The same patient metadata must be present in any validation cohorts used
to make risk predictions.
We can see there are a number of clinical variables related to patient survived in the patient metadata. For our model, we will be using patient sex, age and tumour grade along with the TNM staging of the tumour.
unique_patient_ID | grade | sex | age | T | N | M | sample_type | sample_name | survival_time | event_occurred | prognosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ICGC_0006 | ICGC_0006 | G2 | F | 49 | T1 | N1b | MX | tumour | ICGC_0006 | 1715 | 0 | good |
ICGC_0007 | ICGC_0007 | G2 | F | 74 | T3 | N1b | MX | tumour | ICGC_0007 | 56 | 1 | bad |
ICGC_0009 | ICGC_0009 | G4 | M | 34 | T3 | N0 | MX | tumour | ICGC_0009 | 399 | 1 | good |
ICGC_0020 | ICGC_0020 | G2 | M | 57 | T3 | N1b | MX | tumour | ICGC_0020 | 1259 | 1 | good |
ICGC_0021 | ICGC_0021 | G3 | M | 60 | T3 | N1b | MX | tumour | ICGC_0021 | 715 | 1 | good |
ICGC_0025 | ICGC_0025 | G3 | M | 69 | T3 | N1b | MX | tumour | ICGC_0025 | 348 | 1 | bad |
Clinical annotations are only available for a subset of our patient
cohorts, thus we subset our cohort list to ensure the model works as
expected. Variable columns with levels of a model parameter which are
not present in the original model will be dropped automatically. To
prevent this behavior, it is necessary to add any missing levels to the
colData
of the training data before training the model. The
easiest way to achieve this is by converting the columns in question
into factors and ensuring all levels in the training and validation data
are set for those columns.
hasModelParamsCohortList <-
PCOSPpredValCohorts[c('ICGCMICRO', 'TCGA', 'PCSI', 'OUH')]
clinicalPredCohortList <- predictClasses(hasModelParamsCohortList,
model=trainedClinicalModel)
## Warning in .local(object, model, ...):
## [PDATK::NULL] Rows 120, 155 have levels that are not in the model, skipping these rows...
To get a general idea of how the ClinicalModel
performed
relative to the PCOSP
model, the
barPlotModelComaprison
can be used to show the . For this
example, we will compare the AUC of the models in each of the validation
cohorts. Other statistics in the validationStats
slot can
be used by changing the stat
argument to the plotting
function.
clinicalVsPCOSPbarPlot <- barPlotModelComparison(validatedClinicalModel,
validatedPCOSPmodel, stat='AUC')
clinicalVsPCOSPbarPlot
To reduce the number of plotting functions and simplify meta-analysis
of many models of different types, we have included the
ModelComparison
object. This object aggregates the
validation statistics from two models. Plotting methods can then be
dispatched on this class, greatly simplifying the process of comparing
several SurivalModel
sub-classes. You can also compare a
model to an existing model comparison, and in this way built up a
collection of model performance statistics which can be visualized in a
forest plot to enable complex comparisons between many models.
Below, we use this feature to compare the PCOSP and
ClinicalModels
based on D index and concordance index, as
calculated using the survcomp
R package.
To get and idea of how our PCOSP model stacks up against other
classifiers from the literature, we have included the
GeneFuModel
class. This SurvivalModel
performs
risk prediction using a set of genes and corresponding coefficients
using the genefu::sig.score
function. We have included data
for three published classifiers from Chen et al. (2015),
Birnbaum et al. (2017) and Haider et al. (2014).
Since there is no training data for the published classifiers, we
create empty GeneFuModel
objects, then assign the model
predictors to the models
slot of each respective
object.
chenGeneFuModel <- GeneFuModel(randomSeed=1987)
birnbaumGeneFuModel <- GeneFuModel(randomSeed=1987)
haiderGeneFuModel <- GeneFuModel(randomSeed=1987)
For the Haider model, we were unable to get the genes and coefficients. However, the author provided the risk scores. As a result we need to do a bit of work to get the Haider GeneFuModel to work with the package function.
chenClassPredictions <- predictClasses(PCOSPpredValCohorts[names(haiderSigScores)],
model=chenGeneFuModel)
birnClassPredictions <- predictClasses(PCOSPpredValCohorts[names(haiderSigScores)],
model=birnbaumGeneFuModel)
haiderClassPredictions <- PCOSPpredValCohorts[names(haiderSigScores)]
# Manually assign the scores to the prediction cohorts
for (i in seq_along(haiderClassPredictions)) {
colMData <- colData(haiderClassPredictions[[i]])
colMData$genefu_score <- NA_real_
colMData[rownames(colMData) %in% names(haiderSigScores[[i]]), ]$genefu_score <-
haiderSigScores[[i]][names(haiderSigScores[[i]]) %in% rownames(colMData)]
colData(haiderClassPredictions[[i]]) <- colMData
}
# Setup the correct model metadata
mcols(haiderClassPredictions)$hasPredictions <- TRUE
metadata(haiderClassPredictions)$predictionModel <- haiderGeneFuModel
genefuModelComparisons <- compareModels(validatedChenModel,
validatedBirnbaumModel, modelNames=c('Chen', 'Birnbaum'))
genefuModelComparisons <- compareModels(genefuModelComparisons,
validatedHaiderModel, model2Name='Haider')
allModelComparisons <- compareModels(genefuModelComparisons, validatedPCOSPmodel,
model2Name='PCOSP')
# We are only interested in comparing the summaries, so we subset our model comparison
allModelComparisons <- subset(allModelComparisons, isSummary == TRUE)
allDindexComparisonForestPlot <- forestPlot(allModelComparisons,
stat='log_D_index', colourBy='model', groupBy='mDataType')
allDindexComparisonForestPlot
allConcIndexComparisonForestPlot <- forestPlot(allModelComparisons,
stat='concordance_index', colourBy='model', groupBy='mDataType')
allConcIndexComparisonForestPlot
From the two forest plots, we can see that the PCOSP
model matched our outperformed all of the public classifiers, even when
only trained with 100 models. It is likely that using 1000 models, we
would see even better separation of the PCOSP
model.
Indeed, this is was the result in Sandhu et al. (2019).
data(cohortSubtypeDFs)
# Add the subtypes to the prediction cohort
subtypedPCOSPValCohorts <- assignSubtypes(PCOSPpredValCohorts, cohortSubtypeDFs)
## <simpleError in doTryCatch(return(expr), name, parentenv, handler): report ROC failed>
## <simpleError in wilcox.test.default(pred[obs == 1], pred[obs == 0], alternative = "great"): not enough (non-missing) 'x' observations>
## <simpleError in if (AUC.low > 0.5 & P >= 0.05) { wilc.t = wilcox.test(predictor[table.gold == 1], predictor[table.gold == 0], alternative = "less", exact = exact) P = wilc.t$p.value}: missing value where TRUE/FALSE needed>
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_pointrange()`).
forestPlot(subtypeValidatedPCOSPmodel, stat='concordance_index', groupBy='cohort',
colourBy='subtype')
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_pointrange()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_segment()`).
Sandhu V, Labori KJ, Borgida A, et al. Meta-Analysis of 1,200 Transcriptomic Profiles Identifies a Prognostic Model for Pancreatic Ductal Adenocarcinoma. JCO Clin Cancer Inform. 2019;3:1-16. doi:10.1200/CCI.18.00102
Chen DT, Davis-Yadley AH, Huang PY, et al. Prognostic Fifteen-Gene Signature for Early Stage Pancreatic Ductal Adenocarcinoma. PLoS One. 2015;10(8):e0133562. Published 2015 Aug 6. doi:10.1371/journal.pone.0133562
Birnbaum DJ, Finetti P, Lopresti A, et al. A 25-gene classifier predicts overall survival in resectable pancreatic cancer. BMC Med. 2017;15(1):170. Published 2017 Sep 20. doi:10.1186/s12916-017-0936-z
Haider S, Wang J, Nagano A, et al. A multi-gene signature predicts outcome in patients with pancreatic ductal adenocarcinoma. Genome Med. 2014;6(12):105. Published 2014 Dec 3. doi:10.1186/s13073-014-0105-3