1 Pancreatic Cancer Overall Survival Predictor

As an example of the utility of the PDATK package, we provide code replicating the analysis published in Sandhu et al. (2019). While the code presented here is run on a subset of the original data to ensure that the PDATK installation size is not too large, the full data from that study can is available via the MetaGxPancreas Bioconductor data package.

## CohortList of length 11

1.1 Split Training and Validation Data

To get started using PDATK, place each of your patient cohorts into SurvivalExperiment objects and then assemble those into a master CohortList, which holds the training and validation data for use with the various SurvivalModels in this package.

commonGenes <- findCommonGenes(sampleCohortList)
# Subsets all list items, with subset for specifying rows and select for
# specifying columns
cohortList <- subset(sampleCohortList, subset=commonGenes)

ICGCcohortList <- cohortList[grepl('ICGC', names(cohortList),]
validationCohortList <- cohortList[!grepl('icgc', names(cohortList),]

Since we are interested in predicting survival, it is necessary to remove patients with insufficient data to be useful. In general, we want to remove patients who did not have an event in our period of interest. As such we remove samples who dropped out of a study, but did not pass away before the first year.

validationCohortList <- dropNotCensored(validationCohortList)
ICGCcohortList <- dropNotCensored(ICGCcohortList)

We have now split our patient cohorts into training and validation data. For this analysis we will be training using the ICGC cohorts, which includes one cohort with RNA micro-array data and another with RNA sequencing data. When using multiple cohorts to train a model, it is required that those cohorts share samples. As a result we will take as training data all patients shared between the two cohorts and leave the remainder of patients as part of our validationData.

# find common samples between our training cohorts in a cohort list
commonSamples <- findCommonSamples(ICGCcohortList)

# split into shared samples for training, the rest for testing
ICGCtrainCohorts <- subset(ICGCcohortList, select=commonSamples)
ICGCtestCohorts <- subset(ICGCcohortList, select=commonSamples, invert=TRUE)

# merge our training cohort test data into the rest of the validation data
validationCohortList <- c(ICGCtestCohorts, validationCohortList)

# drop ICGCSEQ from the validation data, because it only has 7 patients
validationCohortList <- 
    validationCohortList[names(validationCohortList) != 'ICGCSEQ']

1.2 Setup A PCOSP Model Object

We now have patient molecular data, annotated with the number of days survived since treatment and the survival status and are ready to apply a SurvivalModel to this data. In this example, we are applying a Pancreatic Cancer Overall Survival Model, as described in . This class uses the switchBox package to create an ensemble of binary classifiers, whos votes are then tallied into a PCOSP score. A PCOSP score is simply the proportion of models predicting good survival out of the total number of models in the ensemble.

PCOSPmodel <- PCOSP(ICGCtrainCohorts, minDaysSurvived=365, randomSeed=1987)

# view the model parameters; these make your model run reproducible
## $randomSeed
## [1] 1987
## $RNGkind
## [1] "Mersenne-Twister" "Inversion"        "Rejection"       
## $minDaysSurvived
## [1] 365

1.3 Training a PCOSP Model

To simplify working with different SurvivalModel sub-classes, we have implemented a standard work-flow that is valid for all SurvivalModels. This involves first training the model, then using it to predict risk/risk-classes for a set of validation cohorts and finally assessing performance on the validation data.

To train a model, the trainModel method is used. This function abstracts away the implementation of model training, allowing end-users to focus on applying the SurvivalModel to make predictions without a need to understand the model internals. We hope this will make the package useful for those unfamiliar or uninterested in the details of survival prediction methods.

For training a PCOSP model there are two parameters. First, numModels is the number of models to train for use in the ensemble classifier to predict PCOSP scores. To keep computation brief, we are only training 25 models. However, it is recommended to use a minimum of 1000 for real world applications. The second parameter is minAccuracy, which is the minimum model accuracy for a trained model to included in the final model ensemble. Paradoxically, increasing this too high can actually decrease the overall performance of the PCOSP model. We recommend 0.6 as a sweet spot between random chance and over-fitting but encourage experimentation to see what works best with your data.

trainedPCOSPmodel <- trainModel(PCOSPmodel, numModels=15, minAccuracy=0.6)

## $randomSeed
## [1] 1987
## $RNGkind
## [1] "Mersenne-Twister" "Inversion"        "Rejection"       
## $minDaysSurvived
## [1] 365
## $numModels
## [1] 15
## $minAccuracy
## [1] 0.6

We can see that after training, the additional model parameters are added to the modelParams item in the model metadata. The goal is to ensure that your model training, prediction and validation are fully reproducible by capturing the parameters relevant to a specific model.

1.4 Risk Prediction with a PCOSP Model

After training, a model can now be used with new data to make risk predictions and classify samples into ‘good’ or ‘bad’ survival groups. To do this, the standard predictClasses method is used. Similar to trainData, we have abstracted away the implementation details to provide users with a simple, consistent interface for using SurvivalModel sub-classes to make patient risk predictions.

PCOSPpredValCohorts <- predictClasses(validationCohortList,

The returned CohortList object now indicates that each of the cohorts have predictions. This information is available in the elementMetadata slot of the cohort list and can be accessed with the mcols function from S4Vectors.

## DataFrame with 10 rows and 2 columns
##             mDataType hasPredictions
##           <character>      <logical>
## ICGCMICRO   rna_micro           TRUE
## PCSI          rna_seq           TRUE
## TCGA          rna_seq           TRUE
## KIRBY         rna_seq           TRUE
## UNC         rna_micro           TRUE
## CHEN        rna_micro           TRUE
## COLLISON    rna_micro           TRUE
## ZHANG       rna_micro           TRUE
## OUH         rna_micro           TRUE
## WINTER      rna_micro           TRUE

Predicting risk with a specific model adds a corresponding metadata column to the object colData. In the case of a PCOSP model, the new column is called PCOSP_prob_good and represents the proportion of models in the ensemble which predicted good survival for a given sample.

unique_patient_ID grade sex age T N M sample_type sample_name survival_time event_occurred prognosis PCOSP_prob_good
ICGC_0001 ICGC_0001 G1 M 59 T2 N0 M0 tumour ICGC_0001 522 1 good 0.6666667
ICGC_0002 ICGC_0002 G2 F 77 T2 N1a MX tumour ICGC_0002 2848 1 good 0.4666667
ICGC_0003 ICGC_0003 G2 F 58 T3 N1a MX tumour ICGC_0003 1778 0 good 0.9333333
ICGC_0004 ICGC_0004 G2 F 78 T2 N0 MX tumour ICGC_0004 1874 1 good 1.0000000
ICGC_0005 ICGC_0005 G3 M 63 T3 N1b MX tumour ICGC_0005 641 1 good 0.5333333
ICGC_0008 ICGC_0008 G2 M 72 T3 N1b MX tumour ICGC_0008 1209 1 good 0.8000000

Additionally, binary predictions of good or bad survival can be found in the PCOSPpredictions item of each SurvivalExperiments metadata. This contains the raw predictions from the model for each classifier in the ensemble, ordered by classifier accuracy. This data is not important for end users, but is used internally when calculating validation statistics for the model. For users wishing to classify samples rather than estimate risks, we recommend a PCOSP cut-off of >0.5 for good survival prognosis.

knitr::kable(metadata(PCOSPpredValCohorts[[1]])$PCOSPpredictions[1:5, 1:5])
ICGC_0001 ICGC_0002 ICGC_0003 ICGC_0004 ICGC_0005
rank1 bad bad good good good
rank2 good good good good bad
rank3 good good good good good
rank4 bad good good good good
rank5 good bad good good bad

1.5 Validating A PCOSP Model

The final step in the standard SurvivalModel work-flow is to compute model performance statistics for the model on the validation data. This can be accomplished using the validateModel method, which will add statistics to the validationStats slot of a SurvivalModel object and the data to the validationData slot.

validatedPCOSPmodel <- validateModel(trainedPCOSPmodel,
statistic estimate se lower upper p_value n cohort isSummary mDataType
AUC 0.7070000 0.0470000 0.6150000 0.7990000 0.0000142 158 ICGCMICRO FALSE rna_micro
log_D_index 0.5956475 0.1592665 0.4882238 0.7030711 0.0001668 158 ICGCMICRO FALSE rna_micro
concordance_index 0.6434057 0.0321350 0.5804223 0.7063892 0.0000081 158 ICGCMICRO FALSE rna_micro
AUC 0.6400000 0.0570000 0.5280000 0.7530000 0.0088634 107 PCSI FALSE rna_seq
log_D_index 0.4697159 0.1838308 0.3457239 0.5937078 0.0105421 107 PCSI FALSE rna_seq
concordance_index 0.6212825 0.0387126 0.5454073 0.6971577 0.0017309 107 PCSI FALSE rna_seq

Examining the data.table from the validationStats slot we can see that three model performance statistics have been calculated for all of the validation cohorts. Additionally, aggregate statistics have been calculated by molecular data type and for all cohorts combined. This table can be used to generate model performance plots. We have included several functions for examining model performance.

1.6 Plotting Model Performance

PCOSPdIndexForestPlot <- forestPlot(validatedPCOSPmodel, stat='log_D_index')

PCOSPconcIndexForestPlot <- forestPlot(validatedPCOSPmodel, stat='concordance_index')

cohortROCplots <- plotROC(validatedPCOSPmodel, alpha=0.05)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the PDATK package.
##   Please report the issue at <>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.