Modeling Metastasis
Predicting cancer outcomes in Merkel Cell Carcinoma
Consulting project for Insight’s Health Data Science program
Merkel cell carcinoma is a rare but aggressive form of skin cancer - it’s five times more deadly than more common skin cancers, like melanoma. Because the cancer is so aggressive, after initial diagnosis, all patients are recommended to undergo a biopsy procedure to determine whether the cancer has metastasized to the lymph nodes.
However, the biopsy procedure has a high complication rate, with about 10% of patients experiencing surgical complications like infection. Only about 30% of patients who receive the lymph node biopsy actually have a positive result of metastasis – which means that the other 70% of patients undergo an unnecessary surgery. Because of this high burden to the patient, physicians want to be able to identify which patients have low risk for metastasis and can avoid the lymph node biopsy procedure.
For my insight project, I consulted with physicians at OHSU who specialize in skin cancer research to tackle this problem. My goal was to develop a machine learning model that can predict probability of metastasis from new patient data, which could then be used to guide clinical decision-making about the biopsy procedure.
Data challenges
I used publicly available data on Merkel Cell Carcinoma from the National Cancer Database. Although this database, which began in 2002, has records for almost 15,000 patients, the recording of sentinel lymph node (SLN) biopsies did not start until 2012 - which meant about two-thirds of the data was not usable for this analysis. There was also a lot of missing data in several important variables to include in the model. For example, several histologic features about the merkel cell tumor (e.g. size, growth pattern) were not available for all patients.
After cleaning the data, I decided to use imputation methods to recover more patient cases and maximize the sample size. I used IterativeImputation in Sklearn, which is a regression-based imputation technique that predicts the missing values based on the other available features in the data. It does this in a round-robin fashion, until all missing variables are filled in.
After processing the data, I had ~2500 patient records with 7 features. The features included basic demographic information, like age and sex, as well as variables about the merkel cell tumor: size, location on the body, and histologic features about the tumor recorded from a dermopathologist.
Model training
First, I split the data into training (50%), validation (25%), and hold-out test sets (25%). Because there were 2x more cases with negative outcomes (i.e, no metastasis), I stratified the samples based on the outcome variable. This ensured that each set of the data had a similar proportion of positive and negative cases of metastasis.
Next, I set up the model training to predict positive vs. negative cases of metastasis. I focused on logistic regression as my primary classification algorithm because it was important for the physicians I consulted with to have a model that clinicians could easily interpret. Due to the imbalance in classes, I used balanced class weights in the logistic regression, which substantially improved the model performance.
Probability calibration
I also applied probability calibration so that the predicted probabilities better reflected the relative risk observed in the sample. Many classification algorithms don’t have well-calibrated probabilities, because the models are optimizing the ranking of predictions for classification, which was the case for my simple logistic regression. Given that we planned to use the model to generate predicted probabilities of metastasis for new patients, having a well-calibrated model was a priority for this project. In other words, cases with predicted probability of 0.1 should only have true positive metastasis outcomes 10% of the time. I used CalibrationClassifierCV in Sklearn, which allowed me to compare both parametric and non-parametric calibration methods during model training.
Because I had a relatively modest sample size (only 1K cases for training), I used 5 resampled iterations of the data with 3-fold cross-fold validation. This method applies random resampling to the data 5x (like bootstrapping), and then generates 3 cross-validation folds for each resample. By comparing the model training results across a total of 15 iterations, I had more confidence that the training results would generalize to new data.
Decision thresholds
After training the model, I performed a separate validation step to tune the decision threshold for the model. By default, classification algorithms assume a decision threshold of 0.5, where predicted probabilities > 0.5 are labeled as ‘positive’, and probabilities < 0.5 are labeled as ‘negative’. However, labeling a patient with .45 predicted probability of metastasis as “negative” isn’t ideal for this clinical application, because they actually have a relatively high chance of metastasis. Instead, we want to have greater confidence that cases labeled as ‘negative’ have a very low incidence of metastasis (e.g. < 5% metastasis). This is reflected in the metric of precision for the negative class, or the number of true ‘negative’ cases out of all predicted ‘negative’ cases. In a separate validation set, I chose a decision threshold that maximized the precision for the negative class.
Model evaluation
As the final step, I evaluated the model on the hold-out test set, which achieved a relatively high precision for the negative class. This means we identified a sub-group of patients we are most confident don’t have metastasis, and could potentially avoid the lymph node biopsy. However, optimizing for precision came at a cost to recall, because now many ‘negative’ cases were labelled as ‘positive’ for metastasis. This means that for new patient data, many patients who don’t have metastasis would still be recommended for biopsy. However, currently all patients with Merkel Cell Carcinoma are recommended for biopsy, so the model predictions still provide an improvement over the current clinical standard of care.
Deliverables
Statistical inference
The physicians I worked with didn’t just want black-box predictions; they also wanted to interpret the model coefficients to understand relative risk for each feature. After training the model, I used bootstrap resampling (1000 iterations) with the training set to generate confidence intervals for the model coefficients. These results provided valuable information about which features were most important for the model prediction. For example, the most important variable was a histologic feature about the merkel cell tumor, which was not recorded for the majority of patients in the NCDB. Given that Merkel Cell Carcinoma is a relatively rare cancer, it’s also under-studied, and these results will provide valuable information for the field to improve data collection standards moving forward.
Clinical prototype
Ultimately, the physicians plan to use these model predictions to inform clinical decision-making about whether new Merkel Cell patients should undergo the lymph node biopsy. As a clinical prototype, I created a web application in streamlit that deploys the final model and generates a predicted probability from new patient data. The physicians enter new data about the patient (e.g. characteristics about their merkel cell tumor), and obtain the predicted risk of metastasis. This will allow patients to better understand their relative risk level when discussing their care plan with their physician.