While many computer vision and machine learning approaches are transferable across application areas, others are fairly unique to a particular type of data.
In this article, I delve into a collection of insightful articles that unravel the intricacies of computational pathology, from the challenges of annotation to the nuances of model evaluation and explainability. Each piece contributes to a broader understanding of the unique complexities posed by histopathology images, offering valuable perspectives on the current state of computer vision for pathology. You’ll gain insights into label-efficient learning, quality control, addressing bias, and fostering reproducibility in the pursuit of advancing medical diagnostics and research.
Just getting started in computational pathology? Here’s a great overview of the field.
Cooper et al. discuss models for different types of clinical tasks and approaches to learning models with weak or unlabeled images. They also review some of the challenges: bias and fairness, heterogeneous outcomes, and interpretability. Finally, they touch on some opportunities that are just beginning to be explored: multimodal models and large generative models.
It can be challenging to obtain a sufficient number of medical images to train a deep learning model – let alone annotate them with enough accuracy and detail.
A number of label-efficient learning approaches have been developed, many originating in the computer vision literature before being applied to medical images.
Jin et al. provided a very thorough review of these label-efficient methods: semi-supervised, self-supervised, multiple instance, active, and annotator-efficient learning.
A great resource if you’re looking for alternative approaches to accommodate the challenge of annotating medical images.
Annotation is a time-consuming but necessary step in preparing to train a machine learning model.
For whole slide images, annotations can be applied at the patient, region, or cell level.
This article by Wahab et al. reviews the steps required for proper annotation and some of the pitfalls in the process.
Their recommendations include:
- Prior to annotating, designing an algorithm so that all parties understand the purpose of the annotations.
- Start with a pilot phase to identify annotation issues.
- A large number of region types can be labeled initially and merged later.
- Training an initial ML model will help identify challenging classes that can be prioritized for annotation.
- Inter-observer agreement should be discussed regularly and class definitions refined as needed.
- Keep an ‘unknown’ category to avoid adding noisy annotations to other classes.
It generally requires a domain expert – a pathologist – to annotate histopathology images, but computer vision engineers should gain a basic understanding of what they’re annotating and the nuances involved.
Machine learning models for pathology images come with some unique challenges not present in natural imagery: images can only be interpreted by an expert, labeling is expensive, the images are large, patient samples are often limited, and the potential for spurious correlations, among others.
Javed et al. outlined some guidelines for model evaluation.
Some of their suggestions are:
- Incorporate stratified evaluation metrics as opposed to a single test-set wide global metric.
- Conduct a correlation check on all metadata values present in the dataset to see if any of them correlate highly with label values for the task of interest.
- Test models under known sources of variations and evaluate for consistency and accuracy in performance.
Recommendations on Compiling Test Datasets for Evaluating Artificial Intelligence Solutions in Pathology
Thorough validation of machine learning models is essential to understanding how they will perform in a real setting.
Yet compiling validation datasets can be challenging and, for pathology at least, there are no standard procedures.
Homeyer et al. developed a set of recommendations for evaluating AI solutions in pathology.
There are a number of important points in this article.
- Test datasets must be large and diverse to reduce the risk of spurious correlations. However, they must also be small enough to be collected with reasonable cost and effort. This is a tough balance to strike.
- Test data should represent a realistic performance assessment for routine use.
- Biological, technical, and observer variables must be considered.
- Data must be collected following a consistent protocol – which can be hard to achieve with retrospective datasets.
Selecting an appropriate metric for validating an ML model is relatively straightforward when working with a benchmark dataset. But it can become much more complicated with real world datasets.
Which metric should you assess performance by? Do you need more than one?
Metrics Reloaded provides a framework for categorizing most biomedical image analysis problems and a flowchart for identifying appropriate metrics.
While not specific to pathology images, this article and the next have a lot of pathology examples.
This may not be something you read end-to-end in one sitting but is very valuable for referring back to any time you start a new image analysis task.
Validation metrics are the key to tracking progress for any AI project. But there are many different metrics than can be chosen for a particular application. And the most frequently selected metrics may not be the most helpful – sometimes for rather subtle reasons.
As a companion to the “Metrics Reloaded” paper, Reinke et al. detailed some of the most common pitfalls in selecting metrics for image analysis, specifically for image-level classification, semantic segmentation, object detection, and instance segmentation.
I can’t summarize all the important insights they revealed in this short section – it’s worth checking out yourself. They have a clear figure to demonstrate each pitfall, so just reviewing these would be worthwhile for anyone tackling an image analysis project.
The Devil is in the Details: Whole Slide Image Acquisition and Processing for Artifacts Detection, Color Variation, and Data Augmentation: A Review
Automated pipelines applying deep learning to whole slide images have demonstrated a number of powerful capabilities.
But whole slide images come with some additional complexities like artifacts and staining variations. While an expert pathologist can adapt to these challenges, deep learning models are not typically that robust. If an inference image looks different than any training images, the model will produce an unreliable result.
Kanwal et al. outlined the challenges in artifact detection, color variation, and pathology-specific data augmentation.
They reviewed how tissue is acquired, processed, and imaged – and the many sources of variation from each step.
These solutions include methods to detect artifacts, standardize staining variations, and augment images.
Batch Effects and Bias
Variations in staining protocols, labs, and scanners are well known to affect the performance of deep learning models on histology images.
But the extent to which other batch effects can impact modeling results is often dismissed.
Howard et al. studied how batch effects impact deep learning models using the TCGA dataset.
They found that “batch effect exists in the histology images in TCGA across multiple cancer types, and inadequately controlling for this batch effect results in biased estimates of accuracy. Although stain normalization can remove some of the perceptible variation and augmentation can mask differences in color, second order image features are unaffected by these methods, and they do not resolve the ability of deep learning models to accurately identify a tissue submitting site.”
Four recommendations from their work:
- Variations in model predictions across sites should be reported.
- If variations across sites are seen, models should not be trained and assessed on patients from the same site.
- External validation should be the gold standard.
- In the absence of external validation, they proposed a quadratic programming technique for optimal stratification.
Hidden Variables in Deep Learning Digital Pathology and Their Potential to Cause Batch Effects: Prediction Model Study
Schmitt et al. studied the impact of batch effects due to patient age, slide preparation date, slide origin (institution), and scanner type.
They trained deep learning models to predict each of these variables.
Not surprisingly, their model was very accurate for slide origin and scanner, but patient age was also highly accurate. The accuracy of slide preparation date predictions varied widely.
Their recommendations: “balance any known batch effect variables during creation of the training data set, in addition to any normalization and preprocessing standardization. If easy-to-learn variables are equally balanced between classes, separation based on these variables should no longer result in a reduction of the training loss, thus losing its optimization value.”
Algorithm fairness has become an increasing concern across many different applications of AI. The consequences of a biased model are particularly concerning in healthcare.
Chen et al. discussed what it means for an algorithm to be fair in the context of medical applications, particularly pathology.
They discussed how bias arises in clinical applications and ways to mitigate it using federated learning, disentanglement, and model explainability.
Deep learning models for pathology images are particularly susceptible to distribution shifts. And spurious causal structure in whole slide images can correlate with race/ethnicity, providing a hidden shortcut for algorithms to predict variables like outcome.
One of the common challenges with deep learning in the research community – across many different application areas – is the difficulty in reproducing results.
There are many reasons why results for a particular method may differ. It could be as simple as a hyperparameter difference or more complex like a difference in data or annotation procedures.
Fell et al. reproduced three top-performing papers from the Camelyon 16 and 17 challenges on detecting lymph node metastases.
They were not able to reproduce the results in some of these papers due to details that were not provided.
Through this work, they proposed a reproducibility checklist that every researcher should review before publishing a paper. These are details that should be included for every published method for pathology. Some of these details may be relegated to the supplement, but they should be accessible for anyone who wishes to replicate the study.
While the latest deep learning algorithms for pathology have shown great power, understanding the patterns they base their predictions on is more challenging.
Muller et al. reviewed the different stakeholders in a computational pathology project and why explainability is important for each. They also outlined the components of an AI system and what makes an explanation a good explanation.
A variety of explainability approaches have been proposed, but it is not always clear which are the most helpful.
Evans et al. studied the usability of different types of model explanations for pathologists, focusing on saliency maps, concept attribution, prototypes, counterfactuals, and trust scores.
They conducted a survey of pathologists to evaluate example approaches from each category and rank them by how well they are understood, how relevant they are, whether they build trust, and if they provide valuable information.
“On the one hand, our findings demonstrate the preference of pathologists for simple visual explanations that mirror their way of thinking and integrate cleanly with their diagnostic workflow. On the other, they suggest dangers associated with explanations that are overly appealing in their simplicity, or allow for too much ambiguity in their interpretation.”
Confirmation bias is a major concern with many methods.
Together, these articles provide a solid overview of the current state of computer vision for histopathology.
While many of the techniques for modeling pathology images are similar to those for other domains, others were developed to handle the unique complexity of these massive gigapixel images.
But to develop a robust model, you really need to understand the nuances of these images – domain shift and batch effects especially – plus the annotation challenges that lead to inter-rater variability.
Once you’re ready to get your feet wet in developing your first model for pathology, be sure to research further and work with a pathologist who can provide the important domain expertise.