A great deal of focus in the pathology AI world is placed on creating models. Models to segment nuclei and tissue types, count mitoses, assess biomarkers, or predict patient outcomes.

But these powerful models can only reach their full potential if a few other key components are in place: 1) quality control, 2) generalizable models, and 3) proper validation.

I bring up these critical points not to imply that your models will be doomed to failure if you have not tackled these items, but to emphasize that these challenges must be considered during model development. With appropriate planning, your machine learning models can become repeatable, reliable, and robust.

Whole Slide Quality Control

Whole slide images generally come with some processing artifacts: tissue folds, uneven sectioning, out-of-focus regions, bubbles, etc. Some artifacts can be caught during scanning and be reprocessed, but others may not.

How do artifacts affect model performance?

Schömig-Markiefka et al. developed a method for stress testing deep learning algorithms by measuring the influence of artifacts [1]. They simulated artifacts on prostate tissue and trained a deep learning model to identify them. By manipulating the severity of each type of artifact, they were able to measure the effect on model performance. All artifact types were shown to degrade model performance, depending on the severity. Afterall, they do obscure the relevant biology.

How frequently do artifacts occur in whole slide images? It depends on where the slides came from and were scanned.

And, finally, how can image regions with these artifacts be identified and excluded from model training and inference?

Recently announced commercial products from Visiopharm and Proscia tackle quality control. And for those developing their own machine learning solutions, two great tools for identifying artifacts are HistoQC and PathProfiler.

Developed by Andrew Janowczyk at Case Western Reserve University, HistoQC is an open source package that uses traditional machine learning methods to detect defects and batch effects in whole slides [2].

PathProfiler, on the other hand, is a framework for training a deep learning model for quality assessment [3]. In a study by Haghighat et al. at the University of Oxford, they trained their model on artifacts in slides of prostate cancer. The model would likely need to be retrained for use on a different type of tissue.

Regardless of the tool you choose, it should be validated on your data. Quality control software has come a long way, but it may not be sufficient for every dataset and introduces an opportunity for bias if not implemented appropriately.

Model Generalizability to Different Labs and Scanners

While artifacts should typically be discarded from a dataset, other variations are so ubiquitous that they generally need to be accommodated instead of removed.

One of the largest challenges in histopathology image analysis is creating models that are robust to the variations across different labs and imaging systems. These variations can be caused by different color responses of slide scanners, raw materials, manufacturing techniques, and protocols for staining.

Different setups can produce images with different stain intensities or other changes, creating a domain shift between the source data that the model was trained on and the target data on which a deployed solution would need to operate. When the domain shift is too large, a model trained on one type of data will fail on another, often in unpredictable ways.

So how do you handle the domain shift?

There are a variety of different solutions. Each tackles the problem from a different perspective.

Perhaps the easiest strategy to apply is modifying the input images. Color augmentation can be used to increase the diversity of images [4], while stain normalization tries to reduce the variations [5,6]. Instead of transforming the input images, other approaches focus on the representation learned by the model – its feature space – and improving its generalizability [7, 8]. Finally, models can be modified to better fit a target dataset. Adapting at time test or finetuning transforms the model to be suitable for a target domain [9, 10]. However, it will no longer perform well on the source dataset and may not generalize well to other domains.

Different techniques are appropriate for different use cases. Often these approaches are combined to improve model generalizability.

Proper Validation

Evaluating a model on images from different labs and scanners than it was trained on is an essential validation step.

But other batch effects can also impact modeling results.

Howard et al. studied how batch effects impact deep learning models using the TCGA dataset [11]. They found that deep learning could accurately identify the site that submitted a tissue sample. This remained true despite the use of color normalization and augmentation techniques.

They further demonstrated that survival times, gene expression patterns, driver mutations, and other variables vary substantially across institutions and should be accounted for during model training. These batch effects can lead to biased estimates of model performance.

Javed et al. recently outlined an evaluation framework for these and other challenges in pathology [12]. They recommended approaches for dealing with variability in labels, identifying confounding variables, and stratifying metrics, among others.

A recently released toolbox called REET (Robustness Evaluation and Enhancement Toolbox) from the University of Warwick tackles some types of model failures. Foote et al. simulated image perturbations like noise, blurring, and staining variations and used adversarial training to enhance model robustness [13].

Understanding the failure modes of a model is essential to uncovering deficiencies in artifact detection or model generalizability. Or it might reveal a subgroup of patients for which your model underperforms.

Awareness of the ways in which your model can fail to produce the expected results is critical in deciding whether it is ready to be deployed. It is also a prerequisite for identifying ways to improve your model.

Lessons Learned

The challenges discussed in this article are present in almost all real world datasets. This is a clear distinction from the clean benchmark datasets used in academic research.

Real world imagery is messy. This observation is not unique to pathology, but certain types of variations are specific to pathology. Some variations (like artifacts) are best discarded in both training and inference data, while others (like staining differences) typically need to be accommodated.

The most critical lesson I can leave you with is to validate. Artifacts and color variations are expected in whole slide images. What other challenges are present in your dataset?


Want to receive regular machine learning insights for pathology delivered straight to your inbox?

Sign up for Pathology ML Insights


References

[1] Schömig-Markiefka, B., Pryalukhin, A., Hulla, W., Bychkov, A., Fukuoka, J., Madabhushi, A., Achter, V., Nieroda, L., Büttner, R., Quaas, A. and Tolkach, Y., 2021. Quality control stress test for deep learning-based diagnostic model in digital pathology. Modern Pathology, 34(12), pp.2098-2108.

[2] Janowczyk, A., Zuo, R., Gilmore, H., Feldman, M. and Madabhushi, A., 2019. HistoQC: an open-source quality control tool for digital pathology slides. JCO Clinical Cancer Informatics, 3, pp.1-7.

[3] Haghighat, M., Browning, L., Sirinukunwattana, K., Malacrino, S., Alham, N.K., Colling, R., Cui, Y., Rakha, E., Hamdy, F., Verrill, C. and Rittscher, J., 2021. PathProfiler: Automated Quality Assessment of Retrospective Histopathology Whole-Slide Image Cohorts by Artificial Intelligence, A Case Study for Prostate Cancer Research. medRxiv.

[4] Faryna, K., van der Laak, J. and Litjens, G., 2021, February. Tailoring automated data augmentation to H&E-stained histopathology. In Medical Imaging with Deep Learning.

[5] Salehi, P. and Chalechale, A., 2020, February. Pix2pix-based stain-to-stain translation: a solution for robust stain normalization in histopathology images analysis. In 2020 International Conference on Machine Vision and Image Processing (MVIP) (pp. 1-7). IEEE.

[6] Shaban, M.T., Baur, C., Navab, N. and Albarqouni, S., 2019, April. StainGAN: Stain style transfer for digital histological images. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) (pp. 953-956). IEEE.

[7] Marini, N., Atzori, M., Otálora, S., Marchand-Maillet, S. and Müller, H., 2021. H&E-adversarial network: a convolutional neural network to learn stain-invariant features through Hematoxylin & Eosin regression. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 601-610).

[8] Koohbanani, N.A., Unnikrishnan, B., Khurram, S.A., Krishnaswamy, P. and Rajpoot, N., 2021. Self-path: Self-supervision for classification of pathology images with limited annotations. IEEE Transactions on Medical Imaging, 40(10), pp.2845-2856.

[9] Li, Y., Wang, N., Shi, J., Hou, X. and Liu, J., 2018. Adaptive batch normalization for practical domain adaptation. Pattern Recognition, 80, pp.109-117.

[10] Aubreville, M., Bertram, C.A., Donovan, T.A., Marzahl, C., Maier, A. and Klopfleisch, R., 2020. A completely annotated whole slide image dataset of canine breast cancer to aid human breast cancer research. Scientific Data, 7(1), pp.1-10.

[11] Howard, F.M., Dolezal, J., Kochanny, S., Schulte, J., Chen, H., Heij, L., Huo, D., Nanda, R., Olopade, O.I., Kather, J.N. and Cipriani, N., 2020. The impact of digital histopathology batch effect on deep learning model accuracy and bias. bioRxiv.

[12] Javed, S.A., Juyal, D., Shanis, Z., Chakraborty, S., Pokkalla, H. and Prakash, A., 2022. Rethinking Machine Learning Model Evaluation in Pathology. arXiv preprint arXiv:2204.05205.

[13] Foote, A., Asif, A., Rajpoot, N. and Minhas, F., 2022. REET: Robustness Evaluation and Enhancement Toolbox for Computational Pathology. arXiv preprint arXiv:2201.12311.