Image credit: Microsoft Bing Image Creator

Every computer vision project comes with its own set of challenges. These challenges are best revealed by a thorough understanding of the data.

These are some of the pitfalls that I’ve encountered in applying computer vision and machine learning to pathology images.

Yes, there are a lot of them…

1) Gigapixel Whole Slide Images

The most obvious challenge is simply the sheer size of whole slide images (WSI). These are gigapixel images – often 100,000 x 100,000 pixels in size – and they contain multiple levels of magnification.

Part of the images contain tissue, but there is also a lot of whitespace. The useful portions of the image typically need to be separated out before doing anything with machine learning.

When it comes to applying deep learning, no GPU today has enough memory to process an entire WSI simultaneously. They are generally tiled up for processing.

These tiles can be very diverse. Cancer may be limited to only a small region on the slide

Regardless, they require large data storage, transmission, and computational resources.

Not every application requires whole slide images. Tissue microarrays are used for many research applications in order to study smaller samples from a larger set of patients.

2) Few Patient Samples

WSIs contain billions of pixels. In fact, fewer than fifty WSIs will provide more pixels of tissue than the 1.2 million photos in ImageNet, the ubiquitous dataset of computer vision. However, it is the small number of patient samples that causes challenges.

Unlike computer vision benchmark datasets that get larger each year, medical applications are often limited by the number of patients because of the time and expense of processing patient samples, the limited number of patients with a particular disease, and privacy concerns. Perhaps more than one image is available for each patient, but the biological diversity is still limited by the number of patients.

If privacy concerns and regulations are the limiting factor in creating a larger dataset, federated learning is one option. Federated learning is a paradigm for training on decentralized datasets while preserving privacy, enabling models trained on data from multiple medical centers

Another option might be to pretrain a model on a larger unlabeled dataset (discussed in more detail in the next section). This will be easier for a common modality like hematoxylin and eosin stained slides.

For novel imaging modalities, you may not have enough images to learn a good representation. Transfer learning could be used to extract features or finetune a model, but transferring from a disparate image type like ImageNet can impede model generalizability. Simpler features and a simpler model is another option – for example, hand-crafted features to describe cell and tissue morphology.

A successful solution to a novel imaging modality with a small number of patient samples requires a deep understanding of the images – particularly when it comes to building a model that is robust and generalizable.

3) Scarce Labels

In other cases, training images are numerous, but annotations are scarce. In this situation, unsupervised learning methods like self-supervised learning (SSL) can provide a powerful solution.

A representation can be learned on the large amount of unlabeled data and then transferred or finetuned on the labeled data.

SSL makes use of a pretext task to learn a suitable representation. The most common pretext tasks are contrastive or reconstructive. Contrastive models like SimCLR, MoCo, and BYOL encourage image patches with a similar appearance to have a similar encoding and those with a different appearance to have a different encoding. Whereas reconstruction objectives use an encoder-decoder structure and attempt to rebuild image patches or channels that are blacked out or corrupted. Both types of SSL require the model to learn the patterns present in the images in order to be successful. Many of those patterns will also be useful for downstream tasks.

But even SSL doesn’t solve the problem in all situations. It may miss subtle discriminative features. SSL also requires a very large amount of data to be successful and cannot always overcome the scarcity of labels. A classical ML model (perhaps even a linear model) using hand-crafted features may be suitable if labels are very scarce.

Regardless, sufficient labeled data will still be needed for validation.

4) Subjective Labels

Medical images require many years of training to properly interpret. While a novice may annotate for some tasks like nuclei segmentation, expert pathologists are needed for more complex labeling like classifying cell types, detecting mitoses, or grading tumors.

Yet labeling can be subjective for even the best trained experts. Medical studies frequently assess the inter-rater agreement. If the agreement is low, this will limit the ability of an ML model to accurately predict a label.

What can be done?

First, you can standardize labeling procedures as much as possible. This is best done by writing up an annotation guide with examples to clearly define each class.

Second, you can have more than one annotator label each image in order to assess the inter-rater agreement on your data. It may vary by class or be lower when a pair of classes have a similar appearance.

You could use the multiple sets of annotations to find a consensus as the ground truth for model training. Or you could use it to learn more about the agreement between annotators.

Once you know the inter-rater agreement and have standardized annotation procedures as much as possible, you have an upper bound on the performance you might expect from an ML model. If experts only agree on the label 85% of the time, you can’t expect a classifier to do better than this.

A baseline ML model might still have lots of room for improvement, but if, after much iteration, you’re approaching 85% accuracy, you are approaching the limits for that task.

5) Weak Labels

The next challenge to consider is what type of labels are available. The most common types of annotations for pathology are at the patient level, image patches, bounding boxes, or pixels – going from the weakest to the strongest form of annotation.

Patient-level labels could be the presence or absence of tumor, a grade assigned by a pathologist, or a molecular biomarker assessed by some form of molecular analysis such as immunohistochemistry or RNA/DNA sequencing.

A naive way to model this is to assign the patient label to every image patch and train a model. The challenge is that you don’t know which part of the image is associated with the class label, and the naive labeling method will introduce noisy labels.

The most common solution is multiple instance learning. Each image patch is considered an instance with a latent label. We know the patient- or slide-level label, but the model must learn the patch labels. Attention-based and transformer models are frequently used to model this weak label situation.

Weak labels can be even more challenging with heterogeneous tumors. Multiple classes might be present in a single tumor and the labeling method may not reflect this.

There is a lot to consider when dealing with weak labels. This is an ongoing area of research.

6) Batch Effects

One of the most subtle challenges with pathology images is batch effects. Batch effects are technical sources of variation, such as slide preparation techniques, annotation procedures, or patient characteristics that may be confounded with the variables of interest.

Research has shown that features from deep learning models trained on whole slide images from The Cancer Genome Atlas (TCGA) can predict the medical center that sample was from – even when stain normalization was applied. From the hidden information encoding the medical center, patient ethnicity could be inferred, biasing the model.

What can be done about batch effects? First, be aware of the potential for batch effects and do your best to detect them. Second, reduce the chance that they will occur.

For whole slide images, the medical center is the most common and problematic. To ensure that a model cannot benefit from batch effects, you can split your data into training, validation, and test sets with each medical center isolated to a single set. Images from a particular medical center should never be distributed across the sets. This way you can assess whether your model generalizes to other medical centers.

Other types of batch effects require similar planning and validation.

7) Limited Diversity

Batch effects are not the only potential source of bias and degraded model performance. Patient population is another key factor. Different cohorts of patients may have a different distribution of age, gender, race, or some other subgrouping.

Perhaps your training set contains a smaller proportion of samples for a particular patient subgroup than your inference data will. Your model will likely underperform on this subgroup.

The key here is also detecting bias. The best way to do that is by validating your model thoroughly.

You don’t just want to look at a single metric measuring your overall model performance. Instead, use stratified metrics.

Stratify your test set into different subgroups. Perhaps by age, sex, race, disease subgroup, etc. Calculate your chosen metrics for each subgroup. You likely don’t have enough data to stratify by multiple characteristics at once and still have a meaningful number of patients. But you can stratify by one characteristic at a time and calculate your metrics across each subgroup.

If you notice that performance is degraded for some subgroups, you have two options: limit the scope of your product to not cover certain types of patients or gather a more diverse training set.

8) Tissue and Imaging Artifacts

Whole slide images generally come with some processing artifacts: tissue folds, uneven sectioning, out-of-focus regions, bubbles, etc. Some artifacts can be caught during scanning and be reprocessed, but others may not. All artifact types have been shown to degrade model performance, depending on the severity.

Artifacts are more prevalent in some datasets than others. Studies have shown that this may be one reason for degraded performance on TCGA. TCGA was set up as a genomics project, so the challenges with whole slide images may not be too much of a surprise.

Some form of quality control is often necessary when applying deep learning to pathology images. The same quality checks should be performed on both training and inference data.

Developing your own tool for quality control is a large undertaking. Fortunately, there are an increasing number of commercial and open source options.

9) Scanner and Staining Variations

The above sections touch on a couple sources of domain shift, but here I want to focus specifically on color variations – variations caused by different color responses of slide scanners, raw materials, manufacturing techniques, and protocols for staining.

Different setups can produce images with different stain intensities or other changes, creating a domain shift between the source data that the model was trained on and the target data on which a deployed solution would need to operate. When the domain shift is too large, a model trained on one type of data will fail on another, often in unpredictable ways.

So how do you handle the domain shift?

There are a variety of different solutions. Each tackles the problem from a different perspective.

Perhaps the easiest strategy to apply is modifying the input images. Color augmentation can be used to increase the diversity of images, while stain normalization tries to reduce the variations. Instead of transforming the input images, other approaches focus on the representation learned by the model – its feature space – and improving its generalizability. Finally, models can be modified to better fit a target dataset. Adapting at time test or finetuning transforms the model to be suitable for a target domain.

Different techniques are appropriate for different use cases. Often these approaches are combined to improve model generalizability.

But keep in mind that generalizability to staining and scanner variations may not be needed for every application. For example, if tissue is processed in a central lab with a single scanner type and a consistent staining procedure, then fewer sources of variation must be accommodated. The needs for your application should be defined and the limitations of your model validated and understood.

10) Multimodal

Pathology image datasets also frequently contain multispectral, multiplex, or multimodal information.

This could be from imaging at different wavelengths or with different protein markers. Or just additional types of data like clinical information or any form of omics data.

So this is great – more data! But it’s all complex data and frequently high dimensional.

If you have multiple types of images, are they spatially aligned? If they’re not registered and they need to be, this adds some complexity to your pipeline.

If they’re multispectral and have more than the three standard RGB channels, transfer learning is more complex. Alternatively, dimensionality reduction could lose important information.

And if the modalities are very different, perhaps images and text, how do you integrate the information? Data fusion approaches can get complex and may not take full advantage of the insights from all modalities.


When I started writing this article, it was intended to be a brief overview of some of the common pitfalls. But I realize now that I’ve barely scratched the surface. And there are other items that could take this list far past ten.

None of these pitfalls is insurmountable. But you need to be aware of them if you’re going to be successful in working with pathology images.

Each has multiple solutions, but your best bet is to first understand your data and problem space. Only then can you select the most suitable data- and model-centric solutions.

Is your team looking to maximize the impact of your images and algorithms?

I’ve worked with a variety of teams to unlock the value in their data, create more accurate models, and generate new powerful insights from pathology and remote sensing images.

Schedule a free Machine Learning Discovery Call and learn how to advance your machine learning algorithms.