Image credit: DreamStudio

What is the first thing you do when starting a new machine learning project?

I’ve posed this question to a variety of ML leaders in startups and have received a few different answers. In no particular order:

  1. Try out one of our existing models to see if it works for the new task.
  2. Start exploring and understanding the data.
  3. Dig into the research literature to see what’s been done before.

Notice that none of these first steps is to code and train a new model. And none is to design a data preprocessing pipeline.

Each of the three approaches has its merits. If the new project is quite similar to something that has previously been modeled (both the data and the task), trying out modeling approaches that have already been implemented can be a very quick way to establish a baseline for the task. In doing so, you may also discover new challenges that must be accommodated in data preprocessing or modeling.

This might lead you into #2: exploring and understanding the data. Or you might have started here. Recognizing the unique needs of a new dataset is essential. Perhaps preprocessing or annotation needs to be handled differently. Maybe there are artifacts in the data that need to be cleaned up or the labels aren’t always correct. Understanding the challenges that preprocessing and modeling will need to contend with is essential.

But the step that some teams miss and is the most critical in setting a project up for success is a literature search. Has someone else modeled a similar task on similar data? If the type of data you’re working with is common, then you might be able to apply a very strict definition of “similar.” But if you’re working with a new imaging modality, for example, or tackling a new task, you might need to relax the definition of “similar” to find relevant research.

All three of these first steps are important in the process that I use for planning a new project: a Machine Learning Roadmap.

When I work with clients on a new project, the Roadmap is the first step. The Roadmap clarifies the scope of work for the rest of the project. It decreases the uncertainty on what will need to be implemented. It also reduces the likelihood of going in circles or wasting time on unsuccessful approaches. It saves time and money by identifying existing toolkits before implementing something from scratch. And it increases the likelihood of the project’s success.

What’s involved in an ML Roadmap? Let me walk you through the core components.

1) Define the Problem

Start by clearly defining the problem you want to solve using machine learning. And while you’re at it, take a step back and consider whether ML is even the right tool for your problem. This sets the foundation for the entire project and helps ensure that the project will deliver the desired results.

Defining the problem involves identifying the business problem, the data you need to collect, and the target variable. A clear problem definition and well-defined goals will help you avoid unnecessary experimentation and focus on the most important aspects of the problem.

Establishing success criteria is key. This can include the evaluation metrics but is more about the intended use case.

Some things to consider:

  • Is your solution relevant? Will it integrate into current workflows in a way that will solve the current bottleneck or pain points?
  • How accurate does it need to be to improve upon the current process?
  • What scenarios will the model need to be able to generalize to? This could include things like different imaging devices, patient populations, or lighting conditions.
  • How explainable does the model need to be? Understanding how a model works makes it much easier to identify areas for improvement. But it can also be important in building trust or getting regulatory approval.
  • Will there be computational limitations once the model is deployed? Understanding any processing or memory constraints up front can narrow down the possible approaches.

By taking the time to define the problem upfront, you set the stage for a successful machine learning project that delivers the desired results.

Researching related work is a critical step in any ML project. It helps you identify existing solutions to similar problems and understand the state-of-the-art in the field.

You can start by conducting a literature review. This involves reading research papers, conference proceedings, and other relevant literature in the field.

It’s essential to keep track of the sources you have read and the key findings from each source. This can help you organize your thoughts and identify patterns and gaps in the existing solutions.

  • What type of data were they working with?
  • How many patients, images, etc.?
  • How did they annotate and structure their training data?
  • What model architecture did they use?
  • How did they train their model?
  • What challenges did they encounter?
  • Were there any problems with the quality or quantity of images or labels?
  • How did they collect independent data for validation?

These are all important aspects to understand before starting to build your own solution.

Researching related work can also help identify existing codebases, datasets, or pretrained models that can kickstart your project, saving you time and resources.

3) Understand the Data

Understanding the data is a crucial step in starting any ML project. This is because the quality and relevance of the data significantly impact the performance of the ML model.

For some projects, data may already be collected. For others, the data collection process must first be defined and executed. Your literature review may help guide what type of data you should collect and how much data you might need for your project.

Once data is collected, it will likely need to be annotated – also a task that can be enlightened by your literature review.

  • What type of annotations are needed? Pixel-, patch-, and image-level are the most common.
  • What tools have been used to assist with annotation? Can annotations come from some other modality? Perhaps from molecular analysis of a biological sample or an existing set of annotations like Open Street Map for satellite imagery.
  • How subjective are your annotations? Researching or running your own experiment to assess interobserver agreement can reveal the extent of this challenge.

You also need to understand the quality of your data. This includes checking for missing values, outliers, and inconsistencies in the data. These could include tissue preparation artifacts, imaging defects like noise or blurriness, or out-of-domain scenarios. By identifying data quality issues, you can preprocess and clean it appropriately and plan for any challenges that you cannot eliminate upfront.

Data preprocessing may include normalization, scaling, or other transformations. For large images, it typically includes tiling into small patches. The data and annotations must be stored in a format that is efficient for model training.

Understanding the data also helps you identify any biases that can affect the model’s performance and reliability. Biases may be due to a lack of training data for a particular subgroup or a spurious correlation. Batch effects due technical variations like processing differences at different labs or geographic variations. Or even samples labeled by different annotators.

For most applications, domain experts should be consulted in learning about the data:

  • How was the data collected?
  • What does it represent?
  • What features are looked at in studying the data?
  • What variations are present or might be expected in real world use?
  • What artifacts or quality issues might be present that could confuse a model?

Some of these aspects can be quite nuanced and not obvious to someone untrained in a particular field.

This critical step of understanding the data helps to assess the quality and relevance, identify and address data bias, and determine the appropriate preprocessing techniques.

4) Plan for Validation

Forming a plan for validation early in a project is important to reveal any unexpected challenges. The final model will be expected to perform in some real world scenario and testing its abilities is essential.

The first validation setup to consider is splitting the training data into training, validation, and test sets. The training set is usually the largest portion of the data and is used to train the model. The validation set is used to tune the model’s hyperparameters, such as the learning rate or regularization strength. The testing set is used to evaluate the model’s performance, providing an unbiased estimation of the model’s generalization ability on unseen data. The test set should be kept completely separate from the training and validation sets during the model development process.

Typically, the training, validation, and test sets are typically randomly sampled from the available data while maintaining the desired distribution of classes or target variables to avoid any unintentional bias. When data consists of different groups, such as multiple images from each patient, samples collected from different medical centers, or images from different geographic regions, a more careful stratification of groups is necessary to evaluate model generalizability. All examples from the same group should fall into the training, validation, or test set and never be distributed across the three.

Cross-validation techniques, such as k-fold or leave-n-out cross-validation, can also be employed to obtain more robust performance estimates by systematically rotating the data across training, validation, and test sets. This setup is particularly common for small datasets.

Assessing model performance involves calculating one or more metrics on the training, validation, and test sets. Suitable metrics depend on the application but could include accuracy, sensitivity, specificity, F1 score, AUC score, DICE, or many others. Each of these metrics compares the model’s predictions with your ground truth.

In some applications, calculating metrics on your test set may be sufficient validation. In others, this held out portion of your data may not be sufficiently similar to a real world scenario. Perhaps your model needs to work on patients from a different geographic region or medical center than it was trained on, and you don’t have annotated training data available. You still need to validate your model on an external cohort to simulate its real world performance and ability to generalize.

5) Develop a Baseline Model

After a great deal of planning and research, you are finally ready to start modeling. But I don’t advise starting with the most complicated deep learning model out there. Start simple. Develop a simple baseline first. It will enable you to test out your data processing, annotation, and validation pipelines, revealing any unexpected challenges.

There are many suitable algorithms for a particular project, and selecting the best one can be challenging. The simplest baseline might be a linear classifier or regressor built on simple features. Or it could be using transfer learning without finetuning to minimize the time spent on learning features. Don’t bother tuning the hyperparameters extensively at this point; the default ones may even be sufficient for this first step.

Developing a baseline model helps you establish a performance benchmark that can be used to evaluate the effectiveness of future models. It helps set realistic performance expectations for your project and enables you to determine how much improvement is required to achieve a desirable level of performance.

This baseline model should not be considered the final model. Rather, it should be used as a starting point for developing more complex models that can achieve better performance.

6) Iterate and Improve

Iterating is essential to improving the model’s performance until it achieves the desired level of accuracy.

The first step is to analyze the model’s performance. This involves examining a few different aspects:

  • Reviewing training and validation metrics to look for signs of overfitting or problems with model convergence.
  • Stratifying validation metrics into different subgroups to identify areas for improvement or possible biases.
  • Categorizing failure modes to find areas for improvement.
  • Reviewing results with domain experts for feedback on what deficiencies are important to them.

Once you have analyzed the model’s performance, you need to hypothesize why it performed poorly and how you might resolve the issues. Solutions may be data-centric, such as gathering more data or changing cleaning procedures, or model-centric, such as changes to the model architecture or the hyperparameters. Review the notes from your literature search for ideas.

The next step is to test your hypotheses by implementing the changes and evaluating the model’s performance on the validation data. Prioritize your work by fixing problems that are the most detrimental to your model or that are easiest to fix first.

Iterating and improving a machine learning model is an ongoing process. You need to continue testing and refining the model until you achieve the desired level of accuracy. Keep these iterations tight so that you can course correct as soon as possible – especiatlly if the fix involves time-consuming changes to data collection or annotation.

7) Deploy, Monitor, and Maintain

Once you have a model that meets your desired level of performance, you can deploy it in a production environment. This involves integrating the model into your application or system and making sure it performs as expected.

The first step is to identify the requirements for deploying the model. This could include factors such as performance, scalability, security, and user interface. You also need to choose a deployment platform; typical options are cloud-based services or on-premises infrastructure.

The next step is to package the model into a format that can be deployed and test the deployment to ensure that it is working correctly. This could involve testing the model’s performance, scalability, and security. After the model is deployed, you need to monitor its performance and make any necessary adjustments.

Deploying a machine learning model is an ongoing process, and you need to continuously improve the model to ensure that it remains effective over time.

Finally, it is important to document any changes that are made to the model or its training process. This ensures that the model remains transparent and reproducible over time.


Machine learning projects are complex and iterative. This roadmap process enables you to plan each aspect of your project. Although the details may change, overall components will stay the same. From defining the problem to maintaining your model, each step requires careful planning. Wherever possible, you should also be thinking about how your planned approach could fail and some solutions to address these possible failures.

Is your team looking to maximize the impact of your images and algorithms?

I’ve worked with a variety of teams to unlock the value in their data, create more accurate models, and generate new powerful insights from pathology and remote sensing images.

Schedule a free Machine Learning Discovery Call and learn how to advance your machine learning algorithms.