Integration of Image and Genomic Features of Breast Cancer

Project Details

Background: Histologic image features and genomics provide two complementary views of tumors. By combining the two, a more complete picture of tumor prognosis and treatment models is possible. Each alone has been shown to inform certain decisions made by doctors, but an integrated model can provide a more predictive analysis. We used these tumor features to predict genomic subtype and grade.

Challenge: The breast cancer data set was small with only around 1,000 samples. This is typical for medical data and brings challenges for many machine learning tasks. In particular, the deep learning models that excel with large training sets tend to overfit. Further, most commonly-used multimodal analysis techniques are unsupervised - they make no use of class labels. Our goal was to create a method to find a shared embedding of the input modalities that is also discriminative.

Solution: We created a new set of deep learning techniques based off a popular data analysis technique called Canonical Correlation Analysis (CCA). CCA projects a pair of modalities into a shared space. While CCA is often used for classification tasks, we found that recent deep learning variants of CCA found many features that were of no use for classification. Our methods resolve this by ensuring that the shared space is also discriminative.

Results: We tested our methods for classifying breast tumors when both modalities are available for training but only one is available at test time, as well as two other applications. Our results showed a large increase in classification accuracy over previous methods and increased robustness to small training set sizes. We also demonstrated utility for semi-supervised learning when labels were only available for some samples.

Additional Applications: The robustness properties of our methods make them especially suitable for medical applications in which patient samples are limited. Parallel modalities of data are increasingly common in many other applications: other imaging modalities, images and text, audio and video, parallel texts of different languages, or even different feature representations for the same data. Beyond classification tasks, CCA-based methods can also be used for clustering or any other task-driven goal that can be used in a deep network.

Preprint