Why models that excel on benchmarks fail a block down the road
 ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Robust AI: Built to Withstand the Wild


Why models that thrive in the lab collapse in the real world—and how to build systems that don't


Robustness is what separates AI that works from AI that lasts. While accuracy measures your ability to hit a target and reliability represents hitting it consistently, robustness is the ability to function when the wind is howling, the inputs are corrupted, and the environment shifts in ways your training set never anticipated.


A primary failure mode for AI systems is when models that perform spectacularly on curated laboratory datasets collapse when exposed to real-world complexity. Mahyar Salek, CTO of DeepCell, notes that AI models often show impressive results in academic publications but frequently fail when moved even a block down the road because they can't handle local artifacts and environmental differences. Joe Brew, CEO of Hyfe, observed similar patterns: early cough-detection models were highly accurate in sanitized lab settings but fell apart in the diverse acoustic environments of real-world users—noisy cars, crowded offices, and everything in between.


Reliability vs. Robustness: A Critical Distinction

Reliability involves performing predictably under a diverse but known range of variables—different scanner manufacturers, different patient demographics. Robustness is the ability to resist failure when faced with conditions that fall entirely outside the historical training record.


For example, Marshall Moutenot, CEO of Upstream Tech, demonstrated robustness when their HydroForecast models outperformed traditional conceptual models during a year drier than any in their historical training data. Their model survived an event totally unrepresented in the data because it had learned the fundamental rules of the system rather than merely memorizing past patterns.


Brittleness is a silent killer of ROI and a major safety risk. Vedavyas Panneershelvam of Phaidra warns that in mission-critical facilities like data centers, a lack of robustness can lead to thermal excursions that cause catastrophic failures.


The Four Vectors of Robustness


To achieve real-world impact, AI must be hardened against four specific stress vectors:


Input robustness addresses resilience to noise and corruption. Tim O'Connell, CEO of Emtelligent, notes that medical records often arrive via fax, resulting in crooked pages or white-out lines across text. Systems must be trained to recognize and ignore these noise patterns to extract the underlying medical truth.


Distributional robustness ensures a system knows when it's out of its depth. Christian Leibig at Vara utilizes Bayesian deep learning to quantify model uncertainty in mammography. When a scan is too far from the training distribution, the system abstains from prediction and defers to a human expert, prioritizing a robust "I don't know" over a life-threatening guess.


Adversarial robustness defends against manipulation. Leibig also warns that models are prone to making high-confidence predictions based on irrelevant patterns—the specific font used by a medical scanner or local prevalence statistics—rather than actual biology. If a model relies on these spurious correlations, it will fail catastrophically when deployed to a new clinic.


Edge-Case robustness addresses the long tail that dominates real-world outcomes. Amanda Marrs of AMP Robotics explains that in recycling, objects are almost never pristine—they're dirty, smashed, torn, or covered in peanut butter. A robust AI must recognize items despite extreme physical deformations.


Engineering for the Wild


Building hardened systems requires integrating interdisciplinary expertise at every stage. Curai builds its AI team by mixing clinical experts with researchers and ML engineers on every project. David Marvin from Planet emphasizes that teams need "all three legs of the stool"—ecology, ML, and remote sensing—to build products that don't fail when moved from lab to field.


Theory-guided architectures prevent models from hallucinating under out-of-distribution conditions. Phaidra and DeepSea use physics-based unit tests—checking for monotonicity, ensuring that lower temperature settings result in higher power consumption—to verify predictions remain consistent with physical laws.


Data quality trumps quantity. Leaders at Aignostics and Lunit find that models trained on smaller, highly diverse, expert-curated datasets often outperform massive models trained on noisy, undifferentiated data.


The Robustness-Performance Trade-off


Developers must often navigate the cost of resilience, accepting that optimizing for peak benchmark performance frequently produces systems too brittle for the wild. In high-stakes sectors, a robust "I don't know" is infinitely more valuable than a confident but life-threatening error.


Mathieu Bauchy of Concrete.ai operates in construction, where a single failure could collapse a bridge or high-rise. Their models are uncertainty-aware, utilizing an ensemble voting system that only prescribes a concrete formulation if it determines a 99.9% probability of meeting performance targets. As the AI extrapolates into novel chemical territories where data is sparse, it becomes increasingly conservative, favoring proven safety margins over high-risk optimal guesses.


Harro Stokman of Kepler Vision intentionally trades instant accuracy for temporal robustness in fall detection systems. Instead of triggering an alert the moment a fall is detected in a single frame, the system averages evidence over 10-20 subsequent frames, ensuring alarms reflect real events rather thanvisual artifacts. This maintains roughly one false alarm per room every three months at the minor cost of a few seconds of latency.


The Impact Connection


For AI to deliver lasting impact, robustness is non-negotiable across three dimensions:


Safety-critical trust: Junaid Kalia of NeuroCare.AI notes that in stroke care, where 32,000 neurons die every second, there is an absolute ethical imperative to deploy AI safely. The medical community will only adopt AI if they maintain total confidence in its performance.


Operational scalability: Gershom Kutliroff of Taranis highlights that relying on hundreds of human annotators to manually tag images is prohibitively expensive and makes consistent quality impossible at scale. Robust models that handle messy field data without human oversight enable the throughput necessary for global impact.


Broader applicability: Kevin Lang of Agerpoint democratizes agricultural data by engineering models robust enough to extract research-grade plant metrics from consumer iPhone sensors. When a model can look past the noise of a consumer-grade device, it stops being a luxury for the elite and becomes a tool for global sustainability.


Impactful AI must maintain its integrity when the world gets messy. We must stop chasing the ceiling of peak theoretical accuracy and start measuring the floor—ensuring that even in chaotic scenarios, the system remains a trustworthy tool rather than a hazardous liability.


Building for the wild means accepting that data will be noisy, sensors will be cheap, and conditions will be unprecedented. A hardened system looks past these limitations to deliver a consistent, safe result.


Robustness completes the technical foundation for high-impact AI. Without Accuracy, Reliability, and Robustness working together, there's no base stable enough to support mission-critical applications. Next, we transition from asking whether the system can work to whether it is fair, transparent, and accountable—moving from machine capabilities to human responsibility.


- Heather

Vision AI that bridges research and reality

— delivering where it matters


Research: Foundation Models for Pathology


Benchmarking the Foundation Model Paradigm in Pathology


Are massive foundation models truly "plug-and-play" for every task, or do they still require domain-specific evaluation and adaptation?

As the field shifts from task-specific architectures to large-scale generalist models, systematic benchmarking is essential to understand where these models provide a true advantage. Histopathology presents unique challenges—such as gigapixel resolutions, staining variations, and the "needle-in-a-haystack" nature of mitotic figures—that require specialized investigation into how foundation models (FMs) generalize across diverse tasks like image search and dense pixel prediction.

Three recent studies provide critical insights into these domains:
● 𝗖𝗼𝗻𝘁𝗲𝗻𝘁-𝗕𝗮𝘀𝗲𝗱 𝗠𝗲𝗱𝗶𝗰𝗮𝗹 𝗜𝗺𝗮𝗴𝗲 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹: 𝗔𝗺𝗶𝗿𝗿𝗲𝘇𝗮 𝗠𝗮𝗵𝗯𝗼𝗱 𝗲𝘁 𝗮𝗹. evaluated FMs as unsupervised feature extractors for searching similar medical images. They found that for 2D datasets, FMs like 𝗨𝗡𝗜 deliver superior performance by a large margin compared to traditional CNNs, though CNNs remain competitive for 3D volumetric data.
● 𝗠𝗶𝘁𝗼𝘁𝗶𝗰 𝗙𝗶𝗴𝘂𝗿𝗲 𝗖𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻: 𝗝𝗼𝗻𝗮𝘀 𝗔𝗺𝗺𝗲𝗹𝗶𝗻𝗴 𝗲𝘁 𝗮𝗹. investigated scaling laws for identifying small mitotic structures, which are critical independent prognostic markers. Their work demonstrated that 𝗟𝗼𝗥𝗔-𝗮𝗱𝗮𝗽𝘁𝗲𝗱 models can reach performance levels close to 100% data availability with only 10% of training data, effectively closing the gap when evaluating on unseen tumor domains.
● 𝗛𝗶𝘀𝘁𝗼𝗽𝗮𝘁𝗵𝗼𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝗜𝗺𝗮𝗴𝗲 𝗦𝗲𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻: 𝗜𝘁𝘀𝗮𝘀𝗼 𝗩𝗶𝘁𝗼𝗿𝗶𝗮 𝗲𝘁 𝗮𝗹. explored dense prediction tasks by pairing frozen encoders with shared lightweight decoders across H&E and IHC modalities. Surprisingly, they found that compact, modality-aware encoders often outperform larger architectures, suggesting that model size and classification accuracy are poor predictors of segmentation capability.

These benchmarks remind us that "bigger" is not always better; rather, the diversity of the training data and the specific adaptation strategy—like LoRA or modality-aware pretraining—are the true drivers of clinical performance.

Evaluating pre-trained convolutional neural networks and foundation models as feature extractors for content-based medical image retrieval 

Benchmarking Foundation Models for Mitotic Figure Classification 

A Benchmark of Foundation Model Encoders for Histopathological Image Segmentation

Research: Multimodal


The AI revolution: how multimodal intelligence will reshape the oncology ecosystem


Cancer manifests across multiple biological scales—from molecular alterations to tissue organization to clinical phenotype. Models trained on a single data type consistently miss this complexity.

David Dellamonica et al. examine how multimodal AI (MMAI) is reshaping oncology by integrating imaging, histopathology, genomics, and clinical data into unified frameworks.

Precision oncology faces a combinatorial challenge: numerous molecularly defined subgroups, expanding targeted therapies, and hundreds of variables per patient. MMAI contextualizes molecular features within anatomical and clinical frameworks, enabling more mechanistically plausible inferences.

𝗞𝗲𝘆 𝘂𝘀𝗲 𝗰𝗮𝘀𝗲𝘀:
⬥ 𝘚𝘤𝘳𝘦𝘦𝘯𝘪𝘯𝘨 𝘢𝘯𝘥 𝘳𝘪𝘴𝘬 𝘴𝘵𝘳𝘢𝘵𝘪𝘧𝘪𝘤𝘢𝘵𝘪𝘰𝘯: Combining imaging with clinical metadata for earlier intervention
⬥ 𝘋𝘪𝘢𝘨𝘯𝘰𝘴𝘵𝘪𝘤𝘴: Inferring genomic alterations from histology, reducing sequencing turnaround time and cost
⬥ 𝘛𝘳𝘦𝘢𝘵𝘮𝘦𝘯𝘵 𝘴𝘦𝘭𝘦𝘤𝘵𝘪𝘰𝘯: Identifying patient signatures that predict response to specific therapies
⬥ 𝘊𝘭𝘪𝘯𝘪𝘤𝘢𝘭 𝘵𝘳𝘪𝘢𝘭𝘴: Optimizing eligibility criteria and enabling synthetic control arms

The authors also acknowledge barriers: fragmented data silos, need for harmonized regulatory frameworks, and bias management.

Research: Emissions from AI


Misinformation by Omission: The Need for More Environmental Transparency in AI


You've probably seen the claim that a ChatGPT query uses "10x more energy than a Google search." That figure traces back to an offhand comment by someone with no connection to OpenAI, multiplied by a 16-year-old estimate of Google search costs. It's been repeated in 53% of media coverage on the topic.

Sasha Luccioni et al. examine how the lack of transparency around AI's environmental impacts has created fertile ground for misinformation—and how that misinformation shapes policy and public perception.

The authors analyzed Epoch AI's Notable AI Models dataset and found direct environmental disclosure 𝘱𝘦𝘢𝘬𝘦𝘥 in 2022 at just 10%—then declined sharply. By Q1 2025, most notable models fell back into "no disclosure." Result: 84% of current LLM usage flows through models with zero environmental transparency.

𝗞𝗲𝘆 𝗳𝗶𝗻𝗱𝗶𝗻𝗴𝘀:
⬥ The "training AI = 5 cars' lifetime emissions" was a specific neural architecture search estimate, not representative of typical training—yet it's been generalized across the field
⬥ Only 5% of media articles acknowledged uncertainty in the energy figures they cited
⬥ Claims that "AI can reduce 10% of global emissions" trace to BCG reports with no detailed methodology
⬥ Pre-training energy varies enormously: 0.8 MWh (OLMo 20M) to 3,500 MWh (Llama 4 Scout)

The authors propose comprehensive measurement by developers, integration into corporate sustainability frameworks, standardized verification, and clear regulatory requirements.

We can't make informed decisions about AI's environmental trade-offs when the underlying data doesn't exist.