I’ve seen too many AI projects stop short of impact — and I want to change that.
Take part in the State of Impactful AI Survey to share your experience and learn what others are discovering.
Get early access to the findings and enter a $250 Amazon gift card draw as a thank you.
Contribute now

Beyond Accuracy: The Hidden Drivers of AI That Lasts


Accuracy is the easiest metric to measure — and the least predictive of real-world success.


Every AI team learns this eventually. The model looks brilliant in validation, the metrics sparkle, the dashboards glow green. Yet months later, it’s quietly shelved — not because it failed technically, but because it failed to last.


In medical imaging, one system correctly flagged anomalies in every internal test. But when deployed at a second hospital, its performance collapsed. The culprit wasn’t the model; it was the scanner. Slight differences in calibration produced colors that the algorithm had never seen. Accuracy was perfect on paper — and meaningless in practice.


That’s the danger of chasing metrics that measure prediction but not persistence.

The AI that lasts — the kind that earns trust and scales across environments — isn’t the one that wins the benchmark. It’s the one that survives the messy realities of deployment.


The Hidden Drivers of Lasting AI


When I audit underperforming computer vision systems, three overlooked forces keep reappearing. They’re not flashy, but they determine whether an AI system delivers real-world value or becomes another stalled pilot.


1. Variability Testing


Most validation schemes are designed for speed and accuracy, not stress. Teams randomly split a single dataset into train and test sets, ensuring statistical rigor but not operational realism.


The real challenge is to test variability: new sites, new devices, new time periods, new conditions. That’s where models that look perfect on paper start to falter — and where truly robust ones prove their worth.


Cross-site validation in pathology, cross-sensor validation in agriculture, cross-season validation in earth observation — these are the experiments that reveal if your model generalizes or just memorizes. They’re slower, but they save months of rework later.


2. Domain Alignment


Many AI teams treat “data” and “domain” as separate concerns: the ML engineers handle the data and modeling, and the experts review results at the end. But that’s how shortcuts sneak in.


Without continuous domain input, models learn correlations that don’t hold up in practice — staining differences mistaken for cancer, lighting conditions mistaken for stress, noise mistaken for signal.


Keeping domain experts in the loop — not just during annotation, but in model critique and error review — grounds your AI in the real phenomena it’s meant to understand. That’s how data turns from pixels into meaning.


3. Maintainability Over Time


AI doesn’t age gracefully. Environments evolve, data drifts, and instruments are recalibrated. A model that works flawlessly today might degrade silently next quarter.


That’s why the teams that succeed treat models as living systems:

  • They monitor performance in production.

  • They version data and retraining sets.

  • They establish thresholds for when to retrain — not reactively, but proactively.

These processes aren’t glamorous, but they create resilience. They make AI predictable enough to trust and stable enough to scale.


From Accuracy to Accountability


Lasting AI isn’t just robust — it’s accountable. It can explain how it works, adapt when the world changes, and justify its place in a workflow.


That’s why I launched the State of Impactful AI Survey: to uncover how real organizations define and sustain success beyond the benchmark.


The survey explores questions like:

  • Which principles — robustness, transparency, sustainability — matter most when AI leaves the lab?

  • Where do teams get stuck: variability, domain alignment, or maintenance?

  • How are leaders balancing speed, cost, and long-term reliability?

It’s less about counting failures — and more about learning how teams build systems that last.


Why This Matters


Accuracy might help you publish a paper or impress a demo day crowd. But impact requires something deeper: durability, trust, and alignment with the real world.


When AI fails quietly — when it’s never adopted, never trusted, never scaled — the world loses out on progress that could have mattered.


If your organization has built AI that technically worked but didn’t deliver outcomes, your experience can help shape the roadmap for what comes next.


👉 Contribute to the State of Impactful AI Survey


Because accuracy might win the leaderboard — but robustness, trust, and actionability win the world.

- Heather

Vision AI that bridges research and reality

— delivering where it matters


Research: Foundation Model Robustness


Towards Robust Foundation Models for Digital Pathology


Foundation models in digital pathology have a hidden vulnerability: they're learning to recognize which hospital generated an image rather than focusing purely on disease biology.

New research from Jonah Komen et al. reveals a fundamental challenge in building robust AI for medical imaging. Even when foundation models (FMs) are trained on massive datasets, downstream models built on top of them can latch onto "Clever Hans" features—superficial patterns like staining variations or scanning artifacts that correlate with medical centers rather than actual pathology.

Why this matters: For rare diseases where training data is scarce and must be pooled from multiple hospitals with different imaging protocols, these technical artifacts can severely limit model generalization. A model that performs well at one institution may fail at another, not because the biology is different, but because it learned the shortcuts.

Key innovations and findings:
- The team demonstrated that biological and technical information become entangled in FM representation spaces—they're not encoded in separate directions, making it risky to simply "remove" center-specific signatures
- Three robustification approaches were tested: ComBat (representation adjustment), DANN (training-time correction), and stain normalization (image preprocessing)
- Stain normalization showed consistent improvements, while ComBat proved powerful but risked removing biological signals when data distributions varied across centers

The takeaway: There's a tension between preserving all information in foundation models versus making them robust to technical confounders. The authors argue that in medical settings—especially for rare diseases—robustified FMs are essential for clinical deployment, even if perfect disentanglement remains an open challenge.

This work pushes us to think more carefully about what our models are actually learning and whether we're building systems that will generalize when it matters most: across diverse patient populations and clinical settings.

Reserach: Generalizability


On the Generalizability of Foundation Models for Crop Type Mapping


The US and EU maintain detailed crop type maps with 80%+ accuracy, updated regularly. But most of the world—especially data-scarce regions in Africa, South America, and Asia—lacks this critical agricultural intelligence. Can foundation models trained on data-rich regions generalize to regions where labeled data is scarce?

Food security depends on accurate crop type mapping for yield prediction, conservation, and disaster assessment. Yet the geographic disparity in data availability creates a potential geospatial bias—models trained on developed nations may fail in developing ones, precisely where they're needed most.

Yi-Chia Chang et al. addressed this challenge by creating the first harmonized global crop type mapping dataset, combining five regional datasets across five continents, all focused on the four major cereal grains: maize, soybean, rice, and wheat.

𝘒𝘦𝘺 𝘧𝘪𝘯𝘥𝘪𝘯𝘨𝘴:
- SSL4EO-S12 (pre-trained on all 13 Sentinel-2 spectral bands) outperformed both SatlasPretrain and ImageNet weights by 3-27% across all regions
- Only 100 labeled images are sufficient for achieving high overall accuracy—but 900 images are needed to overcome severe class imbalance and improve average accuracy
- Out-of-distribution data helps significantly when in-domain samples are scarce (zero-shot learning), but can actually hurt performance when sufficient local data becomes available due to distribution shift

The research reveals both promise and pitfalls: while foundation models can bridge data gaps between regions, careful attention to data composition and distribution shift is essential. The takeaway? Prioritize in-domain data when available, but leverage out-of-domain pretraining strategically for data-scarce regions.

All datasets and code available via TorchGeo, Hugging Face, and GitHub.