Foundational Deep Learning Models for Global Weather Forecasting: A Review

Recently, I wrote on a review article on some of the recent models in data driven weather forecasting. It’s not available on arXiv yet, but you can download it here.

This post outlines some of the main conclusions.

Key Takeaways

The review highlights three major tensions shaping the future of ML-based weather forecasting:

Model skill vs operational confidence – While ML models often match or exceed NWP in average RMSE for 2–10 day forecasts, they still lack reliability for operations. They systematically underestimate extremes (e.g., peak winds during Storm Ciarán underestimated by over 40 km/h), even though their training data (ERA5) captures these values. This bias likely comes from excessive spatial smoothing that removes key mesoscale structures.
Physical plausibility vs resolution – ML models can reproduce large-scale, physically consistent patterns (e.g., jet streams, frontal boundaries, Rossby wave propagation) but their effective resolution is much coarser than their grid spacing. For example, Pangu-Weather’s effective resolution is ~4.5°, despite being trained at 0.25°. This limits their ability to represent mesoscale and sub-synoptic features critical for extremes.
Interpretability vs benchmarking – “Physical realism” is often cited but rarely defined or measured quantitatively. Promising directions include analysing latent spaces, attention maps, and sensitivity fields to uncover whether models learn physically meaningful concepts. Current benchmarks like WeatherBench only partially address this; an expanded, standardised framework for evaluating physical realism is urgently needed.

A unified, quantitative definition of physical realism should become part of standard benchmarking, alongside model skill metrics. Expanding benchmarks to capture mesoscale representation and interpretability would build operational trust, improve extreme event forecasts, and accelerate progress in data-driven weather prediction.

Digging Into AI Forecasts: Are They Physically Sound?

Aurora: Microsoft’s Foundational Data-Driven Weather Model