\addbibresource

fcinterp.bib

Modulated Adaptive Fourier Neural Operators for Temporal Interpolation of Weather Forecasts

Jussi Leinonen
NVIDIA Corporation
Santa Clara, CA 95051
jleinonen@nvidia.com
Boris Bonev
NVIDIA Corporation
Santa Clara, CA 95051
bbonev@nvidia.com
Thorsten Kurth
NVIDIA Corporation
Santa Clara, CA 95051
tkurth@nvidia.com
Yair Cohen
NVIDIA Corporation
Santa Clara, CA 95051
yacohen@nvidia.com
Abstract

Weather and climate data are often available at limited temporal resolution, either due to storage limitations, or in the case of weather forecast models based on deep learning, their inherently long time steps. The coarse temporal resolution makes it difficult to capture rapidly evolving weather events. To address this limitation, we introduce an interpolation model that reconstructs the atmospheric state between two points in time for which the state is known. The model makes use of a novel network layer that modifies the adaptive Fourier neural operator (AFNO), which has been previously used in weather prediction and other applications of machine learning to physics problems. The modulated AFNO (ModAFNO) layer takes an embedding, here computed from the interpolation target time, as an additional input and applies a learned shift–scale operation inside the AFNO layers to adapt them to the target time. Thus, one model can be used to produce all intermediate time steps. Trained to interpolate between two time steps 6666 h apart, the ModAFNO-based interpolation model produces 1111 h resolution intermediate time steps that are visually nearly indistinguishable from the actual corresponding 1111 h resolution data. The model reduces the RMSE loss of reconstructing the intermediate steps by approximately 50%percent5050\%50 % compared to linear interpolation. We also demonstrate its ability to reproduce the statistics of extreme weather events such as hurricanes and heat waves better than 6666 h resolution data. The ModAFNO layer is generic and is expected to be applicable to other problems, including weather forecasting with tunable lead time.

1 Introduction

The urgent need to address climate change is driving an increasing demand for climate and weather data and more accurate predictions. Beyond meteorological services, such data are used by numerous sectors such as insurance, investing, energy, transport, infrastructure management and emergency responders.

The demand for higher resolution, customized data products and probabilistic forecasts (typically achieved with ensemble forecasts, i.e. executing a simulation multiple times from different initial conditions) is rapidly increasing the atmospheric data storage requirements, with the largest data archives approaching exabyte size \citepECMWF2024FactsFigures. Such data volumes are only feasible to manage for the largest data centers, and even then at great cost.

Weather forecast models based on machine learning (ML) have recently achieved skill competitive with state-of-the-art numerical weather prediction (NWP) models while requiring orders of magnitude less time, energy and money to produce a forecast \citepPathak2022FourCastNet,Kurth2023FourCastNet,EbertUphoff2023Outlook,Lam2023GraphCast,Chen2023FengWu,Price2024Gencast,Zhong2024FuxiENS. ML weather models make it plausible to avoid the long-term storage of simulation results, instead executing forecasts on demand and thus drastically requiring data storage costs.

Most ML forecast models execute long time steps, for example 6666 h for GraphCast \citepLam2023GraphCast and FourCastNet \citepPathak2022FourCastNet. This can leave gaps in the data, especially when predicting rapidly evolving extreme weather events. PanguWeather \citepBi2023PanguWeather consists of multiple models with different time steps (1111 h, 3333 h, 6666 h and 24242424 h) to achieve a 1111 h resolution, but this requires the training of multiple models and may introduce inconsistencies in the final simulation: For example, a 23232323 h forecast is executed as 3×6h+3h+2×1h36h3h21h3\times 6\ \mathrm{h}+3\ \mathrm{h}+2\times 1\ \mathrm{h}3 × 6 roman_h + 3 roman_h + 2 × 1 roman_h, which is not guaranteed to be consistent with following prediction, which is obtained from a single time step of the 24242424 h model.

In this work, we assume that our input data has coarse temporal resolution, whether due to archiving constraints or as a result of using a long time step ML weather model, and treat the recovery of the intermediate time steps as an interpolation rather than forecasting problem. Since deep neural networks have proven to be suitable for building powerful weather forecasting models, it can be expected that a network that learns to model the physics of the atmosphere should be able to produce a much better result than simple (e.g. linear) interpolation methods. Based on this premise, we introduce an adaptable architecture to predict arbitrary time steps using a single model. This is based on the adaptive Fourier neural operator (AFNO) used in the original FourCastNet model, with an additional time-dependent modulation in both the spatial and spectral domains to adjust the network to the desired time step.

Contributions: We introduce the modulated adaptive Fourier neural operator (ModAFNO), which applies a scale–shift operation to spatial and spectral domain feed-forward neural networks to condition their output on the time step. The utility of the approach is demonstrated with a weather interpolation model, which uses ModAFNO to reconstruct global weather states at 1111-hourly resolution from 6666-hour resolution input data.

2 Model

The problem considered in this work is formulated as follows: given the global atmospheric states at times t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, as well as additional context variables and a target time t𝑡titalic_t with t0tt1subscript𝑡0𝑡subscript𝑡1t_{0}\leq t\leq t_{1}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_t ≤ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, predict the state at time t𝑡titalic_t.

2.1 Architecture

We base our model on the AFNO neural network architecture, which was used in the original FourCastNet model and has been shown to be adaptable to many physical problems \citepGuibas2022AFNO,NVIDIA2023Darcy,Leinonen2023LDCast. The AFNO implementation in NVIDIA Modulus (https://github.com/NVIDIA/modulus), available under the Apache-2.0 License, was used as the baseline. The ModAFNO architecture is also now available in Modulus.

The AFNO FourCastNet employs a patch encoding layer, a series of AFNO blocks, and a decoding layer. The AFNO blocks learn neural network operations both in the spatial domain and in the Fourier spectral domain. Formally, the k𝑘kitalic_kth AFNO block, operating on the output xk1subscript𝑥𝑘1x_{k-1}italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT of the previous block, can represented as

Filterk(x)subscriptFilter𝑘𝑥\displaystyle\mathrm{Filter}_{k}(x)roman_Filter start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) =\displaystyle== IDFT(Sλ(MLPkf(DFT(x))))IDFTsubscript𝑆𝜆subscriptsuperscriptMLPf𝑘DFT𝑥\displaystyle\mathrm{IDFT}(S_{\lambda}(\mathrm{MLP}^{\mathrm{f}}_{k}(\mathrm{% DFT}(x))))roman_IDFT ( italic_S start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( roman_MLP start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_DFT ( italic_x ) ) ) ) (1)
yksubscript𝑦𝑘\displaystyle y_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =\displaystyle== LayerNormkf(Filterk(xk1))+xk1subscriptsuperscriptLayerNormf𝑘subscriptFilter𝑘subscript𝑥𝑘1subscript𝑥𝑘1\displaystyle\mathrm{LayerNorm}^{\mathrm{f}}_{k}(\mathrm{Filter}_{k}(x_{k-1}))% +x_{k-1}roman_LayerNorm start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_Filter start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ) + italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT (2)
xksubscript𝑥𝑘\displaystyle x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =\displaystyle== LayerNormks(MLPks(yk))+yksubscriptsuperscriptLayerNorms𝑘subscriptsuperscriptMLPs𝑘subscript𝑦𝑘subscript𝑦𝑘\displaystyle\mathrm{LayerNorm}^{\mathrm{s}}_{k}(\mathrm{MLP}^{\mathrm{s}}_{k}% (y_{k}))+y_{k}roman_LayerNorm start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_MLP start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (3)

In Eq. 1, representing the spectral-domain filter, DFTDFT\mathrm{DFT}roman_DFT and IDFTIDFT\mathrm{IDFT}roman_IDFT are the discrete 2D Fourier transform and the inverse 2D DFT, respectively, Sλ(x)=sign(x)max(|x|λ,0)subscript𝑆𝜆𝑥sign𝑥𝑥𝜆0S_{\lambda}(x)=\mathrm{sign}(x)\max(|x|-\lambda,0)italic_S start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_x ) = roman_sign ( italic_x ) roman_max ( | italic_x | - italic_λ , 0 ) is a sparsity-promoting soft-thresholding operation, and MLPkfsubscriptsuperscriptMLPf𝑘\mathrm{MLP}^{\mathrm{f}}_{k}roman_MLP start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a feed-forward neural network (i.e. multilayer perceptron, MLP) operating in the spectral domain. Since the output of the DFT is complex-valued, MLPkfsubscriptsuperscriptMLPf𝑘\mathrm{MLP}^{\mathrm{f}}_{k}roman_MLP start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a complex-valued network. In Eq. 2, the filter result is normalized with layer normalization (LayerNorm). In Eq. 3, MLPkssubscriptsuperscriptMLPs𝑘\mathrm{MLP}^{\mathrm{s}}_{k}roman_MLP start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT acts in the spatial domain and is a real-valued MLP that performs channel mixing with shared weights for each point; the result of this MLP is then passed through layer normalization. Residual connections are applied in both Eqs. 2 and 3.

Refer to caption
Figure 1: a) Overview of the ModAFNO interpolation network. b) The modulated AFNO layer. A time-dependent embedding is added to the MLPs in the spectral and spatial domains to enable temporal interpolation. c) Detailed view of the modulated MLP. A separate MLP produces the scale and shift activations from the time embedding.

To allow the AFNO to adapt to a desired time step, we introduce an embedding-dependent scale–shift operation to the AFNO layer in both the spatial and spectral stages, as illustrated in Fig. 1. This novel architectural feature is inspired by the use of a similar operation in convolutional networks in MetNet-2 \citepEspeholt2022MetNet2 and MetNet-3 \citepAndrychowicz2023MetNet3. The resulting modulated AFNO (ModAFNO) block can be expressed as

ModFilterk(x,q)subscriptModFilter𝑘𝑥𝑞\displaystyle\mathrm{ModFilter}_{k}(x,q)roman_ModFilter start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x , italic_q ) =\displaystyle== IDFT(Sλ(ModMLPkf(DFT(x),q)))IDFTsubscript𝑆𝜆subscriptsuperscriptModMLPf𝑘DFT𝑥𝑞\displaystyle\mathrm{IDFT}(S_{\lambda}(\mathrm{ModMLP}^{\mathrm{f}}_{k}(% \mathrm{DFT}(x),q)))roman_IDFT ( italic_S start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( roman_ModMLP start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_DFT ( italic_x ) , italic_q ) ) ) (4)
yksubscript𝑦𝑘\displaystyle y_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =\displaystyle== LayerNormkf(ModFilterk(xk1,q))+xk1subscriptsuperscriptLayerNormf𝑘subscriptModFilter𝑘subscript𝑥𝑘1𝑞subscript𝑥𝑘1\displaystyle\mathrm{LayerNorm}^{\mathrm{f}}_{k}(\mathrm{ModFilter}_{k}(x_{k-1% },q))+x_{k-1}roman_LayerNorm start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_ModFilter start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_q ) ) + italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT (5)
xksubscript𝑥𝑘\displaystyle x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =\displaystyle== LayerNormks(ModMLPks(yk,q))+yk.subscriptsuperscriptLayerNorms𝑘subscriptsuperscriptModMLPs𝑘subscript𝑦𝑘𝑞subscript𝑦𝑘\displaystyle\mathrm{LayerNorm}^{\mathrm{s}}_{k}(\mathrm{ModMLP}^{\mathrm{s}}_% {k}(y_{k},q))+y_{k}.roman_LayerNorm start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_ModMLP start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_q ) ) + italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (6)

The modulated MLPs (ModMLP) in the spectral (Eq. 4) and spatial (Eq. 6) domains are of the form

ModMLPkf(x,q)subscriptsuperscriptModMLPf𝑘𝑥𝑞\displaystyle\mathrm{ModMLP}^{\mathrm{f}}_{k}(x,q)roman_ModMLP start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x , italic_q ) =\displaystyle== Lineark,2f(ReLU(Lineark,1f(x)wkf(q)+bkf(q)))subscriptsuperscriptLinearf𝑘2ReLUsubscriptsuperscriptLinearf𝑘1𝑥subscriptsuperscript𝑤f𝑘𝑞subscriptsuperscript𝑏f𝑘𝑞\displaystyle\mathrm{Linear}^{\mathrm{f}}_{k,2}(\mathrm{ReLU}(\mathrm{Linear}^% {\mathrm{f}}_{k,1}(x)\cdot w^{\mathrm{f}}_{k}(q)+b^{\mathrm{f}}_{k}(q)))roman_Linear start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , 2 end_POSTSUBSCRIPT ( roman_ReLU ( roman_Linear start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT ( italic_x ) ⋅ italic_w start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) + italic_b start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) ) ) (7)
ModMLPks(x,q)subscriptsuperscriptModMLPs𝑘𝑥𝑞\displaystyle\mathrm{ModMLP}^{\mathrm{s}}_{k}(x,q)roman_ModMLP start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x , italic_q ) =\displaystyle== Lineark,2s(GELU(Lineark,1s(x)wks(q)+bks(q)))subscriptsuperscriptLinears𝑘2GELUsubscriptsuperscriptLinears𝑘1𝑥subscriptsuperscript𝑤s𝑘𝑞subscriptsuperscript𝑏s𝑘𝑞\displaystyle\mathrm{Linear}^{\mathrm{s}}_{k,2}(\mathrm{GELU}(\mathrm{Linear}^% {\mathrm{s}}_{k,1}(x)\cdot w^{\mathrm{s}}_{k}(q)+b^{\mathrm{s}}_{k}(q)))roman_Linear start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , 2 end_POSTSUBSCRIPT ( roman_GELU ( roman_Linear start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT ( italic_x ) ⋅ italic_w start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) + italic_b start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) ) ) (8)

that is, they follow the typical flow of an MLP with linear layers combined with a nonlinear activation function, except a scale–shift operation is used to modulate the MLP in both the spectral and spatial domains. In both cases, the scale factors w𝑤witalic_w and the shifts b𝑏bitalic_b only have channel and batch dimensions and are shared across all positions, adding only a small amount of computation. The spectral domain factors wfsuperscript𝑤fw^{\mathrm{f}}italic_w start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT and bfsuperscript𝑏fb^{\mathrm{f}}italic_b start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT are complex numbers, hence a complex multiplication and addition are performed in ModMLPfsuperscriptModMLPf\mathrm{ModMLP}^{\mathrm{f}}roman_ModMLP start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT, while the corresponding operations in ModMLPssuperscriptModMLPs\mathrm{ModMLP}^{\mathrm{s}}roman_ModMLP start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT are real-valued. The choice of activation functions follows the original AFNO architecture. The encoder and decoder architectures are also inherited unchanged from AFNO.

The embedding q𝑞qitalic_q can, in principle, condition the network to many different types of inputs. In our model, we want the embedding to depend on the interpolation target time t𝑡titalic_t; therefore we compute q𝑞qitalic_q using a sinusoidal time embedding EMBEMB\mathrm{EMB}roman_EMB that is passed through an MLP. Then, each of wkf(q)subscriptsuperscript𝑤f𝑘𝑞w^{\mathrm{f}}_{k}(q)italic_w start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ), bkf(q)subscriptsuperscript𝑏f𝑘𝑞b^{\mathrm{f}}_{k}(q)italic_b start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ), wks(q)subscriptsuperscript𝑤s𝑘𝑞w^{\mathrm{s}}_{k}(q)italic_w start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ), bks(q)subscriptsuperscript𝑏s𝑘𝑞b^{\mathrm{s}}_{k}(q)italic_b start_POSTSUPERSCRIPT roman_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) is computed from q𝑞qitalic_q using a learned MLP that is unique to each stage k𝑘kitalic_k.

q𝑞\displaystyle qitalic_q =\displaystyle== MLPq(EMB(tt0t1t0))superscriptMLP𝑞EMB𝑡subscript𝑡0subscript𝑡1subscript𝑡0\displaystyle\mathrm{MLP}^{q}\left(\mathrm{EMB}\left(\frac{t-t_{0}}{t_{1}-t_{0% }}\right)\right)roman_MLP start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( roman_EMB ( divide start_ARG italic_t - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) ) (9)
wk{f,s}(q)subscriptsuperscript𝑤fs𝑘𝑞\displaystyle w^{\{\mathrm{f},\mathrm{s}\}}_{k}(q)italic_w start_POSTSUPERSCRIPT { roman_f , roman_s } end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) =\displaystyle== MLPkw{f,s}(q)subscriptsuperscriptMLP𝑤fs𝑘𝑞\displaystyle\mathrm{MLP}^{w\{\mathrm{f},\mathrm{s}\}}_{k}(q)roman_MLP start_POSTSUPERSCRIPT italic_w { roman_f , roman_s } end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) (10)
bk{f,s}(q)subscriptsuperscript𝑏fs𝑘𝑞\displaystyle b^{\{\mathrm{f},\mathrm{s}\}}_{k}(q)italic_b start_POSTSUPERSCRIPT { roman_f , roman_s } end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) =\displaystyle== MLPkb{f,s}(q).subscriptsuperscriptMLP𝑏fs𝑘𝑞\displaystyle\mathrm{MLP}^{b\{\mathrm{f},\mathrm{s}\}}_{k}(q).roman_MLP start_POSTSUPERSCRIPT italic_b { roman_f , roman_s } end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) . (11)

2.2 Implementation and training

2.2.1 Data

We used data from the global ECMWF Reanalysis v5 \citep[ERA5;][]Hersbach2020ERA5 from the European Centre for Medium-Range Weather Forecasts (ECMWF) to train our model. This dataset gives an estimate of the historical state of the atmosphere at 1111 h temporal resolution on a 0.25°0.25°0.25\degree0.25 ° resolution latitude–longitude grid (with 721×14407211440721\times 1440721 × 1440 grid points per time step and variable). It is available under the open Licence to Use Copernicus Products at the Copernicus Climate Data Store \citepHersbach2023ERA5 or Google Cloud (https://cloud.google.com/storage/docs/public-datasets/era5). ERA5 is also the dataset used to train most global weather forecast models, including FourCastNet, PanguWeather and GraphCast.

The data downloaded from ERA5 for model training comprise pressure level height (z𝑧zitalic_z), temperature (T𝑇Titalic_T), specific humidity (q𝑞qitalic_q) as well as winds in the west–east (u𝑢uitalic_u) and north–south (v𝑣vitalic_v) directions. These are at 13131313 pressure levels each (50505050, 100100100100, 150150150150, 200200200200, 250250250250, 300300300300, 400400400400, 500500500500, 600600600600, 700700700700, 850850850850 and 1000100010001000 hPa). Furthermore, wind u𝑢uitalic_u/v𝑣vitalic_v components at 10101010 m and 100100100100 m heights from the surface, temperature at 2222 m height, surface pressure, mean sea level pressure and total column water vapor are included, giving a total of 73737373 variables. The data were acquired at the full 1111 h temporal resolution and span the years 19801980198019802017201720172017, with a total data volume of approximately 95959595 TB.

To provide context information to the model, we add external forcings that are either constant or can be computed using known equations from time stamps. The constant inputs are the sine and cosine of latitude and longitude (which act as a positional embedding), land-sea mask and surface elevation; the single computed input is the cosine of the solar zenith angle.

2.2.2 Implementation

The model consists of the encoder, 12121212 ModAFNO layers and the decoder. It processes the entire global ERA5 grid, except for the southernmost latitude since the model is limited to even-numbered dimensions. It thus has spatial input and output dimensions of 720×14407201440720\times 1440720 × 1440 (reduced from 721721721721 rows due to the removal of the southernmost latitude). The model has one tensor input, constructed by concatenating the states at t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (normalized to zero mean and unit variance), the constant context variables, and the solar zenith angle at t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, t𝑡titalic_t and t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, giving a total of 155155155155 input channels (2×732732\times 732 × 73 atmospheric state variables and 9999 auxiliary inputs). It also has one scalar input containing the interpolation target time t𝑡titalic_t, and a tensor output with 73737373 channels.

Given the similarity of the AFNO and ModAFNO models, we initially trained followed the training setup of the AFNO FourCastNet. However, we found that this training configuration sometimes produced artifacts at the edges of the 8×8888\times 88 × 8 tiles used by the encoder and decoder. We reduced the tile size to 2×2222\times 22 × 2 and the number of channels to 512512512512, and removed the block-diagonal structure from the spectral part (effectively reducing the number of blocks to 1111) to stabilize training. leading to a model with fewer parameters but higher computational and memory requirements. This configuration produced a model with better skill than the original and resolved the artifact issue. Further reduction of the tile size to 1×1111\times 11 × 1 produced a model with lower skill but much higher memory requirements; thus we did not explore this further. Other than the abovementioned hyperparameters, we used model and training configurations identical to the AFNO FourCastNet.

2.2.3 Training

We trained the model with 6666 million samples of training data, using the Adam optimizer \citepKingma2014Adam starting with a learning rate of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and decaying to 00 over the training period using a cosine learning rate schedule. Mean square error weighted by the cosine of the latitude (i.e. proportional to the area of the grid points) was used as the training objective. The years 19801980198019802016201620162016 were used for training while 2017201720172017 was held back for testing. The training was performed on an NVIDIA DGX SuperPOD, using 64646464 NVIDIA H100 GPUs and a global batch size of 64646464, requiring approximately 1300130013001300 GPU hours. GPU memory usage was approximately 72727272 GB/GPU during training and 22222222 GB/GPU during inference.

3 Results

To evaluate the effectiveness of our method, we evaluate the trained model using statistics of the reconstruction error, as well as a set of weather scenarios, which include Storm Ophelia, Hurricane Irma and the European heatwave of August 2017201720172017. The method is compared to ground truth observations at 6666-hourly and 1111-hourly resolutions, as well as linear interpolation. Since we restrict our problem definition to reconstructing the weather between two known time steps, we do not consider interpolation methods that use more than two time steps to produce each output.

3.1 Interpolated wind speeds during Storm Ophelia

We demonstrate results from the ModAFNO interpolation model in Fig. 2. This example shows Storm Ophelia on the Atlantic Ocean approaching Ireland on October 16, 2017. The ModAFNO interpolation result is compared to the true 1111 h data and linear interpolation. The interpolation result is visually nearly indistinguishable from the true 1111 h data, correctly capturing the motion of the storm center on the top right and the front on the left side of the image. Meanwhile, the linear interpolation demonstrates the shortcomings of the 6666 h data by duplicating the fast-moving features in the middle frames, producing unrealistic weather states.

Refer to caption
Figure 2: Example of ModAFNO interpolation of wind speeds from Storm Ophelia approaching Ireland on October 16, 2017. Top row: the actual ERA5 wind speed; middle row: the ModAFNO interpolated wind speed; bottom row: wind speed obtained with linear interpolation.

3.2 Interpolation error

In Fig. 3, we show the statistics of the interpolation RMSE of the ModAFNO model against the linear interpolation baseline as a function of the interpolation time. The errors were computed from 1000100010001000 random samples from the test dataset, using the values normalized to zero mean and unit variance to give approximately equal weight to each of the 73737373 variables. The linear interpolation has zero error at the endpoints by definition, while the ModAFNO model has a small amount of error remaining at the endpoints due to the lossy nature of its encoding and decoding. At the other interpolation times, ModAFNO achieves approximately 50%percent5050\%50 % reduction in RMSE compared to the linear interpolation. The errors of both models display highly skewed statistics.

Refer to caption
Figure 3: Normalized RMSE error of the interpolation for the ModAFNO model (blue) and linear interpolation (orange) as a function of the interpolation time. The line with triangles indicates the mean of the pointwise RMSE while the box plots indicate the median (midline of the box), the 25th and 75th percentiles (ends of the box) and the 9th and 91st percentiles (whiskers).

3.3 Maximum wind speeds in Hurricane Irma

To demonstrate the advantages of the 1111 h time interpolation, we reconstructed the wind speeds along the path of Hurricane Irma of September 2017, moving through the Caribbean towards Florida. In Fig. 4, we show the maximum wind speed, which is a good indicator of the level of wind-caused damage, computed from the original 6666 h resolution (left), interpolation of the 6666 h data to 1111 h (middle), and the actual 1111 h data (right). The 6666 h resolution data displays the maximum wind speeds fluctuating between weak and strong along the path, owing to the region of peak winds moving too fast for the 6666 h time resolution to capture it smoothly. The 1111 h interpolation captures the true maximum wind speeds much more accurately, even though some error remains particularly at the southern flank of the path of maximum winds.

Refer to caption
Figure 4: Maximum 10101010 m wind speed during Hurricane Irma in the Caribbean and Florida on September 2–8, 2017. Left: obtained with 6666 h temporal resolution. Middle: with ModAFNO interpolation to 1111 h resolution. Right: the true maximum at 1111 h resolution.

3.4 Extreme temperatures in a European heat wave

To demonstrate that the interpolation model has learned information about the physics of the atmosphere, we demonstrate another case of extreme weather prediction in the form of the temperatures during a European heatwave in August 2017. On August 3–4, the peak temperatures in Milan, Italy, occurred on both days approximately halfway between the 6666 h time steps given at 12 and 18 UTC. Thus, the 6666 h forecast would miss these extremes and underestimate the peak temperatures. In the plot shown in Fig. 5, we show that the 6666 h forecast linearly interpolated between time steps indeed predicts peak temperatures approximately 2°C2°C2\ \degree\mathrm{C}2 ° roman_C lower than the actual temperature. Meanwhile, the 1111 h interpolated forecast correctly infers that the hottest part of the day occurs between the endpoints of the interpolation period, and predicts the correct temperature with an accuracy of 0.5°C0.5°C0.5\ \degree\mathrm{C}0.5 ° roman_C (slightly overestimating on both days) and the timing of the highest temperature to within 1111 h.

Refer to caption
Figure 5: 2222 m temperature at the grid square containing the city of Milan, Italy, during a heat wave on August 3–4, 2017, as indicated by linear interpolation between 6666 h resolution time steps (orange dotted line), interpolation to 1111 h resolution (blue solid line) and the true 1111 h resolution data (green dashed line).

4 Discussion and summary

A major advantage of ML weather forecast models over traditional NWP model is their orders of magnitude higher speed and hence cost efficiency. This advantage partially derives from the long time step used in ML weather models, such as 6666 hours in FourCastNet and GraphCast. However, this temporal resolution is insufficient for capturing quickly moving or rapidly evolving weather phenomena. Similar issues can arise when weather data is archived at a coarse temporal resolution due to storage constraints. To alleviate these issues, we have introduced a weather interpolation model that fills in the gaps between 6666 h time steps at 1111 h temporal resolution using a modulated AFNO layer that adapts to the desired target time step. We show that this model significantly outperforms linear interpolation at predicting the intermediate weather states. It also exhibits better skill than linear interpolation with extreme weather events such as fast-moving hurricanes and temperature maxima in heat waves.

Ideally, a model trained with a continuous input indicating the time step would generalize well enough to predict the weather state at any point between the endpoints of the interpolation spaced 6666 h apart. Thus, it should be possible to inference the model at arbitrarily high temporal resolution. Unfortunately, this turns out not to be the case with the model trained in this work: attempts to inference it at half-hour time intervals (e.g. 3.53.53.53.5 h from the beginning) produced artifacts that render the prediction unusable. Thus, the model appears to have only properly learned the target times found in the training data, which are only available at integer hours. Generalizing the model training better to enable temporally continuous inference remains a challenge for further work.

Beyond the ability to resolve rapidly evolving weather events, we expect that ModAFNO and similar solutions are also applicable to the challenge of containing the ever-increasing data storage requirements of weather forecast and climate data. Availability of fast and accurate interpolation makes it feasible to store only limited amounts of data, using the interpolation to recover the intermediate time steps that were not stored, at a cost of a small amount of error due to the imperfect reconstruction. This capability will be particularly attractive if the model can be trained to generalize to arbitrary target times steps rather than just 1111 h time steps, as in this case it will enable the estimation of atmospheric states at arbitrary temporal resolution, which can be highly valuable in applications such as nowcasting.

The modulation concept in ModAFNO can also be reasonably expected to be adaptable to similar network architectures such as the spherical Fourier neural operator \citepBonev2023SFNO. Furthermore, its use is not limited to interpolation and we expect it can be used to train weather forecast models with adjustable lead times. The embedding used to adjust ModAFNO can also be derived from other contexts besides lead time, allowing other types of conditional networks to be constructed.

\printbibliography