11institutetext: Trip.com Group, Shanghai, China
11email: {yh_liao,zt_wang,weip,qq_nie,zhenhuazhang}@trip.com

TripCast: Pre-training of Masked 2D Transformers for Trip Time Series Forecasting

Yuhua Liao🖂    Zetian Wang    Peng Wei    Qiangqiang Nie    Zhenhua Zhang
Abstract

Deep learning and pre-trained models have shown great success in time series forecasting. However, in the tourism industry, time series data often exhibit a leading time property, presenting a 2D structure. This introduces unique challenges for forecasting in this sector. In this study, we propose a novel modelling paradigm, TripCast, which treats trip time series as 2D data and learns representations through masking and reconstruction processes. Pre-trained on large-scale real-world data, TripCast notably outperforms other state-of-the-art baselines in in-domain forecasting scenarios and demonstrates strong scalability and transferability in out-domain forecasting scenarios.

Keywords:
Trip Time Series Pre-trained Models Transformer Tourism.

1 Introduction

Refer to caption
Figure 1: An illustration of flight booking time series data (left). The vertical axis represents the flight takeoff date, and the horizontal axis represents the booking process. Within each takeoff date (right), the booking process is shown as a 1D time series and the entire data is shown as a 2D matrix. Across different takeoff dates, the unobserved booking process is shown as a triangle.

Time series forecasting is widely used in various real-world fields, such as finance, speech analysis, action recognition, and traffic flow forecasting [21]. Accurate forecasts empower businesses to optimize decision-making, enhance operations, and improve overall efficiency [1]. In the tourism industry, time series forecasting plays a crucial role in revenue management [11], demand planning [15], and dynamic pricing [26].

In the past decades, deep learning methods [2, 23, 39] have achieved significant success in time series forecasting [21]. These methods are flexible in modeling complex patterns and dependencies in time series, and have been widely used in various domains. However, training deep learning models from scratch requires a large amount of data and computational resources, which limits their usage in practice. In the tourism sector, new routes and flights are scheduled monthly without any historical data. Therefore, it is impractical to train a robust and accurate deep learning model for new routes or flights. More critically, in some domains, the application of deep time series models is hindered by the cold start problem due to the challenges or costs associated with data collection [24]. Remarkably, large-scale pre-training has become a key element of training large neural networks in vision [19, 27] and text [3, 7] domain [10]. Large Language Models (LLMs) learn general representations from web-scale text data and both model size and data scale [14] enhance corresponding zero-shot and in-context learning abilities. This inspires us to investigate the potential of pre-training time series models in the context of the tourism industry, especially given the limited research currently available in this field.

However, the time series data of the tourism industry inherently exhibits a dual-axis nature, as illustrated in Figure 1. The vertical axis represents the event time, such as the flight departure date, while the horizontal axis denotes the leading time prior to the event, such as the booking date. Existing forecasting paradigms typically address this problem from either the event time axis or the leading time axis. These dichotomous approaches result in two primary challenges: accuracy and efficiency.

Firstly, observations of time series in the tourism industry are typically influenced by both past event time points and leading time points. For instance, the booking rate of a flight on a specific departure date is influenced by the booking rate of the same flight on previous departure dates as well as the booking rate on previous leading times. Consequently, ignoring the complex dependencies and causality across different event times and leading times, existing models might fail to yield accurate forecasts. Secondly, building multiple models for different leading time steps or event time steps is inefficient and time-consuming. This fragmented approach necessitates significant computational resources and may lead to redundancy and suboptimal use of data.

To address these challenges, we propose a novel modelling paradigm that treats trip time series as a whole 2D data, and learns local and global dependencies through masking and reconstruction training processes. Furthermore, to validate the transferability and scalability of TripCast as a zero-shot forecaster in the tourism industry, extensive experiments are conducted on zero-shot forecasting tasks in both in-domain and out-domain scenarios.

Our contributions are as follows:

* For the first time, we formulate the problem of trip time series forecasting and introduce a novel modelling paradigm that treats trip time series as 2D data to capture the intrinsic correlations and causality between different event times and leading times.

* To address the challenges of trip time series forecasting, we propose TripCast that learns local and global dependencies through masking and reconstruction processes.

* We perform comprehensive experiments based on large-scale datasets from an online travel agency. The results show that our method as a zero-shot forecaster, outperforms deep learning and pre-trained models in in-domain scenarios and achieves strong scalability and transferability in out-domain scenarios.

2 Problem Statement

Refer to caption
Figure 2: Illustration of trip time series data (a) and trip time series forecasting problem (b).

2.1 Trip Time Series

Let trip time series be denoted as sequential data with two axes, event time and leading time (Figure 2). The event time axis represents when a good or service is consumed, such as a flight takeoff date or a hotel room check-in date. The leading time axis represents the time before consumption, such as the booking date or search date. Formally, a trip time series X𝑋Xitalic_X is defined as a 2D matrix with dimensions H×C𝐻𝐶H\times Citalic_H × italic_C, where H𝐻Hitalic_H is the length of the event time axis and C𝐶Citalic_C is the length of the leading time axis. For simplicity, we ignore the covariates dimension in all definitions.

2.2 Trip Time Series Forecasting

Given a trip time series XH×C𝑋superscript𝐻𝐶X\in\mathbb{R}^{H\times C}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_C end_POSTSUPERSCRIPT, Hobssubscript𝐻𝑜𝑏𝑠H_{obs}italic_H start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT and Hpredsubscript𝐻𝑝𝑟𝑒𝑑H_{pred}italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT is the number of observed and predicted time steps along the event time axis. Correspondingly, X𝑋Xitalic_X has maximum Hpredsubscript𝐻𝑝𝑟𝑒𝑑H_{pred}italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT unobserved steps along the leading time axis and the number of unobserved leading steps is increasing with the advance of time. Our goal is to predict the unobserved leading time steps of future event time steps. Formally, the task can be defined as a parameterized function θsubscript𝜃\mathcal{F_{\theta}}caligraphic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT:

:XHobs:,Cobs:=θ(X:Hobs,:XHobs:,:Cobs):subscript𝑋subscript𝐻𝑜𝑏𝑠::subscript𝐶𝑜𝑏𝑠absentsubscript𝜃subscript𝑋:absentsubscript𝐻𝑜𝑏𝑠:subscript𝑋subscript𝐻𝑜𝑏𝑠::absentsubscript𝐶𝑜𝑏𝑠\mathcal{F}:X_{H_{obs}:,C_{obs}:}=\mathcal{F_{\theta}}(X_{:H_{obs},:}\cup X_{H% _{obs}:,:C_{obs}})caligraphic_F : italic_X start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT : , italic_C start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT : end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT : italic_H start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT , : end_POSTSUBSCRIPT ∪ italic_X start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT : , : italic_C start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (1)

where Cobssubscript𝐶𝑜𝑏𝑠C_{obs}italic_C start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT are the observed leading time steps for each event time step. The problem is illustrated in Figure 2.

2.3 In-domain and Out-domain Forecasting

Refer to caption
Figure 3: Hierarchical granularities of trip time series. They can be categorized into three levels of granularities: domain, collection, and time series. Each domain contains multiple collections, and each collection contains multiple time series.

Conceptually, temporal datasets can be categorized into three levels of granularities: domain, collection, and time series [32] as shown in Figure 3. In-domain forecasting involves training and evaluating the model on the same dataset source. Conversely, out-domain forecasting entails training the model on multiple datasets and evaluating it on a dataset from a different domain. In this study, we focus on evaluating the effectiveness and scalability of TripCast in both in-domain and out-domain tasks.

3 Related Work

3.1 Tourism Industry and Time Series Forecasting

Time series forecasting is crucial in the tourism industry for revenue management, demand planning, and dynamic pricing. Existing forecasting methods can be classified into three categories: historical-data-based methods, advanced-data-based methods, and combined methods [31]. Popular traditional methods in the tourism industry include ARIMA [5, 8], Exponential Smoothing [36], and Holt-Winters [12]. With advancements in deep learning, some studies have explored leveraging deep learning models for tourism forecasting. In this work [30], the authors trained forecasting models with temporal fusion transformer (TFT) [18] for five different airports, and found that TFT outperforms traditional methods.

3.2 Pre-training Modelling for Time Series Analysis

Inspired by advancements in pre-training across various fields, self-supervised learning has been adopted for time series forecasting. TS2Vec [35] and CoST [34] learn representations through contrastive learning. However, due to the limited scale of available datasets, they only consider in-domain scenarios, and their transferability is not well-studied. With the explosion of large language models (LLMs) [25, 29], some studies explore to leverage LLMs for time series forecasting [38]. Time-LLM [13] uses text data to reprogram time series modality into language modality. This approach [40] fine-tunes LLMs with time series datasets and achieves state-of-the-art performance on various forecasting scenarios. TEMPO [4] introduces a prompt-based structure to enhance the distribution adaptation of LLMs for time series forecasting. Recently, foundation models pre-trained with time series data have been proposed [6, 28, 33].

4 Methodology

Within this section, we first outline the architecture of TripCast, which is well designed to accommodate the dual-axis properties of trip time series. We then describe the training strategies for both pre-training and downstream tasks.

4.1 Model Structure

Refer to caption
Figure 4: The architecture of proposed TripCast model. Trip time series and covariates are stacked along the event time and leading time axes. The input data is normalized and projected to higher dimension. Then, token-level masking is applied to the projected input data. The masked input data is patched and fed into multiple transformer layers to learn predictive representations. Finally, output of the transformer layers is projected and reconstructed to estimate the unobserved values of future time steps.

Input Projection and Masking. Unlike image modeling [9], we cannot directly apply patch masking to trip time series because observed and unobserved values might be mixed within the same patch. To tokenize the unobserved and missing values, we are inspired by TS2Vec [35] and project the input xh,csubscript𝑥𝑐x_{h,c}italic_x start_POSTSUBSCRIPT italic_h , italic_c end_POSTSUBSCRIPT to a higher dimension latent vector zh,csubscript𝑧𝑐z_{h,c}italic_z start_POSTSUBSCRIPT italic_h , italic_c end_POSTSUBSCRIPT and apply token-level mask to the input data. Notably, we mask the latent vectors rather than the raw input data because the value range of the raw input data is dynamic, making it impractical to use a fixed mask value. In this way, observed and unobserved tokens are separated in the latent representations space. Furthermore, we adopt two masking strategies during pre-training stage:

* Random masking: This strategy simulates missing data by masking a predetermined proportion of tokens from the projected data at random (Figure 4). It enhances the robustness of TripCast models and ensures stable performance in real-world applications.

Maskh,crandomBernoulli(p),Maskh,crandom{0,1}formulae-sequencesimilar-to𝑀𝑎𝑠subscriptsuperscript𝑘𝑟𝑎𝑛𝑑𝑜𝑚𝑐𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖𝑝𝑀𝑎𝑠subscriptsuperscript𝑘𝑟𝑎𝑛𝑑𝑜𝑚𝑐01Mask^{random}_{h,c}\sim Bernoulli(p),\hskip 28.45274ptMask^{random}_{h,c}\in\{% 0,1\}italic_M italic_a italic_s italic_k start_POSTSUPERSCRIPT italic_r italic_a italic_n italic_d italic_o italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_c end_POSTSUBSCRIPT ∼ italic_B italic_e italic_r italic_n italic_o italic_u italic_l italic_l italic_i ( italic_p ) , italic_M italic_a italic_s italic_k start_POSTSUPERSCRIPT italic_r italic_a italic_n italic_d italic_o italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_c end_POSTSUBSCRIPT ∈ { 0 , 1 }

* Progressive masking: In trip time series, unobserved values typically appear in a triangular form, and with the progress of time, unobserved values along the diagonal are gradually revealed. To inject this prior knowledge into training stage and help the model learn causality, we mask triangular regions of the input data in a progressive manner which is shown in Figure 4.

During inference stage, we only mask the unobserved tokens and feed the masked input data into the model to predict these values.
Patching and Positional Encoding. As demonstrated by PatchTST [22] and Vision Transformer [9], patching is an effective way to capture local patterns. In TripCast, we segment the input data into non-overlapping patches and apply a linear projection to each patch. This process reduces input data redundancy and extracts local semantic information. To capture the order of the input sequence, we use sinusoidal positional encoding to encode the positional information of the input data.

zpacth=PatchEmbed(z)+SinusoidalPositionalEncoding(z)superscript𝑧𝑝𝑎𝑐𝑡𝑃𝑎𝑡𝑐𝐸𝑚𝑏𝑒𝑑𝑧𝑆𝑖𝑛𝑢𝑠𝑜𝑖𝑑𝑎𝑙𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔𝑧z^{pacth}=PatchEmbed(z)+SinusoidalPositionalEncoding(z)italic_z start_POSTSUPERSCRIPT italic_p italic_a italic_c italic_t italic_h end_POSTSUPERSCRIPT = italic_P italic_a italic_t italic_c italic_h italic_E italic_m italic_b italic_e italic_d ( italic_z ) + italic_S italic_i italic_n italic_u italic_s italic_o italic_i italic_d italic_a italic_l italic_P italic_o italic_s italic_i italic_t italic_i italic_o italic_n italic_a italic_l italic_E italic_n italic_c italic_o italic_d italic_i italic_n italic_g ( italic_z ) (2)

Transformer Encoder. After patching and positional encoding, we use standard transformer encoder to map the input tokens to latent representations. Each of these layers is composed of a multi-head self-attention mechanism and subsequently a feed-forward neural network.

z1enc=SelfAttention(zpatch)superscriptsubscript𝑧1𝑒𝑛𝑐𝑆𝑒𝑙𝑓𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛superscript𝑧𝑝𝑎𝑡𝑐\displaystyle z_{1}^{enc}=SelfAttention(z^{patch})italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT = italic_S italic_e italic_l italic_f italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_z start_POSTSUPERSCRIPT italic_p italic_a italic_t italic_c italic_h end_POSTSUPERSCRIPT ) (3)
z2enc=LayerNorm(z1enc+zpatch)superscriptsubscript𝑧2𝑒𝑛𝑐𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚superscriptsubscript𝑧1𝑒𝑛𝑐superscript𝑧𝑝𝑎𝑡𝑐\displaystyle z_{2}^{enc}=LayerNorm(z_{1}^{enc}+z^{patch})italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT = italic_L italic_a italic_y italic_e italic_r italic_N italic_o italic_r italic_m ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT + italic_z start_POSTSUPERSCRIPT italic_p italic_a italic_t italic_c italic_h end_POSTSUPERSCRIPT ) (4)
z3enc=FeedForward(z2enc)superscriptsubscript𝑧3𝑒𝑛𝑐𝐹𝑒𝑒𝑑𝐹𝑜𝑟𝑤𝑎𝑟𝑑superscriptsubscript𝑧2𝑒𝑛𝑐\displaystyle z_{3}^{enc}=FeedForward(z_{2}^{enc})italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT = italic_F italic_e italic_e italic_d italic_F italic_o italic_r italic_w italic_a italic_r italic_d ( italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT ) (5)
zenc=LayerNorm(z3enc+z2enc)superscript𝑧𝑒𝑛𝑐𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚superscriptsubscript𝑧3𝑒𝑛𝑐superscriptsubscript𝑧2𝑒𝑛𝑐\displaystyle z^{enc}=LayerNorm(z_{3}^{enc}+z_{2}^{enc})italic_z start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT = italic_L italic_a italic_y italic_e italic_r italic_N italic_o italic_r italic_m ( italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT + italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT ) (6)

Reconstruction. Given the latent representations of the transformer encoder, we project the latent vector to P×P×N𝑃𝑃𝑁P\times P\times Nitalic_P × italic_P × italic_N, where P𝑃Pitalic_P is the size of patch and N𝑁Nitalic_N is number of predicted series. In this work, we focus on univariate scenario, so N𝑁Nitalic_N is 1. Then, we reshape the projected latent vectors to (B,H,C,N)𝐵𝐻𝐶𝑁(B,H,C,N)( italic_B , italic_H , italic_C , italic_N ) as the output of model.
Instance Normalization. To mitigate the distribution drift between training and test data, we apply reversible instance normalization [16] in TripCast models. This normalization module scales the input data by the mean and variance, then reverses the scaling for the output predictions. Although our input data is 2D, the mean and variance are calculated in the same manner as in typical instance normalization.

4.2 Pre-training and Downstream Tasks

We split each dataset into pre-train and train-test partitions in a roughly 90/10 split. To prevent data leakage, we ensure that all return routes and flights are either in the pre-train or train-test set. For train-test sets, we choose the data from 2019-06-01 to 2019-08-31 as validation set and the data from 2019-09-01 to 2019-12-31 as test set on all datasets. All TripCast models are trained on pre-train datasets and evaluated on train-test datasets. Our aim is to demonstrate the potential of TripCast as a zero-shot forecaster in the tourism industry.
Pre-training. In this work, we focus on supervised pre-training since our main goal is to demonstrate the effectiveness and transferability of this novel modelling paradigm. In all pre-training tasks, we set H𝐻Hitalic_H to 60, C𝐶Citalic_C to 40 and the maximum Hpredsubscript𝐻𝑝𝑟𝑒𝑑H_{pred}italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT of progressive masking to 15. Furthermore, we use mean absolute error (MAE) as the loss function to train the model during the pre-training stage.
Downstream Tasks. After pre-training, we evaluate TripCast on two downstream tasks: in-domain forecasting and out-domain forecasting. In in-domain forecasting, we pre-train the model within each domain and assess its performance within the same domain. In out-domain forecasting, we pre-train a unified model on all domains, then evaluate its performance on each domain.

5 Experiments

In this work, we collect five extensive, real-world datasets from an online travel agency (OTA) to evaluate the performance of TripCast. These collections encompass flight sales data, flight booking price data, and user search data. First, we pre-train TripCast models of small and base sizes on each dataset and evaluate their performance in in-domain forecasting scenarios. Next, we compare our method with deep learning and pre-trained time series models. Then, for investigating the transferability as well as scalability of TripCast models, we pre-train TripCast model of large size on four datasets except UserSearch, and evaluate its performance on out-domain forecasting tasks. Finally, we conduct extensive ablation studies and examine the impact of various components and masking strategies on the performance of TripCast.

5.1 Datasets

All datasets are preprocessed into univariate time series with date features. Below are the details of the datasets:

Dataset Period Total n_series Total n_obs Frequency
Pre-train Train-test Pre-train Train-test
FlightSales 2018-01similar-to\sim2019-12 3,947 489 110,997,640 13,712,640 Day
RouteSales 2018-01similar-to\sim2019-12 2,572 286 68,789,440 7,626,080 Day
FlightPrice 2017-08similar-to\sim2019-12 5,395 595 173,911,800 19,237,680 Day
RoutePrice 2017-01similar-to\sim2019-12 3,996 445 159,749,040 17,685,080 Day
UserSearch 2017-04similar-to\sim2019-12 3,298 367 124,884,320 13,690,880 Day
Table 1: Key details of datasets.

* FlightSales: This dataset contains the daily sales rate of seats which is the ratio of the number of seats sold to the capacity of flights. All time series in this dataset are aggregated by flight.

* RouteSales: This dataset is similar to FlightSales, but the time series are aggregated by route.

* FlightPrice: This dataset contains the accumulative average order price of flights. All time series in this dataset are aggregated by flight.

* RoutePrice: This dataset is similar to FlightPrice, but the time series are aggregated by route.

* UserSearch: This dataset contains the accumulative user search count of routes.

5.2 Training

We pre-train the models in three different sizes ranging from small to large, with detailed hyperparameters shown in table 2. The minimum model has less than 1 million parameters while the large model has nearly 20 million parameters. All models are trained with a batch size of 256 and 50000 iterations. We use Adam [17] with an initial learning rate of 3e-4, and cosine learning rate decay. The training is conducted using NVIDIA V100 GPUs with mixed precision training.

Model Layers Dimension Heads Params Iters
TripCastsmall 4 128 4 928k 50000
TripCastbase 4 256 8 3.4m 50000
TripCastlarge 6 512 8 19.4m 50000
Table 2: Details of the hyperparameters of TripCast models in different sizes.

5.3 Evaluation Metrics

As evaluation criteria, in this study, we employ mean absolute error (MAE) and weighted absolute percentage error (WAPE).

MAE=1ni=1n|yiy^i|;WAPE=i|yiy^i|i|yi|formulae-sequence𝑀𝐴𝐸1𝑛superscriptsubscript𝑖1𝑛subscript𝑦𝑖subscript^𝑦𝑖WAPEsubscript𝑖subscript𝑦𝑖subscript^𝑦𝑖subscript𝑖subscript𝑦𝑖MAE=\frac{1}{n}\sum_{i=1}^{n}|y_{i}-\hat{y}_{i}|;\quad\mathrm{WAPE}=\frac{\sum% _{i}\left|y_{i}-\hat{y}_{i}\right|}{\sum_{i}\left|y_{i}\right|}italic_M italic_A italic_E = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ; roman_WAPE = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG (7)

5.4 Baselines

For deep learning methods, we compare TripCast with linear family [37], iTransformer [20], and PatchTST [22]. For pre-trained models, we compare TripCast with GPT4TS [40]. The details of the baselines are as follows:

Baseline Hyperparameters Values
LinearFamily model type {Linear, NLinear, DLinear}
PatchTST
d_model
num_layers
{128, 256}
{2, 3, 4}
iTransformer
d_model
num_layers
use_norm
{128, 256}
{2, 3, 4}
{true, false}
GPT4TS
block_size
n_head
d_model
num_layers
{1024}
{12}
{768}
{6}
Table 3: Hyperparameter search range for baselines.

Constrained by the fact that all baselines are single-axis time series models, we simplify the forecasting task to predicting the value of the last leading step for convenience. The look-back period and prediction horizon of baselines are set to 45 and 15, which are consistent with TripCast models. This setting ensures that the performance of both TripCast and baselines is evaluated at the same time points. Additionally, with a batch size of 256, training of deep learning baselines is conducted over 10,000 iterations. Based on validation loss, early stopping is implemented, with the loss being summarized and reported at intervals of 100 iterations. The optimal checkpoint is chosen according to the validation loss. For pre-trained models, we use the same training hyperparameters as TripCast models. In summary, deep learning models are trained from scratch on train-test datasets, while pre-trained models are trained on pre-train datasets and follow zero-shot evaluation on train-test datasets.

6 Results

6.1 In-domain Forecasting

The performance of TripCast models and baselines in in-domain scenarios is illustrated in Table 4. We find that both TripCastsmall and TripCastbase outperform all baselines across all datasets. Among deep learning methods, PatchTST outperforms other methods in three out of five datasets indicating that patching and transformer-based models effectively capture trip time series patterns. GPT4TS, as a LLM-based model outperforms deep learning methods in three out of five datasets. We speculate that the strong transferability of GPT2 and the extensive pre-training data contribute to its superior performance. This also highlights the potential of pre-trained models in trip time series forecasting.

FlightSales RouteSales FlightPrice RoutePrice RouteSearch
MAE WAPE MAE WAPE MAE WAPE MAE WAPE MAE WAPE
Linear 0.064 0.193 0.048 0.153 116.8 0.151 167.4 0.192 94.3 0.127
NLinear 0.063 0.193 0.048 0.153 115.7 0.149 169.1 0.195 95.3 0.129
DLinear 0.064 0.193 0.048 0.152 113.1 0.146 166.7 0.191 92.2 0.124
PatchTST 0.064 0.193 0.048 0.155 109.7 0.142 162.3 0.186 88.8 0.119
iTransformer 0.064 0.193 0.048 0.152 110.8 0.143 163.1 0.187 90.3 0.121
GPT4TS 0.063 0.193 0.047 0.152 110.0 0.142 161.2 0.185 79.9 0.108
TripCastsmall 0.052 0.159 0.038 0.122 94.7 0.122 106.6 0.127 47.2 0.064
TripCastbase 0.050 0.153 0.038 0.121 91.4 0.118 103.7 0.124 44.5 0.061
Table 4: Test set results for deep learning and pre-trained baseline methods. Optimal results are highlighted in bold.

6.2 Towards Foundation Model (Out-domain Forecasting)

The ultimate goal of our research is to develop a foundation model for trip time series forecasting. Experimentally, we investigate the effectiveness of our model in out-domain forecasting. We pre-train model of different sizes (Figure 5) on all datasets except UserSearch and evaluate their performance on UserSearch dataset. Our findings indicate that TripCast models perform well on the UserSearch dataset. The accuracy of TripCastsmall is close to PatchTST, while TripCastbase and TripCastlarge outperforms GPT4TS although it is pre-trained on target domain. Furthermore, we observe that TripCast models’ performance scales well with the number of training iterations. This suggests that our method is a promising candidate for a foundational model in trip time series forecasting.

Refer to caption
Figure 5: Accuracy versus the number of iterations during pre-training for different model sizes.

6.3 Ablation Study

6.3.1 Masking Strategy.

We conducted ablation studies on masking strategy, with a focus on progressive masking, as the robustness of the model is not our primary concern in this work. Table 5 shows that dynamic progressive masking helps models learn causality and achieve better performance.

6.3.2 Positional Encoding.

Attention mechanism is permutation invariant, so transformer models rely on positional encoding to capture the order of the input sequence. We compared the performance of TripCastbase with learned positional encoding, fixed positional encoding, and no positional encoding. Our findings, summarized in Table 6, indicate that fixed positional encoding yields better performance than learned positional encoding.

FlightSales FlightPrice RouteSearch
MAE WAPE MAE WAPE MAE WAPE
TripCastbase 0.050 0.153 91.4 0.118 44.5 0.061
w/o Progressive Mask 0.051 0.153 92.1 0.119 45.3 0.062
Table 5: Ablation study of the masking strategy.
FlightSales FlightPrice RouteSearch
Date/Time SPE LPE MAE WAPE MAE WAPE MAE WAPE
0.062 0.186 99.2 0.127 79.5 0.107
0.050 0.153 91.4 0.118 44.5 0.061
0.052 0.157 90.1 0.116 49.7 0.067
Table 6: Ablation study of the positional encoding.

7 Conclusion

In this study, the trip time series forecasting problem is formulated and we proposed a novel modelling paradigm to tackle its challenges. We pre-train transformer-based models on five large-scale real-world datasets and subsequently evaluate their performance in in-domain forecasting. Our findings demonstrate the effectiveness of our approach against other deep learning and pre-trained models. Additionally, we show that our method scales well with model size and training iterations for out-of-domain forecasting. Our work opens new possibilities for time series forecasting in tourism, and we hope that it will inspire further research in this area.

References

  • [1] Bączek, J., Zhylko, D., Titericz, G., Darabi, S., Puget, J.F., Putterman, I., Majchrowski, D., Gupta, A., Kranen, K., Morkisz, P.: Tspp: A unified benchmarking tool for time-series forecasting. arXiv preprint arXiv:2312.17100 (2023)
  • [2] Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
  • [3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
  • [4] Cao, D., Jia, F., Arik, S.O., Pfister, T., Zheng, Y., Ye, W., Liu, Y.: Tempo: Prompt-based generative pre-trained transformer for time series forecasting. arXiv preprint arXiv:2310.04948 (2023)
  • [5] Carmona-Benítez, R.B., Nieto, M.R.: Sarima damp trend grey forecasting model for airline industry. Journal of air transport management 82, 101736 (2020)
  • [6] Das, A., Kong, W., Sen, R., Zhou, Y.: A decoder-only foundation model for time-series forecasting. arXiv preprint arXiv:2310.10688 (2023)
  • [7] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • [8] Do, Q.H., Lo, S., Chen, J., Le, C., Anh, L.H.: Forecasting air passenger demand: a comparison of lstm and sarima. Journal of Computer Science 16(7), 1063–1084 (2020)
  • [9] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • [10] Gruver, N., Finzi, M., Qiu, S., Wilson, A.G.: Large language models are zero-shot time series forecasters. Advances in Neural Information Processing Systems 36 (2024)
  • [11] Hayes, D.K., Hayes, J.D., Hayes, P.A.: Revenue management for the hospitality industry. John Wiley & Sons (2021)
  • [12] Huang, L., Zheng, W.: Hotel demand forecasting: a comprehensive literature review. Tourism Review 78(1), 218–244 (2023)
  • [13] Jin, M., Wang, S., Ma, L., Chu, Z., Zhang, J.Y., Shi, X., Chen, P.Y., Liang, Y., Li, Y.F., Pan, S., et al.: Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728 (2023)
  • [14] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)
  • [15] Kim, S., et al.: Forecasting short-term air passenger demand using big data from search engine queries. Automation in Construction 70, 98–108 (2016)
  • [16] Kim, T., Kim, J., Tae, Y., Park, C., Choi, J.H., Choo, J.: Reversible instance normalization for accurate time-series forecasting against distribution shift. In: International Conference on Learning Representations (2021)
  • [17] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [18] Lim, B., Arık, S.Ö., Loeff, N., Pfister, T.: Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting 37(4), 1748–1764 (2021)
  • [19] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems 36 (2024)
  • [20] Liu, Y., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., Long, M.: itransformer: Inverted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625 (2023)
  • [21] Ma, Q., Liu, Z., Zheng, Z., Huang, Z., Zhu, S., Yu, Z., Kwok, J.T.: A survey on time-series pre-trained models. arXiv preprint arXiv:2305.10716 (2023)
  • [22] Nie, Y., Nguyen, N.H., Sinthong, P., Kalagnanam, J.: A time series is worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730 (2022)
  • [23] Oreshkin, B.N., Carpov, D., Chapados, N., Bengio, Y.: N-beats: Neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437 (2019)
  • [24] Oreshkin, B.N., Carpov, D., Chapados, N., Bengio, Y.: Meta-learning framework with applications to zero-shot time-series forecasting. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 9242–9250 (2021)
  • [25] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, 27730–27744 (2022)
  • [26] Pereira, L.N.: An introduction to helpful forecasting methods for hotel revenue management. International Journal of Hospitality Management 58, 13–23 (2016)
  • [27] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [28] Rasul, K., Ashok, A., Williams, A.R., Khorasani, A., Adamopoulos, G., Bhagwatkar, R., Biloš, M., Ghonia, H., Hassen, N.V., Schneider, A., et al.: Lag-llama: Towards foundation models for time series forecasting. arXiv preprint arXiv:2310.08278 (2023)
  • [29] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
  • [30] Wang, L., Mykityshyn, A., Johnson, C., Cheng, J.: Flight demand forecasting with transformers. In: AIAA AVIATION 2022 Forum. p. 3708 (2022)
  • [31] Weatherford, L.R., Kimes, S.E.: A comparison of forecasting methods for hotel revenue management. International journal of forecasting 19(3), 401–415 (2003)
  • [32] Woo, G., Liu, C., Kumar, A., Sahoo, D.: Pushing the limits of pre-training for time series forecasting in the cloudops domain. arXiv preprint arXiv:2310.05063 (2023)
  • [33] Woo, G., Liu, C., Kumar, A., Xiong, C., Savarese, S., Sahoo, D.: Unified training of universal time series forecasting transformers. arXiv preprint arXiv:2402.02592 (2024)
  • [34] Woo, G., Liu, C., Sahoo, D., Kumar, A., Hoi, S.: Cost: Contrastive learning of disentangled seasonal-trend representations for time series forecasting. arXiv preprint arXiv:2202.01575 (2022)
  • [35] Yue, Z., Wang, Y., Duan, J., Yang, T., Huang, C., Tong, Y., Xu, B.: Ts2vec: Towards universal representation of time series. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 8980–8987 (2022)
  • [36] Yüksel, S.: An integrated forecasting approach to hotel demand. Mathematical and Computer Modelling 46(7-8), 1063–1070 (2007)
  • [37] Zeng, A., Chen, M., Zhang, L., Xu, Q.: Are transformers effective for time series forecasting? In: Proceedings of the AAAI conference on artificial intelligence. vol. 37, pp. 11121–11128 (2023)
  • [38] Zhang, X., Chowdhury, R.R., Gupta, R.K., Shang, J.: Large language models for time series: A survey. arXiv preprint arXiv:2402.01801 (2024)
  • [39] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Zhang, W.: Informer: Beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 11106–11115 (2021)
  • [40] Zhou, T., Niu, P., Sun, L., Jin, R., et al.: One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems 36, 43322–43355 (2023)