1 Introduction

Undoubtedly, time series are now inseparable from people’s lives, and the quantity of time series data is rapidly increasing. As an important and common form of data, time series attain high level of adoption in fields such as finance, scientific research, industry, and military. The classification of time series is a problem of significant research value and has been a long-standing concern. The early mainstream temporal classification has evolved from distance-based methods and feature-based methods to ensemble-based methods.

Later, the progression of neural network has led to the emergence of some deep learning-based classifiers. Currently, most deep learning classifiers for time series classification are grounded in Convolutional Neural Networks (CNNs). With the outstanding performance of Transformers on various tasks in natural language processing and computer vision, there is increasing attention paid to exploring the application of Transformers in the field of time series. However, the models based on Transformers often require a large amount of data to perform well. To the best of our knowledge, currently there is a lack of a Transformer-based model that can be trained directly on small time series classification datasets without pre-training on other large datasets but still performs well. To address this issue, we explore the possibility of using a Transformer architecture model for single-variable time series classification in this paper. Through experiments on “University of California, Riverside” (UCR) datasets, we find that although a pure transformer-based deep learning model performs well on some datasets, it performs poorly on most datasets compared to CNNs. Therefore, we propose our CTCTime model by combining CNNs and Transformers. We conduct comparative experiments on 44 UCR datasets with 15 non-deep learning-based time series classification methods and 6 deep learning-based methods on 85 datasets. Sufficient experimental evidence has confirmed the competitiveness and effectiveness of our model. The main contributions of this paper are summarized as:

  1. 1.

    We introduce deep learning models based on the Transformer architecture into one-dimensional time series classification and explore the feasibility of training the models directly on small datasets without additional data processing.

  2. 2.

    We propose a novel time series classification model called CTCTime, which combines the Transformer architecture and the CNN architecture, integrating the advantages of both models.

  3. 3.

    Comparative experiments with a significant quantity of time series classification methods demonstrate that CTCTime performs well on one-dimensional time-series classification problems.

Subsequent sections of this paper are arranged in the subsequent order: Sect. 2 introduces the related works of time series classification, including traditional methods and deep learning-based methods, with a further elaboration on traditional methods. Section 3 first briefly introduces Transformers and then describes the structure of our CTCTime model. Section 4 presents research findings. Section 5 summarizes our findings and provides prospects for future work.

2 Related Work

Time series classification has occupied a vital spot in the sphere of the field of data mining and machine learning. In this section, we first briefly introduce the task goal of time series classification, and then briefly introduce the existing methods for time series classification.

The task goal of time series classification is to design an effective classifier to obtain the label corresponding to a given set of sequences, in order to determine its category.

There are many methods for time series classification, which can be divided into traditional classification methods and deep learning-based classification methods. In terms of specific technical approaches, traditional time series classification methods can be further classified into distance-based methods, feature-based methods, and ensemble-based approaches.

As the mainstream algorithm for time series classification over the years, distance-based methods select a certain type of distance to measure the similarity between two time series, such as Euclidean distance, dynamic time warping (DTW) [1], longest common subsequence (LCS) [2], etc. Among these methods, the nearest neighbor algorithm combined with DTW (NN-DTW) [3] is considered to be the most effective method and has long been chosen as the main time series classifier. However, due to the time complexity of calculating DTW distance being \(O({n}^{2})\), which is much higher than the time complexity of calculating Euclidean distance at \(O(n)\), computation time is often relatively long. Therefore, some scholars have conducted research based on DTW [4,5,6] to improve the speed, but this did not improve the accuracy of classification. In order to improve the accuracy on the basis of DTW, methods such as Time-weight-based DTW [7] and Adaptive constrained DTW (ACDTW) [8] have been proposed.

Feature-based classification methods are methods that extract relevant features through a certain measurement relationship. These extracted features are usually quantified to form a Bag-of-Words (BoW), which is then input into a classifier for classification. Time Series Bag of Features (TSBF) [9] is a typical method of this kind, which selects random sub-sequences from the time series, extracts features for each sub-sequence, combines them into a feature bag, and inputs them into a random forest classifier to complete classification. Time series forest (TSF) [10] reduces computational complexity by using a random feature sampling strategy. Another kind of feature-based classification method called Bag-of-SFA-Symbols (BOSS) [11] uses symbolic Fourier approximation (SFA) transform to extract the word features of BOSS classifier from series. Some researchers have expanded on the basis of BOSS, such as BOSS in Vector Space (BOSSVS) [12]. Highly comparative method is used by Highly Comparative Feature (HCF) [13] to extract features and construct representations of features.

Ensemble based approaches are some methods that combine a variety of different classifiers to improve classification accuracy. 11 classifiers are combined with a weighted ensemble method by Elastic Ensemble (PROP) [14] to classify time series. These sub-classifiers use elastic distance measures as well as nearest neighbor algorithms for classification. The Flat Collective of Transformation Ensembles (COTE) [15] adopts 35 classifiers and extracted the features of time series from two levels: time domain and frequency domain. The Hierarchical Vote Collective of Transformation-based Ensembles (Hive-COTE) [16] was improved on the basis of COTE and once became the classification algorithm with the highest accuracy [17]. However, Hive-COTE has a large computation cost and may encounter problems when running on big data mining problems. TS-CHIEF [18] is an ensemble-based, scalable, and high-precision time series algorithm. It uses a tree-structured classifier to integrate some of the previously most effective time series embeddings and achieves higher accuracy than Hive-COTE.

With the continuous development of neural network, researchers have begun to apply it to time classification tasks [19], and several deep learning-based time series classification models have been designed. Cui et al. proposed Multi-scale Convolutional Neural Networks (MCNN) [20] for time series classification tasks. Wang et al. proposed an end-to-end baseline based on deep learning, without using any domain-specific preprocessing steps, adopting Multilayer Perceptron (MLP), Fully Convolutional Network (FCN), and Resnet as models [21], and demonstrated that FCN performed best in time series tasks on the UCR time series repository. Yang et al. [22] improved the accuracy by performing data augmentation based on these baselines. CNN has been proven to be the most widely used classifier among deep learning-based time series classification methods, such as FCN, 3DACN [23], etc. These models were initially applied to domain-agnostic time series classification problems and achieved some success. Yang et al. ROCKET [24] used simple linear classifiers based on random convolutional kernels and achieved good results on the UCR time series repository. InceptionTime proposed by Fawaz et al. [25] is an ensemble of deep Convolutional Neural Network models that inspired by the Inception-v4 architecture. InceptionTime has been verified to have similar efficacy to Hive-COTE and has higher scalability. MACNN [26] improved on MCNN by introducing attention mechanisms to improve the accuracy of the model. Recently, Liu et al. [32] propose a novel time series feature extraction block named Convolutional Gated Linear Units (CGLU), which is a combination of convolutional operations and Gated Linear Units for adaptively extracting local temporal features of time series. However, compared to CNN-based models, there are few effective transformer-based time series classification models. Chen et al. [33] conducted research on multivariate time series classification based on the Transformer architecture. This paper focuses on the one-dimensional time series problem, investigates the Transformer architecture model, and proposes a new time series classification model called CTCTime.

3 Methodology

In this section, we first introduce something related to transformer, since it’s the foundation of the proposed method. Then we demonstrate the proposed model CTCTime in details.

3.1 Structure of Transformer

Transformers initially showed excellent performance in natural language processing tasks [27], and in recent years, they have also demonstrated impressive performance in computer vision [28], competing strongly in many tasks.

The overall architecture of the Transformer can be partitioned into four parts: the input part, output part, encoder, and decoder. The input part includes data embedding layers and positional encoding, while the output part contains a softmax layer and a linear connection layer. The encoder and decoder are composed of multiple encoder and decoder layers, respectively. However, the model may vary depending on the task being processed. The Transformer model does not always require a decoder for all tasks. For example, in generative tasks such as machine translation, the Encoder is used to encode the input sequence into a series of hidden vectors, while the Decoder is used to generate the output sequence from these vectors. However, for some tasks that only involve encoding, such as text classification, using only the Encoder can still achieve good performance without a Decoder.

For time series classification problems, a schematic diagram of a Transformer-based model is shown in Fig. 1.

Fig. 1
figure 1

Schematic diagram of transformer architecture model

3.2 Multi-Head Attention Mechanism

Self-attention mechanism is the foundation of the multi-head attention mechanism. The self-attention based architecture plays an important role in the transformer model because it can extract the correlations between different parts of the input. In our model, we encode the entire time sequence through a fully connected layer to a specified length, and then split the encoded sequence into multiple sub-sequences, generating the final embedding by combining them with positional information. The self-attention mechanism helps us obtain the degree of correlation between different parts of the embedding. Following is a brief explanation of the calculation process for the self-attention mechanism.

For each vector \({E}_{i}\) to be calculated, it is mapped into three different matrices Q, K, and V, representing query, key, and value respectively, through three different matrix transformations. Then, each q vector is used to perform attention on each k vector in a dot-product manner, and the resulting value α represents the correlation between the two vectors, which forms the matrix A. The softmax operation is performed on the A matrix to obtain A'. Using A' and V, the output vector b for each input vector \({E}_{i}\) for the corresponding self-attention layer is computed as follows: \( b_{i} = \sum\nolimits_{j = 1}^{n} {v_{i} } \cdot \alpha_{i,j}^{\prime } \)

Multi-head attention mechanism is actually an improved version of the self-attention mechanism. The multi-head attention mechanism calculates multiple sets of Q, K, and V by mapping them to different spaces, then obtains different outputs, and finally concatenates these different outputs together and obtains the final output through linear transformation.

3.3 The Overall Structure of the Proposed CTCTime

The proposed model, CTCTime, is a new deep learning model that combines transformer and CNN structures. Its overall structure is shown in Fig. 2.

Fig. 2
figure 2

The overall architecture of CTCTime

The inputs of the model are \({b}_{size}\) one-dimensional time series with a length of \({L}_{input}\). These input sequences are processed simultaneously along two different paths, which we name the transformer path and the CNN path. The CNN path and transformer path can extract different feature sequences for time series, and their outputs are feature sequences of the same initial length, but with different dimensionality. We combine these two sets of feature sequences and the initial time series by dimension and input them into Convolution Block 4.

The structure of the fourth convolution module is shown in Fig. 3, which includes two one-dimensional convolution layers with a window size of 3, a stride of 1, and padding of 1, one BatchNorm layer, and one activation function layer. We stack the output of the CNN path and transformer path with the initial time series to generate a sequence group with 98 channels as the input of the first convolution layer in this module. We set the channel numbers of the output of the two convolution layers to both be 64. In this convolutional block, ELU is chosen as the activation function. The formula for ELU is:

$$ {\text{ELU}}\left( {\text{x}} \right){ = }\left\{ {\begin{array}{*{20}c} {\text{x,}} & {\text{x > 0}} \\ {\alpha \left( {{\text{e}}^{{\text{x}}} { - 1}} \right){,}} & {{\text{x}} \le {0}} \\ \end{array} } \right. $$
(1)
Fig. 3
figure 3

Schematic diagram of the Convolution block 4

Then, we use an adaptive global average pooling layer to reduce the dimensionality to 1, followed by a fully connected layer to obtain a sequence of the same length as the predicted number of classes. Finally, the feature sequences are processed by Softmax to obtain the final output result.

3.4 The Transformer Path in CTCTime

As is shown in Fig. 1, the architecture of the Transformer path mainly includes: generation of embeddings, embedding of positional encoding, a transformer encoder composed of multiple transformer modules stacked together and two fully connected layers.

In the Transformer path, the input time series group goes through a fully connected layer to extract preliminary features and transform them into a sequence group with length \({L}_{a}\). This process ensures that time series with different initial lengths can be further processed. Then, the 1×\({L}_{a}\) sequence is divided into \(\frac{{L}_{a}}{{L}_{p}}\) sub-sequences with a length of \({L}_{p}\), generating an embedding with dimensions of \([{b}_{size}, \frac{{L}_{a}}{{L}_{p}},{L}_{p}]\). At the same time, the model generates a set of trainable embedding vectors with a length of \({L}_{p}\) as positional encoding. The positional encoding is added to the embedding as the input of the transformer block. Adding positional encoding to the input can reflect the position information of the sequence, which greatly improves the effectiveness of the Transformer. In most cases, we set \({L}_{a}\) to 1024 and \({L}_{p}\) to 32 by default, so the dimension of the data input to the first transformer module is [\({b}_{size}\), 33, 32]. The schematic diagram of the Transformer block is as follows:

The Transformer block in CTCTime is mainly composed of a normalization layer, multi-head attention mechanism, and feed-forward layer, and internally contains multiple residual structures. Unlike the normalization layer in the CNN module, we use LayerNorm for the normalization layer in the Transformer module instead of BatchNorm. If \({X}_{Ti}\) denotes the input of a Transformer block, its output calculation formula is:

$${X}_{Tt}=At[{Norm}_{layer}\left({X}_{Ti}\right)]+{X}_{Ti}$$
(2)
$${X}_{To}=Fd\left([{Norm}_{layer}\left({X}_{Tt}\right)]\right)+{X}_{Tt}$$
(3)

where At(X) represents the multi-head attention mechanism processing of X, and Fd(X) represents the feed-forward processing of X.

The data processed by LayerNorm will undergo multi-head attention mechanism processing. In CTCTime’s multi-head attention mechanism, we set the number of heads to 8 in our experiments.

The feed-forward operation is essentially an MLP module. In CTCTime, the output dimension of the feed-forward layer is the same as the input dimension and includes two fully connected layers. The first fully connected layer is followed by a GELU activation function and a dropout operation with a probability of \({d}_{f}\). The second fully connected layer is also followed by dropout operation with a probability of \({d}_{f}\). The formula for the GELU activation function is:

$$ {\text{GELU(x)}} = {\text{x}}\Phi {\text{(x)}} = {\text{x}} \cdot \frac{1}{2}\left( {1 + {\text{erf}}\left( {{\text{x}}/\sqrt 2 } \right)} \right) $$
(4)

where \(\Phi (\text{x})\) represents the cumulative distribution function of the Gaussian distribution, which is the definite integral of the Gaussian distribution over the interval \((-\infty ,x]\).

The whole transformer encoder in CTCTime stacks the Transformer module for depth times. Considering the limited number of training samples, the value of depth cannot be too large, so we set the depth value in the model to be 3.

The data output by the transformer encoder passes through a Linear layer to obtain a sequence with length \({L}_{input}\). At the end of the transformer path, the sequence groups are normalized and outputted.

3.5 The CNN Path in CTCTime

The structure of the CNN path is also shown in Fig. 1, consisting of three convolution blocks, a pooling layer stacked and a fully connected layer together. In the CNN path, we do not encode the time series to change its length, but instead input them into the convolution block according to the initial length \({L}_{input}\) (Fig. 4). The structure of the convolution module in our model is not exactly the same. The schematic diagram of the first to third Convolution blocks is shown in Fig. 5.

Fig. 4
figure 4

Schematic diagram of the Transformer block

Fig. 5
figure 5

Schematic diagram of the Convolution block 1 to3

The first three convolution blocks each contain three parallel one-dimensional convolutional layers, one BatchNorm layer, and one activation function layer. We use different sizes of convolution kernels for these three parallel convolutional layers in order to capture different receptive fields and extract more feature information. In the first convolutional block, time series are input into three one-dimensional convolutional layers with different kernel sizes of [5, 9, 13], respectively. The paddings of these convolutional layers are [2, 4, 6], the stride of them is 1.

These three sequences, as well as the original time series, are then stacked together to generate a sequence group with 49 channels and a length equal to the original length of the time series. Then, these three sequence groups and the original time series are stacked together to generate a sequence group with a channel number of 49 and a sequence length of \({L}_{input}\). It is noteworthy that in the second and third convolutional blocks, we don’t stack the initial time series. Only the three sequence groups obtained by convolution layers are stacked. The output of the second convolutional block and the third convolutional block are sequence groups with a channel number of 72 and 96, respectively. We use BatchNorm layers to normalize the data, which improves the training speed and generalization ability of the model and prevents overfitting. We use ReLU as the activation function for the first three convolutional blocks in our model CTCTime. The formula for ReLU is: \(ReLU(x)=max(0,x)\).

4 Experimental Results and Analysis

In this section, we first explore the possibility of a single Transformer architecture without pretraining for one-dimensional time series classification tasks. We compare it with a single FCN architecture without pretraining and our proposed CTCTime on some datasets from the UCR archive. Then, based on the experimental experience of FCN [21], InceptionTime [25], ROCKET [24], MACNN [26], and other time series classification models, we test the proposed model CTCTime as a classifieron datasets from the UCR archive collection and compare it with a large number of existing time series classification algorithms. In the experiment setting in FCN, we select traditional algorithms, such as BOSS [11], COTE [15] and Hive-COTE [16] for comparison. In addition, we conducted comparative experiments with advanced time series classification methods such as InceptionTime (ITime) and MACNN on 85 datasets belonging to the UCR archive to demonstrate the accuracy and superiority of CTCTime.

4.1 Experiment Settings

The UCR archive [3] holds an important position in the field of time series classification and is a must-test dataset collection for each one-dimensional time series algorithm. In 2018, the UCR archive was expanded from the original 85 datasets to 128 datasets. However, there are relatively few publicly available time series classification algorithms that have been tested on these newly expanded datasets. Therefore, we tested our method on the original 85 datasets. In addition, considering that some comparative methods did not publicly disclose the results on these 85 datasets, when comparing with these methods, we refer to experiments such as MCNN and baseline and compare results on the same 44 datasets. For advanced methods that have published results, such as ROCKET and MACNN, we compare experimental results on all 85 datasets.

In our experiments, we set the batch size to [16, 64, 128] and conducted relevant tests. We ultimately chose 128 as it had better performance. The learning rate was set to [0.01, 0.002, 0.001], and we ultimately chose 0.002. Besides, we choose the commonly used stochastic gradient descent algorithm as the optimizer. The number of training epochs is determined according to the size of the dataset.

4.2 The Experimental Results of the Transformer Architecture Model

We attempt to use the transformer architecture, which has shown excellent performance in both natural language processing and computer vision, for time series classification tasks. Currently, there is a lack of large publicly available datasets for time series classification tasks, such as ImageNet for computer vision tasks, making it difficult to conduct large-scale pretraining. Therefore, we conduct tests directly on the UCR dataset without any preprocessing.

The schematic diagram of the Transformer model used for testing is shown in Sect. 3.1. We compared this single transformer architecture model with the typical FCN in the CNN architecture as well as our proposed CTCTime. The experiment found that although the transformer architecture had higher accuracy on certain datasets, its performance was not as good as the CNN architecture on most datasets. We selected some representative comparative results and showed them in Table 1. The values in the table represent the error values of different classification methods on the datasets. Moreover, CTCTime, which combines the two, performs better on most datasets.

Table 1 Representative comparative results of Transformer, FCN and CTCTime

From the table above, it can be observed that the performance of models with a single Transformer architecture typically does not surpass that of FCN models or the CTCTime model proposed in this paper. This may be attributed to the influence of the number of training samples, which results in suboptimal training outcomes for single Transformer architecture models when directly applied to these datasets. However, Transformer models can learn general feature representations from large-scale pre-training data and then adapt to specific time series tasks through transfer learning. This method of pre-training and fine-tuning has demonstrated excellent performance in the field of natural language processing. Therefore, to construct a Transformer architecture model that performs well across various time series classification datasets, one could consider first building a dataset with a substantial number of samples for model training, and then fine-tuning the trained model on the target dataset for testing. In this paper, to handle time series data, the CTCTime model is constructed by integrating the Transformer structure with the CNN architecture. This not only retains the advantage of easy training of the CNN architecture but also introduces positional encoding, enabling the model to comprehend the sequential order of elements in the data. Additionally, the self-attention mechanism prevents issues of vanishing and exploding gradients when processing long sequence data. Consequently, CTCTime outperforms both single Transformer architecture models and FCN models.

4.3 Comparison between CTCTime and Traditional Algorithms

There are many traditional time series algorithms, and we select 13 non-deep learning-based time series methods to compare with our proposed CTCTime. We conducted tests on 44 UCR datasets, following the experiment of Wang et al. [21]. Thirteen methods were used in the experiment, include 1-NN DTW, DTW with a warping window constraint set through cross-validation (DTW CV) [4], Fast Shapelet (FS) [29], Bag-of-SFA-Symbols (BOSS) [11], Shotgun Classifier (SC) [30], time series based on a bag-of-features (TSBF) [9], and Time Series Forest (TSF) [10], 1-NN Bag-Of-SFA-Symbols in Vector Space (BOSSVS) [12], Elastic Ensemble (PROP) [14], Learn Shapelets Model (LS) [31], the Shapelet Ensemble (SE) model [15], flat-COTE(COTE) [15], and Hierarchical Vote Collective of Transformation-based Ensembles (Hive-COTE) [16]. Hive-COTE performs the best among these 13 traditional algorithms.

To facilitate the display of comparative results of the experiments, we divided the 13 methods into two groups and presented them in two tables. Tables 2 and 3 show the error rates of our proposed CTCTime and these traditional time series classification methods on 44 datasets. The error rate of each algorithm on a dataset is equal to 1 minus the accuracy. In addition, we recorded the winning times (the number of datasets with the smallest error) and average rank of each method within its group, as well as the winning times and average rank among the 14 methods.

Table 2 Comparison of error rates between CTCTime and the first group of traditional algorithms
Table 3 Comparison of Error Rates between CTCTime and the second group of traditional algorithms

In the comparative experiments with the first group of seven time series classification algorithms, it is obvious that CTCTime outperforms them. As can be seen in Table. 2, the Winning times of CTCTime within the group accounts for more than half and far exceeds other algorithms. The average rank of CTCTime within the group is 1.886, while the second-best algorithm, BOSS, has an average rank of 3.045. In terms of both Winning times (the number of datasets with the smallest error) and average rank, CTCTime has outstanding advantages.

In addition, to statistical tests to confirm the statistical significance or superiority of CTCTime’s performance compared to other methods, we performed a Nemenyi test on these algorithms. The principle of the Nemenyi test is to determine whether there are significant differences by comparing the rank differences among different groups. It does not rely on any parameter assumption and has good statistical power. The implementation of this test relies on the Orange3 package in Python, and the resulting graph is shown in Fig. 6. Algorithms connected by horizontal lines can be considered to have no significant differences in performance.

Fig. 6
figure 6

Critical Difference Diagram over the average rankings of CTCTime and the First Group of Traditional Algorithms

The results of comparative experiments using the second group of algorithms are presented in Table 3. As can be seen, Hive-COTE (abbreviated as HCTE in Table 3) and CTCTime perform similarly well in terms of Winning times and average rank, and they outperform other algorithms. In addition, compared with the results of CTCTime and the 13 non-deep learning-based time series classification methods, CTCTime and Hive-COTE show more outstanding performance than other algorithms. Although CTCTime is slightly inferior to Hive-COTE in terms of average rank, CTCTime wins more than Hive-COTE. In addition, the average rank graph obtained through the Nemenyi test is shown in Fig. 7.

Fig. 7
figure 7

Critical Difference Diagram over the average rankings of CTCTime and the Second Group of Traditional Algorithms

Compared to traditional algorithms, CTCTime demonstrates the superiority of deep learning algorithms, particularly in feature extraction. Analyzing all the datasets involved in the testing, CTCTime achieves higher average accuracy than all individual traditional time series classification algorithms and shows no significant difference in performance compared to the complex Hive-COTE.

4.4 Comparison between CTCTime and Other Deep Learning Algorithms

For deep learning-based time series classification methods, we selected FCN, ResNet, ROCKET, ITime, MCNN, MACNN to compare with our CTCTime. In addition, we included the traditional algorithm TS-CHIEF [19], which has been proven to be better than Hive-COTE, in the comparison. We record the error rates of each method on each dataset and calculated their Winning times and Average rank as measures of method performance. The results of the comparative experiment are shown in Table 4.

Table 4 Comparison of Error Rates between CTCTime and advanced algorithms

From the table, it can be seen that CTCTime has the highest Winning times. The performance of MACNN is comparable to that of CTCTime and both algorithms outperform other methods with outstanding performance in many datasets. Therefore, from the table we can deduce that the proposed CTCTime performs as well as, or even better than, state-of-the-art time series classification models on some datasets. Overall, CTCTime, which combines Transformer and CNN architectures, demonstrates good accuracy and scalability in solving one-dimensional time series classification problems. The critical difference diagram over the average rankings of CTCTime and advanced algorithms is shown in Fig. 8.

Fig. 8
figure 8

Critical Difference Diagram over the average rankings of CTCTime and advanced algorithms

Due to its Transformer architecture, CTCTime utilizes self-attention mechanisms to capture long-range dependencies in input sequences. In contrast, CNN architectures are limited by local receptive fields, which may result in information loss or blurring when handling long-range dependencies. Due to the strong modeling capacity of Transformers for input sequences, CTCTime exhibits good generalization capability when handling inputs of different lengths and structures. Additionally, CNN architecture compensates for the lack of precise handling of positional information in Transformers. CTCTime can leverage both the local perception capability of CNN and the global association modeling capability of Transformers, thus better capturing the features of the input data. However, CTCTime does not perform as well as MACNN on the one-dimensional time series datasets that are derived from multidimensional time series in the UCR database (such as UWaveX, UWaveY). If we exclude these datasets, CTCTime outperforms MACNN more frequently. CTCTime is more suitable for purely one-dimensional time series classification tasks.

Additionally, this paper also compares CTCTime with the latest one-dimensional time series classification network, Conv-GLU [32], on the 44 UCR datasets listed in Table 2. The results show that CTCTime outperforms Conv-GLU on 21 datasets, underperforms on 18 datasets, and has the same results on 5 datasets. The average rank of CTCTime is 1.41, while the average rank of Conv-GLU is 1.48. This further demonstrates the superiority of CTCTime. We believe that the reason CTCTime is able to achieve good results is because it not only retains the structure of CNN, which facilitates model training, but also incorporates structures such as positional encoding and self-attention mechanisms, making it more suitable for one-dimensional time series classification tasks.

4.5 Further Analysis of the Effectiveness of CTCTime

During the experimental process of testing CTCTime on the UCR dataset, the model has a short single training time on the vast majority of datasets, demonstrating good scalability. Typically, the training time for one epoch of the model on the UCR dataset does not exceed 1 s on our equipment. For instance, the Hearing dataset contains 64 samples in both the training and test sets. The length of the time series within this dataset is 512. The running time for training and testing the model for one cycle is less than 0.05 s.

There are numerous hyperparameters involved in training a model, such as the structural parameters of the model itself, like the number of Transformer blocks and the number of heads, as well as the training process parameters that need to be set, such as the number of training epochs and batch size. Clearly, setting larger values for training epochs and batch size often leads to better results, but they typically need to be adjusted based on the experimental environment and computational resources. To fine-tune the model’s structural parameters, factors such as the length of the time series and the number of samples in the dataset must be considered. Taking the UCR dataset as an example, in our experiments, the number of heads is usually set to 8, and when the length of the time series is greater than 1000, the number of Transformer blocks is set to 3, while for shorter sequences, this value is typically set to 2. This is derived from our experimental results and can handle most datasets. We believe that the length of the time series is an important factor. Additionally, when the number of samples in the training set is small, the model’s performance sensitivity to hyperparameter selection is often higher, making it more challenging to set reasonable hyperparameters. Since CTCTime combines the Transformer architecture with the CNN architecture, it has a large number of parameters, which can make it difficult to interpret the model's decisions. Each parameter contributes to the model's output, and tracing the influence of individual parameters is complex.

5 Conclusions

In this paper, we attempted to apply the transformer architecture that has been widely used and successful in Natural Language Processing and Computer Vision to solve the time series classification problem that has attracted a lot of attention. In our experiments, we find that although transformer structure models perform better on some datasets, in most cases they are not as effective as CNN architectures. We propose a new time series classification model, CTCTime, which combines Transformer and CNN architectures. CTCTime retains the structure of CNN, which makes the model easy to train, and introduces structures such as positional encoding and self-attention mechanisms, making it more suitable for one-dimensional time series classification tasks.We compare CTCTime with 13 traditional time series classification algorithms on 44 datasets and with 7 advanced classification methods on 85 UCR datasets. Extensive experiments demonstrate the feasibility of the proposed CTCTime in solving one-dimensional time series classification problems, and CTCTime can achieve good or even better performance compared to state-of-the-art time series classification models.