\SetWatermarkFontSize

12pt \SetWatermarkScale1.1 \SetWatermarkAngle90 \SetWatermarkHorCenter202mm \SetWatermarkVerCenter170mm \SetWatermarkColordarkgray \SetWatermarkTextLate-Breaking / Demo Session Extended Abstract, ISMIR 2024 Conference

Exploring Tokenization Methods for Multitrack Sheet Music Generation

Abstract

This study explores the tokenization of multitrack sheet music in ABC notation, introducing two methods—bar-stream and line-stream patching. We compare these methods against existing techniques, including bar patching, byte patching, and Byte Pair Encoding (BPE). In terms of both computational efficiency and the musicality of the generated compositions, experimental results show that bar-stream patching performs best overall compared to the others, which makes it a promising tokenization strategy for sheet music generation.

1 Introduction

Sheet music generation, particularly using ABC notation—a compact, text-based format, has gained prominence in symbolic music generation. [1, 2, 3, 4, 5]. Tokenizing multitrack ABC notation in language models presents unique challenges due to inter-track dependencies. FolkRNN [1] represented musical elements like pitch and duration as multi-character tokens. In contrast, CLaMP [6] and bGPT [7] introduced bar patching and byte patching, respectively, which tokenize score text into patches and then decode them with a character-level decoder. MuPT [4] used the Byte Pair Encoding (BPE) method [8] from NLP. Nevertheless, challenges regarding musicality and computational efficiency still exist.

In this work, we investigate tokenization as a critical initial step in training a sheet music generation model, aiming to minimize computational costs while maintaining the quality of the generated music. Building on bar patching and byte patching methods, we introduce two new techniques—bar-stream patching and line-stream patching. We evaluate all patching methods, including BPE, within a pre-training and fine-tuning framework.

2 Methodology

2.1 Model Architecture

We adopted Tunesformer [3], a hierarchical GPT-2 [9] decoder architecture, for our patching methods. In this framework, patch-level decoders embed and process patches to generate features for a character-level decoder, which performs auto-regressive character prediction. The context lengths are determined by the patch length for the patch-level decoder and the patch size for the character-level decoder. For BPE, we use a standard GPT-2 decoder.

2.2 Data Tokenization

To ensure multitrack score voice alignment, we use interleaved ABC notation [4] for multiple musical parts. Then, we tokenize score text with four patching methods and BPE, as shown in Fig. 1. Existing methods include:

Bar patching: Divide score text into bar patches, where each bar corresponds to a single voice, and truncate/pad bars based on patch size.

Byte patching: Divide score text into fixed-length patches regardless of musical score semantics.

BPE: A 50,000-token vocabulary was created through an iterative tokenization approach that merges frequent character or sub-word pairs in the score text.

To avoid the truncation in bar patching and ensure division according to semantic units of musical scores, two patching methods are proposed:

Bar-stream patching: An improvement on bar patching. First, the score text is divided into bars. Then, each bar is split into fixed-length patches as per the patch size; if a bar’s final patch is shorter than the patch size, it is padded.

Line-stream patching: Like bar-stream patching, but this method divides the score by line breaks. In interleaved ABC, each line represents a bar with all voices.

2.3 Dataset

Pre-training used an in-house 160K ABC-notation score dataset. To evaluate models’ generalization with different tokenization, we fine-tuned on three classical music datasets of different instrumentation: 398 Bach chorales [10], 103 Haydn string quartets [11], 54 Mozart piano sonatas [12]. Additionally, data augmentation on 15 key signatures was done in both pre-training and fine-tuning.

Refer to caption
Figure 1: An example of score text and various tokenization implementations with colors marking token boundaries.
Tokenization Parameters Sec/Epoch Inference Speed BPB CLaMP 2 Score
Bach Haydn Mozart Bach Haydn Mozart Bach Haydn Mozart
Byte patching 65,872,896 963 597.1 630.4 623.8 0.2795 0.3682 0.3900 0.9767 0.9071 0.8068
Line-stream patching 65,872,896 1107 549.7 564.8 569.0 0.2772 0.3797 0.3958 0.9734 0.8916 0.8213
Bar-stream patching 65,872,896 1063 446.3 465.6 449.6 0.2539 0.3526 0.3879 0.9781 0.9228 0.8225
Bar patching 70,628,352 2848 226.1 210.9 204.3 0.2479 0.3515 0.3920 0.9813 0.9045 0.7531
BPE 84,074,496 4071 91.0 80.2 71.1 0.2591 0.3340 0.3542 0.9687 0.9050 0.7005
Table 1: Comparison of evaluation results among different tokenization methods.

3 Experiments

3.1 Settings

For patching methods, we used a 6-layer patch-level decoder and a 3-layer character-level decoder. For BPE, a 6-layer decoder was directly applied. To balance bar truncation and efficiency, the patch size was set to 64 for bar patching (covering 97.7% of all bars) and 16 for other patching methods where truncation is not an issue. The patch length was 512 for all patching methods and 4096 for BPE, ensuring comparable score lengths across attention spans. All pre-training was carried out using 2 H800 GPUs with the batch size maximized.

3.2 Evaluation Metrics

We evaluated models’ efficiency and musicality across different tokenization strategies using these metrics:

Sec/Epoch: This represents the average duration of each pre-training epoch, measured in seconds.

Bits-per-byte (BPB): Calculates the average bits to predict the next token on the validation set.

Inference Speed: Average characters generated per second during inference.

CLaMP 2 Score: Calculated by extracting semantic features with CLaMP 2 [13] and computing the cosine similarity between the validation set and the generated data. A higher score means the generated data is more similar to the real data.

3.3 Results

Regarding efficiency, byte patching, line-stream patching, and bar-stream patching require shorter training times and have faster inference speeds, with byte patching performing the best. Bar patching and BPE are less computationally efficient because bar patching has a larger patch size and BPE has a longer context length.

For BPB, BPE generally performs best. This is likely because BPE tokenizes score text into high-frequency combinations, thus providing the model with more prior knowledge compared to the character-level decoding in patching methods.

However, BPE underperforms in CLaMP 2 Scores, suggesting a semantic gap between the generated results and real music. In contrast, bar-stream patching achieves high CLaMP 2 Scores. It not only avoids bar truncation issues but also incorporates prior knowledge of bar units during patching, leading to better musicality.

Overall, our experiments show that bar-stream patching is the top-performing method, presenting a balanced performance across all metrics. It combines high training and inference efficiency with generated results that closely resemble real classical compositions.

4 Conclusion

In this study, we explored tokenization methods for sheet music generation based on ABC notation. We introduced bar-stream and line-stream patching and compared them with bar patching, byte patching, and BPE. Focusing on the balance between computational efficiency and musicality, the results demonstrated that bar-stream patching outperformed the others in general.

For future work, we will scale up the model size and dataset with employing bar-stream patching and a hierarchical decoder. Additionally, we will establish a classical-music-centered dataset for fine-tuning to enhance the musicality of the generated results.

References

  • [1] B. L. Sturm, J. F. Santos, O. Ben-Tal, and I. Korshunova, “Music transcription modelling and composition using deep learning,” arXiv preprint arXiv:1604.08723, 2016.
  • [2] S. Wu, X. Li, and M. Sun, “Chord-conditioned melody harmonization with controllable harmonicity,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [3] S. Wu, X. Li, F. Yu, and M. Sun, “Tunesformer: Forming irish tunes with control codes by bar patching,” arXiv preprint arXiv:2301.02884, 2023.
  • [4] X. Qu, Y. Bai, Y. Ma, Z. Zhou, K. M. Lo, J. Liu, R. Yuan, L. Min, X. Liu, T. Zhang et al., “Mupt: A generative symbolic music pretrained transformer,” arXiv preprint arXiv:2404.06393, 2024.
  • [5] L. Casini, N. Jonason, and B. L. Sturm, “Investigating the viability of masked language modeling for symbolic music generation in abc-notation,” in International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar).   Springer, 2024, pp. 84–96.
  • [6] S. Wu, D. Yu, X. Tan, and M. Sun, “Clamp: Contrastive language-music pre-training for cross-modal symbolic music information retrieval,” arXiv preprint arXiv:2304.11029, 2023.
  • [7] S. Wu, X. Tan, Z. Wang, R. Wang, X. Li, and M. Sun, “Beyond language models: Byte models are digital world simulators,” arXiv preprint arXiv:2402.19155, 2024.
  • [8] P. Gage, “A new algorithm for data compression,” C Users J., vol. 12, no. 2, p. 23–38, Feb. 1994.
  • [9] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  • [10] M. S. Cuthbert and C. Ariza, ��music21: A toolkit for computer-aided musicology and symbolic music data,” 2010.
  • [11] M. Gotham, M. Redbond, B. Bower, and P. Jonas, “The “openscore string quartet” corpus,” in Proceedings of the 10th International Conference on Digital Libraries for Musicology, 2023, pp. 49–57.
  • [12] J. Hentschel, M. Neuwirth, and M. Rohrmeier, “The annotated mozart sonatas: Score, harmony, and cadence,” Transactions of the International Society for Music Information Retrieval, vol. 4, no. 1, pp. 67–80, 2021.
  • [13] S. Wu, Y. Wang, R. Yuan, Z. Guo, X. Tan, G. Zhang, M. Zhou, J. Chen, X. Mu, Y. Gao, Y. Dong, J. Liu, X. Li, F. Yu, and M. Sun, “Clamp 2: Multimodal music information retrieval across 101 languages using large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2410.13267