-
Fine-Tuning Pre-trained Transformers into Decaying Fast Weights
Authors:
Huanru Henry Mao
Abstract:
Autoregressive Transformers are strong language models but incur O(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve O(1) time and memory complexity. We explore these approaches and find that th…
▽ More
Autoregressive Transformers are strong language models but incur O(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve O(1) time and memory complexity. We explore these approaches and find that they are unnecessarily complex, and propose a simple alternative - decaying fast weights - that runs fast on GPU, outperforms prior methods, and retains 99% of attention's performance for GPT-2. We also show competitive performance on WikiText-103 against more complex attention substitutes.
△ Less
Submitted 9 October, 2022;
originally announced October 2022.
-
Sampling Through the Lens of Sequential Decision Making
Authors:
Jason Xiaotian Dou,
Alvin Qingkai Pan,
Runxue Bao,
Haiyi Harry Mao,
Lei Luo,
Zhi-Hong Mao
Abstract:
Sampling is ubiquitous in machine learning methodologies. Due to the growth of large datasets and model complexity, we want to learn and adapt the sampling process while training a representation. Towards achieving this grand goal, a variety of sampling techniques have been proposed. However, most of them either use a fixed sampling scheme or adjust the sampling scheme based on simple heuristics.…
▽ More
Sampling is ubiquitous in machine learning methodologies. Due to the growth of large datasets and model complexity, we want to learn and adapt the sampling process while training a representation. Towards achieving this grand goal, a variety of sampling techniques have been proposed. However, most of them either use a fixed sampling scheme or adjust the sampling scheme based on simple heuristics. They cannot choose the best sample for model training in different stages. Inspired by "Think, Fast and Slow" (System 1 and System 2) in cognitive science, we propose a reward-guided sampling strategy called Adaptive Sample with Reward (ASR) to tackle this challenge. To the best of our knowledge, this is the first work utilizing reinforcement learning (RL) to address the sampling problem in representation learning. Our approach optimally adjusts the sampling process to achieve optimal performance. We explore geographical relationships among samples by distance-based sampling to maximize overall cumulative reward. We apply ASR to the long-standing sampling problems in similarity-based loss functions. Empirical results in information retrieval and clustering demonstrate ASR's superb performance across different datasets. We also discuss an engrossing phenomenon which we name as "ASR gravity well" in experiments.
△ Less
Submitted 13 December, 2022; v1 submitted 17 August, 2022;
originally announced August 2022.
-
A Survey on Self-supervised Pre-training for Sequential Transfer Learning in Neural Networks
Authors:
Huanru Henry Mao
Abstract:
Deep neural networks are typically trained under a supervised learning framework where a model learns a single task using labeled data. Instead of relying solely on labeled data, practitioners can harness unlabeled or related data to improve model performance, which is often more accessible and ubiquitous. Self-supervised pre-training for transfer learning is becoming an increasingly popular techn…
▽ More
Deep neural networks are typically trained under a supervised learning framework where a model learns a single task using labeled data. Instead of relying solely on labeled data, practitioners can harness unlabeled or related data to improve model performance, which is often more accessible and ubiquitous. Self-supervised pre-training for transfer learning is becoming an increasingly popular technique to improve state-of-the-art results using unlabeled data. It involves first pre-training a model on a large amount of unlabeled data, then adapting the model to target tasks of interest. In this review, we survey self-supervised learning methods and their applications within the sequential transfer learning framework. We provide an overview of the taxonomy for self-supervised learning and transfer learning, and highlight some prominent methods for designing pre-training tasks across different domains. Finally, we discuss recent trends and suggest areas for future investigation.
△ Less
Submitted 1 July, 2020;
originally announced July 2020.
-
Speech Recognition and Multi-Speaker Diarization of Long Conversations
Authors:
Huanru Henry Mao,
Shuyang Li,
Julian McAuley,
Garrison Cottrell
Abstract:
Speech recognition (ASR) and speaker diarization (SD) models have traditionally been trained separately to produce rich conversation transcripts with speaker labels. Recent advances have shown that joint ASR and SD models can learn to leverage audio-lexical inter-dependencies to improve word diarization performance. We introduce a new benchmark of hour-long podcasts collected from the weekly This…
▽ More
Speech recognition (ASR) and speaker diarization (SD) models have traditionally been trained separately to produce rich conversation transcripts with speaker labels. Recent advances have shown that joint ASR and SD models can learn to leverage audio-lexical inter-dependencies to improve word diarization performance. We introduce a new benchmark of hour-long podcasts collected from the weekly This American Life radio program to better compare these approaches when applied to extended multi-speaker conversations. We find that training separate ASR and SD models perform better when utterance boundaries are known but otherwise joint models can perform better. To handle long conversations with unknown utterance boundaries, we introduce a striding attention decoding algorithm and data augmentation techniques which, combined with model pre-training, improves ASR and SD.
△ Less
Submitted 4 November, 2020; v1 submitted 16 May, 2020;
originally announced May 2020.
-
ReZero is All You Need: Fast Convergence at Large Depth
Authors:
Thomas Bachlechner,
Bodhisattwa Prasad Majumder,
Huanru Henry Mao,
Garrison W. Cottrell,
Julian McAuley
Abstract:
Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al. used free probability theory to show that dynamical isometry…
▽ More
Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al. used free probability theory to show that dynamical isometry plays an integral role in efficient deep learning. We show that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches. Although much simpler than its predecessors, this gate enables training thousands of fully connected layers with fast convergence and better test performance for ResNets trained on CIFAR-10. We apply this technique to language modeling and find that we can easily train 120-layer Transformers. When applied to 12 layer Transformers, it converges 56% faster on enwiki8.
△ Less
Submitted 24 June, 2020; v1 submitted 10 March, 2020;
originally announced March 2020.
-
Growth, saturation and collapse of laser-driven plasma density gratings
Authors:
H. H. Ma,
S. M. Weng,
P. Li,
X. F. Li,
Y. X. Wang,
S. H. Yew,
M. Chen,
P. McKenna,
Z. M. Sheng
Abstract:
The plasma density grating induced by intersecting intense laser pulses can be utilized as an optical compressors, polarizers, waveplates and photonic crystals for the manipulation of ultra-high-power laser pulses. However, the formation and evolution of the plasma density grating are still not fully understood as linear models are adopted to describe them usually. In this paper, two nonlinear the…
▽ More
The plasma density grating induced by intersecting intense laser pulses can be utilized as an optical compressors, polarizers, waveplates and photonic crystals for the manipulation of ultra-high-power laser pulses. However, the formation and evolution of the plasma density grating are still not fully understood as linear models are adopted to describe them usually. In this paper, two nonlinear theoretical models are presented to study the formation process of the plasma density grating. In the first model, a nonlinear analytical solution based on the fluid equations is presented while in the second model a particle-mesh method is adopted to investigate the kinetic effects. It is found that both models can describe the plasma density grating formation at different stages, well beyond the linear growth stage. More importantly, the second model can reproduce the phenomenon of "ion wave-breaking" of plasma density grating, which eventually induces the saturation of plasma density grating. Using the second model, the saturation time of the plasma density grating is obtained as a function of laser intensity and plasma density, which can be applied to estimate the lifetime of the plasma density grating in experiments. The results from these two nonlinear models are verified using particle-in-cell simulations.
△ Less
Submitted 28 July, 2020; v1 submitted 9 February, 2020;
originally announced February 2020.
-
Improving Neural Story Generation by Targeted Common Sense Grounding
Authors:
Huanru Henry Mao,
Bodhisattwa Prasad Majumder,
Julian McAuley,
Garrison W. Cottrell
Abstract:
Stories generated with neural language models have shown promise in grammatical and stylistic consistency. However, the generated stories are still lacking in common sense reasoning, e.g., they often contain sentences deprived of world knowledge. We propose a simple multi-task learning scheme to achieve quantitatively better common sense reasoning in language models by leveraging auxiliary trainin…
▽ More
Stories generated with neural language models have shown promise in grammatical and stylistic consistency. However, the generated stories are still lacking in common sense reasoning, e.g., they often contain sentences deprived of world knowledge. We propose a simple multi-task learning scheme to achieve quantitatively better common sense reasoning in language models by leveraging auxiliary training signals from datasets designed to provide common sense grounding. When combined with our two-stage fine-tuning pipeline, our method achieves improved common sense reasoning and state-of-the-art perplexity on the Writing Prompts (Fan et al., 2018) story generation dataset.
△ Less
Submitted 27 February, 2020; v1 submitted 25 August, 2019;
originally announced August 2019.
-
LakhNES: Improving multi-instrumental music generation with cross-domain pre-training
Authors:
Chris Donahue,
Huanru Henry Mao,
Yiting Ethan Li,
Garrison W. Cottrell,
Julian McAuley
Abstract:
We are interested in the task of generating multi-instrumental music scores. The Transformer architecture has recently shown great promise for the task of piano score generation; here we adapt it to the multi-instrumental setting. Transformers are complex, high-dimensional language models which are capable of capturing long-term structure in sequence data, but require large amounts of data to fit.…
▽ More
We are interested in the task of generating multi-instrumental music scores. The Transformer architecture has recently shown great promise for the task of piano score generation; here we adapt it to the multi-instrumental setting. Transformers are complex, high-dimensional language models which are capable of capturing long-term structure in sequence data, but require large amounts of data to fit. Their success on piano score generation is partially explained by the large volumes of symbolic data readily available for that domain. We leverage the recently-introduced NES-MDB dataset of four-instrument scores from an early video game sound synthesis chip (the NES), which we find to be well-suited to training with the Transformer architecture. To further improve the performance of our model, we propose a pre-training technique to leverage the information in a large collection of heterogeneous music, namely the Lakh MIDI dataset. Despite differences between the two corpora, we find that this transfer learning procedure improves both quantitative and qualitative performance for our primary task.
△ Less
Submitted 10 July, 2019;
originally announced July 2019.
-
The NES Music Database: A multi-instrumental dataset with expressive performance attributes
Authors:
Chris Donahue,
Huanru Henry Mao,
Julian McAuley
Abstract:
Existing research on music generation focuses on composition, but often ignores the expressive performance characteristics required for plausible renditions of resultant pieces. In this paper, we introduce the Nintendo Entertainment System Music Database (NES-MDB), a large corpus allowing for separate examination of the tasks of composition and performance. NES-MDB contains thousands of multi-inst…
▽ More
Existing research on music generation focuses on composition, but often ignores the expressive performance characteristics required for plausible renditions of resultant pieces. In this paper, we introduce the Nintendo Entertainment System Music Database (NES-MDB), a large corpus allowing for separate examination of the tasks of composition and performance. NES-MDB contains thousands of multi-instrumental songs composed for playback by the compositionally-constrained NES audio synthesizer. For each song, the dataset contains a musical score for four instrument voices as well as expressive attributes for the dynamics and timbre of each voice. Unlike datasets comprised of General MIDI files, NES-MDB includes all of the information needed to render exact acoustic performances of the original compositions. Alongside the dataset, we provide a tool that renders generated compositions as NES-style audio by emulating the device's audio processor. Additionally, we establish baselines for the tasks of composition, which consists of learning the semantics of composing for the NES synthesizer, and performance, which involves finding a mapping between a composition and realistic expressive attributes.
△ Less
Submitted 11 June, 2018;
originally announced June 2018.
-
DeepJ: Style-Specific Music Generation
Authors:
Huanru Henry Mao,
Taylor Shin,
Garrison W. Cottrell
Abstract:
Recent advances in deep neural networks have enabled algorithms to compose music that is comparable to music composed by humans. However, few algorithms allow the user to generate music with tunable parameters. The ability to tune properties of generated music will yield more practical benefits for aiding artists, filmmakers, and composers in their creative tasks. In this paper, we introduce DeepJ…
▽ More
Recent advances in deep neural networks have enabled algorithms to compose music that is comparable to music composed by humans. However, few algorithms allow the user to generate music with tunable parameters. The ability to tune properties of generated music will yield more practical benefits for aiding artists, filmmakers, and composers in their creative tasks. In this paper, we introduce DeepJ - an end-to-end generative model that is capable of composing music conditioned on a specific mixture of composer styles. Our innovations include methods to learn musical style and music dynamics. We use our model to demonstrate a simple technique for controlling the style of generated music as a proof of concept. Evaluation of our model using human raters shows that we have improved over the Biaxial LSTM approach.
△ Less
Submitted 2 January, 2018;
originally announced January 2018.
-
Liquid-Gated High Mobility and Quantum Oscillation of the Two-Dimensional Electron Gas at an Oxide Interface
Authors:
Shengwei Zeng,
Weiming Lü,
Zhen Huang,
Zhiqi Liu,
Kun Han,
Kalon Gopinadhan,
Changjian Li,
Rui Guo,
Wenxiong Zhou,
Haijiao Harsan Ma,
Linke Jian,
T Venkatesan,
Ariando
Abstract:
Electric field effect in electronic double layer transistor (EDLT) configuration with ionic liquids as the dielectric materials is a powerful means of exploring various properties in different materials. Here we demonstrate the modulation of electrical transport properties and extremely high mobility of two-dimensional electron gas at LaAlO$_3$/SrTiO$_3$ (LAO/STO) interface through ionic liquid-as…
▽ More
Electric field effect in electronic double layer transistor (EDLT) configuration with ionic liquids as the dielectric materials is a powerful means of exploring various properties in different materials. Here we demonstrate the modulation of electrical transport properties and extremely high mobility of two-dimensional electron gas at LaAlO$_3$/SrTiO$_3$ (LAO/STO) interface through ionic liquid-assisted electric field effect. By changing the gate voltages, the depletion of charge carrier and the resultant enhancement of electron mobility up to 19380 cm$^2$/Vs are realized, leading to quantum oscillations of the conductivity at the LAO/STO interface. The present results suggest that high-mobility oxide interfaces which exhibit quantum phenomena could be obtained by ionic liquid-assisted field effect.
△ Less
Submitted 28 March, 2016;
originally announced March 2016.