Document Zbl 1536.68017

Energy-based models with applications to speech and language processing. (English) Zbl 1536.68017

Found. Trends Signal Process. 18, No. 1-2, 1-199 (2024).

Summary: Energy-Based Models (EBMs) are an important class of probabilistic models, also known as random fields and undirected graphical models. EBMs are un-normalized and thus radically different from other popular self-normalized probabilistic models such as hidden Markov models (HMMs), autoregressive models, generative adversarial nets (GANs) and variational auto-encoders (VAEs). During these years, EBMs have attracted increasing interest not only from core machine learning but also from application domains such as speech, vision, natural language processing (NLP) and so on, with significant theoretical and algorithmic progress. To the best of our knowledge, there are no review papers about EBMs with applications to speech and language processing. The sequential nature of speech and language also presents special challenges and needs treatment different from processing fix-dimensional data (e.g., images).
The purpose of this monograph is to present a systematic introduction to energy-based models, including both algorithmic progress and applications in speech and language processing, which is organized into four main sections. First, we will introduce basics for EBMs, including classic models, recent models parameterized by neural networks, sampling methods, and various learning methods from the classic learning algorithms to the most advanced ones. The next three sections will present how to apply EBMs in three different scenarios, i.e., for modeling marginal, conditional and joint distributions, respectively. 1) EBMs for sequential data with applications in language modeling, where we are mainly concerned with the marginal distribution of a sequence itself; 2) EBMs for modeling conditional distributions of target sequences given observation sequences, with applications in speech recognition, sequence labeling and text generation; 3) EBMs for modeling joint distributions of both sequences of observations and targets, and their applications in semi-supervised learning and calibrated natural language understanding. In addition, we will introduce some open-source toolkits to help the readers to get familiar with the techniques for developing and applying energy-based models.

MSC:

68T05	Learning and adaptive systems in artificial intelligence
68-02	Research exposition (monographs, survey articles) pertaining to computer science
68T07	Artificial neural networks and deep learning
94A12	Signal theory (characterization, reconstruction, filtering, etc.)
65C05	Monte Carlo methods
62H30	Classification and discrimination; cluster analysis (statistical aspects)
60J22	Computational methods in Markov chains
68R10	Graph theory (including graph drawing) in computer science
65K10	Numerical optimization and variational techniques
65C40	Numerical analysis or methods applied to Markov chains

Keywords:

statistical signal processing; statistical/machine learning; speech and spoken language processing; stochastic optimization; deep learning; graphical models; Markov chain Monte Carlo; variational inference; classification and prediction; probability and statistics

Software:

SimCLR; ElemStatLearn; GLUE; darch; PRMLT; SQuAD; Penn Treebank; FastSpeech; BERTScore; CUSIDE; BERT; Tensor2Tensor; RandAugment; PMTK; ELECTRA; FixMatch; CAT; AISHELL; word2vec; RoBERTa; EESEN

Cite Review PDF

Full Text: DOI arXiv

References:

[1]	F. Amaya and J. M. Benedi, “Improvement of a whole sentence maximum entropy language model using grammatical features,” in Proc. Ann. Meeting of the Association for Computational Linguistics (ACL), 2001.
[2]	K. An, H. Xiang, and Z. Ou, “CAT: A CTC-CRF based ASR toolkit bridging the hybrid and the end-to-end approaches to-wards data efficiency and low latency,” in INTERSPEECH, 2020.
[3]	K. An, H. Zheng, Z. Ou, H. Xiang, K. Ding, and G. Wan, “Cuside: Chunking, simulating future context and decoding for streaming ASR,” in INTERSPEECH, 2022.
[4]	D. Andor, C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, and M. Collins, “Globally normalized tran-sition-based neural networks,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol-ume 1: Long Papers), 2016.
[5]	C. Andrieu, É. Moulines, and P. Priouret, “Stability of stochastic approximation under verifiable conditions,” SIAM Journal on control and optimization, vol. 44, no. 1, 2005, pp. 283-312. · Zbl 1083.62073
[6]	C. Andrieu and J. Thoms, “A tutorial on adaptive mcmc,” Statistics and computing, vol. 18, no. 4, 2008, pp. 343-373.
[7]	A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,” in NIPS, 2007. References
[8]	T. Artieres et al., “Neural conditional random fields,” in AIS-TATS, 2010.
[9]	A. Bakhtin, S. Gross, M. Ott, Y. Deng, M. Ranzato, and A. Szlam, “Real or fake? learning to discriminate machine from human generated text,” arXiv preprint arXiv:1906.03351, 2019.
[10]	D. Belanger and A. McCallum, “Structured Prediction Energy Networks,” in ICML, 2016.
[11]	S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” Advances in neural information processing systems, 2015.
[12]	A. Benveniste, M. Métivier, and P. Priouret, Adaptive algorithms and stochastic approximations. New York: Springer, 1990. · Zbl 0752.93073
[13]	J. E. Besag, “Comments on ”Representations of knowledge in complex systems“ by U. Grenander and M.I. Miller,” Journal of the Royal Statistical Society: Series B, vol. 56, 1994, pp. 549-581. · Zbl 0814.62009
[14]	C. M. Bishop, Pattern recognition and machine learning. Springer, 2006. · Zbl 1107.68072
[15]	G. Bouchard, “Bias-variance tradeoff in hybrid generative-dis-criminative models,” in International Conference on Machine Learning and Applications (ICMLA), 2007.
[16]	H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), 2017.
[17]	S. P. Chatzis and Y. Demiris, “The Infinite-Order Conditional Random Field Model for Sequential Data Modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.
[18]	C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson, “One billion word benchmark for measuring progress in statistical language modeling,” in INTERSPEECH, 2014.
[19]	H. Chen, Stochastic approximation and its applications. Springer Science & Business Media, 2002. · Zbl 1008.62071
[20]	S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language, vol. 13, no. 4, 1999, pp. 359-394.
[21]	T. Chen, E. Fox, and C. Guestrin, “Stochastic gradient Hamilto-nian Monte Carlo,” in ICML, 2014.
[22]	T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” arXiv:2002.05709, 2020.
[23]	X. Chen, X. Liu, Y. Wang, A. Ragni, J. H. Wong, and M. J. Gales, “Exploiting future word contexts in neural network language models for speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, 2019, pp. 1444-1454.
[24]	C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in ICASSP, 2018.
[25]	J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN: First results,” arXiv preprint arXiv:1412.1602, 2014.
[26]	K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than genera-tors,” in International Conference on Learning Representations (ICLR), 2020.
[27]	K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Pre-training transformers as energy-based cloze models,” Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
[28]	K. Clark, M.-T. Luong, C. D. Manning, and Q. Le, “Semi-supervised sequence modeling with cross-view training,” in EMNLP, 2018.
[29]	M. Collins and B. Roark, “Incremental Parsing with the Percep-tron Algorithm,” in ACL, 2004. References
[30]	R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” Journal of machine learning research, vol. 12, no. Aug, 2011, pp. 2493-2537. · Zbl 1280.68161
[31]	T. M. Cover, Elements of information theory. John Wiley & Sons, 1999.
[32]	R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhal-ter, Probabilistic Networks and Expert Systems. Springer-Verlag, 1999. · Zbl 0937.68121
[33]	E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “RandAugment: Practical automated data augmentation with a reduced search space,” in CVPR, 2020.
[34]	X. Cui, B. Kingsbury, G. Saon, D. Haws, and Z. Tuske, “Reducing exposure bias in training recurrent neural network transducers,” in INTERSPEECH, 2021.
[35]	G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, 2012, pp. 30-42.
[36]	Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov, “Good semi-supervised learning that requires a bad GAN,” in NIPS, 2017.
[37]	P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel, “The helmholtz machine,” Neural computation, vol. 7, no. 5, 1995, pp. 889-904.
[38]	A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39, 1977. · Zbl 0364.62022
[39]	Y. Deng, A. Bakhtin, M. Ott, A. Szlam, and M. Ranzato, “Resid-ual energy-based models for text generation,” in ICLR, 2020.
[40]	J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language un-derstanding,” arXiv preprint arXiv:1810.04805, 2018, pp. 4171-4186.
[41]	G. Durrett and D. Klein, “Neural CRF Parsing,” in ACL, 2015.
[42]	B. J. Frey and N. Jojic, “A comparison of algorithms for inference and learning in probabilistic graphical models,” IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), vol. 27, no. 9, 2005, pp. 1392-1416.
[43]	S. Gao, Z. Ou, W. Yang, and H. Xu, “Integrating discrete and neural features via mixed-feature trans-dimensional random field language models,” in ICASSP, 2020.
[44]	M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer, “Mask-predict: Parallel decoding of conditional masked language mod-els,” arXiv preprint arXiv:1904.09324, 2019.
[45]	I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-versarial nets,” in NIPS, 2014.
[46]	J. Goodman, “A bit of progress in language modeling,” Computer Speech & Language, vol. 15, 2001, pp. 403-434.
[47]	K. Goyal, C. Dyer, and T. Berg-Kirkpatrick, “Exposing the implicit energy networks behind masked language models via metropolis-hastings,” in International conference on learning representations, 2022.
[48]	W. Grathwohl, K. Swersky, M. Hashemi, D. Duvenaud, and C. Maddison, “Oops I took a gradient: Scalable sampling for discrete distributions,” in International Conference on Machine Learning, 2021.
[49]	W. Grathwohl, K.-C. Wang, J.-H. Jacobsen, D. Duvenaud, M. Norouzi, and K. Swersky, “Your classifier is secretly an energy based model and you should treat it like one,” in ICLR, 2020.
[50]	A. Graves, “Sequence transduction with recurrent neural net-works,” arXiv preprint arXiv:1211.3711, 2012.
[51]	A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con-nectionist temporal classification: Labelling unsegmented se-quence data with recurrent neural networks,” in ICML, 2006.
[52]	A. Gunawardana, M. Mahajan, A. Acero, and J. C. Platt, “Hid-den conditional random fields for phone classification,” in Ninth European Conference on Speech Communication and Technology (EUROSPEECH), 2005. References
[53]	C. E. Guo, S. C. Zhu, and Y. N. Wu, “Modeling visual patterns by integrating descriptive and generative methods.,” International Journal of Computer Vision, vol. 53, no. 1, 2003, pp. 5-29. · Zbl 1477.68361
[54]	C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On cali-bration of modern neural networks,” in Proceedings of the 34th International Conference on Machine Learning, 2017.
[55]	M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in AISTATS, 2010.
[56]	M. U. Gutmann and A. Hyvärinen, “Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics.,” Journal of machine learning research, vol. 13, no. 2, 2012. · Zbl 1283.62064
[57]	H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “Flat-start single-stage discriminatively trained HMM-based models for ASR,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 11, 2018, pp. 1949-1961.
[58]	T. Han, E. Nijkamp, X. Fang, M. Hill, S.-C. Zhu, and Y. N. Wu, “Divergence triangle for joint training of generator model, energy-based model, and inferential model,” in CVPR, 2019.
[59]	T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction, vol. 2. Springer, 2009. · Zbl 1273.62005
[60]	T. He, B. McCann, C. Xiong, and E. Hosseini-Asl, “Joint energy-based model training for better calibrated natural language understanding models,” preprint arXiv:2101.06829, 2021.
[61]	G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural computation, vol. 14, no. 8, 2002, pp. 1771-1800. · Zbl 1010.68111
[62]	G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The wake-sleep algorithm for unsupervised neural networks.,” Science, vol. 268, no. 5214, 1995, pp. 1158-1161.
[63]	G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, 2006, pp. 1527-1554. · Zbl 1106.68094
[64]	A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in International Con-ference on Learning Representations (ICLR), 2019.
[65]	K. Hu, Z. Ou, M. Hu, and J. Feng, “Neural CRF transducers for sequence labeling,” in ICASSP, 2019.
[66]	Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” arXiv:1508.01991, 2015.
[67]	P. Huembeli, J. M. Arrazola, N. Killoran, M. Mohseni, and P. Wittek, “The physics of energy-based models,” Quantum Machine Intelligence, vol. 4, no. 1, 2022, p. 1.
[68]	A. Hyvärinen and P. Dayan, “Estimation of non-normalized sta-tistical models by score matching,” Journal of Machine Learning Research, vol. 6, no. 4, 2005. · Zbl 1222.62051
[69]	F. Jelinek, “Continuous speech recognition by statistical meth-ods,” Proceedings of the IEEE, vol. 64, no. 4, 1976, pp. 532-556.
[70]	M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine learning, vol. 37, 1999, pp. 183-233. · Zbl 0945.68164
[71]	M. I. Jordan, “Graphical models,” Statistical science, vol. 19, no. 1, 2004, pp. 140-155. · Zbl 1057.62001
[72]	M. Khalifa, H. Elsahar, and M. Dymetman, “A distributional ap-proach to controlled text generation,” in International conference on learning representations, 2021.
[73]	K. Kim, J. Oh, J. Gardner, A. B. Dieng, and H. Kim, “Markov chain score ascent: A unifying framework of variational inference with markovian gradients,” Advances in Neural Information Processing Systems (NeurIPS), 2022.
[74]	T. Kim and Y. Bengio, “Deep directed generative models with energy-based probability estimation,” in ICLR Workshop, 2016.
[75]	D. P. Kingma, M. Welling, et al., “An introduction to variational autoencoders,” Foundations and Trends® in Machine Learning, vol. 12, no. 4, 2019, pp. 307-392.
[76]	D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling, “Semi-supervised learning with deep generative models,” in NIPS, 2014. References
[77]	D. Koller and N. Friedman, Probabilistic graphical models: prin-ciples and techniques. MIT press, 2009. · Zbl 1183.68483
[78]	V. Kuleshov and S. Ermon, “Neural variational inference and learning in undirected graphical models,” in NIPS, 2017.
[79]	J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional ran-dom fields: Probabilistic models for segmenting and labeling sequence data,” in International conference on Machine learning (ICML), 2001.
[80]	S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning,” in ICLR, 2017.
[81]	G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural Architectures for Named Entity Recognition,” in NAACL-HLT, 2016.
[82]	H. Larochelle, M. I. Mandel, R. Pascanu, and Y. Bengio, “Learn-ing algorithms for the classification restricted Boltzmann ma-chine,” Journal of Machine Learning Research, vol. 13, no. 1, 2012, pp. 643-669. · Zbl 1283.68293
[83]	F. Liang, C. Liu, and R. J. Carroll, “Stochastic approximation in monte carlo computation,” Journal of the American Statistical Association, vol. 102, no. 477, 2007, pp. 305-320. · Zbl 1226.65002
[84]	P. Liang and M. I. Jordan, “An asymptotic analysis of genera-tive, discriminative, and pseudolikelihood estimators,” in Inter-national conference on Machine learning (ICML), pp. 584-591, 2008.
[85]	W. Ling, C. Dyer, A. W. Black, I. Trancoso, R. Fermandez, S. Amir, L. Marujo, and T. Luis, “Finding function in form: Compositional character models for open vocabulary word rep-resentation,” in EMNLP, 2015.
[86]	H. Liu and Z. Ou, “Exploring energy-based language models with different architectures and training methods for speech recognition,” in INTERSPEECH, 2023.
[87]	J. S. Liu, Monte Carlo strategies in scientific computing, vol. 10. Springer, 2001. · Zbl 0991.65001
[88]	Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly op-timized bert pretraining approach,” ArXiv, vol. abs/1907.11692, 2019.
[89]	L. Lu, L. Kong, C. Dyer, N. A. Smith, and S. Renals, “Segmental recurrent neural networks for end-to-end speech recognition,” in INTERSPEECH, 2016.
[90]	C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter, and H. Ney, “RWTH ASR systems for librispeech: Hybrid vs attention,” in INTERSPEECH, 2019.
[91]	Y.-A. Ma, T. Chen, and E. Fox, “A complete recipe for stochastic gradient mcmc,” in NIPS, 2015.
[92]	X. Ma and E. Hovy, “End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF,” in ACL, 2016.
[93]	Z. Ma and M. Collins, “Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency,” EMNLP, 2018.
[94]	D. J. MacKay, Information theory, inference and learning algo-rithms. Cambridge university press, 2003. · Zbl 1055.94001
[95]	M. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large annotated corpus of english: The penn treebank,” 1993.
[96]	S. Martin, J. Liermann, and H. Ney, “Algorithms for bigram and trigram word clustering,” Speech Communication, vol. 24, 1998, pp. 19-37.
[97]	A. McCallum, D. Freitag, and F. Pereira, “Maximum entropy markov models for information extraction and segmentation.,” in ICML, 2000.
[98]	G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Z. Hakkani-Tür, X. He, L. P. Heck, G. Tür, D. Yu, and G. Zweig, “Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 23, 2015, pp. 530-539.
[99]	N. Miao, H. Zhou, L. Mou, R. Yan, and L. Li, “CGMH: Con-strained sentence generation by metropolis-hastings sampling,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019. References
[100]	Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in ASRU, 2015.
[101]	T. Mikolov, S. Kombrink, L. Burget, J. H. Cernocky, and S. Khudanpur, “Extensions of recurrent neural network language model,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011.
[102]	T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, pp. 3111-3119, 2013.
[103]	B. Millidge, Y. Song, T. Salvatori, T. Lukasiewicz, and R. Bo-gacz, “Backpropagation at the infinitesimal inference limit of energy-based models: Unifying predictive coding, equilibrium propagation, and contrastive hebbian learning,” in International Conference on Machine Learning, 2023.
[104]	T. Minka, “Divergence measures and message passing,” Microsoft Research Technical Report, 2005.
[105]	F. Mireshghallah, K. Goyal, and T. Berg-Kirkpatrick, “Mix and match: Learning-free controllable text generation using energy language models,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
[106]	T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii, “Virtual ad-versarial training: A regularization method for supervised and semi-supervised learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, 2018, pp. 1979-1993.
[107]	M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted finite-state transducers,” in Springer Handbook of Speech Processing, Springer, 2008, pp. 559-584.
[108]	L.-P. Morency, A. Quattoni, and T. Darrell, “Latent-Dynamic Discriminative Models for Continuous Gesture Recognition,” in CVPR, 2007.
[109]	Y. Mroueh, C.-L. Li, T. Sercu, A. Raj, and Y. Cheng, “Sobolev GAN,” in ICLR, 2018.
[110]	K. P. Murphy, Machine learning: a probabilistic perspective. MIT press, 2012. · Zbl 1295.68003
[111]	C. Naesseth, F. Lindsten, and D. Blei, “Markovian score climb-ing: Variational inference with kl (p\|\| q),” Advances in Neural Information Processing Systems (NeurIPS), 2020.
[112]	R. M. Neal, Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science, University of Toronto, Canada, 1993.
[113]	R. M. Neal, “MCMC using Hamiltonian dynamics,” Handbook of Markov Chain Monte Carlo, 2011. · Zbl 1229.65018
[114]	R. M. Neal and G. E. Hinton, “A view of the em algorithm that justifies incremental, sparse, and other variants,” in Learning in graphical models, Springer, 1998, pp. 355-368. · Zbl 0916.62019
[115]	R. M. Neal, “Connectionist learning of belief networks,” Artificial Intelligence, vol. 56, 1992, pp. 71-113. · Zbl 0761.68081
[116]	A. Ng and M. Jordan, “On discriminative vs. generative clas-sifiers: A comparison of logistic regression and naive bayes,” Advances in neural information processing systems, vol. 14, 2001.
[117]	J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng, “Learning deep energy models,” in International conference on machine learning (ICML), 2011.
[118]	S. Nowozin, “Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference,” in International conference on learning representations, 2018.
[119]	A. Oliver, A. Odena, C. Raffel, E. D. Cubuk, and I. J. Goodfellow, “Realistic evaluation of semi-supervised learning algorithms,” in ICLR, 2018.
[120]	M. Ostendorf, “Continuous-space language processing: Beyond word embeddings,” in International Conference on Statistical Language and Speech Processing, 2016.
[121]	Z. Ou, “A review of learning with deep generative models from perspective of graphical modeling,” arXiv preprint arXiv:1808. 01630, 2018. References
[122]	Z. Ou and Y. Song, “Joint stochastic approximation and its appli-cation to learning discrete latent variable models,” in Conference on Uncertainty in Artificial Intelligence, PMLR, pp. 929-938, 2020.
[123]	Z. Ou and J. Xiao, “A study of large vocabulary speech recogni-tion decoding using finite-state graphs,” in The 7th International Symposium on Chinese Spoken Language Processing, 2010.
[124]	V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An asr corpus based on public domain audio books,” in IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015.
[125]	T. Parshakova, J.-M. Andreoli, and M. Dymetman, “Global autoregressive models for data-efficient sequence learning,” arXiv preprint arXiv:1909.07063, 2019.
[126]	J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.
[127]	J. Peng, L. Bo, and J. Xu, “Conditional Neural Fields,” in NIPS, 2009.
[128]	J. Pennington, R. Socher, and C. Manning, “Glove: Global vec-tors for word representation,” in Conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014.
[129]	S. D. Pietra, V. D. Pietra, and J. Lafferty, “Inducing features of random fields,” IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), vol. 19, 1997, pp. 380-393.
[130]	V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A diffusion probabilistic model for text-to-speech,” in International Conference on Machine Learning, PMLR, pp. 8599-8608, 2021.
[131]	D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” in INTER-SPEECH, 2016.
[132]	L. Qin, S. Welleck, D. Khashabi, and Y. Choi, “Cold decod-ing: Energy-based constrained text generation with langevin dynamics,” Advances in Neural Information Processing Systems (NeurIPS), 2022.
[133]	L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, 1989, pp. 257-286.
[134]	A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018.
[135]	A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, 2019, p. 9.
[136]	C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, 2020, pp. 1-67.
[137]	P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” in EMNLP, 2016.
[138]	M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” in International Conference on Learning Representations (ICLR), 2016.
[139]	A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko, “Semi-supervised learning with ladder networks,” in NIPS, 2015.
[140]	Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020.
[141]	H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, 1951, pp. 400-407. · Zbl 0054.05901
[142]	G. O. Roberts and J. S. Rosenthal, “Examples of adaptive mcmc,” Journal of Computational and Graphical Statistics, vol. 18, no. 2, 2009, pp. 349-367.
[143]	G. O. Roberts and R. L. Tweedie, “Exponential convergence of langevin distributions and their discrete approximations,” Bernoulli, vol. 2, 1996, pp. 341-363. · Zbl 0870.60027
[144]	R. Rosenfeld, S. F. Chen, and X. Zhu, “Whole-sentence expo-nential language models: A vehicle for linguistic-statistical inte-gration,” Computer Speech & Language, vol. 15, 2001, pp. 55-73. References
[145]	T. Ruokolainen, T. Alumae, and M. Dobrinkat, “Using depen-dency grammar features in whole sentence maximum entropy language model for speech recognition.,” in Baltic HLT, 2010.
[146]	S. Russell and P. Norvig, Artificial intelligence: a modern ap-proach (3rd). Upper Saddle River, Prentice-Hall, 2010.
[147]	R. Salakhutdinov and G. Hinton, “Deep Boltzmann machines,” Journal of Machine Learning Research, vol. 5, no. 2, 2009, pp. 1967-2006.
[148]	R. Salakhutdinov, “Learning deep generative models,” Ph.D. thesis, University of Toronto, 2009.
[149]	T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in NIPS, 2016.
[150]	S. Sarawagi and W. W. Cohen, “Semi-Markov Conditional Ran-dom Fields for Information Extraction,” in NIPS, 2004.
[151]	R. Sarikaya, S. F. Chen, A. Sethy, and B. Ramabhadran, “Im-pact of word classing on shrinkage-based language models,” in Eleventh Annual Conference of the International Speech Com-munication Association, 2010.
[152]	I. Sato and H. Nakagawa, “Approximation analysis of stochastic gradient langevin dynamics by using fokker-planck equation and ito process,” in ICML, 2014.
[153]	K. Sato and Y. Sakakibara, “RNA secondary structural alignment with conditional random fields,” Bioinformatics, vol. 21, 2005, pp. 237-42.
[154]	L. K. Saul, T. Jaakkola, and M. I. Jordan, “Mean field theory for sigmoid belief networks,” Journal of artificial intelligence research, vol. 4, no. 1, 1996, pp. 61-76. · Zbl 0900.68379
[155]	B. Scellier and Y. Bengio, “Equilibrium propagation: Bridging the gap between energy-based models and backpropagation,” Frontiers in computational neuroscience, vol. 11, 2017, p. 24.
[156]	H. Schwenk, “Continuous space language models,” Computer Speech & Language, vol. 21, 2007, pp. 492-518.
[157]	H. Scudder, “Probability of error of some adaptive pattern-recognition machines,” IEEE Transactions on Information The-ory, vol. 11, no. 3, 1965, pp. 363-371. · Zbl 0133.12704
[158]	N. Shazeer, J. Pelemans, and C. Chelba, “Sparse non-negative matrix language modeling for skip-grams,” in INTERSPEECH, 2015.
[159]	A. Søgaard and Y. Goldberg, “Deep multi-task learning with low level tasks supervised at lower layers,” in ACL, pp. 231-235, 2016.
[160]	K. Sohn, D. Berthelot, C.-L. Li, and et al, “FixMatch: Simplify-ing semi-supervised learning with consistency and confidence,” arXiv:2001.07685, 2020.
[161]	Q. Song, M. Wu, and F. Liang, “Weak convergence rates of population versus single-chain stochastic approximation mcmc algorithms,” Advances in Applied Probability, vol. 46, no. 4, 2014, pp. 1059-1083. · Zbl 1305.60065
[162]	Y. Song and Z. Ou, “Learning neural random fields with inclusive auxiliary generators,” arXiv preprint arXiv:1806.00271, 2018.
[163]	Y. Song, Z. Ou, Z. Liu, and S. Yang, “Upgrading CRFs to JRFs and its benefits to sequence modeling and labeling,” in ICASSP, 2020.
[164]	Y. Song, H. Zheng, and Z. Ou, “An empirical comparison of joint-training and pre-training for domain-agnostic semi-supervised learning via energy-based models,” in IEEE International Work-shop on Machine Learning for Signal Processing (MLSP), 2021.
[165]	J. T. Springenberg, “Unsupervised and semi-supervised learning with categorical generative adversarial networks,” in ICML, 2016.
[166]	N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural net-works from overfitting,” The journal of machine learning research, 2014. · Zbl 1318.68153
[167]	W. Sun, Z. Tu, and A. Ragni, “Energy-based models for speech synthesis,” arXiv preprint arXiv:2310.12765, 2023.
[168]	M. Sundermeyer, R. Schlüter, and H. Ney, “Lstm neural networks for language modeling.,” in INTERSPEECH, pp. 194-197, 2012.
[169]	I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014. References
[170]	C. Sutton, A. McCallum, et al., “An introduction to conditional random fields,” Foundations and Trends® in Machine Learning, vol. 4, no. 4, 2012, pp. 267-373. · Zbl 1253.68001
[171]	Z. Tan, “Optimally adjusted mixture sampling and locally weighted histogram analysis,” Journal of Computational and Graphical Statistics, vol. 26, 2017, pp. 54-65.
[172]	A. Tarvainen and H. Valpola, “Mean teachers are better role mod-els: Weight-averaged consistency targets improve semi-supervised deep learning results,” in NIPS, 2017.
[173]	L. Theis, A. V. Den Oord, and M. Bethge, “A note on the evaluation of generative models,” in ICLR, 2016.
[174]	T. Tieleman, “Training restricted Boltzmann machines using approximations to the likelihood gradient,” in ICML, 2008.
[175]	S. Toshniwal, A. Kannan, and et al, “A comparison of tech-niques for language model integration in encoder-decoder speech recognition,” in SLT, 2018.
[176]	L. Tu and K. Gimpel, “Learning Approximate Inference Networks for Structured Prediction,” in ICLR, 2018.
[177]	Z. Tüske, K. Audhkhasi, and G. Saon, “Advancing sequence-to-sequence based speech recognition,” in INTERSPEECH, 2019.
[178]	E. Variani, K. Wu, M. D. Riley, D. Rybach, M. Shannon, and C. Allauzen, “Global normalization for streaming speech recogni-tion in a modular framework,” Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 4257-4269.
[179]	A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, 2017.
[180]	M. J. Wainwright, M. I. Jordan, et al., “Graphical models, ex-ponential families, and variational inference,” Foundations and Trends® in Machine Learning, vol. 1, no. 1-2, 2008, pp. 1-305. · Zbl 1193.62107
[181]	A. Wang and K. Cho, “BERT has a mouth, and it must speak: BERT as a Markov random field language model,” in Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, 2019.
[182]	A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in International Conference on Learning Representations (ICLR), 2019.
[183]	B. Wang, “Statistical language models based on trans-dimen-sional random fields,” Ph.D. thesis, Tsinghua University, 2018.
[184]	B. Wang and Z. Ou, “Language modeling with neural trans-dimensional random fields,” in IEEE Automatic Speech Recogni-tion and Understanding Workshop (ASRU), 2017.
[185]	B. Wang and Z. Ou, “Improved training of neural trans-dimensional random field language models with dynamic noise-contrastive estimation,” in IEEE Spoken Language Technology Workshop (SLT), 2018.
[186]	B. Wang and Z. Ou, “Learning neural trans-dimensional ran-dom field language models with noise-contrastive estimation,” in ICASSP, 2018.
[187]	B. Wang, Z. Ou, Y. He, and A. Kawamura, “Model interpolation with trans-dimensional random field language models for speech recognition,” arXiv preprint arXiv:1603.09170, 2016.
[188]	B. Wang, Z. Ou, and Z. Tan, “Trans-dimensional random fields for language modeling,” in Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 785-794, 2015.
[189]	B. Wang, Z. Ou, and Z. Tan, “Learning trans-dimensional ran-dom fields with applications to language modeling,” IEEE trans-actions on pattern analysis and machine intelligence, vol. 40, no. 4, 2018, pp. 876-890.
[190]	M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” in ICML, 2011.
[191]	R. J. Williams and D. Zipser, “A learning algorithm for continu-ally running fully recurrent neural networks,” Neural computa-tion, vol. 1, no. 2, 1989, pp. 270-280.
[192]	S. Wiseman and A. M. Rush, “Sequence-to-sequence learning as beam-search optimization,” in EMNLP, 2016.
[193]	H. Xiang and Z. Ou, “CRF-based single-stage acoustic modeling with CTC topology,” in ICASSP, pp. 5676-5680, 2019.
[194]	J. Xie, Y. Lu, R. Gao, S.-C. Zhu, and Y. N. Wu, “Cooperative training of descriptor and generator networks,” IEEE transac-tions on pattern analysis and machine intelligence, vol. 42, no. 1, 2018, pp. 27-45.
[195]	J. Xie, Y. Lu, S.-C. Zhu, and Y. Wu, “A theory of generative convnet,” in ICML, 2016.
[196]	H. Xu and Z. Ou, “Joint stochastic approximation learning of helmholtz machines,” in ICLR Workshop Track, 2016.
[197]	L. Younes, “Parametric inference for imperfectly observed gibb-sian fields,” Probability Theory and Related Fields, vol. 82, 1989, pp. 625-645. · Zbl 0659.62115
[198]	F. Yu, Z. Yao, X. Wang, K. An, L. Xie, Z. Ou, B. Liu, X. Li, and G. Miao, “The slt 2021 children speech recognition challenge: Open datasets, rules and baselines,” in IEEE Spoken Language Technology Workshop (SLT), 2021.
[199]	W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv:1409.2329, 2014.
[200]	A. Zeyer, E. Beck, R. Schlüter, and H. Ney, “CTC in the context of generalized full-sum HMM training,” in INTERSPEECH, 2017.
[201]	B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng, D. Wu, and Z. Peng, “Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022.
[202]	L. Zhang, D. M. Blei, and C. A. Naesseth, “Transport score climbing: Variational inference using forward kl and adaptive neural transport,” arXiv preprint arXiv:2202.01841, 2022.
[203]	T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” in Interna-tional Conference on Learning Representations, 2020.
[204]	X. Zhang, Z. Tan, and Z. Ou, “Persistently trained, diffusion-assisted energy-based models,” Stat, 2023. · Zbl 07858749
[205]	Y. Zhang, X. Sun, S. Ma, Y. Yang, and X. Ren, “Does Higher Order LSTM Have Better Accuracy for Segmenting and Labeling Sequence Data?” In COLING, 2018.
[206]	Y. Zhang, Z. Ou, M. Hu, and J. Feng, “A probabilistic end-to-end task-oriented dialog model with latent belief states towards semi-supervised learning,” in Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
[207]	S. Zhao, J.-H. Jacobsen, and W. Grathwohl, “Joint energy-based models for semi-supervised classification,” in ICML Workshop on Uncertainty and Robustness in Deep Learning, 2020.
[208]	H. Zheng, K. An, and Z. Ou, “Efficient neural architecture search for end-to-end speech recognition via straight-through gradients,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021.
[209]	H. Zheng, K. An, Z. Ou, C. Huang, K. Ding, and G. Wan, “An empirical study of language model integration for transducer based speech recognition,” in INTERSPEECH, 2022.
[210]	H. Zheng, W. Peng, Z. Ou, and J. Zhang, “Advancing ctc-crf based end-to-end speech recognition with wordpieces and con-formers,” arXiv preprint arXiv:2107.03007, 2021.
[211]	C. Zhu, K. An, H. Zheng, and Z. Ou, “Multilingual and crosslin-gual speech recognition using phonological-vector based phone embeddings,” in IEEE Automatic Speech Recognition and Un-derstanding Workshop (ASRU), 2021.
[212]	X. Zhu, “Semi-supervised learning literature survey,” Technical report, University of Wisconsin-Madison, 2006.
[213]	Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proceedings of the IEEE international conference on computer vision, pp. 19-27, 2015.

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.