Survey: finite-state technology in natural language processing. (English) Zbl 1373.68420
This is a survey on the use of finite-state information and technology for natural language processing, in problems such as tokenization, part-of-speech tagging, parsing. All the involved concepts are thoroughly presented and accompanied by examples. It is overall a theoretical material suitable for computer scientists with a background in automata theory.
Reviewer: Ruxandra Stoean (Craiova)
MSC:
68T50 | Natural language processing |
68Q42 | Grammars and rewriting systems |
68Q45 | Formal languages and automata |
Keywords:
natural language processing; finite-state automaton; tokenization; part-of-speech tagging; parsing; tree automaton; context-free grammar; machine translationReferences:
[1] | Manning, C.; Schütze, H., Foundations of Statistical Natural Language Processing (1999), MIT Press · Zbl 0951.68158 |
[2] | Jurafsky, D.; Martin, J. H., Speech and Language Processing (2008), Prentice Hall |
[3] | Hebisch, U.; Weinert, H. J., Semirings — Algebraic Theory and Applications in Computer Science (1998), World Scientific · Zbl 0934.16046 |
[4] | Golan, J. S., Semirings and Their Applications (1999), Kluwer Academic: Kluwer Academic Dordrecht · Zbl 0947.16034 |
[5] | Yu, S., Regular languages, (Rozenberg, G.; Salomaa, A., Handbook of Formal Languages, vol. 1, Ch. II (1997), Springer), 41-110 · Zbl 0866.68057 |
[6] | Schützenberger, M. P., On the definition of a family of automata, Inf. Control, 4, 2-3, 245-270 (1961) · Zbl 0104.00702 |
[7] | (Droste, M.; Kuich, W.; Vogler, H., Handbook of Weighted Automata. Handbook of Weighted Automata, EATCS Monographs on Theoret. Comput. Sci. (2009), Springer) · Zbl 1200.68001 |
[10] | Wirth, N., Compiler Construction (1996), Addison Wesley · Zbl 0851.68013 |
[11] | Xue, N., Chinese word segmentation as character tagging, Comput. Linguist. Chin. Lang. Process., 8, 1, 29-48 (2003) |
[12] | ISO Technical Committee 46, Romanization of Chinese (1991), International Organization for Standardization, Tech. Rep. ISO 7098:1991 |
[13] | Xia, F., The segmentation guidelines for the Penn Chinese treebank (3.0) (2000), University of Pennsylvania, Tech. Rep. IRCS 00-06 |
[14] | Xue, N.; Xia, F.; Chiou, F.-D.; Palmer, M., The Penn Chinese treebank: phrase structure annotation of a large corpus, J. Natur. Lang. Engrg., 11, 2, 207-238 (2002) |
[15] | Sproat, R.; Shih, C.; Gale, W.; Chang, N., A stochastic finite-state word-segmentation algorithm for Chinese, Comput. Linguist., 22, 3, 377-404 (1996) |
[16] | Sproat, R., Lexical analysis, (Dale, R.; Moisl, H.; Somers, H., Handbook of Natural Language Processing, Ch. 3 (2000), Marcel Dekker Inc.), 37-57 |
[17] | Dijkstra, E. W., A note on two problems in connexion with graphs, Numer. Math., 1, 1, 269-271 (1959) · Zbl 0092.16002 |
[18] | Zhao, H.; Kit, C., Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition, (Proc. SIGHAN Workshop on Chinese Language Processing (2008), Association for Computational Linguistics), 106-111 |
[19] | Navigli, R.; Velardi, P., Learning domain ontologies from document warehouses and dedicated web sites, Comput. Linguist., 30, 2, 151-179 (2004) · Zbl 1234.68373 |
[20] | Pang, B.; Lee, L.; Vaithyanathan, S., Thumbs up? Sentiment classification using machine learning techniques, (Proc. EMNLP (2002), Association for Computational Linguistics), 79-86 |
[21] | DeRose, S. J., Grammatical category disambiguation by statistical optimization, Comput. Linguist., 14, 1, 31-39 (1988) |
[22] | Borgmann, D. A., Beyond Language — Adventures in Word and Thought (1967), Charles Scribner’s Sons |
[23] | Marcus, M. P.; Santorini, B.; Marcinkiewicz, M. A., Building a large annotated corpus of English — the Penn treebank, Comput. Linguist., 19, 2, 313-330 (1993) |
[24] | Markov, A. A., An example of statistical investigation of the text of ‘Eugene Onegin’ concerning the connection of samples in chains, (Proc. Royal Academy of Sciences, St. Petersburg, Vol. VI of Ser. 7 (1913)), 153-162 |
[25] | Myung, I. J., Tutorial on maximum likelihood estimation, J. Math. Psych., 47, 90-100 (2003) · Zbl 1023.62112 |
[26] | Viterbi, A. J., Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Trans. Inform. Theory, 13, 2, 260-269 (1967) · Zbl 0148.40501 |
[27] | Baum, L. E.; Petrie, T., Statistical inference for probabilistic functions of finite state Markov chains, Ann. Math. Stat., 37, 6, 1554-1563 (1966) · Zbl 0144.40902 |
[28] | Kirkpatrick, S.; Gelatt, C. D.; Vecchi, M. P., Optimization by simulated annealing, Science, 220, 4598, 671-680 (1983) · Zbl 1225.90162 |
[29] | Černý, V., Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm, J. Optim. Theory Appl., 45, 1, 41-51 (1985) · Zbl 0534.90091 |
[30] | Ravi, S.; Knight, K., Minimized models for unsupervised part-of-speech tagging, (Proc. ACL (2009), Association for Computational Linguistics), 504-512 |
[31] | Garrette, D.; Baldridge, J., Learning a part-of-speech tagger from two hours of annotation, (Proc. NAACL (2013), Association for Computational Linguistics), 138-147 |
[32] | Dempster, A. P.; Laird, N. M.; Rubin, D. B., Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. Ser. B. Stat. Methodol., 39, 1, 1-38 (1977) · Zbl 0364.62022 |
[33] | Baum, L. E.; Petrie, T.; Soules, G.; Weiss, N., A maximization technique occurring in the statictical analysis of probabilistic functions of Markov chains, Ann. Math. Stat., 41, 1, 164-171 (1970) · Zbl 0188.49603 |
[34] | Baker, J., The DRAGON system — an overview, IEEE Trans. Acoust. Speech Signal Process., 23, 1, 24-29 (1975) |
[35] | Nicolai, C., Solving ion channel kinetics with the QuB software, Biophys. Rev. Lett., 8, 03n04, 191-211 (2013) |
[36] | Bishop, M. J.; Thompson, E. A., Maximum likelihood alignment of DNA sequences, J. Mol. Biol., 190, 2, 159-165 (1986) |
[37] | Stigler, J.; Ziegler, F.; Gieseke, A.; Gebhardt, J. C.M.; Rief, M., The complex folding network of single calmodulin molecules, Science, 334, 6055, 512-516 (2011) |
[38] | Lafferty, J.; McCallum, A.; Pereira, F. C.N., Conditional random fields — probabilistic models for segmenting and labeling sequence data, (Proc. ICML (2001)), 282-289 |
[39] | Smith, N. A.; Johnson, M., Weighted and probabilistic context-free grammars are equally expressive, Comput. Linguist., 33, 4, 477-491 (2007) · Zbl 1234.68186 |
[40] | Salomaa, A., Probabilistic and weighted grammars, Inf. Control, 15, 6, 529-544 (1969) · Zbl 0188.03201 |
[41] | Berstel, J.; Reutenauer, C., Recognizable formal power series on trees, Theoret. Comput. Sci., 18, 115-148 (1982) · Zbl 0485.68077 |
[43] | Wilks, Y.; Charniak, E., Computational Semantics — An Introduction to Artificial Intelligence and Natural Language Understanding (1976), North-Holland · Zbl 0353.68075 |
[44] | Blackburn, P.; Bos, J., Representation and Inference for Natural Language — A First Course in Computational Semantics (2005), CSLI Publications |
[45] | Haghighi, A.; Klein, D., Coreference resolution in a modular, entity-centered model, (Proc. NAACL (2010), Association for Computational Linguistics), 385-393 |
[46] | Bojar, O.; Buck, C.; Federmann, C.; Haddow, B.; Koehn, P.; Leveling, J.; Monz, C.; Pecina, P.; Post, M.; Saint-Amand, H.; Soricut, R.; Specia, L.; Tamchyna, A., Findings of the 2014 workshop on statistical machine translation, (Proc. WMT (2014), Association for Computational Linguistics), 12-58 |
[47] | Carnie, A., Constituent Structure (2010), Oxford University |
[48] | Tesnière, L., Elements of Structural Syntax (2015), John Benjamins |
[49] | Black, E.; Abney, S.; Flickenger, D.; Gdaniec, C.; Grishman, R.; Harrison, P.; Hindle, D.; Ingria, R.; Jelinek, F.; Klavans, J.; Liberman, M.; Marcus, M.; Roukos, S.; Santorini, B.; Strzalkowski, T., A procedure for quantitatively comparing the syntactic coverage of English grammars, (Proc. HLT (1991), Morgan-Kaufmann), 306-311 |
[50] | Carroll, J.; Briscoe, E.; Sanfilippo, A., Parser evaluation: a survey and a new proposal, (Proc. LREC (1998)), 447-454 |
[51] | Sampson, G.; Babarczy, A., A test of the leaf-ancestor metric for parse accuracy, J. Natur. Lang. Engrg., 9, 4, 365-380 (2003) |
[52] | Schmid, H., Efficient parsing of highly ambiguous context-free grammars with bit vectors, (Proc. CoLing (2004), Association for Computational Linguistics), 162-168 |
[53] | Schmid, H., Trace prediction and recovery with unlexicalized PCFGs and slash features, (Proc. ACL (2006), Association for Computational Linguistics), 177-184 |
[54] | Shindo, H.; Miyao, Y.; Fujino, A.; Nagata, M., Bayesian symbol-refined tree substitution grammars for syntactic parsing, (Proc. ACL (2012), Association for Computational Linguistics), 440-448 |
[55] | Klein, D.; Manning, C. D., Accurate unlexicalized parsing, (Proc. ACL (2003), Association for Computational Linguistics), 423-430 |
[56] | Sekine, S., Evalb — bracket scoring program (2013), online at |
[57] | Bohnet, B.; Nivre, J., A transition-based system for joint part-of-speech tagging and labeled non-projective dependency parsing, (Proc. EMNLP (2012), Association for Computational Linguistics), 1455-1465 |
[58] | Henderson, J., Discriminative training of a neural network statistical parser, (Proc. ACL (2004), Association for Computational Linguistics), 95-102 |
[59] | Matsuzaki, T.; Miyao, Y.; Tsujii, J., Probabilistic CFG with latent annotations, (Proc. ACL (2005), Association for Computational Linguistics), 75-82 |
[60] | Prescher, D., Inducing head-driven PCFGs with latent heads — refining a tree-bank grammar for parsing, (Proc. ECML (2005), Association for Computational Linguistics), 292-304 |
[61] | Petrov, S.; Barrett, L.; Thibaux, R.; Klein, D., Learning accurate, compact, and interpretable tree annotation, (Proc. ACL (2006), Association for Computational Linguistics), 433-440 |
[62] | Petrov, S., Coarse-to-fine natural language processing (2009), University of California at Bekeley: University of California at Bekeley Berkeley, CA, USA, PhD thesis |
[63] | Allauzen, C.; Riley, M.; Schalkwyk, J.; Skut, W.; Mohri, M., Openfst: a general and efficient weighted finite-state transducer library, (Proc. CIAA. Proc. CIAA, LNCS, vol. 4783 (2007), Springer), 11-23 · Zbl 1139.68352 |
[64] | May, J.; Knight, K., Tiburon: a weighted tree automata toolkit, (Proc. CIAA. Proc. CIAA, LNCS, vol. 4094 (2006), Springer), 102-113 · Zbl 1160.68423 |
[65] | Mohri, M., Finite-state transducers in language and speech processing, Comput. Linguist., 23, 2, 269-311 (1997) |
[66] | Koehn, P., Statistical Machine Translation (2010), Cambridge University Press · Zbl 1202.68446 |
[67] | Brown, P. F.; Della Pietra, S. A.; Della Pietra, V. J.; Mercer, R. L., The mathematics of statistical machine translation — parameter estimation, Comput. Linguist., 19, 2, 263-311 (1993) |
[68] | Chiang, D., An introduction to synchronous grammars (2006), ISI, University of Southern California, Tech. Rep., part of a tutorial given with Kevin Knight at ACL |
[69] | Mi, H.; Huang, L.; Liu, Q., Forest-based translation, (Proc. ACL (2008), Association for Computational Linguistics), 192-199 |
[70] | Mi, H.; Huang, L., Forest-based translation rule extraction, (Proc. EMNLP (2008), Association for Computational Linguistics), 206-214 |
[71] | Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y., Learning phrase representations using RNN encoder-decoder for statistical machine translation, (Proc. EMNLP (2014), Association for Computational Linguistics), 1724-1734 |
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.