Chinese word segmentation and named entity recognition: a pragmatic approach. (English) Zbl 1234.68409
Summary: This article presents a pragmatic approach to Chinese word segmentation. It differs from most previous approaches mainly in three respects. First, while theoretical linguists have defined Chinese words using various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Second, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e., morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard that is application-independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different natural language processing applications might require different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg, which will be described in detail. It consists of two components: (1) a generic segmenter that is based on the framework of linear mixture models and provides a unified approach to the five fundamental features of word-level Chinese language processing: lexicon word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and (2) a set of output adaptors for adapting the output of (1) to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets.
MSC:
68T50 | Natural language processing |
References:
[1] | Berger Adam, Computational Linguistics 22 (1) pp 39– (1996) |
[2] | Brill Eric, Computational Linguistics 21 (4) pp 543– (1995) |
[3] | Chang Jing-Shin, International Journal of Computational Linguistics and Chinese Language Processing 2 (2) pp 97– (1997) |
[4] | Chen Keh-Jiann, International Journal of Computational Linguistics and Chinese Language Processing 3 (1) pp 27– (1998) |
[5] | DOI: 10.1006/csla.1999.0128 · doi:10.1006/csla.1999.0128 |
[6] | DOI: 10.1002/(SICI)1097-4571(1999)50:3<218::AID-ASI4>3.0.CO;2-1 · doi:10.1002/(SICI)1097-4571(1999)50:3<218::AID-ASI4>3.0.CO;2-1 |
[7] | DOI: 10.1109/34.588021 · doi:10.1109/34.588021 |
[8] | Gao Jianfeng, Computational Linguistics and Chinese Language Processing 6 (1) pp 27– (2001) |
[9] | DOI: 10.1145/595576.595578 · doi:10.1145/595576.595578 |
[10] | DOI: 10.1109/TASSP.1987.1165125 · doi:10.1109/TASSP.1987.1165125 |
[11] | DOI: 10.1017/S1351324996001246 · doi:10.1017/S1351324996001246 |
[12] | DOI: 10.1088/0305-4470/20/11/013 · doi:10.1088/0305-4470/20/11/013 |
[13] | Hockenmaier Julia, Communications of COLIPS 8 (1) pp 69– (1998) |
[14] | Huang Chu-Ren, International Journal of Computational Linguistics and Chinese Language Processing 2 (2) pp 47– (1997) |
[15] | Li Hongqiao, March 22-24, pages pp 497– (2004) |
[16] | Lin Ming-Yu, Proceedings of the ROC Computational Linguistics Conference pp 119– (1993) |
[17] | DOI: 10.1006/csla.2000.0159 · doi:10.1006/csla.2000.0159 |
[18] | DOI: 10.1006/csla.2001.0169 · doi:10.1006/csla.2001.0169 |
[19] | Sproat Richard, Computational Linguistics 22 (3) pp 377– (1996) |
[20] | Sproat Richard, Computer Processing of Chinese and Oriental Languages 4 pp 336– (1990) |
[21] | DOI: 10.1109/89.260336 · doi:10.1109/89.260336 |
[22] | Sun Jian, International Journal of Computational Linguistics and Chinese Language Processing 8 (2) pp 1– (2003) |
[23] | DOI: 10.1162/089120100561746 · doi:10.1162/089120100561746 |
[24] | Wu Andi, International Journal of Computational Linguistics and Chinese Language Processing 8 (1) pp 1– (2003) |
[25] | DOI: 10.1002/(SICI)1097-4571(199310)44:9<532::AID-ASI3>3.0.CO;2-M · doi:10.1002/(SICI)1097-4571(199310)44:9<532::AID-ASI3>3.0.CO;2-M |
[26] | Xue Nianwen, International Journal of Computational Linguistics and Chinese Language Processing 8 (1) pp 29– (2003) |
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.