×

A new indexing method based on word proximity for Chinese text retrieval. (English) Zbl 0969.68626

Summary: This paper proposed a novel text representation and matching scheme for Chinese text retrieval. At present, the indexing methods of Chinese retrieval systems are either character-based or word-based. The character-based indexing methods, such as bi-gram or tri-gram indexing, have high false drops due to the mismatches between queries and documents. On the other hand, it’s difficult to efficiently identify all the proper nouns, terminology of different domains, and phrases in the word-based indexing systems. The new indexing method uses both proximity and mutual information of the word pairs to represent the text content so as to overcome the high false drop, new word and phrase problems that exist in the character-based and word-based systems. The evaluation results indicate that the average query precision of proximity-based indexing is 5.2% higher than the best results of TREC-5.

MSC:

68U99 Computing methodologies and applications
68P20 Information storage and retrieval of data
Full Text: DOI

References:

[1] Salton G. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. · Zbl 0523.68084
[2] Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.
[3] Chien Lee-Feng. Fast and quasi-natural language search for gigabytes of Chinese texts. InACM SIGIR’95, Seattle, 1995, pp.112–120.
[4] Wilkinson R. Chinese document retrieval at TREC-6. InText Retrieval Conference (TREC-6) NIST, Gaithersburg, Maryland, 1997, pp.25–30.
[5] Du L, Sun Y F. The application of NLP in the chinese information retrieval. InSCIPL’98, Hong Kong, 1998, pp.32–38.
[6] Leong M K, Zhou H. Preliminary qualitative analysis of segmented vs bigram indexing in Chinese. InText Retrieval Conference (TREC-6), NIST, Gaithersburg, Maryland, 1997, pp.551–558.
[7] He J, Xu J. Berkeley Chinese information retrieval at TREC-5: Technical report. InText Retrieval Conference (TREC-5), NIST, Gaithersburg, Maryland, 1996, pp.191–196.
[8] Wu Li-deet al. Fudan abstract system of Chinese text.Communications of COLIPS, 1996, 6(1): 35–39.
[9] Sun M, Huang C. Identifying Chinese names in unrestricted texts.Communications of COLIPS, 1994, 4(2): 113–122.
[10] Liu K Y. The evaluation of the modern Chinese word segmentation.Applied Linguistics, 1997, 21(1): 101–106.
[11] Liu Y. Modern Chinese Word Segmentation Specification and Methodology for Information Processing. Tsinghua University Press, 1994.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.