×

Fast approximate search in large dictionaries. (English) Zbl 1234.68424

Summary: The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input \(P\) is the set of all words in the dictionary for which the Levenshtein distance to \(P\) does not exceed a given (small) bound \(k\). In this article we describe methods for efficiently selecting such candidate sets. After introducing as a starting point a basic correction method based on the concept of a “universal Levenshtein automaton”, we show how two filtering methods known from the field of approximate text search can be used to improve the basic procedure in a significant way. The first method, which uses standard dictionaries plus dictionaries with reversed words, leads to very short correction times for most classes of input strings. Our evaluation results demonstrate that correction times for fixed-distance bounds depend on the expected number of correction candidates, which decreases for longer input words. Similarly the choice of an optimal filtering method depends on the length of the input words.

MSC:

68T50 Natural language processing
68Q45 Formal languages and automata

References:

[1] DOI: 10.1016/0306-4573(83)90022-5 · doi:10.1016/0306-4573(83)90022-5
[2] DOI: 10.1142/S0218001495000389 · doi:10.1142/S0218001495000389
[3] DOI: 10.1007/PL00009253 · Zbl 0913.68050 · doi:10.1007/PL00009253
[4] DOI: 10.1016/S0019-9958(60)90272-2 · Zbl 0087.32606 · doi:10.1016/S0019-9958(60)90272-2
[5] DOI: 10.1162/089120100561601 · Zbl 1232.68081 · doi:10.1162/089120100561601
[6] DOI: 10.1145/366862.366913 · doi:10.1145/366862.366913
[7] DOI: 10.1142/S0218001495000080 · doi:10.1142/S0218001495000080
[8] DOI: 10.1142/9789812830968_0008 · doi:10.1142/9789812830968_0008
[9] DOI: 10.1016/0304-3975(92)90138-6 · Zbl 0747.68022 · doi:10.1016/0304-3975(92)90138-6
[10] DOI: 10.1002/spe.4380240105 · doi:10.1002/spe.4380240105
[11] DOI: 10.1145/146370.146380 · doi:10.1145/146370.146380
[12] Levenshtein Vladimir I, Soviet Physics-Doklady 10 pp 707– (1966)
[13] DOI: 10.1007/BF01185432 · Zbl 0941.68560 · doi:10.1007/BF01185432
[14] DOI: 10.1145/375360.375365 · doi:10.1145/375360.375365
[15] DOI: 10.1016/S0020-0190(99)00121-0 · Zbl 1338.68305 · doi:10.1016/S0020-0190(99)00121-0
[16] Odell Margaret K, Patent Number 1 pp 261– (1918)
[17] Odell Margaret K, Patent Number 1 pp 435– (1992)
[18] Oflazer Kemal, Computational Linguistics 22 (1) pp 73– (1996)
[19] DOI: 10.1016/S0031-3203(96)00101-X · doi:10.1016/S0031-3203(96)00101-X
[20] DOI: 10.1002/spe.4380180407 · doi:10.1002/spe.4380180407
[21] Riseman Edward M, IEEE Transactions on Computers, C-20(4):397-403. (1971)
[22] DOI: 10.1007/s10032-002-0082-8 · Zbl 1039.68137 · doi:10.1007/s10032-002-0082-8
[23] DOI: 10.1016/0031-3203(95)00102-6 · doi:10.1016/0031-3203(95)00102-6
[24] DOI: 10.1016/0031-3203(90)90070-2 · doi:10.1016/0031-3203(90)90070-2
[25] DOI: 10.1145/357423.357428 · doi:10.1145/357423.357428
[26] DOI: 10.1016/0031-3203(90)90023-E · doi:10.1016/0031-3203(90)90023-E
[27] DOI: 10.1016/S0019-9958(85)80046-2 · Zbl 0575.68090 · doi:10.1016/S0019-9958(85)80046-2
[28] DOI: 10.1016/0196-6774(85)90023-9 · Zbl 0566.68072 · doi:10.1016/0196-6774(85)90023-9
[29] DOI: 10.1016/0304-3975(92)90143-4 · Zbl 0747.68026 · doi:10.1016/0304-3975(92)90143-4
[30] DOI: 10.1093/comjnl/20.2.141 · Zbl 0359.68121 · doi:10.1093/comjnl/20.2.141
[31] DOI: 10.1145/321796.321811 · Zbl 0278.68032 · doi:10.1145/321796.321811
[32] DOI: 10.1016/0031-3203(90)90071-R · doi:10.1016/0031-3203(90)90071-R
[33] DOI: 10.1145/135239.135244 · doi:10.1145/135239.135244
[34] DOI: 10.1002/spe.4380250307 · doi:10.1002/spe.4380250307
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.