×

Approximate string processing. (English) Zbl 1218.68075

Summary: One of the most important primitive data types in modern data processing is text. Text data are known to have a variety of inconsistencies (e.g., spelling mistakes and representational variations). For that reason, there exists a large body of literature related to approximate processing of text. This survey focuses specifically on the problem of approximate string matching, where, given a set of strings \(S\) and a query string \(v\), the goal is to find all strings \(s \in S\) that have a user specified degree of similarity to \(v\). Set \(S\) could be, for example, a corpus of documents, a set of web pages, or an attribute of a relational table. The similarity between strings is always defined with respect to a similarity function that is chosen based on the characteristics of the data and application at hand. This work presents a survey of indexing techniques and algorithms specifically designed for approximate string matching. We concentrate on inverted indexes, filtering techniques, and tree data structures that can be used to evaluate a variety of set based and edit based similarity functions. We focus on all-match and top-\(k\) flavors of selection and join queries, and discuss the applicability, advantages and disadvantages of each technique for every query type.

MSC:

68P05 Data structures
68P10 Searching and sorting
68W32 Algorithms on strings