×

Compressed multiple pattern matching. (English) Zbl 07559181

Pisanti, Nadia (ed.) et al., 30th annual symposium on combinatorial pattern matching, CPM 2019, Pisa, Italy, June 18–20, 2019. Proceedings. Wadern: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. LIPIcs – Leibniz Int. Proc. Inform. 128, Article 13, 14 p. (2019).
Summary: Given \(d\) strings over the alphabet \(\{0,1,\dots,\sigma-1\}\), the classical Aho-Corasick data structure allows us to find all occ occurrences of the strings in any text \(T\) in \(O(|T|+\text{occ})\) time using \(O(m\log m)\) bits of space, where \(m\) is the number of edges in the trie containing the strings. Fix any constant \(\varepsilon\in(0,2)\). We describe a compressed solution for the problem that, provided \(\sigma\le m^\delta\) for a constant \(\delta <1\), works in \(O(|T|\frac{1}{\varepsilon}\log\frac{1}{\varepsilon})+\text{occ}\) time, which is \(O(|T|+\text{occ})\) since \(\varepsilon\) is constant, and occupies \(mH_k+ 1.443m+ \varepsilon m+ O(d\log\frac{m}{d})\) bits of space, for all \(0\le k\le\max\{0,\alpha\log_\sigma m-2\}\) simultaneously, where \(\alpha\in(0,1)\) is an arbitrary constant and \(H_k\) is the \(k\)th-order empirical entropy of the trie. Hence, we reduce the \(3.443m\) term in the space bounds of previously best succinct solutions to \((1.443+\varepsilon)m\), thus solving an open problem posed by D. Belazzougui [Lect. Notes Comput. Sci. 6129, 88–100 (2010; Zbl 1286.68521)]. Further, we notice that \(L= \log\binom{\sigma(m+1)}{m} - O(\log(\sigma m))\) is a worst-case space lower bound for any solution of the problem and, for \(d= o(m)\) and constant \(\varepsilon\), our approach allows to achieve \(L+ \varepsilon m\) bits of space, which gives an evidence that, for \(d= o(m)\), the space of our data structure is theoretically optimal up to the \(\varepsilon m\) additive term and it is hardly possible to eliminate the term \(1.443m\). In addition, we refine the space analysis of previous works by proposing a more appropriate definition for \(H_k\). We also simplify the construction for practice adapting the fixed block compression boosting technique, then implement our data structure, and conduct a number of experiments showing that it is comparable to the state of the art in terms of time and is superior in space.
For the entire collection see [Zbl 1414.68007].

MSC:

68W32 Algorithms on strings

Citations:

Zbl 1286.68521

References:

[1] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333-340, 1975. doi:10.1145/360825.360855. · Zbl 0301.68048 · doi:10.1145/360825.360855
[2] Jarno Alanko and Tuukka Norri. Greedy shortest common superstring approximation in compact space. In Proc. SPIRE, volume 10508 of LNCS, pages 1-13. Springer, 2017. doi: 10.1007/978-3-319-67428-5_1. · Zbl 1454.68195 · doi:10.1007/978-3-319-67428-5_1
[3] Djamal Belazzougui. Succinct dictionary matching with no slowdown. In Proc. CPM, volume 6129 of LNCS, pages 88-100. Springer, 2010. doi:10.1007/978-3-642-13509-5_9. · Zbl 1286.68521 · doi:10.1007/978-3-642-13509-5_9
[4] Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, 1994.
[5] Ho-Leung Chan, Wing-Kai Hon, Tak-Wah Lam, and Kunihiko Sadakane. Dynamic dictionary matching and compressed suffix trees. In Proc. SODA, pages 13-22. SIAM, 2005. · Zbl 1297.68063
[6] David Clark. Compact pat trees. PhD thesis, University of Waterloo, 1997.
[7] Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana Starikovskaya. Dictionary matching in a stream. In Proc. ESA, volume 9294 of LNCS, pages 361-372. · Zbl 1443.68218
[8] Springer, 2015. doi:10.1007/978-3-662-48350-3_31. · Zbl 1443.68218 · doi:10.1007/978-3-662-48350-3_31
[9] Maxime Crochemore and Wojciech Rytter. Jewels of stringology. World Scientific Publishing Co. Pte. Ltd., 2002. · Zbl 1078.68151
[10] Vassilis Dimopoulos, Ioannis Papaefstathiou, and Dionisios Pnevmatikatos. A memory-efficient reconfigurable Aho-Corasick FSM implementation for intrusion detection systems. In Proc. IC-SAMOS, pages 186-193. IEEE, 2007. doi:10.1109/ICSAMOS.2007.4285750. · doi:10.1109/ICSAMOS.2007.4285750
[11] Guy Feigenblat, Ely Porat, and Ariel Shiftan. Linear time succinct indexable dictionary construction with applications. In Proc. DCC, 2016, pages 13-22. IEEE, 2016. doi:10.1109/ DCC.2016.70. · doi:10.1109/DCC.2016.70
[12] Guy Feigenblat, Ely Porat, and Ariel Shiftan. A grouping approach for succinct dynamic dictionary matching. Algorithmica, 77(1):134-150, 2017. doi:10.1007/s00453-015-0056-0. · Zbl 1359.68335 · doi:10.1007/s00453-015-0056-0
[13] Paolo Ferragina, Raffaele Giancarlo, Giovanni Manzini, and Marinella Sciortino. Boosting textual compression in optimal linear time. Journal of the ACM, 52(4):688-713, 2005. doi: 10.1145/1082036.1082043. · Zbl 1323.68260 · doi:10.1145/1082036.1082043
[14] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proc. FOCS, pages 184-193. IEEE, 2005. doi:10.1109/SFCS.2005.69. · doi:10.1109/SFCS.2005.69
[15] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. FOCS, pages 390-398. IEEE, 2000. doi:10.1109/SFCS.2000.892127. · doi:10.1109/SFCS.2000.892127
[16] Travis Gagie. Large alphabets and incompressibility. Information Processing Letters, 99(6):246-251, 2006. doi:10.1016/j.ipl.2006.04.008. · Zbl 1185.68367 · doi:10.1016/j.ipl.2006.04.008
[17] Simon Gog. SDSL -succinct data structure library, 2015. URL: https://github.com/ simongog/sdsl-lite.
[18] Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J. Puglisi. Faster, minuter. In Proc. DCC, pages 53-62. IEEE, 2016. doi:10.1109/DCC.2016.94. · doi:10.1109/DCC.2016.94
[19] Shay Golan and Ely Porat. Real-time streaming multi-pattern search for constant alphabet. In Proc. ESA, volume 87 of LIPIcs, pages 41:1-41:15. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. doi:10.4230/LIPIcs.ESA.2017.41. · Zbl 1442.68280 · doi:10.4230/LIPIcs.ESA.2017.41
[20] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics. Massachu-setts: Addison-Wesley, 1989. · Zbl 0668.00003
[21] Dan Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge university press, 1997. doi:10.1017/CBO9780511574931. · Zbl 0934.68103 · doi:10.1017/CBO9780511574931
[22] Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan, and Jeffrey S. Vitter. Faster compressed dictionary matching. Theoretical Computer Science, 475:113-119, 2013. doi:10.1016/j.tcs.2012.10.050. · Zbl 1259.68259 · doi:10.1016/j.tcs.2012.10.050
[23] Wing-Kai Hon, Tak-Wah Lam, Rahul Shah, Siu-Lung Tam, and Jeffrey S. Vitter. Compressed index for dictionary matching. In Proc. DCC, pages 23-32. IEEE, 2008. doi:10.1109/DCC. 2008.62. · doi:10.1109/DCC.2008.62
[24] Jesper Jansson, Kunihiko Sadakane, and Wing-Kin Sung. Ultra-succinct representation of ordered trees with applications. Journal of Computer and System Sciences, 78(2):619-631, 2012. doi:10.1016/j.jcss.2011.09.002. · Zbl 1242.68083 · doi:10.1016/j.jcss.2011.09.002
[25] Juha Kärkkäinen and Simon J. Puglisi. Fixed block compression boosting in FM-indexes. In Proc. SPIRE, volume 7024 of LNCS, pages 174-184. Springer, 2011. doi:10.1007/ 978-3-642-24583-1_18. · doi:10.1007/978-3-642-24583-1_18
[26] Tsvi Kopelowitz, Ely Porat, and Yaron Rozen. Succinct Online Dictionary Matching with Improved Worst-Case Guarantees. In Proc. CPM, volume 54 of LIPIcs. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016. doi:10.4230/LIPIcs.CPM.2016.6. · Zbl 1380.68154 · doi:10.4230/LIPIcs.CPM.2016.6
[27] S. Rao Kosaraju and Giovanni Manzini. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM Journal on Computing, 29(3):893-911, 1999. doi:10.1137/ S0097539797331105. · Zbl 0941.68055 · doi:10.1137/S0097539797331105
[28] Hung-Jen Liao, Chun-Hung R. Lin, Ying-Chih Lin, and Kuang-Yuan Tung. Intrusion detection system: A comprehensive review. Journal of Network and Computer Applications, 36(1):16-24, 2013. doi:10.1016/j.jnca.2012.09.004. · doi:10.1016/j.jnca.2012.09.004
[29] Veli Mäkinen and Gonzalo Navarro. Compressed full-text indexes. ACM Computing Surveys, 39(1):2, 2007. doi:10.1145/1216370.1216372. · Zbl 1321.68263 · doi:10.1145/1216370.1216372
[30] Giovanni Manzini. An analysis of the Burrows-Wheeler transform. Journal of the ACM, 48(3):407-430, 2001. doi:10.1145/382780.382782. · Zbl 1323.68262 · doi:10.1145/382780.382782
[31] Gonzalo Navarro. Compact data structures: A practical approach. Cambridge University Press, 2016. doi:10.1017/CBO9781316588284. · doi:10.1017/CBO9781316588284
[32] Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Transactions on Algorithms, 10(3):16, 2014. doi:10.1145/2601073. · Zbl 1333.68084 · doi:10.1145/2601073
[33] Derek Pao, Xing Wang, Xiaoran Wang, Cong Cao, and Yuesheng Zhu. String searching engine for virus scanning. IEEE Transactions on Computers, 60(11):1596-1609, 2011. doi: 10.1109/TC.2010.250. · doi:10.1109/TC.2010.250
[34] Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. ACM Transactions on Algorithms, 3(4):43, 2007. doi:10.1145/1290672.1290680. · Zbl 1446.68046 · doi:10.1145/1290672.1290680
[35] Dina Sokol and Marcus Shoshana. Engineering small space dictionary matching. arXiv preprint, 2013. arXiv:1301.6428. · Zbl 1270.68099
[36] Alan Tam, Edward Wu, Tak-Wah Lam, and Siu-Ming Yiu. Succinct text indexing with wildcards. In Proc. SPIRE, pages 39-50. Springer, 2009. doi:10.1007/978-3-642-03784-9_5. · doi:10.1007/978-3-642-03784-9_5
[37] Sun Wu and Udi Manber. Fast text searching: allowing errors. Communications of the ACM, 35(10):83-91, 1992. doi:10.1145/135239.135244. A Implementation and Experiments We implemented the data structure described in this paper in C++ and compared its runtime and memory consumption with the Belazzougui’s solution [3] and a naive algorithm. We could not find implementations of the Belazzougui’s data structure and implemented it too. In [34] we found a different data structure based on compressed suffix trees but it showed a very poor performance, so we decided to exclude it from the tests. · doi:10.1145/135239.135244
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.