×

Elastic founder graphs improved and enhanced. (English) Zbl 07809107

Summary: Indexing labeled graphs for pattern matching is a central challenge of pangenomics. Equi et al. (2022) [14] developed the Elastic Founder Graph (EFG) representing an alignment of \(m\) sequences of length \(n\), drawn from alphabet \(\Sigma\) plus the special gap character: the paths spell the original sequences or their recombination. By enforcing the semi-repeat-free property, the EFG admits a polynomial-space index for linear-time pattern matching, breaking through the conditional lower bounds on indexing labeled graphs (Equi et al. [13]). In this work, we improve the space of the EFG index answering pattern matching queries in linear time, from linear in the length of all strings spelled by three consecutive node labels, to linear in the size of the edge labels. Then, we develop linear-time construction algorithms optimizing for different metrics: we improve the existing linearithmic construction algorithms to \(O(mn)\), by solving the novel exclusive ancestor set problem on trees; we propose, for the simplified gapless setting, an \(O(mn)\)-time solution minimizing the maximum block height, that we generalize by substituting block height with prefix-aware height. Finally, to show the versatility of the framework, we develop a BWT-based EFG index and study how to encode and perform document listing queries on a set of paths of the graphs, reporting which paths present a given pattern as a substring. We propose the EFG framework as an improved and enhanced version of the framework for the gapless setting, along with construction methods that are valid in any setting concerned with the segmentation of aligned sequences.

MSC:

68Qxx Theory of computing

Software:

vg; HISAT

References:

[1] Alzamel, Mai; Ayad, Lorraine A. K.; Bernardini, Giulia; Grossi, Roberto; Iliopoulos, Costas S.; Pisanti, Nadia; Pissis, Solon P.; Rosone, Giovanna, Comparing degenerate strings. Fundam. Inform., 1-4, 41-58 (2020) · Zbl 1497.68587
[2] Belazzougui, Djamal; Cunial, Fabio; Kärkkäinen, Juha; Mäkinen, Veli, Linear-time string indexing and analysis in small space. ACM Trans. Algorithms, 2 (March 2020) · Zbl 1485.68079
[3] Belazzougui, Djamal; Kosolobov, Dmitry; Puglisi, Simon J.; Raman, Rajeev, Weighted ancestors in suffix trees revisited · Zbl 07695994
[4] Bernardini, Giulia; Gawrychowski, Pawel; Pisanti, Nadia; Pissis, Solon P.; Even, Giovanna Rosone, Faster elastic-degenerate string matching via fast matrix multiplication
[5] Burrows, M.; Wheeler, D., A block-sorting lossless data compression algorithm (1994), Digital Equipment Corporation, Technical Report 124
[6] Cazaux, Bastien; Kosolobov, Dmitry; Mäkinen, Veli; Tuukka, Norri, Linear time maximum segmentation problems in column stream model, 322-336 · Zbl 1539.68105
[7] Cobas, Dustin; Mäkinen, Veli; Rossi, Massimiliano, Tailoring r-index for document listing towards metagenomics applications, 291-306 · Zbl 1511.68099
[8] Cobas, Dustin; Navarro, Gonzalo, Fast, small, and simple document listing on repetitive text collections, 482-498
[9] De La Briandais, Rene, File searching using variable length keys, 295-298
[10] Eggertsson, Hannes P.; Kristmundsdottir, Snaedis; Beyter, Doruk; Jonsson, Hakon; Skuladottir, Astros; Hardarson, Marteinn T.; Gudbjartsson, Daniel F.; Stefansson, Kari; Halldorsson, Bjarni V.; Melsted, Pall, Graphtyper2 enables population-scale genotyping of structural variation using pangenome graphs. Nat. Commun., 1, 5402 (Nov 2019)
[11] Equi, Massimo, Lower and Upper Bounds for String Matching in Labelled Graphs (2022), University of Helsinki, PhD thesis
[12] Equi, Massimo; Grossi, Roberto; Mäkinen, Veli; Tomescu, Alexandru I., On the complexity of string matching for graphs · Zbl 07561548
[13] Equi, Massimo; Mäkinen, Veli; Tomescu, Alexandru I., Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless seth fails, 608-622 · Zbl 1490.68151
[14] Equi, Massimo; Norri, Tuukka; Alanko, Jarno; Cazaux, Bastien; Tomescu, Alexandru I.; Mäkinen, Veli, Algorithms and complexity on indexing founder graphs. Algorithmica, 6, 1586-1623 (2023) · Zbl 07691814
[15] Farach, Martin, Optimal suffix tree construction with large alphabets, 137-143
[16] Fredman, Michael L.; Willard, Dan E., BLASTING through the information theoretic barrier with FUSION TREES, 1-7
[17] Fujishige, Yuta; Tsujimaru, Yuki; Inenaga, Shunsuke; Bannai, Hideo; Takeda, Masayuki, Linear-time computation of DAWGs, symmetric indexing structures, and MAWs for integer alphabets. Theor. Comput. Sci. (2023) · Zbl 1520.68229
[18] Garrison, Erik; Sirén, Jouni; Novak, Adam; Hickey, Glenn; Eizenga, Jordan; Dawson, Eric; Jones, William; Garg, Shilpa; Markello, Charles; Lin, Michael; Paten, Benedict, Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. (2018)
[19] Gusfield, Dan, Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology (1997), Cambridge University Press · Zbl 0934.68103
[20] Jacobson, G., Space-efficient static trees and graphs, 549-554
[21] Kim, Daehwan; Paggi, Joseph; Park, Chanhee; Bennett, Christopher; Salzberg, Steven, Graph-based genome alignment and genotyping with hisat2 and hisat-genotype. Nat. Biotechnol., 1 (2019)
[22] Moritz, Maaß G., Linear bidirectional on-line construction of affix trees. Algorithmica, 1, 43-74 (2003) · Zbl 1060.68145
[23] Mäkinen, Veli; Cazaux, Bastien; Equi, Massimo; Norri, Tuukka; Tomescu, Alexandru I., Linear time construction of indexable founder block graphs · Zbl 1515.92048
[24] Mäkinen, Veli; Navarro, Gonzalo; Sirén, Jouni; Välimäki, Niko, Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 3, 281-308 (2010)
[25] Mäkinen, Veli; Tomescu, Alexandru I.; Kuosmanen Topi Paavilainen, Anna; Gagie, Travis; Chikhi, Rayan, Sparse dynamic programming on dags with small width. ACM Trans. Algorithms, 2 (2019) · Zbl 1454.68112
[26] Manber, Udi; Myers, Eugene W., Suffix arrays: a new method for on-line string searches. SIAM J. Comput., 5, 935-948 (1993) · Zbl 0784.68027
[27] Muthukrishnan, S., Efficient algorithms for document retrieval problems, 657-666
[28] Norri, Tuukka, On Founder Segmentations of Aligned Texts (2022), University of Helsinki, PhD thesis
[29] Norri, Tuukka; Cazaux, Bastien; Dönges, Saska; Valenzuela, Daniel; Mäkinen, Veli, Founder reconstruction enables scalable and seamless pangenomic analysis. Bioinformatics, 24, 4611-4619 (2021)
[30] Norri, Tuukka; Cazaux, Bastien; Kosolobov, Dmitry; Mäkinen, Veli, Linear time minimum segmentation enables scalable founder reconstruction. Algorithms Mol. Biol., 1 (2019)
[31] Rizzo, Nicola; Mäkinen, Veli, Indexable elastic founder graphs of minimum height · Zbl 07842480
[32] Rizzo, Nicola; Mäkinen, Veli, Linear time construction of indexable elastic founder graphs, 480-493 · Zbl 07577720
[33] Schneeberger, Korbinian; Hagmann, Jörg; Ossowski, Stephan; Warthmann, Norman; Gesing, Sandra; Kohlbacher, Oliver; Weigel, Detlef, Simultaneous alignment of short reads against multiple genomes. Genome Biol. (2009)
[34] Seward, Harold Herbert, Information sorting in the application of electronic digital computers to business operations (1954), Massachusetts Institute of Technology, Digital Computer Laboratory, Thesis
[35] Sirén, Jouni; Välimäki, Niko; Mäkinen, Veli, Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform., 2, 375-388 (2014)
[36] Strothmann, Dirk, The affix array data structure and its applications to RNA secondary structure analysis. Theor. Comput. Sci., 1-2, 278-294 (2007) · Zbl 1146.68358
[37] Computational pan-genomics: status, promises and challenges. Brief. Bioinform., 1, 118-135 (2018)
[38] Ukkonen, Esko, On-line construction of suffix trees. Algorithmica, 3, 249-260 (1995) · Zbl 0831.68027
[39] Ukkonen, Esko, Finding founder sequences from a set of recombinants, 277-286 · Zbl 1016.68565
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.