×

Eulertigs: minimum plain text representation of \(k\)-mer sets without repetitions in linear time. (English) Zbl 07896498

Boucher, Christina (ed.) et al., 22nd international workshop on algorithms in bioinformatics, WABI 2022, Potsdam, Germany, September 5–7, 2022. Wadern: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. LIPIcs – Leibniz Int. Proc. Inform. 242, Article 2, 21 p. (2022).
Summary: A fundamental operation in computational genomics is to reduce the input sequences to their constituent \(k\)-mers. For maximum performance of downstream applications it is important to store the \(k\)-mers in small space, while keeping the representation easy and efficient to use (i.e. without \(k\)-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. For that, we present a formalisation of arc-centric bidirected de Bruijn graphs and carefully prove that it accurately models the \(k\)-mer spectrum of the input. Our algorithm first constructs the de Bruijn graph in linear time in the length of the input strings (for a fixed-size alphabet). Then it uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output.
For the entire collection see [Zbl 1496.92003].

MSC:

92D10 Genetics and epigenetics
92-08 Computational methods for problems pertaining to biology
05C20 Directed graphs (digraphs), tournaments

Software:

Eulertigs
Full Text: DOI