×

Euler circuits and DNA sequencing by hybridization. (English) Zbl 0997.92014

Summary: Sequencing by hybridization is a method of reconstructing a long DNA string – that is, figuring out its nucleotide sequence – from knowledge of its short substrings. Unique reconstruction is not always possible, and the goal of this paper is to study the number of reconstructions of a random string. For a given string, the number of reconstructions is determined by the pattern of repeated substrings; in an appropriate limit substrings will occur at most twice, so the pattern of repeats is given by a pairing: a string of length \(2n\) in which each symbol occurs twice. A pairing induces a 2-in, 2-out graph, whose directed edges are defined by successive symbols of the pairing – for example the pairing ABBCAC induces the graph with edges \(AB\), \(BB\), \(BC\), and so forth – and the number of reconstructions is simply the number of Euler circuits in this 2-in, 2-out graph. The original problem is thus transformed into one about pairings: to find the number \(f_k(n)\) of \(n\)-symbol pairings having \(k\) Euler circuits.
We show how to compute this function, in closed form, for any fixed \(k\), and we present the functions explicitly for \(k= 1,\dots, 9\). The key is a decomposition theorem: the Euler “circuit number” of a pairing is the product of the circuit numbers of “component” sub-pairings. These components come from connected components of the “interlace graph”, which has the pairing’s symbols as vertices, and edges when symbols are “interlaced”. (\(A\) and \(B\) are interlaced if the pairing has the form \(\cdots A\cdots B\cdots A\cdots B\cdots\) or \(\cdots B\cdots A\cdots B\cdots A\cdots\).) We carry these results back to the original question about DNA strings, and provide a total variation distance upper bound for the approximation error.
We perform an asymptotic enumeration of 2-in, 2-out digraphs to show that, for a typical random \(n\)-pairing, the number of Euler circuits is of order no smaller than \(2^n/n\), and the expected number is asymptotically at least \(e^{-1/2} 2^{n-1}/n\). Since any \(n\)-pairing has at most \(2^{n-1}\) Euler circuits, this pinpoints the exponential growth rate.

MSC:

92C40 Biochemistry, molecular biology
05A16 Asymptotic enumeration
05C90 Applications of graph theory
92D20 Protein sequences, DNA sequences

Software:

OEIS
Full Text: DOI

References:

[1] R. Arratia, B. Bollobás, G.B. Sorkin, The interlace polynomial: a new graph polynomial, Proceedings of the Eleventh Annual ACM–SIAM Symposium on Discrete Algorithms, January 2000. · Zbl 0955.05066
[2] Arratia, R.; Goldstein, L.; Gordon, L.: Two moments suffice for Poisson approximations: the Chen–Stein method. Ann. probab. 17, 9-25 (1989) · Zbl 0675.60017
[3] Arratia, R.; Martin, D.; Reinert, G.; Waterman, M. S.: Poisson process approximation for sequence repeats, and sequencing by hybridization. J. comput. Biol. 3, 425-463 (1996)
[4] Arratia, R.; Reinert, G.: Poisson process approximation for repeats in one sequence and its application to sequencing by hybridization. Springer lecture notes in computer science 1075, 209-219 (1996)
[5] Barbour, A. D.; Holst, L.; Janson, S.: Poisson approximation. Oxford studies in probability 2 (1992) · Zbl 0746.60002
[6] Bogart, K. P.; Doyle, P.: Non-sexist solution of the menage problem. Amer. math. Mon. 93, No. 7, 514-518 (1986) · Zbl 0617.05006
[7] Bollobás, B.: A probabilistic proof of an asymptotic formula for the number of labelled regular graphs. European J. Combin. 1, 311-316 (1980) · Zbl 0457.05038
[8] Bollobás, B.: Random graphs. (1985) · Zbl 0567.05042
[9] B. Bollobás, Modern Graph Theory, Graduate Texts in Mathematics, Vol. 184, Springer, New York, 1998.
[10] De Bruijn, N. G.; Van Aardenne-Ehrenfest, T.: Circuits and trees in oriented graphs. Simon stevin 28, 203-217 (1951) · Zbl 0044.38201
[11] Dehn, M.: Über kombinatorische topologie. Acta math. 67, 123-168 (1936) · JFM 62.0656.04
[12] De Fraysseix, H.: A characterization of circle graphs. European J. Combin. 5, 223-238 (1984) · Zbl 0551.05056
[13] Dyer, M.; Frieze, A.; Suen, S.: The probability of unique solutions of sequencing by hybridization. J. comput. Biol. 1, 105-110 (1994)
[14] Even, S.; Itai, A.: Queues, stacks and graphs. Theory of machines and computations, 71-86 (1971)
[15] Gabor, C. P.; Supowit, K. J.; Hsu, W. -L.: Recognizing circle graphs in polynomial time. J. ACM 36, 435-473 (1989) · Zbl 0825.68417
[16] C.F. Gauss, Werke, Vol. 8, Teubner, Stuttgart, 1900, pp. 272, 282–286.
[17] Golumbic, M. C.: Algorithmic graph theory and perfect graphs. (1980) · Zbl 0541.05054
[18] Gusfield, D.; Karp, R.; Wang, L.; Stelling, P.: Graph traversals, genes and matroids: an efficient case of the travelling salesman problem. Discrete appl. Math. 88, 167-180 (1998) · Zbl 0927.68061
[19] E. Hubbell, Multiplex sequencing by hybridization, J. Comput. Biol. (1999) to appear.
[20] Kandel, D.; Matias, Y.; Unger, R.; Winkler, P.: Shuffling biological sequences. Discrete appl. Math. 71, 171-185 (1996) · Zbl 0873.92012
[21] Kirchoff, G.: Über die auflösung der gleichungen, auf welche man bei der untersuchung der linearen verteilung galvanischer ströme geführt wird. Ann. phys. Chem. 72, 497-508 (1847)
[22] Pevzner, P.: DNA physical mapping and alternating Eulerian circuits in colored graphs. Algorithmica 13, 77-105 (1995) · Zbl 0840.92011
[23] Read, R. C.; Rosenstiehl, P.: On the Gauss crossing problem. Colloq. math. Soc. jános bolyai 18, 843-876 (1976)
[24] Sloane, N. J. A.; Plouffe, S.: The encyclopedia of integer sequences. (1995) · Zbl 0845.11001
[25] Smith, C. A. B.; Tutte, W. T.: On unicursal paths in a network of degree 4. Amer. math. Mon. 48, 233-237 (1941) · Zbl 0025.09103
[26] Spinrad, J.: Recognition of circle graphs. J. algorithms 16, 264-282 (1994) · Zbl 0797.68130
[27] Sz.-Nagy, J. V.: Über ein topologisches problem von Gauss. Math. zeitschrift 26, 579-592 (1927) · JFM 53.0550.01
[28] Tutte, W. T.: The dissection of equilateral triangles into equilateral triangles. Proc. Cambridge philos. Soc. 44, 463-482 (1948) · Zbl 0030.40903
[29] Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoret. comput. Sci. 92, 191-211 (1992) · Zbl 0747.68026
[30] Waterman, M. S.: Introduction to computational biology: maps, sequences and genomes. (1995) · Zbl 0831.92011
[31] Wilf, H. S.: Generatingfunctionology. (1990) · Zbl 0689.05001
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.