×

Quantifying local randomness in human DNA and RNA sequences using Erdös motifs. (English) Zbl 1406.92463

Summary: In 1932, Paul Erdös asked whether a random walk constructed from a binary sequence can achieve the lowest possible deviation (lowest discrepancy), for the sequence itself and for all its subsequences formed by homogeneous arithmetic progressions. Although avoiding low discrepancy is impossible for infinite sequences, as recently proven by Terence Tao, attempts were made to construct such sequences with finite lengths. We recognize that such constructed sequences (we call these “Erdös sequences”) exhibit certain hallmarks of randomness at the local level: they show roughly equal frequencies of short subsequences, and at the same time exclude trivial periodic patterns. For the human DNA we examine the frequency of a set of Erdös motifs of length-10 using three nucleotides-to-binary mappings. The particular length-10 Erdös sequence is derived from the length-11 Mathias sequence and is identical with the first 10 digits of the Thue-Morse sequence, underscoring the fact that both are deficient in periodicities. Our calculations indicate that: (1) the purine(A and G)/pyridimine(C and T) based Erdös motifs are greatly underrepresented in the human genome, (2) the strong(G and C)/weak(A and T) based Erdös motifs are slightly overrepresented, (3) the densities of the two are negatively correlated, (4) the Erdös motifs based on all three mappings being combined are slightly underrepresented, and (5) the strong/weak based Erdös motifs are greatly overrepresented in the human messenger RNA sequences.

MSC:

92D20 Protein sequences, DNA sequences

Software:

BEDTools

References:

[1] Allouche, J. P., Surveying some notions of complexity for finite and infinite sequences, (Matsumoto, K.; Akiyama, S.; Fukuyama, K.; Nakada, H.; Sugita, H.; Tamagawa, A., Functions in Number Theory and Their Probabilistic Aspects, RIMS Kokyuroku Bessatsu Series B34 (2012), RIMS, Kyoto University), 27-28 · Zbl 1271.68202
[2] Allouche, J. P.; Shallit, J., The ubiquitous Prouhet-Thue-Morse sequence, (Ding, C.; Helleseth, T.; Niederreiter, H., Proceedings of Sequences and Their Applications,SETA’98 (1999), Springer), 1-16 · Zbl 1005.11005
[3] Almirantis, Y., A standard deviation based quantification differentiates coding from non-coding DNA sequences and gives insight to their evolutionary history, J. Theo. Biol., 196, 297-308 (1999)
[4] Almirantis, Y.; Provata, A., The clustered structure; of the purines/pyrimidines distribution in DNA distinguishes systematically between coding and non-coding sequences, Bull. Math. Biol., 59, 975-992 (1997) · Zbl 0880.92014
[5] Arnott, S.; Chandrasekaran, R.; Hukins, D.; Smith, P.; Watts, L., Structural details of a double-helix observed for DNAs containing alternating purine and pyrimidine sequences, J. Mol. Biol., 88, 523-524 (1974)
[6] Bailey, J. A.; Gu, Z.; Clark, R. A.; Reinert, K.; Samonte, R. V.; Schwartz, S.; Adams, M. D.; Myers, E. W.; Li, P. W.; Eichler, E. E., Recent segmental duplications in the human genome, Science, 297, 1003-1007 (2002)
[7] Behe, M. J., An overabundance of long oligopurine tracts occurs in the genome of simple and complex eukaryotes, Nucl. Acids Res., 23, 689-695 (1995)
[8] Benedetto, D.; Caglioti, E.; Loreto, V., Language trees and zipping, Phys. Rev. Lett., 88, 048702 (2002)
[9] Bernardi, G., Misunderstandings about isochores. Part 1, Gene, 276, 3-13 (2001)
[10] Bernardi, G.; Olofsson, B.; Filipski, J.; Zerial, M.; Salinas, J.; Cuny, G.; Meunier-Rotival, M.; Rodier, F., The mosaic genome of warm-blooded vertebrates, Science, 228, 953-958 (1985)
[11] Berthelsen, C. L.; Glazier, J. A.; Skolnick, M. H., Global fractal dimension of human DNA sequences treated as pseudorandom walks, Phys. Rev. A, 45, 8902-8913 (1992)
[12] Bustamante, C. D.; Fledel-Alon, A.; Williamson, S.; Nielsen, R.; Hubisz, M. T.; Glanowski, S.; Tanenbaum, D. M.; White, T. J.; Sninsky, J. J.; Hernandez, R. D.; Civello, D.; Adams, M. D.; Cargill, M.; Clark, A. G., Natural selection on protein-coding genes in the human genome, Nature, 437, 1153-1157 (2005)
[13] Calladine, C. R.; Drew, H.; Luisi, B.; Travers, A., Understanding DNA: the Molecule and How it Works (2004), Academic Press
[14] Carlson, E. A., Mutation: The History of an Idea from Darwin to Genomics (2011), Cold Spring Harbor Laboratory Press
[15] Christophe, D.; Cabrer, B.; Bacolla, A.; Targovnik, H.; Pohl, V.; Vassart, G., An unusually long poly(purine)-poly(pyrimidine) sequence is located upstream from the human thyroglobulin gene, Nucl. Acids Res., 13, 5127-5144 (1985)
[16] Clay, O.; Bernardi, G., Compositional heterogeneity within and among isochores in mammalian genomes: II. Some general comments, Gene, 276, 25-31 (2001)
[17] Cocho, G.; Miramontes, P.; Mansilla, R.; Li, W., Bacterial genomes lacking long-range correlations may not be modeled by low-order Markov chains: the role of mixing statistics and frame shift of neighboring genes, Comp. Biol. Chem., 53, 15-25 (2014)
[18] Colquhoun, D., An investigation of the false discovery rate and the misinterpretation of p-values, Royal Soc. Open Sci., 1, 140216 (2014)
[19] Consortium, I. H.G. S., Initial sequencing and analysis of the human genome, Nature, 409, 860-921 (2001)
[20] Cooper, D. N.; Krawczak, M., Mechanisms of insertional mutagenesis in human genes causing genetic disease, Human Genet., 87, 409-415 (1991)
[21] Cordaux, R.; Batzer, M. A., The impact of retrotransposons on human genome evolution, Nature Rev. Genet., 10, 691-703 (2009)
[22] De Bruijn, N. G., A combinatorial problem, Proc. Koninklijke Nederlandse Akademie van Wetenschappen, 49, 758-764 (1946) · Zbl 0060.02701
[23] De Bruijn, N. G., Acknowledgement of priority to C. Flye Sainte-Marie on the counting of circular arrangements of \(2^n\) zeros and ones that show each n-letter word exactly once (1975), Technische Hogeschool Eindhoven, Technical report · Zbl 0323.05119
[24] Drew, H. R.; Travers, A. A., DNA bending and its relation to nucleosome positioning, J. Mol. Biol., 186, 773-790 (1985)
[25] Duret, L.; Arndt, P. F., The impact of recombination on nucleotide substitutions in the human genome, PLoS Genet., 4, e1000071 (2008)
[26] Erdös, P., Some unsolved problems, Michigan Math. J., 4, 291-300 (1957) · Zbl 0081.00102
[27] Estevez-Rams, E.; Serrano, R. L.; Fernández, B. A.; Reyes, I. B., On the non-randomness of maximum lempel ziv complexity sequences of finite size, Chaos, 23, 023118 (2013) · Zbl 1331.37024
[28] Fickett, J. W.; Torney, D. C.; Wolf, D. R., Base compositional structure of genomes, Genomics, 13, 1056-1064 (1992)
[29] Flores, M.; Morales, L.; Gonzaga-Jauregui, C.; na, R. D.-V.; Zepeda, C.; nez, O. Y.; Gutiérrez, M.; Lemus, T.; Valle, D.; Avila, M. C.; Blanco, D.; Medina-Ruiz, S.; Meza, K.; Ayala, E.; García, D.; Bustos, P.; González, V.; Girard, L.; Tusie-Luna, T.; Dávila, G.; Palacios, R., Recurrent DNA inversion rearrangements in the human genome, Proc. Natl. Acad. Sci., 104, 6099-6106 (2007)
[30] Forsdyke, D. R.; Mortimer, J. R., Chargaff’s legacy, Gene, 261, 127-137 (2000)
[31] Hackenberg, M.; Rueda, A.; Carpena, P.; Bernaola-Galván, P.; Barturen, G.; Oliver, J. L., Clustering of DNA words and biological function: a proof of principle, J. Theo. Biol., 297, 127-136 (2012)
[32] Hartl, D. L.; Clark, A. G., Principles of Population Genetics (1997), Sinauer Asso. Inc
[33] Hollander, M.; Wolfe, D. A., Nonparametric Statistical Methods (1999), Wiley · Zbl 0997.62511
[34] Jiang, C.; Pugh, B. F., Nucleosome positioning and gene regulation: advances through genomics, Nature Rev. Genet., 10, 161-172 (2009)
[35] Kimura, M., The Neutral Theory of Molecular Evolution (1983), Cambridge University Press
[36] Knuth, D. E., Art of Computer Programming, Volume 2: Seminumerical Algorithms (1997), Addison-Wesley · Zbl 0883.68015
[37] Konev, B., Lisitsa, A., 2014. a SAT attack on the erdös discrepancy conjecture. ArXiv preprint, arXiv:1402.2184; Konev, B., Lisitsa, A., 2014. a SAT attack on the erdös discrepancy conjecture. ArXiv preprint, arXiv:1402.2184 · Zbl 1343.68217
[38] Koroteev, M. V.; Miller, J., Scale-free duplication dynamics: a model for ultraduplication, Phys. Rev. E, 84, 061919 (2011)
[39] Leong, A., Variations on the Erdös discrepancy problem, Master Thesis (2011), Computer Science, University of Waterloo: Computer Science, University of Waterloo Waterloo, Ontario, Canada
[40] Leong, A.; Shallit, J., Counting sequences with small discrepancies, Exp. Math., 22, 74-84 (2013) · Zbl 1325.11076
[41] Li, M.; Vitányi, P., Statistical properties of finite sequences with high kolmogorov complexity, Math. Sys. Theo., 27 (1994) · Zbl 0830.68073
[42] Li, M.; Vitányi, P., An Introduction to Kolmogorov Complexity and Its Applications (2008), Springer · Zbl 1185.68369
[43] Li, W., On the relationship between complexity and entropy for markov chains and regular languages, Complex Sys., 5, 381-399 (1991) · Zbl 0799.68116
[44] Li, W., Generating nontrivial long-range correlations and 1/f spectra by replication and mutation, Int. J. Bifur. and Chaos, 2, 137-154 (1992) · Zbl 0900.92108
[45] Li, W., Study of correlation structure in DNA sequences: a critical review, Comput. Chem., 21, 257-272 (1997)
[46] Li, W., On parameters of the human genome, J. Theo. Biol., 288, 92-104 (2011) · Zbl 1397.92458
[47] Li, W., G+c content evolution in the human genome, eLS (2013)
[48] Li, W.; Bernaola-Galvan, P.; Carpena, P.; Oliver, J. L., Isochores merit the prefix ‘iso’, Comp. Biol. Chem., 27, 5-10 (2002)
[49] Li, W.; Holste, D., Spectral analysis of guanine and cytosine fluctuations of mouse genomic DNA, Fluct. Noise Lett., 4 (2004)
[50] Li, W.; Holste, D., Universal 1/f noise, crossovers of scaling exponents, and chromosome-specific patterns of guanine-cytosine content in DNA sequences of the human genome, Phys. Rev. E, 71, 041910 (2005)
[51] Li, W.; Kaneko, K., Long-range correlation and partial \(1/f^α\) spectrum in a noncoding DNA sequence, Europhys. Lett., 17, 655-660 (1992)
[52] Li, W.; Sosa, D.; Jose, M. V., Human repetitive sequence densities are mostly negatively correlated with r/y-based nucleosome-positioning motifs and positively correlated with w/s-based motifs, Genomics, 101, 125-133 (2013)
[53] Mani, G. S., Long-range doublet correlations in DNA and the coding regions, J. Theo. Biol., 158, 447-464 (1992)
[54] Mathias, A., On a conjecture of erdös and C̆udakov, (Bollobás, B.; Thomason, A., Combinatorics, Geometry and Probability: A Tribute to Paul Erdös (1993), Cambridge University Press), 487-488 · Zbl 0874.11010
[55] Mills, R. E.; Luttig, C. T.; Larkins, C. E.; Beauchamp, A.; Tsui, C.; Pittard, W. S.; Devine, S. E., An initial map of insertion and deletion (INDEL) variation in the human genome, Genome Res., 16, 1182-1190 (2006)
[56] Nikolaou, C.; Almirantis, Y., A study of the middle-scale nucleotide clustering in DNA sequences of various origin and functionality, by means of a method based on a modified standard deviation, J. Theo. Biol., 217, 479-492 (2002)
[57] Ohno, S.; Wolf, U.; Atkin, N. B., Evolution from fish to mammals by gene duplication, Hereditas, 59, 169-187 (1968)
[58] Orenstein, Y.; Shamir, R., Design of shortest double-stranded DNA sequences covering all k-mers with applications to protein-binding microarrays and synthetic enhancers, Bioinformatics, 29 (2013)
[59] Park, C. Y.; Halevy, T.; Lee, D. R.; Sung, J. J.; Lee, J. S.; Yanuka, O.; Benvenisty, N.; Kim, D. W., Reversion of FMR1 methylation and silencing by editing the triplet repeats in fragile X iPSC-derived neurons, Cell Rep., 13, 234-241 (2015)
[60] Payseur, B. A.; Jing, P.; Haasl, R. J., A genomic portrait of human microsatellite variation, Mol. Biol. & Evol., 28, 303-312 (2010)
[61] Peckham, H. E.; Thurman, R. E.; Fu, Y.; Stamatoyannopoulos, J. A.; Noble, W. S.; Struhl, K.; Weng, Z., Nucleosome positioning signals in genomic DNA, Genome Res., 17, 1170-1177 (2007)
[62] Peng, C. K.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Sciortino, F.; Simon, M.; Stanley, H. E., Long-range correlations in nucleotide sequences, Nature, 356, 168-170 (1992)
[63] Quinlan, A. R.; Hall, I. M., BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics, 26, 841-842 (2010)
[64] Richly, E.; Leister, D., NUMTs in sequenced eukaryotic genomes, Mol. Biol. Evol., 21, 1081-1084 (2004)
[65] Riklund, R.; Severin, M.; Liu, Y., The thue-morse aperidic crystal, a link between the finonacci quasicrystal and the periodic crystal”, Int. J. Mod. Phys. B, 1, 121-132 (1987) · Zbl 1165.82326
[66] Schildkraut, C. L.; Marmur, J.; Doty, P., Determination of the base composition of deoxyribonucleic acid from its buoyant density in cscl, J. Mol. Biol., 4, 430-443 (1962)
[67] Segal, E.; Fondufe-Mittendorf, Y.; Chen, L.; Thåström, A.; Field, Y.; Moore, I. K.; Wang, J. P.; Widom, J., A genomic code for nucleosome positioning, Nature, 442, 772-778 (2006)
[68] Shepherd, J. C., Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification, Proc. Natl. Acad. Sci., 78, 1596-1600 (1981)
[69] Smit, A., The origin of interspersed repeats in the human genome, Curr. Opin. Genet. Devel., 6, 743-748 (1996)
[70] Soler-Toscano, F.; Zenil, H.; Delahaye, J. P.; Gauvrit, N., Calculating Kolmogorov complexity from the output frequency distributions of small turning machines, PLoS ONE, 9, e96223 (2014)
[71] Soundararjan, K., Tao’s resolution of the erdös discrepancy problem, Bull. Am. Math. Soc., 55, 81-92 (2018) · Zbl 1440.11144
[72] Swartz, M. N.; Trautner, T. A.; Kornberg, A., Enzymatic synthesis of deoxyribonucleic acid, J. Biol. Chem., 237, 1961-1967 (1962)
[73] Tao, T., The erdös discrepancy problem, Discrete Analysis, 2016, 1 (2016) · Zbl 1353.11087
[74] Thanos, D.; Li, W.; Provata, A., Entropic fluctuations in DNA sequences, Physica A, 493, 444-457 (2018) · Zbl 1503.92052
[75] Timmis, J. N.; Ayliffe, M. A.; Huang, C. Y.; Martin, W., Endosymbiotic gene transfer: organelle genomes forge eukaryotic chromosomes, Nature Rev. Genet., 5, 123-135 (2004)
[76] Trifonov, E. N., Nucleosome positioning by sequence, state of the art and apparent finale, J. Biomol. Stru. Dyn., 27, 741-746 (2010)
[77] Trifonov, E. N.; Sussman, J. L., The pitch of chromatin DNA is reflected in its nucleotide sequence, Proc. Natl. Acad. Sci., 77, 3816-3820 (1980)
[78] Vinson, C.; Chatterjee, R., CG methylation, Epigenomics, 4, 655-663 (2012)
[79] Vitányi, P., 2001. Randomness. ArXiv:math/0110086; Vitányi, P., 2001. Randomness. ArXiv:math/0110086
[80] Voss, R. F., Evolution of long-range fractal correlations and 1/f noise in DNA base sequences, Phys. Rev. Lett., 68, 3805-3808 (1992)
[81] Wolfe, K. H., Yesterday’s polyploids and the mystery of diploidization, Nature Rev. Genet., 2, 333-341 (2001)
[82] Zhang, R.; Zhang, C. T., Z curves, an intuitive tool for visualizing and analyzing the DNA sequences, J. Biomol. Struc. & Dyn., 11, 767-782 (1994)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.