×

Entropic fluctuations in DNA sequences. (English) Zbl 1503.92052

Summary: The local Shannon entropy (LSE) in blocks is used as a complexity measure to study the information fluctuations along DNA sequences. The LSE of a DNA block maps the local base arrangement information to a single numerical value. It is shown that despite this reduction of information, LSE allows to extract meaningful information related to the detection of repetitive sequences in whole chromosomes and is useful in finding evolutionary differences between organisms. More specifically, large regions of tandem repeats, such as centromeres, can be detected based on their low LSE fluctuations along the chromosome. Furthermore, an empirical investigation of the appropriate block sizes is provided and the relationship of LSE properties with the structure of the underlying repetitive units is revealed by using both computational and mathematical methods. Sequence similarity between the genomic DNA of closely related species also leads to similar LSE values at the orthologous regions. As an application, the LSE covariance function is used to measure the evolutionary distance between several primate genomes.

MSC:

92D20 Protein sequences, DNA sequences
94A17 Measures of information, entropy
92D10 Genetics and epigenetics
Full Text: DOI

References:

[1] Voss, R. F., Evolution of long-range fractal correlations and 1/f noise in DNA base sequences, Phys. Rev. Lett., 68, 3805-3808 (1992)
[2] Li, W.; Kaneko, K., DNA correlations, Nature, 360, 6405, 635-636 (1992)
[3] Peng, C.-K.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Sciortino, F.; Simons, M.; Stanley, H. E., Long-range correlations in nucleotide sequences, Nature, 356, 6365, 168-170 (1992)
[4] Arneodo, A.; Vaillant, C.; Audit, B.; Argoul, F.; d’ Aubenton-Carafa, Y.; Thermes, C., Multi-scale coding of genomic information: from DNA sequence to genome structure and function, Phys. Rep., 498, 2-3, 45-188 (2011)
[5] Román-Roldán, R.; Bernaola-Galván, P.; Oliver, J. L., Sequence compositional complexity of DNA through an entropic segmentation method, Phys. Rev. Lett., 80, 1344-1347 (1998)
[6] Messer, P. W.; Arndt, P. F., The majority of recent short DNA insertions in the human genome are tandem duplications, Mol. Biol. Evol., 24, 5, 1190-1197 (2007)
[7] Carpena, P.; Oliver, J. L.; Hackenberg, M.; Coronado, A. V.; Barturen, G.; Bernaola-Galván, P., High-level organization of isochores into gigantic superstructures in the human genome, Phys. Rev. E, 83, 031908 (2011)
[8] Polak, P.; Arndt, P. F., Long-range bidirectional strand asymmetries originate at CpG islands in the human genome, Genome Biol. Evol., 1, 189-197 (2009)
[9] Li, W.; Marr, T. H.; Kaneko, K., Understanding long-range correlations in DNA sequences, Physica D, 75, 1, 392-416 (1994) · Zbl 0858.92025
[10] Li, W., The study of correlation structures of DNA sequences: a critical review, Comput. Chem., 21, 4, 257-271 (1997)
[11] Shannon, C. E., A mathematical theory of communication, SIGMOBILE Mob. Comput. Commun. Rev., 5, 1, 3-55 (2001)
[12] Román-Roldán, R.; Bernaola-Galván, P.; Oliver, J. L., Application of information theory to DNA sequence analysis: A review, Pattern Recognit., 29, 7, 1187-1194 (1996)
[13] Provata, A.; Nicolis, C.; Nicolis, G., DNA viewed as an out-of-equilibrium structure, Phys. Rev. E, 89, 052105 (2014)
[14] Bernaola-Galván, P.; Román-Roldán, R.; Oliver, J. L., Compositional segmentation and long-range fractal correlations in DNA sequences, Phys. Rev. E, 53, 5181-5189 (1996)
[15] Bernaola-Galván, P.; Grosse, I.; Carpena, P.; Oliver, J. L.; Román-Roldán, R.; Stanley, H. E., Finding borders between coding and noncoding DNA regions by an entropic segmentation method, Phys. Rev. Lett., 85, 1342-1345 (2000)
[16] Ebeling, W.; Poschel, T.; Albrecht, K.-F., Entropy, transinformation and word distribution of information-carrying sequences, Int. J. Bifurcation Chaos, 05, 01, 51-61 (1995) · Zbl 0887.68120
[17] Ebeling, W.; Nicolis, G., Entropy of symbolic sequences: The role of correlations, Europhys. Lett., 14, 3, 191 (1991)
[18] Nicolis, J. S., Chaos and Information Processing (1991), World Scientific: World Scientific Singapore
[19] Li, W., G+C content evolution in the human genome, (eLS (2013), John Wiley and Sons Ltd: John Wiley and Sons Ltd Chichester)
[20] Wootton, J. C.; Federhen, S., Statistics of local complexity in amino acid sequences and sequence databases, Comput. Chem., 17, 149-163 (1993) · Zbl 0825.92102
[21] A.F.A. Smit, R. Hubley, P. Green, Repeatmasker open-4.0, 2013-2015, http://www.repeatmasker.org.
[22] Benson, G., Tandem repeats finder: a program to analyze DNA sequences, Nucl. Acids Res., 17, 149-163 (1999)
[23] Frenkel, F. E.; Korotkova, M. A.; Korotkov, E. V., Database of periodic DNA regions in major genomes, BioMed. Res. Int., 2017, 7949287 (2017)
[24] http://www.girinst.org/repbase.
[25] Argesti, Alan, Categorical Data Analysis (2013), Wiley · Zbl 1281.62022
[26] Aldrup-MacDonald, M. E.; Sullivan, B. A., The past, present, and future of human centromere genomics, Genes (Basel), 5, 1, 33-50 (2014)
[27] Provata, A.; Almirantis, Y., Fractal cantor patterns in the sequence structure of DNA, Fractals, 08, 01, 15-27 (2000)
[28] Provata, A.; Almirantis, Y., Scaling properties of coding and noncoding DNA sequences, Physica A, 247, 482-496 (1997)
[29] Almirantis, Y.; Provata, A., An evolutionary model for the origin of non-randomness, long-range order and fractality in the genome, BioEssays, 23, 7, 647-656 (2001)
[30] Arora, R.; Sethares, W. A., Detection of periodicities in gene sequences: a maximum likelihood approach, (2007 IEEE International Workshop on Genomic Signal Processing and Statistics (2007), IEEE), 1-4
[31] Gupta, R.; Sarthi, D.; Mittal, A.; Singh, K., A novel signal processing measure to identify exact and inexact tandem repeat patterns in DNA sequences, EURASIP J. Bioinform. Syst. Biol., 2007, 3 (2007)
[32] Illingworth, C. J.; Parkes, K. E.; Snell, C. R.; Mullineaux, P. M.; Reynolds, C. A., Criteria for confirming sequence periodicity identified by Fourier transform analysis: application to GCR2, a candidate plant GPCR?, Biophys. Chem., 133, 1, 28-35 (2008)
[33] Korotkov, E. V.; Korotkova, M. A.; Kudryashov, N. A., Information decomposition method to analyze symbolical sequences, Phys. Lett. A, 312, 3, 198-210 (2003) · Zbl 1041.68073
[34] Tolimieri, R.; An, M.; Lu, C.; Burrus, C. S., Algorithms for Discrete Fourier Transform and Convolution, Signal Processing and Digital Filtering (1989), Springer New York · Zbl 0705.65105
[35] Boore, J. L.; Brown, W. M., Big trees from little genomes: mitochondrial gene order as a phylogenetic tool, Curr. Opin. Genetics Dev., 8, 6, 668-674 (1998)
[36] Snel, B.; Bork, P.; Huynen, M. A., Genome phylogeny based on gene content, Nature Genet., 21, 1, 108-110 (1999)
[37] Kececioglu, J.; Sankoff, D., Exact and approximation algorithms for sorting by reversals, with application to genome rearrangement, Algorithmica, 13, 1-2, 180-210 (1995) · Zbl 0831.92014
[38] Hannenhalli, S.; Pevzner, P. A., Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals, J. ACM, 46, 1, 1-27 (1999) · Zbl 1064.92510
[39] Nadeau, J. H.; Sankoff, D., Counting on comparative maps, TIG, 14, 12, 495-501 (1998)
[40] Provata, A.; Nicolis, C.; Nicolis, G., Complexity measures for the evolutionary categorization of organisms, Comput. Biol. Chem., 53, Part A, 5-14 (2014)
[41] Hamori, E., Novel DNA sequence representations, Nature, 314, 6012, 585 (1985)
[42] Gates, M. A., Simpler DNA sequence representations, Nature, 316, 6025, 219 (1985)
[43] Yu, C.; Liang, Q.; Yin, C.; He, R. L.; Yau, S. S.-T., A novel construction of genome space with biological geometry, DNA Res., 17, 3, 155-168 (2010)
[44] Koonin, E., The emerging paradigm and open problems in comparative genomics., Bioinformatics, 15, 4, 265-266 (1999)
[45] Schwartz, S.; Zhang, Z.; Frazer, K. A.; Smit, A.; Riemer, C.; Bouck, J.; Gibbs, R.; Hardison, R.; Miller, W., Pipmaker: A web server for aligning two genomic DNA sequences, Genome Res., 10, 577-586 (2000)
[46] Delcher, A. L.; Phillippy, A.; Carlton, J.; Salzberg, S. L., Fast algorithms for large-scale genome alignment and comparison, Nucl. Acids Res., 30, 2478-2483 (2002)
[47] Brudno, M.; Do, C. B.; Cooper, G. M.; Kim, M. F.; E. Davydov, E., NISC Comparative Sequencing Program by E.D. Green, A. Sidow, S. Batzoglou, Lagan and multi-lagan: efficient tools for large-scale multiple alignment of genomic DNA, Genome Res., 13, 721-731 (2003)
[48] Frazer, K. A.; Pachter, L.; Poliakov, A.; Rubin, E. M.; Dubchak, I., Vista: computational tools for comparative genomic, Nucl. Acids Res., 32, Suppl. 2, W273-W279 (2004)
[49] Margulies, E. H., Analyses of deep mammalian sequence alignments and constraint predictions for 1 percent of the human genome, Genome Res., 17, 760-774 (2007)
[50] Chen, X.; Tompa, M., Comparative assessment of methods for aligning multiple genome sequences, Nature Biotech., 28, 567-572 (2010)
[51] Sonnhammer, E. L.L.; Durbin, R., A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis, Gene, 167, GC1-GC10 (1996)
[52] Huang, Y.; Zhang, L., Rapid and sensitive dot-matrix methods for genome analysis, Bioinformatics, 20, 460-466 (2004)
[53] Krumsiek, J.; Arnold, R.; Rattei, T., Gepard: a rapid and sensitive tool for creating dotplots on genome scale, Bioinformatics, 23, 1026-1028 (2007)
[54] Vinga, S.; Almeida, J., Alignment-free sequence comparison—a review, Bioinformatics, 19, 513-523 (2003)
[55] Sim, G. E.; Jun, S. R.; Wu, G. A.; Kim, S. H., Alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions, Proc. Natl. Acad. Sci., 106, 2677-2682 (2009)
[56] Reinert, G.; Chew, D.; Sun, F.; Waterman, M. S., Alignment-free sequence comparison (i): statistics and power, J. Comput. Biol., 16, 1615-1634 (2010)
[57] Haubold, B., Alignment-free phylogenetics and population genetics, Brief. Bioinform., 15, 407-418 (2013)
[58] Gibbs, R. A., Genome sequence of the Brown Norway rat yields insights into mammalian evolution, Nature, 428, 6982, 493-521 (2004)
[59] Initial sequence of the chimpanzee genome and comparison with the human genome, Nature, 437, 7055, 69-87 (2005)
[60] Miga, K. H., Chromosome-specific centromere sequences provide an estimate of the ancestral chromosome 2 fusion event in hominin genomes, J. Hered., 108, 45-52 (2016)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.