Complexity of symbol sequences. (English) Zbl 0651.92014
Summary: The statistical properties of various one-dimensional strings are investigated using Shannon entropies and Hamming distances between substrings. Special attention is devoted to representative biosequences: a DNA string and two protein sequences. A rather high degree of randomness of biosequences is found whereas all considered computer languages exhibit long-range correlations.
Shannon entropies of longer “words” are significantly influenced by the finite length of any real sequence. It is shown that straight forward calculations lead to systematic underestimations of entropies. In order to compensate this effect a length correction formula is proposed.
Shannon entropies of longer “words” are significantly influenced by the finite length of any real sequence. It is shown that straight forward calculations lead to systematic underestimations of entropies. In order to compensate this effect a length correction formula is proposed.
MSC:
92Cxx | Physiological, cellular and medical topics |
94A17 | Measures of information, entropy |
68Q99 | Theory of computing |