×

Range-limited Heaps’ law for functional DNA words in the human genome. (English) Zbl 07902692

Summary: Heaps’ or Herdan-Heaps’ law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word as a coding region in a genome that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps’ law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from other definitions such as over-represented k-mers which are much shorter in length. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf’s law is well known, their translation to the Heaps’ law in DNA words is not automatic. Several other animal genomes are shown herein also to exhibit range-limited Heaps’ law with our definition of DNA words, though with various exponents. When tokens were randomly sampled and sample sizes reach to the maximum level, a deviation from the Heaps’ law was observed, but a quadratic regression in log-log type-token plot fits the data perfectly. Investigation of type-token plot and its regression coefficients could provide an alternative narrative of reusage and redundancy of protein domains as well as creation of new protein domains from a linguistic perspective.

MSC:

92D10 Genetics and epigenetics
91F20 Linguistics

Software:

BERT; UniProt; Canu

References:

[1] Akaike, H., A new look at the statistical model identification, IEEE Trans. Autom. Control, 19, 716-723, 1974 · Zbl 0314.62039
[2] Altmann, EG.; Gerlach, M., Statistical laws in linguistics, (Esposti, M. Degli; Altmann, E.; Pachet, F., Creativity and Universality in Language, 2016, Springer: Springer Switzerland), 7-26
[3] Andreeva, A.; Kulesha, E.; Gough, J.; Murzin, AG., The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures, Nucleic Acids Res., 48, D376-D382, 2020
[4] Apostolico, A.; Bock, ME.; Lonardi, S., Monotony of surprise and large-scale quest for unusual words, J. Comput. Biol., 10, 283-311, 2003
[5] Baeza-Yates, RA.; Navarro, G., Block addressing indices for approximate text retrieval, J. Am. Soc. Inf. Sci., 51, 69-82, 2000
[6] Bernhardsson, S.; de Rocha, LEC.; Minnhagen, P., The meta book and size-dependent properties of written language, New J. Phys., 11, Article 123015 pp., 2009
[7] Boytsov, L., A simple derivation of the Heap’s law from the generalized Zipf’s law, 2017, arXiv, preprint, 1711.03066
[8] Brants, T.; Popat, AC.; Xu, P.; Och, FJ.; Dean, J., Large language models in machine translation, (Eisner, J., Proc. 2007 Joint Conf. Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007, Asso. Comp. Linguistics), 858-867, URL: https://aclanthology.org/D07-1/
[9] Brendel, V.; Beckmann, JS.; Trifonov, EN., Linguistics of nucleotide sequences: morphology and comparison of vocabularies, J. Biomol. Struct. Dyn., 4, 11-21, 1986
[10] Buchan, DW.; Jones, DT., Learning a functional grammar of protein domains using natural language word embedding techniques, Proteins, 88, 616-624, 2020
[11] Bussemaker, HJ.; Li, H.; Siggia, ED., Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis, Proc. Natl. Acad. Sci., 97, 10096-10100, 2000
[12] Caetano-Anollés, G., The compressed vocabulary of microbial life, Front. Microbiol., 12, Article 655990 pp., 2021
[13] Caetano-Anollés, G.; Minhas, BF.; Aziz, F.; Mughal, F.; Shahzad, K.; Tal, G.; Mittenthal, JE.; Caetano-Anollés, D.; Koç, I.; Nasir, A.; Caetano-Anollés, K.; Kim, KM., The Compressed Vocabulary of the Proteins of Archaea, Biocommunication of Archaea, 147-174, 2017, Springer: Springer Switzerland
[14] Castresana, J., Genes on human chromosome 19 show extreme divergence from the mouse orthologs and a high GC content, Nucleic Acids Res., 30, 1751-1756, 2002
[15] Chikhi, R.; Medvedev, P., Informed and automated k-mer size selection for genome assembly, Bioinformatics, 30, 31-37, 2013
[16] Devlin, J.; Chang, MW.; Lee, K.; Toutanova, K., BERT: pre-training of deep bidirectional transformers for language understanding, 2018, arXiv, preprint
[17] Dong, S.; Searls, DB., Gene structure prediction by linguistic methods, Genomics, 23, 540-551, 1994
[18] Dotan, E.; Jaschek, G.; Pupko, T.; Belinkov, Y., Effect of tokenization on transformers for biological sequences, Bioinformatics, 40, Article btae196 pp., 2024
[19] Egghe, L., Untangling Herdan’s law and Heaps’ law: Mathematical and informetric arguments, J. Am. Soc. Inf. Sci. Technol., 58, 702-709, 2007
[20] Eliazar, I., The growth statistics of Zipfian ensembles: beyond Heaps’ law, Phys. A, 390, 3189-3203, 2011
[21] Ferruz, N.; Schmidt, S.; Höcker, B., ProtGPT2 is a deep unsupervised language model for protein design, Nat. Commun., 13, 4348, 2022
[22] Font-Clos, F.; Corral, Á., Log-Log convexity of type-token growth in Zipf’s systems, Phys. Rev. Lett., 114, Article 238701 pp., 2015
[23] Frappat, L.; Minichini, C.; Sciarrino, A.; Sorba, P., Universality and Shannon entropy of codon usage, Phys. Rev. E, 68, Article 061910 pp., 2003
[24] Frontali, C.; Pizzi, E., Similarity in oligonucleotide usage in introns and intergenic regions contributes to long-range correlation in the Caenorhabditis elegans genome, Gene, 232, 87-95, 1999
[25] Gao, K.; Miller, J., Algebraic distribution of segmental duplication lengths in whole-genome sequence self-alignments, PLoS One, 6, Article e18464 pp., 2011
[26] Gatherer, D., Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences, Bioinf. Biol. Insights, 1, 101-126, 2007
[27] Gerlach, M.; Altmann, EG., Stochastic model for the vocabulary growth in natural languages, Phys. Rev. X, 3, Article 021006 pp., 2013
[28] Gimona, M., Protein linguistics — a grammar for modular protein assembly?, Nat. Rev. Mol. Cell Biol., 7, 68-73, 2006
[29] Grimwood, J.; Gordon, LA.; Olsen, A.; Terry, A.; Schmutz, J.; Lamerdin, J.; Hellsten, U.; Goodstein, D.; Couronne, O.; Tran-Gyamfi, M., The DNA sequence and biology of human chromosome 19, Nature, 428, 529-535, 2004
[30] Harris, RA.; Raveendran, M.; Worley, KC.; Rogers, J., Unusual sequence characteristics of human chromosome 19 are conserved across 11 nonhuman primates, BMC Evol. Biol., 20, 33, 2020
[31] Heaps, HS., Information Retrieval: Computational and Theoretical Aspects, 1978, Academic Press: Academic Press New York, USA · Zbl 0471.68075
[32] Herdan, G., Type-Token Mathematics: A Textbook of Mathematical Linguistics, 1960, Mouton: Mouton The Hague, Netherlands · Zbl 0163.40904
[33] Hernández-Fernández, A.; Torre, IG.; Garrido, JM.; Lacasa, L., Linguistic laws in speech: the case of Catalan and Spanish, Entropy, 21, 1153, 2019
[34] Peirce on Signs: Writings on Semiotic by Charles Sanders, 1991, University of North Carolina Press: University of North Carolina Press Chapel Hill, NC, USA
[35] Ionit-Laza, I.; Lange, C.; Laird, NM., Estimating the number of unseen variants in the human genome, Porc. Natl. Acad. Sci., 106, 5008-5013, 2009 · Zbl 1202.92059
[36] Ispolatov, I.; Krapivsky, PL.; Yuryev, A., Duplication-divergence model of protein interaction network, Phys. Rev. E, 71, Article 061911 pp., 2005
[37] Kay, LE., Who Wrote the Book of Life? a History of the Genetic Code, 2000, Stanford University Press: Stanford University Press Stanford, CA, USA
[38] Konopka, AK.; Martindale, C., Noncoding DNA, Zipf’s law, and language (letter), Science, 268, 5212, 1995
[39] Koonin, EV.; Wolf, YI.; Karev, GP., The structure of the protein universe and genome evolution, Nature, 420, 218-223, 2002
[40] Koren, S.; Walenz, BP.; Berlin, K.; Miller, JR.; Bergman, NH.; Phillippy, AM., Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation, Genome Res., 27, 722-736, 2017
[41] Li, W., Expansion-modification systems: A model for spatial 1/f spectra, Phys. Rev. A, 43, 5240-5260, 1991
[42] Li, W., Zipf’s law everywhere, Glottometrics, 5, 14-21, 2002
[43] Li, W., Menzerath’s law at the gene-exon level in the human genome, Complexity, 17, 49-53, 2011
[44] Li, W., On parameters of the human genome, J. Theoret. Biol., 288, 92-104, 2011 · Zbl 1397.92458
[45] Li, W.; Fontanelli, O.; Miramontes, P., Size distribution of function-based human gene sets and the split-merge model, Royal Soc. Open Sci., 3, Article 160275 pp., 2016
[46] Li, W.; Freudenberg, J.; Freudenberg, J., Alignment-free approaches for predicting novel Nuclear Mitochondrial Segments (NUMTs) in the human genome, Gene, 691, 141-152, 2019
[47] Li, W.; Freudenberg, J.; Miramontes, P., Diminishing return for increased mappability with longer sequencing reads: implications of the k-mer distributions in the human genome, BMC Bioinf., 15, 2, 2014
[48] Li, W.; Miramontes, P., Fitting ranked English and spanish letter frequency distribution in US and Mexican presidential speeches, J. Quant. Linguist., 18, 359-380, 2011
[49] Li, W.; Miramontes, P.; Cocho, G., Fitting ranked linguistic data with two-parameter functions, Entropy, 12, 1743-1764, 2010
[50] Li, W.; Nyholt, DR., Marker selection by AIC and BIC, Genet. Epid., 21, suppl 1, S272-S277, 2001
[51] Lü, L.; Zhang, ZK.; Zhou, T., Deviation of Zipf’s and Heaps’ Laws in human languages with limited dictionary sizes, Sci. Rep., 3, 1082, 2013
[52] Luscombe, NM.; Qian, J.; Zhang, Z.; Johnson, T.; Gerstein, M., The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties, Genome Biol., 3, Article research0040 pp., 2002
[53] Lynch, M.; Conery, JS., The origins of genome complexity, Science, 302, 1401-1404, 2003
[54] Madani, A.; Krause, B.; Greene, E. R.; Subramanian, S.; Mohr, B. P.; Holton, J. M.; Olmos, J. L.; Xiong, C.; Sun, Z. Z.; Socher, R.; Fraser, J. S.; Naik, N., Large language models generate functional protein sequences across diverse families, Nat. Biotech., 41, 1099-1106, 2023
[55] Mantegna, RN.; Buldyrev, SV.; Goldberger, AL.; Havlin, S.; Peng, CK.; Simons, M.; Stanley, HE., Linguistic features of noncoding DNA sequences, Phys. Rev. Lett., 73, 3169-3172, 1994
[56] Medini, D.; Donati, C.; Rappuoli, R.; Tettelin, H., (Tettelin, H.; Medini, D., The Pangenome, 2020, Springer: Springer Switzerland), 3-20
[57] Menzerath, P., Über einige phonetische probleme, (Actes du premier Congres International de Linguistes, 1928, Sijthoff: Sijthoff Leiden, Netherlands), 104-105
[58] Miller, J.; McLachlan, AD.; Klug, A., Repetitive zinc-binding domains in the protein transcription factor IIIA from Xenopus oocytes, EMBO J., 4, 1609-1614, 1985
[59] Mistry, J.; Chuguransky, S.; Williams, L.; Qureshi, M.; Salazar, GA.; Sonnhammer, ELL.; Tosatto, SCE.; Paladin, L.; Raj, S.; Richardson, LJ.; Finn, RD.; Bateman, A., Pfam: The protein families database in 2021, Nucl. Acids Res., 49, D412-D419, 2021
[60] Moghaddasi, H.; Khalifeh, K.; Darooneh, AH., Distinguishing functional DNA words; a method for measuring clustering levels, Sci. Rep., 7, 41543, 2017
[61] Mukhopadhyay, I.; Som, A.; Sahoo, S., Word organization in coding DNA: A mathematical model, Theor. Biosci., 125, 1-17, 2006
[62] Müller, A.; MacCallum, RM.; Sternberg, MJE., Structural characterization of the human proteome, Genome Res., 12, 1625-1641, 2002
[63] Murzin, AG.; Brenner, SE.; Hubbard, T.; Chothia, C., SCOP: A structural classification of proteins database for the investigation of sequences and structures, J. Mol. Biol., 247, 536-540, 1995
[64] Nasir, A.; Kim, KM.; Caetano-Anollés, G., Phylogenetic tracings of proteome size support the gradual accretion of protein structural domains and the early origin of viruses from primordial cells, Front. Microbiol., 8, 1178, 2017
[65] Nelson, SC.; Yum, JH.; Ceccarelli, L., How metaphors about the genome constrain CRISPR metaphors: separating the Text from its Editor, Am. J. Bioeth., 15, 60-62, 2015
[66] Newman, MEJ., Power laws, Pareto distributions and Zipf’s law, Contemp. Phys., 46, 323-351, 2005
[67] Nijkamp, E.; Ruffolo, JA.; Weinstein, EN.; Naik, N.; Madani, A., ProGen2: Exploring the boundaries of protein language models, Cell Syst., 14, P968-P978, 2023
[68] Nikolaou, C., Menzerath-Altmann law in mammalian exons reflects the dynamics of gene structure evolution, Comput. Biol. Chem., 53, A, 134-143, 2014
[69] Ofer, D.; Brandes, N.; Linial, M., The language of proteins: NLP, machine learning & protein sequences, Comput. Struct. Biotech. J., 19, 1750-1758, 2021
[70] Paysan-Lafosse, T.; Blum, M.; Chuguransky, S.; Grego, T.; Pinto, BL.; Salazar, GA.; Bileschi, ML.; Bork, P.; Bridge, A.; Colwell, L.; Gough, J.; Haft, DH.; Letunić, I.; Marchler-Bauer, A.; Mi, H.; Natale, DA.; Orengo, CA.; Pandurangan, AP.; Rivoire, C.; Sigrist, CJA.; Sillitoe, I.; Thanki, N.; Thomas, PD.; Tosatto, SCE.; Wu, CH.; Bateman, A., InterPro in 2022, Nucleic Acids Res., 51, D418-D427, 2022
[71] Petersen, AM.; Tenenbaum, JN.; Havlin, S.; Stanley, HE.; Perc, M., Languages cool as they expand: Allometric scaling and the decreasing need for new words, Sci. Rep., 2, 943, 2012
[72] Phillips, GJ.; Arnold, J.; Ivarie, R., The effect of codon usage on the oligonucleotide composition of the E. coli genome and identification of over—and underrepresented sequences by Markov chain analysis, Nucleic Acids Res., 15, 2627-2638, 1987
[73] Qian, J.; Luscombe, NM.; Gerstein, M., Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model, J. Mol. Biol., 313, 673-681, 2001
[74] Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I., Language models are unsupervised multitask learners, 2019, https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf, preprint
[75] Rahman, A.; Hallgrimsdottir, I.; Eisen, M.; Pachter, L., Association mapping from sequencing reads using k-mers, eLife, 7, Article e32920 pp., 2018
[76] Rao, R.; Meier, J.; Sercu, T.; Ovchinnikov, S.; Rives, A., Transformer protein language models are unsupervised structure learners, BioRxiv, 2020, preprint
[77] Scaiewicz, A.; Levitt, M., The language of the protein universe, Development, 35, 50-56, 2015
[78] Searls, DB., The language of genes, Nature, 420, 211-217, 2002
[79] Semple, S.; Ferrer-i Cancho, R.; Gustison, ML., Linguistic laws in biology, Trends Ecol. Evol., 37, 53-66, 2022
[80] Sheinman, M.; Ramisch, A.; Massip, F.; Arndt, PF., Evolutionary dynamics of selfish DNA explains the abundance distribution of genomic subsequences, Sci. Rep., 6, 30851, 2016
[81] Stephens, ZD.; Iyer, RK., Measuring the mappability spectrum of reference genome assemblies, (BCB’18: Proc. 2018 ACM Int. Conf. on Bioinformatics, Comp. Biol. and Health Informatics, 2018, ACM: ACM New York, NY, USA), URL: https://doi.org/10.1145/3233547.3233582
[82] Tettelin, H.; Riley, D.; Cattuto, C.; Medini, D., Comparative genomics: the bacterial pan-genome, Curr. Opin. Microbiol., 11, 472-477, 2008
[83] The UniProt Consortium, Reorganizing the protein space at the Universal Protein Resource (UniProt), Nucleic Acids Res., 40, D71-D75.5, 2012
[84] Tunnicliffe, M.; Hunter, G., Random sampling of the Zipf-Mandelbrot distribution as a representation of vocabulary growth, Phys. A, 608, Article 128259 pp., 2022 · Zbl 07639854
[85] van Leijenhorst, DC.; van der Weide, TP., A formal derivation of Heaps’ Law, Inf. Sci., 170, 263-272, 2005 · Zbl 1070.60009
[86] Vilo, J., Pattern Discovery from Biosequences, 2002, Department of Computer Science, University of Helsinki, (Ph.D Thesis)
[87] Wagner, A., Life Finds a Way: Mapping the Origins of Creativity, 2019, Basic Books: Basic Books New York, NY, USA
[88] Wang, Y.; Zhang, H.; Zhong, H.; Xue, Z., Protein domain identification methods and online resources, Comput. Struct. Biotechnol. J., 19, 1145-1153, 2021
[89] Webster, JJ.; Kit, C., Tokenization as the initial phase in NLP, (Proc. 14th Conf. Comp. Linguistics, Vol. 4, 1992, Asso. Comp. Linguistics), 1107-1110
[90] Wetzel, L., Types and Tokens: On Abstract Objects, 2009, MIT Press: MIT Press Cambridge, MA, USA
[91] Yu, L.; Tanwar, DK.; Wolf, ED.; Penha, YI.; Koonin, EV.; Basu, MK., Grammar of protein domain architectures, Proc. Natl. Acad. Sci., 116, 3636-3645, 2019
[92] Zielezinski, A.; Vinga, S.; Almeida, J.; Karlowski, WM., Alignment-free sequence comparison: benefits, applications, and tools, Genome Biol., 18, 186, 2017
[93] Zipf, GK., The Psycho-Biology of Languages, 1935, Houghtion-Mifflin: Houghtion-Mifflin Boston, MA
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.