×

Protein sequence comparison method based on 3-ary Huffman coding. (English) Zbl 1519.92170

Summary: Based on 3-ary Huffman coding algorithm, we propose a digital mapping method of protein sequence. Firstly, a 3-ary Huffman tree is defined by the frequency characteristic of 20 amino acids in given protein sequences. The 0-2 codes of 20 amino acids constructed by the 3-ary Huffman tree can convert long protein sequences into one-to-one 0-2 digital sequences. According to the frequency characteristic and the distribution information of 0-2 codes of 20 amino acids in the 0-2 digital sequences, we design the 40-dimensional vectors to characterize the protein sequences. Next, the proposed digital mapping method is used to perform three separate applications, similarity comparison of nine ND6 proteins, evolutionary trend analysis of the 2009 pandemic human influenza A (H1N1) viruses from January 2020 to June 2022, and the evolution analysis of 95 coronavirus genes. The results illustrate the utility of the proposed method.

MSC:

92D20 Protein sequences, DNA sequences

Software:

2D-MH; MEGA
Full Text: DOI

References:

[1] T. D. Pham, J. Zuegg, A probabilistic measure for alignment-free sequence comparison, Bioinformatics 20 (2004) 3455-3461.
[2] S. Vinga, J. Almeida, Alignment-free sequence comparison-a review, Bioinformatics 19 (2003) 513-523.
[3] E. Hamori, J. Ruskin, H curves, A novel method of representation of nucleotide series especially suited for long DNA sequences, J. Biol. Chem. 258 (1983) 1318-1327.
[4] E. Hamori, Graphic representation of long DNA sequences by the method of Hcurves-current results and future aspects, BioTechniques 7 (1989) 710-720.
[5] H. J. Jeffrey, Chaos game representation of gene structure, Nucleic Acids Res. 18 (1990) 2163-2170.
[6] A. Nandy, A new graphical representation and analysis of DNA se-quence structure: I. Methodology and application to globin genes, Curr. Sci. 66 (1994) 309-314.
[7] D. Bielińska-Wa�ż, Four-component spectral representation of DNA sequences, J. Math. Chem. 47 (2010) 41-51. · Zbl 1194.92024
[8] D. Bielińska-Wa�ż, W. Nowak, P. Wa�ż, A. Nandy, T. Clark, Distribu-tion moments of 2D-graphs as descriptors of DNA sequences, Chem. Phys. Lett. 443 (2007) 408-413.
[9] B. Liao, C. Zeng, F. Li, Y. Tang, Analysis of similarity/dissimilarity of DNA sequences based on dual nucleotides, MATCH Commun. Math. Comput. Chem. 59 (2008) 647-652. · Zbl 1270.92015
[10] B. Liao, Q. Xiang, L. Cai, Z. Cao, A new graphical coding of DNA sequence and its similarity calculation, Phys. A 392 (2013) 4663-4667. · Zbl 1395.92105
[11] M. Randić, M. Vračko, N. Lerš, D. Plavšić., Novel 2-D graphical rep-resentation of DNA sequences and their numerical characterization, Chem. Phys. Lett. 368 (2003) 1-6.
[12] M. Randić, Another look at the chaos-game representation of DNA, Chem. Phys. Lett. 456 (2008) 84-88.
[13] G. Jaklič, T. Pisanski, M. Randić, Characterization of complex bi-ological systems by matrix invariants, J. Comput. Biol. 13 (2006) 1558-1564.
[14] Z. Qi, L. Li, X. Qi, Using Huffman coding method to visualize and analyze DNA sequences, J. Comput. Chem. 32 (2011) 3233-3240.
[15] M. Randić, 2-D graphical representation of proteins based on virtual genetic code, SAR QSAR Environ. Res. 15 (2004) 147-157.
[16] M. Randić, J. Zupan, A. T. Balaban, Unique graphical representation of protein sequences based on nucleotide triplet codons, Chem. Phys. Lett. 397 (2004) 247-252.
[17] M. Randić, D. Butina, J. Zupan, Novel 2-D graphical representation of proteins, Chem. Phys. Lett. 419 (2006) 528-532.
[18] F. Bai, T. Wang, A 2-D graphical representation of protein sequences based on nucleotide triplet codons, Chem. Phys. Lett. 413 (2005) 458-462.
[19] P. He, D. Li, Y. Zhang, X. Wang, Y. Yao, A 3D graphical represen-tation of protein sequences based on the Gray code, J. Theor. Biol. 304 (2012) 81-87. · Zbl 1397.92528
[20] M. Randić, 2-D graphical representation of proteins based on physico-chemical properties of amino acids, Chem. Phys. Lett. 444 (2007) 176-180.
[21] Y. Yao, Q. Dai, C. Li, P. He, X. Nan, Y. Zhang, Analysis of similar-ity/dissimilarity of protein sequences, Proteins: Struct. Func. Bioinf. 73 (2008) 864-871.
[22] M. I. A. el Maaty, M. M. Abo-Elkhier, M. A. Abd Elwahaab, 3D graphical representation of protein sequences and their statistical characterization, Phys. A 389 (2010) 4668-4676.
[23] Z. Wu, X. Xiao, K. Chou, 2D-MH: A web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids, J. Theor. Biol. 267 (2010) 29-34. · Zbl 1410.92089
[24] B. E. Blaisdell, A measure of the similarity of sets of sequences not requiring sequence alignment, Proc. Natl. Acad. Sci. 83 (1986) 5155-5159. · Zbl 0592.92011
[25] G. W. Stuart, K. Moffett, S. Baker, Integrated gene and species phy-logenies from unaligned whole genome protein sequences, Bioinfor-matics 18 (2002) 100-108.
[26] T. J. Wu, J. P. Burke, D. B. Davison, A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words, Biometrics 53 (1997) 1431-1439. · Zbl 0931.62100
[27] V. Afreixo, C. A. Bastos, A. J. Pinho, S. P. Garcia, P. J. Ferreira, Genome analysis with inter-nucleotide distances. Bioinformatics 25 (2009) 3064-3070.
[28] Y. Gao, L. Luo, Genome-based phylogeny of dsDNA viruses by a novel alignment-free method, Gene 492 (2012) 309-314.
[29] S. Ding, Y. Li, X. Yang, T. Wang, A simple k-word interval method for phylogenetic analysis of DNA sequences, J. Theor. Biol. 317 (2013) 192-199.
[30] Q. Dai, X Liu, Y. Yao, F. Zhao, Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison, J. Theor. Biol. 276 (2011) 174-180. · Zbl 1405.92213
[31] L. Yang, X. Zhang, H. Zhu, Alignment free comparison: similarity dis-tribution between the DNA primary sequences based on the shortest absent word, J. Theor. Biol. 295 (2012) 125-131. · Zbl 1336.92030
[32] D. Huang, H. Yu, Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids, IEEE/ACM Trans. Comput. Biol. Bioinf. 10 (2013) 457-467.
[33] F. Bai, J. Xu, L. Liu, Weighted relative entropy for phylogenetic tree based on 2-step Markov model, Math. Biosci. 246 (2013) 8-13. · Zbl 1309.92058
[34] Z. Qi, M. Jin, J. Wang, S. Li, Novel DNA sequence comparison method based on Markov chain and information entropy, Mini Rev. Org. Chem. 12 (2015) 524-533.
[35] D. A. Huffman, A method for the construction of minimum-redundancy codes, Proc. IRE. 40 (1952) 1098-1102. · Zbl 0137.13605
[36] Z. Qi, T. Fan, PN-curve: A 3D graphical representation of DNA sequences and their numerical characterization, Chem. Phys. Lett. 442 (2007) 434-440.
[37] J. Yu, X. Sun, J. Wang, TN curve: a novel 3D graphical representa-tion of DNA sequence based on trinucleotides and its applications, J. Theor. Biol. 261 (2009) 459-468. · Zbl 1403.92226
[38] J. Yu, X. Sun, Reannotation of protein-coding genes based on an im-proved graphical representation of DNA sequence, J. Comput. Chem. 31 (2010) 2126-2135.
[39] D. Panas, P. Wa�ż, D. Bielińska-Wa�ż, A. Nandy, S. C. Basak, An application of the 2D-dynamic representation of DNA/RNA sequences to the prediction of influenza A virus subtypes, MATCH Commun. Math. Comput. Chem. 80 (2018) 295-310. · Zbl 1468.92051
[40] D. J. MacKay,D. J. Mac Kay, Information Theory, Inference, and Learning Algorithms, Cambridge Univ. Press, Cambridge, 2003. · Zbl 1055.94001
[41] Z. Qi, X. Qi, Novel 2D graphical representation of DNA sequence based on dual nucleotides, Chem. Phys. Lett. 440 (2007) 139-144.
[42] M. Randić, K. Mehulić, D. Vukičević, T. Pisanski, D. Vikić-Topić, D. Plavšić, Graphical representation of proteins as four-color maps and their numerical characterization, J. Mol. Graphics Modell. 27 (2009) 637-641.
[43] B. Liao, B. Liao, X. Sun, Q. Zeng, A novel method for similarity analy-sis and protein sub-cellular localization prediction, Bioinformatics 26 (2010) 2678-2683.
[44] Z. Qi, M. Jin, S. Li, F. Jun, A protein mapping method based on physicochemical properties and dimension reduction, Comput. Biol. Med. 57 (2015) 1-7.
[45] Z. Qi, X. Wen, Novel protein sequence comparison method based on transition probability graph and information entropy, Chem. High Throughput Screen. 25 (2022) 392-400.
[46] Y. Yao, S. Yan, J. Han, Q. Dai, P. A. He, A novel descriptor of protein sequences and its application, J. Theor. Biol. 347 (2014) 109-117. · Zbl 1412.92251
[47] C. Li, Q. Dai, P. A. He, A time series representation of protein sequences for similarity comparison, J. Theor. Biol. 538 (2022) #111039. · Zbl 1483.92105
[48] Z. Qi, J. Feng, C. Liu, Evolution trends of the 2009 pandemic influenza A (H1N1) viruses in different continents from March 2009 to April 2012, Biologia 69 (2014) 407-418.
[49] J. F. W. Chan, S. Yuan, K. H. Kok, K. K. W. To, H. Chu, J. Yang, F. Xing, J. Liu, C. C. Y. Yip, R. W. S. Poon, H. W. Tsoi, S. K. Lo, K. H. Chan, V. K. Poon, W. M. Chan, J. D. Ip, J. P. Cai, V. C. Cheng, H. Chen, C. K. Hui, K. Y. Yuen, A familial cluster of pneumonia as-sociated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster, Lancet 395 (2020) 514-523.
[50] S. Kumar, G. Stecher, M. Li, C. Knyaz, K. Tamura, MEGA X: molec-ular evolutionary genetics analysis across computing platforms, Mol. Biol. Evol. 35 (2018) 1547-1549.
[51] X. Li, J. Zai, Q. Zhao, Q. Nie, Y. Li, B. T. Foley, A. Chaillon, Evolu-tionary history, potential intermediate animal host, and cross-species analyses of SARS-CoV-2, J. Med. Virol. 92 (2020) 602-611.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.