×

A new hybrid fractal algorithm for predicting thermophilic nucleotide sequences. (English) Zbl 1307.92309

Summary: Knowledge of thermophilic mechanisms about some organisms whose optimum growth temperature (OGT) ranges from 50 to 80 degree plays a major role in helping design stable proteins. How to predict a DNA sequence to be thermophilic is a long but not fairly resolved problem. Chaos game representation (CGR) can investigate the patterns hiding in DNA sequences, and can visually reveal previously unknown structure. Fractal dimensions are good tools to measure sizes of complex, highly irregular geometric objects. In this paper, we convert every DNA sequence into a high dimensional vector by CGR algorithm and fractal dimension, and then predict the DNA sequence thermostability by these fractal features and support vector machine (SVM). We have conducted experiments on three groups: 17-dimensional vector, 65-dimensional vector, and 257-dimensional vector. Each group is evaluated by the 10-fold cross-validation test. For the results, the group of 257-dimensional vector gets the best results: the average accuracy is 0.9456 and average MCC is 0.8878. The results are also compared with the previous work with single CGR features. The comparison shows the high effectiveness of the new hybrid fractal algorithm.

MSC:

92D20 Protein sequences, DNA sequences
Full Text: DOI

References:

[1] Baish, J. W.; Jain, R. K., Cancer, angiogenesis and fractals, Nat. Med., 4, 984 (1998)
[2] Baish, J. W.; Jain, R. K., Fractals and cancer, Cancer Res., 60, 3683-3688 (2000)
[3] Barabote, R. D.; Xie, G.; Leu, D. H., Complete genome of the cellulolytic thermophile Acidothermus cellulolyticus 11B provides insights into its ecophysiological and evolutionary adaptations, Genome Res., 19, 6, 1033-1043 (2009)
[4] Cai, Y. D.; Liu, X. J.; Xu, X. B.; Chou, K. C., Prediction of protein structural classes by support vector machines, Comput. Chem., 26, 293-296 (2002)
[5] Cai, Y. D.; Liu, X. J.; Xu, X. B.; Chou, K. C., Support vector machines for predicting HIV protease cleavage sites in protein, J. Comput. Chem., 23, 267-274 (2002)
[6] Cai, Y. D.; Liu, X. J.; Xu, X. B.; Chou, K. C., Support vector machines for the classification and prediction of beta-turn types, J. Pept. Sci., 2002, 8, 297-301 (2002)
[7] Cai, Y. D.; Lin, S.; Chou, K. C., Support vector machines for prediction of protein signal sequences and their cleavage sites, Peptides, 24, 159-161 (2003)
[8] Cai, Y. D.; Feng, K. Y.; Li, Y. X.; Chou, K. C., Support vector machine for predicting alpha-turn types, Peptides, 24, 629-630 (2003)
[9] Cai, Y. D.; Zhou, G. P.; Chou, K. C., Support vector machines for predicting membrane protein types by using functional domain composition, Biophys. J., 84, 3257-3263 (2003)
[10] Cai, Y. D.; Zhou, G. P.; Jen, C. H.; Lin, S. L.; Chou, K. C., Identify catalytic triads of serine hydrolases by support vector machines, J. Theor. Biol., 228, 551-557 (2004) · Zbl 1439.92141
[11] Cai, Y. D.; Pong-Wong, R.; Feng, K.; Jen, J. C.H.; Chou, K. C., Application of SVM to predict membrane protein types, J. Theor. Biol., 226, 373-376 (2004)
[12] Chen, C.; Chen, L.; Zou, X.; Cai, P., Prediction of protein secondary structure content using the concept of Chou’s pseudo amino acid composition and support vector machine, Protein Pept. Lett., 16, 27-31 (2009)
[13] Chen, J.; Liu, H.; Yang, J.; Chou, K. C., Prediction of linear B-cell epitopes using amino acid pair antigenicity scale, Amino Acids, 33, 423-428 (2007)
[14] Chou, K. C.; Zhou, G. P., Role of the protein outside active site on the diffusion-controlled reaction of enzyme, J. Am. Chem. Soc., 104, 1409-1413 (1982)
[15] Chou, K. C., Review: low-frequency collective motion in biomacromolecules and its biological functions, Biophys. Chem., 30, 3-48 (1988)
[16] Chou, K. C., Graphic rules in steady and non-steady enzyme kinetics, J. Biol. Chem., 264, 12074-12079 (1989)
[17] Chou, K. C., Review: applications of graph theory to enzyme kinetics and protein folding kinetics. Steady and non-steady state systems, Biophys. Chem., 35, 1-24 (1990)
[18] Chou, K. C., Energy-optimized structure of antifreeze protein and its binding mechanism, J. Mol. Biol., 223, 509-517 (1992)
[19] Chou, K. C., A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins, J. Biol. Chem., 268, 16938-16948 (1993)
[20] Chou, K. C.; Kezdy, F. J.; Reusser, F., Review: steady-state inhibition kinetics of processive nucleic acid polymerases and nucleases, Anal. Biochem., 221, 217-230 (1994)
[21] Chou, K. C.; Zhang, C. T., Review: prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol., 30, 275-349 (1995)
[22] Chou, K. C., Review: prediction of HIV protease cleavage sites in proteins, Anal. Biochem., 233, 1-14 (1996)
[23] Chou, K. C.; Cai, Y. D., Using functional domain composition and support vector machines for prediction of protein subcellular location, J. Biol. Chem., 277, 45765-45769 (2002)
[24] Chou, K. C.; Shen, H. B., Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms (updated version: cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms, Natural Science, 2010, 2, 1090-1103), Nat. Protocols, 3, 153-162 (2008)
[25] Chou, K. C.; Shen, H. B., Review: recent advances in developing web-servers for predicting protein attributes, Nat. Sci., 2, 63-92 (2009), (openly accessible at)
[26] Chou, K. C.; Shen, H. B., Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization, PLoS ONE, 5, e11335 (2010)
[27] Chou, K. C., Graphic rule for drug metabolism systems, Curr. Drug Metab., 11, 369-378 (2010)
[28] Chou, K. C., Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review), J. Theor. Biol., 273, 236-247 (2011) · Zbl 1405.92212
[29] Chou, K. C.; Wu, Z. C.; Xiao, X., iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, PLoS ONE, 6, e18258 (2011)
[30] Ding, H.; Luo, L.; Lin, H., Prediction of cell wall lytic enzymes using Chou’s amphiphilic pseudo amino acid composition, Protein Pept. Lett., 16, 351-355 (2009)
[31] Ding, H.; Liu, L.; Guo, F. B.; Huang, J.; Lin, H., Identify Golgi protein types with modified mahalanobis discriminant algorithm and pseudo amino acid composition, Protein Pept. Lett., 18, 58-63 (2011)
[32] Esmaeili, M.; Mohabatkar, H.; Mohsenzadeh, S., Using the concept of Chou’s pseudo amino acid composition for risk type prediction of human papillomaviruses, J. Theor. Biol., 263, 203-209 (2010) · Zbl 1406.92455
[33] Falconer, K. J., Techniques in Fractal Geometry (1997), Wiley · Zbl 0869.28003
[34] Foroutan, P. K.; Dutilleul, P.; Smith, D. L., Advances in the implementation of the box-counting method of fractal dimension estimation, Appl. Math. Comput., 105, 195-210 (1999) · Zbl 1025.28004
[35] Forterre, P., A hot story from comparative genomics: reverse gyrase is the only hyperthermophile-specific protein, Trends Genet., 18, 236-237 (2002)
[36] Georgiou, D. N.; Karakasidis, T. E.; Nieto, J. J.; Torres, A., Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou’s pseudo amino acid composition, J. Theor. Biol., 257, 17-26 (2009) · Zbl 1400.92393
[37] Gu, Q.; Ding, Y. S.; Zhang, T. L., Prediction of G-protein-coupled receptor classes in low homology using chou’s pseudo amino acid composition with approximate entropy and hydrophobicity patterns, Protein Pept. Lett., 17, 559-567 (2010)
[38] Gouy, M.; Gautier, C., Codon usage in bacteria: correlation with gene expressivity, Nucleic Acids Res., 10, 7055-7074 (1982)
[39] Grantham, R.; Gautier, C.; Gouy, C., Codon frequencies in 119 individual genes confirm consistent choices of degenerate bases according to genome type, Nucleic Acids Res., 8, 1893-1912 (1980)
[40] Grizzi, F.; Russo, C.; Colombo, P.; Franceschini, B.; Frezza, E. E.; Cobos, E.; Chiriva-Internati, M., Quantitative evaluation and modeling of two-dimensional neovascular network complexity: the surface fractal dimension, BMC Cancer, 5, 14 (2005)
[41] Gromiha, M. M.; Suresh, M. X., Discrimination of mesophilic and thermophilic proteins using machine learning algorithms, Proteins, 70, 1274-1279 (2008)
[42] Hao, B. L.; Lee, H. C.; Zhang, S. Y., Fractals related to long DNA sequences and complete genomes, Chao Solitons Fractals, 11, 825-836 (2000) · Zbl 0959.92019
[43] Hu, L.; Zheng, L.; Wang, Z.; Li, B.; Liu, L., Using pseudo amino acid composition to predict protease families by incorporating a series of protein biological features, Protein Pept. Lett., 18, 552-558 (2011)
[44] Huang, Y.; Niu, B. F.; Gao, Y.; Fu, L. M.; Li, W. Z., CD-HIT Suite: a web server for clustering and comparing biological sequences, Bioinformatics, 26, 5, 680-682 (2010)
[45] Jaenicke, R.; Bohm, G., The stability of proteins in extreme environments, Curr. Opinion Struct. Biol., 8, 6, 738-748 (1998)
[46] Jeffrey, H. J., Chaos game representation of gene structure, Nucleic Acids Res., 18, 2163-2170 (1990)
[47] Joshi, R. R.; Sekharan, S., Characteristic peptides of protein secondary structural motifs, Protein Pept. Lett., 17, 1198-1206 (2010)
[48] Kanaya, S.; Kinouchi, M., Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome, Gene, 276, 89-99 (2001)
[49] Kandaswamy, K. K.; Pugalenthi, G.; Moller, S.; Hartmann, E.; Kalies, K. U.; Suganthan, P. N.; Martinetz, T., Prediction of apoptosis protein locations with genetic algorithms and support vector machines through a new mode of pseudo amino acid composition, Protein Pept. Lett., 17, 1473-1479 (2010)
[50] Kumar, S.; Tsai, C. J.; Nussinov, R., Factors enhancing protein thermostability, Protein Eng., 13, 179-191 (2000)
[51] Lawyer, F. C.; Stoffel, S.; Saiki, R. K.; Myambo, K.; Drummond, R.; Gelfand, D. H., Isolation, characterization, and expression in Escherichia coli of the DNA polymerase gene from Thermus aquaticus, J. Biol. Chem., 264, 6427-6437 (1989)
[52] Lin, W. Z.; Xiao, X.; Chou, K. C., GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis, Protein Eng. Des. Sel., 22, 699-705 (2009)
[53] Lin, H.; Ding, H., Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition, J. Theor. Biol., 269, 64-69 (2011) · Zbl 1307.92080
[54] Lynn, D. J.; Singer, G. A.C.; Hickey, D. A., Synonymous codon usage is subject to selection in thermophilic bacteria, Nucleic Acids Res., 30, 4272-4277 (2002)
[55] Mandelbrot, B. B., The Fractal Geometry of Nature (1982), Freeman: Freeman San Francisco · Zbl 0504.28001
[56] Masso, M.; Vaisman, I. I., Knowledge-based computational mutagenesis for predicting the disease potential of human non-synonymous single nucleotide polymorphisms, J. Theor. Biol., 266, 560-568 (2010) · Zbl 1407.92082
[57] Mohabatkar, H., Prediction of cyclin proteins using Chou’s pseudo amino acid composition, Protein Pept. Lett., 17, 1207-1214 (2010)
[58] Mohabatkar, H.; Mohammad Beigi, M.; Esmaeili, A., Prediction of GABA(A) receptor proteins using the concept of Chou’s pseudo-amino acid composition and support vector machine, J. Theor. Biol., 281, 18-23 (2011) · Zbl 1397.92215
[59] Montanucci, L.; Martelli, P. L.; Fariselli, P.; Casadio, R., Robust determinants of thermostability highlighted by a codon frequency index capable of discriminating thermophilic from mesophilic genomes, J. Proteome Res., 6, 2502-2508 (2007)
[60] Montanucci, L.; Martelli, P. L.; Fariselli, P.; Casadio, R., Predicting protein thermostability changes from sequence upon multiple mutations, Bioinformatics, 24, i190-i195 (2008)
[61] Nanni, L.; Lumini, A., A further step toward an optimal ensemble of classifiers for peptide classification, a case study: HIV protease, Protein Pept. Lett., 16, 163-167 (2009)
[62] Qiu, J. D.; Huang, J. H.; Shi, S. P.; Liang, R. P., Using the concept of Chou’s pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform, Protein Pept. Lett., 17, 715-722 (2010)
[63] Singer, G. A.C.; Hickey, D. A., Thermophilic prokaryotes have characteristic patterns of codon usage, amino acid composition and nucleotide content, Gene, 317, 39-47 (2003)
[64] Soddell, J.; Seviour, R., A comparison of methods for determining the fractal dimensions of colonies of filamentous bacteria, Binary, 6, 21-31 (1994)
[65] Spasic, S.; Kalauzi, A.; Grbic, G.; Martac, L.; Culic, M., Fractal analysis of rat brain activity after injury, Med. Biol. Eng. Comput., 43, 345-348 (2005)
[66] Vapnik, V., Statistical Learning Theory (1998), Wiley Interscience: Wiley Interscience New York · Zbl 0935.62007
[67] Wu, J.; Lu, J.; Wang, J. Q., Application of chaos and fractal models to water quality time series prediction, Environ. Modelling Software, 24, 632-636 (2009)
[68] Xiao, X.; Wang, P.; Chou, K. C., GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes, J. Comput. Chem., 30, 1414-1423 (2009)
[69] Xiao, X.; Wang, P.; Chou, K. C., GPCR-2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions, Mol. Biosyst., 7, 911-919 (2011)
[70] Xiao, X.; Wang, P.; Chou, K. C., Quat-2L: a web-server for predicting protein quaternary structural attributes, Mol. Diversity, 15, 1, 149-155 (2011)
[71] Xiao, X.; Wu, Z. C.; Chou, K. C., iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites, J. Theor. Biol., 284, 42-51 (2011) · Zbl 1397.92238
[72] Xiao, X.; Wu, Z. C.; Chou, K. C., A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites, PLos One, 6, 6, e20592 (2011)
[73] Xiao, X.; Chou, K. C., Using pseudo amino acid composition to predict protein attributes via cellular automata and others approaches, Curr. Bioinf., 2011, 6, 251-260 (2011)
[74] Yang, J. Y.; Peng, Z. L.; Yu, Z. G.; Zhang, R. J.; Anh, V.; Wang, D., Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation, J. Theor. Biol., 257, 618-626 (2009) · Zbl 1400.92417
[75] Yu, L.; Guo, Y.; Li, Y.; Li, G.; Li, M.; Luo, J.; Xiong, W.; Qin, W., SecretP: identifying bacterial secreted proteins by fusing new features into Chou’s pseudo-amino acid composition, J. Theor. Biol., 267, 1-6 (2010) · Zbl 1410.92040
[76] Yu, Z. G.; Anha, V.; Lau, K. S., Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses, J. Theor. Biol., 226, 341-348 (2004) · Zbl 1439.92148
[77] Zakeri, P.; Moshiri, B.; Sadeghi, M., Prediction of protein submitochondria locations based on data fusion of various features of sequences, J. Theor. Biol., 269, 208-216 (2011) · Zbl 1307.92094
[78] Zeng, Y. H.; Guo, Y. Z.; Xiao, R. Q.; Yang, L.; Yu, L. Z.; Li, M. L., Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach, J. Theor. Biol., 259, 366-372 (2009) · Zbl 1402.92193
[79] Zhang, G. Y.; Fang, B. S., Study on the discrimination of thermophilic and mesophilic proteins based on dipeptide composition, Chin. J. Biotech., 22, 2, 293-298 (2006)
[80] Zhang, G. Y.; Fang, B. S., Support vector machine for discrimination of thermophilic and mesophilic proteins based on amino acid composition, Protein Pept. Lett., 13, 965-970 (2006)
[81] Zhou, G. P., The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein-protein interaction mechanism, J. Theor. Biol., 284, 142-148 (2011) · Zbl 1397.92245
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.