×

Predicting DNA binding proteins using support vector machine with hybrid fractal features. (English) Zbl 1411.92108

Summary: DNA-binding proteins play a vitally important role in many biological processes. Prediction of DNA-binding proteins from amino acid sequence is a significant but not fairly resolved scientific problem. Chaos game representation (CGR) investigates the patterns hidden in protein sequences, and visually reveals previously unknown structure. Fractal dimensions (FD) are good tools to measure sizes of complex, highly irregular geometric objects. In order to extract the intrinsic correlation with DNA-binding property from protein sequences, CGR algorithm, fractal dimension and amino acid composition are applied to formulate the numerical features of protein samples in this paper. Seven groups of features are extracted, which can be computed directly from the primary sequence, and each group is evaluated by the 10-fold cross-validation test and Jackknife test. Comparing the results of numerical experiments, the group of amino acid composition and fractal dimension (21-dimension vector) gets the best result, the average accuracy is 81.82% and average Matthew’s correlation coefficient (MCC) is 0.6017. This resulting predictor is also compared with existing method DNA-Prot and shows better performances.

MSC:

92C40 Biochemistry, molecular biology
92D20 Protein sequences, DNA sequences
28A80 Fractals
92-08 Computational methods for problems pertaining to biology
Full Text: DOI

References:

[1] Ahmad, S.; Gromiha, M. M.; Sarai, A., Analysis and prediction of DNA binding proteins and their binding residues based on composition, sequence and structural information, Bioinformatics, 20, 477-486 (2004)
[2] Ahmad, S.; Sarai, A., Moment-based prediction of DNA-binding proteins, J. Mol. Biol., 341, 65-71 (2004)
[3] Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res., 25, 3389-3402 (1997)
[4] Baish, J. W.; Jain, R. K., Cancer, angiogenesis and fractals, Nat. Med., 4, 984 (1998)
[5] Baish, J. W.; Jain, R. K., Fractals and cancer, Cancer Res,, 60, 3683-3688 (2000)
[6] Basu, S.; Pan, A.; Dutta, C.; Das, J., Chaos game representation of proteins, Mol. Modell., 15, 279-289 (1997)
[7] Berman, G. H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E., The protein data bank, Nucleic Acids Res., 28, 235-242 (2000)
[8] Bhardwaj, N.; Langlois, R. E.; Zhao, G.; Lu, H., Kernel-based machine learning protocol for predicting DNA-binding proteins, Nucleic Acids Res., 33, 6486-6493 (2005)
[9] Cai, Y. D.; Lin, S. L., Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence, Biochim. Biophys. Acta, 1648, 127-133 (2003)
[10] Chen, W.; Feng, P. M.; Lin, H.; Chou, K. C., iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition, Nucleic Acids Res., 41, e68 (2013)
[11] Chou, K. C.; Zhang, C. T., Review: prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol., 30, 275-349 (1995)
[12] Chou, K. C.; Shen, H. B., Review: recent advances in developing web-servers for predicting protein attributes, Nat. Sci., 2, 63-92 (2009), (openly accessible at)
[13] Chou, K. C., Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review), J. Theor. Biol., 273, 236-247 (2011) · Zbl 1405.92212
[14] Falconer, K. J., Techniques in Fractal Geometry (1997), Wiley · Zbl 0869.28003
[15] Fang, Y.; Guo, Y.; Feng, Y.; Li, M., Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features, Amino Acids, 34, 103-109 (2008)
[16] Foroutan, P. K.; Dutilleul, P.; Smith, D. L., Advances in the implementation of the box-counting method of fractal dimension estimation, Appl. Math. Comput., 105, 195-210 (1999) · Zbl 1025.28004
[17] Fujishima, K.; Komasa, M.; Kitamura, S.; Suzuki, H.; Tomita, M.; Kanai, A., Proteome-wide prediction of novel DNA/RNA-binding proteins using amino acid composition and periodicity in the hyperthermophilic archaeon pyrococcus furiosus, DNA Res., 14, 91-102 (2007)
[18] Gasteiger, E.; Jung, E.; Bairoch, A., SWISS-PROT: connecting biomolecular knowledge via a protein database, Curr. Issues Mol. Biol., 3, 47-55 (2001)
[19] Grizzi, F.; Russo, C.; Colombo, P.; Franceschini, B.; Frezza, E. E.; Cobos, E.; Chiriva-Internati, M., Quantitative evaluation and modeling of two-dimensional neovascular network complexity: the surface fractal dimension, BMC Cancer, 5, 14 (2005)
[20] Hao, B. L.; Lee, H. C.; Zhang, S. Y., Fractals related to long DNA sequences and complete genomes, Chaos, Solitons Fractals, 11, 825-836 (2000) · Zbl 0959.92019
[21] Hayat, M.; Khan, A., Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition, J. Theor. Biol., 271, 10-17 (2011) · Zbl 1405.92217
[22] Hu, L.; Huang, T.; Shi, X.; Lu, W. C.; Cai, Y. D.; Chou, K. C., Predicting functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid properties, PLoS One, 6, e14556 (2011)
[23] Huang, Y.; Niu, B. F.; Gao, Y.; Fu, L. M.; Li, W. Z., CD-HIT Suite: a web server for clustering and comparing biological sequences, Bioinformatics, 26, 5, 680-682 (2010)
[24] Ji, G.; Wu, X.; Shen, Y.; Huang, J.; Quinn, L. Q., A classificationbased prediction model of messenger RNA polyadenylation sites, J. Theor. Biol., 265, 287-296 (2010) · Zbl 1460.92157
[25] Jeffrey, H. J., Chaos game representation of gene structure, Nucleic Acids Res., 18, 2163-2170 (1990)
[26] Kumar, K. K.; Pugalenthi, G.; Suganthan, P. N., DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest, J. Biomol. Struct. Dyn., 26, 679-686 (2009)
[27] Kumar, M.; Gromiha, M. M.; Raghava, G. P., Identification of DNA-binding proteins using support vector machines and evolutionary profiles, BMC Bioinf., 8, 463 (2007)
[28] Langlois, R. E.; Lu, H., Boosting the prediction and understanding of DNA binding domains from sequence, Nucleic Acids Res., 38, 3149-3158 (2010)
[29] Lin, W. Z.; Fang, J. A.; Xiao, X.; Chou, K. C., iDNA-Prot: identification of DNA binding proteins using random forest with grey model, PLoS One, 6, e24756 (2011)
[30] Liu, X. L.; Lu, J. L.; Hu, X. H., Predicting thermophilic proteins with pseudo amino acid composition: approached from chaos game representation and principal component analysis, Protein Pept. Lett., 18, 1244-1250 (2011)
[31] Lu, J. L.; Hu, X. H.; Hu, D. G., A new hybrid fractal algorithm for predicting thermophilic nucleotide sequences, J. Theor. Biol., 293, 74-81 (2012) · Zbl 1307.92309
[32] Luscombe, N. M.; Austin, S. E.; Berman, H. M.; Thornton, J. M., An overview of the structures of protein-DNA complexes, Genome. Biol, 15, 1 (2000), (REVIEWS001)
[33] Mandelbrot, B. B., The Fractal Geometry of Nature (1982), Freeman: Freeman San Francisco · Zbl 0504.28001
[34] Masso, M.; Vaisman, I. I., Knowledge-based computational mutagenesis for predicting the disease potential of human non-synonymous single nucleotide polymorphisms, J. Theor. Biol., 266, 560-568 (2010) · Zbl 1407.92082
[35] Nanni, L.; Lumini, A., Combing ontologies and dipeptide composition for predicting DNA-binding proteins, Amino Acids, 34, 635-641 (2008)
[36] Nanni, L.; Lumini, A., An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins, Amino Acids, 36, 167-175 (2009)
[37] Nimrod, G.; Szilagyi, A.; Leslie, C.; Ben-Tal, N., Identification of DNA-binding proteins using structural, electrostatic and evolutionary features, J. Mol. Biol., 387, 1040-1053 (2009)
[38] Niu, X. H.; Hu, X. H.; Shi, F.; Xia, J. B., Predicting protein solubility by the general form of Chou’s pseudo amino acid composition: approached from chaos game representation and fractal dimension, Protein Pept. Lett, 19, 940-948 (2012)
[39] Nordhoff, E.; Krogsdam, A. M.; Jorgensen, H. F.; Kallipolitis, B. H.; Clark, B. F.; Roepstorff, P.; Kristiansen, K., Rapid identification of DNA-binding proteins by mass spectrometry, Nat. Biotechnol., 17, 884-888 (1999)
[40] Pellegrini-Calace, M.; Thornton, J. M., Detecting DNA-binding helix-turn-helix structural motifs using sequence and structure information, Nucleic Acids Res., 33, 2129-2140 (2005)
[41] Shanahan, H. P.; Garcia, M. A.; Jones, S.; Thornton, J. M., Identifying DNA-binding proteins using structural motifs and the electrostatic potential, Nucleic Acids Res., 32, 4732-4741 (2004)
[42] Shao, X.; Tian, Y.; Wu, L.; Wang, Y.; Jing, L.; Deng, N., Predicting DNA- and RNA-binding proteins from sequences with kernel methods, J. Theor. Biol., 258, 289-293 (2009) · Zbl 1402.92332
[43] Soddell, J.; Seviour, R., A comparison of methods for determining the fractal dimensions of colonies of filamentous bacteria, Binary, 6, 21-31 (1994)
[44] Sonnhammer, E. L.; Eddy, S. R.; Durbin, R., Pfam: a comprehensive database of protein domain families based on seed alignments, Proteins, 28, 405-420 (1997)
[45] Spasic, S.; Kalauzi, A.; Grbic, G.; Martac, L.; Culic, M., Fractal analysis of rat brain activity after injury, Med. Biol. Eng. Comput., 43, 345-348 (2005)
[46] Stawiski, E. W.; Gregoret, L. M.; Mandel-Gutfreund, Y., Annotating nucleic acid-binding function based on protein structure, J. Mol. Biol., 326, 4, 1065-1079 (2003)
[47] Vapnik, V., Statistical Learning Theory (1998), Wiley Interscience: Wiley Interscience New York · Zbl 0935.62007
[48] Wang, L. J.; Brown, S. J., BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences, Nucleic Acids Res., 34, W243-W248 (2006)
[49] Xia, J. B.; Zhang, S. L.; Shi, F.; Xiong, H. J.; Hu, X. H.; Niu, X. H., Using the concept of pseudo amino acid composition to predict resistance gene against Xanthomonas oryzae pv. oryzae in rice: an approach from chaos games representation, J. Theor. Biol., 284, 16-23 (2011)
[50] Xiao, X.; Wang, P.; Chou, K. C., GPCR-2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions, Mol. Biosyst., 7, 911-919 (2011)
[51] Xu, Y.; Ding, J.; Wu, L. Y.; Chou, K. C., iSNO-PseAAC: predict cysteine \(S\)-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition, PLoS One, 8, e55844 (2013)
[52] Yang, J. Y.; Peng, Z. L.; Yu, Z. G.; Zhang, R. J.; Anh, V.; Wang, D., Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation, J. Theor. Biol., 257, 618-626 (2009) · Zbl 1400.92417
[53] Yu, X.; Cao, J.; Cai, Y.; Shi, T.; Li, Y., Predicting rRNA-, RNA-, and DNA binding proteins from primary structure with support vector machines, J. Theor. Biol., 240, 175-184 (2006) · Zbl 1447.92318
[54] Yu, Z. G.; Anha, V.; Lau, K. S., Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses, J. Theor. Biol, 226, 341-348 (2004) · Zbl 1439.92148
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.