×

RFCRYS: sequence-based protein crystallization propensity prediction by means of random forest. (English) Zbl 1397.92530

Summary: Production of high-quality diffracting crystals is a critical step in determining the 3D structure of a protein by X-ray crystallography. Only 2%–10% of crystallization projects result in high-resolution protein structures. Previously, several computational methods for prediction of protein crystallizability were developed. In this work, we introduce RFCRYS, a random forest based method to predict crystallizability of proteins. RFCRYS utilizes mono-, di-, and tri-peptides amino acid compositions, frequencies of amino acids in different physicochemical groups, isoelectric point, molecular weight, and length of protein sequences, from the primary sequences to predict crystallizabillity by using two different databases. RFCRYS was compared with previous methods and the results obtained show that our proposed method using this set of features outperforms existing predictors with higher accuracy, MCC, and Specificity. Especially, our method is characterized by high Specificity of 0.95, which means RFCRYS rarely mispredicts a protein chain to be crystallizable which consequently would be useful for saving time and resources. In conclusion RFCRYS provides accurate crystallizability prediction for a protein chain that can be applied to support crystallization projects getting higher success rate towards obtaining diffraction-quality crystals.

MSC:

92D20 Protein sequences, DNA sequences
62P10 Applications of statistics to biology and medical sciences; meta analysis
68T05 Learning and adaptive systems in artificial intelligence
Full Text: DOI

References:

[1] Anfinsen, C.B., Principles that govern the folding of protein chains, Science, 181, 223-230, (1973)
[2] Babnigg, G.; Joachimiak, A., Predicting protein crystallization propensity from protein sequence, J. struct. funct. genomics, 11, 71-80, (2010)
[3] Berardi, M.J.; Shih, W.M.; Harrison, S.C.; Chou, J.J., Mitochondrial uncoupling protein 2 structure determined by NMR molecular fragment searching, Nature, 476, 109-113, (2011)
[4] Breiman, L., Random forests, Mach. learn., 45, 1, 5-32, (2001) · Zbl 1007.68152
[5] Call, M.E.; Wucherpfennig, K.W.; Chou, J.J., The structural basis for intramembrane assembly of an activating immunoreceptor complex, Nat. immunol., 11, 1023-1029, (2010)
[6] Chandonia, J.M.; Brenner, S.E., The impact of structural genomics: expectations and outcomes, Science, 311, 5759, 347-351, (2006)
[7] Chen, C.; Chen, L.; Zou, X.; Cai, P., Prediction of protein secondary structure content by using the concept of Chou’s pseudo amino acid composition and support vector machine, Protein pept. lett., 16, 27-31, (2009)
[8] Chen, K.; Kurgan, L.; Rahbari, M., Prediction of protein crystallization using collocation of amino acid pairs, Biochem. biophys. res. commun., 355, 764-769, (2007)
[9] Chou, K.C., Prediction of protein cellular attributes using pseudo amino acid composition, Proteins: struct., funct., genet., 43, 246-255, (2001), (Erratum: ibid, 2001, Vol 44, 60)
[10] Chou, K.C., Review: structural bioinformatics and its impact to biomedical science, Curr. med. chem., 11, 2105-2134, (2004)
[11] Chou, K.C., Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review), J. theor. biol., 273, 236-247, (2011) · Zbl 1405.92212
[12] Chou, K.C.; Shen, H.B., Cell-ploc: A package of web servers for predicting subcellular localization of proteins in various organisms (updated version: cell-ploc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms, natural science, 2010, 2, 1090-1103), Nat. protoc., 3, 153-162, (2008)
[13] Chou, K.C.; Shen, H.B., Review: recent advances in developing web-servers for predicting protein attributes, Nat. sci., 2, 63-92, (2009), (openly accessible at:
[14] Chou, K.C.; Zhang, C.T., Review: prediction of protein structural classes, Crit. rev. biochem. mol. biol., 30, 275-349, (1995)
[15] Chou, K.C.; Wu, Z.C.; Xiao, X., Iloc-euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, Plos one, 6, e18258, (2011)
[16] Cooper, D.R.; Boczek, T.; Grelewska, K.; Pinkowska, M.; Sikorska, M.; Zawadzki, M.; Derewenda, Z., Protein crystallization by surface entropy reduction: optimization of the SER strategy, Acta crystallogr. D: biol. crystallogr., 63, 636-645, (2007)
[17] Derewenda, Z., Rational protein crystallization by mutational surface engineering, Struct., 12, 529-535, (2004)
[18] Ding, H.; Luo, L; Lin., H, Prediction of cell wall lytic enzymes using Chou’s amphiphilic pseudo amino acid composition, Protein pept. lett., 16, 351-355, (2009)
[19] Dudoit, S.; Fridlyan, J.; FridlyaN, T.P., Comparison of discrimination methods for the classification of tumors using gene expression data, J. am. stat. assoc., 97, 77-87, (2002) · Zbl 1073.62576
[20] Esmaeili, M.; Mohabatkar., H.; Mohsenzadeh, S., Using the concept of Chou’s pseudo amino acid composition for risk type prediction of human papillomaviruses, J. theor. biol., 263, 203-209, (2010) · Zbl 1406.92455
[21] Huang, T.; Shi, X.H.; Wang, P.; He, Z.; Feng, K.Y.; Hu, L.; Kong, X.; Li, Y.X.; Cai, Y.D.; Chou, K.C., Analysis and prediction of the metabolic stability of proteins based on their sequential features, subcellular locations and interaction networks, Plos one, 5, 6, e10972, (2010)
[22] Jia, S.C.; Hu, X.Z., Using random forest algorithm to predict beta-hairpin motifs, Protein pept. lett., 18, 609-617, (2011)
[23] Kandaswamy, K.K.; Pugalenthi, G.; Suganthan, P.N.; Gangal, R., SVMCRYS: an SVM approach for the prediction of protein crystallization propensity from protein sequence, Protein pept. lett., 17, 423-430, (2010)
[24] Kandaswamy, K.K; Chou, K.C.; Martinetz, T.; Moller, S.; Suganthan, P.N.; Sridharan, S.; Pugalenthi, G., AFP-pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties, J. theor. biol., 270, 56-62, (2011)
[25] Kumar, K.K.; Pugalenthi, G.; Suganthan, P.N., DNA-prot: identification of DNA binding proteins from protein sequence information using random forest, J. biomol. struct. dyn., 26, 6, 679-686, (2009)
[26] Kurgan, L.; Razib, A.A.; Aghakhani, S.; Dick, S.; Mizianty, M.J.; Jahandideh, S., CRYSTALP2: sequence-based protein crystallization propensity prediction, BMC struct. biol., 9, 50, (2009)
[27] Liaw, A.; Wiener, M., Classification and regression by randomforest, R news, 2, 3, 18-22, (2002)
[28] Lin, H., The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition, J. theor. biol., 252, 350-356, (2008) · Zbl 1398.92076
[29] Lin, J.; Wang, Y., Using a novel adaboost algorithm and Chou’s pseudo amino acid composition for predicting protein subcellular localization, Protein pept. lett., 18, 1219-1225, (2011)
[30] Lin, W.Z.; Fang, J.A.; Xiao, X.; Chou, K.C., Idna-prot: identification of DNA binding proteins using random forest with grey model, Plos one, 6, e24756, (2011)
[31] Marsden, R.L.; Lewis, T.A.; Orengo, C.A., Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint, BMC bioinf., 8, 86, (2007)
[32] Mizianty, M.J.; Kurgan, L., Meta prediction of protein crystallization propensity, Biochem. biophys. res. commun., 390, 10-15, (2009)
[33] Mizianty, M.J.; Kurgan, L., Sequence-based prediction of protein crystallization, purification and production propensity, Bioinf., 27, 13, 24-33, (2011)
[34] Mohabatkar, H., Prediction of cyclin proteins using Chou’s pseudo amino acid composition, Protein pept. lett., 17, 1207-1214, (2010)
[35] Mohabatkar, H.; Mohammad Beigi, M.; Esmaeili, A., Prediction of GABA(A) receptor proteins using the concept of Chou’s pseudo-amino acid composition and support vector machine, J. theor. biol., 281, 18-23, (2011) · Zbl 1397.92215
[36] Overton, I.M.; Barton, G.J., A normalised scale for structural genomics target ranking: the OB-score, FEBS lett., 580, 4005-4009, (2006)
[37] Overton, I.M.; Padovani, G.; Girolami, M.A.; Barton, G.J., Parcrys: a parzen window density estimation approach to protein crystallization propensity prediction, Bioinf., 24, 901-907, (2008)
[38] Oxenoid, K.; Chou, J.J., The structure of phospholamban pentamer reveals a channel-like architecture in membranes, Proc. nat. acad. sci. U. S. A., 102, 10870-10875, (2005)
[39] Pielak, R.M.; Chou, J.J., Solution NMR structure of the V27A drug resistant mutant of influenza A M2 channel, Biochem. biophys. res. commun., 401, 58-63, (2010)
[40] Pielak, R.M.; Chou, J.J., Influenza M2 proton channels, Biochim. biophys. acta, 1808, 522-529, (2011)
[41] Pielak, R.M.; Jason R. Schnell, J.R.; Chou, J.J., Mechanism of drug inhibition and drug resistance of influenza A M2 channel, Proc. natl. acad. sci. U. S. A., 106, 7379-7384, (2009)
[42] Price, W.N., Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data, Nat. biotechnol., 27, 51-57, (2009)
[43] Pugalenthi, G.; Kandaswamy, K.K.; Suganthan, P.N.; Archunan, G.; Sowdhamini, R., Identification of functionally diverse lipocalin proteins from sequence information using support vector machine, Amino acids, 3, 777-783, (2010)
[44] Pugalenthi, G.; Kandaswamy, K.K.; Chou, K.C.; Vivekanandan, S.; Kolatkar, P., RSARF: prediction of residue solvent accessibility from protein sequence using random forest method, Protein pept. lett., 19, 50-56, (2012)
[45] Qiu, Z.; Wang, X., Improved prediction of protein ligand-binding sites using random forests, Protein pept. lett., 18, 1212-1218, (2011)
[46] Schnell, J.R.; Chou, J.J., Structure and mechanism of the M2 proton channel of influenza A virus, Nature, 451, 591-595, (2008)
[47] Shameer, K.; Pugalenthi, G.; Kandaswamy, K.K.; Sowdhamini, R., 3dswap-pred: prediction of 3D domain swapping from protein sequence using random forest approach, Protein pept. lett., 18, 1010-1020, (2011)
[48] Slabinski, L.; Jaroszewski, L.; Rodrigues, A.P.C.; Rychlewski, L.; Wilson, I.A.; Lesley, S.A.; Godzik, A., The challenge of protein structure determination—lessons from structural genomics, Protein sci., 16, 11, 2472-2482, (2007)
[49] Smialowski, P.; Schmidt, T.; Cox, J.; Kirschner, A.; Frishman, D., Will my protein crystallize? A sequence-based predictor, Proteins, 62, 343-355, (2006)
[50] Statnikov, A.; Wang, L.; Aliferis, C.F., A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification, BMC bioinf., 9, 319, (2008)
[51] Wang, J.; Pielak, R.M.; McClintock, M.A.; Chou, J.J., Solution structure and functional analysis of the influenza B proton channel, Nat. struct. mol. biol., 16, 1267-1271, (2009)
[52] Wang, P.; Xiao, X.; Chou, K.C., NR-2L: A two-level predictor for identifying nuclear receptor subfamilies based on sequence-derived features, Plos one, 6, 8, e23505, (2011)
[53] Wu, Z.C.; Xiao, X.; Chou, K.C., Iloc-plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Mol. biosyst., 7, 3287-3297, (2011)
[54] Wu, Z.C.; Xiao, X.; Chou, K.C., Iloc-gpos: A multi-layer classifier for predicting the subcellular localization of singleplex and multiplex Gram-positive bacterial proteins, Protein pept. lett., 19, 4-14, (2012)
[55] Xiao, X.; Chou, K.C., Using pseudo amino acid composition to predict protein attributes via cellular automata and other approaches, Curr. bioinf., 2, 251-260, (2011)
[56] Xiao, X.; Wang, P.; Chou, K.C., Cellular automata and its applications in protein bioinformatics, Curr. protein pept. sci., 12, 6, 508-519, (2011)
[57] Xiao, X.; Wu, Z.C.; Chou, K.C., Iloc-virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites, J. theor. biol., 284, 42-51, (2011) · Zbl 1397.92238
[58] Zeng, Y.H.; Guo, Y.Z.; Xiao, R.Q.; Yang., L.; Yu, L.Z., Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach, J. theor. biol., 259, 366-372, (2009) · Zbl 1402.92193
[59] Zhou, X.B.; Chen, C.; Li, Z.C.; Zou, X.Y., Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes, J. theor. biol., 248, 546-551, (2007) · Zbl 1451.92245
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.