×

Prediction of protein subcellular localization with oversampling approach and Chou’s general PseAAC. (English) Zbl 1394.92047

Summary: Predicting protein subcellular location with support vector machine has been a popular research area recently because of the dramatic explosion of bioinformation. Though substantial achievements have been obtained, few researchers considered the problem of data imbalance before classification, which will lead to low accuracy for some categories. So in this work, we combined oversampling method with SVM to deal with the protein subcellular localization of unbalanced data sets.
To capture valuable information of a protein, a PseAAC (Pseudo Amino Acid Composition) has been extracted from PSSM(Position-Specific Scoring Matrix) as a feature vector, and then be selected by principal component analysis (PCA). Next, samples which are treated by oversampling method to eliminate the imbalance of sample numbers in different classes are fed into support vector machine to predict the protein subcellular location. To evaluate the performance of proposed method, Jackknife tests are performed on three benchmark datasets (ZD98, CL317 and ZW225).
Results of SVM experiments with and without oversampling gained by Jackknife tests show that oversampling methods have successfully decrease the imbalance of data sets, and the prediction accuracy of each class in each dataset is higher than 88.9%. With comparison with other protein subcellular localization methods, the method in this work reaches the best performance. The overall accuracies of ZD98, CL317 and ZW225 are 93.2%, 96.00% and 92.15% respectively, which are 2.4%, 8.0% and 8.2% higher than the best methods in the comparison. The excellent overall accuracy gained by the proposed method indicates that the feature representation and selection capture useful information of protein sequence and oversampling methods successfully solve the imbalance of sample numbers in SVM classification.

MSC:

92C40 Biochemistry, molecular biology
68T05 Learning and adaptive systems in artificial intelligence
Full Text: DOI

References:

[1] Boeckmann, B.; Bairoch, A.; Apweiler, R., The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003, Nucleic Acids Res., 31, 1, 365-370 (2003)
[2] Bulashevska, A.; Eils, R., Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains, BMC Bioinf., 7, 298 (2006)
[3] Bulashevska, A.; Eils, R., Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains, BMC Bioinf., 7, 298 (2006)
[4] Cai, Y. D., Using functional domain composition and support vector machines for prediction of protein subcellular location, J. Biol. Chem., 277, 45765-45769 (2002)
[5] Chang C, C.; Lin C, J., LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. (TIST), 2, 3 (2001)
[6] Chen, Y. L.; Li, Q. Z., Prediction of the subcellular location apoptosis proteins using the algorithm of measure of diversity, ActaSci. Natur. Univ. NeiMongol, 25, 413-417 (2004)
[7] Chen, Y. L.; Li, Q. Z., Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition, J. Theor. Biol, 248, 377-381 (2007) · Zbl 1451.92113
[8] Chen, Y. L.; Li, Q. Z., Prediction of the subcellular location of apoptosis proteins, J. Theor. Biol, 245, 775-783 (2007) · Zbl 1451.92112
[9] Chen, W.; Lei, T. Y.; Jin, D. C.; Lin, H., PseKNC: a flexible web-server for generating pseudo K-tuple nucleotide composition, Anal. Biochem., 456, 53-60 (2014)
[10] Chen, W.; Feng, P.; Yang, H.; Ding, H., iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences, Oncotarget, 8, 4208-4217 (2017)
[11] Chen Y, L.; Li Q, Z., Prediction of the subcellular location of apoptosis proteins, J. Theo. Biol., 245, 4, 775-783 (2007) · Zbl 1451.92112
[12] Cheng, X.; Xiao, X., pLoc-mPlant: predict subcellular localization of multi-location plant proteins via incorporating the optimal GO information into general PseAAC, Mol. BioSyst., 13, 1722-1727 (2017)
[13] Cheng, X.; Xiao, X., pLoc-mVirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC, Gene, 628, 315-321 (2017)
[14] Cheng, X.; Xiao, X., pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC, Genomics (2017)
[15] Cheng, X.; Zhao, S. G., iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals, Bioinformatics, 33, 341-346 (2017), Corrigendum, ibid.,261033 (2017)
[16] Cheng, X.; Zhao, S. G.; Lin, W. Z., pLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites, Bioinformatics (2017)
[17] Cheng, X.; Zhao, S. G.; Xiao, X.; Chou, K. C., iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals, Oncotarget, 8, 58494-58503 (2017)
[18] Chou, K. C.; Shen, H. B., Review: Recent progresses in protein subcellular location prediction, Anal. Biochem., 370, 1-16 (2007)
[19] Chou, K. C.; Shen, H. B., Recent advances in developing web-servers for predicting protein attributes, Natural Sci., 1, 63-92 (2009)
[20] Chou, K. C.; Wu, Z. C.; Xiao, X., iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins, PLoS ONE, 6, e18258 (2011)
[21] Chou, K. C.; Wu, Z. C.; Xiao, X., iLoc-Hum: Using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites, Mol. Biosyst., 8, 629-641 (2012)
[22] Chou K, C., Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins, 43, 3, 246-255 (2001)
[23] Chou, K. C., Prediction of protein cellular attributes using pseudo amino acid composition, PROTEINS, 44, 60, 246-255 (2001), Erratum: ibid.2001, 43
[24] Chou, K. C., Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review), J. Theor. Biol., 273, 236-247 (2011) · Zbl 1405.92212
[25] Chou, K. C., Impacts of bioinformatics to medicinal chemistry, Med. Chem., 11, 218-234 (2015)
[26] Chou, K. C., Impacts of bioinformatics to medicinal chemistry, Med. Chem., 11, 218-234 (2015)
[27] Chou, K. C., An unprecedented revolution in medicinal chemistry driven by the progress of biological science, Current Topics Med. Chem., 17, 2337-2358 (2017)
[28] Ding, S. Y.; Yan, S. J.; Qi, S. H.; Li, Y.; Yao, Y. H., A protein structural classes prediction method based on PSI-BLAST profile, J. Theor. Biol., 353, 19-23 (2014) · Zbl 1412.92240
[29] Du, Q. S.; Wang, S. Q.; Xie, N. Z.; Wang, Q. Y., 2L-PCA: A two-level principal component analyzer for quantitative drug design and its applications, Oncotarget (2017)
[30] Du, Q. S.; Wang, S. Q.; Xie, N. Z.; Wang, Q. Y., 2L-PCA: A two-level principal component analyzer for quantitative drug design and its applications, Oncotarget (2017)
[31] Elrod, D. W., Protein subcellular location prediction, Protein Eng., 12, 107-118 (1999)
[32] Feng, P.; Ding, H.; Yang, H.; Chen, W., iRNA-PseColl: Identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into PseKNC, Mol. Therapy-Nucleic Acids, 7, 155-163 (2017)
[33] Gao H X. Application of multivariate statistics. Beijing: Peking University Press, 2005. 265-290.; Gao H X. Application of multivariate statistics. Beijing: Peking University Press, 2005. 265-290. · Zbl 1063.65021
[34] Huang, J.; Shi, F.; Zhou, H. B., Support vector machine for predicting apoptosis proteins types by incorporating protein instability index, China J. Bioinf., 3, 121-123 (2005)
[35] Jia, J.; Liu, Z.; Liu, B., iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets, Molecules, 21, 95 (2016)
[36] Jia, J.; Liu, Z.; Xiao, X., iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset, Anal. Biochem., 497, 48-56 (2016)
[37] Khan, M.; Hayat, M.; Khan, S. A.; Iqbal, N., Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou’s general PseAAC, J. Theor. Biol., 415, 13-19 (2017)
[38] Lee K. Y., Kim D. W., Na D. K., et al. PLPD: reliable protein localization prediction from imbalanced and overlapped datasets, Nucleic Acids Res.. 2006, 34(17): 4655-4666.; Lee K. Y., Kim D. W., Na D. K., et al. PLPD: reliable protein localization prediction from imbalanced and overlapped datasets, Nucleic Acids Res.. 2006, 34(17): 4655-4666.
[39] Li, L. Q., Advancement of Predicting Protein Subcellular Location Sites, Immunol, 25, 5, 602-604 (2009)
[40] Lin, W. Z.; Fang, J. A.; Xiao, X., iLoc-Animal: A multi-label learning classifier for predicting subcellular localization of animal proteins, Mol. Biosyst., 9, 634-644 (2013)
[41] Liu, B.; Wu, H., Pse-in-One 2.0: An improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein Sequences, Natural Sci., 9, 67-91 (2017)
[42] Liu, L. M.; Xu, Y., iPGK-PseAAC: identify lysine phosphoglycerylation sites in proteins by incorporating four different tiers of amino acid pairwise coupling information into the general PseAAC, Med. Chem., 13, 552-559 (2017)
[43] Liu, B.; Yang, F., 2L-piRNA: A two-layer ensemble classifier for identifying piwi-interacting RNAs and their function, Mol. Therapy-Nucleic Acids, 7, 267-277 (2017)
[44] Liu, T.; Zheng, X.; Wang, C.; Wang, J., Prediction of subcellular location of apoptosis proteins using pseudo amino acid composition: an approach from auto covariance transformation, Protein Peptide Lett., 17, 1263-1269 (2010)
[45] Liu, T.; Zheng, X.; Wang, J., Prediction of protein structural class for low-similarity sequences using support vector machine and PSI- BLAST profile, Biochimie, 92, 10, 1330-1334 (2010)
[46] Liu, Z.; Xiao, X.; Qiu, W. R., iDNA-Methyl: Identifying DNA methylation sites via pseudo trinucleotide composition, Anal. Biochem., 474, 69-77 (2015)
[47] Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L., Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences (updated version: Pse-in-One 2.0: An improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein Sequences, Natural Science, 2017, 9, 67-91), Nucleic Acids Res., 43, W65-W71 (2015)
[48] Liu, B.; Wang, S.; Long, R., iRSpot-EL: identify recombination spots with an ensemble learning approach, Bioinformatics, 33, 35-41 (2017)
[49] Meher, P. K.; Sahu, T. K.; Saini, V.; Rao, A. R., Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC, Sci. Rep., 7, 42362 (2017)
[50] Nakai, K., Protein sorting signals and prediction of subcellular localization, Adv. Protein Chem., 54, 277-344 (2000)
[51] Nakashima, H.; Nishikawa, K., Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies, J. Mol. Biol., 238, 54-61 (1994)
[52] Nello, Chritianini; John, Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods (2004), Beijing: Publishing house of Electronics Industry · Zbl 0994.68074
[53] Niu, B.; Zhang, M.; Du, P.; Jiang, L.; Qin, R.; Su, Q.; Chen, F., Small molecular floribundiquinone B derived from medicinal plants inhibits acetylcholinesterase activity, Oncotarget., 8, 57149-57162 (2017)
[54] Pseudo amino acid composition. https://en.wikipedia.org/wiki/Pseudo_amino_acid_compo-sition; Pseudo amino acid composition. https://en.wikipedia.org/wiki/Pseudo_amino_acid_compo-sition
[55] Qiu, J. D.; Luo, S. H.; Huang, J. H.; Sun, X. Y.; Liang, R. P., Predicting subcellular location of apoptosis proteins based on wavelet transform and support vector machine, Amino Acids, 38, 1201-1208 (2010)
[56] Qiu, W. R.; Jiang, S. Y.; Sun, B. Q.; Xia, X., iRNA-2methyl: identify RNA 2′-O-methylation sites by incorporating sequence-coupled effects into general PseKNC and ensemble classifier, Med. Chem. (2017)
[57] Qiu, W. R.; Jiang, S. Y.; Xu, Z. C., iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition, Oncotarget, 8, 41178-41188 (2017)
[58] Qiu, W. R.; Sun, B. Q.; Xiao, X.; Xu, D., iPhos-PseEvo: Identifying human phosphorylated proteins by incorporating evolutionary information into general PseAAC via grey system theory, Mol. Inf., 36 (2017), UNSP 1600010
[59] Rahimi, M.; Bakhtiarizadeh, M. R.; Mohammadi-Sangcheshmeh, A., OOgenesis_Pred: A sequence-based method for predicting oogenesis proteins by six different modes of Chou’s pseudo amino acid composition, J. Theor. Biol., 414, 128-136 (2017)
[60] Shen, H. B.; Chou, K. C., A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0, Anal. Biochem., 394, 269-274 (2009)
[61] Shen, H. B., Gpos-mPLoc: A top-down approach to improve the quality of predicting subcellular localization of Gram-positive bacterial proteins, Protein Peptide Lett., 16, 1478-1484 (2009)
[62] Shen, H. B., A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites:, Euk-mPLoc 2.0 PLoS ONE, 5, e9931 (2010)
[63] Shen, H. B., Plant-mPLoc: A Top-Down Strategy to Augment the Power for Predicting Plant Protein Subcellular Localization, PLoS ONE, 5, e11335 (2010)
[64] Shen, H. B., Gneg-mPLoc: A top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins, J. Theor. Biol., 264, 326-333 (2010) · Zbl 1406.92211
[65] Shen, H. B., Virus-mPLoc: A Fusion Classifier for Viral Protein Subcellular Location Prediction by Incorporating Multiple Sites, J. Biomol. Struct. Dyn., 28, 175-186 (2010)
[66] Su, Q.; Lu, W.; Du, D.; Chen, F.; Niu, B., Prediction of the aquatic toxicity of aromatic compounds to tetrahymena pyriformis through support vector regression, Oncotarget., 8, 49359-49369 (2017)
[67] Wang, J.; Yang, B.; Revote, J.; Leier, A.; Marquez-Lago, T. T.; Web.; Lithgow, T., POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles, Bioinf. (2017)
[68] Wu, Z. C.; Xiao, X., iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Mol. Biosyst., 7, 3287-3297 (2011)
[69] Wu, Z. C.; Xiao, X., iLoc-Gpos: A Multi-Layer Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Gram-Positive Bacterial Proteins, Protein Peptide Lett., 19, 4-14 (2012)
[70] Xiao, X.; Wu, Z. C., iLoc-Virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites, J. Theor. Biol., 284, 42-51 (2011) · Zbl 1397.92238
[71] Xiao, X.; Wu, Z. C., A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites, PLoS ONE, 6, e20592 (2011)
[72] Xiao, X.; Min, J. L.; Lin, W. Z.; Liu, Z., iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via the benchmark dataset optimization approach, J Biomol. Struct. Dyn., 33, 2221-2233 (2015)
[73] Xu, Y.; Li, C., iPreny-PseAAC: identify C-terminal cysteine prenylation sites in proteins by incorporating two tiers of sequence couplings into PseAAC, Med. Chem., 13, 544-551 (2017)
[74] Yang, Y.; Lu B, L., Protein subcellular multi-localization prediction using a min-max modular support vector machine, Int. J. Neural Syst., 20, 1, 13-28 (2010)
[75] Zhang, Z. H.; Wang, Z. H.; Zhang, Z. R.; Wang, Y. X., A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine, FEBS Lett., 580, 6169-6174 (2006)
[76] Zhang, S.; Huang, B.; Xia, X. F.; Sun, Z. R., Bioinformatics Research in Subcellular Localization of Protein, Prog. Biochem. Biophys., 34, 6, 573-579 (2007)
[77] Zhang, L.; Liao, B.; Li, D. C.; Zhu, W., A novel representation for apoptosis protein subcellular localization prediction using support vector machine, J. Theor. Biol., 259, 361-365 (2009) · Zbl 1402.92163
[78] Zhang, S. L.; Liang, Y. Y.; Bai, Z. G., A novel reduced triplet composition based method to predict apoptosis protein subcellular localization, MATCH Commun. Math. Comput. Chem., 73, 559-571 (2015)
[79] Zhou, G. P.; Doctor, K., Subcellular location prediction of apoptosis proteins, Proteins, 50, 44-48 (2003)
[80] Zhou, G. P.; Doctor, K., Subcellular location prediction of apoptosis proteins, Proteins, 50, 44-48 (2003)
[81] Zhou, Z. H.; Liu, X. Y., Training cost-sensitive neural with methods addressing the class imbalance problem, IEEE Trans. Knowl. Data Eng., 18, 1, 63-77 (2005)
[82] Zhou, Z.-H.; Liu, X.-Y., Training cost-sensitive neural networks with methods addressing class imbalance problem, IEEE Trans. Knowl. Data Eng., 18, 1, 63-77 (2006)
[83] Zhou G, P.; Doctor, K., Subcellular location prediction of apoptosis proteins, Proteins, 50, 1, 44-48 (2003)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.