Abstract
Scientists involved in the area of proteomics are currently seeking integrated, customised and validated research solutions to better expedite their work in proteomics analyses and drug discoveries. Some drugs and most of their cell targets are proteins, because proteins dictate biological phenotype. In this context, the automated analysis of protein localisation is more complex than the automated analysis of DNA sequences; nevertheless the benefits to be derived are of same or greater importance. In order to accomplish this target, the right choice of the kind of the methods for these applications, especially when the data set is drastically imbalanced, is very important and crucial. In this paper we investigate the performance of some commonly used classifiers, such as the K nearest neighbours and feed-forward neural networks with and without cross-validation, in a class of imbalanced problems from the bioinformatics domain. Furthermore, we construct ensemble-based schemes using the notion of diversity, and we empirically test their performance on the same problems. The experimental results favour the generation of neural network ensembles as these are able to produce good generalisation ability and significant improvement compared to other single classifier methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Boland MV, Murphy RF (1999) After sequencing: quantitative analysis of protein localization. IEEE Eng Med Biol Sept/Oct:115–119
Liang P, Labedan B, Riley M (2002) Physiological genomics of Escherichia coli protein families. Physiol Genomics 9(1):15–26
Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, Anvik J, Macdonell C, Eisner R (2004) Predicting subcellular localization of proteins using machine learned classifiers. Bioinformatics 20:547–556
Clare A, King RD (2003) Predicting gene function in Saccharomyces cerevisiae. Bioinformatics 19:42–49
Neagu D, Palade V (2003) A neuro-fuzzy approach for fuctional genomics data interpretation and analysis. Neural Comput Appl 12:153–159
Nakai K, Kanehisa M (1991) Expert system for predicting protein localization sites in gram-negative bacteria. Proteins: Struct Funct Genet 11:95–110
Nakai K, Kanehisa M (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14:897–911
Horton P, Nakai K (1996) A probabilistic classification system for predicting the cellular localization sites of proteins. In: Proceedings of the 4th international conference on intelligent systems for molecular biology, AAAI Press, St. Louis, pp 109–115
Horton P, Nakai K (1997) Better prediction of protein cellular localization sites with the k nearest neighbors classifier. In: Proceedings of intelligent systems in molecular biology, Halkidiki, Greece, pp 368–383
Cairns P, Huyck C, Mitchell I, Wu W (2001) A comparison of categorisation algorithms for predicting the cellular localization sites of proteins. In: Proceedings of IEEE international workshop on database and expert systems applications, pp 296–300
Bolat B, Yıldırım T (2003) A data selection method for probabilistic neural networks. In: International XII. Turkish symposium on artificial intelligence and neural networks—TAINN, pp 1137–1140
Tan AC, Gilbert D (2003) An empirical comparison of supervised machine learning techniques in bioinformatics. In: Proceedings of the first Asia Pacific bioinformatics conference (APBC 2003), Adelaide, Australia. Australian Computer Society, Sydney. Chen P (ed) Conferences in research and practice in information technology, vol 19, pp 219–222
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClellend JL (eds) Parallel distributed processing: explorations in the microstructure of cognition. MIT Press, Cambridge, pp 318–362
Sima J (1996) Back propagation is not efficient. Neural Netw 6:1017–1023
Riedmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning: the RPROP algorithm. In: Proceedings of international conference on neural networks, San Francisco, CA, pp 586–591
Riedmiller M (1994) RPROP-description and implementation details. Technical Report, University of Karlsruhe, Germany
Udelhoven T, Schutt B (2000) Capability of feed-forward neural networks for a chemical evaluation of sediments with diffuse reflectance spectroscopy. Chemometr Intell Lab Syst 51:9–22
Hansen LK, Salamon P (1990) Neural network ensembles. IEEE Trans Pattern Anal Mach Intell 12:993–1001
Krogh A, Vedelsby J (1995) Neural network ensembles, cross validation, and active learning. In: Tesauro G, Touretzky D, Leen T (eds) Advances in neural information processing systems, vol 2, pp 650–659
Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J Artif Intell Res 11:169–198
Sharkey AJC (1996) On combining artificial neural nets. Connect Sci 8:299–314
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international machine learning conference, pp 148–156
Sharkey AJC, Sharkey NE (1997) Combining diverse neural nets. Knowl Eng Rev 12:231–247
Zenobi G, Cunningham P (2001) Using diversity in preparing ensembles of classifiers based on different feature subsets to minimize generalization error. In: Proceedings of the European conference on machine learning, pp 576–587
Murphy PM, Aha DW (1996) UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn
Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, Davis NW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B, Shao Y (1997) The complete genome sequence of Escherichia coli K-12. Science 277(5331):1453–1474
Lodish H, Berk A, Zipursky SL, Matsudaira P, Baltimore D, James Darnell J (2003) Molecular cell biology, 5th edn. Freeman, San Francisco, CA
Van Belle D, Andre B (2001) A genomic view of yeast membrane transporters. Curr Opin Cell Biol 13(4):389–398
Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting Subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300:1005–1016
Igel C, Husken M (2003) Empirical evaluation of the improved Rprop learning algorithms. Neurocomputing 50:105–123
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence, AAAI Press and MIT Press, pp 223–228
Nugent CD, Lopez JA, Smith AE 1, Black ND (2002) Prediction models in the design of neural network based ECG classifiers: a neural network and genetic programming approach. BMC Med Inform Decis Making 2(1)
Snedecor G, Cochran W (1989) Statistical methods, 8th edn. Iowa State University Press, Ames, IA
Acknowledgements
We would like to thank Dr Maria Roubelakis of Oxford University for assistance in biological aspects of this work.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Anastasiadis, A.D., Magoulas, G.D. Analysing the localisation sites of proteins through neural networks ensembles. Neural Comput & Applic 15, 277–288 (2006). https://doi.org/10.1007/s00521-006-0029-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-006-0029-y