Abstract
Nowadays generating predictive models by applying machine learning and model ensembles techniques is a faster task facilitated by development of more user-friendly data mining tools. However, such progress raises the issues related to model management: once developed, many classifiers for example become accessible in collections of models. Choosing the relevant model from the collection can reduce costs of generating new predictive models: calculating the similarity of predictive models is the key to rank them, which may improve model selection or combination. For this aim we introduce a methodology to measure the similarity of classifiers by comparing their datasets, transfer functions and confusion matrices. We propose the Dataset Similarity Coefficient to calculate the similarity of datasets, and the Similarity of Models measure to calculate the similarity between such predictive models. In this paper we focus on toxicology applications of binary classification models. The results show that our methodology performs well in measuring models similarity from a collection of classifiers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Makhtar, M., Neagu, D.C., Ridley, M.: Predictive Model Representation and Comparison: Towards Data and Predictive Models Governance. In: Proceedings of the 10th UK Workshop on Computational Intelligence UKCI 2010, pp. 1–6. University of Essex, UK (2010)
Todeschini, R., Consonnia, V., Pavan, M.: A distance measure between models: a tool for similarity/diversity analysis of model populations. Chemometrics and Intelligent Laboratory Systems 70, 55–61 (2004)
Choi, S.-S., Cha, S.-H., Tappert, C.C.: A Survey of Binary Similarity and Distance Measures. Journal of Systemics, Cybernetics and Informatics 8, 43–48 (2010)
Lesot, M.-J., Rifqi, M.: Similarity measures for binary and numerical data: a survey. International Journal of Knowledge Engineering and Soft Data Paradigms 1, 63–84 (2009)
Sequeira, K., Zaki, M.J.: Exploring Similarities across High-dimensional Datasets. In: Taniar, D. (ed.) Research and Trends in Data Mining Technologies and Applications, vol. 3, pp. 53–85. Idea Group Inc., USA (2007)
Prasanna, S.R.M., Yegnanarayana, B., Pinto, J.P., Hermansky, H.: Analysis of Confusion Matrix to Combine Evidence for Phoneme Recognition. IDIAP Research Report, IDIAP-RR-27-2007 (2007)
Freitas, C.O.A., Carvalho, J.M.D.: J. Jose Josemar Oliveira, S. B. K. Aires, and R. Sabourin.: Confusion Matrix Disagreement for Multiple Classifiers. In: Proceedings of the Congress on pattern recognition 12th Iberoamerican Conference on Progress in Pattern Recognition, Image Analysis and Applications, pp. 387–396 (2007)
Witten, I.H., Frank, E., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: Weka: Practical Machine Learning Tools and Techniques with Java Implementations. In: Proceedings of the ICONIP/ANZIIS/ANNES 1999 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems, pp. 192–196 (1999)
D. M. Group.: PMML 3.2 - Model Explanation Documents (2008)
Kohavi, R., Provost, F.: Glossary of Terms. Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process 30, 271–274 (1998)
Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Researchers. HP Laboratories (2004)
DEMETRA Project (2008), http://www.demetra-tox.net/
TETRATOX.: TETRATOX Home (2008), http://www.vet.utk.edu/TETRATOX/index.php
Trundle, P.: Hybrid Intelligent Systems Applied to Predict Pesticides Toxicity - a Data Integration Approach. Phd Thesis. School of Informatics, University of Bradford, UK (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Makhtar, M., Neagu, D.C., Ridley, M.J. (2011). Binary Classification Models Comparison: On the Similarity of Datasets and Confusion Matrix for Predictive Toxicology Applications. In: Böhm, C., Khuri, S., Lhotská, L., Pisanti, N. (eds) Information Technology in Bio- and Medical Informatics. ITBAM 2011. Lecture Notes in Computer Science, vol 6865. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23208-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-23208-4_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23207-7
Online ISBN: 978-3-642-23208-4
eBook Packages: Computer ScienceComputer Science (R0)