Abstract
Manually annotated metadata usually contains errors from mistyping; however, correcting those metadata manually could be costly and time consuming. This paper proposed a framework to ease metadata correction processed by proposing a system that utilizes OCR and NLP techniques to automatically extract metadata from document image. The system firstly converts images into text using OCR and then extracts metadata from OCR results. After that, the extracted metadata are compared with the data in existing repository to locate error entries. The error entries are then displayed to users whom will correct them using supporting information. Although human decision is required to correct the error manually, this step is necessary with only error entries. The experimental results with 3,712 thesis abstracts show that the proposed solution can automatically extract the relevance information with 91.41% accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Yingsaeree, C., Kawtrakul, A.: A Unified Framework for Automatic Metadata Extraction from Electronic Document. In: Proceedings of International Advanced Digital Library Conference (2005)
Waewsawangwong, P., Kawtrakul, A.: Multi-Feature Extraction for Printed Thai Character Recognition. In: Proceedings of 4th Symposium on Natural Language Processing (2000)
Wood, D.: Theory of computation. Wiley International, Chichester (1998)
Patel, A.: Yapps: Yet Another Python Parser System (June 16, 2005) (2003), from http://theory.stanford.edu/~amitp/Yapps
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kawtrakul, A. (2005). Data Cleansing and Preparation for Moving Toward Electronic Library Repository. In: Fox, E.A., Neuhold, E.J., Premsmit, P., Wuwongse, V. (eds) Digital Libraries: Implementing Strategies and Sharing Experiences. ICADL 2005. Lecture Notes in Computer Science, vol 3815. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11599517_69
Download citation
DOI: https://doi.org/10.1007/11599517_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30850-8
Online ISBN: 978-3-540-32291-7
eBook Packages: Computer ScienceComputer Science (R0)