Data Cleansing and Preparation for Moving Toward Electronic Library Repository

Asanee Kawtrakul²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3815))

Included in the following conference series:

International Conference on Asian Digital Libraries

1165 Accesses

Abstract

Manually annotated metadata usually contains errors from mistyping; however, correcting those metadata manually could be costly and time consuming. This paper proposed a framework to ease metadata correction processed by proposing a system that utilizes OCR and NLP techniques to automatically extract metadata from document image. The system firstly converts images into text using OCR and then extracts metadata from OCR results. After that, the extracted metadata are compared with the data in existing repository to locate error entries. The error entries are then displayed to users whom will correct them using supporting information. Although human decision is required to correct the error manually, this step is necessary with only error entries. The experimental results with 3,712 thesis abstracts show that the proposed solution can automatically extract the relevance information with 91.41% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Automatic Document Metadata Extraction Based on Deep Networks

Automatic Metadata Harvesting from Digital Content Using NLP

TU Framework in Automatic Formatting a Digital Library

References

Yingsaeree, C., Kawtrakul, A.: A Unified Framework for Automatic Metadata Extraction from Electronic Document. In: Proceedings of International Advanced Digital Library Conference (2005)
Google Scholar
Waewsawangwong, P., Kawtrakul, A.: Multi-Feature Extraction for Printed Thai Character Recognition. In: Proceedings of 4th Symposium on Natural Language Processing (2000)
Google Scholar
Wood, D.: Theory of computation. Wiley International, Chichester (1998)
Google Scholar
Patel, A.: Yapps: Yet Another Python Parser System (June 16, 2005) (2003), from http://theory.stanford.edu/~amitp/Yapps

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Kasetsart University, Bangkok, 10900, Thailand
Asanee Kawtrakul

Authors

Asanee Kawtrakul
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Virginia Tech, 24061, Blacksburg, VA
Edward A. Fox
University of Vienna, Vienna, Austria
Erich J. Neuhold
Department of Library Science, Chulalongkorn University, 10330, Bangkok, Thailand
Pimrumpai Premsmit
School of Engineering and Technology, Asian Institute of Technology, P.O. Box 4, 12120, Klong Luang, Pathum Thani, Thailand
Vilas Wuwongse

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kawtrakul, A. (2005). Data Cleansing and Preparation for Moving Toward Electronic Library Repository. In: Fox, E.A., Neuhold, E.J., Premsmit, P., Wuwongse, V. (eds) Digital Libraries: Implementing Strategies and Sharing Experiences. ICADL 2005. Lecture Notes in Computer Science, vol 3815. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11599517_69

Download citation

DOI: https://doi.org/10.1007/11599517_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30850-8
Online ISBN: 978-3-540-32291-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data Cleansing and Preparation for Moving Toward Electronic Library Repository

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic Document Metadata Extraction Based on Deep Networks

Automatic Metadata Harvesting from Digital Content Using NLP

TU Framework in Automatic Formatting a Digital Library

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Data Cleansing and Preparation for Moving Toward Electronic Library Repository

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic Document Metadata Extraction Based on Deep Networks

Automatic Metadata Harvesting from Digital Content Using NLP

TU Framework in Automatic Formatting a Digital Library

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation