Skip to main content

Data Cleansing and Preparation for Moving Toward Electronic Library Repository

  • Conference paper
Digital Libraries: Implementing Strategies and Sharing Experiences (ICADL 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3815))

Included in the following conference series:

  • 1165 Accesses

Abstract

Manually annotated metadata usually contains errors from mistyping; however, correcting those metadata manually could be costly and time consuming. This paper proposed a framework to ease metadata correction processed by proposing a system that utilizes OCR and NLP techniques to automatically extract metadata from document image. The system firstly converts images into text using OCR and then extracts metadata from OCR results. After that, the extracted metadata are compared with the data in existing repository to locate error entries. The error entries are then displayed to users whom will correct them using supporting information. Although human decision is required to correct the error manually, this step is necessary with only error entries. The experimental results with 3,712 thesis abstracts show that the proposed solution can automatically extract the relevance information with 91.41% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
eBook
USD 39.99
Price excludes VAT (USA)
Softcover Book
USD 54.99
Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Yingsaeree, C., Kawtrakul, A.: A Unified Framework for Automatic Metadata Extraction from Electronic Document. In: Proceedings of International Advanced Digital Library Conference (2005)

    Google Scholar 

  2. Waewsawangwong, P., Kawtrakul, A.: Multi-Feature Extraction for Printed Thai Character Recognition. In: Proceedings of 4th Symposium on Natural Language Processing (2000)

    Google Scholar 

  3. Wood, D.: Theory of computation. Wiley International, Chichester (1998)

    Google Scholar 

  4. Patel, A.: Yapps: Yet Another Python Parser System (June 16, 2005) (2003), from http://theory.stanford.edu/~amitp/Yapps

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kawtrakul, A. (2005). Data Cleansing and Preparation for Moving Toward Electronic Library Repository. In: Fox, E.A., Neuhold, E.J., Premsmit, P., Wuwongse, V. (eds) Digital Libraries: Implementing Strategies and Sharing Experiences. ICADL 2005. Lecture Notes in Computer Science, vol 3815. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11599517_69

Download citation

  • DOI: https://doi.org/10.1007/11599517_69

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-30850-8

  • Online ISBN: 978-3-540-32291-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics