research-article

Improving optical character recognition through efficient multiple system alignment

Authors:

William B. Lund,

Eric K. RinggerAuthors Info & Claims

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

Pages 231 - 240

https://doi.org/10.1145/1555400.1555437

Published: 15 June 2009 Publication History

Abstract

Individual optical character recognition (OCR) engines vary in the types of errors they commit in recognizing text, particularly poor quality text. By aligning the output of multiple OCR engines and taking advantage of the differences between them, the error rate based on the aligned lattice of recognized words is significantly lower than the individual OCR word error rates. This lattice error rate constitutes a lower bound among aligned alternatives from the OCR output. Results from a collection of poor quality mid-twentieth century typewritten documents demonstrate an average reduction of 55.0% in the error rate of the lattice of alternatives and a realized word error rate (WER) reduction of 35.8% in a dictionary-based selection process. As an important precursor, an innovative admissible heuristic for the A* algorithm is developed, which results in a significant reduction in state space exploration to identify all optimal alignments of the OCR text output, a necessary step toward the construction of the word hypothesis lattice. On average 0.0079% of the state space is explored to identify all optimal alignments of the documents.

References

[1]

J. Ajot, J. Fiscus, N. Radde, and C. Laprun. Asclite -- Multi-dimensional Alignment Program. http://www.nist.gov/speech/tools/asclite.html, Aug. 2008.

[2]

E. Brill and R. C. Moore. An Improved Error Model for Noisy Channel Spelling Correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 286--293, Hong Kong, 2000.

Digital Library

[3]

R. Dechter and J. Pearl. Generalized Best-first Search Strategies and the Optimality af A*. Journal of the ACM, 32(3):505--536, 1985.

Digital Library

[4]

I. Elias. Settling the Intractability of Multiple Alignment. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology, 13(7):1323--1339, Sept. 2006. 17037961.

[5]

P. F. Felzenszwalb and D. McAllester. The Generalized A* Architecture. Journal of Artificial Intelligence Research, 29:153--190, 2007.

Digital Library

[6]

J. Fiscus. A Post-processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER). In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 1997., pages 347--354, 1997.

[7]

T. Ikeda and T. Imai. Fast A* Algorithms for Multiple Sequence Alignment. In Proceedings of Genome Informatics Workshop 1994, Yokohama, Japan, 1994.

[8]

H. Imai and T. Ikeda. k-Group Multiple Alignment Based on A* Search. In Proceedings of the 6th Genome Inform. Workshop, pages 9--18, 1995.

[9]

D. R. Jordan. Daily Battle Communiques, 1944--1945. Harold B. Lee Library, L. Tom Perry Special Collections, MSS 2766, 1945.

[10]

E. Lawler. Combinatorial Optimization : Networks and Matroids, pages 70--73. Holt, Reinhart and Winston, 1 edition, 1976.

[11]

H. Ma and D. Doermann. Adaptive OCR with Limited User Feedback. In Proceedings of the Eighth International Conference on Document Analysis and Recognition, 2005., pages 814--818 Vol. 2, 2005.

Digital Library

[12]

W. Magdy and K. Darwish. Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 408--414, Sydney, Australia, July 2006.

Digital Library

[13]

L. Mangu, E. Brill, and A. Stolcke. Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks. cs/0010012, Oct. 2000. Computer Speech and Language 14(4), 373--400, October 2000.

Digital Library

[14]

C. Notredame. Recent Progress in Multiple Sequence Alignment: a Survey. Pharmacogenomics, 3(1):131--144, Nov. 2004.

[15]

J. Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley Pub. Co, Reading, Mass., 1984.

Digital Library

[16]

E. K. Ringger. Correcting Speech Recognition Errors. Dissertation, University of Rochester, 2000.

Digital Library

[17]

E. K. Ringger and J. F. Allen. A Fertility Channel Model for Post-Correction of Continuous Speech Recognition. In Fourth International Conference on Spoken Language Processing (ICSLP 1996), Philadelphia, PA, Oct. 1996.

Digital Library

[18]

S. Schroedl. An Improved Search Algorithm for Optimal Multiple-sequence Alignment. Journal of artificial Intelligence Research, 23(January/June 2005):587--623, 2005.

Digital Library

[19]

L. Si, T. Kanungo, and X. Huang. Boosting Performance of Bio-entity Recognition by Combining Results from Multiple Systems. In Proceedings of the 5th international workshop on Bioinformatics, pages 76--83, Chicago, Illinois, 2005. ACM.

Digital Library

[20]

L. Wang and T. Jiang. On the Complexity of Multiple Sequence Alignment. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology, 1(4):337--348, 1994. 8790475.

[21]

P. Xiu and H. Baird. Towards Whole-Book Recognition. In Document Analysis Systems, 2008. DAS 2008. The Eighth IAPR International Workshop on, pages 629--636, 2008.

Digital Library

Cited By

Piryani BMozafari JJatowt AHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper PagesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657891(2038-2048)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657891
Francois MEglin V(2023)Ensuring an Error-Free Transcription on a Full Engineering Tags Dataset Through Unsupervised Post-OCR MethodsDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41734-4_6(88-103)Online publication date: 19-Aug-2023
https://doi.org/10.1007/978-3-031-41734-4_6
Hashmi KBymana Ponnappa RBukhari SJenckel MDengel A(2019)Feedback Learning: Automating the Process of Correcting and Completing the Extracted Information2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)10.1109/ICDARW.2019.40091(116-121)Online publication date: Sep-2019
https://doi.org/10.1109/ICDARW.2019.40091
Show More Cited By

Index Terms

Improving optical character recognition through efficient multiple system alignment

Recommendations

An optical character recognition system for printed Telugu text

Telugu is one of the oldest and popular languages of India, spoken by more than 66 million people, especially in South India. Not much work has been reported on the development of optical character recognition (OCR) systems for Telugu text. Therefore, ...
Automated system for Arabic optical character recognition
ICICS '12: Proceedings of the 3rd International Conference on Information and Communication Systems

In this paper an Arabic Optical Character Recognition system is implemented. The system takes a scanned image of an Arabic text as an input and generates an editable text out of it. The system starts by segmenting the document which is presented as an ...
Nastaliq optical character recognition
ACMSE '08: Proceedings of the 46th annual ACM Southeast Conference

Nastaliq is a calligraphic, beautiful and more aesthetic style of writing Urdu, the national language of Pakistan, also used to read and write in India and other countries of the region.

OCRs developed for many world languages are already under efficient ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

June 2009

502 pages

ISBN:9781605583228

DOI:10.1145/1555400

General Chairs:
Fred Heath
University of Texas Libraries, USA
,
Mary Lynn Rice-Lively
University of Texas at Austin, USA
,
Program Chair:
Richard Furuta
Texas A&M University, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

JCDL '09

Sponsor:

JCDL '09: Joint Conference on Digital Libraries

June 15 - 19, 2009

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Upcoming Conference

JCDL '24

Sponsor:
sigir
sigir

The 2024 ACM/IEEE Joint Conference on Digital Libraries

December 16 - 20, 2024

Hong Kong , China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
772
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 24 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Piryani BMozafari JJatowt AHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper PagesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657891(2038-2048)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657891
Francois MEglin V(2023)Ensuring an Error-Free Transcription on a Full Engineering Tags Dataset Through Unsupervised Post-OCR MethodsDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41734-4_6(88-103)Online publication date: 19-Aug-2023
https://doi.org/10.1007/978-3-031-41734-4_6
Hashmi KBymana Ponnappa RBukhari SJenckel MDengel A(2019)Feedback Learning: Automating the Process of Correcting and Completing the Extracted Information2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)10.1109/ICDARW.2019.40091(116-121)Online publication date: Sep-2019
https://doi.org/10.1109/ICDARW.2019.40091
Habeeb IAl-Zaydi ZAbdulkhudhur H(2018)Enhanced Ensemble Technique for Optical Character RecognitionNew Trends in Information and Communications Technology Applications10.1007/978-3-030-01653-1_13(213-225)Online publication date: 26-Sep-2018
https://doi.org/10.1007/978-3-030-01653-1_13
Ronald Yu Zexuan Luo Chiang Y(2016)Recognizing text in historical maps using maps from multiple time periods2016 23rd International Conference on Pattern Recognition (ICPR)10.1109/ICPR.2016.7900258(3993-3998)Online publication date: Dec-2016
https://doi.org/10.1109/ICPR.2016.7900258
Chen SLu SWen YLu Y(2015)Using multiple sequence alignment and statistical language model to integrate multiple Chinese address recognition outputsProceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2015.7333742(151-155)Online publication date: 23-Aug-2015
https://dl.acm.org/doi/10.1109/ICDAR.2015.7333742
Lund WRingger EWalker D(2014)How well does multiple OCR error correction generalize?Document Recognition and Retrieval XXI10.1117/12.2042502(90210A)Online publication date: 3-Feb-2014
https://doi.org/10.1117/12.2042502
Yang TLee DHe QIyengar ANejdl WPei JRastogi R(2013)On handling textual errors in latent document modelingProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505555(2089-2098)Online publication date: 27-Oct-2013
https://dl.acm.org/doi/10.1145/2505515.2505555
Lund WKennard DRingger EFrinken VBarrett BManmatha RMärgner V(2013)Why multiple document image binarizations improve OCRProceedings of the 2nd International Workshop on Historical Document Imaging and Processing10.1145/2501115.2501126(86-93)Online publication date: 24-Aug-2013
https://dl.acm.org/doi/10.1145/2501115.2501126
Lund WWalker DRingger E(2011)Progressive Alignment and Discriminative Error Correction for Multiple OCR EnginesProceedings of the 2011 International Conference on Document Analysis and Recognition10.1109/ICDAR.2011.303(764-768)Online publication date: 18-Sep-2011
https://dl.acm.org/doi/10.1109/ICDAR.2011.303
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents