skip to main content
research-article

Improving optical character recognition through efficient multiple system alignment

Published: 15 June 2009 Publication History

Abstract

Individual optical character recognition (OCR) engines vary in the types of errors they commit in recognizing text, particularly poor quality text. By aligning the output of multiple OCR engines and taking advantage of the differences between them, the error rate based on the aligned lattice of recognized words is significantly lower than the individual OCR word error rates. This lattice error rate constitutes a lower bound among aligned alternatives from the OCR output. Results from a collection of poor quality mid-twentieth century typewritten documents demonstrate an average reduction of 55.0% in the error rate of the lattice of alternatives and a realized word error rate (WER) reduction of 35.8% in a dictionary-based selection process. As an important precursor, an innovative admissible heuristic for the A* algorithm is developed, which results in a significant reduction in state space exploration to identify all optimal alignments of the OCR text output, a necessary step toward the construction of the word hypothesis lattice. On average 0.0079% of the state space is explored to identify all optimal alignments of the documents.

References

[1]
J. Ajot, J. Fiscus, N. Radde, and C. Laprun. Asclite -- Multi-dimensional Alignment Program. http://www.nist.gov/speech/tools/asclite.html, Aug. 2008.
[2]
E. Brill and R. C. Moore. An Improved Error Model for Noisy Channel Spelling Correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 286--293, Hong Kong, 2000.
[3]
R. Dechter and J. Pearl. Generalized Best-first Search Strategies and the Optimality af A*. Journal of the ACM, 32(3):505--536, 1985.
[4]
I. Elias. Settling the Intractability of Multiple Alignment. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology, 13(7):1323--1339, Sept. 2006. 17037961.
[5]
P. F. Felzenszwalb and D. McAllester. The Generalized A* Architecture. Journal of Artificial Intelligence Research, 29:153--190, 2007.
[6]
J. Fiscus. A Post-processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER). In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 1997., pages 347--354, 1997.
[7]
T. Ikeda and T. Imai. Fast A* Algorithms for Multiple Sequence Alignment. In Proceedings of Genome Informatics Workshop 1994, Yokohama, Japan, 1994.
[8]
H. Imai and T. Ikeda. k-Group Multiple Alignment Based on A* Search. In Proceedings of the 6th Genome Inform. Workshop, pages 9--18, 1995.
[9]
D. R. Jordan. Daily Battle Communiques, 1944--1945. Harold B. Lee Library, L. Tom Perry Special Collections, MSS 2766, 1945.
[10]
E. Lawler. Combinatorial Optimization : Networks and Matroids, pages 70--73. Holt, Reinhart and Winston, 1 edition, 1976.
[11]
H. Ma and D. Doermann. Adaptive OCR with Limited User Feedback. In Proceedings of the Eighth International Conference on Document Analysis and Recognition, 2005., pages 814--818 Vol. 2, 2005.
[12]
W. Magdy and K. Darwish. Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 408--414, Sydney, Australia, July 2006.
[13]
L. Mangu, E. Brill, and A. Stolcke. Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks. cs/0010012, Oct. 2000. Computer Speech and Language 14(4), 373--400, October 2000.
[14]
C. Notredame. Recent Progress in Multiple Sequence Alignment: a Survey. Pharmacogenomics, 3(1):131--144, Nov. 2004.
[15]
J. Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley Pub. Co, Reading, Mass., 1984.
[16]
E. K. Ringger. Correcting Speech Recognition Errors. Dissertation, University of Rochester, 2000.
[17]
E. K. Ringger and J. F. Allen. A Fertility Channel Model for Post-Correction of Continuous Speech Recognition. In Fourth International Conference on Spoken Language Processing (ICSLP 1996), Philadelphia, PA, Oct. 1996.
[18]
S. Schroedl. An Improved Search Algorithm for Optimal Multiple-sequence Alignment. Journal of artificial Intelligence Research, 23(January/June 2005):587--623, 2005.
[19]
L. Si, T. Kanungo, and X. Huang. Boosting Performance of Bio-entity Recognition by Combining Results from Multiple Systems. In Proceedings of the 5th international workshop on Bioinformatics, pages 76--83, Chicago, Illinois, 2005. ACM.
[20]
L. Wang and T. Jiang. On the Complexity of Multiple Sequence Alignment. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology, 1(4):337--348, 1994. 8790475.
[21]
P. Xiu and H. Baird. Towards Whole-Book Recognition. In Document Analysis Systems, 2008. DAS 2008. The Eighth IAPR International Workshop on, pages 629--636, 2008.

Cited By

View all
  • (2024)ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper PagesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657891(2038-2048)Online publication date: 10-Jul-2024
  • (2023)Ensuring an Error-Free Transcription on a Full Engineering Tags Dataset Through Unsupervised Post-OCR MethodsDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41734-4_6(88-103)Online publication date: 19-Aug-2023
  • (2019)Feedback Learning: Automating the Process of Correcting and Completing the Extracted Information2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)10.1109/ICDARW.2019.40091(116-121)Online publication date: Sep-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
June 2009
502 pages
ISBN:9781605583228
DOI:10.1145/1555400
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. A* algorithm
  2. OCR error rate reduction
  3. text alignment

Qualifiers

  • Research-article

Conference

JCDL '09
JCDL '09: Joint Conference on Digital Libraries
June 15 - 19, 2009
TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Upcoming Conference

JCDL '24
The 2024 ACM/IEEE Joint Conference on Digital Libraries
December 16 - 20, 2024
Hong Kong , China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 24 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper PagesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657891(2038-2048)Online publication date: 10-Jul-2024
  • (2023)Ensuring an Error-Free Transcription on a Full Engineering Tags Dataset Through Unsupervised Post-OCR MethodsDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41734-4_6(88-103)Online publication date: 19-Aug-2023
  • (2019)Feedback Learning: Automating the Process of Correcting and Completing the Extracted Information2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)10.1109/ICDARW.2019.40091(116-121)Online publication date: Sep-2019
  • (2018)Enhanced Ensemble Technique for Optical Character RecognitionNew Trends in Information and Communications Technology Applications10.1007/978-3-030-01653-1_13(213-225)Online publication date: 26-Sep-2018
  • (2016)Recognizing text in historical maps using maps from multiple time periods2016 23rd International Conference on Pattern Recognition (ICPR)10.1109/ICPR.2016.7900258(3993-3998)Online publication date: Dec-2016
  • (2015)Using multiple sequence alignment and statistical language model to integrate multiple Chinese address recognition outputsProceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2015.7333742(151-155)Online publication date: 23-Aug-2015
  • (2014)How well does multiple OCR error correction generalize?Document Recognition and Retrieval XXI10.1117/12.2042502(90210A)Online publication date: 3-Feb-2014
  • (2013)On handling textual errors in latent document modelingProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505555(2089-2098)Online publication date: 27-Oct-2013
  • (2013)Why multiple document image binarizations improve OCRProceedings of the 2nd International Workshop on Historical Document Imaging and Processing10.1145/2501115.2501126(86-93)Online publication date: 24-Aug-2013
  • (2011)Progressive Alignment and Discriminative Error Correction for Multiple OCR EnginesProceedings of the 2011 International Conference on Document Analysis and Recognition10.1109/ICDAR.2011.303(764-768)Online publication date: 18-Sep-2011
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media