Abstract
Information extraction from semi-structured documents is crucial for frictionless business-to-business (B2B) communication. While machine learning problems related to Document Information Extraction (IE) have been studied for decades, many common problem definitions and benchmarks do not reflect domain-specific aspects and practical needs for automating B2B document communication. We review the landscape of Document IE problems, datasets and benchmarks. We highlight the practical aspects missing in the common definitions and define the Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) problems. There is a lack of relevant datasets and benchmarks for Document IE on semi-structured business documents as their content is typically legally protected or sensitive. We discuss potential sources of available documents including synthetic data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
For some only the annotations are available, without the original PDFs/images.
- 3.
- 4.
A large proportion of the UCSF Industry Document Library are old documents, often written on a typewriter, which presents a domain shift w.r.t. today’s documents.
- 5.
Automated crawling of the site not allowed: https://www.sec.gov/os/accessing-edgar-data.
- 6.
CC-MAIN-2022-05 contains almost 3 billion documents out of which 0.84% are PDFs [89] – however, most of them are not semi-structured business documents.
References
Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.: A realistic dataset for performance evaluation of document layout analysis. In: Proceedings of ICDAR, pp. 296–300. IEEE (2009)
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: Proceedings of the IEEE/CVF CVPR, pp. 9365–9374 (2019)
Baviskar, D., Ahirrao, S., Kotecha, K.: Multi-layout invoice document dataset (MIDD): a dataset for named entity recognition. Data (2021). https://doi.org/10.3390/data6070078
Bensch, O., Popa, M., Spille, C.: Key information extraction from documents: evaluation and generator. In: Abbès, S.B., et al. (eds.) Proceedings of DeepOntoNLP and X-SENTIMENT. CEUR Workshop Proceedings, vol. 2918, pp. 47–53. CEUR-WS.org (2021)
Berge, J.: The EDIFACT Standards. Blackwell Publishers, Inc. (1994)
Borchmann, Ł., et al.: DUE: End-to-end document understanding benchmark. In: Proceedings of NeurIPS (2021)
Bosak, J., McGrath, T., Holman, G.K.: Universal business language v2. 0. Organization for the Advancement of Structured Information Standards (OASIS), Standard (2006)
Cesarini, F., Francesconi, E., Gori, M., Soda, G.: Analysis and understanding of multi-class invoices. Doc. Anal. Recogn. 6(2), 102–114 (2003)
Chaudhry, R., Shekhar, S., Gupta, U., Maneriker, P., Bansal, P., Joshi, A.: LEAF-QA: locate, encode & attend for figure question answering. In: Proceedings of WACV, pp. 3501–3510. IEEE (2020). https://doi.org/10.1109/WACV45572.2020.9093269
Chen, L., et al.: WebSRC: a dataset for web-based structural reading comprehension. CoRR (2021)
Chen, W., Chang, M., Schlinger, E., Wang, W.Y., Cohen, W.W.: Open question answering over tables and text. In: Proceedings of ICLR (2021)
Chen, W., et al.: TabFact: a large-scale dataset for table-based fact verification. In: Proceedings of ICLR (2020)
Chen, W., Zha, H., Chen, Z., Xiong, W., Wang, H., Wang, W.Y.: HybridQA: a dataset of multi-hop question answering over tabular and textual data. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP. Findings of ACL, vol. EMNLP 2020, pp. 1026–1036. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.91
Cho, M., Amplayo, R.K., Hwang, S., Park, J.: Adversarial TableQA: attention supervision for question answering on tables. In: Zhu, J., Takeuchi, I. (eds.) Proceedings of ACML. Proceedings of Machine Learning Research, vol. 95, pp. 391–406 (2018)
Clausner, C., Antonacopoulos, A., Pletschacher, S.: ICDAR 2019 competition on recognition of documents with complex layouts-RDCL2019. In: Proceedings of ICDAR, pp. 1521–1526. IEEE (2019)
Cristani, M., Bertolaso, A., Scannapieco, S., Tomazzoli, C.: Future paradigms of automated processing of business documents. Int. J. Inf. Manag. 40, 67–75 (2018)
d’Andecy, V.P., Hartmann, E., Rusinol, M.: Field extraction by hybrid incremental and a-priori structural templates. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 251–256. IEEE (2018)
Deng, Y., Rosenberg, D.S., Mann, G.: Challenges in end-to-end neural scientific table recognition. In: Proceedings of ICDAR, pp. 894–901. IEEE (2019). https://doi.org/10.1109/ICDAR.2019.00148
Denk, T.I., Reisswig, C.: BERTgrid: contextualized embedding for 2D document representation and understanding. arXiv preprint arXiv:1909.04948 (2019)
Dhakal, P., Munikar, M., Dahal, B.: One-shot template matching for automatic document data capture. In: Proceedings of Artificial Intelligence for Transforming Business and Society (AITB), vol. 1, pp. 1–6. IEEE (2019)
Directive 2014/55/EU of the European parliament and of the council on electronic invoicing in public procurement, April 2014. https://eur-lex.europa.eu/eli/dir/2014/55/oj
Fang, J., Tao, X., Tang, Z., Qiu, R., Liu, Y.: Dataset, ground-truth and performance metrics for table detection evaluation. In: Blumenstein, M., Pal, U., Uchida, S. (eds.) Proceedings of IAPR International Workshop on Document Analysis Systems, DAS, pp. 445–449. IEEE (2012). https://doi.org/10.1109/DAS.2012.29
Ford, G., Thoma, G.R.: Ground truth data for document image analysis. In: Symposium on Document Image Understanding and Technology, pp. 199–205. Citeseer (2003)
Gao, L., Yi, X., Jiang, Z., Hao, L., Tang, Z.: ICDAR2017 competition on page object detection. In: Proceedings of ICDAR, pp. 1417–1422 (2017). https://doi.org/10.1109/ICDAR.2017.231
Garncarek, Ł, et al.: LAMBERT: layout-aware language modeling for information extraction. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 532–547. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_34
Göbel, M.C., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: Proceedings of ICDAR, pp. 1449–1453. IEEE Computer Society (2013). https://doi.org/10.1109/ICDAR.2013.292
Jaume, G., Ekenel, H.K., Thiran, J.P.: FUNSD: a dataset for form understanding in noisy scanned documents. In: ICDAR-OST (2019, accepted)
Hamad, K.A., Mehmet, K.: A detailed analysis of optical character recognition technology. Int. J. Appl. Math. Electron. Comput. 1(Special Issue-1), 244–249 (2016)
Hamza, H., Belaïd, Y., Belaïd, A.: Case-based reasoning for invoice analysis and recognition. In: Weber, R.O., Richter, M.M. (eds.) ICCBR 2007. LNCS (LNAI), vol. 4626, pp. 404–418. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74141-1_28
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: International Conference on Document Analysis and Recognition (ICDAR) (2015)
He, S., Schomaker, L.: Beyond OCR: multi-faceted understanding of handwritten document characteristics. Pattern Recogn. 63, 321–333 (2017)
Holeček, M.: Learning from similarity and information extraction from structured documents. Int. J. Doc. Anal. Recogn. (IJDAR) 1–17 (2021)
Holeček, M., Hoskovec, A., Baudiš, P., Klinger, P.: Table understanding in structured documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 5, pp. 158–164. IEEE (2019)
Holt, X., Chisholm, A.: Extracting structured data from invoices. In: Proceedings of the Australasian Language Technology Association Workshop 2018, pp. 53–59 (2018)
Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: Proceedings of ICDAR, pp. 1516–1520. IEEE (2019). https://doi.org/10.1109/ICDAR.2019.00244
Islam, N., Islam, Z., Noor, N.: A survey on optical character recognition system. arXiv preprint arXiv:1710.05703 (2017)
Jiang, J.: Information extraction from text. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 11–41. Springer, Cham (2012). https://doi.org/10.1007/978-1-4614-3223-4_2
Jobin, K.V., Mondal, A., Jawahar, C.V.: DocFigure: a dataset for scientific document figure classification. In: 13th IAPR International Workshop on Graphics Recognition, GREC@ICDAR, pp. 74–79. IEEE (2019). https://doi.org/10.1109/ICDARW.2019.00018
Kardas, M., et al.: AxCell: automatic extraction of results from machine learning papers. arXiv preprint arXiv:2004.14356 (2020)
Katti, A.R., et al.: Chargrid: towards understanding 2D documents. arXiv preprint arXiv:1809.08799 (2018)
Krieger, F., Drews, P., Funk, B., Wobbe, T.: Information extraction from invoices: a graph neural network approach for datasets with high layout variety. In: Ahlemann, F., Schütte, R., Stieglitz, S. (eds.) WI 2021. LNISO, vol. 47, pp. 5–20. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86797-3_1
Kumar, A., et al.: Ask me anything: dynamic memory networks for natural language processing. In: Balcan, M., Weinberger, K.Q. (eds.) Proceedings of ICML, vol. 48, pp. 1378–1387. JMLR.org (2016)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Knight, K., Nenkova, A., Rambow, O. (eds.) Proceedings of NAACL HLT, pp. 260–270 (2016). https://doi.org/10.18653/v1/n16-1030
Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 665–666 (2006)
Li, J., Wang, S., Wang, Y., Tang, Z.: Synthesizing data for text recognition with style transfer. Multimed. Tools Appl. 78(20), 29183–29196 (2019)
Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark for image-based table detection and recognition. In: Calzolari, N., et al. (eds.) Proceedings of The 12th Language Resources and Evaluation Conference, LREC. pp. 1918–1925 (2020)
Liu, W., Zhang, Y., Wan, B.: Unstructured document recognition on business invoice. Machine Learning, Stanford iTunes University, Stanford, CA, USA, Technical report (2016)
Majumder, B.P., Potti, N., Tata, S., Wendt, J.B., Zhao, Q., Najork, M.: Representation learning for information extraction from form-like documents. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL, pp. 6495–6504 (2020). https://doi.org/10.18653/v1/2020.acl-main.580
Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: InfographicVQA. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1697–1706 (2022)
Mathew, M., Karatzas, D., Jawahar, C.V.: DocVQA: a dataset for VQA on document images. In: Proceedings of WACV, pp. 2199–2208. IEEE (2021). https://doi.org/10.1109/WACV48630.2021.00225
McCann, B., Keskar, N.S., Xiong, C., Socher, R.: The natural language decathlon: multitask learning as question answering. CoRR (2018)
Meadows, B., Seaburg, L.: Universal business language 1.0. Organization for the Advancement of Structured Information Standards (OASIS) (2004)
Medvet, E., Bartoli, A., Davanzo, G.: A probabilistic approach to printed document understanding. Int. J. Doc. Anal. Recogn. 14(4), 335–347 (2011). https://doi.org/10.1007/s10032-010-0137-1
Memon, J., Sami, M., Khan, R.A., Uddin, M.: Handwritten optical character recognition (OCR): a comprehensive systematic literature review (SLR). IEEE Access 8, 142642–142668 (2020)
Methani, N., Ganguly, P., Khapra, M.M., Kumar, P.: PlotQA: reasoning over scientific plots. In: Proceedings of WACV, pp. 1516–1525 (2020). https://doi.org/10.1109/WACV45572.2020.9093523
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticæ Investigationes, pp. 3–26 (2007). https://doi.org/10.1075/li.30.1.03nad
Nassar, A., Livathinos, N., Lysak, M., Staar, P.W.J.: TableFormer: table structure understanding with transformers. CoRR abs/2203.01017 (2022). https://doi.org/10.48550/arXiv.2203.01017
Nayef, N., et al.: ICDAR 2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In: Proceedings of ICDAR, pp. 1582–1587. IEEE (2019)
Palm, R.B., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 329–336. IEEE (2019)
Palm, R.B., Winther, O., Laws, F.: CloudScan - a configuration-free invoice analysis system using recurrent neural networks. In: Proceedings of ICDAR, pp. 406–413. IEEE (2017). https://doi.org/10.1109/ICDAR.2017.74
Park, S., el al.: Cord: a consolidated receipt dataset for post-OCR parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: CascadeTabNet: an approach for end to end table detection and structure recognition from image-based documents. In: Proceedings of CVPRw, pp. 2439–2447 (2020). https://doi.org/10.1109/CVPRW50498.2020.00294
Qasim, S.R., Mahmood, H., Shafait, F.: Rethinking table recognition using graph neural networks. In: Proceedings of ICDAR, pp. 142–147. IEEE (2019). https://doi.org/10.1109/ICDAR.2019.00031
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)
Rastogi, M., et al.: Information extraction from document images via FCA based template detection and knowledge graph rule induction. In: Proceedings of CVPRw, pp. 2377–2385 (2020). https://doi.org/10.1109/CVPRW50498.2020.00287
Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 122–127. IEEE (2019)
Rusinol, M., Benkhelfallah, T., Poulain dAndecy, V.: Field extraction from administrative documents by incremental structural templates. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1100–1104. IEEE (2013)
Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: DeepDeSRT: deep learning for detection and structure recognition of tables in document images. In: Proceedings of ICDAR, pp. 1162–1167 (2017). https://doi.org/10.1109/ICDAR.2017.192
Schuster, D., et al.: Intellix-end-user trained information extraction for document archiving. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 101–105. IEEE (2013)
Shahab, A., Shafait, F., Kieninger, T., Dengel, A.: An open approach towards the benchmarking of table structure recognition systems. In: Doermann, D.S., Govindaraju, V., Lopresti, D.P., Natarajan, P. (eds.) The Ninth IAPR International Workshop on Document Analysis Systems, DAS, pp. 113–120 (2010). https://doi.org/10.1145/1815330.1815345
Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: Chen, J., Gonçalves, M.A., Allen, J.M., Fox, E.A., Kan, M., Petras, V. (eds.) Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL, pp. 223–232 (2018). https://doi.org/10.1145/3197026.3197040
Smith, R.: An overview of the tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633. IEEE (2007)
Stanisławek, T., et al.: Kleister: key information extraction datasets involving long documents with complex layouts. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 564–579. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_36
Stockerl, M., Ringlstetter, C., Schubert, M., Ntoutsi, E., Kriegel, H.P.: Online template matching over a stream of digitized documents. In: Proceedings of the 27th International Conference on Scientific and Statistical Database Management, pp. 1–12 (2015)
Stray, J., Svetlichnaya, S.: DeepForm: extract information from documents (2020). https://wandb.ai/deepform/political-ad-extraction, benchmark
Sun, H., Kuang, Z., Yue, X., Lin, C., Zhang, W.: Spatial dual-modality graph reasoning for key information extraction. arXiv preprint arXiv:2103.14470 (2021)
Sunder, V., Srinivasan, A., Vig, L., Shroff, G., Rahul, R.: One-shot information extraction from document images using neuro-deductive program synthesis. arXiv preprint arXiv:1906.02427 (2019)
Tensmeyer, C., Morariu, V.I., Price, B., Cohen, S., Martinez, T.: Deep splitting and merging for table structure decomposition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 114–121. IEEE (2019)
Wang, J., et al.: Towards robust visual information extraction in real world: new dataset and novel solution. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021)
Web: Annual reports. https://www.annualreports.com/. Accessed 28 Apr 2022
Web: Charity Commission for England and Wales. https://apps.charitycommission.gov.uk/showcharity/registerofcharities/RegisterHomePage.aspx. Accessed 22 Apr 2022
Web: EDGAR. https://www.sec.gov/edgar.shtml. Accessed 22 Apr 2022
Web: Industry Documents Library. https://www.industrydocuments.ucsf.edu/. Accessed 22 Apr 2022
Web: NIST Special Database 2. https://www.nist.gov/srd/nist-special-database-2. Accessed 25 Apr 2022
Web: Open Government Data (OGD) Platform India. https://visualize.data.gov.in/. Accessed 22 Apr 2022
Web: Public Inspection Files. https://publicfiles.fcc.gov/. Accessed 22 Apr 2022
Web: Scitsr. https://github.com/Academic-Hammer/SciTSR. Accessed 26 Apr 2022
Web: S &P 500 Companies with Financial Information. https://www.spglobal.com/spdji/en/indices/equity/sp-500/#data. Accessed 25 Apr 2022
Web: Statistics of Common Crawl Monthly Archives – MIME Types. https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes. Accessed 22 Apr 2022
Web: Tablebank. https://github.com/doc-analysis/TableBank. Accessed 26 Apr 2022
Web: World Bank Open Data. https://data.worldbank.org/. Accessed 22 Apr 2022
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: Gupta, R., Liu, Y., Tang, J., Prakash, B.A. (eds.) Proceedings on KDD, pp. 1192–1200 (2020). https://doi.org/10.1145/3394486.3403172
Xu, Y., et al.: LayoutXLM: multimodal pre-training for multilingual visually-rich document understanding. CoRR (2021)
Yi, J., Sundaresan, N.: A classifier for semi-structured documents. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 340–344 (2000)
Yu, D., et al.: Towards accurate scene text recognition with semantic reasoning networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12113–12122 (2020)
Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: processing key information extraction from documents using improved graph learning-convolutional networks. In: Proceedings of ICPR, pp. 4363–4370. IEEE (2020). https://doi.org/10.1109/ICPR48806.2021.9412927
Zhao, X., Wu, Z., Wang, X.: CUTIE: learning to understand documents with convolutional universal text information extractor. CoRR abs/1903.12363 (2019). http://arxiv.org/abs/1903.12363
Zheng, X., Burdick, D., Popa, L., Zhong, X., Wang, N.X.R.: Global table extractor (GTE): a framework for joint table identification and cell structure recognition using visual context. In: Proceedings of WACV, pp. 697–706. IEEE (2021). https://doi.org/10.1109/WACV48630.2021.00074
Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 564–580. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_34
Zhong, X., Tang, J., Jimeno-Yepes, A.: PubLayNet: largest dataset ever for document layout analysis. In: Proceedings of ICDAR, pp. 1015–1022. IEEE, September 2019. https://doi.org/10.1109/ICDAR.2019.00166
Zhu, F., et al.: TAT-QA: a question answering benchmark on a hybrid of tabular and textual content in finance. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings International Joint Conference on Natural Language Processing, pp. 3277–3287 (2021). https://doi.org/10.18653/v1/2021.acl-long.254
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Skalický, M., Šimsa, Š., Uřičář, M., Šulc, M. (2022). Business Document Information Extraction: Towards Practical Benchmarks. In: Barrón-Cedeño, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2022. Lecture Notes in Computer Science, vol 13390. Springer, Cham. https://doi.org/10.1007/978-3-031-13643-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-13643-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13642-9
Online ISBN: 978-3-031-13643-6
eBook Packages: Computer ScienceComputer Science (R0)