research-article

Assigning ICD-O-3 codes to pathology reports using neural multi-task training with hierarchical regularization

Authors:

Eric B. Durbin,

Ramakanth KavuluruAuthors Info & Claims

BCB '21: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

Article No.: 32, Pages 1 - 10

https://doi.org/10.1145/3459930.3469541

Published: 01 August 2021 Publication History

Abstract

Tracking population-level cancer information is essential for researchers, clinicians, policymakers, and the public. Unfortunately, much of the information is stored as unstructured data in pathology reports. Thus, too process the information, we require either automated extraction techniques or manual curation. Moreover, many of the cancer-related concepts appear infrequently in real-world training datasets. Automated extraction is difficult because of the limited data. This study introduces a novel technique that incorporates structured expert knowledge to improve histology and topography code classification models. Using pathology reports collected from the Kentucky Cancer Registry, we introduce a novel multi-task training approach with hierarchical regularization that incorporates structured information about the International Classification of Diseases for Oncology, 3rd Edition classes to improve predictive performance. Overall, we find that our method improves both micro and macro F1. For macro F1, we achieve up to a 6% absolute improvement for topography codes and up to 4% absolute improvement for histology codes.

References

[1]

Mohammed Alawad, Shang Gao, John X Qiu, Hong Jun Yoon, J Blair Christian, Lynne Penberthy, Brent Mumphrey, Xiao-Cheng Wu, Linda Coyle, and Georgia Tourassi. 2020. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. Journal of the American Medical Informatics Association 27, 1 (2020), 89--98.

[2]

Mohammed Alawad, Hong-Jun Yoon, and Georgia D Tourassi. 2018. Coarse-to-fine multi-task training of convolutional neural networks for automated information extraction from cancer pathology reports. In 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI). IEEE, 218--221.

[3]

Mohammed M Alawad, Shang Gao, John X Qiu, Noah T Schaefferkoetter, Jacob Hinkle, Hong-Jun Yoon, Blair Christian, Xiao-Cheng Wu, Eric B Durbin, Jong Cheol Jeong, et al. 2019. Deep Transfer Learning Across Cancer Registries for Information Extraction from Pathology Reports. Technical Report. Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States).

[4]

Collin F Baker, Charles J Fillmore, and John B Lowe. 1998. The berkeley framenet project. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1. 86--90.

[5]

Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), D267--D270.

[6]

Nicolò Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. 2006. Hierarchical classification: combining bayes with svm. In Proceedings of the 23rd international conference on Machine learning. ACM, 177--184.

Digital Library

[7]

Jiaoyan Chen, Yuxia Geng, Zhuo Chen, Ian Horrocks, Jeff Z Pan, and Huajun Chen. 2021. Knowledge-aware Zero-Shot Learning: Survey and Perspective. arXiv preprint arXiv:2103.00070 (2021).

[8]

Kevin De Angeli, Shang Gao, Mohammed Alawad, Hong-Jun Yoon, Noah Schaefferkoetter, Xiao-Cheng Wu, Eric B Durbin, Jennifer Dohertty, Antoinette Stroup, Linda Coyle, et al. 2021. Deep active learning for classifying cancer pathology reports. BMC bioinformatics 22, 1 (2021), 1--25.

[9]

Abhishek K Dubey, Jacob Hinkle, J Blair Christian, and Georgia Tourassi. 2019. Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. ACM, 320--327.

Digital Library

[10]

Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A Smith. 2015. Retrofitting Word Vectors to Semantic Lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1606--1615.

[11]

Jack Shanmugaratnam Sobin Parkin Whelan Frittz, Percy. 2001. International Classification of Diseases for Oncology; Third Edition. (2001).

[12]

Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 758--764.

[13]

Shang Gao, Michael T Young, John X Qiu, Hong-Jun Yoon, James B Christian, Paul A Fearn, Georgia D Tourassi, and Arvind Ramanthan. 2017. Hierarchical attention networks for information extraction from cancer pathology reports. Journal of the American Medical Informatics Association 25, 3 (2017), 321--330.

[14]

Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 2000. Learning to forget: Continual prediction with LSTM. Neural computation 12, 10 (2000), 2451--2471.

[15]

Alex Graves. 2012. Supervised sequence labelling with recurrent neural networks. Vol. 385. Springer.

[16]

Dirk Hovy and Christoph Purschke. 2018. Capturing regional variation with distributed place representations and geographic retrofitting. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4383--4394.

[17]

Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 328--339.

[18]

National Cancer Institute. 2017. Overview of the SEER Program. https://seer.cancer.gov/about/overview.html.

[19]

Vianney Jouhet, Georges Defossez, Anita Burgun, Pierre Le Beux, P Levillain, Pierre Ingrand, and Vincent Claveau. 2012. Automated classification of free-text pathology reports for registration of incident cases of cancer. Methods of information in medicine 51, 03 (2012), 242--251.

[20]

Vianney Jouhet, Fleur Mougin, Bérénice Bréchat, and Frantz Thiessard. 2017. Building a model for disease classification integration in oncology, an approach based on the national cancer institute thesaurus. Journal of biomedical semantics 8, 1 (2017), 6.

[21]

Ramakanth Kavuluru, Isaac Hands, Eric B Durbin, and Lisa Witt. 2013. Automatic extraction of ICD-O-3 primary sites from cancer pathology reports. AMIA Summits on Translational Science Proceedings 2013 (2013), 112.

[22]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1746--1751. http://www.aclweb.org/anthology/D14--1181

[23]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980

[24]

Kun Lan, Dan-tong Wang, Simon Fong, Lian-sheng Liu, Kelvin KL Wong, and Nilanjan Dey. 2018. A survey of data mining and deep learning in bioinformatics. Journal of medical systems 42, 8 (2018), 139.

Digital Library

[25]

Yu Li, Chao Huang, Lizhong Ding, Zhongxiao Li, Yijie Pan, and Xin Gao. 2019. Deep learning in bioinformatics: introduction, application, and perspective in big data era. arXiv preprint arXiv:1903.00342 (2019).

[26]

Hui Liu, Danqing Zhang, Bing Yin, and Xiaodan Zhu. 2021. Improving Pretrained Models for Zero-shot Multi-label Text Classification through Reinforced Label Hierarchy Reasoning. arXiv:2104.01666 [cs.CL]

[27]

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. AAAI Press, 2873--2879.

[28]

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial Multi-task Learning for Text Classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1--10.

[29]

Yu Meng, Yunyi Zhang, Jiaxin Huang, Chenyan Xiong, Heng Ji, Chao Zhang, and Jiawei Han. 2020. Weakly-Supervised Text Classification Using Label Names Only. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 9006--9017.

[30]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.

[31]

George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J Miller. 1990. Introduction to WordNet: An on-line lexical database. International journal of lexicography 3, 4 (1990), 235--244.

[32]

James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. 2018. Explainable Prediction of Medical Codes from Clinical Text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1. 1101--1111.

[33]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12 (2011), 2825--2830.

[34]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[35]

John X Qiu, Hong-Jun Yoon, Paul A Fearn, and Georgia D Tourassi. 2017. Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE journal of biomedical and health informatics 22, 1 (2017), 244--251.

[36]

John X Qiu, Hong-Jun Yoon, Paul A Fearn, and Georgia D Tourassi. 2018. Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE journal of biomedical and health informatics 22, 1 (2018), 244--251.

[37]

Anthony Rios and Ramakanth Kavuluru. 2018. EMR Coding with Semi-Parametric Multi-Head Matching Networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1. 2081--2091.

[38]

Anthony Rios and Ramakanth Kavuluru. 2018. Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '18). Association for Computational Linguistics.

[39]

Anthony Rios and Ramakanth Kavuluru. 2018. Few-shot and zero-shot multi-label learning for structured label spaces. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Vol. 2018. NIH Public Access, 3132.

[40]

Hong-Jun Yoon, Hilda B Klasky, John P Gounley, Mohammed Alawad, Shang Gao, Eric B Durbin, Xiao-Cheng Wu, Antoinette Stroup, Jennifer Doherty, Linda Coyle, et al. 2020. Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports. Journal of Biomedical Informatics 110 (2020), 103564.

Digital Library

[41]

Hong-Jun Yoon, Arvind Ramanathan, and Georgia Tourassi. 2016. Multi-task deep neural networks for automated extraction of primary site and laterality information from cancer pathology reports. In INNS Conference on Big Data. Springer, 195--204.

[42]

Zhiguo Yu, Trevor Cohn, Byron C Wallace, Elmer Bernstam, and Todd Johnson. 2016. Retrofitting word vectors of mesh terms to improve semantic similarity measures. In Proceedings of the seventh international workshop on health text mining and information analysis. 43--51.

[43]

Jie Zhou, Chunping Ma, Dingkun Long, Guangwei Xu, Ning Ding, Haoyu Zhang, Pengjun Xie, and Gongshen Liu. 2020. Hierarchy-Aware Global Model for Hierarchical Text Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1106--1117.

Cited By

Ji SLi XSun WDong HTaalas AZhang YWu HPitkänen EMarttinen P(2024)A Unified Review of Deep Learning for Automated Medical CodingACM Computing Surveys10.1145/3664615Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1145/3664615
Hochheiser HFinan SYuan ZDurbin EJeong JHands IRust DKavuluru RWu XWarner JSavova G(2023)DeepPhe-CR: Natural Language Processing Software Services for Cancer Registrar Case AbstractionJCO Clinical Cancer Informatics10.1200/CCI.23.00156Online publication date: Sep-2023
https://doi.org/10.1200/CCI.23.00156
Cai LLi JLv HLiu WNiu HWang Z(2023)Integrating domain knowledge for biomedical text analysis into deep learningJournal of Biomedical Informatics10.1016/j.jbi.2023.104418143:COnline publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1016/j.jbi.2023.104418
Show More Cited By

Index Terms

Assigning ICD-O-3 codes to pathology reports using neural multi-task training with hierarchical regularization
1. Applied computing
  1. Life and medical sciences
    1. Health informatics
2. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Regularization
    2. Machine learning approaches
      1. Neural networks

Recommendations

Supervised line attention for tumor attribute classification from pathology reports: Higher performance with less data
Graphical abstract

Display Omitted
Highlights
- Annotating clinical documents is costly and time consuming.
- Performance of ...
Abstract Objective
We aim to build an accurate machine learning-based system for classifying tumor attributes from cancer pathology reports in the presence of a small amount of annotated data, motivated by the expensive and time-...
Cross-registry neural domain adaptation to extract mutational test results from pathology reports
Graphical abstract

Display Omitted
Highlights
- Convolutional neural networks are used to extract EGFR results from pathology reports.
Abstract Objective
We study the performance of machine learning (ML) methods, including neural networks (NNs), to extract mutational test results from pathology reports collected by cancer registries. Given the lack of hand-labeled ...
Anaphoric reference in clinical reports

Graphical abstractDisplay Omitted Highlights Annotated 180 clinical reports to indicate anaphor-antecedent pairs. Identity was the most frequent relation, with set/subset and part/whole too. Accurate resolution will require extensive domain knowledge. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BCB '21: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

August 2021

603 pages

ISBN:9781450384506

DOI:10.1145/3459930

General Chairs:
Hongmei Jiang
Northwestern University
,
Xiuzhen Huang
Arkansas State University
,
Jiajie Zhang
The University of Texas Health Science Center at Houston

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGBIOM: ACM Special Interest Group on Biomedical Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Shared Resource Facilities of the University of Kentucky Markey Cancer Center

Conference

BCB '21

Sponsor:

SIGBIOM

BCB '21: 12th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

August 1 - 4, 2021

Florida, Gainesville

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
180
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)4

Reflects downloads up to 21 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ji SLi XSun WDong HTaalas AZhang YWu HPitkänen EMarttinen P(2024)A Unified Review of Deep Learning for Automated Medical CodingACM Computing Surveys10.1145/3664615Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1145/3664615
Hochheiser HFinan SYuan ZDurbin EJeong JHands IRust DKavuluru RWu XWarner JSavova G(2023)DeepPhe-CR: Natural Language Processing Software Services for Cancer Registrar Case AbstractionJCO Clinical Cancer Informatics10.1200/CCI.23.00156Online publication date: Sep-2023
https://doi.org/10.1200/CCI.23.00156
Cai LLi JLv HLiu WNiu HWang Z(2023)Integrating domain knowledge for biomedical text analysis into deep learningJournal of Biomedical Informatics10.1016/j.jbi.2023.104418143:COnline publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1016/j.jbi.2023.104418
Shaprynskyi VBabii Y(2022)DIAGNOSTIC AND THERAPEUTIC MANAGEMENT FOR LEIOMYOMA OF THE UPPER GASTROINTESTINAL TRACTKharkiv Surgical School10.37699/2308-7005.4-5.2022.10(46-54)Online publication date: 26-Oct-2022
https://doi.org/10.37699/2308-7005.4-5.2022.10
Shaprynsky VKaminsky OBabii Y(2021)CLINICAL AND MORPHOLOGICAL FEATURES OF GASTROINTESTINAL LEIOMYOMAS WHICH ARE COMPLICATED BY BLEEDINGКлінічна та профілактична медицина10.31612/2616-4868.4(18).2021.05(32-37)Online publication date: 4-Nov-2021
https://doi.org/10.31612/2616-4868.4(18).2021.05

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents