skip to main content
research-article

Assigning ICD-O-3 codes to pathology reports using neural multi-task training with hierarchical regularization

Published: 01 August 2021 Publication History

Abstract

Tracking population-level cancer information is essential for researchers, clinicians, policymakers, and the public. Unfortunately, much of the information is stored as unstructured data in pathology reports. Thus, too process the information, we require either automated extraction techniques or manual curation. Moreover, many of the cancer-related concepts appear infrequently in real-world training datasets. Automated extraction is difficult because of the limited data. This study introduces a novel technique that incorporates structured expert knowledge to improve histology and topography code classification models. Using pathology reports collected from the Kentucky Cancer Registry, we introduce a novel multi-task training approach with hierarchical regularization that incorporates structured information about the International Classification of Diseases for Oncology, 3rd Edition classes to improve predictive performance. Overall, we find that our method improves both micro and macro F1. For macro F1, we achieve up to a 6% absolute improvement for topography codes and up to 4% absolute improvement for histology codes.

References

[1]
Mohammed Alawad, Shang Gao, John X Qiu, Hong Jun Yoon, J Blair Christian, Lynne Penberthy, Brent Mumphrey, Xiao-Cheng Wu, Linda Coyle, and Georgia Tourassi. 2020. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. Journal of the American Medical Informatics Association 27, 1 (2020), 89--98.
[2]
Mohammed Alawad, Hong-Jun Yoon, and Georgia D Tourassi. 2018. Coarse-to-fine multi-task training of convolutional neural networks for automated information extraction from cancer pathology reports. In 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI). IEEE, 218--221.
[3]
Mohammed M Alawad, Shang Gao, John X Qiu, Noah T Schaefferkoetter, Jacob Hinkle, Hong-Jun Yoon, Blair Christian, Xiao-Cheng Wu, Eric B Durbin, Jong Cheol Jeong, et al. 2019. Deep Transfer Learning Across Cancer Registries for Information Extraction from Pathology Reports. Technical Report. Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States).
[4]
Collin F Baker, Charles J Fillmore, and John B Lowe. 1998. The berkeley framenet project. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1. 86--90.
[5]
Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), D267--D270.
[6]
Nicolò Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. 2006. Hierarchical classification: combining bayes with svm. In Proceedings of the 23rd international conference on Machine learning. ACM, 177--184.
[7]
Jiaoyan Chen, Yuxia Geng, Zhuo Chen, Ian Horrocks, Jeff Z Pan, and Huajun Chen. 2021. Knowledge-aware Zero-Shot Learning: Survey and Perspective. arXiv preprint arXiv:2103.00070 (2021).
[8]
Kevin De Angeli, Shang Gao, Mohammed Alawad, Hong-Jun Yoon, Noah Schaefferkoetter, Xiao-Cheng Wu, Eric B Durbin, Jennifer Dohertty, Antoinette Stroup, Linda Coyle, et al. 2021. Deep active learning for classifying cancer pathology reports. BMC bioinformatics 22, 1 (2021), 1--25.
[9]
Abhishek K Dubey, Jacob Hinkle, J Blair Christian, and Georgia Tourassi. 2019. Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. ACM, 320--327.
[10]
Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A Smith. 2015. Retrofitting Word Vectors to Semantic Lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1606--1615.
[11]
Jack Shanmugaratnam Sobin Parkin Whelan Frittz, Percy. 2001. International Classification of Diseases for Oncology; Third Edition. (2001).
[12]
Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 758--764.
[13]
Shang Gao, Michael T Young, John X Qiu, Hong-Jun Yoon, James B Christian, Paul A Fearn, Georgia D Tourassi, and Arvind Ramanthan. 2017. Hierarchical attention networks for information extraction from cancer pathology reports. Journal of the American Medical Informatics Association 25, 3 (2017), 321--330.
[14]
Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 2000. Learning to forget: Continual prediction with LSTM. Neural computation 12, 10 (2000), 2451--2471.
[15]
Alex Graves. 2012. Supervised sequence labelling with recurrent neural networks. Vol. 385. Springer.
[16]
Dirk Hovy and Christoph Purschke. 2018. Capturing regional variation with distributed place representations and geographic retrofitting. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4383--4394.
[17]
Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 328--339.
[18]
National Cancer Institute. 2017. Overview of the SEER Program. https://seer.cancer.gov/about/overview.html.
[19]
Vianney Jouhet, Georges Defossez, Anita Burgun, Pierre Le Beux, P Levillain, Pierre Ingrand, and Vincent Claveau. 2012. Automated classification of free-text pathology reports for registration of incident cases of cancer. Methods of information in medicine 51, 03 (2012), 242--251.
[20]
Vianney Jouhet, Fleur Mougin, Bérénice Bréchat, and Frantz Thiessard. 2017. Building a model for disease classification integration in oncology, an approach based on the national cancer institute thesaurus. Journal of biomedical semantics 8, 1 (2017), 6.
[21]
Ramakanth Kavuluru, Isaac Hands, Eric B Durbin, and Lisa Witt. 2013. Automatic extraction of ICD-O-3 primary sites from cancer pathology reports. AMIA Summits on Translational Science Proceedings 2013 (2013), 112.
[22]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1746--1751. http://www.aclweb.org/anthology/D14--1181
[23]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
[24]
Kun Lan, Dan-tong Wang, Simon Fong, Lian-sheng Liu, Kelvin KL Wong, and Nilanjan Dey. 2018. A survey of data mining and deep learning in bioinformatics. Journal of medical systems 42, 8 (2018), 139.
[25]
Yu Li, Chao Huang, Lizhong Ding, Zhongxiao Li, Yijie Pan, and Xin Gao. 2019. Deep learning in bioinformatics: introduction, application, and perspective in big data era. arXiv preprint arXiv:1903.00342 (2019).
[26]
Hui Liu, Danqing Zhang, Bing Yin, and Xiaodan Zhu. 2021. Improving Pretrained Models for Zero-shot Multi-label Text Classification through Reinforced Label Hierarchy Reasoning. arXiv:2104.01666 [cs.CL]
[27]
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. AAAI Press, 2873--2879.
[28]
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial Multi-task Learning for Text Classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1--10.
[29]
Yu Meng, Yunyi Zhang, Jiaxin Huang, Chenyan Xiong, Heng Ji, Chao Zhang, and Jiawei Han. 2020. Weakly-Supervised Text Classification Using Label Names Only. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 9006--9017.
[30]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.
[31]
George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J Miller. 1990. Introduction to WordNet: An on-line lexical database. International journal of lexicography 3, 4 (1990), 235--244.
[32]
James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. 2018. Explainable Prediction of Medical Codes from Clinical Text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1. 1101--1111.
[33]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12 (2011), 2825--2830.
[34]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[35]
John X Qiu, Hong-Jun Yoon, Paul A Fearn, and Georgia D Tourassi. 2017. Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE journal of biomedical and health informatics 22, 1 (2017), 244--251.
[36]
John X Qiu, Hong-Jun Yoon, Paul A Fearn, and Georgia D Tourassi. 2018. Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE journal of biomedical and health informatics 22, 1 (2018), 244--251.
[37]
Anthony Rios and Ramakanth Kavuluru. 2018. EMR Coding with Semi-Parametric Multi-Head Matching Networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1. 2081--2091.
[38]
Anthony Rios and Ramakanth Kavuluru. 2018. Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '18). Association for Computational Linguistics.
[39]
Anthony Rios and Ramakanth Kavuluru. 2018. Few-shot and zero-shot multi-label learning for structured label spaces. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Vol. 2018. NIH Public Access, 3132.
[40]
Hong-Jun Yoon, Hilda B Klasky, John P Gounley, Mohammed Alawad, Shang Gao, Eric B Durbin, Xiao-Cheng Wu, Antoinette Stroup, Jennifer Doherty, Linda Coyle, et al. 2020. Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports. Journal of Biomedical Informatics 110 (2020), 103564.
[41]
Hong-Jun Yoon, Arvind Ramanathan, and Georgia Tourassi. 2016. Multi-task deep neural networks for automated extraction of primary site and laterality information from cancer pathology reports. In INNS Conference on Big Data. Springer, 195--204.
[42]
Zhiguo Yu, Trevor Cohn, Byron C Wallace, Elmer Bernstam, and Todd Johnson. 2016. Retrofitting word vectors of mesh terms to improve semantic similarity measures. In Proceedings of the seventh international workshop on health text mining and information analysis. 43--51.
[43]
Jie Zhou, Chunping Ma, Dingkun Long, Guangwei Xu, Ning Ding, Haoyu Zhang, Pengjun Xie, and Gongshen Liu. 2020. Hierarchy-Aware Global Model for Hierarchical Text Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1106--1117.

Cited By

View all
  • (2024)A Unified Review of Deep Learning for Automated Medical CodingACM Computing Surveys10.1145/3664615Online publication date: 17-May-2024
  • (2023)DeepPhe-CR: Natural Language Processing Software Services for Cancer Registrar Case AbstractionJCO Clinical Cancer Informatics10.1200/CCI.23.00156Online publication date: Sep-2023
  • (2023)Integrating domain knowledge for biomedical text analysis into deep learningJournal of Biomedical Informatics10.1016/j.jbi.2023.104418143:COnline publication date: 1-Jul-2023
  • Show More Cited By

Index Terms

  1. Assigning ICD-O-3 codes to pathology reports using neural multi-task training with hierarchical regularization

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        BCB '21: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
        August 2021
        603 pages
        ISBN:9781450384506
        DOI:10.1145/3459930
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 01 August 2021

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. biomedical informatics
        2. natural language processing
        3. neural networks

        Qualifiers

        • Research-article

        Funding Sources

        • Shared Resource Facilities of the University of Kentucky Markey Cancer Center

        Conference

        BCB '21
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 254 of 885 submissions, 29%

        Upcoming Conference

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)40
        • Downloads (Last 6 weeks)4
        Reflects downloads up to 21 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)A Unified Review of Deep Learning for Automated Medical CodingACM Computing Surveys10.1145/3664615Online publication date: 17-May-2024
        • (2023)DeepPhe-CR: Natural Language Processing Software Services for Cancer Registrar Case AbstractionJCO Clinical Cancer Informatics10.1200/CCI.23.00156Online publication date: Sep-2023
        • (2023)Integrating domain knowledge for biomedical text analysis into deep learningJournal of Biomedical Informatics10.1016/j.jbi.2023.104418143:COnline publication date: 1-Jul-2023
        • (2022)DIAGNOSTIC AND THERAPEUTIC MANAGEMENT FOR LEIOMYOMA OF THE UPPER GASTROINTESTINAL TRACTKharkiv Surgical School10.37699/2308-7005.4-5.2022.10(46-54)Online publication date: 26-Oct-2022
        • (2021)CLINICAL AND MORPHOLOGICAL FEATURES OF GASTROINTESTINAL LEIOMYOMAS WHICH ARE COMPLICATED BY BLEEDINGКлінічна та профілактична медицина10.31612/2616-4868.4(18).2021.05(32-37)Online publication date: 4-Nov-2021

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media