Abstract
Biomedical question answering (BioQA) is the process of automated information extraction from the biomedical literature, and as the number of accessible biomedical papers is increasing rapidly, BioQA is attracting more attention. In order to improve the performance of BioQA systems, we designed strategies for the sub-tasks of BioQA and assessed their effectiveness using the BioASQ dataset. We designed data-centric and model-centric strategies based on the potential for improvement for each sub-task. For example, model design for the factoid-type questions has been explored intensely but the potential of increased label consistency has not been investigated (data-centric approach). On the other hand, for list-type questions, we apply the sequence tagging model as it is more natural for the multi-answer (i.e. multi-label) task (model-centric approach).
Our experimental results suggest two main points: scarce resources like BioQA datasets can be benefited from data-centric approaches with relatively little effort; and a model design reflecting data characteristics can improve the performance of the system.
The scope of this paper is majorly focused on applications of our strategies in the BioASQ 8b dataset and our participating systems in the 9th BioASQ challenges. Our submissions achieve competitive results with top or near top performance in the 9th challenge (Task b - Phase B).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Resources for our data cleaning operations (our annotations) are available at https://github.com/dmis-lab/bioasq9b-dmis.
- 2.
- 3.
- 4.
Last checked on 2022 May.
- 5.
The official result (human evaluation) is on: http://participants-area.bioasq.org/results/9b/phaseB/.
References
Medline PubMed Production Statistics. https://www.nlm.nih.gov/bsd/medline_pubmed_production_stats.html. Accessed 19 June 2022
Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/W19-1909, https://www.aclweb.org/anthology/W19-1909
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Dror, R., Peled-Cohen, L., Shlomov, S., Reichart, R.: Statistical significance testing for natural language processing. Synthesis Lect. Hum. Lang. Technol. 13(2), 1–116 (2020)
Falke, T., Ribeiro, L.F., Utama, P.A., Dagan, I., Gurevych, I.: Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2214–2220 (2019)
Jeong, M., et al.: Transferability of natural language inference to biomedical question answering. arXiv preprint arXiv:2007.00217 (2020)
Jin, Q., Dhingra, B., Cohen, W.W., Lu, X.: Probing biomedical embeddings from language models. arXiv preprint (2019)
Kim, D., et al.: A neural named entity recognition and multi-type normalization tool for biomedical text mining. IEEE Access 7, 73729–73740 (2019). https://doi.org/10.1109/ACCESS.2019.2920708
Kim, N., et al.: Probing what different NLP tasks teach machines about function word comprehension. In: Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), pp. 235–249. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/S19-1026, https://www.aclweb.org/anthology/S19-1026
Krithara, A., Nentidis, A., Paliouras, G., Krallinger, M., Miranda, A.: BioASQ at CLEF2021: large-scale biomedical semantic indexing and question answering. In: Hiemstra, D., Moens, M.-F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) ECIR 2021. LNCS, vol. 12657, pp. 624–630. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_73
Kryściński, W., McCann, B., Xiong, C., Socher, R.: Evaluating the factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840 (2019)
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension (2019)
Mollá, D., Khanna, U., Galat, D., Nguyen, V., Rybinski, M.: Query-focused extractive summarisation for finding ideal answers to biomedical and COVID-19 questions. arXiv preprint arXiv:2108.12189 (2021)
Ng, A.Y.: A Chat with Andrew on MLOps: from model-centric to data-centric AI (2021). https://www.youtube.com/06-AZXmwHjo
Ozyurt, I.B.: End-to-end biomedical question answering via bio-answerfinder and discriminative language representation models. In: CLEF (Working Notes) (2021)
Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. arXiv preprint (2019)
Peters, M.E., et al.: Deep contesxtualized word representations (2018)
Phang, J., Févry, T., Bowman, S.R.: Sentence encoders on STILTs: supplementary training on intermediate labeled-data tasks (2019)
Tsatsaronis, G., et al.: An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 16(1), 1–28 (2015)
Wiese, G., Weissenborn, D., Neves, M.: Neural domain adaptation for biomedical question answering. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 281–289. Association for Computational Linguistics, Vancouver, August 2017. https://doi.org/10.18653/v1/K17-1029, https://www.aclweb.org/anthology/K17-1029
Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Association for Computational Linguistics, New Orleans, June 2018. https://doi.org/10.18653/v1/N18-1101, https://www.aclweb.org/anthology/N18-1101
Yoon, W., Jackson, R., Lagerberg, A., Kang, J.: Sequence tagging for biomedical extractive question answering. Bioinformatics (2022). https://doi.org/10.1093/bioinformatics/btac397
Yoon, W., Lee, J., Kim, D., Jeong, M., Kang, J.: Pre-trained language model for biomedical question answering. In: Cellier, P., Driessens, K. (eds.) ECML PKDD 2019. CCIS, vol. 1168, pp. 727–740. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43887-6_64
Yoon, W., et al.: KU-DMIS at BioASQ 9: data-centric and model-centric approaches for biomedical question answering. In: CLEF (Working Notes), pp. 351–359 (2021)
Zhang, Y., Han, J.C., Tsai, R.T.H.: NCU-IISR/AS-GIS: results of various pre-trained biomedical language models and linear regression model in BioASQ task 9b phase B. In: CEUR Workshop Proceedings (2021)
Zhu, C., et al.: Enhancing factual consistency of abstractive summarization. arXiv preprint arXiv:2003.08612 (2020)
Acknowledgements
We express gratitude towards Dr. Jihye Kim and Dr. Sungjoon Park from Korea University for their invaluable insight into our systems’ output. This research is supported by National Research Foundation of Korea (NRF-2020R1A2C3010638) and a grant of the Korea Health Technology R &D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HR20C0021).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Author Note
This work is submitted to the 2022 CLEF - Best of 2021 Labs track. Our work originates from our challenge participation in the 9th BioASQ (2021 CLEF Labs), presented under the title KU-DMIS at BioASQ 9: Data-centric and model-centric approaches for biomedical question answering (Yoon et al. 2021 [26]).
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yoon, W. et al. (2022). Data-Centric and Model-Centric Approaches for Biomedical Question Answering. In: Barrón-Cedeño, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2022. Lecture Notes in Computer Science, vol 13390. Springer, Cham. https://doi.org/10.1007/978-3-031-13643-6_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-13643-6_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13642-9
Online ISBN: 978-3-031-13643-6
eBook Packages: Computer ScienceComputer Science (R0)