×

Knowledge is power: open-world knowledge representation learning for knowledge-based visual reasoning. (English) Zbl 1543.68361

Summary: Knowledge-based visual reasoning requires the ability to associate outside knowledge that is not present in a given image for cross-modal visual understanding. Two deficiencies of the existing approaches are that (1) they only employ or construct elementary and explicit but superficial knowledge graphs while lacking complex and implicit but indispensable cross-modal knowledge for visual reasoning, and (2) they also cannot reason new/unseen images or questions in open environments and are often violated in real-world applications. How to represent and leverage tacit multimodal knowledge for open-world visual reasoning scenarios has been less studied. In this paper, we propose a novel open-world knowledge representation learning method to not only construct implicit knowledge representations from the given images and their questions but also enable knowledge transfer from a known given scene to an unknown scene for answer prediction. Extensive experiments conducted on six benchmarks demonstrate the superiority of our approach over other state-of-the-art methods. We apply our approach to other visual reasoning tasks, and the experimental results show that our approach, with its good performance, can support related reasoning applications.

MSC:

68T30 Knowledge representation
68T05 Learning and adaptive systems in artificial intelligence
68T45 Machine vision and scene understanding
68U10 Computing methodologies for image processing
Full Text: DOI

References:

[1] Zheng, W.; Yan, L.; Gou, C.; Wang, F.-Y., Knowledge is power: hierarchical-knowledge embedded meta-learning for visual reasoning in artistic domains, (Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, 2021, Association for Computing Machinery: Association for Computing Machinery New York, NY, USA), 2360-2368
[2] Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R., Ok-vqa: a visual question answering benchmark requiring external knowledge, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019)
[3] Zheng, W.; Yan, L.; Gou, C.; Wang, F.-Y., KM^4: visual reasoning via knowledge embedding memory model with mutual modulation, Inf. Fusion, 2021
[4] Wu, Q.; Wang, P.; Wang, X.; He, X.; Zhu, W., Visual Question Answering-from Theory to Application, 2022, Springer
[5] Suchan, J.; Bhatt, M.; Varadarajan, S., Commonsense visual sensemaking for autonomous driving – on generalised neurosymbolic online abduction integrating vision and semantics, Artif. Intell., 299, Article 103522 pp., 2021
[6] İlkan Ceylan, İsmail; Darwiche, A.; Van den Broeck, G., Open-world probabilistic databases: semantics, algorithms, complexity, Artif. Intell., 295, Article 103474 pp., 2021 · Zbl 1519.68068
[7] Li, G.; Wang, X.; Zhu, W., Boosting visual question answering with context-aware knowledge aggregation, (Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, 2020, Association for Computing Machinery: Association for Computing Machinery New York, NY, USA), 1227-1235
[8] Weston, J.; Chopra, S.; Bordes, A., Memory networks, 2014
[9] Singh, A. K.; Mishra, A.; Shekhar, S.; Chakraborty, A., From strings to things: knowledge-enabled vqa model that can read and reason, (Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019)
[10] Zhu, Z.; Yu, J.; Wang, Y.; Sun, Y.; Hu, Y.; Wu, Q., Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering, (Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20, 2021)
[11] Ben-younes, H.; Cadene, R.; Cord, M.; Thome, N., Mutan: multimodal tucker fusion for visual question answering, (ICCV, 2017)
[12] Shevchenko, V.; Teney, D.; Dick, A.; van den Hengel, A., Reasoning over vision and language: exploring the benefits of supplemental knowledge, (Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-World KNowledge (LANTERN), 2021, Association for Computational Linguistics: Association for Computational Linguistics Kyiv, Ukraine), 1-18
[13] Gao, F.; Ping, Q.; Thattai, G.; Reganti, A.; Wu, Y. N.; Natarajan, P., Transform-retrieve-generate: natural language-centric outside-knowledge visual question answering, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022), 5067-5077
[14] Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z., Dbpedia: a nucleus for a web of open data, (Proceedings of the 6th International the Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference, ISWC’07/ASWC’07, 2007, Springer-Verlag: Springer-Verlag Berlin, Heidelberg), 722-735
[15] Speer, R.; Chin, J.; Havasi, C., Conceptnet 5.5: an open multilingual graph of general knowledge, (Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, 2017, AAAI Press), 4444-4451
[16] Singh, H.; West, R.; Colavizza, G., Wikipedia citations: a comprehensive data set of citations with identifiers extracted from English Wikipedia, Quant. Sci. Stud., 2, 1, 1-19, 2021
[17] Gardères, F.; Ziaeefard, M.; Abeloos, B.; Lecue, F., ConceptBert: concept-aware representation for visual question answering, (Findings of EMNLP, 2020)
[18] Ravi, S.; Chinchure, A.; Sigal, L.; Liao, R.; Shwartz, V., VLC-BERT: visual question answering with contextualized commonsense knowledge, 2022
[19] Salaberria, A.; Azkune, G.; Lopez de Lacalle, O.; Soroa, A.; Agirre, E., Image captioning for effective use of language models in knowledge-based visual question answering, Expert Syst. Appl., 212, Article 118669 pp., 2023
[20] Reiter, R., On closed world data bases, (Webber, B. L.; Nilsson, N. J., Readings in Artificial Intelligence, 1981, Morgan Kaufmann), 119-140
[21] Zhou, Z.-H., Open-environment machine learning, Nat. Sci. Rev., 9, 8, Article nwac123 pp., 07 2022
[22] Heo, Y.-J.; Kim, E.-S.; Choi, W. S.; Zhang, B.-T., Hypergraph transformer: weakly-supervised multi-hop reasoning for knowledge-based visual question answering, (Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, Association for Computational Linguistics: Association for Computational Linguistics Dublin, Ireland), 373-390
[23] Lu, J.; Clark, C.; Zellers, R.; Mottaghi, R.; Kembhavi, A., Unified-IO: a unified model for vision, language, and multi-modal tasks, 2022
[24] Guo, Y.; Nie, L.; Wong, Y.; Liu, Y.; Cheng, Z.; Kankanhalli, M., A unified end-to-end retriever-reader framework for knowledge-based vqa, (Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, 2022, Association for Computing Machinery: Association for Computing Machinery New York, NY, USA), 2061-2069
[25] Chen, Z.; Huang, Y.; Chen, J.; Geng, Y.; Fang, Y.; Pan, J.; Zhang, N.; Zhang, W., LaKo: knowledge-driven visual question answering via late knowledge-to-text injection, 2022
[26] Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; Amodei, D., Language models are few-shot learners, (Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; Lin, H., Advances in Neural Information Processing Systems, vol. 33, 2020, Curran Associates, Inc.), 1877-1901
[27] Aditya, S.; Yang, Y.; Baral, C., Integrating knowledge and reasoning in image understanding, (Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artificial Intelligence Organization, 2019), 6252-6259
[28] Gan, Z.; Li, L.; Li, C.; Wang, L.; Liu, Z.; Gao, J., Vision-language pre-training: basics, recent advances, and future trends, Found. Trends® Comput. Graph. Vis., 14, 3-4, 163-352, 2022
[29] Chen, F.; Zhang, D.; Han, M.; Chen, X.; Shi, J.; Xu, S.; Xu, B., Vlp: a survey on vision-language pre-training, 2022, preprint
[30] Du, Y.; Liu, Z.; Li, J.; Zhao, W. X., A survey of vision-language pre-trained models, 2022, preprint
[31] Zhu, X.; Li, Z.; Wang, X.; Jiang, X.; Sun, P.; Wang, X.; Xiao, Y.; Yuan, N. J., Multi-modal knowledge graph construction and application: a survey, IEEE Trans. Knowl. Data Eng., 1-20, 2022
[32] Khan, S.; Naseer, M.; Hayat, M.; Zamir, S. W.; Khan, F. S.; Shah, M., Transformers in vision: a survey, ACM Comput. Surv., 54, 10s, sep 2022
[33] Uppal, S.; Bhagat, S.; Hazarika, D.; Majumder, N.; Poria, S.; Zimmermann, R.; Zadeh, A., Multimodal research in vision and language: a review of current and emerging trends, Inf. Fusion, 77, 149-171, 2022
[34] Yusuf, A. A.; Chong, F.; Xianling, M., An analysis of graph convolutional networks and recent datasets for visual question answering, Artif. Intell. Rev., 55, 8, 6277-6300, 2022
[35] Liu, Y.; Wei, Y.-S.; Yan, H.; Li, G.-B.; Lin, L., Causal reasoning meets visual representation learning: a prospective study, Mach. Intell. Res., 19, 6, 485-511, 2022
[36] Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R. R.; Cheng, M.-M.; Hu, S.-M., Attention mechanisms in computer vision: a survey, Comput. Vis. Media, 8, 3, 331-368, 2022
[37] de Santana Correia, A.; Colombini, E. L., Attention, please! A survey of neural attention models in deep learning, Artif. Intell. Rev., 55, 8, 6037-6124, 2022
[38] Lu, J.; Batra, D.; Parikh, D.; Lee, S., Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, (Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Garnett, R., Advances in Neural Information Processing Systems, vol. 32, 2019, Curran Associates, Inc.)
[39] Tan, H.; Bansal, M., LXMERT: learning cross-modality encoder representations from transformers, (Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, Association for Computational Linguistics: Association for Computational Linguistics Hong Kong, China), 5100-5111
[40] Aditya, S.; Yang, Y.; Baral, C., Explicit reasoning over end-to-end neural architectures for visual question answering, (Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18, 2018, AAAI Press)
[41] Parmar, J.; Chouhan, S. S.; Raychoudhury, V.; Rathore, S. S., Open-world machine learning: applications, challenges, and opportunities, ACM Comput. Surv., 55, 10, 1-37, 2023
[42] Scheirer, W. J.; de Rezende Rocha, A.; Sapkota, A.; Boult, T. E., Toward open set recognition, IEEE Trans. Pattern Anal. Mach. Intell., 35, 7, 1757-1772, 2013
[43] Jain, L. P.; Scheirer, W. J.; Boult, T. E., Multi-class open set recognition using probability of inclusion, (Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T., Computer Vision - ECCV 2014, 2014, Springer International Publishing: Springer International Publishing Cham), 393-409
[44] Joseph, K. J.; Khan, S.; Khan, F. S.; Balasubramanian, V. N., Towards open world object detection, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021), 5830-5840
[45] Gupta, A.; Narayan, S.; Joseph, K. J.; Khan, S.; Khan, F. S.; Shah, M., Ow-detr: open-world detection transformer, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022), 9235-9244
[46] Cen, J.; Yun, P.; Cai, J.; Wang, M. Y.; Liu, M., Deep metric learning for open world semantic segmentation, (Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021), 15333-15342
[47] Xie, J.; Hou, X.; Ye, K.; Shen, L., Clims: cross language image matching for weakly supervised semantic segmentation, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022), 4483-4492
[48] Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R., Ok-vqa: a visual question answering benchmark requiring external knowledge, (Conference on Computer Vision and Pattern Recognition (CVPR), 2019)
[49] Schwenk, D.; Khandelwal, A.; Clark, C.; Marino, K.; Mottaghi, R., A-okvqa: a benchmark for visual question answering using world knowledge, (Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T., Computer Vision - ECCV 2022, 2022, Springer Nature: Springer Nature Switzerland, Cham), 146-162
[50] Lu, J.; Batra, D.; Parikh, D.; Lee, S., ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, 2019, Curran Associates Inc., Red: Curran Associates Inc., Red Hook, NY, USA
[51] Marino, K.; Chen, X.; Parikh, D.; Gupta, A.; Rohrbach, M., Krisp: integrating implicit and symbolic knowledge for open-domain knowledge-based vqa, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021), 14111-14121
[52] Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; Houlsby, N., An image is worth 16x16 words: transformers for image recognition at scale, (International Conference on Learning Representations, 2021)
[53] He, K.; Zhang, X.; Ren, S.; Sun, J., Deep residual learning for image recognition, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016)
[54] Zheng, W.; Yan, L.; Gou, C.; Wang, F.-Y., Two heads are better than one: hypergraph-enhanced graph reasoning for visual event ratiocination, (Meila, M.; Zhang, T., Proceedings of the 38th International Conference on Machine Learning. Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 139, 2021, PMLR), 12747-12760
[55] Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K., BERT: pre-training of deep bidirectional transformers for language understanding, (Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, Association for Computational Linguistics: Association for Computational Linguistics Minneapolis, Minnesota), 4171-4186
[56] Chen, Y.; Rohrbach, M.; Yan, Z.; Shuicheng, Y.; Feng, J.; Kalantidis, Y., Graph-based global reasoning networks, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019)
[57] Liang, X.; Hu, Z.; Zhang, H.; Lin, L.; Xing, E. P., Symbolic graph reasoning meets convolutions, (Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R., Advances in Neural Information Processing Systems, vol. 31, 2018, Curran Associates, Inc.)
[58] Kipf, T. N.; Welling, M., Semi-supervised classification with graph convolutional networks, (International Conference on Learning Representations, 2017)
[59] Li, Q.; Han, Z.; Wu, X.-m., Deeper insights into graph convolutional networks for semi-supervised learning, Proc. AAAI Conf. Artif. Intell., 32, 1, Apr. 2018
[60] Goodfellow, I.; Bengio, Y.; Courville, A., Deep Learning, 2016, MIT Press · Zbl 1373.68009
[61] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.u.; Polosukhin, I., Attention is all you need, (Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; Garnett, R., Advances in Neural Information Processing Systems, vol. 30, 2017, Curran Associates, Inc.)
[62] Wang, X.; Girshick, R.; Gupta, A.; He, K., Non-local neural networks, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018)
[63] Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; Bai, X., Asymmetric non-local neural networks for semantic segmentation, (Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019)
[64] Kim, W.; Son, B.; Kim, I., Vilt: vision-and-language transformer without convolution or region supervision, (Meila, M.; Zhang, T., Proceedings of the 38th International Conference on Machine Learning. Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 139, 2021, PMLR), 5583-5594
[65] Ding, Y.; Yu, J.; Liu, B.; Hu, Y.; Cui, M.; Wu, Q., Mukea: multimodal knowledge extraction and accumulation for knowledge-based visual question answering, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022), 5089-5098
[66] Hudson, D. A.; Manning, C. D., Gqa: a new dataset for real-world visual reasoning and compositional question answering, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019)
[67] Chang, Y.; Narang, M.; Suzuki, H.; Cao, G.; Gao, J.; Bisk, Y., Webqa: multihop and multimodal qa, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022), 16495-16504
[68] Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D., Making the v in vqa matter: elevating the role of image understanding in visual question answering, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017)
[69] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. L., Microsoft coco: common objects in context, (Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T., Computer Vision - ECCV 2014, 2014, Springer International Publishing: Springer International Publishing Cham), 740-755
[70] Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; Bernstein, M. S.; Fei-Fei, L., Visual genome: connecting language and vision using crowdsourced dense image annotations, Int. J. Comput. Vis., 123, 1, 32-73, 2017
[71] Ordonez, V.; Kulkarni, G.; Berg, T., Im2text: describing images using 1 million captioned photographs, (Shawe-Taylor, J.; Zemel, R.; Bartlett, P.; Pereira, F.; Weinberger, K., Advances in Neural Information Processing Systems, vol. 24, 2011, Curran Associates, Inc.)
[72] Sharma, P.; Ding, N.; Goodman, S.; Soricut, R., Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning, (Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, Association for Computational Linguistics: Association for Computational Linguistics Melbourne, Australia), 2556-2565
[73] Zhang, Z.; Sabuncu, M., Generalized cross entropy loss for training deep neural networks with noisy labels, (Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R., Advances in Neural Information Processing Systems, vol. 31, 2018, Curran Associates, Inc.)
[74] MacQueen, J., Classification and Analysis of Multivariate Observations, 1967
[75] Ashby, F. G.; Rosedahl, L., A neural interpretation of exemplar theory, Psychol. Rev., 124, 4, 472, 2017
[76] Hwang, J.; Oh, S. W.; Lee, J.-Y.; Han, B., Exemplar-based open-set panoptic segmentation network, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021), 1175-1184
[77] Zhou, Z.-H., Machine Learning, 2021, Springer Nature · Zbl 1479.68001
[78] Wu, Q.; Yang, C.; Yan, J., Towards open-world feature extrapolation: an inductive graph learning approach, (Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; Vaughan, J. W., Advances in Neural Information Processing Systems, vol. 34, 2021, Curran Associates, Inc.), 19435-19447
[79] Luo, H.; Jiang, W.; Gu, Y.; Liu, F.; Liao, X.; Lai, S.; Gu, J., A strong baseline and batch normalization neck for deep person re-identification, IEEE Trans. Multimed., 22, 10, 2597-2609, 2020
[80] Xie, R.; Liu, Z.; Luan, H.; Sun, M., Image-embodied knowledge representation learning, (Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, 2017), 3140-3146
[81] Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O., Translating embeddings for modeling multi-relational data, (Burges, C.; Bottou, L.; Welling, M.; Ghahramani, Z.; Weinberger, K., Advances in Neural Information Processing Systems, vol. 26, 2013, Curran Associates, Inc.)
[82] Kamigaito, H.; Hayashi, K., Unified interpretation of softmax cross-entropy and negative sampling: with case study for knowledge graph embedding, (Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, Association for Computational Linguistics), 5517-5531, Online
[83] Hamilton, W. L., Graph representation learning, Synth. Lect. Artif. Intell. Mach. Learn., 14, 3, 1-159, 2020
[84] Microsoft, Microsoft/NNI: an open source automl toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning, GitHub
[85] Wang, P.; Wu, Q.; Shen, C.; Dick, A.; van den Hengel, A., Fvqa: fact-based visual question answering, IEEE Trans. Pattern Anal. Mach. Intell., 40, 10, 2413-2427, 2018
[86] Shah, S.; Mishra, A.; Yadati, N.; Talukdar, P. P., Kvqa: knowledge-aware visual question answering, Proc. AAAI Conf. Artif. Intell., 33, 01, 8876-8884, 2019
[87] Cao, Q.; Li, B.; Liang, X.; Wang, K.; Lin, L., Knowledge-routed visual question reasoning: challenges for deep representation embedding, IEEE Trans. Neural Netw. Learn. Syst., 33, 7, 2758-2767, 2022
[88] Gupta, A.; Narayan, S.; Joseph, K. J.; Khan, S.; Khan, F. S.; Shah, M., Ow-detr: open-world detection transformer, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022), 9235-9244
[89] Ma, S.; Wang, Y.; Wei, Y.; Fan, J.; Li, T. H.; Liu, H.; Lv, F., Cat: localization and identification cascade detection transformer for open-world object detection, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023), 19681-19690
[90] Loshchilov, I.; Hutter, F., Decoupled weight decay regularization, (International Conference on Learning Representations, 2019)
[91] Kim, J.-H.; Jun, J.; Zhang, B.-T., Bilinear attention networks, (Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R., Advances in Neural Information Processing Systems, vol. 31, 2018, Curran Associates, Inc.)
[92] Lu, J.; Yang, J.; Batra, D.; Parikh, D., Hierarchical question-image co-attention for visual question answering, (Lee, D.; Sugiyama, M.; Luxburg, U.; Guyon, I.; Garnett, R., Advances in Neural Information Processing Systems, vol. 29, 2016, Curran Associates, Inc.)
[93] Han, Y.; Yin, J.; Wu, J.; Wei, Y.; Nie, L., Semantic-aware modular capsule routing for visual question answering, 2022
[94] Narasimhan, M.; Schwing, A. G., Straight to the facts: learning knowledge base retrieval for factual visual question answering, (Proceedings of the European Conference on Computer Vision (ECCV), 2018)
[95] Liu, L.; Wang, M.; He, X.; Qing, L.; Chen, H., Fact-based visual question answering via dual-process system, Knowl.-Based Syst., 237, Article 107650 pp., 2022
[96] Zhang, L.; Liu, S.; Liu, D.; Zeng, P.; Li, X.; Song, J.; Gao, L., Rich visual knowledge-based augmentation network for visual question answering, IEEE Trans. Neural Netw. Learn. Syst., 32, 10, 4362-4373, 2021
[97] Zhang, Y.; Jiang, M.; Zhao, Q., Query and attention augmentation for knowledge-based explainable reasoning, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022), 15576-15585
[98] Narasimhan, M.; Lazebnik, S.; Schwing, A., Out of the box: reasoning with graph convolution nets for factual visual question answering, (Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R., Advances in Neural Information Processing Systems, vol. 31, 2018, Curran Associates, Inc.)
[99] Song, L.; Li, J.; Liu, J.; Yang, Y.; Shang, X.; Sun, M., Answering knowledge-based visual questions via the exploration of question purpose, Pattern Recognit., 133, Article 109015 pp., 2023
[100] Yu, J.; Zhu, Z.; Wang, Y.; Zhang, W.; Hu, Y.; Tan, J., Cross-modal knowledge reasoning for knowledge-based visual question answering, Pattern Recognit., 108, Article 107563 pp., 2020
[101] Li, M.; Marie-Francine, M., Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering, 2021, Association for the Advancement of Artificial Intelligence
[102] Graves, A.; Fernández, S.; Schmidhuber, J., Bidirectional lstm networks for improved phoneme classification and recognition, (Duch, W.; Kacprzyk, J.; Oja, E.; Zadrożny, S., Artificial Neural Networks: Formal Models and Their Applications - ICANN 2005, 2005, Springer Berlin Heidelberg: Springer Berlin Heidelberg Berlin, Heidelberg), 799-804
[103] Sukhbaatar, S.; szlam, a.; Weston, J.; Fergus, R., End-to-end memory networks, (Cortes, C.; Lawrence, N.; Lee, D.; Sugiyama, M.; Garnett, R., Advances in Neural Information Processing Systems, vol. 28, 2015, Curran Associates, Inc.), 2015
[104] Li, Y.; Zemel, R.; Brockschmidt, M.; Tarlow, D., Gated graph sequence neural networks, (Proceedings of ICLR’16, Proceedings of iclr’16 Edition, 2016)
[105] Garcia-Olano, D.; Onoe, Y.; Ghosh, J., Improving and diagnosing knowledge-based visual question answering via entity enhanced knowledge injection, (Companion Proceedings of the Web Conference 2022, WWW ’22, 2022, Association for Computing Machinery: Association for Computing Machinery New York, NY, USA), 705-715
[106] Kim, E.-S.; Kang, W. Y.; On, K.-W.; Heo, Y.-J.; Zhang, B.-T., Hypergraph attention networks for multimodal learning, (2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020), 14569-14578
[107] LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; Jackel, L. D., Backpropagation applied to handwritten zip code recognition, Neural Comput., 1, 4, 541-551, 1989
[108] Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C. L.; Parikh, D., Vqa: visual question answering, (Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015)
[109] Marino, K.; Chen, X.; Parikh, D.; Gupta, A.; Rohrbach, M., Krisp: integrating implicit and symbolic knowledge for open-domain knowledge-based vqa, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021), 14111-14121
[110] Wu, J.; Lu, J.; Sabharwal, A.; Mottaghi, R., Multi-modal answer validation for knowledge-based vqa, Proc. AAAI Conf. Artif. Intell., 36, 3, 2712-2721, 2022
[111] Luo, M.; Zeng, Y.; Banerjee, P.; Baral, C., Weakly-supervised visual-retriever-reader for knowledge-based question answering, (Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, Association for Computational Linguistics: Association for Computational Linguistics Online and Punta Cana, Dominican Republic), 6417-6431
[112] Chen, X.; Wang, X.; Changpinyo, S.; Piergiovanni, A.; Padlewski, P.; Salz, D.; Goodman, S.; Grycner, A.; Mustafa, B.; Beyer, L.; Kolesnikov, A.; Puigcerver, J.; Ding, N.; Rong, K.; Akbari, H.; Mishra, G.; Xue, L.; Thapliyal, A.; Bradbury, J.; Kuo, W.; Seyedhosseini, M.; Jia, C.; Karagol Ayan, B.; Riquelme, C.; Steiner, A.; Angelova, A.; Zhai, X.; Houlsby, N.; Soricut, R., PaLI: a jointly-scaled multilingual language-image model, 2022
[113] Yang, Z.; Gan, Z.; Wang, J.; Hu, X.; Lu, Y.; Liu, Z.; Wang, L., An empirical study of gpt-3 for few-shot knowledge-based vqa, Proc. AAAI Conf. Artif. Intell., 36, 3, 3081-3089, 2022
[114] Hao, Y.; Song, H.; Dong, L.; Huang, S.; Chi, Z.; Wang, W.; Ma, S.; Wei, F., Language models are general-purpose interfaces, 2022
[115] Hu, Y.; Hua, H.; Yang, Z.; Shi, W.; Smith, N. A.; Luo, J., PromptCap: prompt-guided task-aware image captioning, 2022
[116] Jiang, Y.; Natarajan, V.; Chen, X.; Rohrbach, M.; Batra, D.; Parikh, D., Pythia v0.1: the winning entry to the VQA challenge 2018, 2018
[117] Kamath, A.; Clark, C.; Gupta, T.; Kolve, E.; Hoiem, D.; Kembhavi, A., Webly supervised concept expansion for general purpose vision models, 2022
[118] Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H., OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, (Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu, G.; Sabato, S., Proceedings of the 39th International Conference on Machine Learning. Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 162, 2022, PMLR), 23318-23340
[119] Perez, E.; Strub, F.; de Vries, H.; Dumoulin, V.; Courville, A., Film: visual reasoning with a general conditioning layer, Proc. AAAI Conf. Artif. Intell., 32, 1, Apr. 2018
[120] Yu, Z.; Yu, J.; Xiang, C.; Fan, J.; Tao, D., Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering, IEEE Trans. Neural Netw. Learn. Syst., 29, 12, 5947-5959, 2018
[121] Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L., Bottom-up and top-down attention for image captioning and visual question answering, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018)
[122] Yu, Z.; Yu, J.; Cui, Y.; Tao, D.; Tian, Q., Deep modular co-attention networks for visual question answering, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019)
[123] Cao, Q.; Li, B.; Liang, X.; Lin, L., Explainable high-order visual question reasoning: a new benchmark and knowledge-routed network, 2019
[124] Shen, S.; Li, L. H.; Tan, H.; Bansal, M.; Rohrbach, A.; Chang, K.-W.; Yao, Z.; Keutzer, K., How much can CLIP benefit vision-and-language tasks?, (International Conference on Learning Representations, 2022)
[125] Li, J.; Li, D.; Savarese, S.; Hoi, S., Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models, (Proceedings of the 40th International Conference on Machine Learning, ICML’23, 2023, JMLR.org)
[126] Achiam, O. J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; Avila, R.; Babuschkin, I.; Balaji, S.; Balcom, V.; Baltescu, P.; Bao, H.; Bavarian, M.; Belgum, J.; Bello, I.; Berdine, J.; Bernadett-Shapiro, G.; Berner, C.; Bogdonoff, L.; Boiko, O.; Boyd, M.; Brakman, A.-L.; Brockman, G.; Brooks, T.; Brundage, M.; Button, K.; Cai, T.; Campbell, R.; Cann, A.; Carey, B.; Carlson, C.; Carmichael, R.; Chan, B.; Chang, C.; Chantzis, F.; Chen, D.; Chen, S.; Chen, R.; Chen, J.; Chen, M.; Chess, B.; Cho, C.; Chu, C.; Chung, H. W.; Cummings, D.; Currier, J.; Dai, Y.; Decareaux, C.; Degry, T.; Deutsch, N.; Deville, D.; Dhar, A.; Dohan, D.; Dowling, S.; Dunning, S.; Ecoffet, A.; Eleti, A.; Eloundou, T.; Farhi, D.; Fedus, L.; Felix, N.; Fishman, S. P.; Forte, J.; Fulford, I.; Gao, L.; Georges, E.; Gibson, C.; Goel, V.; Gogineni, T.; Goh, G.; Gontijo-Lopes, R.; Gordon, J.; Grafstein, M.; Gray, S.; Greene, R.; Gross, J.; Gu, S. S.; Guo, Y.; Hallacy, C.; Han, J.; Harris, J.; He, Y.; Heaton, M.; Heidecke, J.; Hesse, C.; Hickey, A.; Hickey, W.; Hoeschele, P.; Houghton, B.; Hsu, K.; Hu, S.; Hu, X.; Huizinga, J.; Jain, S.; Jain, S.; Jang, J.; Jiang, A.; Jiang, R.; Jin, H.; Jin, D.; Jomoto, S.; Jonn, B.; Jun, H.; Kaftan, T.; Kaiser, L.; Kamali, A.; Kanitscheider, I.; Keskar, N. S.; Khan, T.; Kilpatrick, L.; Kim, J. W.; Kim, C.; Kim, Y.; Kirchner, H.; Kiros, J. R.; Knight, M.; Kokotajlo, D.; Kondraciuk, L.; Kondrich, A.; Konstantinidis, A.; Kosic, K.; Krueger, G.; Kuo, V.; Lampe, M.; Lan, I.; Lee, T.; Leike, J.; Leung, J.; Levy, D.; Li, C. M.; Lim, R.; Lin, M.; Lin, S.; Litwin, M.; Lopez, T.; Lowe, R.; Lue, P.; Makanju, A. A.; Malfacini, K.; Manning, S.; Markov, T.; Markovski, Y.; Martin, B.; Mayer, K.; Mayne, A.; McGrew, B.; McKinney, S. M.; McLeavey, C.; McMillan, P.; McNeil, J.; Medina, D.; Mehta, A.; Menick, J.; Metz, L.; Mishchenko, A.; Mishkin, P.; Monaco, V.; Morikawa, E.; Mossing, D. P.; Mu, T.; Murati, M.; Murk, O.; M’ely, D.; Nair, A.; Nakano, R.; Nayak, R.; Neelakantan, A.; Ngo, R.; Noh, H.; Long, O.; O’Keefe, C.; Pachocki, J. W.; Paino, A.; Palermo, J.; Pantuliano, A.; Parascandolo, G.; Parish, J.; Parparita, E.; Passos, A.; Pavlov, M.; Peng, A.; Perelman, A.; de Avila Belbute Peres, F.; Petrov, M.; de Oliveira Pinto, H. P.; Pokorny, M.; Pokrass, M.; Pong, V. H.; Powell, T.; Power, A.; Power, B.; Proehl, E.; Puri, R.; Radford, A.; Rae, J.; Ramesh, A.; Raymond, C.; Real, F.; Rimbach, K.; Ross, C.; Rotsted, B.; Roussez, H.; Ryder, N.; Saltarelli, M. D.; Sanders, T.; Santurkar, S.; Sastry, G.; Schmidt, H.; Schnurr, D.; Schulman, J.; Selsam, D.; Sheppard, K.; Sherbakov, T.; Shieh, J.; Shoker, S.; Shyam, P.; Sidor, S.; Sigler, E.; Simens, M.; Sitkin, J.; Slama, K.; Sohl, I.; Sokolowsky, B. D.; Song, Y.; Staudacher, N.; Such, F. P.; Summers, N.; Sutskever, I.; Tang, J.; Tezak, N. A.; Thompson, M.; Tillet, P.; Tootoonchian, A.; Tseng, E.; Tuggle, P.; Turley, N.; Tworek, J.; Uribe, J. F.C.; Vallone, A.; Vijayvergiya, A.; Voss, C.; Wainwright, C.; Wang, J. J.; Wang, A.; Wang, B.; Ward, J.; Wei, J.; Weinmann, C.; Welihinda, A.; Welinder, P.; Weng, J.; Weng, L.; Wiethoff, M.; Willner, D.; Winter, C.; Wolrich, S.; Wong, H.; Workman, L.; Wu, S.; Wu, J.; Wu, M.; Xiao, K.; Xu, T.; Yoo, S.; Yu, K.; Yuan, Q.; Zaremba, W.; Zellers, R.; Zhang, C.; Zhang, M.; Zhao, S.; Zheng, T.; Zhuang, J.; Zhuk, W.; Zoph, B., 2023, Gpt-4 technical report
[127] Balepur, N.; Ravichander, A.; Rudinger, R., Artifacts or abduction: How do llms answer multiple-choice questions without the question?, 2024, preprint
[128] Smith, B., Stop talking about tomorrow’s ai doomsday when ai poses risks today, Nature, 618, 885-886, 2023
[129] Samuelson, P., Generative ai meets copyright, Science, 381, 6654, 158-161, 2023
[130] Radhakrishnan, A.; Beaglehole, D.; Pandit, P.; Belkin, M., Mechanism for feature learning in neural networks and backpropagation-free machine learning models, Science, 383, 6690, 1461-1467, 2024
[131] Mottaghi, R.; Rastegari, M.; Gupta, A.; Farhadi, A., “what happens if...” learning to predict the effect of forces in images, (Leibe, B.; Matas, J.; Sebe, N.; Welling, M., Computer Vision - ECCV 2016, 2016, Springer International Publishing: Springer International Publishing Cham), 269-285
[132] Gu, H., The Discourses and Sayings of Confucius: A New Special Translation, Illustrated with Quotations from Goethe and Other Writers, 1898, Kelly and Walsh, limited
[133] Lu, P.; Mishra, S.; Xia, T.; Qiu, L.; Chang, K.-W.; Zhu, S.-C.; Tafjord, O.; Clark, P.; Kalyan, A., Learn to explain: multimodal reasoning via thought chains for science question answering, (Oh, A. H.; Agarwal, A.; Belgrave, D.; Cho, K., Advances in Neural Information Processing Systems, 2022)
[134] Gao, P.; Jiang, Z.; You, H.; Lu, P.; Hoi, S. C.H.; Wang, X.; Li, H., Dynamic fusion with intra- and inter-modality attention flow for visual question answering, (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019)
[135] Lu, P.; Qiu, L.; Chen, J.; Xia, T.; Zhao, Y.; Zhang, W.; Yu, Z.; Liang, X.; Zhu, S.-C., Iconqa: a new benchmark for abstract diagram understanding and visual language reasoning, (The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2021)
[136] Li, L. H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; Chang, K.-W., What does BERT with vision look at?, (Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, Association for Computational Linguistics), 5265-5275, Online
[137] Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P. J., Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res., 21, 140, 1-67, 2020
[138] Ben Abacha, A.; Sarrouti, M.; Demner-Fushman, D.; Hasan, S. A.; Müller, H., Overview of the vqa-med task at imageclef 2021: visual question answering and generation in the medical domain, (CLEF 2021 Working Notes, CEUR Workshop Proceedings, 2021, CEUR-WS.org: CEUR-WS.org Bucharest, Romania)
[139] Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J., Bleu: a method for automatic evaluation of machine translation, (Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, 2002, Association for Computational Linguistics: Association for Computational Linguistics USA), 311-318
[140] Gong, H.; Huang, R.; Chen, G.; Li, G., Sysu-hcp at vqa-med 2021: a data-centric model with efficient training methodology for medical visual question answering, (CLEF 2021 - Conference and Labs of the Evaluation Forum, September 21-24, 2021, Bucharest, Romania, CEUR Workshop Proceedings, 2021)
[141] Xiao, Q.; Zhou, X., Yunnan university at vqa-med 2021: pretrained biobert for medical domain visual question answering, (CLEF 2021 - Conference and Labs of the Evaluation Forum, September 21-24, 2021, Bucharest, Romania, CEUR Workshop Proceedings, 2021)
[142] Eslami, S.; de Melo, G.; Meinel, C., Teams at vqa-med 2021: Bbn-orchestra for long-tailed medical visual question answering, (CLEF 2021 - Conference and Labs of the Evaluation Forum, September 21-24, 2021, Bucharest, Romania, CEUR Workshop Proceedings, 2021)
[143] Li, J.; Liu, S., Lijie at imageclefmed vqa-med 2021: attention model-based efficient interaction between multimodality, (CLEF (Working Notes), 2021), 1275-1284
[144] Schilling, R.; Messina, P.; Parra, D.; Löbel, H., Puc chile team at vqa-med 2021: approaching vqa as a classification task via fine-tuning a pretrained cnn, (CLEF (Working Notes), 2021), 1346-1351
[145] Li, Y.; Yang, Z.; Hao, T., Tam at vqa-med 2021: a hybrid model with feature extraction and fusion for medical visual question answering, (CLEF (Working Notes), 2021), 1295-1304
[146] Sitara, N. M.S.; Srinivasan, K., Ssn mlrg at vqa-med 2021: an approach for vqa to solve abnormality related queries using improved datasets, (CLEF (Working Notes), 2021), 1329-1335
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.