Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.16890 (cs)

[Submitted on 31 Aug 2023 (v1), last revised 4 Sep 2023 (this version, v2)]

Title:TouchStone: Evaluating Vision-Language Models by Language Models

Authors:Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, Jingren Zhou

View PDF

Abstract:Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual storytelling abilities. In this paper, we propose an evaluation method that uses strong LLMs as judges to comprehensively evaluate the various abilities of LVLMs. Firstly, we construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. This dataset not only covers fundamental recognition and comprehension but also extends to literary creation. Secondly, by integrating detailed image annotations we effectively transform the multimodal input content into a form understandable by LLMs. This enables us to employ advanced LLMs for directly evaluating the quality of the multimodal dialogue without requiring human intervention. Through validation, we demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone, aligning with human preferences. We hope our work can serve as a touchstone for LVLMs' evaluation and pave the way for building stronger LVLMs. The evaluation code is available at this https URL.

Comments:	this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2308.16890 [cs.CV]
	(or arXiv:2308.16890v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.16890

Submission history

From: Shuai Bai [view email]
[v1] Thu, 31 Aug 2023 17:52:04 UTC (11,964 KB)
[v2] Mon, 4 Sep 2023 15:06:15 UTC (11,966 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TouchStone: Evaluating Vision-Language Models by Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TouchStone: Evaluating Vision-Language Models by Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators