\newlistof

casestudyfigurescsfList of Case Study Figures

LIME: Less Is More for MLLM Evaluation

King Zhu^1,2, Qianbo Zang¹⁰¹¹footnotemark: 1, Shian Jia⁴¹¹footnotemark: 1, Siwei Wu^1,3¹¹footnotemark: 1, Feiteng Fang^6,9¹¹footnotemark: 1, Yizhi Li^1,3
Shawn Gavin¹, Tuney Zheng^1,2, Jiawei Guo^1,2, Bo Li⁵, Haoning Wu², Xingwei Qu¹
Jian Yang¹, Zachary Liu⁸, Xiang Yue, J.H. Liu¹, Chenghua Lin^1,3, Min Yang⁹,Hamid Alinejad-Rokny
Shiwen Ni⁹, Wenhao Huang^1,2²²footnotemark: 2, Ge Zhang^1,2²²footnotemark: 2
¹M-A-P, ²01.ai,³University of Manchester,⁴Zhejiang University, ⁵NTU, ⁶The University of New South Wales
⁷USTC,⁸Dartmouth College,⁹Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
¹⁰Interdisciplinary Centre for Security, Reliability and Trust (SnT), Université du Luxembourg
Equal Contribution.Corresonding authors.

Abstract

Multimodal Large Language Models (MLLMs) are measured on numerous benchmarks like image captioning, visual question answer, and reasoning. However, these benchmarks often include overly simple or uninformative samples, making it difficult to effectively distinguish the performance of different MLLMs. Additionally, evaluating models across many benchmarks creates a significant computational burden. To address these issues, we propose LIME (Less Is More for MLLM Evaluation), a refined and efficient benchmark curated using a semi-automated pipeline. This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that require image-based understanding. Our experiments show that LIME reduces the number of samples by 76% and evaluation time by 77%, while it can more effectively distinguish different models’ abilities. Notably, we find that traditional automatic metrics like CIDEr are insufficient for evaluating MLLMs’ captioning performance, and excluding the caption task score yields a more accurate reflection of overall model performance. All code and data are available at https://github.com/kangreen0210/LIME.

1 Introduction

In order to better understand the model’s capabilities and guide addressing the shortcomings of MLLMs, researchers develop numerous benchmarks for various tasks (Antol et al., 2015; Wei et al., 2023; Fu et al., 2023; Yue et al., 2024; Wu et al., 2024a). These benchmarks thoroughly explore the capabilities of MLLMs in various tasks such as image captioning, image question answering, and multimodal retrieving.

However, existing MLLM benchmarks and unified evaluation frameworks cannot effectively and efficiently reflect the ability of MLLMs. Current benchmarks include numerous relatively simple samples (i.e., how many chairs are in the picture) and some incorrect questions caused by annotation issues. Most MLLMs consistently perform on these samples (i.e., all correct or all wrong). Therefore, those benchmarks cannot fully reflect the gap between different MLLMs and across various tasks. Besides, the current unified multimodal evaluation frameworks require significant computational resources, necessitating integrating much evaluation data from various benchmarks. The selection of effective evaluation data is largely overlooked by current researchers.

Refer to caption — Figure 1: Pipeline of the Data Curation. The left half part is the Open-Source Models as Judges module, which uses several Multimodal LLMs to answer questions for each sample and assess their difficulty. The upper right part is the Semi-Automated Screening Process module filtering some samples that are too simple or difficult. As for the Eliminating Answer Leakage, we filter the sample that can be answered without the image.

As shown in Figure 1, to address the aforementioned issues, we propose to use a general data process pipeline and curate a LIME, which contains 9403 samples and is refined across 10 tasks within 6 domains. We select six major tasks in the multimodal domain and use 9 MLLMs to refine those 10 benchmarks within the corresponding domain. To eliminate bias introduced by individual models, we choose 9 models as judges and filter samples based on their performance. On the one hand, we remove samples that most models answer correctly due to the fact that they cannot distinguish the capabilities among different models. On the other hand, we use a method that combines humans and MLLMs to filter out some abnormally difficult samples. Meanwhile, we use LLMs to filter out samples that can be answered directly from the context of the question. After that, we obtain a smaller yet higher-quality unified bench (i.e., LIME). We conduct various experiments on LIME using both MLLMs and LLMs on different input settings, such as QA + image inputs, QA input (text-only input), and the QA + image description experiment. We make several valuable findings:

•

LIME can better reflect the performance differences of MLLMs. On our LIME benchmark, under consistent conditions (same model series, same model size), different MLLMs demonstrate a wider score range, indicating that LIME is more effective at reflecting performance differences between models with a smaller amount of data.
•

MLLMs exhibit varying capabilities across different subtasks. Specifically, they excel in the Visual Question Answering (VQA) subtasks, showcasing relatively high performance when answering questions directly related to factual information depicted in images. However, their performance is comparatively lower in tasks that necessitate the application of additional commonsense knowledge or complex reasoning. This highlights the significant image content recognition capabilities of current MLLMs. Additionally, most MLLMs perform exceptionally well in POPE and Caption subtasks, while demonstrating relatively poor performance in OCR subtasks. This underscores the requirement for MLLMs to perceive deeper content within an image for OCR tasks.
•

Through the correlation analysis of scores across different tasks, we find that using traditional automatic metrics for the captioning task makes it difficult to reasonably evaluate the model’s performance. Different tasks have varying requirements for factual perception and the application of additional commonsense knowledge in images.

2 Method

Most benchmarks contain low-quality, noisy data.

Figure 2 shows the statistics of different subtasks within our LIME benchmark. It is worth mentioning that the proportion of easy and wrong samples exceeds 30Out of the 10 subtasks, 6 have proportions exceeding 50%. Notably, for the POPE dataset, 95% of the data can be classified as noisy or erroneous. This indicates that existing benchmarks are filled with a large amount of low-quality data, which does not accurately reflect the true capabilities of MLLMs.

Inspired by MMStar (Chen et al., 2024a), we utilize open-source MLLMs and LLMs as the judges for filtering, specifically, we remove the existing annotation errors. The overall pipeline consists of three main stages: (1) Using open-source models as judges, (2) A semi-automated screening process, and (3) Eliminating answer leakage. Our approach aims to improve existing benchmarks by removing inaccurate and oversimplified data.

2.1 open-source models as judges

To avoid potential biases that may exist in individual MLLMs, we select ten different types of open-source models as judges. To categorize the difficulty of each sample, we analyze the performance of all judge models on each question and label the difficulty based on the number of models that answer correctly. We define $N$ as the number of models that correctly answer the sample. If $N\geq 6$ , the question is classified as the easy set. If $3\leq N\leq 5$ , it is classified as the middle set. Conversely, if $N\leq 2$ , it is classified as the hard set.

2.2 Semi-automated screening process

Easy samples do not effectively differentiate the capabilities of various models, as most models can answer them correctly. Therefore, we remove the easy samples to better assess model performance.

Furthermore, we find that some questions are not correctly answered by any model, which can be due to potential errors in the question design. To mitigate these potential errors and filter out totally incorrect questions, we implement a semi-automated screening process, which consists of two stages. In the first stage, all questions with zero passes are reviewed by GPT-4V to assess their correctness in terms of logic and meaning. In the second stage, questions deemed correct by GPT-4V are then manually screened. This strategy helps us eliminate meaningless or erroneous data from the dataset, thereby reducing its size and improving its quality.

2.3 Eliminating Answer Leakage

Although the previous two stages have filtered out potential errors and assessed the quality of the questions, we still need to address the potential issue of ANSWER LEAKAGE. Multimodal Answer Leakage can be summarized into two main categories: 1.Text Answerable Questions: The textual information contains all the necessary details to answer the question, making the corresponding visual information redundant. 2.Seen Questions: The model has encountered a specific question during training and has memorized the question along with its corresponding ground truth. As a result, some questions can be answered by pure text-based language models (LLMs) without relying on visual information. Therefore, we conduct a text-only check using pure text LLMs. Based on LLMs’ responses, we remove the samples that can be directly answered without using the image. After that, we proportionally sample 1200 samples from these categories based on their difficulty levels. For benchmarks with fewer than 1,200 entries, we adapt all samples.

3 LIME: A comprehensive MLLMs benchmark

Table 1: Data statics: Full Size: the size of the original dataset, Lite Size: the final size of the LIME. For the COCO-Caption dataset, we selected the 2017 subset, and for the ScienceQA dataset, we chose the ScienceQA-IMG subset.

Task Domain	Dataset	Split	Full Size	Lite Size
Captioning	TextCaps COCO-Caption	val val	3166 5000	1200 1200
T/F reasoning	POPE	val	9000	443
Normal VQA	OK-VQA TextVQA	val val	5046 5000	1200 1200
Infographic QA	infoVQA ChartQA	val val	2801 2500	1200 1200
Science QA	ScienceQA AI2D	val val	2097 3088	300 1000
OCR	OCRBench	val	1000	460

In this section, we propose LIME, a comprehensive benchmark for Multimodal Large Language Models (MLLMs). LIME streamlines existing mainstream benchmarks. Tab 1 shows the main datasets included in our benchmark, as well as the data scale after careful pruning. For each sub-dataset, we aim to keep the size around 1k samples.

3.1 Task definition

We have categorized the existing mainstream tasks into six domains: Captioning, T/F Reasoning, Normal VQA, Infographic Understanding QA, Science QA, and OCR. Below are the task definitions for each domain

Image understanding and captioning: The Captioning task focuses on the fundamental image-text understanding ability, requiring MLLMs to accurately describe and understand the content of images. This ability is commonly learned by most multimodal models during the pre-training stage. For example, the CLIP model aligns image and text features through contrastive learning, making Captioning a measure of the basic capabilities of MLLMs.

T/F reasoning: T/F Reasoning requires the model to judge the truthfulness of textual statements based on the image content. This not only demands basic image understanding from the MLLMs but also requires a certain level of reasoning ability.

Normal VQA: Normal VQA, or Visual Question Answering, comprehensively evaluates the model’s ability to answer questions based on visual input. MLLMs are required to select the most appropriate answer from specific options.

Infographic Understanding QA: This task differs from Normal VQA as it tests the model’s ability to retrieve details from images. MLLMs need to find the most relevant information in the image related to the question and then provide a reasoned answer.

Science QA: Science QA includes questions and answers related to natural science knowledge. This requires the model to have domain-specific knowledge in natural sciences, mainly assessing the MLLMs’ mastery of knowledge within a specific domain.

OCR: The OCR task requires the precise extraction of textual content from images.

3.2 data statistics

LIME is composed of 10 open-source multimodal evaluation benchmarks, with scales ranging from 1,000 to 9,000. After our three-stage data curation, the data scale of each benchmark is significantly reduced. Figure 1 shows the number of samples removed at each stage compared to the original dataset. The amount of data removed varies at each stage, with the most being removed in the first stage, reflecting a large number of low-difficulty or data-leakage samples in the existing 9 MLLMs. Comparing the data volumes before and after the second stage of semi-automated screening, we can see that many datasets, such as OK-VQA and TextVQA, have a high rate of low-quality data leading to MLLMs’ incorrect answers. Additionally, some datasets, such as ScienceQA and AI2D, have a significant amount of data removed after the third stage, indicating that many questions in these datasets may contain potential answer leakage. The statistics of the curated data are shown in Tab 1.

4 Experiment

4.1 Experiment Setting

To evaluate the quality of LIME, we conduct a series of experiments across various open-source and closed-source models. These experiments primarily encompass the following three settings:

Main experiment: To demonstrate the performance of LIME, we evaluate mainstream open-source and closed-source models using a standardized process to reflect their overall performance differences.

Text-only set: To prevent potential data leakage issues, we conduct validation experiments using text-only QA pairs. This verifies whether LLMs can correctly answer questions based on text-only information.

Text-only question with Image Description set: Image Description (ID) refers to simple descriptions of images that represent superficial information contained within them. For most MLLMs, questions containing only superficial information are easy to answer; however, questions requiring complex visual inference are significantly more challenging. To further validate whether LIME can reflect the capabilities of MLLMs, we input text-only QA pairs combined with ID into LLMs and test their ability.

4.2 Baselines

We select LLaVA-1.5 (Liu et al., 2023a; b), LLaVA-1.6 (Liu et al., 2024), Tinny-LLaVA (Zhou et al., 2024), MiniCPM (Hu et al., 2024), Idefics-2 ¹¹1https://huggingface.co/blog/idefics2, Deepseek-VL(Lu et al., 2024), CogVLM (Wang et al., 2023; Hong et al., 2023), XComposer-4KHD (Zhang et al., 2023), Mantis (Jiang et al., 2024), InternVL-1.5 and InternVL-2 (Chen et al., 2023; 2024b) as our MLLMs baseline, and LLaMA3, Yi, Yi-1.5 (AI et al., 2024), Qwen-1.5 (Bai et al., 2023a) and Qwen2 (Yang et al., 2024) as LLMs baseline. To ensure fairness in the evaluations, we use the unified evaluation framework provided by lmms-eval (Zhang et al., 2024b) to conduct evaluation experiments on LIME. For models not supported by lmms-eval, we refine the inference code provided by the model developers to make it compatible with the new models for the sake of aligning the results of different models.

Metrics

For most tasks included in LIME, we reference the metrics computation methods used in lmms-eval. Specifically, for tasks such as AI2D, ScienceQA, OCRBench, and POPE, we calculate the accuracy of the extracted responses. For tasks such as OK-VQA and TextVQA, we calculate the metric scores based on the overlap between the response and the candidate answers. For tasks like TextCaps and COCO-Caption2017, we use CIDEr as the score. The ANLS metric is used for the infoVQA task, and the Relaxed Overall metric is used for the ChartQA task.

We calculate the sub-scores for each task category by taking a weighted average of the subtask scores, and then compute the overall score by weighted averaging the scores of all tasks except for the caption tasks. The details of the metrics calculation are provided in Tab 7.

5 Results

5.1 Main Result

Table 2: Left half of the table:Comparing overall scores of LIME and Original. The arrow next to the LIME score indicates the change in ranking on LIME compared to the original dataset.

\uparrow

: upward shift,

\downarrow

: downward shift, and -: no change. Right half of the table: performance on six domains

Model	Size	LIME	Original	Reasoning	VQA	InfoQA	SciQA	OCR	Caption
GPT-4O	-	52.63	-	47.18	42.95	57.63	56.15	72.39	47.84
claude-3-5-sonnet	-	51.99	-	35.89	50.33	56.38	44.69	73.91	28.00
Gemini-1.5-Pro-Vision	-	49.46	-	54.63	37.71	55.33	50.15	73.26	41.38
GPT-4-Vision-Preview	-	42.23	-	42.44	33.86	48.00	42.39	55.22	29.14
InternVL-2 2023	40B	66.85 ( - )	80.31	51.69	48.72	81.12	75.92	75.87	56.02
Qwen2-VL 2023b	7B	65.28 ( $\uparrow$ 1)	79.14	53.05	51.37	80.83	62.08	77.61	89.67
InternVL-1.5 2024b	26B	64.12 ( $\downarrow$ 1)	79.49	51.69	52.68	78.96	63.32	60.65	90.93
InternVL-2 2023	26B	63.98 ( - )	78.82	54.63	45.64	79.12	70.54	71.09	66.54
InternVL-2 2023	8B	62.00 ( $\uparrow$ 1 )	77.84	49.21	45.15	76.00	68.54	70.65	34.00
LLaVA-OneVision 2024	7B	61.95 ( $\downarrow$ 1 )	78.71	52.37	51.27	74.50	66.77	47.83	106.46
XComposer2-4KHD 2023	7B	57.52 ( $\uparrow$ 4)	71.93	46.28	44.22	73.29	58.38	53.04	87.57
InternVL-2 2023	4B	57.22 ( $\downarrow$ 1)	73.97	47.18	39.89	71.21	63.31	67.17	28.83
CogVLM-2 2024	19B	54.44 ( $\uparrow$ 6)	69.93	51.02	37.19	69.92	54.00	68.26	28.84
Qwen2-VL 2023b	2B	54.00 ( $\uparrow$ 5)	70.86	50.79	43.78	66.25	46.38	68.04	88.39
InternVL-2 2023	2B	53.64 ( $\downarrow$ 2)	73.00	50.79	40.71	62.88	56.54	67.39	47.27
CogVLM-1 2023	17B	51.03 ( $\uparrow$ 1)	71.34	55.10	51.45	59.46	36.54	41.96	33.92
Cambrian 2024	34B	50.17 ( $\downarrow$ 5)	73.26	49.44	39.66	57.50	60.23	39.13	4.62
Cambrian 2024	13B	48.57 ( $\downarrow$ 4)	72.39	50.79	41.53	56.04	49.23	42.39	6.96
InternVL-2 2023	1B	48.21 ( $\uparrow$ 3)	68.46	52.82	36.46	56.04	47.92	65.00	14.19
Cambrian 2024	8B	47.95 ( $\downarrow$ 4)	71.84	49.89	42.12	53.55	49.46	43.04	6.13
LLaVA-1.6 2024	34B	44.06 ( $\uparrow$ 3)	67.22	47.00	30.80	53.21	53.08	37.17	66.25
MiniCPM-LLaMA3-2.5 2024	8B	42.61 ( $\downarrow$ 3)	71.22	43.10	43.55	58.55	6.60	55.87	35.89
LLaVA-OneVision 2024	0.5B	41.40 ( $\uparrow$ 4)	65.65	48.98	35.87	48.04	36.23	42.83	93.34
LLaVA-LLaMA3 2023	8B	40.90 ( $\downarrow$ 3)	69.74	44.24	37.36	43.33	45.56	30.22	74.03
Mantis-Idefics-2 2024	8B	39.25 ( - )	66.91	44.24	36.79	39.75	43.69	32.17	82.44
Deepseek-VL 2024	7B	38.10 ( $\uparrow$ 2)	65.62	48.50	34.90	38.50	44.23	25.43	68.72
LLaVA-1.6-vicuna 2024	13B	37.08 ( $\downarrow$ 4)	67.29	43.10	30.00	41.63	41.54	31.96	62.23
Idefics-2 2024	8B	36.39 ( $\downarrow$ 2)	66.73	42.00	46.05	18.50	47.46	42.61	77.87
LLaVA-1.6-vicuna 2024	7B	30.15 ( - )	64.80	41.10	25.75	32.88	31.77	23.70	62.20
Mantis-SigLIP 2024	8B	29.13 ( $\uparrow$ 1)	58.96	45.60	29.39	25.79	35.77	10.65	74.69
MiniCPM 2024	1.0	26.15 ( $\uparrow$ 2)	56.18	44.00	21.60	24.58	35.46	14.57	72.80
LLaVA-1.5 2023a	13B	20.38 ( $\downarrow$ 2)	59.58	36.60	25.80	8.96	31.08	5.87	74.81
LLaVA-1.5 2023a	7B	17.20 ( $\downarrow$ 1)	57.27	32.51	19.97	7.17	29.81	4.78	72.47
InstructBLIP-vicuna 2023	7B	15.55 ( - )	47.87	45.10	16.75	6.04	24.77	4.35	77.61
Tiny-LLaVA-1 2024	1.4B	13.95 ( - )	34.30	37.00	9.80	8.33	27.85	3.48	61.05

As shown in Tab 2, we evaluate both open-source and closed-source MLLMs using our LIME benchmark. Overall, for closed-source models, GPT-4O achieves the best performance with a score of 52%, while for open-source models, models with larger parameter sizes and newer versions tend to have higher overall scores. InternVL-1.5, InternVL-2-Large (26B, 40B), and LLaVA-OneVision-7B achieve the best overall performance, with their overall scores all surpassing 60%. The performance of InternVL-2-Small (1B-8B), the CogVLM series, and the Cambrian series follows, with their overall scores ranging from 45% to 60%.

Comparing the overall scores of LIME and Origin benchmarks, we observe that certain model series, such as Cambrian and LLaVA-1.5, experience a decline in overall scores. Conversely, the CogVLM and LLaVA-OneVision series show an improvement, with CogVLM2 and XComposer-4KHD experiencing significant increases of 4% and 6%, respectively.

Tab 6 provides more detailed experimental results. Regarding caption subtasks, most models demonstrate good performance. These tasks involve generating or assessing descriptions of the content in images, which indicates that current MLLMs possess strong image content recognition capabilities. As for the VQA task, current MLLMs perform relatively well on TextVQA, ChatQA, and ScienceQA, where the questions directly ask about facts in the picture. However, their performance is relatively lower on OK-VQA, infoVQA, and AI2D, which require additional commonsense knowledge or complex reasoning to answer the questions. This demonstrates that current MLLMs exhibit significant image content recognition capabilities but are limited in their ability to perform complex reasoning using additional knowledge. We believe this limitation may be due to constraints in the language model component of MLLMs.

5.2 Correlation Analysis

Figure 4 illustrates the correlation between the various sub-tasks in LIME and WildVision Bench. Most tasks in LIME exhibit a strong positive correlation with WildVision Bench. Six subtasks have correlation scores exceeding 80%. Additionally, the overall score of LIME correlates at 91% with WV-Elo, which is higher than any individual sub-task and the original bench’s correlations, demonstrating that the overall score of LIME provides a more comprehensive reflection of MLLMs’ capabilities.

Automated evaluation metrics (e.g., CIDEr) cannot effectively assess the performance of LVMs in captioning tasks.

As an early foundational problem, the captioning task is extensively studied, and MLLMs demonstrate exceptional ability in this task. For instance, earlier models like InstructBlip perform exceptionally well on captioning tasks, and there is a broad presence of training data for image captioning in MLLMs’ training processes. However, the captioning task shows a negative correlation with all other sub-tasks. This indicates that previous metrics (e.g., BLEU, CIDEr) only focus on the overlap between the model-generated responses and the ground truth, but do not consider that MLLMs might generate content that is semantically close to the ground truth (i.e., the model-generated response may be semantically similar to the ground truth but expressed differently, or the model may generate more detailed content about the image). Consequently, we exclude it from the overall score calculation.

There is a certain degree of correlation between the sub-tasks in LIME.

On the one hand, the relevance of TextVQA, InfographicVQA, and OCRBench is relatively high. As shown in Fig. 4, the correlation of these tasks all surpasses 85%, and these two VQA tasks require MLLMs to understand fine-grained content in images to answer questions. This demonstrates that OCR tasks also rely on the ability of MLLMs to perceive fine-grained objective facts in images. On the other hand, POPE, ChartQA, and InfographicVQA all require reasoning abilities using extra commonsense knowledge. The correlation scores of these tasks are all above 70%, and POPE requires the model to use extra knowledge to solve the hallucination of MLLMs. We assume that ChartQA and infoVQA may also necessitate the use of additional common knowledge by the models to solve problems.

5.3 Effectiveness of LIME

Table 3: Statistics on the score distributions across different model series.

Model series	Dataset	GiNi	stdev
InternVL-2	LIME Original	0.061 0.030	6.972 4.421
Cambrian	LIME Original	0.006 0.002	1.227 0.715
LLaVA-1.6	LIME Original	0.042 0.004	6.730 1.418

Table 4: Statistics on the score distributions across different model sizes.

Model size	Dataset	GiNi	stdev
7B	LIME Original	0.271 0.086	19.041 10.836
8B	LIME Original	0.128 0.046	10.685 6.270
13B	LIME Original	0.174 0.043	13.536 6.446

LIME provides a more challenging evaluation for MLLMs.

As shown in Tab 2, the MLLMs’ performances on LIME are less than those on the Original Bench for most tasks. Compared to the Origin benchmark, different MLLMs show a larger score range on our LIME, indicating that our LIME can better reflect the performance differences between models with a smaller amount of data.

Furthermore, we compare the score variations across different model series and model sizes. Figure 5 illustrates a clear positive correlation between model performance and model size within the same model series. Notably, LIME exhibits a more dispersed score distribution, effectively highlighting the differences in model performance. In Tab 5.3 and 5.3, the Gini coefficient and standard deviation are used to measure the differences in overall score distribution across the same model series and model sizes. The larger the Gini coefficient and standard deviation, the greater the disparity in data distribution. It can be observed that, whether within the same model series or the same model size, LIME achieves higher Gini and standard deviation values compared to the original bench. This indicates that LIME can better differentiate the performance differences between various models.

LIME eliminates potential data leakage.

For multimodal question answering tasks, visual information input is essential, and LLMs are unable to provide correct answers due to they cannot perceive the content within the image. However, as shown in Figure 6 (right), there are severe data leakage issues in the original Bench for the AI2D and ScienceQA tasks. The average score for AI2D is close to 55%, and for ScienceQA, it exceeds 60%, which shows that data from AI2D and ScienceQA in Original are highly likely to have been exposed to the training data of LLMs. In contrast, the LIME has eliminated this potential threat, achieving scores below 25% in AI2D and close to 40% in ScienceQA.

5.4 The Impact of Detail Image Perception

Table 5: Text-only with VD results: With the condition of providing only text QA information and VD information, the performance comparison between vlms-bench and origin bench.

Setting	Models	AI2D	ChQA	COCO	IVQA	OCRBen	OK VQA	POPE	SciQA	TCaps	TVQA
	LLaMA3-8B	23.5	6.4	2.8	12.9	9.2	17.4	32.1	16.4	5.3	17.9
	LLaMA3-70B	24.0	7.7	3.3	12.3	9.3	21.8	38.1	39.4	6.0	22.0
	Qwen1.5-32B	28.8	6.7	6.5	9.4	8.7	4.7	39.3	46.6	9.2	13.7
	Qwen1.5-72B	25.4	2.5	3.2	10.1	8.9	7.8	42.7	44.2	6.0	15.2
	Qwen2-7B	27.6	6.7	6.9	11.2	8.9	15.0	44.2	45.5	12.5	19.0
	Qwen2-72B	26.3	6.9	2.7	10.8	9.6	10.6	36.3	45.2	5.2	16.8
LIME	Yi-1.5-9B-Chat	22.1	2.3	0.3	3.1	0.0	7.8	40.0	0.0	0.2	5.8
	LLaMA3-8B	49.0	11.4	3.1	18.6	19.3	32.5	46.9	59.5	6.5	26.4
	LLaMA3-70B	52.0	12.4	3.6	17.6	19.5	36.4	5.2	64.6	7.8	36.2
	Qwen1.5-32B	60.5	10.7	8.1	15.0	20.2	15.8	47.4	68.8	10.6	22.1
	Qwen1.5-72B	58.8	6.4	3.8	16.6	20.2	21.1	35.1	68.4	7.1	27.4
	Qwen2-7B	59.2	12.7	7.4	19.7	19.6	30.5	44.6	69.0	15.4	33.3
	Qwen2-72B	60.4	10.5	3.5	15.7	20.5	24.2	34.3	67.9	6.8	28.7
Original	Yi-1.5-9B-Chat	24.7	3.1	0.0	5.9	0.5	7.8	32.7	31.7	0.2	5.8

In our data cleaning process, we remove many questions that most models can answer, as well as a small number of questions that are difficult for both humans and GPT-4V to answer, in order to make the benchmark better highlight the differences in model capabilities. As shown in Tab 5, to investigate whether the remaining samples need to be answered by using textual and image information, we conduct experiments using LLMs to generate answers on both the Original Benchmark and MLLMs Benchmark under QID (question + image description) setting.

LIME requires MLLMs to perceive deeper levels of image information.

Especially in tasks such as AI2D, OCRBench, and TCaps, the scores of LLMs on LIME are significantly lower than on the Original Benchmark when provided with only the questions and simple image descriptions. This indicates that, after removing some of the simpler questions, LIME is better at testing the models’ ability to perceive image details.

5.5 Existing benchmark still differs from real-world query.

To further investigate the gap between LIME and real-world users’ queries, we construct a similarity search system that compares them. MixEval (Ni et al., 2024) uses SentenceTransformers(Reimers, 2019) as the retrieval model, while Uniir (Wei et al., 2023) employs multimodal models like CLIP and BLIP. We use WildVision-Chat as the query data source, which contains 45.2k high-quality user questions, and employ SentenceTransformers to retrieve the top 10 most similar samples from LIME. To fully incorporate image information, we combine the question and image description as the query input. Additionally, we utilize Qwen2-72B to ensure a high level of relevance in the final results. As a result, we obtain a LIME-fit dataset containing 1.1k relevant samples. Existing benchmark can’t cover all types of real-world query.

In Figure 9, we compare the category distribution differences between LIME-fit and the WildVision Bench. It is evident that LIME-fit concentrates in a few specific categories (e.g., data analysis, general description, object recognition). However, it does not include instructions for solving real-world problems, such as Face Recognition, Problem Solving, and Scene Description. Furthermore, Figure 10 shows the frequency distribution of each subcategory in LIME-fit, which follows a long-tail distribution. This indicates that the current benchmark does not fully cover the instruction requirements of real-world scenarios.

6 Related work

In recent years, there has been increasing attention on establishing evaluation benchmarks to assess the performance of MLLMs in different scenarios to guide the development of MLLMs. Early multimodal evaluation benchmarks primarily focused on single tasks, such as Visual Question Answering (VQA)(Antol et al., 2015; Goyal et al., 2017; Kafle & Kanan, 2017; Singh et al., 2019; Marino et al., 2019), Image Captioning(Agrawal et al., 2019), and Information Retrieval (Wei et al., 2023). As MLLMs develop, simple benchmarks are no longer sufficient to evaluate the versatile capabilities of these models comprehensively, since most MLLMs demonstrate exceptional ability on those benchmarks. Consequently, numerous more difficult and diverse benchmarks have emerged in recent years to assess the capabilities of MLLMs comprehensively. For instance, MMMU (Yue et al., 2024) and CMMMU (Zhang et al., 2024a) are comprehensive benchmark tests for university-level multidisciplinary multimodal understanding and reasoning. MMBench (Liu et al., 2023c) has developed a comprehensive evaluation pipeline that offers fine-grained capability assessment and robust evaluation metrics. MMRA (Wu et al., 2024b) systematically establishes an association relation system among images to assess the multi-image relation mining ability of MLLMs.

However, those benchmarks cannot distinguish the performance gaps among different models excellently, as they still contain some too simple or difficult samples that most models yield the same results on. Furthermore, training datasets across different models may contain the samples of those benchmarks, which results in data leakage issues (Fu et al., 2023). Mmstar (Chen et al., 2024a) and MMLU_Redux (Gema et al., 2024) have identified several issues within current benchmarks. Mmstar (Chen et al., 2024a) proposes an automated pipeline to filter benchmark data, aiming to detect potential data leakage, while MMLU_Redux (Gema et al., 2024) focuses on correcting annotation errors. However, there is still a pressing need for a comprehensive pipeline that fully addresses the challenges posed by multimodal datasets. In response to this, we introduce LIME: LESS IS MORE FOR MLLM EVALUATION. We have carefully selected six task types from existing mainstream benchmarks and scaled them down according to clear guidelines. This streamlined version retains the core elements of mainstream MLLM benchmarks, providing a more efficient and focused evaluation.

7 Conclusion

As MLLMs continue to advance, a notable absence of convenient and high-quality multimodal benchmarks has emerged. In response to this, we propose a pipeline aimed at semi-automatically refining existing benchmarks to enhance their quality, culminating in the development of LIME, which comprises 9,403 evaluation samples across 6 types of tasks and 10 different benchmark datasets. By refining the original benchmarks to filter question difficulty and eliminate potentially problematic items, LIME offers a more rigorous evaluation for MLLMs, necessitating a deeper understanding of image information. The outcomes of our evaluation experiments demonstrate the heightened challenge posed by LIME for MLLMs. We anticipate that our approach will contribute to the advancement of MLLM evaluation systems, and we are committed to continually enriching LIME with an expanded array of datasets through regular updates and expansions. Our ultimate goal is to provide the community with a simpler, more efficient, and more accurate evaluation method and suite for MLLMs.

References

Agrawal et al. (2019) Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 8948–8957, 2019.
AI et al. (2024) 01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. Yi: Open foundation models by 01.ai, 2024.
Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015.
Bai et al. (2023a) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
Bai et al. (2023b) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
Chen et al. (2024a) Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024a.
Chen et al. (2023) Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
Chen et al. (2024b) Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024b.
Contributors (2023) XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/xtuner, 2023.
Dai et al. (2023) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. URL https://arxiv.org/abs/2305.06500.
Fu et al. (2023) Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
Gema et al. (2024) Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? arXiv preprint arXiv:2406.04127, 2024.
Goyal et al. (2017) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913, 2017.
Hong et al. (2023) Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2023.
Hong et al. (2024) Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, et al. Cogvlm2: Visual language models for image and video understanding. arXiv preprint arXiv:2408.16500, 2024.
Hu et al. (2024) Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.
Jiang et al. (2024) Dongfu Jiang, Xuan He, Huaye Zeng, Con Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024.
Kafle & Kanan (2017) Kushal Kafle and Christopher Kanan. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision, pp. 1965–1973, 2017.
Laurençon et al. (2024) Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246, 2024.
Li et al. (2024) Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024.
Liu et al. (2023a) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a.
Liu et al. (2023b) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023b.
Liu et al. (2024) Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
Liu et al. (2023c) Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
Lu et al. (2024) Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision-language understanding, 2024.
Marino et al. (2019) Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp. 3195–3204, 2019.
Ni et al. (2024) Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, and Yang You. Mixeval: Deriving wisdom of the crowd from llm benchmark mixtures. arXiv preprint arXiv:2406.06565, 2024.
Reimers (2019) N Reimers. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
Singh et al. (2019) Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8317–8326, 2019.
Tong et al. (2024) Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860, 2024.
Wang et al. (2023) Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2023.
Wei et al. (2023) Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers. arXiv preprint arXiv:2311.17136, 2023.
Wu et al. (2024a) Siwei Wu, Yizhi Li, Kang Zhu, Ge Zhang, Yiming Liang, Kaijing Ma, Chenghao Xiao, Haoran Zhang, Bohao Yang, Wenhu Chen, Wenhao Huang, Noura Al Moubayed, Jie Fu, and Chenghua Lin. SciMMIR: Benchmarking scientific multi-modal information retrieval. In Findings of the Association for Computational Linguistics ACL 2024, pp. 12560–12574, 2024a.
Wu et al. (2024b) Siwei Wu, Kang Zhu, Yu Bai, Yiming Liang, Yizhi Li, Haoning Wu, Jiaheng Liu, Ruibo Liu, Xingwei Qu, Xuxin Cheng, et al. Mmra: A benchmark for multi-granularity multi-image relational association. arXiv preprint arXiv:2407.17379, 2024b.
Yang et al. (2024) An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
Yao et al. (2024) Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint 2408.01800, 2024.
Yue et al. (2024) Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9556–9567, 2024.
Zhang et al. (2024a) Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, Tianyu Zheng, Kang Zhu, Yuyang Cheng, Chunpu Xu, Shuyue Guo, et al. Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2401.11944, 2024a.
Zhang et al. (2024b) Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. arXiv preprint arXiv:2407.12772, 2024b.
Zhang et al. (2023) Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023.
Zhou et al. (2024) Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289, 2024.

Appendix A Appendix

A.1 overall Data statics

Figure 7shows the overall data distribution in LIME, and figure 8 shows an example for each category title

A.2 more experiment result

Table 6: Comparing overall scores of LIME and Original. Top: results on LIME, Bottom: results on the original dataset. The arrow next to the model name indicates the change in ranking on LIME compared to the original dataset.

\uparrow

: upward shift,

\downarrow

: downward shift, and -: no change.

			T/F	Common VQA		InfoVQA		ScienceQA		OCR	Captioning
Model	Size	Overall	POPE $\uparrow$	TVQA $\uparrow$	OK VQA $\uparrow$	ChQA $\uparrow$	IVQA $\uparrow$	AI2D $\uparrow$	SciQA $\uparrow$	OCRBen $\uparrow$	COCO $\uparrow$	TCaps $\uparrow$
InternVL-2 2023 ( - )	40B	66.85	51.69	77.98	19.45	88.33	73.92	69.20	98.33	75.87	63.10	48.94
Qwen2-VL 2023b ( $\uparrow$ 1)	7B	65.28	53.05	74.56	28.18	83.17	78.50	58.20	75.00	77.61	68.74	110.60
InternVL-1.5 2024b ( $\downarrow$ 1)	26B	64.12	51.69	69.88	35.47	87.00	70.92	54.81	91.67	60.65	69.24	112.63
InternVL-2 2023 ( - )	26B	63.98	54.63	75.20	16.08	87.67	70.58	62.80	96.33	71.09	76.18	56.91
InternVL-2 2023 ( $\uparrow$ 1 )	8B	62.00	49.21	66.10	24.20	83.75	68.25	60.40	95.67	70.65	42.55	25.44
LLaVA-OneVision 2024 ( $\downarrow$ 1 )	7B	61.95	52.37	65.22	37.32	80.83	68.17	59.20	92.00	47.83	104.74	108.18
XComposer2-4KHD 2023 ( $\uparrow$ 4)	7B	57.52	46.28	60.30	28.13	80.42	66.17	54.90	70.00	53.04	97.07	78.07
InternVL-2 2023( $\downarrow$ 1)	4B	57.22	47.18	62.29	17.48	81.92	60.50	54.00	94.33	67.17	35.99	21.67
CogVLM-2 2024 ( $\uparrow$ 6)	19B	54.44	51.02	69.46	4.92	80.33	59.50	45.00	84.00	68.26	23.67	34.01
Qwen2-VL 2023b ( $\uparrow$ 5)	2B	54.00	50.79	70.70	16.87	67.50	65.00	42.90	58.00	68.04	75.06	101.72
InternVL-2 2023 ( $\downarrow$ 2)	2B	53.64	50.79	59.56	21.87	71.75	54.00	45.80	92.33	67.39	51.95	42.59
CogVLM-1 2023 ( $\uparrow$ 1)	17B	51.03	55.10	71.20	31.70	61.67	57.25	31.40	53.67	41.96	29.28	38.56
Cambrian 2024 ( $\downarrow$ 5)	34B	50.17	49.44	57.28	22.03	71.83	43.17	54.80	78.33	39.13	4.27	4.97
Cambrian 2024 ( $\downarrow$ 4)	13B	48.57	50.79	58.93	24.13	69.25	42.83	45.20	62.67	42.39	7.40	6.52
InternVL-2 2023 ( $\uparrow$ 3)	1B	48.21	52.82	57.55	15.37	65.83	46.25	37.00	84.33	65.00	15.19	13.19
Cambrian 2024 ( $\downarrow$ 4)	8B	47.95	49.89	59.00	25.23	69.42	37.67	43.60	69.00	43.04	5.85	6.41
LLaVA-1.6 2024 ( $\uparrow$ 3)	34B	44.06	47.00	51.20	10.40	64.33	42.08	49.60	64.67	37.17	84.25	48.25
MiniCPM-LLaMA3-2.5 2024 ( $\downarrow$ 3)	8B	42.61	43.10	61.80	25.30	69.00	48.10	6.60	6.60	55.87	31.91	39.86
LLaVA-OneVision 2024 ( $\uparrow$ 4)	0.5B	41.40	48.98	48.61	23.13	55.42	40.67	31.20	53.00	42.83	96.40	90.28
LLaVA-LLaMA3 2023 ( $\downarrow$ 3)	8B	40.90	44.24	40.01	34.72	64.42	22.25	40.72	61.67	30.22	99.35	48.71
Mantis-Idefics-2 2024 ( - )	8B	39.25	44.24	44.51	29.07	59.00	20.50	35.80	70.00	32.17	61.82	103.07
Deepseek-VL 2024 ( $\uparrow$ 2)	7B	38.10	48.50	44.80	25.00	54.67	22.33	37.00	68.33	25.43	54.22	83.21
LLaVA-1.6-vicuna 2024 ( $\downarrow$ 4)	13B	37.08	43.10	43.90	16.10	54.50	28.75	38.10	53.00	31.96	76.62	47.83
Idefics-2 2024 ( $\downarrow$ 2)	8B	36.39	42.00	56.50	35.60	13.08	23.92	38.10	78.67	42.61	61.23	94.51
LLaVA-1.6-vicuna 2024 ( - )	7B	30.15	41.10	39.00	12.50	43.08	22.67	27.10	47.33	23.70	76.05	48.35
Mantis-SigLIP 2024 ( $\uparrow$ 1)	8B	29.13	45.60	26.34	32.43	35.33	16.25	27.70	62.67	10.65	68.16	81.21
MiniCPM 2024 ( $\uparrow$ 2)	1.0	26.15	44.00	37.00	6.20	35.75	13.42	27.90	60.67	14.57	68.96	76.65
LLaVA-1.5 2023a ( $\downarrow$ 2)	13B	20.38	36.60	19.50	32.10	5.50	12.42	25.90	48.33	5.87	80.89	68.73
LLaVA-1.5 2023a ( $\downarrow$ 1)	7B	17.20	32.51	16.50	23.43	5.25	9.08	24.05	49.00	4.78	79.20	65.73
InstructBLIP-vicuna 2023 ( - )	7B	15.55	45.10	11.40	22.10	3.00	9.08	21.90	34.33	4.35	102.08	53.14
Tiny-LLaVA-1 2024 ( - )	1.4B	13.95	37.00	18.70	0.90	4.50	12.17	22.80	44.67	3.48	63.19	58.91
InternVL-2	40B	80.31	89.23	82.59	50.98	85.52	76.08	85.88	98.56	79.90	99.15	62.03
InternVL-1.5	26B	79.49	88.90	79.00	60.70	83.70	72.50	78.90	94.50	71.40	95.80	148.10
Qwen2-VL	7B	79.14	88.17	80.92	55.68	83.32	78.86	80.73	85.57	81.20	92.13	144.36
InternVL-2	26B	78.82	88.64	82.06	48.50	84.44	72.72	83.16	97.47	77.60	110.30	80.10
LLaVA-OneVision	7B	78.71	89.17	76.02	60.98	80.12	70.69	81.38	95.88	62.10	140.45	136.97
InternVL-2	8B	77.84	87.90	77.00	52.02	82.48	70.65	82.25	97.03	76.50	89.77	36.70
InternVL-2	4B	73.97	87.71	74.51	38.43	81.04	65.19	78.08	96.03	75.00	54.08	30.17
Cambrian	34B	73.26	88.46	72.11	52.07	74.60	51.48	80.41	85.52	59.00	8.18	6.08
InternVL-2	2B	73.00	88.90	72.39	43.74	74.72	57.69	72.70	94.25	75.50	79.52	59.81
Cambrian	13B	72.39	88.53	73.07	53.28	72.60	50.73	73.93	79.08	61.40	14.33	9.44
XComposer-4KHD	7B	71.93	87.00	74.30	51.90	80.60	72.80	34.40	96.00	66.90	134.00	111.40
Cambrian	8B	71.84	88.24	72.47	52.17	73.44	48.05	72.99	80.32	61.60	9.13	7.97
CogVLM-1	17B	71.34	88.90	79.70	46.90	67.00	63.30	61.90	70.50	59.10	28.40	44.70
MiniCPM-LLaMA3-2.5	8B	71.22	88.00	75.00	52.30	72.90	56.90	71.70	53.00	69.40	35.50	52.90
Qwen2-VL	2B	70.86	87.78	78.70	40.59	73.12	67.15	70.21	77.89	75.30	103.52	131.92
CogVLM-2	19B	69.93	87.56	77.59	18.51	79.84	62.62	72.41	90.93	76.60	24.10	42.23
LLaVA-LLaMA3	8B	69.74	87.80	65.40	60.20	69.30	37.60	71.60	73.30	55.00	135.00	69.60
InternVL-2	1B	68.46	87.94	69.67	33.84	71.40	52.02	62.56	89.59	74.20	49.34	18.03
LLaVA-1.6-vicuna	13B	67.29	87.50	67.00	46.30	62.20	41.50	70.40	73.50	55.00	101.90	67.30
LLaVA-1.6	34B	67.22	85.60	68.90	31.00	67.40	51.90	76.10	82.70	58.60	114.40	69.10
Mantis-Idefics-2	8B	66.91	86.90	63.51	52.50	63.56	31.17	66.81	81.80	54.20	79.42	134.08
Idefics-2	8B	66.73	86.80	71.30	53.90	26.40	37.00	69.20	87.20	61.60	71.90	119.10
LLaVA-OneVision	0.5B	65.65	88.33	65.85	44.17	61.36	46.23	57.09	67.03	57.60	131.90	120.81
Deepseek-VL	7B	65.62	87.10	63.20	48.70	60.60	34.30	63.40	81.70	43.30	67.60	110.10
LLaVA-1.6-vicuna	7B	64.80	87.60	64.90	44.20	55.00	37.00	65.30	70.20	52.40	100.00	72.00
LLaVA-1.5	13B	59.58	87.10	48.70	58.30	18.10	29.50	59.40	72.80	33.60	115.40	104.00
Mantis-SigLIP	8B	58.96	81.47	49.59	52.90	42.56	26.56	57.84	75.36	34.50	91.37	111.43
LLaVA-1.5	7B	57.27	87.00	46.10	53.40	18.20	25.80	55.20	69.50	31.50	109.00	98.00
MiniCPM	1.0	56.18	85.10	55.30	47.30	15.40	20.10	56.90	43.00	60.00	25.90	41.60
InstructBLIP-vicuna	7B	47.87	85.00	33.20	45.20	12.50	22.90	34.00	36.40	25.90	141.40	74.00
Tiny-LLaVA-1	1.4B	34.30	56.30	38.50	3.80	11.10	22.20	32.30	58.20	17.20	80.90	83.10

A.3 Pipeline details

A.3.1 prompt template details

Semi-Automated Screening Process Prompt

We selected GPT-4V as the basis for automatic judgment and interacted with the GPT-4V API using specific prompt templates for different subtasks.

Exact Vision Description Prompt

For the QVD experiment, we use LLaVA-NEXT-110B to extract information from the images, with the following prompt:

A.3.2 metrics

Subtask metrics: As shown in the Tab 7, different metrics are used for different subtasks. It is important to note that, except for the CIDEr metric, all other metrics have a range between 0 and 1. The final score for each subtask is calculated by taking the average of these metrics.

Table 7: Metrics for different subtask

Metric	Subtask	Formula
Accuracy	AI2D, ScienceQA-IMG, OCRBench, POPE	$\text{Accuracy}=\begin{cases}1,&\text{if the prediction is correct}\\ 0,&\text{if the prediction is incorrect}\end{cases}$
CIDEr	TextCaps,COCO-Caption	$\text{CIDEr}=\frac{1}{m}\sum_{i=1}^{m}\sum_{n=1}^{N}w_{n}\cdot\frac{g_{i}^{(n)% }\cdot r_{i}^{(n)}}{\\|g_{i}^{(n)}\\|\\|r_{i}^{(n)}\\|}$
Match score	OK-VQA,TextVQA	$\text{SCORE}=\min\left(1,\frac{\text{match\_nums}}{3}\right)$
ANLS	InfoVQA	$\text{ANLS}(X,Y)=1-\frac{\text{Lev}(X,Y)}{\max(\|X\|,\|Y\|)}$
Relaxed Overall	ChartQA	$\text{SCORE}=\frac{\left\|\text{prediction}-\text{SCORE}\right\|}{\left\|\text{% target}\right\|}$

Overall metric: For the overall metric, we explored two mainstream calculation methods: arithmetic mean 1 and weighted mean 2.

\text{Arithmetic Mean}=\frac{1}{n}\sum_{i=1}^{n}x_{i}

(1)

\text{Weighted Mean}=\frac{\sum_{i=1}^{n}w_{i}x_{i}}{\sum_{i=1}^{n}w_{i}}

(2)

The arithmetic mean directly calculates the average of each subtask’s scores, while the weighted mean takes into account the number of samples in each subtask. We compare the results of these two calculation methods, as shown in the Tab 8. weighted average method achieves a higher correlation with WV-ELO. This suggests that the weighted average method is slightly superior to the arithmetic mean, as it considers the impact of the number of data points on the overall score, thereby avoiding potential errors caused by uneven data distribution. Therefore, in our work, we ultimately chose the weighted average as the method for calculating the overall score.

Table 8: Comparison of different overall metrics method

model	overall_weighted	overall_sum	overall_cider	WV_bench
LLaVA-1.6-vicuna-7B	30.15	30.46	36.07	992
LLaVA-1.6-vicuna-13B	37.08	36.52	41.04	956
LLaVA-1.6-34B	44.06	43.30	47.12	1059
CogVLM	51.03	47.66	44.03	1016
Deepseek-VL	38.1	39.04	43.31	979
Idefics2	36.39	38.43	43.83	965
MiniCPM-v-1.0	26.15	28.95	35.79	910
Tinny-LLaVA-1-hf	13.95	17.79	24.15	879
LLaVA-1.5-13B	20.38	22.88	32.3	891
InstructBLIP-vicuna-7B	15.55	18.61	29.56	862
correlation score	0.91	0.90	0.87	1

A.3.3 difficulty classification details

For subtasks using the accuracy (acc) metric, where the scores are binary, with only 1 or 0, other tasks may have various possible score distributions (e.g., COCO-Caption, OK-VQA). Therefore, we determine the threshold score based on the overall distribution of subtask scores, and choose the cutoff value that offers the greatest distinction, as shown in Tab 9, for the metrics ANLS, Relaxed Overall and Accuracy (Acc), the threshold is set to 1.0, for BLEU-4 (for the captioning task, we use the BLEU-4 metric to represent the score for each question), the threshold is set to 0.2, while for Match Score, it is set to 0.6. When the score is greater than the threshold, it is marked as correct; otherwise, it is marked as incorrect.

Metrics	bleu4	Match score	ANLS	Relaxed Overall	Acc
Threshold	0.2	0.6	1.0	1.0	1.0

Table 9: Thresholds for Different Metrics

A.3.4 Retrieve from real world queryd

Appendix B case study

The original dataset contains noise data. In the following figure, we categorize the problematic data into three types and present specific examples from different datasets.

Text Answerable Questions:

Some questions can be answered without the need for visual information, mainly focusing on the AI2D and ScienceQA datasets. As shown in figs. 11 and 12, AI2D and ScienceQA emphasize knowledge in the field of science while overlooking the importance of visual information. Given the background of domain knowledge, some LLMs are able to provide answers even without requiring visual input.

Annotation Error Questions:

Most benchmarks are manually curated, which inevitably leads to annotation errors. Problematic questions exist in almost all benchmarks. It can be found in figs. 13, 14, 20, 21, 22, 23, 24 and 25.

Repeated Question:

Some benchmarks also contain a significant amount of duplicate data, where the question content and image content are completely identical. This issue is mainly found in the POPE dataset, as shown in the figs. 15, 16, 17, 18 and 19.

\listofcasestudyfigures

	AI2D		ScienceQA
Model	LIME	Original	LIME	Original
LLaMA3-8B	18.10	46.76	33.33	59.35
LLaMA3-70B	25.70	62.05	56.00	69.91
Qwen1.5-32B	24.10	61.14	43.67	67.97
Qwen1.5-72B	19.80	57.45	35.00	61.13
Qwen2-7B	21.00	57.09	43.00	67.38
Qwen2-72B	20.60	69.95	38.67	63.36
Yi-1.5-9B-Chat	20.10	23.22	17.33	23.60
Yi-1.5-34B-Chat	23.60	54.15	42.00	65.20
Yi-1.5-34B-Chat	25.20	60.69	46.00	70.55