Table 1: MEGA-Bench full results. The number in the parentheses is the number of tasks of each keyword.
The Core set contains $N_{\text{core}} = 440$ tasks evaluated by rule-based metrics, and the Open-ended set contains $N_{\text{open}} = 65$ tasks evaluated by a VLM judge (we use GPT-4o-0806).
Different from the results in our paper, we only use the Core results with CoT prompting here for clarity and compatibility with the released data.
$\text{Overall} \ = \ \frac{\text{Core} \ \cdot \ N_{\text{core}} \ + \ \text{Open-ended} \ \cdot \ N_{\text{open}}}{N_{\text{core}} \ + \ N_{\text{open}}}$
* indicates self-reported results from the model authors.

Rank	Models	Overall (505)	Core (440)	Open-ended (65)	Perception (145)	Knowledge (97)	Planning (78)	Info Extraction (72)	Math (33)	Coding (31)	Science (29)	Metrics (20)
10	Gemini-2.0-Flash-exp	51.30*	57.12	64.56	58.82	64.68	40.44	66.94	53.27	57.89	58.84	64.63

Rank	Models	Overall (505)	Core (440)	Open-ended (65)	Perception (145)	Knowledge (97)	Planning (78)	Info Extraction (72)	Math (33)	Coding (31)	Science (29)	Metrics (20)
1	Gemini-exp-1206	58.08	57.12	64.56	58.82	64.68	40.44	66.94	53.27	57.89	58.84	64.63
2	Gemini-2.0-Flash-exp	55.26	54.07	63.3	57.19	62.53	36.04	63.97	49.66	49.48	56.79	65.66
3	Claude-3.5-Sonnet (1022)	54.27	52.59	65.63	55.62	56.58	39.9	65.9	47.63	51.67	55.12	61.24
4	GPT-4o (0513)	54.21	52.65	64.78	55.1	61.36	33.2	70.56	44.05	50.33	52.84	60.96
5	Gemini-2.0-Flash-thinking	53.18	51.78	62.67	52.72	57.48	37.41	64.23	47.8	52.12	57.38	61.89
6	Claude-3.5-Sonnet (0620)	52.13	50.41	63.74	53.23	55.07	33.82	66.63	47.46	51.94	51.35	58.09
7	Qwen2.5-VL-72B	51.30*	-	-	-	-	-	-	-	-	-	-
8	Gemini-1.5-Pro-002	49.56	48.22	58.58	52.51	57.22	33.35	54.21	41.21	43.47	51.24	58.21
9	MiniMax-VL-01	47.40*	-	-	-	-	-	-	-	-	-	-
10	Qwen2-VL-72B	46.84	45.42	56.4	51.77	50.48	30.68	56.64	35.54	41.22	45.62	50.3
11	InternVL2.5-78B	45.58	44.13	55.38	51.68	50.35	26.4	54.02	37.74	39.33	43.89	47.73
12	Gemini-1.5-Flash-002	43.83	41.89	56.91	46.4	51.26	26.43	48.22	38.48	36.78	48.9	53.48
13	GPT-4o mini	43.07	40.77	58.65	43.59	53.99	24.2	56.67	32.92	34.58	35.54	51.77
14	InternVL2-Llama3-76B	37.73	35.63	51.93	42.2	46.26	21.32	43.07	28.69	29.53	30.01	47.41
15	Qwen2.5-VL-7B	36.80*	-	-	-	-	-	-	-	-	-	-
16	Qwen2-VL-7B	34.35	32.93	43.96	39.26	40.16	18.1	39.72	24.18	30.89	28.8	44.92
17	Pixtral 12B	33.20	31.36	45.66	37.61	38.58	12.1	42.65	25.5	25.68	34.65	45.76
18	Llava-OneVision-72B	31.84	29.74	45.99	37.17	41.43	16.18	27.55	30.62	24.54	29.03	40.49
19	Aria-MoE-25B	31.76	28.91	51.04	34.95	39.54	14.04	39.29	26.1	25.37	28.62	36.63
20	InternVL2.5-8B	30.39	28.34	44.27	33.27	34.78	15.97	35.1	25.86	25.45	28.83	44.96
21	Qwen2.5-VL-3B	28.90*	-	-	-	-	-	-	-	-	-	-
22	Mammoth-VL-8B	27.90	26.41	37.99	33.35	35.17	15.33	23.6	26.07	20.67	23.89	37.59
23	InternVL2-8B	27.74	25.96	39.79	32.15	33.94	12.17	29.13	22.08	24.7	24.61	39.96
24	MiniCPM-V2.6	25.37	22.96	41.73	29.24	33.19	11.69	26.67	16.49	15.34	25.71	37.78
25	Phi-3.5-Vision	25.12	23	39.48	30.98	33.57	8.6	20.75	21.14	20.14	25.86	34.95
26	NVLM-D-72B	23.29	21.59	34.78	26.89	36.57	6.69	15.7	26.87	22.93	23.68	18.86
27	Llava-OneVision-7B	22.99	21.36	33.98	27.64	31.37	9.16	17.07	22.11	13.9	24.38	37.31
28	Qwen2-VL-2B	22.25	20.88	31.54	27.6	26.57	6.97	25.22	16.36	17	21.06	31
29	Ivy-VL-3B	20.28	19.19	27.72	24.95	24.8	7.49	14.1	21.63	13.67	22.06	42.15
30	InternVL2.5-2B	19.04	17.81	27.38	22.45	25.06	5.32	18.13	15.6	12.42	19.22	37.62
31	Llama-3.2-11B	18.02	16	31.73	19.9	28.06	8.1	17.29	13.94	5.75	16.28	25.43
32	Aquila-VL-2B-llava-qwen	17.10	16	24.57	20.25	23.81	5.57	9.81	17.87	13.51	22.53	29.33
33	InternVL2-2B	14.52	13.14	23.86	17	21.17	4.1	8.72	10.98	11.26	16.86	33.33
34	deepseek-vl2-tiny	12.91	11.08	25.27	14.91	21.58	3.51	5.63	8.06	10.77	15.13	27.22
35	Idefics3-8B-Llama3	11.94	8.96	32.11	13.28	16.06	4.67	11.45	9.84	9.79	18.4	9.7

Select table to display. Default: all MEGA-Bench tasks; Single Image: single-image tasks only.

Default Single Image

MEGA-Bench Leaderboard

🚀 Introduction

MEGA-Bench is a comprehensive benchmark scaling multimodal evaluation to 500+ real-world tasks!

We aim to provide cost-effective and accurate evaluation for multimodal models, covering a wide range of real-world tasks. You don't have to run models on dozens of benchmarks -- MEGA-Bench delivers a comprehensive performance report in a single benchmark.

🧐 Highlights of MEGA-Bench

505 diverse tasks evaluating multimodal models across 8 grand application types, 7 input visual formats, 6 output formats, and 10 general multimodal skills, covering single-image, multi-image, and video tasks
Moves beyond multiple-choice questions, offering diverse output formats like numbers, code, LATEX, phrases, free-form responses, and more. We developed 45 customized metrics to accurately evaluate these diverse outputs
Focuses on task diversity rather than repetitive examples, ensuring cost-efficient evaluation
Provides fine-grained capability reports across application type, input/output formats, and required skills

🔨 Systematic Annotation Process

Guided by an initial application-driven taxonomy tree
16 expert annotators contributing to a 2-round process to develop 505 tasks
Utilizes advanced tools for task design, review, and quality control
Ensures high-quality data through continuous refinement and balanced task distribution

📊🔍 Results & Takeaways from Evaluating Top Models

️‍🔥📝 2025.01

Gemini 2.0 Experimental (1206) and Gemini 2.0 Flash Experimental outperform GPT-4o and Claude 3.5 Sonnet.
We add Grok-2-vision-1212 to the single-image leaderboard. The model seems to use a lot of tokens per image, and cannot run many of our multi-image and video tasks.
We will evaluate o1 series models when there is budget.

📝 2024.11

GPT-4o (0513) and Claude 3.5 Sonnet (1022) lead the benchmark. Claude 3.5 Sonnet (1022) improves over Claude 3.5 Sonnet (0620) obviously in planning tasks (application dimension) and UI/Infographics inputs (input format dimension).
Qwen2-VL stands out among open-source models, and its flagship model gets close to some proprietary flagship models
Chain-of-Thought (CoT) prompting improves proprietary models but has limited impact on open-source models
Gemini 1.5 Flash performs the best among all the evaluated efficiency models, but struggles with UI and document tasks
Many open-source models face challenges in adhering to output format instructions

🎯 Interactive Visualization

Visit our project page to explore the interactive task taxonomy and radar maps, offering deep insights into model capabilities across multiple dimensions. Discover a comprehensive breakdown far beyond single-score evaluations.

📚 More Information

Our evaluation pipeline is available on our GitHub repo.
Check full details of our paper at https://arxiv.org/abs/2410.10563
Hugging Face Datasets: https://huggingface.co/datasets/TIGER-Lab/MEGA-Bench
AaronCWacker Fork: https://github.com/AaronCWacker/MEGA-Bench