Table 1: MEGA-Bench full results. The number in the parentheses is the number of tasks of each keyword.
The Core set contains $N_{\text{core}} = 440$ tasks evaluated by rule-based metrics, and the Open-ended set contains $N_{\text{open}} = 65$ tasks evaluated by a VLM judge (we use GPT-4o-0806).
Different from the results in our paper, we only use the Core results with CoT prompting here for clarity and compatibility with the released data.
$\text{Overall} \ = \ \frac{\text{Core} \ \cdot \ N_{\text{core}} \ + \ \text{Open-ended} \ \cdot \ N_{\text{open}}}{N_{\text{core}} \ + \ N_{\text{open}}}$
* indicates self-reported results from the model authors.

Select a dimension to display breakdown results. We use different column colors to distinguish the overall benchmark scores and breakdown results.
Select a model group
Rank
Models
Overall (505)
Core (440)
Open-ended (65)
Perception (145)
Knowledge (97)
Planning (78)
Info Extraction (72)
Math (33)
Coding (31)
Science (29)
Metrics (20)
10
51.30*
57.12
64.56
58.82
64.68
40.44
66.94
53.27
57.89
58.84
64.63
Select table to display. Default: all MEGA-Bench tasks; Single Image: single-image tasks only.