New LLM releases come with fancy graphs that compare the averaged performance of this latest release with previous releases. For small/open models, the x-axis is often parameters or tokens/$ (Mistral invented the "Upper Left Triangle") graph. But there's a problem with these graphs: **averaged benchmark scores are misleading**.
Let's look at the specific case of multimodal models (which have been on my mind a lot lately, as you might guess). Reported benchmarks include: DocVQA, ChartQA, and OCRBench.
But there's a problem: OCRBench is a dataset composed of instances **pulled from DocVQA, ChartQA, and others**. So you're double-counting all the instances from DocVQA/ChartQA/etc. when you also add in OCRBench.
So if you want to juice your model performance, just train it to be really good at DocVQA. *Actually, just get really good at DocVQA regardless, it'd make my life easier.*