Be Careful Reporting (or Interpreting) Averaged Benchmarks

New LLM releases come with fancy graphs that compare the averaged performance of this latest release with previous releases. For small/open models, the x-axis is often parameters or tokens/$ (Mistral invented the "Upper Left Triangle") graph. But there's a problem with these graphs: **averaged benchmark scores are misleading**. Let's look at the specific case of multimodal models (which have been on my mind a lot lately, as you might guess). Reported benchmarks include: DocVQA, ChartQA, and OCRBench. But there's a problem: OCRBench is a dataset composed of instances **pulled from DocVQA, ChartQA, and others**. So you're double-counting all the instances from DocVQA/ChartQA/etc. when you also add in OCRBench. So if you want to juice your model performance, just train it to be really good at DocVQA. *Actually, just get really good at DocVQA regardless, it'd make my life easier.*