1. Tracking Large Multimodal Model (LMM) Announcements and Metrics (2023-2024)

Since the GPT-4 announcement, it has become standard practice for commercial/research labs doing multimodal model releases to report results across a litany of document tasks. Personally really interested in document AI (not only because I worked at the Adobe Document Intelligence Lab or wrote my dissertation on document collections). I think it's a good litmus test for "how economically useful is a model," because a lot of low-hanging useful tasks happen to be in documents. Common practice for multimodal model releases is to evaluate them against a number of common benchmarks for comparison, including: - MMMU - VQA - TextVQA - DocVQA (my related post: [[2. What can we learn about LMMs from Document Question Answering?]]) - ChartQA (my related post: [[3. What can we learn about LMMs from Charts?]]) - InfographicsVQA - AI2D | Company | Model Name | Release Date | Open Weight | Model Size | MMMU | VQAv2 | TextVQA | DocVQA | ChartQA | InfographicsQA / InfoVQA | AI2D | MathVista | Math-Vision | DUDE | DocCVQA | TVQA | LSMDC | SlideVQA | OCRBench | MTVQA | MM MT-Bench | CountBenchQA | Flickr Count | | --------- | --------------------------------------------------------------------- | ------------ | ----------- | ---------- | ---- | ----- | ------- | ----------- | ------- | ------------------------ | ---- | --------- | ----------- | ---- | ------- | ---- | ----- | -------- | -------- | ----- | ----------- | ------------ | ------------ | | AllenAI | [Molmo](https://molmo.allenai.org/blog) | 10/25/2024 | Y | 72B | 54.1 | 86.5 | 83.1 | 93.5 | 87.3 | 81.9 | 96.3 | 58.6 | | | | | | | | | | 91.2 | 85.2 | | | | | Y | 7B-D | 45.3 | 85.6 | 81.7 | 92.2 | 84.1 | 72.6 | 93.2 | 51.6 | | | | | | | | | | 84.8 | 84.8 | | | | | Y | 7B-O | 39.3 | 85.3 | 80.4 | 90.8 | 80.4 | 70.0 | 90.7 | 44.5 | | | | | | | | | | 83.3 | 83.3 | | | | | Y | 1B | 34.9 | 83.9 | 78.8 | 77.7 | 78.0 | 53.9 | 86.4 | 34.0 | | | | | | | | | | 79.6 | 87.2 | | Alibaba | [Qwen-2-VL](https://qwenlm.github.io/blog/qwen2-vl/) | 08/29/2024 | Y | 72B | 64.5 | | 85.5 | 96.5 | 88.3 | 84.5 | | 70.5 | 25.9 | | | | | | 855 | 32.6 | | | | | | | | Y | 7B | 54.1 | | 84.3 | 94.5 | 83.0 | 76.5 | | 58.2 | 16.3 | | | | | | 845 | 26.3 | | | | | | | | Y | 2B | 41.1 | | 79.7 | 90.1 | 73.5 | 65.5 | | 43.0 | 12.4 | | | | | | 794 | 20.0 | | | | | Anthropic | [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet) | 06/20/2024 | N | - | 68.3 | | | 95.2 (ANLS) | 90.8 | | 94.7 | 67.7 | | | | | | | | | | | | | | Claude 3.5 Sonnet (new) | | N | - | | | | | | | | | | | | | | | | | | | | | | Claude 3 Opus | | N | - | | | | | | | | | | | | | | | | | | | | | | Claude 3 Sonnet | | N | - | | | | | | | | | | | | | | | | | | | | | | Claude 3 Haiku | | N | - | | | | | | | | | | | | | | | | | | | | | Google | Gemini-1.5 Flash | | N | - | | | | | | | | | | | | | | | | | | | | | | Gemini-1.5 Flash 8B | | Y | 8B | | | | | | | | | | | | | | | | | | | | | | Gemini-1.5 Pro | | N | - | | | | | | | | | | | | | | | | | | | | | | Gemini-2.0 Flash | | N | - | | | | | | | | | | | | | | | | | | | | | | PaliGemma | | Y | | | | | | | | | | | | | | | | | | | | | | Meta | Llama-3.2 90B | | N | | | | | | | | | | | | | | | | | | | | | | Microsoft | Phi-3V | | Y | | | | | | | | | | | | | | | | | | | | | | | Phi-3.5V | | Y | | | | | | | | | | | | | | | | | | | | | | Mistral | [Pixtral 12B](https://mistral.ai/news/pixtral-12b/) | 10/17/2024 | Y (Apache) | 12B | 52.5 | 78.6 | | 90.7 | 81.8 | | | 58.0 | | | | | | | | | 6.05 | | | | | [Pixtral Large](https://mistral.ai/news/pixtral-large/) | 11/18/2024 | Y* (NC) | 124B | 64.0 | 80.9 | | 93.3 | 88.1 | | 93.8 | 69.4 | | | | | | | | | 7.40 | | | | OpenAI | [GPT-4V](https://openai.com/index/gpt-4-research/) | | N | - | | | | | | | | | | | | | | | | | | | | | | GPT-4o | | N | - | | | | | | | | | | | | | | | | | | | | | | GPT-4o-mini | | N | - | | | | | | | | | | | | | | | | | | | | | | o1 | | N | - | | | | | | | | | | | | | | | | | | | | I'm really interested in the document-centric subset of these (DocVQA, ChartQA, InfographicsVQA, AI2D) and am working on a series exploring these datasets in depth. The goal of this series is two-fold: 1. To impart a familiarity with these datasets 2. To be able to assess how good foundation models are at handling documents ## The Evolution of Document AI Tracing the history of document AI we see a field that progressively moves to more and more abstract tasks, getting closer to "useful thing based on document." For example, OCR and key-value pair (KVP) detection are useful intermediate tasks (that can enable us to build useful downstream systems). But abstractive QA is a downstream task. This history isn't unique to document AI, it's similar to NLP (how many people are doing part-of-speech or dependency parsing to build systems nowadays?).