🚧 Benchmarking Localization in LMMs — Bounding Boxes vs Set-of-Mark Prompting

I built a small demo object detector on top of an LLM here: [[Google Gemini 101 - Object Detection with Vision and Structured Outputs]] ([Code](gist.github.com)), and discussed issues with localization in document images here: [[Horseshoes (and Hand Grenades) - LLM Localization is not Close, but not Close Enough]]. In this post, I want to compare two approaches for localization: 1. set-of-mark prompting 2. object detection (and the bounding box format that supports it) I want to compare a few state-of-the-art models and show you how well they perform. - Gemini vs. GPT-4o vs. Llama3.2 vs. Pixtral And run some simple experiments using some small object detection benchmarks. Experiments like: - does few-shot help with object detection - does fine-tuning help with object detection