I built a small demo object detector on top of an LLM here: [[Google Gemini 101 - Object Detection with Vision and Structured Outputs]] ([Code](gist.github.com)), and discussed issues with localization in document images here: [[Horseshoes (and Hand Grenades) - LLM Localization is not Close, but not Close Enough]].
In this post, I want to compare two approaches for localization:
1. set-of-mark prompting
2. object detection (and the bounding box format that supports it)
I want to compare a few state-of-the-art models and show you how well they perform.
- Gemini vs. GPT-4o vs. Llama3.2 vs. Pixtral
And run some simple experiments using some small object detection benchmarks.
Experiments like:
- does few-shot help with object detection
- does fine-tuning help with object detection