Google Gemini 101 - Object Detection with Vision and Structured Outputs

> [!NOTE] **TL;DR** > I walk you through how to build an object detector with Gemini 2.0 Flash as a first project: submit images in your calls, get back structured objects, how to get and visualize bounding boxes. This is a _missing manual_ for how to get a simple working prototype up and running with Gemini's vision mode and structured outputs. I'm confident that manual exists elsewhere, but I haven't been able to find it. And if it exists within the confusing labyrinth of Google's documentation, I'm certain that I will _never_ find it. My first experience with the Google Gemini docs resulted in far more questions than answers: should you use `vertexai` or `google-generativeai`? What's the difference? Are there differences in what the things you can _do_ with the two APIs? What the hell is a billing profile in Google Cloud? I just want to: - put in a credit card - get an API key - write some code to solve my problem Which is largely how it works with any sane model provider (OpenAI, Mistral, etc). This guide is aimed at helping you do that, by showing you the happy path. As a motivating example, we're going to build a silly little object detector for tea parties. We want to find the teapots and the teacups in images, like so: ![[gemini_out.jpg]] I don't think this is the _best_ use case of Gemini, but I do think it's a fun and illustrative one. 😀 If you just want the code, I've created a [Github gist](https://gist.github.com/jbarrow/c35279cbed578eeba4e1253dc6907c8c) with it. # Getting Set Up For this process, we're going to be doing everything through the Google AI Studio process, which includes: - getting a key - setting up a billing account - preparing our dev environment (exporting the key and installing the `google-generativeai` python package) ## Get an API Key First, go to Google AI Studio: https://aistudio.google.com/app/apikey If this is your first time, you'll be greeted with this screen: ![[gemini_start.png]] Click "Get API key" and copy that key. If you're doing Paid, don't forget to follow through on "Go To Billing" and **create and link a billing account** on Google Cloud! Otherwise, you'll **only have the free tier limitations** on requests per minute and total tokens. This can be a fun one to debug if your calls are failing or taking too long to return. > [!WARNING] > As of writing this, Gemini 2.0 Flash has a pretty low RPM allowance. You can use Gemini 1.5 Flash/Pro instead for higher throughput or wait for the full API release. ## Export your API Key You'll want to add it to your `.bashrc` or export it as `GOOGLE_API_KEY`. This is the default name that the Gemini package looks for, so if you use anything else you'll need to explicitly load it. ```sh export GOOGLE_API_KEY="<PASTE_YOUR_KEY>" ``` ## Install Google GenAI Since we're going through the Google AI Studio process and not the VertexAI one, we'll want to install the `google-generativeai` package. ```sh pip install -U google-generativeai ``` And there we go, we're now ready to write some code! # Step 1: Making an API Call The first thing to do is just make an API call. We want to ensure that our key is working, that we have access to the `gemini-2.0-flash-exp` model, and that we can work with the returns. The code for making an API call is pretty simple, you just need to 3 steps: 1. add your key 2. make the call to a specific model (in our case, `gemini-flash-2.0-exp`, which is the latest release as of my writing this) 3. parse the results ```python import os import google.generativeai as genai def main() -> None: # step 1: add your key genai.configure(api_key=os.getenv("GOOGLE_API_KEY")) # step 2: make the call to a specific model model = genai.GenerativeModel("gemini-2.0-flash-exp") result = model.generate_content(["Say hi, Gemini!"]) # step 3: parse the outputs print(result.candidates[0].content.parts[0].text) if __name__ == "__main__": main() ``` And here's the result I get running that: ``` $ python gemini_structured.py Hi there! How can I help you today? ``` # Step 2: Adding Images to the Call This is a great start, but for the purposes of this post we're interested in: 1. using images with Gemini's vision mode 2. getting structured outputs Let's start with vision mode. Passing an image to the call from Python is pretty easy, you can use a standard `PIL.Image` and then pass it in as part of the `generate_content` call (you just pass in a list of content parts, which can be text or images). For the first step, let's install `Pillow` (for `PIL`): ```sh pip install -U Pillow ``` We now have access to `PIL.Image`, which lets us load images with: ```python from PIL import Image image = Image.open(image_path) ``` Plugging that into our script, we get: ```python from PIL import Image import google.generativeai as genai import argparse import os def main(image_path: str, size: float = 1024) -> None: image = Image.open(image_path) model = genai.GenerativeModel("gemini-2.0-flash-exp") result = model.generate_content( [ """How many teacups and teapots are in this image?""", image, ] ) print(result.candidates[0].content.parts[0].text) if __name__ == "__main__": genai.configure(api_key=os.getenv("GOOGLE_API_KEY")) parser = argparse.ArgumentParser() parser.add_argument("image_path", help="Path to image file to process") args = parser.parse_args() main(args.image_path) ``` Note, I also cleaned up the script to parse arguments, so we can pass an image in from the command line. I'll be using the following image (Creative Commons, from Pexels) saved as `image.jpeg` in my directory: ![[gemini_in.jpeg]] And after running, I get: ``` $ python gemini_structured.py image.jpeg Based on the image, there is **1 teacup** and **1 teapot**. ``` So we know that Gemini can see the picture, and can at least count to 1. 🎉 # Step 3: Getting Structured Results ```python from typing_extensions import TypedDict, Literal from PIL import Image, ImageDraw import google.generativeai as genai import argparse import json import os class TeaSet(TypedDict): type: Literal["Teacup", "Teapot"] def main(image_path: str, size: float = 1024) -> None: image = Image.open(image_path) # I've found that localization works better when the image is smaller image.thumbnail((size, size)) image.save("/Users/jbarrow/vaults/DocAI/Images/gemini-in.jpeg") model = genai.GenerativeModel("gemini-2.0-flash-exp") result = model.generate_content( [ """Find all the teacups and teapots in the image. Return your answer as a list of JSON objects.""", image, ], generation_config=genai.GenerationConfig( response_mime_type="application/json", response_schema=list[TeaSet], ), ) objects = json.loads(str(result.candidates[0].content.parts[0].text)) print(json.dumps(objects, indent=4)) if __name__ == "__main__": genai.configure(api_key=os.getenv("GOOGLE_API_KEY")) parser = argparse.ArgumentParser() parser.add_argument("image_path", help="Path to image file to process") args = parser.parse_args() main(args.image_path) ``` ``` [ { "type": "Teacup" }, { "type": "Teapot" } ] ``` # Step 4: Bounding Boxes One neat trick that Gemini can do quite well (better than GPT-4o but not perfectly) is localizing objects within an image by returning bounding boxes. You can do this by prompting Gemini to return a bounding box for the object in the format: `[ymin, xmin, ymax, xmax]` In order to get the bounding boxes, it's 2 simple modifications to our above code. First, we have to add the `bounding_box` key to our `TeaSet` typed dict: ```python class TeaSet(TypedDict): type: Literal["Teacup", "Teapot"] bounding_box: list[int] ``` (N.B. I would have preferred to use a `tuple[int, int, int, int]`, but that gets converted to an unsupported schema by Gemini using `maxItems`, so instead I'm sticking with the above) And second, we update our prompt to tell Gemini what format to return the box in: ```python """Find all the teacups and teapots in the image. Return your answer as a list of JSON objects with the type and bounding box. Return the bounding box in [ymin, xmin, ymax, xmax] format.""" ``` And that's it, we're now getting bounding box info! ``` $ python gemini_structured.py image.jpeg [ { "bounding_box": [ 165, 177, 696, 601 ], "type": "Teacup" }, { "bounding_box": [ 23, 596, 556, 984 ], "type": "Teapot" } ] ``` However, how good are those bounding boxes? What is the coordinate space? We'll want to plot them overtop our image to verify. ## Drawing the Bounding Boxes Each value in the returned bounding box will be **between 0 and 1000**, so to get the image coordinates you have to do some post-processing. Luckily, that post-processing is simple: divide the coordinate by 1000, and multiply by the image dimension (either height or width). We can write a quick little function that normalizes the bounding box and plots it over the image like so: ```python from PIL import ImageDraw def draw_bounding_box( image: Image.Image, type: str, bbox: list[int] ) -> Image.Image: width, height = image.size draw = ImageDraw.Draw(image) ymin, xmin, ymax, xmax = [coord / 1000 for coord in bbox] box_xmin = int(xmin * width) box_ymin = int(ymin * height) box_xmax = int(xmax * width) box_ymax = int(ymax * height) draw.rectangle( [(box_xmin, box_ymin), (box_xmax, box_ymax)], outline="blue", width=2, ) draw.text( (box_xmin, box_ymin), type, fill="white", ) return image ``` ## Full Code If we take all the above modifications and put it into a full script: ```python from typing_extensions import TypedDict, Literal from PIL import Image, ImageDraw import google.generativeai as genai import argparse import json import os def draw_bounding_box( image: Image.Image, type: str, bbox: list[int] ) -> Image.Image: width, height = image.size draw = ImageDraw.Draw(image) ymin, xmin, ymax, xmax = [coord / 1000 for coord in bbox] box_xmin = int(xmin * width) box_ymin = int(ymin * height) box_xmax = int(xmax * width) box_ymax = int(ymax * height) draw.rectangle( [(box_xmin, box_ymin), (box_xmax, box_ymax)], outline="blue", width=2, ) draw.text( (box_xmin, box_ymin), type, fill="white", ) return image class TeaSet(TypedDict): type: Literal["Teacup", "Teapot"] bounding_box: list[int] def main(image_path: str, size: float = 1024) -> None: image = Image.open(image_path) # I've found that localization works better when the image is smaller image.thumbnail((size, size)) model = genai.GenerativeModel("gemini-2.0-flash-exp") result = model.generate_content( [ """Find all the teacups and teapots in the image. Return your answer as a list of JSON objects with the type and bounding box. Return the bounding box in [ymin, xmin, ymax, xmax] format.""", image, ], generation_config=genai.GenerationConfig( response_mime_type="application/json", response_schema=list[TeaSet], ), ) objects = json.loads(str(result.candidates[0].content.parts[0].text)) for object in objects: details = TeaSet(**object) if "bounding_box" in details: image = draw_bounding_box( image, details["type"], details["bounding_box"] ) image.show() if __name__ == "__main__": genai.configure(api_key=os.getenv("GOOGLE_API_KEY")) parser = argparse.ArgumentParser() parser.add_argument("image_path", help="Path to image file to process") args = parser.parse_args() main(args.image_path) ``` Now when we run our code, we get something like the following result: ![[gemini_out.jpg]] Pretty neat! Now, there are some caveats and gotchas with structured outputs from Gemini, but those are beyond the scope of this post. In the future, we'll look at a few, including: - use of the OpenAI API with Gemini - the funky subset of OpenAPI that Gemini uses - setting some fields to be required (all fields are by default optional using the above methods) - improving localization with few-shot