Step 2: Add Object Detection + Counting

Step 2: Add Object Detection + Counting (aka "teach the model to count, finally")

Right now your captions come from BLIP alone, which is great at vibes but not great at facts.
It can describe a scene nicely, but when it comes to counting or being precise... it starts guessing like it's in a multiple-choice exam.

What's the problem?

BLIP is weak at:
- counting objects
- being precise ("a fruit" vs "2 apples")
So you might get something like:
"a group of people sitting at a table"
when in reality it's clearly 5 people and 3 pizzas and someone is judging your life choices.

What are we adding?

We introduce YOLOv8 nano for object detection.
This model is fast, lightweight, and very good at identifying and counting objects in an image.
Then we combine its output with BLIP.
Now we have:
BLIP = storyteller
YOLO = accountant

Install Additional Dependency

We need the Ultralytics package:
pip install ultralytics
Add it to your requirements.txt so Docker doesn't forget it exists:
ultralytics

Loading YOLOv8

This is surprisingly simple. Almost suspiciously simple.
from ultralytics import YOLO

yolo_model = YOLO("yolov8n.pt")
Yes, that's it. It auto-downloads the model if needed.
No drama. No ceremony.

Running Object Detection

We feed the same image into YOLO and extract detected objects.
def detect_objects(image_path):
    results = yolo_model(image_path)
    
    detections = results[0].boxes
    names = results[0].names

    object_counts = {}

    for box in detections:
        cls_id = int(box.cls[0])
        label = names[cls_id]
        object_counts[label] = object_counts.get(label, 0) + 1

    return object_counts
Now we get something like:
{
  "person": 5,
  "pizza": 3,
  "chair": 6
}
Look at that. Numbers. Actual numbers. Civilization.

Merging YOLO with BLIP

Now comes the fun part: combining both outputs into a better caption.
def generate_enhanced_caption(image_path):
    caption = generate_caption(image_path)
    objects = detect_objects(image_path)

    object_summary = ", ".join(
        [f"{count} {name}s" for name, count in objects.items()]
    )

    final_caption = f"{caption}. Detected: {object_summary}."

    return final_caption
Example output:
"a group of people sitting at a table. Detected: 5 persons, 3 pizzas, 6 chairs."
Is it grammatically perfect? Not always.
Is it way more useful? Absolutely.

Update Your API

Swap out the old function with the enhanced one:
@app.post("/caption")
async def caption_image(file: UploadFile = File(...)):
    file_path = f"temp_{file.filename}"
    
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)

    caption = generate_enhanced_caption(file_path)
    
    return {"caption": caption}

Actual Results

For this photo:



We got the following result, comparing step 1 and step 2:

Performance Considerations

You just added another model, so yes, things get heavier.
But YOLOv8 nano is optimized for speed, so it's still quite reasonable.
If needed, you can later:
- run YOLO on GPU
- batch requests
- cache results
But for now, we keep it simple and working.

What You Just Achieved

Technically:
- Integrated object detection into your pipeline
- Extracted structured data (object counts)
- Combined multimodal outputs into a richer caption

Emotionally:
- Your model stopped guessing and started counting
- You upgraded from "vibes AI" to "data-driven AI"

Why This Step Matters

This is where your system becomes more than just a demo.
You're no longer relying on a single model.
You're orchestrating multiple models to complement each other.
And this is the core idea behind real-world AI systems:
no single model does everything well, but together they look very smart.
Step 2 is done.
Your model can now see, describe, and count.
It's basically becoming a very observant human.

Any comments? Feel free to participate below in the Facebook comment section.
Post your comment below.
Anything is okay.
I am serious.