Step 3: Train Florence-2 Base Model On The MS COCO Dataset Using PEFT

The Great AI Makeover: Ditching the Frankenstein Setup

Once upon a time, my app relied on BLIP for reading images and YOLOv8 for drawing boxes. It was like hiring two separate guys to do a job that one smart intern could do. It was heavy, it ate hard drive space, and it moved like molasses in winter.

Enter the Intern: Florence-2-base

I fired BLIP and YOLO and hired Microsoft's Florence-2-base as the base model. But why Florence-2 specifically? Because it is a unified "Vision-Language" wizard. Instead of piping images through two massive independent networks, Florence-2 accepts a text prompt like "<OD>" (Object Detection) or "<CAPTION>" and does everything internally. It outputs bounding boxes and text in a single pass!

  • Less Baggage: I completely removed the massive Ultralytics library and separate BLIP checkpoints from my Docker image to keep things extremely lean.
  • Unified Pipeline: One model, one processor, zero headaches.

Kaggle Bootcamp: Why PEFT, or Parameter-Efficient Fine-Tuning?

You might ask: "Isn't Florence-2-base smart enough out of the box?" Yes, it is brilliant at zero-shot generalization. But like a brilliant intern, it tends to format answers unpredictably. I needed it to spit out strict, reliable captions aligned exactly with the MS COCO Dataset style.

Training a 230-million parameter model from absolute scratch requires a supercomputer. Instead, I used PEFT (Parameter-Efficient Fine-Tuning) on Kaggle. By mathematically attaching tiny "LoRA" adapters to the model's Attention layers, I only actually trained 1% of the network. The base intelligence remained untouched, but it learned my exact style requirements!

Wait, how does LoRA actually work?
Imagine the base model as a gigantic encyclopedia. Instead of using an eraser and painfully rewriting every single page (which would completely melt my GPU), LoRA just slides a small, highly intelligent sticky note over the text.

Mathematically, neural network layers are just massive grids of numbers, like a 10,000 by 10,000 matrix. Updating that giant original grid requires calculating 100 million memory parameters! LoRA cheats the system by splitting the required update into two much smaller bottleneck grids: a 10,000 by 8 matrix, and an 8 by 10,000 matrix. When you mathematically multiply those two tiny grids together, they instantly expand back into a full 10,000 by 10,000 ghost overlay!

The magic here is that I only had to train 160,000 numbers instead of 100 million. By dynamically "taping" these tiny LoRA matrices directly onto the model's underlying Attention layers, the core intelligence of Florence-2 remained completely untouched, but it perfectly learned my exact MS COCO formatting rules!

# Initializing the LoRA adapters
lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

I also had to dodge some nasty HuggingFace bugs! The standard saving tools crashed on Florence-2's tied weights, so I had to extract the raw PyTorch adapter dictionaries manually to survive.

# Bypassing the crash by grabbing raw dicts directly from memory
lora_state_dict = get_peft_model_state_dict(model)
save_file(lora_state_dict, "adapter_model.safetensors")

The CPU Speed Demon: 8-Bit Quantization

When I successfully glued the LoRA brains onto Florence-2 locally, my Docker CPU was still struggling to think fast, taking nearly 9 seconds. I needed to chop the virtualized math overhead completely.

  • The Memory Diet: I squished all incoming images down to 384x384. This mathematically deleted 75% of the visual patches fed to the DaViT vision encoder!
  • Greedy Mode: I turned off beam search by setting num_beams=1. I told the model to just guess the first valid word instead of calculating complex alternate realities.
  • LoRA Fusion: I permanently baked the LoRA weights into the base layers so the computer didn't have to calculate two matrices every time.
  • 8-Bit Quantization: I magically shrank the massive 32-bit floating point layers down to 8-bit integers. Since CPU inference is bottlenecked by memory bandwidth, this doubled the data my CPU could swallow per second!
# Fusing the LoRA matrix safely into the base structure
model = model.merge_and_unload()

# Turbocharging the CPU Inference with 8-bit integers
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

Final Report Card

The transformation from the clunky BLIP+YOLO setup to an optimized Florence-2 microservice has been staggering.

  • Base Model Footprint: ~920 MB loaded in memory
  • LoRA Adapter Weights: ~15 MB physically stored on disk
  • Total Docker Image Size: 1.94 GB (Including lightweight Python, PyTorch CPU Core, and pure Transformers)
  • Original Latency: ~8.53 seconds per caption generation
  • Final Optimized Latency: ~3.85 seconds per caption on a virtualized Docker CPU!
My container is now lean, mean, and generates captions faster than I can say "Parameter-Efficient Fine-Tuning"!

Any comments? Feel free to participate below in the Facebook comment section.
Post your comment below.
Anything is okay.
I am serious.