Step 3: Train Florence-2 Base Model On The MS COCO Dataset Using PEFT
The Great AI Makeover: Ditching the Frankenstein Setup
Once upon a time, my app relied on BLIP for reading images and YOLOv8 for drawing boxes. It was like hiring two separate guys to do a job that one smart intern could do. It was heavy, it ate hard drive space, and it moved like molasses in winter.
Enter the Intern: Florence-2-base
I fired BLIP and YOLO and hired Microsoft's Florence-2-base as the base model. But why Florence-2 specifically? Because it is a unified "Vision-Language" wizard. Instead of piping images through two massive independent networks, Florence-2 accepts a text prompt like "<OD>" (Object Detection) or "<CAPTION>" and does everything internally. It outputs bounding boxes and text in a single pass!
- Less Baggage: I completely removed the massive Ultralytics library and separate BLIP checkpoints from my Docker image to keep things extremely lean.
- Unified Pipeline: One model, one processor, zero headaches.
Kaggle Bootcamp: Why PEFT, or Parameter-Efficient Fine-Tuning?
You might ask: "Isn't Florence-2-base smart enough out of the box?" Yes, it is brilliant at zero-shot generalization. But like a brilliant intern, it tends to format answers unpredictably. I needed it to spit out strict, reliable captions aligned exactly with the
MS COCO Dataset style.
Training a 230-million parameter model from absolute scratch requires a supercomputer. Instead, I used PEFT (Parameter-Efficient Fine-Tuning) on Kaggle. By mathematically attaching tiny "LoRA" adapters to the model's Attention layers, I only actually trained 1% of the network. The base intelligence remained untouched, but it learned my exact style requirements!
Wait, how does LoRA actually work?
Imagine the base model as a gigantic encyclopedia. Instead of using an eraser and painfully rewriting every single page (which would completely melt my GPU), LoRA just slides a small, highly intelligent sticky note over the text.
Mathematically, neural network layers are just massive grids of numbers, like a 10,000 by 10,000 matrix. Updating that giant original grid requires calculating 100 million memory parameters! LoRA cheats the system by splitting the required update into two much smaller bottleneck grids: a 10,000 by 8 matrix, and an 8 by 10,000 matrix. When you mathematically multiply those two tiny grids together, they instantly expand back into a full 10,000 by 10,000 ghost overlay!
The magic here is that I only had to train 160,000 numbers instead of 100 million. By dynamically "taping" these tiny LoRA matrices directly onto the model's underlying Attention layers, the core intelligence of Florence-2 remained completely untouched, but it perfectly learned my exact MS COCO formatting rules!
# Initializing the LoRA adapters
lora_config = LoraConfig(
r=8,
target_modules=["q_proj", "v_proj"],
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
I also had to dodge some nasty HuggingFace bugs! The standard saving tools crashed on Florence-2's tied weights, so I had to extract the raw PyTorch adapter dictionaries manually to survive.
# Bypassing the crash by grabbing raw dicts directly from memory
lora_state_dict = get_peft_model_state_dict(model)
save_file(lora_state_dict, "adapter_model.safetensors")
The CPU Speed Demon: 8-Bit Quantization
When I successfully glued the LoRA brains onto Florence-2 locally, my Docker CPU was still struggling to think fast, taking nearly 9 seconds. I needed to chop the virtualized math overhead completely.
- The Memory Diet: I squished all incoming images down to 384x384. This mathematically deleted 75% of the visual patches fed to the DaViT vision encoder!
- Greedy Mode: I turned off beam search by setting num_beams=1. I told the model to just guess the first valid word instead of calculating complex alternate realities.
- LoRA Fusion: I permanently baked the LoRA weights into the base layers so the computer didn't have to calculate two matrices every time.
- 8-Bit Quantization: I magically shrank the massive 32-bit floating point layers down to 8-bit integers. Since CPU inference is bottlenecked by memory bandwidth, this doubled the data my CPU could swallow per second!
# Fusing the LoRA matrix safely into the base structure
model = model.merge_and_unload()
# Turbocharging the CPU Inference with 8-bit integers
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
Final Report Card
The transformation from the clunky BLIP+YOLO setup to an optimized Florence-2 microservice has been staggering.
- Base Model Footprint: ~920 MB loaded in memory
- LoRA Adapter Weights: ~15 MB physically stored on disk
- Total Docker Image Size: 1.94 GB (Including lightweight Python, PyTorch CPU Core, and pure Transformers)
- Original Latency: ~8.53 seconds per caption generation
- Final Optimized Latency: ~3.85 seconds per caption on a virtualized Docker CPU!
My container is now lean, mean, and generates captions faster than I can say "Parameter-Efficient Fine-Tuning"!
Any comments? Feel free to participate below in the Facebook comment section.