How Low-Rank Adaptation Works - Take 3
Question
Are you saying both "Option 1: Keep LoRA Separate" and "Option 2: Merge LoRA into W" referenced
in this post are used in practice?
Answer
Yes, both are used in practice, but they serve slightly different purposes and you will see clear patterns depending on the scenario.
When "Keep LoRA Separate" is Used
This is very common during development and in flexible systems.
Typical situations:
- Serving multiple LoRA adapters on the same base model
- Switching tasks dynamically (e.g., different customers, styles, domains)
- Experimentation and evaluation
Why keep them separate?
- You can load/unload LoRA weights quickly
- Storage is very small (only A and B)
- No need to duplicate the large base model
Example:
One base LLM:
- LoRA #1 → legal domain
- LoRA #2 → medical domain
- LoRA #3 → customer support tone
You simply swap LoRA weights depending on the request.
Trade-off:
- Slightly more computation during inference (extra A×B×x)
When "Merge LoRA into W" is Used
This is common in production deployments where performance matters.
Typical situations:
- Single fixed task model
- Latency-sensitive applications
- Edge deployment or optimized inference pipelines
Why merge?
- No extra computation during inference
- Simpler model graph
- Better compatibility with optimized runtimes
What happens technically:
W' = W + A×B
After this:
- A and B are no longer needed
- The model behaves like a fully fine-tuned model
How Practitioners Usually Work
In real workflows, people often do both:
- Train LoRA → keep adapters separate (flexibility)
- Evaluate multiple LoRAs
- Choose the best one
- Merge it into W for final deployment
Key Takeaway
- Keeping LoRA separate = flexibility and modularity
- Merging LoRA = speed and simplicity
So your understanding is right, and the important nuance is this: LoRA is not just about reducing parameters, it is also about enabling a modular way to adapt models.
Any comments? Feel free to participate below in the Facebook comment section.