How Low-Rank Adaptation Works
LoRA (Low-Rank Adaptation) – Detailed Explanation
1. What is LoRA?
LoRA is a method to fine-tune large neural networks without changing most of their original weights.
Instead of updating a huge weight matrix directly, LoRA adds a small, trainable "adapter" that learns the change.
Key idea:
Keep original weights frozen, and only learn a small correction.
Why this matters:
- Much faster training
- Much less memory usage
- Easy to switch between different tasks
2. The Core Mathematical Idea
In a normal neural network layer, we have a weight matrix:
W (size: d x k)
Normally during training, we update W directly.
With LoRA, we do NOT change W. Instead, we approximate the update like this:
W' = W + ΔW
But instead of learning ΔW as a full matrix, we break it into two smaller matrices:
ΔW = A × B
Where:
- A is size (d x r)
- B is size (r x k)
- r is a small number (rank), like 4, 8, or 16
So instead of learning d×k parameters, we only learn r×(d+k).
This is the "low-rank" idea.
3. Why Low-Rank Works
In practice, large neural network updates often have redundancy.
This means the important changes can be represented in a lower-dimensional space.
Think of it like:
- Full matrix = full detail image
- Low-rank = compressed version with key features
Example:
If W is 4096×4096, that is ~16 million parameters.
If r = 8:
- A = 4096×8 → ~32K params
- B = 8×4096 → ~32K params
- Total = ~64K params
This is
250x smaller than full fine-tuning.
4. How LoRA is Applied in Transformers
LoRA is usually applied to attention layers, especially:
- Query (Q)
- Key (K)
- Value (V)
- Output projection
Example for a linear layer:
Original:
y = W x
With LoRA:
y = W x + (A × B × x)
Or equivalently:
y = (W + A×B) x
During training:
- W is frozen
- A and B are trained
During inference:
- You can merge A×B into W
- Or keep them separate
5. Step-by-Step Training Flow
Step 1: Load pretrained model
Step 2: Freeze all original weights
Step 3: Insert LoRA layers (A and B)
Step 4: Train only A and B
Step 5: Save LoRA weights
Example (conceptual):
for each batch:
output = model_with_lora(input)
loss = compute_loss(output, target)
update(A, B)
6. Intuition with a Simple Example
Imagine a model trained to describe images.
Original model output:
"A fruit on a table"
After LoRA fine-tuning on a dataset of apples:
"Two red apples on a wooden table"
LoRA didn't relearn everything. It only learned:
- How to count
- How to be more specific
7. Real Use Case: Chatbot Personalization
Base model:
User: Hello
Model: Hello! How can I help you?
After LoRA trained on company tone:
User: Hello
Model: Welcome! How may we assist you today?
Only small style changes were learned.
8. Choosing Rank (r)
Rank controls capacity.
- Small r (e.g. 4): very efficient, less expressive
- Medium r (e.g. 8–16): good balance
- Large r (e.g. 64+): closer to full fine-tuning
Rule of thumb:
- Simple task → small r
- Complex domain → larger r
9. Scaling Factor (Alpha)
LoRA often uses a scaling factor:
W' = W + (α / r) × (A × B)
Where:
- α controls how strong the LoRA update is
Example:
- α = 16, r = 8 → scale = 2
This stabilizes training.
10. Advantages of LoRA
- Very memory efficient
- Fast training
- Can store many task-specific adapters
- No need to retrain full model
11. Limitations
- May not capture very complex changes if rank is too small
- Requires careful tuning of r and α
- Not always as powerful as full fine-tuning
12. LoRA vs Full Fine-Tuning
Full fine-tuning:
- Updates all weights
- Very expensive
- Best performance (sometimes)
LoRA:
- Updates small matrices
- Cheap and fast
- Slightly less flexible
13. Practical Example (Pseudo Code)
class LoRALinear:
def __init__(self, W, r):
self.W = freeze(W)
self.A = random_matrix(d, r)
self.B = random_matrix(r, k)
def forward(self, x):
return W @ x + A @ B @ x
14. Where LoRA is Used
- Large language models (LLMs)
- Image models (diffusion, captioning)
- Speech models
- Recommendation systems
15. Final Intuition
Think of LoRA as:
- Original model = knowledge base
- LoRA = small "patch" or "skill"
Instead of rewriting the whole brain, you just attach a small module that adjusts behavior.
This is why LoRA is powerful: it adds new abilities without rebuilding everything.
Any comments? Feel free to participate below in the Facebook comment section.