How Low-Rank Adaptation Works - Take 2
Question
Using your example in "3. Why Low-Rank Works"
in this post, my understanding is this:
Without LoRA:
# of params during training = 4096×4096
# of params used during inference= 4096×4096
With LoRA:
# of params during training = 4096×8+8×4096
# of params used during inference= 4096×4096+4096×8+8×4096
Am I correct?
Answer
Parameter Count: Training vs Inference
This is a very important point, and your understanding is almost correct. Let’s clarify it carefully.
Without LoRA (Full Fine-Tuning):
- Training parameters = 4096 × 4096 ≈ 16 million
- Inference parameters = 4096 × 4096 ≈ 16 million
In this case:
- All parameters are updated during training
- The same full matrix is used during inference
With LoRA:
We split into two parts:
- Original matrix W (frozen)
- LoRA matrices A and B (trainable)
During Training:
- Trainable parameters = 4096 × 8 + 8 × 4096 = 65,536
- Frozen parameters = 4096 × 4096 (not updated)
So:
- Total parameters in model = still ~16 million
- But only ~65K parameters are actually trained
During Inference (Two Options):
Option 1: Keep LoRA Separate
- Parameters used = W + A + B
- = 4096×4096 + 4096×8 + 8×4096
This matches your understanding exactly.
Option 2: Merge LoRA into W (Common in practice)
- Compute: W' = W + A×B
- Use only W' during inference
After merging:
- Parameters used = 4096 × 4096
- No extra computation from A and B
Key Insight:
- LoRA reduces training cost, not necessarily total model size
- Inference cost can stay the same (after merging)
- The main savings come from not updating the full matrix
Simple Analogy:
Think of W as a large book.
- Full fine-tuning = rewriting the entire book
- LoRA = writing a small note that modifies parts of the book
During inference:
- You can either read the book + note together
- Or rewrite the book once with the note applied
Both give the same result, but LoRA made training much cheaper.
Any comments? Feel free to participate below in the Facebook comment section.