Learning Rate Schedule With Cosine Annealing
I wrote an app to
classify an object using transfer learning, and the model was trained on CIFAR-10 dataset, all by myself.
CosineAnnealingLR: A Smooth and Principled Learning Rate Schedule
Why Learning Rate Scheduling Matters
In transfer learning, especially with a pretrained backbone like
ResNet-18, the learning rate plays a disproportionate role in
final validation accuracy. Early in training, we want sufficiently large
updates to adapt pretrained features to the new task. Later on, however,
large learning rates become harmful—they cause the optimizer to overshoot
good minima and introduce unnecessary noise into fine-grained weight
adjustments.
A learning rate scheduler exists precisely to manage this trade-off. Among
many options,
CosineAnnealingLR is one of the most effective and
theoretically grounded schedules for transfer learning.
How CosineAnnealingLR Works
CosineAnnealingLR gradually decreases the learning rate following a cosine
curve rather than a linear or step-wise decay. The learning rate at training
step
t is defined as:
LR(t) = LR_min + 0.5 × (LR_max − LR_min) × (1 + cos(π × t / T))
Where:
LR_max is the initial learning rate
LR_min is the minimum learning rate
T is the total number of epochs (or steps)
t is the current epoch
At the beginning of training, the cosine term is close to 1, so the learning
rate stays near
LR_max. As training progresses, the cosine value
smoothly decreases toward −1, causing the learning rate to decay gently and
continuously until it reaches
LR_min at the final epoch.
Why Cosine Decay Is Better Than Step Decay
Traditional step schedulers abruptly reduce the learning rate at predefined
epochs (e.g., divide by 10 at epoch 30). These sudden drops can disrupt
optimization, especially when fine-tuning pretrained models where stability
is critical.
CosineAnnealingLR avoids this problem entirely:
- No sudden learning rate jumps
- No hand-tuned decay milestones
- Continuous, smooth reduction of update magnitude
This smooth decay aligns well with gradient-based optimization, allowing the
model to transition naturally from coarse adaptation to fine-grained
refinement.
What It Does to the Loss Landscape
Early in training, higher learning rates encourage exploration of the loss
surface and help the model escape sharp or suboptimal minima inherited from
ImageNet pretraining. As the learning rate decays, the optimizer’s trajectory
becomes more conservative, allowing it to settle into flatter, more stable
minima.
Flatter minima are strongly correlated with better generalization. In practice,
this means better validation accuracy and less sensitivity to small dataset
shifts.
Why CosineAnnealingLR Works Especially Well for ResNet-18
ResNet-18 has relatively limited capacity compared to deeper architectures,
which makes optimization stability particularly important. Cosine annealing
helps in several ways:
– It prevents over-updating early convolutional layers when they are partially unfrozen
– It allows the classifier head to converge quickly, then refine gradually
– It complements regularizers like data augmentation and label smoothing
Because ResNet-18 converges quickly, a smooth decay schedule ensures that later
epochs are still productive rather than noisy.
Expected Validation Accuracy Gains
In practical transfer learning setups with ResNet-18 (small to medium-sized
datasets), CosineAnnealingLR typically yields:
–
+0.5% to +1.5% improvement in validation accuracy compared to a constant
learning rate
– More stable validation curves with fewer late-epoch regressions
– Faster convergence to peak validation accuracy
While the exact gain depends on dataset size, augmentation strength, and
fine-tuning depth, CosineAnnealingLR is rarely neutral—it almost always improves
either peak accuracy, stability, or both.
Interaction with Two-Stage Fine-Tuning
Cosine annealing pairs particularly well with two-stage fine-tuning:
Stage 1: Higher effective learning rates help quickly adapt the classifier
and upper layers.
Stage 2: As deeper layers are unfrozen, the naturally lower learning rate
prevents catastrophic forgetting of pretrained features.
This makes CosineAnnealingLR an excellent default choice when progressively
unfreezing ResNet blocks.
Key Takeaway
CosineAnnealingLR is more than a cosmetic learning rate decay—it encodes a
strong optimization prior: explore first, refine later. For transfer learning
with ResNet-18, it provides smoother optimization, better-calibrated weights,
and consistently higher validation accuracy with minimal tuning effort.
In short, it is a
low-risk, high-reward scheduler that should be part of
any serious transfer learning baseline.
Any comments? Feel free to participate below in the Facebook comment section.