Optimizers

Cheat sheet to help you choose the right optimizer based on different problem characteristics like dataset size, noise, and convergence speed:

🛠 Optimizer Cheat Sheet

Optimizer

Best for

Pros ✅

Cons ❌

SGD (Stochastic Gradient Descent)

Large datasets, online learning

Efficient for large data, generalizes well

Requires careful tuning of learning rate, slower convergence

Momentum SGD

High-dimensional problems

Faster convergence than vanilla SGD

Still needs learning rate tuning

Nesterov Accelerated Gradient (NAG)

Smooth, convex problems

Improves momentum by looking ahead

Slightly more complex implementation

AdaGrad

Sparse data (e.g., NLP, recommendation systems)

No need to tune learning rate per feature

Learning rate decreases too much over time

RMSProp

Non-stationary problems (e.g., deep learning, RNNs)

Adapts learning rate, prevents AdaGrad's decay issue

Requires tuning decay parameter

Adam (Adaptive Moment Estimation)

Most deep learning tasks

Combines momentum + RMSProp, adaptive learning rate

Can lead to poor generalization (overfits sometimes)

AdamW

Deep learning with weight decay

Better generalization than Adam

Requires tuning weight decay

Adadelta

Limited memory situations

No need to set learning rate

Less commonly used

Nadam (Adam + NAG)

Faster convergence

Improves Adam with look-ahead momentum

More computation than Adam

L-BFGS (Limited-memory BFGS)

Small datasets, convex problems

Second-order optimization without full Hessian

Not scalable to deep learning

🔍 How to Choose?

🔢 Based on Dataset Size

Small datasets → L-BFGS, Adam
Medium datasets → Adam, RMSProp
Large datasets → SGD, Momentum SGD, AdamW

⚡️ Based on Convergence Speed

Fast convergence required? → Adam, Nadam
Slower but stable? → SGD with Momentum, RMSProp

🌊 Based on Data Type

Sparse data (e.g., NLP, embeddings) → AdaGrad, RMSProp
Dense data (e.g., vision, tabular) → Adam, SGD with Momentum

🔀 Based on Noise in Gradients

Noisy gradients (e.g., reinforcement learning, RNNs, online learning) → RMSProp, Adam
Stable gradients (e.g., batch training, tabular data) → SGD with Momentum

🎯 General Recommendation

If unsure, start with Adam. If it overfits or behaves poorly, switch to SGD with Momentum or AdamW.

When Adam is a Good Choice

Works well for most deep learning tasks (CNNs, RNNs, transformers, etc.).
Converges faster than plain SGD, especially in noisy environments.
Doesn’t require much hyperparameter tuning (default settings often work).

❌ When Adam Might Not Be the Best Choice

Problem Scenario

Why Adam Might Not Work

Better Alternative

Overfitting happens

Adam adapts too aggressively, leading to poor generalization

AdamW (Adam with weight decay)

You need best generalization (e.g., Image Classification)

Adam can converge too quickly to a suboptimal solution

SGD with Momentum

Training very large datasets (e.g., billion+ samples)

Adam uses more memory due to per-parameter adaptive learning rates

SGD with Momentum

Training on Reinforcement Learning (RL) tasks

Adam struggles with high variance gradients

RMSProp

Training deep networks (e.g., Transformers, LSTMs)

Adam works but may lead to instability

AdamW (better stability)

🚀 Practical Rule of Thumb

Start with Adam if you’re unsure.
If you see overfitting → Try AdamW or SGD with Momentum.
If you need better generalization → Use SGD with Momentum.
If you work with reinforcement learning → Use RMSProp.

PreviousPytorch Early Stopping NextPytorch Tensor Shapes

Last updated 4 months ago

Was this helpful?