Optimizers
Cheat sheet to help you choose the right optimizer based on different problem characteristics like dataset size, noise, and convergence speed:
🛠 Optimizer Cheat Sheet
SGD (Stochastic Gradient Descent)
Large datasets, online learning
Efficient for large data, generalizes well
Requires careful tuning of learning rate, slower convergence
Momentum SGD
High-dimensional problems
Faster convergence than vanilla SGD
Still needs learning rate tuning
Nesterov Accelerated Gradient (NAG)
Smooth, convex problems
Improves momentum by looking ahead
Slightly more complex implementation
AdaGrad
Sparse data (e.g., NLP, recommendation systems)
No need to tune learning rate per feature
Learning rate decreases too much over time
RMSProp
Non-stationary problems (e.g., deep learning, RNNs)
Adapts learning rate, prevents AdaGrad's decay issue
Requires tuning decay parameter
Adam (Adaptive Moment Estimation)
Most deep learning tasks
Combines momentum + RMSProp, adaptive learning rate
Can lead to poor generalization (overfits sometimes)
AdamW
Deep learning with weight decay
Better generalization than Adam
Requires tuning weight decay
Adadelta
Limited memory situations
No need to set learning rate
Less commonly used
Nadam (Adam + NAG)
Faster convergence
Improves Adam with look-ahead momentum
More computation than Adam
L-BFGS (Limited-memory BFGS)
Small datasets, convex problems
Second-order optimization without full Hessian
Not scalable to deep learning
🔍 How to Choose?
🔢 Based on Dataset Size
Small datasets → L-BFGS, Adam
Medium datasets → Adam, RMSProp
Large datasets → SGD, Momentum SGD, AdamW
⚡️ Based on Convergence Speed
Fast convergence required? → Adam, Nadam
Slower but stable? → SGD with Momentum, RMSProp
🌊 Based on Data Type
Sparse data (e.g., NLP, embeddings) → AdaGrad, RMSProp
Dense data (e.g., vision, tabular) → Adam, SGD with Momentum
🔀 Based on Noise in Gradients
Noisy gradients (e.g., reinforcement learning, RNNs, online learning) → RMSProp, Adam
Stable gradients (e.g., batch training, tabular data) → SGD with Momentum
🎯 General Recommendation
If unsure, start with Adam. If it overfits or behaves poorly, switch to SGD with Momentum or AdamW.
When Adam is a Good Choice
Works well for most deep learning tasks (CNNs, RNNs, transformers, etc.).
Converges faster than plain SGD, especially in noisy environments.
Doesn’t require much hyperparameter tuning (default settings often work).
❌ When Adam Might Not Be the Best Choice
Overfitting happens
Adam adapts too aggressively, leading to poor generalization
AdamW (Adam with weight decay)
You need best generalization (e.g., Image Classification)
Adam can converge too quickly to a suboptimal solution
SGD with Momentum
Training very large datasets (e.g., billion+ samples)
Adam uses more memory due to per-parameter adaptive learning rates
SGD with Momentum
Training on Reinforcement Learning (RL) tasks
Adam struggles with high variance gradients
RMSProp
Training deep networks (e.g., Transformers, LSTMs)
Adam works but may lead to instability
AdamW (better stability)
🚀 Practical Rule of Thumb
Start with Adam if you’re unsure.
If you see overfitting → Try AdamW or SGD with Momentum.
If you need better generalization → Use SGD with Momentum.
If you work with reinforcement learning → Use RMSProp.
Last updated
Was this helpful?