Snips & Tips
Snips & Tips
  • Snips & Tips
  • 📊Data Science
    • Polars Dataframe Library
    • Loading large data
    • Pandas
      • Pandas Apply Function
    • Apache Spark
      • Custom Transformer
    • Data Visualizations
    • Jupyter Notebooks
      • Jupyter Notebook Structure
    • Probability
    • Statistics
      • Statistical Tests
      • Z - Test
      • Hypothesis Testing
    • SQL
      • SQL Tips
      • Creating new columns
  • ☘️Deep Learning
    • Backpropagation in Deep Learning
    • Pytorch Early Stopping
    • Optimizers
  • Pytorch Tensor Shapes
  • 🔖Machine Learning
    • Handling Imbalanced Dataset
    • Time Series Forecasting
      • Hierarchical Time Series Forecasting
      • Facebook Prophet
      • Misc
    • Handling high dimensionality data
      • Weight of evidence and Information value
    • Debugging ML Models
    • Feature Engineering
      • Time Series
      • Outlier Detection
      • Categorical Encoding
      • Feature Scaling
  • 🐲DSA
    • Arrays
  • 🖥️WEB DEV
    • Typescript
    • React State Management
    • Redux Boilerplate
    • Intercept a HTTP request or response
    • this keyword
    • Array Methods
    • Throttle Debounce
    • Media Queries
    • React Typeahead Search
  • Replace text with React Component
  • 💻Product Analytics
    • Product Sense
    • Customer Segmentation
  • 🖥️Terminal
    • Terminal Commands
    • Jupyter Notebook 2 HTML
  • 🪛Tools and Libraries
    • Web Based
    • Databases
  • 🚟Backend
    • Fast API CRUD
    • Scalable APIs
  • 💸Quant Finance
    • Misc
    • Factor Investing
  • 🎮Game Dev
    • Misc
  • 🛠️Architecture
    • Docker
    • AWS CDK
  • 🦠Artificial Intelligence
    • AI Engg
Powered by GitBook
On this page
  • 🛠 Optimizer Cheat Sheet
  • 🔍 How to Choose?

Was this helpful?

Edit on GitHub
  1. Deep Learning

Optimizers

Cheat sheet to help you choose the right optimizer based on different problem characteristics like dataset size, noise, and convergence speed:


🛠 Optimizer Cheat Sheet

Optimizer
Best for
Pros ✅
Cons ❌

SGD (Stochastic Gradient Descent)

Large datasets, online learning

Efficient for large data, generalizes well

Requires careful tuning of learning rate, slower convergence

Momentum SGD

High-dimensional problems

Faster convergence than vanilla SGD

Still needs learning rate tuning

Nesterov Accelerated Gradient (NAG)

Smooth, convex problems

Improves momentum by looking ahead

Slightly more complex implementation

AdaGrad

Sparse data (e.g., NLP, recommendation systems)

No need to tune learning rate per feature

Learning rate decreases too much over time

RMSProp

Non-stationary problems (e.g., deep learning, RNNs)

Adapts learning rate, prevents AdaGrad's decay issue

Requires tuning decay parameter

Adam (Adaptive Moment Estimation)

Most deep learning tasks

Combines momentum + RMSProp, adaptive learning rate

Can lead to poor generalization (overfits sometimes)

AdamW

Deep learning with weight decay

Better generalization than Adam

Requires tuning weight decay

Adadelta

Limited memory situations

No need to set learning rate

Less commonly used

Nadam (Adam + NAG)

Faster convergence

Improves Adam with look-ahead momentum

More computation than Adam

L-BFGS (Limited-memory BFGS)

Small datasets, convex problems

Second-order optimization without full Hessian

Not scalable to deep learning


🔍 How to Choose?

🔢 Based on Dataset Size

  • Small datasets → L-BFGS, Adam

  • Medium datasets → Adam, RMSProp

  • Large datasets → SGD, Momentum SGD, AdamW

⚡️ Based on Convergence Speed

  • Fast convergence required? → Adam, Nadam

  • Slower but stable? → SGD with Momentum, RMSProp

🌊 Based on Data Type

  • Sparse data (e.g., NLP, embeddings) → AdaGrad, RMSProp

  • Dense data (e.g., vision, tabular) → Adam, SGD with Momentum

🔀 Based on Noise in Gradients

  • Noisy gradients (e.g., reinforcement learning, RNNs, online learning) → RMSProp, Adam

  • Stable gradients (e.g., batch training, tabular data) → SGD with Momentum

🎯 General Recommendation

  • If unsure, start with Adam. If it overfits or behaves poorly, switch to SGD with Momentum or AdamW.

When Adam is a Good Choice

  • Works well for most deep learning tasks (CNNs, RNNs, transformers, etc.).

  • Converges faster than plain SGD, especially in noisy environments.

  • Doesn’t require much hyperparameter tuning (default settings often work).


❌ When Adam Might Not Be the Best Choice

Problem Scenario
Why Adam Might Not Work
Better Alternative

Overfitting happens

Adam adapts too aggressively, leading to poor generalization

AdamW (Adam with weight decay)

You need best generalization (e.g., Image Classification)

Adam can converge too quickly to a suboptimal solution

SGD with Momentum

Training very large datasets (e.g., billion+ samples)

Adam uses more memory due to per-parameter adaptive learning rates

SGD with Momentum

Training on Reinforcement Learning (RL) tasks

Adam struggles with high variance gradients

RMSProp

Training deep networks (e.g., Transformers, LSTMs)

Adam works but may lead to instability

AdamW (better stability)


🚀 Practical Rule of Thumb

  • Start with Adam if you’re unsure.

  • If you see overfitting → Try AdamW or SGD with Momentum.

  • If you need better generalization → Use SGD with Momentum.

  • If you work with reinforcement learning → Use RMSProp.

PreviousPytorch Early StoppingNextPytorch Tensor Shapes

Last updated 2 months ago

Was this helpful?

☘️