论文阅读-Benchmarking Optimizers for Large Language Model Pretraining

Original Paper: [2509.01440] Benchmarking Optimizers for Large Language Model Pretraining


Introduction

Chinchilla Scaling Law: The optimal amount of training data for a given model size that yields the best performance under a fixed computational budget. To be more specific, we need around 20 text tokens per parameter (see 2203.15556)

Overview: We discuss the algorithms according to their logical grouping:

  • Adam-like methods: AdamW​, ADOPT​, AdEMAMix
  • Sign-based methods: Lion​, Signum
  • Approximate second-order optimizers: Muon​, SOAP​, Sophia
  • Learning rate/ scheduler-free learning algorithms: Schedule-Free AdamW​, Prodigy
  • MARS methods: MARS

ADOPT: Remove the current gradient $g_t$ from the second moment estimate $v_t$ and alter the order of the momentum update $m_t$ and normalization.

image

AdEMAMix (Dual EMA) : This work argues that using a single EMA to accumulate past gradients in the first moment estimate $m$ can be suboptimal, as it cannot simultaneously prioritize both immediate past and older gradients.

image

Lion: Lion is a sign-based method, which determines the update direction by taking the sign of an interpolation between the previous momentum and the current gradient.

image

Signum: This method differs from Lion in the interpolation term between the EMA of momentum and the current gradient.

image

Muon and D-Muon: In Muon’s original code, weight decay does not apply to the matrix parameters in MuonNon1D​. This weight decay issue is addressd in [2502.16982] Muon is Scalable for LLM Training, in which the authors present a scheme for sharing the learning rate and weight decay between the matrix and non-matrix parameters of the model.

image

image

SOAP: SOAP improves Shampoo and reduces the computational overhead by optimizing only two-dimensional layers while running AdamW for 1D layers.

image

Sophia:

image

Schedule-Free AdamW: The idea of Shcedule-Free AdamW​ is to eliminate learning rate schedulers by replacing them with iterate averaging.

image

Prodigy: Prodigy removes the need for hand-tuned learning rates through an intrinsic, adaptive step-size scheme.

image

MARS: MARS incorporates modern adaptive and approximate second-order methods with a variance reduction update style.

image


Results at Small Scale: 124M Models

Notation: Hereafter, “$A \times B$ tokens” indicates the batch size is $A$, and each batch contains $B$ tokens.

Results with Small and Large Batches and Stability across Training Horizons

image
Comparing optimizers for training a 124M parameter LLM: (a) "small" batch size (b) "large" batch size.
image
Ranking of optimizers for 124M models with "small" and "large" batch sizes.

Takeaway (Batch Size)

  • AdEMAMix​ consistently achieves state-of-the-art performance and robust scaling with training duration.
  • Sign-based methods (Signum​, Lion​) and MARS​ greatly benefit from the increased batch size.
  • Sophia​ diverges in small-batch setting, when trained beyond the Chinchilla optimal horizon, even with sufficiently small learning rate;
  • SOAP​ show a consistent performance in both settings.

Takeaway (Stability) : Once optimizers are properly re-tuned for the maximal length of training considered, doubling of number of iterations does not affect the ranking of methods.


Increasing the Batch Size Further:

image
Scaling batch size

Takeaway: Many methods, especially MARS​, Prodigy​, and sign-based​ ones, can outperform AdamW​ while trained on a sufficiently large batches.


Weight Decay Ablation:

Here the baseline AdamW​ uses a weight decay of $\lambda = 0.1$.

image
Larger weight decay achieves significantly better results when training on fewer tokens: (a) AdamW, Signum, Lion with large weight decay outperform baseline AdamW with weight decay of 0.1 for short training duration. (b) the setting without weight decay is suboptimal. (c) Smaller weight decay leads to larger L2 norm of the model parameter.
image
Importance of weight decay for Muon. (1) D-Muon uses a weight decay for all parameters, (2) Muon uses weight decay only on embeddings, scalar parameters, and the final layer. We can see that D-Muon greatly outperforms the basic Muon.

Takeaway:

  • The use of weight decay (particularly a large weight decay term 0.5 and above), can significantly impact the final loss and optimizer behavior.
  • The setting of weight decay to be $0$ is suboptimal.
  • For extended training horizons, non-zero weight of $0.1$ proves to be a robust option.

Learning Rate Sensitivity:

image
Optimal learning rate stability across optimizers. The optimal learning rate determined during tuning on 2.1B tokens remains consistent after a learning rate sweep on 16.8B tokens for most optimizers.

Takeaway:

  • For most optimizer, the learning rate $\gamma_{\max}$ selected near the Chinchilla optimal horizon transfers smoothly to $8 \times$longer run.
  • Sign-based methods and Sophia​ diverge with $\gamma_{\max} = 2e^{-3}$.
  • MARS​ demonstrates a very consistent performance across $\gamma$ sweep.

Warmup Ablation:

image
Warmup ablation: sign-based optimizers, Sophia and SF-AdamW benefit from the increased warmup.

Takeaway: We reveal that the warmup duration is optimizer-dependent and should be tuned: for SF-AdamW​, Sophia​, and Signum​, longer warmup results in improved final performance.


Warmup Types of WSD(Warmup-Stable-Decay), cosine, and linear **$\gamma$**​ -scheduler:

image
Comparisons between cosine, WSD, and the linear schedulers.
image
Gradient norm patterns for different schedulers: (b) the gradient evolution for majority of optimizers resembles the SF-AdamW pattern (a,c) Exceptions are sign-based methods: Signum and Lion.

Takeaway: A choice of the learning rate scheduler is also optimizer-related

  • For most methods, the cosine scheduler dominates.
  • Linear scheduler outperforms or matches cosine and WSD for sign-based methods, SOAP​ and MARS​.
  • WSD appears to be the best option for Muon

Results at Medium Scale: 210M Models

Results

image
Ranking of optimizers for 210M models with the batch size of 256*512 tokens.
image
Comparing optimizers for training a 219M parameter LLM.

Takeaway:

  • We do not observe a much of a change in ranking of optimizers for 210M model, compared to benchmarking on 124M.
  • We replicated almost identical hyperparameters for all optimizers, except for the learning rate for sign-based methods (which is more sensitive to the learning rate while scaling the model size)

Decay the learning rate sufficiently

image
Decaying the learning rate down to 0.01 and beyond, instead of only to 0.1

Takeaway: Decaying the learning rate further than $10\%$ of the maximal significantly improves the results. However, for different schedulers, the best final learning rate is different.


Results at Large Scale: 583M and 720M Parameters

Results

image
Ranking of optimizers for 720M Llama-based models.
image
Comparing optimizers for training a 720M parameter LLM.

Takeaway:

  • At larger scale of model and batch size, AdEMAMix​ and MARS​ dominate.
  • Despite training with large batches, Signum​ and Lion​ scale poorly.
  • D-Muon​ is consistent across all our benchmarking setups.

Wall-clock time comparison

image
Wall-clock time comparison.

Takeaway: Most optimizers exhibit similar wall-time performance, with sign-based methods being slightly faster. SOAP​ is the main exception.


Extension to MoEs

MoE:

image

Results:

image
Ranking optimizers for 520M MoE models with 256*512 batch size.
image
Comparing optimizers for training a 520M parameter MoE.

Takeaway: Benchmarking results obtained for dense models transfer to corresponding MoEs.