Steer Before You Shrink

What separates training methods that scale from those that don’t?

Test accuracy by method and strength, CIFAR-10

Across 61,000 configurations on CIFAR, batch normalization, orthogonalized updates, and ASAM help at every data scale with no sign of diminishing returns. Dropout and random weight perturbation are neutral to mildly helpful only at low strength, and the threshold drops as task complexity rises.

BN orthogonalizes representations, and Muon orthogonalizes parameter updates. Both steer the optimization without restricting what the network can learn. In the tested range, their effect does not reverse, and it grows with task complexity.

Dropout restricts the network during training. At low rates, the restriction is cheap. At high rates, the cost scales with task complexity.

Test accuracy by method and strength, CIFAR-100

The same pattern holds on CIFAR-100, where dropout is at best neutral at the lowest tested scales and negative after that. Methods that steer do not reverse in the tested range. Methods that shrink do, and the reversal moves earlier as complexity rises.

Large-scale training moved away from dropout, and this sweep suggests why.

What separates methods that scale is whether they steer or shrink.


Benchmark, 68K sampled configs, 61K completed.