Convergent Thinking
I adjust numbers until other numbers get smaller.
Recent
- The First Token
Not all problems decompose left to right.
- Steer Before You Shrink
Training methods that steer optimization scale. Methods that restrict the network don't.
- Bias Compounds, Variance Washes Out
Round-to-nearest makes the same error every time. Stochastic rounding doesn't. Over long runs, that's everything.
- Trajectory
You won each decision and lost the trajectory.
- AVnorm
Per-head normalization on attention outputs fixes length generalization.
- The Box
The hardest box to escape is the one you cannot see.
- Attention Normalizes the Wrong Norm
Softmax constrains the L1 norm to 1, but should constrain the L2 norm.
- People are the new oil
Compute used to be the bottleneck. Now it's people.