The First Token
Some problems get easier as you go, and others can’t start until they’re already solved.
A sentence is the first kind. Each word tightens what can follow, and the ending arrives as if it were always going to.
Sudoku is the second. Place a number and it forecloses options in cells you haven’t looked at; erase it and everything built on it collapses. The grid resolves only once you see it whole.
An autoregressive model writes sentences well because sentences are forgiving; its trap is writing sudokus the same way. Each step arrives as if it were the only possible next one, and the question of whether the first step was right never forms.
Language modeling has its own attempts, where a model thinks aloud, rereads what it wrote, or writes many drafts and keeps one. All of it moves forward, so every correction inherits the mistake it follows.
Diffusion does not reason longer. Each step refines the whole sequence, and every position conditions every other. The first token carries no special weight, and problems that foreclose become tractable because nothing is decided alone.
A sentence is written one word at a time, but a sudoku must be seen whole.