The low-rank softmax bottleneck, explained
This is a starter draft seeded from my undergraduate thesis. Rewrite it in your own voice before publishing.
Every autoregressive language model ends the same way: a hidden state gets mapped to a distribution over the vocabulary through a softmax,
Stack the log-probabilities for every context and every word into a matrix , and the model is effectively trying to match with , where holds the hidden states and the word embeddings. Here’s the catch: if the hidden size is , then has rank at most .
But the true log-probability matrix of natural language appears to be high-rank. So a model with a small hidden dimension is structurally incapable of expressing the real distribution, no matter how much data you throw at it. This is the softmax bottleneck (Yang et al., 2018).
Why it matters
It reframes a hyperparameter you might set casually — the hidden dimension — as a hard ceiling on expressiveness. You can be bottlenecked even with a perfectly trained model.
Ways around it
- Mixture of Softmaxes (MoS): mix several softmaxes so the result is no longer a single low-rank product.
- Higher-rank output maps: break the strict rank constraint with a non-linear final map.
There’s a real trade-off with compute, and that’s where most of the interesting engineering lives.