The low-rank softmax bottleneck, explained

This is a starter draft seeded from my undergraduate thesis. Rewrite it in your own voice before publishing.

Every autoregressive language model ends the same way: a hidden state $h_t$ gets mapped to a distribution over the vocabulary through a softmax,

P(x \mid c_t) = \frac{\exp(h_t^\top w_x)}{\sum_{x'} \exp(h_t^\top w_{x'})}.

Stack the log-probabilities for every context and every word into a matrix $A$ , and the model is effectively trying to match $A$ with $H W^\top$ , where $H$ holds the hidden states and $W$ the word embeddings. Here’s the catch: if the hidden size is $d$ , then $H W^\top$ has rank at most $d$ .

\operatorname{rank}(H W^\top) \le d.

But the true log-probability matrix of natural language appears to be high-rank. So a model with a small hidden dimension is structurally incapable of expressing the real distribution, no matter how much data you throw at it. This is the softmax bottleneck (Yang et al., 2018).

Why it matters

It reframes a hyperparameter you might set casually — the hidden dimension — as a hard ceiling on expressiveness. You can be bottlenecked even with a perfectly trained model.

Ways around it

Mixture of Softmaxes (MoS): mix several softmaxes so the result is no longer a single low-rank product.
Higher-rank output maps: break the strict $\le d$ rank constraint with a non-linear final map.

There’s a real trade-off with compute, and that’s where most of the interesting engineering lives.