← Writing

The low-rank softmax bottleneck, explained

This is a starter draft seeded from my undergraduate thesis. Rewrite it in your own voice before publishing.

Every autoregressive language model ends the same way: a hidden state hth_t gets mapped to a distribution over the vocabulary through a softmax,

P(xct)=exp(htwx)xexp(htwx).P(x \mid c_t) = \frac{\exp(h_t^\top w_x)}{\sum_{x'} \exp(h_t^\top w_{x'})}.

Stack the log-probabilities for every context and every word into a matrix AA, and the model is effectively trying to match AA with HWH W^\top, where HH holds the hidden states and WW the word embeddings. Here’s the catch: if the hidden size is dd, then HWH W^\top has rank at most dd.

rank(HW)d.\operatorname{rank}(H W^\top) \le d.

But the true log-probability matrix of natural language appears to be high-rank. So a model with a small hidden dimension is structurally incapable of expressing the real distribution, no matter how much data you throw at it. This is the softmax bottleneck (Yang et al., 2018).

Why it matters

It reframes a hyperparameter you might set casually — the hidden dimension — as a hard ceiling on expressiveness. You can be bottlenecked even with a perfectly trained model.

Ways around it

  • Mixture of Softmaxes (MoS): mix several softmaxes so the result is no longer a single low-rank product.
  • Higher-rank output maps: break the strict d\le d rank constraint with a non-linear final map.

There’s a real trade-off with compute, and that’s where most of the interesting engineering lives.