Ternary Search: Linear Representations and Superposition

As LLMs become larger, more capable, and more ubiquitous, the area of ​​mechanistic interpretation – that is, understanding the inner workings of these models – becomes increasingly interesting and important. In the same way that software engineers benefit from having good mental models of file systems and networking, AI researchers and engineers should strive to lay some theoretical foundation for understanding the “intelligence” that emerges from LLM. A strong mental model will improve our ability to use technology. In this post, I want to cover two fundamental and related concepts in the field (each with their own paper) that I find fascinating from a mathematical perspective: the linear representation hypothesis (Park et al.) and superposition (Anthropic).

The Linear Representation Hypothesis (LRH) has been around for quite some time, ever since people noticed that the word embeddings produced by Word2Vec satisfy some interesting properties. If we consider $E(x)$ to be the embedding vector of a word, you see the approximate equivalence

$E(“\text{king”}) – E(“\text{man”}) + E(“\text{woman”}) \approx E(“\text{queen”})$.

Observation of this form shows that concepts (i.e. gender in the example) are represented linearly in the geometry of the embedding space, which is a simple but non-obvious claim.

AVvXsEhVxoqnLzyr8LWHUqiRHpniAYtpZ0S Mes0xy7 uAW9XGwsqoXgId7rFuYN FZ23Jo376cdznRwQyIIJJ3jowdg8s5a0YjzORnjpLx bABSJwPWKLxS8NX96YR9Uv6kODGTuhhFGCheDGLRbjoHh 7wgfghpCltR0NWZQ90zZ5r0jxAPyzDxeQxJvlX6BBw=w638 h195
Simplified model of LLM in terms of embedding and unembedding.

Modern LLMs are rapidly moving forward, and LRH remains a popular way of explaining what is going on inside these models. Park et al. The paper presents a mathematical outline of the hypothesis to try and formalize the idea. It uses a simplified model of the LLM where most of the internal functioning (multilayer perceptron, attention, etc.) is treated as a black box, and the LRH is interpreted in two different representation spaces with the same dimensionality as the model:

  • The “embedding space” is where the final hidden states of the network reside ($E(x)$ for the input context $x$). This term is similar to the embedding formulation and is where you will make interventions that affect the behavior of the model.
  • The “unembedding space” is where the rows of the unembedding matrix reside (U(y)$ for each output token $y$). The concept direction measured by a linear probe at the hidden position (to evaluate the presence of the concept) corresponds to a vector in this space.

There are formulations corresponding to LRH in two related places. Let $C$ represent the directional concept of gender, i.e. male => female. Then any pair of input contexts that differ only in that concept must satisfy it, e.g.

$E(“\text{Hail the Queen”}) – E(“\text{Hail the King”}) = \alpha \cdot E_C$

where $\alpha \ge 0$ and $E_C$ is a constant vector in the embedding space called the embedding representation. Similarly, any pair of output tokens that differ only in that concept must satisfy it, for example

$U(“\text{queen”}) – U(“\text{king”}) = \beta \cdot U_C$

where $\beta \ge 0$ and $U_C$ is a constant vector in the unembedding space called the unembedding representation. Essentially, applying the concept has a linear impact on both locations.

The paper goes into more detail which I will skip here, but they show that these representations are isomorphic, unifying interference and linear detection ideas. They then verify empirically on Llama 2 that they can find embedding and unembedding representations for different concepts (e.g. present => past tense, noun => plural, English => French) that approximately fit their theoretical framework – great!

AVvXsEhycFN8znlIlG71ln55TB66lmn8mB3ipkLck44f1Y7fElRufprERpNFlUMTZ2p7iCayjeOzwQKQQ RFW5pUNxxBLgkZii YFdztbZfZt5Bq6k1OZv54ruvFsw7QdE5Jwuwr9B8SWxIeJb1Dbc6yorBAz3mPvG2Tj58Wf0J7zOnFu8WPX8qhrE DHp5h2Wcc=w477 h306
Approximate orthogonality of concept representation in Llama 2. Source: Park et al.

OK, so let’s assume that concepts actually have linear representation. It would then be logical that unrelated concepts have orthogonal directions. Otherwise, implementing the male => female concept may affect the appearance of the English => French concept, which makes no sense. One of the key results of Park et al. Is this orthogonality not under the standard Euclidean inner product, but under a “causal inner product” derived from the unembedding matrix. Only by looking at concept representations through that lens do we get the stereotypy we expect.

But in these models, the representation space is relatively small (at most 2K to 16K dimensions). So how do these spaces “fit” such a large number of language features that far exceed their dimensionality? It is impossible for all such features to be orthogonal, regardless of the geometry.

AVvXsEiQXfMNFK5bHJldQEFgk5aCTaqLdgYpDRknr61Mrwxt9gDYNnLTB EcDNH20Nybq0D8E4WOHpGhyzwcEuNbD2863EjQ4NsTxOExSx3xtSWOcHxYoQLNfUtYknPHFhtwkH91FO5oMYZILUj0O9BwTKrloMWNzzvbW48pgbFLeuCCLM2CEDDc4qnTw bMeimB=w641 h173
Interference effects of non-orthogonal features. Source: Anthropological.

This is where superposition comes into play. In low-dimensional spaces, the intuition is that, when you have $N$ vectors with $N > d$ in a $d$-dimensional space, they start to interfere significantly (the magnitude of the inner product is large). This is one of the examples where low-dimensional intuition does not extend to higher dimensions, however, as evidenced by the Johnson–Lindenstrauss lemma. One implication of the lemma is that you can choose exponentially (in the number of dimensions) many vectors that are nearly-orthogonal – that is, the inner product between any pair of vectors is bounded by a small constant. You can consider this the flip side of the curse of dimensionality.

The Anthropic paper demonstrates superposition phenomena in toy models on small, synthetic datasets. A particularly interesting observation is that superposition does not occur with no activation function (purely linear computation), but it occurs with a nonlinear one (ReLU in their case). The idea is that non-linearity allows the model to manage interference in a productive way. But this still works well only because of the natural sparsity of these features in the data – models learn to superimpose features that are unlikely to exist together.

AVvXsEjrrXgJsBeMd4lIZ6WQbAF5v6VFub6t7vFwyf8AL9wR219FvOPObQnJYciMcayuKoGa exCuRX6oOz937uv0RPTf1MnN8UNiliG8rXiAyGLuq hmCAJXLQTnj75 3sfY6E20fidEG fMS7ue927CfJa60P5h8ol2DyARI87ZtvKDQFV6VkoiJoTWho4 uzs=w385 h360
Visualization of a square antiprism, an energy-minimizing arrangement of 8 points on a 3-D unit sphere.

In experimental setups where features in synthetic data are of equal importance and sparsity, they observe that the embedding vectors learned by the model form regular structures in the embedding space, such as tetrahedrons, pentagons, or square antiprisms. Coincidentally, these are the same types of structures I worked with in some earlier research I did on circular codes. These structures emerged from using gradient descent-like algorithms to minimize the energy of arrangements of points on the unit hypersphere (analogous to that described by the Thomson problem). It’s fun to see the overlap of so many areas!

To conclude, features as linear representations, even if not the whole story, are a valuable framework to help us interpret and intervene in LLM. It has a solid theoretical foundation that is empirically supported. Sparseness, superimposition, and the non-intuitive nature of high-dimensional spaces give us a window into how the complexity (and intelligence?) of language is captured by these models.



<a href

Leave a Comment