Thinking About a Bias for Better Semantic Coherence and Reuse

Published:

Contents

The Problem of Reuse, Coherency, and Discovery.

For a general intelligence, good generalisation requires good reuse of semantic components. Good reuse of semantics requires that there exists a consistent internal handle for the same semantic purpose across contexts (e.g., a stable readout for classification from a linear layer or stable causal role in the computation) so that downstream computations can depend on “the same thing” reliably even if the internal basis is distributed or not uniquely identifiable. This typically requires some notion of (conditional) invariance or equivariance, which could be that the concept’s identity should be stable under semantics-preserving transformations, and when a transformation should matter (e.g., location), the representation should change in a predictable, structured way. Thus good reuse of semantic components requires that the components themselves have a notion of semantic invariance and are likely to have relationships that reflect the underlying reality it observes. I’m going to refer to all of this as having good internal coherency.

I’m not claiming that concepts must live in a single neuron or a uniquely identifiable subspace. What I care about is functional consistency, in that there exists a stable readout/causal role for a concept across contexts, and that concepts can be reused compositionally without being overwritten by irrelevant variations. Further, where readout stability is relevant, I’m mostly concerned about stability for a low-complexity family of readouts (e.g., linears) that downstream learning can realistically find and rely on.

When I say “semantic invariance,” I don’t mean “the entire representation must ignore a factor.” Often we want a representation where some readout is invariant (e.g., is there a cat?) while other readouts remain sensitive or equivariant (e.g., Where is the cat? How many cats?). So the goal is not global invariance everywhere, it’s structured representations where invariant/equivariant semantic components exist and can be accessed reliably.

To illustrate why we want semantic invariance, consider learning a representation of “cat.” If the model is unable to have a stable handle for “cat” and over-relies on context-specific cues (e.g., a particular fur texture or camera artifact,) it may bind cat-typical relations (like “chases mice”) to those proxies and fail to transfer when those cues change.1 If I learn to become invariant to different furs, either by associating many fur types with other features that together indicate “cat”, or by simply ignoring the regularities that look like fur, I can then better generalise “cat” chasing mice across different furs.

The intent of having the term “internal coherency” is to put more emphasis on the importance of semantic components when considering generalisation, instead of just making a statement about the overall behavior of a model. It is supposed to be a blanket term for what intuitively makes a good semantic component (e.g., components that should be independent don’t systematically interfere.) Good reuse specifically emphasises knowing how to use and compose good representations, while internal coherency is about all the representations themselves being good and thus encompasses good reuse.

A core problem in the field of artificial intelligence right now is the lack of a scalable objective which directly biases a model’s internal representations to have internal semantic coherence and reusability.

For example, a symptom of this lack of good reuse is that the model may not reuse the same concepts after some input perturbation/reformulation which preserves the concepts needed to be reused for the desired output (e.g., asking for the inverse relation2) potentially causing undesired behavior in a model. I also hope that you see that when people say they desire good generalisation, they very likely mean that they desire generalisation that occurs over specific chunks of abstraction. A SFTed LLM may learn the exact token sequence associated with a relation, but not the higher level abstractions needed to understand the inverse relation. When humans learn over text, we are able to better infer the chunks of abstraction that the producer of the text intends that we generalise over. Doing this in part requires discovering more regularities from the data that may contradict with a given explanation of the data.

Thus, I also claim that we lack an objective which further enables the continuous discovery of good (higher level) regularities that improve a model’s predictive ability over its own internal world and reality. In practice this will likely require a bias to be internally coherent, as you can only form regularities to have semantic invariances that reflect reality if it’s composed by components with semantic invariances too. Further, it will require a bias towards good semantic reuse, as if we cannot construct a concept from its components reliably, it’s difficult to learn said concept well.

If you are familiar with the UFR/FER paper, a way of generally viewing the undesired behavior is that I am describing the effects on semantics that the FER representations would produce.1

In summary:

  • Why invariance helps coherence and reuse: If a “cat” representation isn’t stable under nuisance changes (bad coherence,) downstream “cat chases mice” relations are likely to have attached unstable proxies (fur texture, camera artifacts) to our representation of “cat”. If multiple disconnected “cat-ish” fragments exist, different inputs/contexts will activate different fragments (bad reuse and therefore bad coherence,) and may not convey all the learned regularities associated with “cat”.

  • Why coherence helps discovery: Discovering/forming more complex regularities depends on having stable intermediate variables, fractured/entangled variables makes this formation difficult.

I don’t want to consider everything humans do to enable generalisation as it’s a bit distracting. But at the very least, I believe that we have the ability to infer intermediate semantics from a given attempt at predicting something about the world, and can then check (actively or passively) if the intermediate semantics are useful for predicting more broadly across our understanding of reality, a process which learns better generalization than machines currently do.

This process is a bias towards internal coherence and reuse. If you lack such a bias, you are bottlenecked in your ability to discover more regularities in data. Learning more complicated regularities requires that the concepts used to compose them be coherent.

Machines do not have a well crafted bias towards internal coherence and reuse, an issue we are currently hammering through with scaling. I will present two components from existing learning methods that I believe are likely to create this bias.

An Introduction.

We have identified the lack of a bias for internal semantic coherency (and thus also for reuse.)

This blog post is going to give my perspective of the most important insights from the past few years in AI, how they relate to this needed bias, and how we have two insights that address this issue in a few settings (LLMs and SSL.)

This is a bunch of position papers condensed into a blog post, and encapsulates a bit of my personal taste in research. It’s also not deeply technical so that it is accessible, as such there are places I’ve thrown away a lot of nuance intentionally for the sake of brevity. In my mind, I’m giving quite an obvious framing of quite a few obvious observations, but nonetheless I hope there’s some interesting takeaway for everyone.

I’m going to claim that there are two related insights from learning objectives we currently use that underlies their effectiveness, in particular for coherency. I’ll keep returning to these:

  • Step‑wise semantic construction supervision – breaking complex predictions into simpler semantic steps and supervising them. This type of supervision incentivises semantic reuse and thus also internal coherence.

  • Explicit higher‑level supervision – explicitly supervising interactions at higher‑level abstractions (and only not training to predict the data space like next token prediction.) Note that when you explicitly supervise step-wise semantic construction, you are naturally supervising higher-levels of abstraction.

I’m going to pose a few questions which aim to argue that these two factors are what underlies the most effective ideas we currently have in learning, in part because they produce our desired coherency bias. Further I will argue where our current learning methods fall short, they can better take insight from these factors.

Section 1: Observe that our most effective generally-intelligent (and scalable) AIs today are trained with generative objectives, why is this? (Alternatively, why supervising semantic construction underlie the effectiveness of LLMs. )

A big part of the explanation for why we’ve ended up in a place where our most intelligent models are trained generatively could in part lie in our ability to factorise the data space.

When intermediate prediction steps correlate with intermediate latent variables, they heavily bias the model towards representing said intermediate variables. I believe what underlies the effectiveness of current data-generative objectives in capturing underlying semantics is that factoring the data space can be designed to correlate with intermediate latent variables, and so predicting well-designed data factors biases the model towards learning how the underlying semantics are constructed, which in turn learns a better causal understanding of the data.

We don’t get a clean “latent space of true concepts” handed to us by the world for us to directly learn how semantics are constructed, but predicting the next token is a reasonable proxy for learning how semantics are constructed (even if there is no strict alignment guarantee, this alignment is quite strong for text.) Transformer LLMs factorise the data via sequence likelihood autoregressively over tokens, diffusion via noising.

There are two related reasons why supervising semantic construction is so effective. One is the lower likelihood of learning a spurious circuit, and the other is that it creates a natural bias towards reusing simpler concepts.

A given (potentially unfactored) prediction step on natural data can be designed to incentivise that the model tracks changes in higher-level semantics which are useful to us (i.e., intelligent.) If this prediction step is too large, many of the possible semantics to explain it are “wrong but loss-minimising”. Factorisation is quite an easy way to create many intermediate constraints with how semantics are constructed. It reduces objective underspecification. Many internal decompositions can solve an unfactored prediction, and by adding intermediate predictions we eliminate large classes of “wrong but loss-minimising” decompositions by forcing the model to be correct in more places along the way.

Autoregressive factorisation doesn’t guarantee semantic stepwise supervision, it guarantees dense intermediate constraints. In text, those constraints are unusually aligned with latent semantic state changes, which is part of why it works so well.

When the alignment is weak (as it is in many vision settings,) factorisation can still help optimisation, but it’s a much cruder proxy for semantic construction. Poorer alignment in vision could also explain why diffusion video models (which factorise geometrically via noise) aren’t great world models. Even in text, token factorisation only biases semantic factorisation and does not guarantee it, because many internal decompositions can still satisfy next-token loss.

(A subtle implication here is that this issue is potentially inherent to traditional feed forward neural nets with one input and one output, and no special latent supervision. In practice, needing to learn higher level semantics needs a deeper net. Supervising higher level semantics without factoring prediction steps in a deeper model is inherently prone to many spurious explanations inside the model, as you have more circuitry to optimise and more semantics you can possibly screw up learning. A smaller net is less able to learn complex semantics altogether.)

Further, as we supervise the construction of semantics more, there is more training signal for how lower level semantics can compose, a process which naturally learns better reuse of lower level semantics.

Predicting small semantic factors is useful for learning complex joints in part because it has these two qualities (lower spuriousness + semantic reuse signal) we want, and predicting small data-space elements is a proxy for predicting small semantic factors. Thus, models which become intelligent by exploiting this natural alignment between data factoring and semantic factoring are more amenable to data generation.

This is to say, what underlies the effectiveness of data-generative objectives is their exploitation of the usefulness of supervising the construction of semantics.

Section 2: LLMs seem to do this, what’s missing? (Alternatively, why just supervising semantic construction isn’t sufficient. )

You can consider general neural net training as consolidating fractured representations to become some notion of semantically invariant. Scaling under this framing seems to create a bias towards consolidation, one that is expensive and inefficient, as obviously scaling creates better models and it’s very likely that better models have better representations of higher level semantics, something which requires that said representations have some notion of semantic consolidation/invariance. Current LLM training does not have a sufficient bias for this form of consolidation.

Next token prediction supervises broadly at a low level of abstraction. By definition we are not supervising higher level concepts explicitly. Supervising the invariance of higher level concepts and their interactions only for the purpose of predicting the next token lacks a significant bias for the entire higher level concept to be invariant. The loss can be satisfied with only having to learn fractured parts of a given higher level concept, instead of associating those fractured parts together to form the entire higher level concept (which would definitionally give you semantic invariance at the higher level.)

Currently, our most notable method for supervising the interaction between higher level concepts in LLMs is RLVR, which ideally attempts to shape the model’s invariant representations (which dictates its behavior) for a given task to be useful for getting a reward. RLVR supplies sequence-level constraints, which can pressure more coherency and reuse if the model already has some useful (for the reward) abstractions than can be extracted by rolling the model out a bunch and attempting to incentivise the invariance of useful semantics found in the sequences. A less nuanced way to view this is that RLVR attempts to narrowly supervise semantic invariances at a higher level of abstraction by weighing many lower level sequences together in order to capture said semantic invariance, while next token prediction has less of a bias towards the specific semantics being captured and instead attempts to implicitly capture everything in the data useful for predicting some low level of abstraction which often leads it to not capturing higher level semantic invariances.

RLVR empirically works better when the base model is better3,4,5, as attempting to learn to compose together and shape abstractions by weighing a bunch of rollouts is much easier when those abstractions exist in a form closer to their desired form inside the model (otherwise there’s more work for RLVR.) However, as RLVR supervision works off of existing abstractions found in the base LLM, and said abstractions are often fractured parts of concepts from the next token objective (i.e., not actually semantically invariant at higher levels,) it is hard for RLVR to begin supervising broad semantic invariances of entire concepts.

If a concept is implemented as multiple shards (as they were never consolidated during next token pretraining,) any given rollout distribution will reliably activate only a subset of them. Reward gradients then update (and potentially unify) only the activated shards. It struggles to force a global consolidation of a concept across all shards found in the data, unless reward coverage is broad and diverse enough to repeatedly co-activate and jointly shape all relevant shards. Thus it is also unsurprising that RLVR currently has limited generalisation.

Yes, in principle RL could unify shards given broad enough coverage and if the optimisation dynamics encourage merging, however in practice reward signals are weak/sparse and tend to overfit to local behavioral hacks.

Useful intelligence requires that we have coherency at varying levels of abstraction. To generalise from a narrow set of experiences (either because the task itself is narrow, or because the training set is small,) it requires at the very least associating higher level semantic invariances (and not some fractured part of it) from the experience to other tasks, something that is difficult if you don’t have sufficient semantic invariances at those higher levels.

Just supervising semantic construction is not a sufficiently strong bias for semantic invariances at the higher levels of abstraction you construct. Lower level abstractions all the way down to the data space allow for semantics-preserving variations, and so without enforcing some notion of semantic invariance at higher levels of abstraction directly (e.g., by allowing representations of higher level concepts to interact directly instead of through tokens,) there is insufficient pressure for there to be semantic invariance at the higher levels, giving you poor coherence and thus also poor reuse.

Note that there are two separable hypotheses here that I will also continue to think about in the next section:

  • (H1) Missing constraints: prediction tasks at lower level abstractions (e.g., next token prediction) do not explicitly constrain many high-level invariances, so high-level decompositions remain underdetermined and can stay fractured even at scale. Said low level prediction tasks do not necessarily prohibit further specification of high-level semantics to address this factoring.
  • (H2) Active pressure: optimising a low-level objective (e.g., next token prediction) can actively encourage shortcuts that implement “shards” of a concept rather than a unified variable, because shards are often sufficient to reduce loss locally and are easier to discover.

I suspect both are happening, and that RLVR isn’t sufficient to address H1. Further, I also suspect that H2 does not prohibit the further specification of higher level semantics to address the fracturing.

Section 3: If LLMs having fractured higher level representations help explain their deficiencies, why would explicit higher level supervision help fix this fracturing? (Alternatively, why is explicitly supervising higher level abstractions important? )

The short answer is that we know latent SSL somewhat works, but I’m going to try and take you through the reasoning as to why.

Learning semantic invariance at higher levels of abstraction is a task of providing more supervision in a way such that it isn’t satisfied by learning a fractured representation. Doing so will give us a greater bias towards consolidating fractured representations, and thus coherency and reuse. Supervising interactions between higher levels of abstraction explicitly is a way of doing this.

In practice (i.e., in latent SSL which will be talked about more extensively in the next section,) as we do not have higher levels of abstraction given to us as numerical targets, we learn to shape a compression-biased representation by predicting targets also biased towards compression, and design the prediction task so that we are biased to compress away lower levels of abstraction so that the learned representation lives at some higher level. Note that this compression bias may come from neural nets and gradients descent itself, and isn’t necessarily from some information bottleneck. There are many different notions of compression here.

As representations of higher level concepts here becomes invariant towards some lower level variations, and these lower level variations can be thought of as some combination of generally ignored or “parts of the higher level concept associated with the higher level concept,” we reduce the chances of a lower level variation interfering with a higher level concept, creating less fractured representations for the regularities we do capture.

Once we have a representation biased towards information that lives at a higher level, we can directly shape it to be predictively useful of other representations that live at a higher level, giving it gradients that are more reflective of what useful intelligence is often for (learning interactions between higher levels of abstraction.)

This obviously provides more learning signal for higher level abstractions than implicitly supervising higher level semantics via supervising the relationships of lower level semantics. Thinking back to our two separable hypotheses on learning by constructing low level factors, explicit higher level supervision addresses H1. I think it could be reasonable to believe that if we do this supervision smart enough, H2 won’t be a hard roadblock.

For current neural nets, explicit higher level supervision could mean using the entire latent representation (which is what current latent SSL does) or some conditioned readout (e.g., via a MLP) of the latent representation to predict other latent representations, rather than just the data space. I am not saying that you should or shouldn’t be predicting lower level abstractions, but rather that there should be a loss signal that biases internal representations towards containing invariant representations of higher level concepts, and that this loss signal could likely come from explicit higher level supervision.

Latent self-supervised learning having any efficacy is reasonable proof that explicitly supervising higher level concepts, in this case by having a target that is biased towards higher level concepts, works to some degree.

I don’t have much to offer about how to create a prediction task for higher level semantic invariances for LLMs, mostly because I work on vision.

Section 4: What is lacking about latent space self-supervised learning, an act of explicitly supervising higher levels of abstraction? (Alternatively, why explicit higher level supervision somewhat works and why just learning over higher level abstractions isn’t sufficient. )

I’m going to take the position here that current purely bootstrapped latent space self-supervised learning is unlikely to be how general intelligence would emerge. The short answer to the question here is that current latent space SSL has a bunch of quirks to get training to work that greatly restricts what it can learn and discover, quirks that are needed because we are not explicitly supervising the construction of semantics.

I will be considering what I believe are two related surface level issues (downstream coverage + discovery bottleneck) with current SSL methods, and then argue that they are addressable by better supervising semantic construction.

For those who are unfamiliar with (visual) self-supervised learning, here is a grossly simplified illustrative setup for the pretraining visual representations via a pretext task (inspired by DINO6 and other joint embedding style setups, but not totally reflective of SOTA setups and will break if you actually try training it): 1) Instantiate a vision encoder (e.g., a ViT), and a prediction head. 2) Take an image, and produce two cropped views of it. Put the two views through the same vision encoder (it’s important to note that DINO doesn’t actually do this) and produce a latent for each view, and then run something like a KL divergence loss over the latent views (you explicitly optimise for the same global latent representation across the two views, which in my opinion is not ideal for reasons I’ll explain.)

The “pretext task” here is the matching of distributions (we’re using KL Divergence) across multiple cropped views. Note that while current SSL mostly focuses on learning visual representations for downstream tasks (i.e., are mostly concerned with pretext tasks,) I will be broadly considering self supervised learning from a bootstrapped latent space in general, which may not necessarily be just for visual pretraining. I will still be starting off grounded in the illustrative toy setup, as it’s less nebulous.

Let’s start by considering why we are optimising in the latent space (rather than reconstructing the pixel space,) in the setting where we are both bootstrapping the model’s representation of the world and the model learns by predicting its own outputs.

For most downstream tasks, the abstractions useful to us from the data live at a higher level. These higher level abstractions are often invariant to small variations at lower levels of abstraction. What latent self-supervised learning has figured out is that emphasising these higher levels of abstraction in a way such that the targets for learning them are biased to become invariant to these lower level variations, and thus better reflecting the actual semantic nature of said higher level abstractions not being affected by said lower level variations, often yields more semantically aligned features for some downstream probes (e.g., linear probes) than attempting to learn them implicitly by predicting a lower level target (see MAE vs DINO7,8) which likely indicate that they are often better representations of higher level abstractions out of the box.

So explicit higher level supervision somewhat works because we can have a target which lifts the level of abstraction, allowing us to get a better learning signal for higher (more often useful) levels of abstraction.

Having identifiable semantic invariances in a representation is obviously helpful for problems like reuse and coherency, as the alternative of having significant numerical variations for what’s semantically the same concept makes everything harder.

Now let’s consider how this “emphasising of higher levels of abstraction” is currently done in latent SSL. Generally, bootstrapped latent SSL setups (like our toy) use a view/partial observation of the data (in our toy, some crop) to predict/match a representation produced from another view/partial observation (in our toy, some other crop) of the data.

Primarily only what is reliably conditionally predictable from one view to another produces consistent gradients, so the representation is biased towards shared predictable structure, and away from view-specific or conditionally high-entropy details (which tend to contribute noisy, inconsistent gradients.) As the same encoder being trained is used to produce all representations used for this prediction objective, throughout training the target will become dominated by factors of the data with this type of conditional regularity, and become increasingly invariant to factors which produce inconsistent gradients. In short, optimising in latent space lets you define supervision on a learned target representation that emphasises “factors predictable under the chosen relation” (in our toy case, multi-crop distribution matching) and model architecture. As we often care about learning higher level concepts, we pick our pretext task (which helps define the chosen relation) to reflect this.

The effectiveness of current visual SSL indicates to me that a part of creating a learning bias for semantic invariance for a representation is by using said representation to be predictive of other representations (also biased towards semantic invariance) at varying levels of abstraction. Note that this is a more general statement than having some single semantically invariant target, it merely requires the existence of some invariance in some target, rather than only having a single target which is invariant.

First let’s note that “factors predictable under the chosen relation”, as induced by the pretext task and model architecture, has some notion of learning some subset of all existing semantics, and that this subset will live at some levels of abstraction. Let’s also note that a model learns by attempting to construct semantics (e.g., to predict a class or token, or to create some representation of the data) that fit the loss. It is obviously easier to learn less associations between semantics consistent with the data than it is to learn more. Given the choice to satisfy the loss while learning fewer associations and representing fewer regularities, a model will do so.

Thus, in many (bootstrapped single-objective) latent SSL objectives, having a bias for dispersing away regularities that are too hard to represent and use for prediction allows for the formation of semantic invariances in the target more easily, as dispersion is often the easiest solution. Indeed this is what (I would argue) that current latent SSL methods are designed to do, and they are in a sense learning via a subtractive process with respect to what information from the data it learns to represent. As the model learns through this subtractive process, the representations produced can become increasingly invariant to abstractions outside our “factors predictable under the chosen relation” induced subset. If this subset is small enough and has a stable enough signal, it is more likely to provide sufficient learning signal to create good representations with good semantic invariances. If this subset is too large (we don’t disperse enough regularities under current SSL,) it becomes harder to construct semantic invariances with preserved regularities in the target (again, learning to associate semantics is hard,) and we slowly lose the desired properties of the target that latent SSL is able to produce.

Note that models can still retain detail in intermediate layers even if the final embedding(s) is invariant. This critique is about what the primary SSL loss makes easy to extract and reliably reuse, especially from the representation level we treat as “the” learned features.

A question you may have now is “if current latent SSL is inherently subtractive, why does it learn anything at all?”, the answer to which is that in practice latent SSL has some anti-collapse (in an attempt to stop the model from outputting trivial constants for all inputs) tricks (e.g., like stop-grad with EMA teacher6,9, or dimension contrastive like SIGReg10) that along with the general task design (heavily) slows the dispersal of regularities which are predictable under our chosen relation (so that it is easier for the representation to capture what’s predictable rather than also dispersing it and collapsing.) Our anti-collapse tricks help prevent a trivial constant solution and enforce representational diversity, but do not remove the fundamental pressure toward encoding primarily what’s stable under the chosen relation.

Let’s consider our toy example, for the model to actually represent lower level abstractions (e.g., fur texture on a cat) in the final representations, it has to both associate them to higher level abstractions (e.g., cat) as well as then propagate said lower level abstractions to the representation (e.g., representation now conveys cat with some exact fur texture.) This association itself, as well as then representing the lower level abstraction along with the higher levels of abstraction in the final representation, is in practice very hard with a traditional single prediction task. If you think about our toy setup of attempting to match distributions between two different crops of an image, it’s obviously harder to learn to use the exact higher frequency texture (e.g., the low level abstraction of exact pixels and short lines that convey fur texture) in a predictively useful way.

There is actually no hard reason why the loss going down in latent SSL must mean that you are learning more semantics from the data, as opposed to reconstructive methods where you know that you are satisfying a nontrivial objective. Anti-collapse removes the constant solution, but not the multiplicity of non-semantic solutions such that better predictions must indicate more semantic capture. Anything that you do learn is thanks to the mercy of good natural biases.

Thus a way to consider the current way we produce stable targets for higher levels of abstraction by defining one prediction task whose entire learning signal is supposed to be for the higher level abstractions, which we produce by no longer representing (and dispersing a lot of) the lower levels of abstraction via our subtractive bias.

For some evidence of this subtractive bias, latent self-supervised learning will often collapse into a trivial solution (e.g., outputting some constant value) without anti-collapse precautions (note that if you code up the toy example, it’ll likely collapse like this.) Yeah sure, higher level abstractions are often semantically invariant to lower level variations and thus learning them benefit from the target reflecting this semantic association. However it’s very important to note that invariant signals for the higher level concepts existing in both our representation and learning signal does not imply that the entire representation or learning signal should be invariant to lower level abstractions. It could simply be that the mere existence of stable higher abstraction learning signals (and not complete representational invariance) is sufficient for learning good representations of higher level abstractions. Again, it is just that this stable signal is more easily produced if your one prediction target is produced by being biased to dispersing lower levels of abstraction throughout training.

Because of this representation-wise invariance to certain (lower level) abstractions, said representations also do not support the desire to vary the level of abstraction from the data we want to use depending on the task. Given some information, humans are able to focus across varying levels of abstraction emergent from some observation depending on the task, and it’s hard to see why this capability isn’t desirable.

I hope you can see that there are two issues with this setup.

Picking a few higher levels of abstraction to be all that we represent about the world is unsupportive of all task classes we’d want a general intelligence to be able to do (e.g., count marbles vs identify the presence of a pile of marbles.) If there is some meaningful factor of the data that is not biased towards being represented under a model (and we become invariant to it,) it becomes difficult to observe changes for that data feature in the model’s representations, and makes it hard to reason over such features of the data, creating a sort of malign invariance.

Further, more subtly, the decision to have a single prediction target which becomes invariant to many abstractions which exist in the data also means that we are incentivised against discovering all possible regularities in the data, as more regularity may exist in the data features that our target becomes invariant to. Intuitively, we can become invariant to the specific writing on a blackboard and just understand “writing on blackboard”, or we are sufficiently sensitive to the specific writing to understand that the writing is some specific math proof on a blackboard after we’ve learnt enough about the world to understand the underlying mathematics behind the proof. There are instances where fine-grained details (e.g., specific symbols for mathematics) convey information about complex regularities, regularities which will be hard to learn if we become invariant to said low level details in our objective. This is specifically an issue of invariance softly limiting our search space, as we would be unable to discover the latter regularity if during our training we become invariant to the lower level variations of chalk patterns on the blackboard.

Early invariance to detail can prevent later discovery of higher-level regularities that depend on those details. The discovery of regularities from the data is limited by what the model encodes in a stable form early on in training (which would be the regularities that are easily predictable) for two somewhat distinct reasons. It is much easier to use/compose stable regularities to form stable predictions, and further, only said regularities will be represented in the target to incentivise the model to compose together regularities from the input in an attempt to predict it.

So the current way we explicitly supervise higher level abstractions is insufficient as it disperses regularities via the subtractive bias, a bias which is useful for producing stable targets. I’m not saying that current latent SSL is bad at representation learning for downstream tasks, I’m saying its current success mode (stable invariances via subtraction + anti-collapse tricks) is counterproductive for its further progress.

Reconstructive SSL (e.g., masked reconstruction7) partially avoids this dispersion issue by forcing retention of low-level information, but it often provides weaker direct pressure for high-level invariant variables as there isn’t an objective that explicitly shapes higher-level interactions.

In my view, the goal is not to make the entire representation invariant. The goal is to learn a rich representation that can preserve fine detail when needed, while also containing stable, reusable invariant/equivariant variables that are easy to extract and compose for downstream tasks.

More concretely, a better representation of an image of a cat should convey lower level abstractions and what they semantically construct: the pixel values which make up the image, the cat’s fur texture and outline, and then that the entity itself is some specific type of cat. During training, we should not assume that some details are unpredictable and thus eliminate them completely, but instead give said details, as well as found invariances, a consistent place to live.

The goal is not “make representations invariant.” The goal is “make invariant semantics exist, while preserving lower level detail in a structured way inside our representation so invariances and details can both be reused.”

I do believe that the lesson latent SSL teaches us is not that the entire representation should become invariant to lower level variations, but that there should exist invariant signals in the target for higher level abstractions. It also just so happens that making the entire representation invariant to lower level variations via our current subtractive learning process is an easy way of creating this invariant signal, and so we currently perceive making the entire representation we learn to be only the higher level abstraction as being necessary.

Language modeling teaches us that semantically constructive methods are extremely powerful, however language is unique as semantic construction can be easily defined as a task of predicting data factors. Thus I believe that the task should be to find a way to supervise semantic construction for more domains of natural data, so we have stronger intermediate constructive signals that will allow us to move away from relying on a purely subtractive bias for creating strong learning signals for varying levels of abstraction.

The key effectiveness of the subtractive bias is that it addresses the difficulty of learning intermediate semantic associations; explicitly supervising semantic construction makes learning intermediate associations easier.

To give a more concrete question, I think an interesting question is “How could I take a ViT and define local losses on the intermediate outputs, such that I can supervise the construction of semantics to learn good level-varied abstractions of the data, and then learn a final representation which preserves all levels of abstraction (including the data space) so that I can find explanations for the data that are also consistent across all the abstractions I’ve learnt while not dispersing away regularities?” As not dispersing away information from the the data is quite easy (i.e., make the objective in part data-reconstructive,) the more important task revolves around figuring out how to give different levels of abstraction a place to live in a representation such that regularities that are likelier to be noise don’t interfere with the shaping of higher level invariances.

It’s also interesting to note here that reconstructing the data space is inherently a bias against collapse. It could be that getting intermediate supervision working (i.e., supervising semantic construction explicitly for creating a representation) to give you sufficient intermediate semantic invariances allows you to use data prediction as your “anti-collapse method” while providing stable enough signals (more intermediate stability via invariances is likely to give more stability with the higher levels they construct) such that you can shape higher level abstractions with high signal too.

Closing remarks.

The most important lesson I’ve learnt from observing the history of deep learning and the scaling era is the importance of supervising semantic construction. Before the scaling era, we’ve done this implicitly through architectural biases and regularisation. The scaling era was brought about by the ability of designing data factoring to be a sufficient proxy for semantic factoring, and semantic factoring being a very powerful objective for learning good representations.

Instead of further pursuing this idea of supervising the construction of semantic factors, we instead only gleaned the surface level of data factoring and decided to scale it to death.

We’ve also learnt through SSL that explicitly supervising higher levels of abstraction explicitly is quite data efficient, and very important for learning good higher level representations.

I think we’ll see these two lines of thought eventually converge.

References

  1. A. Kumar, J. Clune, J. Lehman, et al., “Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis,” arXiv:2505.11581, 2025. arXiv:2505.11581  2

  2. L. Berglund, M. Tong, M. Kaufmann, et al., “The Reversal Curse: LLMs trained on ‘A is B’ fail to learn ‘B is A’,” arXiv:2309.12288, 2023. arXiv:2309.12288 

  3. Y. Yue, Z. Chen, R. Lu, et al., “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?,” arXiv:2504.13837, 2025. arXiv:2504.13837 

  4. J. Hu, Y. Zhang, Q. Han, et al., “Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model,” arXiv:2503.24290, 2025. arXiv:2503.24290 

  5. DeepSeek-AI, D. Guo, D. Yang, et al., “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv:2501.12948, 2025. arXiv:2501.12948 

  6. M. Caron, H. Touvron, I. Misra, et al., “Emerging Properties in Self-Supervised Vision Transformers,” arXiv:2104.14294, 2021. arXiv:2104.14294  2

  7. K. He, X. Chen, S. Xie, et al., “Masked Autoencoders Are Scalable Vision Learners,” arXiv:2111.06377, 2021. arXiv:2111.06377  2

  8. O. Simeoni, H. V. Vo, M. Seitzer, et al., “DINOv3,” arXiv:2508.10104, 2025. arXiv:2508.10104 

  9. M. Assran, Q. Duval, I. Misra, et al., “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture,” arXiv:2301.08243, 2023. arXiv:2301.08243 

  10. R. Balestriero, Y. LeCun, “LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics,” arXiv:2511.08544, 2025. arXiv:2511.08544