Scientists are tackling the persistent problem of catastrophic forgetting in continual learning, often assessing it solely through performance metrics rather than investigating the underlying neural mechanisms. Sergi Masip from KU Leuven, alongside Gido M. van de Ven from the University of Groningen, Javier Ferrando from Amazon, and Tinne Tuytelaars, present a novel mechanistic framework that interprets forgetting geometrically, linking it to how individual features are encoded and subsequently disrupted. This research is significant because it moves beyond simply measuring forgetting to understanding why it happens, formalising the view that limited feature capacity and disrupted readout contribute to the problem, and identifying scenarios where it is most pronounced. By analysing a tractable model and applying their framework to practical networks trained on CIFAR-10, the authors offer a new, feature-centric vocabulary for advancing continual learning research.
This research is significant because it moves beyond simply measuring forgetting to understanding why it happens, formalising the view that limited feature capacity and disrupted readout contribute to the problem, and identifying scenarios where it is most pronounced.
Feature forgetting as geometric representation transformation
These transformations diminish the allocated capacity of features, leading to poorer representations, and disrupt their subsequent readout by downstream computations. The study unveils a feature-centric vocabulary for Continual learning, offering a granular, mechanistic account of how knowledge is lost when a neural network learns sequential tasks. Researchers argue that examining the internal representational transformations is crucial for a deeper understanding of forgetting, utilising tools from the field of mechanistic interpretability to reverse-engineer the algorithms learned by Neural networks. This approach centres on the premise that knowledge is encoded through hierarchically represented individual features, corresponding to meaningful concepts or patterns in the data, and encoded as linear directions in activation space.
The strength and direction of these features are captured by feature vectors, which, when combined, form the basis of a layer’s representation. Analysis reveals that feature vectors can be altered through rotations and scaling, impacting their allocated capacity and readability. Rotations can increase overlap with other features, while scaling can cause fading, both compromising the ability to accurately extract information. Even without reducing allocated capacity, these transformations can cause readout misalignment, where downstream computations fail to correctly interpret the feature. To illustrate the framework’s value, the scientists employed the tractable model to pinpoint optimal and suboptimal conditions for feature forgetting, subsequently validating their analytical predictions with empirical results.
This work offers a novel explanation for previously observed phenomena related to task similarity and architectural choices in neural networks. Furthermore, the research extends beyond theoretical modelling, demonstrating the framework’s applicability to practical models. This case study opens new avenues for mechanistic research in the field, providing a powerful tool for understanding and mitigating catastrophic forgetting.
Geometric Feature Capacity and Catastrophic Forgetting are key
The research team engineered a geometric approach to understand how allocating capacity to features diminishes and disrupts readout by downstream computations, ultimately leading to forgetting. This work moves beyond performance-level measurements, probing the underlying mechanisms of forgetting. Experiments employed a feature-reader model, allowing researchers to systematically manipulate feature capacity and readout mechanisms to observe their effects on learning and forgetting. This technique enabled the team to analyse representations within the trained network, identifying feature fading and readout misalignment as dominant correlates of forgetting, particularly in the final layer.
The approach achieves a semantic understanding of forgetting, distinguishing between feature rotation and fading, which traditional metrics like Centered Kernel Alignment often fail to do. Researchers harnessed linear probing to assess the quality of learned features, while carefully controlling for confounding factors within the feature-reader model. The system delivers precise quantification of feature capacity and readout alignment, providing a granular view of the forgetting process. This methodology allows practitioners to diagnose the root causes of continual learning failures and determine which knowledge components are most susceptible to loss. The team characterised the impact of depth, demonstrating that deeper networks exhibit greater capacity reduction during continual learning. This work provides a new, feature-centric perspective, bridging the gap between continual learning and mechanistic interpretability, and ultimately deepening our understanding of the forgetting phenomenon with a score of 0.7 and 0.8 respectively.
Feature encoding geometry explains catastrophic forgetting in neural
The team measured feature representation using feature vectors, capturing both direction and strength within a layer’s activation space. These vectors form the basis of the layer’s representation, where the activation in response to an input is a linear combination of feature vectors weighted by their activations. Deeper layers were found to encode increasingly abstract features, demonstrating a hierarchical progression of pattern recognition. Results from the tractable model implementation support the analysis, providing a fresh explanation for previously observed phenomena related to task similarity and architectural choices.
Researchers quantified allocated capacity, a metric that captures the exclusivity of a feature’s encoding, using the formula Ci = ( (φ⊤ i φi)2 / P j(φ⊤ i φj)2 ) if ∥φi∥> 0, else 0. When a feature vector is orthogonal to all others, its allocated capacity reaches one, indicating exclusive access to its representational dimension. Conversely, overlapping feature vectors exhibit decreased Ci values, reflecting shared capacity and reduced readability. The study demonstrated that when a “yellow” feature is represented in a network layer, downstream computations rely on readout vectors to extract information, with readout quality fundamentally limited by the feature’s allocated capacity. Furthermore, the work introduces a conceptualisation of forgetting arising from two transformations: rotations and scaling of feature vectors. These transformations lead to capacity degradation through feature overlap or fading, and readout misalignment, where downstream computations fail to adapt to changes in feature representation.
Forgetting as Geometric Feature Vector Evolution
Through analysis of a simplified model, researchers identified scenarios that either minimise or exacerbate forgetting, demonstrating that orthogonal task probes and minimal overlap in feature activation can mitigate interference. The study formalised this view by deriving how feature vectors evolve during training, revealing that updates are influenced not only by the feature itself but also by correlated feature activations and the alignment with the new task’s probe. Key findings indicate that features rarely activated in a new task experience minimal change, effectively limiting forgetting to the active feature subspace. Furthermore, the research highlights the importance of probe orthogonality, as gradients from orthogonal tasks do not interfere with previously learned features. Acknowledging the model’s simplifications, the authors note that assumptions such as fixed task probes were made to enable closed-form derivations, though they suggest the qualitative conclusions likely hold even with probe adaptation. They also build upon previous linear model work, incorporating a superposition lens absent in earlier SVD-based analyses.