Google’s ‘Nested Learning’ paradigm could solve AI's memory and continual learning problem

nested learning

Researchers at Google have developed a new AI paradigm that aims to solve one of the biggest limitations in today’s large language models: the inability to learn or update their knowledge after training. paradigm, called nested learningRedefines a model and its training not as a single process, but as a system of nested, multi-scale optimization problems. The researchers argue that this approach could unlock more expressive learning algorithms, leading to better learning and memory in context.

To prove their concept, the researchers used nested learning to develop a new model called Hope. Initial experiments show that it has superior performance on language modeling, continuous learning, and long context reasoning tasks, potentially paving the way for efficient AI systems that can adapt to real-world environments.

Memory problem of large language models

deep learning algorithms Helped reduce the need for careful engineering and domain expertise required for traditional machine learning. By feeding models large amounts of data, they can learn the necessary representations on their own. However, this approach presented its own set of challenges that could not be solved simply by stacking more layers or building larger networks, such as normalizing new data, continuously learning new functions, and avoiding sub-optimal solutions during training.

Efforts to overcome these challenges led to innovations transformerThe foundation of today’s large language model (LLM). These models have been launched "Scaling up the ‘true’ architecture results in a paradigm shift from task-specific models to more general-purpose systems with a variety of emerging capabilities," The researchers write. Nevertheless, a fundamental limitation remains: LLMs are largely static after training and cannot update their core knowledge or acquire new skills from new interactions.

The only adaptable component of LLM is its learning in context Capability, which allows it to act based on the information provided in the immediate signal. This makes the current LLM consistent with a person who cannot form new long-term memories. Their knowledge is limited to what they learned during pre-training (the distant past) and what is in their current reference window (the immediate present). Once the conversation goes beyond the context window, that information is lost forever.

The problem is that today’s Transformer-based LLMs have no mechanism for “online” consolidation. The information in the context window never updates the model’s long-term parameters – the weights stored in its feed-forward layers. As a result, the model cannot permanently acquire new knowledge or skills from interactions; Whatever it learns disappears as soon as the context window is flipped.

A nested approach to learning

Nested learning (NL) is designed to allow computational models, like the brain, to learn from data using different levels of abstraction and time-scales. It treats a single machine learning model not as a continuous process, but as a system of interconnected learning problems that are optimized together at different speeds. This is a deviation from the classic approach, which treats a model’s architecture and its optimization algorithm as two separate components.

Under this paradigm, the training process is viewed as developing "associative memory," The ability to connect and recall related pieces of information. The model learns to map a data point to its local error, which measures how "surprising" That was a data point. Even key architectural components like the attention mechanism in Transformers can be viewed as simple associative memory modules that learn mappings between tokens. By defining an update frequency for each component, these nested optimization problems can be ordered differently. "level," Creating the core of the NL model.

hope for continuous learning

The researchers put these principles into practice with Hope, an architecture designed to embody nested learning. hope is a modified version of titansGoogle introduced another architecture in January to address the memory limitations of the Transformer model. While the Titans had a powerful memory system, its parameters were updated at only two different speeds: a long-term memory module and a short-term memory mechanism.

Asha is a self-modifying architecture augmented with a "Continuum Memory System" (CMS) that enables unlimited levels of learning in context and scales to larger context windows. The CMS functions like a series of memory banks, each updated at a different frequency. Fast-updating banks handle immediate information, while slow-updating banks consolidate more abstract knowledge over a longer period of time. This allows the model to adapt its own memory in a self-referential loop, creating an architecture with theoretically infinite learning levels.

On a diverse set of language modeling and common sense reasoning tasks, Hope demonstrated lower confusion (how well a model predicts the next word in a sequence and maintains coherence in the text it generates) and higher accuracy than both the standard Transformer and other modern recurrent models. Hope outperformed even in the long term "needle in a haystack" Tasks, where a model must find and exploit a specific piece of information hidden within a large amount of text. This suggests that its CMS provides a more efficient way of handling long information sequences.

It is one of many efforts to create AI systems that process information at different levels. hierarchical logic model (HRM) by Sapient Intelligence used a hierarchical architecture to make models more efficient in learning reasoning tasks. small logic model (TRM), a model from Samsung, improves HRM by making architectural changes, making it more efficient and improving its performance.

While promising, nested learning faces some of the same challenges as these other paradigms in realizing its full potential. Current AI hardware and software stacks are highly optimized specifically for classic deep learning architectures and Transformer models. Large-scale adoption of nested learning may require fundamental changes. However, if it gains momentum, it could lead to far more efficient LLMs that can learn continuously, a critical capability for real-world enterprise applications where environments, data, and user needs are in constant flux.



Leave a Comment