Google PM open-sources Always On Memory Agent, ditching vector databases for LLM-driven persistent memory

ChatGPT Image Mar 6 2026 02 31 43 PM
Shubham Sabu, senior AI product manager at Google, has turned one of the toughest problems in agent design into an open-source engineering exercise: persistent memory.

This week, they published an open-source “Always On Memory Agent” on the official Google Cloud Platform Github page under a permissive MIT license, which allows commercial use.

It was introduced with Google’s Agent Development Kit, or ADK, last spring in 2025, and Gemini 3.1 Flash-Lite, a lower-cost model Google introduced on March 3, 2026 as its fastest and most cost-effective Gemini 3 series model.

This project serves as a practical reference implementation for something that many AI teams want but few have produced in a clean way: an agent system that can continuously receive information, consolidate it in the background, and retrieve it later without relying on traditional vector databases.

For enterprise developers, the release matters less as a product launch than as an indication of where the agent infrastructure is headed.

Repo offers a vision of long-term autonomy that is increasingly attractive to support systems, research assistants, internal co-pilots, and workflow automation. As memory ceases to be session-bound, it also brings governance questions into sharper focus.

What the repo appears to do – and what it apparently doesn’t claim

The repo appears to use a multi-agent internal architecture, with specialist components handling ingestion, consolidation and queries.

But the supplied material does not clearly establish a blanket claim that it is a shared memory framework for multiple independent agents.

That distinction matters. ADK supports multi-agent systems as a framework, but this specific repo is always described as an on-prem memory agent, or memory layer, which is built with specialist subagent and persistent storage.

Even at this narrow level, it solves a core infrastructure problem that multiple teams are actively working on.

Architecture prioritizes simplicity over traditional recovery stacks

According to the repository, the agent runs continuously, ingests files or API input, stores structured memories in SQLite, and performs scheduled memory consolidation every 30 minutes by default.

A native HTTP API and Streamlit dashboard are included, and the system supports text, image, audio, video, and PDF ingestion. Repo has framed the design with a deliberately provocative claim: “No vector database. No embeddings. Just an LLM that reads, thinks, and writes to structured memory.”

That design option is likely to attract the attention of developers looking to manage cost and operational complexity. Traditional retrieval stacks often require separate embedding pipelines, vector storage, indexing logic, and synchronization tasks.

Sabu’s example instead relies directly on a model of organizing and updating memory. In practice, this can simplify prototyping and reduce infrastructure sprawl, especially for small or medium-memory agents. It also shifts the performance question from vector search overhead to model latency, memory compaction logic, and long-run behavioral stability.

Flash-light always-on model makes some economic sense

This is where the Gemini 3.1 flash-light enters the story.

Google says this model is built largely for high-volume developer workloads and is priced at $0.25 per 1 million input tokens and $1.50 per 1 million output tokens.

The company also says Flash-Lite is 2.5 times faster than Gemini 2.5 Flash at the time of first token and offers a 45% increase in output speed while maintaining the same or better quality.

On Google’s published benchmarks, the model posted an Elo score of 1432 on Arena.ai, 86.9% on GPQA Diamond, and 76.8% on MMMU Pro. Google considers those features suitable for high-frequency tasks like translation, moderation, UI generation, and simulation.

Those numbers help explain why the flashlight has been paired with a background-memory agent. A 24/7 service that periodically re-reads, consolidates, and provides service to memory needs predictable latency and low predictable costs to avoid making “always on” excessively expensive.

Google’s ADK document reinforces the broader story. The framework is presented as model-agnostic and deployment-agnostic, with support for workflow agents, multi-agent systems, tools, assessment, and deployment targets, including Cloud Run and Vertex AI agent engines. This combination makes Memory Agent feel less like a one-time demo and more like a reference point for a broader agent runtime strategy.

The enterprise debate is not just about capacity, but about governance

The public reaction shows why continued enterprise adoption of persistent memory will not depend solely on speed or token pricing.

Many of the responses on X highlighted concerns that enterprise architects may raise. Frank Abe called Google ADK and 24/7 memory consolidation a “fantastic leap forward for continuous agent autonomy”, but warned that an agent is “dreaming” and cross-pollinating memories in the background without deterministic boundaries becomes “a compliance nightmare.”

ELED made a related point, arguing that the main cost of always-on agents is not tokens but “drift and loop”.

Those criticisms go directly to the operational burden of persistent systems: who can write to memory, what gets merged, how does retention work, when are memories deleted, and how do we audit what the agent learned over time?

Another response from Effie challenged the “no embedding” framing of the repo, arguing that the system still has to fragment, index, and retrieve structured memory, and that this may work well for small-context agents but breaks down when the memory store grows too large.

That criticism is technically important. Deleting the vector database does not delete the recovery design; This changes where the complexity resides.

For developers, the tradeoff is less about ideology and more about fit. A lightweight stack may be attractive for low-cost, bounded-memory agents, while large-scale deployments may still require tighter recovery controls, more explicit indexing strategies, and stronger lifecycle tooling.

ADK expands the story from a demo

Other commenters focused on developer workflow. One asked for the ADK repo and documentation and wanted to know if the runtime is serverless or long-running, and whether tool-calling and evaluation hooks are available out of the box.

Based on the materials supplied, the answer is effectively both: the memory-agent instance itself is structured like a long-running service, while the ADK more broadly supports multiple deployment patterns and includes tools and evaluation capabilities.

The always-on memory agent is interesting in itself, but the bigger message is that Sabu is trying to make agents feel like deployable software systems rather than isolated signals. In that framing, memory becomes part of the runtime layer, not just an add-on feature.

What Sabu has shown – and what he hasn’t shown

What Sabu hasn’t shown yet is just as important as what he has published.

The material provided does not include the Anthropic Cloud Haiku benchmark versus direct flash-light for the agent loop in production use.

They also do not have the typical enterprise-grade compliance controls for this memory agent, such as: deterministic policy limits, retention guarantees, separation rules, or formal audit workflows.

And while the repo appears to use multiple specialist agents internally, the content clearly doesn’t prove a major claim about persistent memory shared among multiple independent agents.

For now, Repo reads as a compelling engineering template rather than a complete enterprise memory platform.

Why does it matter now?

Still, the release comes at the right time. Enterprise AI teams are moving beyond single-turn assistants and toward systems that are expected to remember priorities, preserve project context, and work over longer horizons.

Sabu’s open-source memory agent provides a solid starting point for the next layer of infrastructure, and gives some credibility to Flash-Lite economics.

But the strongest takeaway from the reaction around the launch is that persistent memory will be judged on capacity as well as governance.

This is the real enterprise question behind Sabu’s demo: not whether an agent can remember, but whether it can remember in ways that remain limited, observable, and safe to rely on in production.



<a href

Leave a Comment