
On-device AI models remain small because the entire weight set has to reside in DRAM, capping practical parameter counts well below the use of server-side deployments. Enterprise architects evaluating agentic workloads have had to choose between a capable cloud-dependent model and a limited on-device model. Apple’s third generation Foundation models were announced at WWDC26, Break that barrier by removing weight set DRAM entirely.
The AFM3 family was developed in collaboration with Google and consists of five models: two on-device and three server-based, all running within Apple’s private cloud compute range. Server-side models, including AFM3 Cloud Pro for agentic tool use and complex logic, run on Nvidia GPUs in Google Cloud. The on-device architecture is Apple’s own. AFM3 Core Advanced is a 20-billion-parameter model that stores weights in NAND flash instead of DRAM.
"Instead of putting the entire model in DRAM, the entire model is stored in flash memory," Apple’s research team wrote. "Since the NAND-to-DRAM bandwidth is too slow to swap weights token by token, as the standard MoE model requires, AFM 3 Core Advanced makes routing decisions per signal."
How architecture really works
Every local AI developer works on the same memory wall that Apple is working on.
"You can’t put 20B parameters in RAM at any reasonable precision," Avni Hannoun, an Anthropic researcher and former Apple research scientist, posted on X. "To implement this they are using beautiful exotic architecture by today’s standards. A small model predicts from a query (or prompt) which specialists will load from NAND into RAM."
That predict-and-load mechanism consists of three separate components, each of which is governed by the hardware constraints of consumer silicon.
The entire 20B weight set resides in flash, not DRAM. The AFM3 Core Advanced stores its entire parameter set in NAND flash instead of active memory. Standard on-device deployment requires the entire model to fit in DRAM, which limits the computation of their parameters. Apple’s approach, which it calls Instruction-Following Pruning (IFP) and developed with its own researchers, treats flash as the permanent home of the model and DRAM as a working buffer for the specialists needed for a given signal.
Expert routing happens once per signal, not per token. In the traditional mix-of-experts model, a router selects different experts for each generated token – which would require continuous weight movement between flash and DRAM at inference speed. The NAND-to-DRAM bandwidth cannot support this. AFM3 Core Advanced routes the prompt once, selects a fixed expert set, loads it into DRAM with shared experts always active, and generates all tokens from the same configuration.
"The main difference from a normal MoE is that you do it once per query and then generate all tokens with the same experts," Hannun wrote.
Active parameter count ranges from 1B to 4B depending on task complexity. Instead of running a fixed model size for each request, AFM3 Core Advanced adjusts how many parameters it activates depending on what the task requires – 1 billion for simple operations, up to 4 billion for harder ones, all taken from a 20-billion-parameter pool in flash.
What Apple has revealed and what it hasn’t
The architecture paper elaborates on memory design and sparse activation mechanisms. It is less forward on practical deployment barriers.
Apple’s profiling tools reveal timing, but not the metrics that determine production feasibility. "Energy, memory bandwidth, thermal? Not in the documents," Marco Abis, who is building the profiler Giraffe for native AI on Apple Silicon, posted on X. "A notable difference, considering that most on-device performance dictates that."
Abyss found no statement in Apple’s documentation — the Core AI docs, the Foundation Model docs, or the Private Cloud Compute Security post — about when an on-device request is transparently offloaded, or whether that routing is visible to the developer or user. For enterprises that require documentation of where estimates run, this is a direct compliance issue.
At present all the information is not available. Apple has indicated that a full technical report with benchmarks is due later this summer.
What does this mean for the enterprise architect
Regulated industries evaluating agentic AI deployments now have to make a concrete architectural decision.
- The DRAM wall for on-device agents has just moved. Enterprises evaluating agents that need to run without a cloud round-trip now have a 20-billion-parameter local option to evaluate. The constraint is transferred from the model capability to the device hardware.
-
The private/cloud boundary is now an architectural decision, not a default. Simple requests remain on the device; Complex agentive task routing for AFM3 Cloud Pro on private cloud compute. Apple hasn’t publicly specified when a request is offloaded or whether that routing is visible to the developer — a distinction that complicates policy decisions for organizations that need to document where the inference runs.
-
The agent server tier relies on Google Cloud. AFM3 Cloud Pro runs on Nvidia GPUs in Google Cloud. The Private Cloud Compute Guarantee covers data privacy. This does not eliminate the Google Cloud dependency for server-side inference.
The AFM3 core gives advanced enterprises a 20-billion-parameter on-device option that did not exist before WWDC26. Whether this is deployable on a large scale depends on answers Apple has not yet published. Those details are given in the summer technical report.
<a href