For five years, every frontier AI lab has been locked in the same race: make the context window bigger. 4,000 tokens. 32,000. 200,000. 1 million. 2 million. The implicit assumption — that more memory means more intelligence — has driven billions of dollars of inference-cost optimisation, hardware investment, and benchmark wins. A new paper from MIT CSAIL by Alex L. Zhang, Tim Kraska, and Omar Khattab argues that assumption is wrong. The future of AI memory, they say, is not a bigger brain. It is teaching the model where to look. The replacement architecture — Recursive Language Models — is also exactly the answer to the question every sovereign-AI advocate has been asking for the last three years.
🔍 THE BOTTOM LINE
The RLM paper makes three claims worth taking seriously. First, an LLM that navigates an external document outperforms the same LLM trying to memorise the document — by margins of +26% over compaction methods, +130% over CodeAct with sub-calls, and +13% over Claude Code, on GPT-5 across four long-context benchmarks. Second, a model with a 200,000-token context window can effectively process roughly 20 million tokens using the RLM pattern — two orders of magnitude beyond the native window. Third, a small post-trained RLM (RLM-Qwen3-8B) beats its vanilla base by 28.3% on average and approaches GPT-5 quality on three long-context tasks. The architectural shift is the interesting part. Stop trying to make the model remember everything; treat the document as an environment and let the model query it the way a human expert uses a library.
What the Paper Actually Does
The RLM paradigm, as described in the arXiv preprint and the open-source release on GitHub, treats a long prompt as part of an external environment rather than something to be loaded into the model’s context. The LLM is given programmatic access to the document — it can examine it, decompose it, and recursively call itself over snippets. Instead of stuffing a 500-page contract or a million-line codebase into a single prompt and hoping the model remembers everything, the model writes a small program that navigates the document. It searches, it scans structure, it pulls only the information it needs, and it composes the answer.
The model is also allowed to launch smaller sub-agents in parallel to analyse different sections of the document at the same time, and the architecture combines their findings into a final answer. There is no chunking, no summarisation, no information loss, and no context decay. The entire document stays intact and searchable, but only the relevant parts are ever loaded into the model at any one time.
The numbers are the part that has the industry paying attention. On the same GPT-5 backbone, with comparable cost, RLM outperforms:
- +26% over compaction (the standard “summarise then query” approach)
- +130% over CodeAct with sub-calls (the OpenAI-style code-executing agent pattern)
- +13% over Claude Code (Anthropic’s own long-context agent)
The result is consistent across four diverse long-context benchmarks — multi-document reasoning, code understanding, long-context QA, and information aggregation — and the cost is comparable to running the long-context baseline. The lab has also open-sourced the full implementation on GitHub under a permissive license.
Why This Is the Third Paper in Six Months Saying It
The RLM paper is the third major technical contribution in the last six months that has, in effect, declared the long-context arms race structurally misguided. The Google DeepMind “From AGI to ASI” paper treated retrieval-augmented architectures as the durable primitive for the post-AGI era. Anthropic’s production Skills pattern is essentially RLM in commercial clothing — instead of stuffing every instruction into every prompt, store the instruction set externally and navigate it. And now MIT CSAIL has shown that a model navigating an external document beats the same model trying to remember the document, with two-orders-of-magnitude context extension and lower cost.
The convergence is the signal. Three independent research groups, three different framings, the same conclusion: the multi-year, multi-billion-dollar race to expand context windows has been solving the wrong problem. The real architecture is navigate, don’t memorise.
The Sovereign-AI Angle
This is where the paper lands hardest for a New Zealand reader. The persistent hurdle for Kiwi organisations — government agencies, banks, healthcare providers, the entire public sector — has been data residency. Most want to use high-capability AI but cannot legally or ethically move sensitive citizen data to servers in the United States, Singapore, or anywhere outside New Zealand jurisdiction. The sovereign-AI question has been “do we build our own ChatGPT?” — an answer that is expensive, slow, and almost certainly worse than the frontier.
The RLM architecture provides a different answer. Because RLMs navigate data as an external environment, the model does not need to “know” the whole dataset at once. It only needs to know how to find the right piece of information within a local data store. This unlocks a “Local Data, Global Intelligence” architecture:
- Sensitive data — citizen records, health data, financial transactions — stays on servers in Auckland, Wellington, or Christchurch.
- A local RLM acts as a navigator that queries the local store.
- The frontier model can be any model, including open-weights (Qwen, Llama, Mistral) running on NZ-based infrastructure, or even a frontier API call that sees only the retrieved snippet — not the entire corpus.
This is the architecture sovereign-AI advocates have been describing for three years without a working technical primitive. RLM is the primitive. The “Sovereign AI” question shifts from “we need our own ChatGPT” to “we need our own RLM, and any model can plug into it.” That is a smaller, cheaper, and faster ambition — and a more realistic path to data-sovereign AI for organisations that cannot or will not send their data offshore.
For the Australian market the same logic applies. A Brisbane-based healthcare provider, a Perth-based mining company, a Canberra-based defence contractor — all of them have the same data-residency constraint, and all of them have been waiting for a working technical answer. RLM is the first credible one.
What It Means for Frontier Labs
The pricing implications are non-trivial. Frontier labs have been selling 1-million-token context windows as a feature. If RLM can match or exceed that capability at comparable cost using a 200K context window, the “context window size” stops being a defensible differentiator. The competitive moat shifts to the model itself, the data flywheel, and the integrations — not the context size. This is bad news for labs that have been leaning on context-window size as a product story.
It is also a quiet vindication of Anthropic’s Skills architecture and OpenAI’s tool-use work. Both are variations on the “navigate, don’t memorise” theme. The labs that already shipped this pattern are positioned better than the labs that have been racing on raw context size.
❓ FAQ
Is the 100x context extension real? Yes, per the paper. A 200K-token model running the RLM pattern can effectively process approximately 20 million tokens — two orders of magnitude beyond the native window. The extension comes from the model being able to inspect, decompose, and recursively call itself over snippets of the document, rather than loading the whole document into context.
How is this different from RAG? Traditional RAG retrieves chunks of a document into the prompt, then asks the model to reason over them. RLM goes further: the model programmatically decomposes the document, can launch sub-agents in parallel, can recursively re-query itself, and can do all of this with full control over the navigation. RAG is a pre-processing step; RLM is an inference paradigm.
What about latency? The paper does not make claims about latency, and this is a real concern. A 20-million-token navigation will be slower than a single 200K-token call, even with parallel sub-agents. The trade-off is quality vs. latency, and the paper optimises for quality. For latency-sensitive applications (real-time chat, voice, code completion) the long-context baseline may still be the right choice.
Is the open-source implementation production-ready? No. The GitHub release is research code, not production code. The pattern is well-described and reproducible, but the engineering to deploy it reliably at scale — error handling, sub-agent reliability, document indexing, security boundaries — is still ahead of the lab release. Expect the production-grade versions to ship from the major labs (or open-source competitors) over the next 6-12 months.
What does this mean for a Kiwi government agency considering sovereign AI? The RLM pattern makes sovereign AI practical. An NZ agency can run a frontier open-weights model (Qwen 3, Llama 4, Mistral) on local infrastructure, point it at a local data store of sensitive records, and use the RLM navigation pattern to answer questions without ever sending data offshore. The model does not need to be the largest or the newest — it needs to be good at navigation. This is a smaller, cheaper, and more defensible sovereign-AI architecture than trying to build a New Zealand GPT.
🔍 THE BOTTOM LINE
For five years the context window was the AI industry’s favourite form of progress. Bigger meant smarter. The RLM paper says the opposite: a smaller model that can navigate is smarter than a bigger model that tries to memorise. The numbers are credible, the architecture is sound, and the implications for sovereign AI, frontier-lab pricing, and enterprise deployment are all significant. The context-window wars may not have been won by the lab with the biggest model. They may have been won by the team that stopped trying to remember everything.
📰 Sources
- Zhang, Kraska, Khattab — Recursive Language Models (arXiv 2512.24601)
- RLM open-source implementation on GitHub
- MIT CSAIL — Tim Kraska research group
- Google DeepMind — From AGI to ASI paper
- Anthropic — Claude Code agent architecture
- OpenAI — Long context benchmarks methodology
- Alibaba — Qwen 3 release notes