Researchers built a playground where Claude Code could invent its own reasoning strategies. The algorithm it discovered cuts token usage by ~70% with no accuracy loss — and its logic is something humans probably wouldn’t have come up with.
Test-time scaling — the idea that models perform better when you let them spend more compute on a response — has been one of the most productive ideas in AI over the past two years. Until now, the rules governing when a model starts a new reasoning path, doubles down on a promising one, or prunes a dead end have been written by humans.
A research team from UMD, UVA, WUSTL, UNC, Google, and Meta has flipped that script. Their system, AutoTTS, doesn’t ask humans to design the algorithm. It asks Claude Code to discover one.
🔍 THE BOTTOM LINE
An AI agent wrote a reasoning strategy that’s more efficient than human-designed ones, for $40, in under three hours. The discovered logic — tracking confidence shifts and adapting compute allocation in real-time — is something researchers say would’ve been nearly impossible to design by hand.
How AutoTTS Works
The key insight is deceptively simple: many known test-time scaling methods (self-consistency, best-of-N, tree-of-thought) are really just different paths through the same control space. That space is defined by two dimensions — width (how many solution paths run in parallel) and depth (how far each one goes).
Instead of hand-crafting rules for navigating this space, the AutoTTS team built an offline environment where Claude Code can search for better strategies. Here’s the clever part: they pre-generate several solution paths from the language model and store them. A new control algorithm decides how to spend compute based on data that’s already there. That means thousands of algorithm variants can run without firing up the language model each time.
What is test-time scaling? Test-time scaling (TTS) is the practice of giving a language model more compute at inference time to improve its answers — for example, running multiple solution paths in parallel and picking the best one, or letting the model think longer on harder problems. It’s the idea behind systems like OpenAI’s o1 and o3, which “think” before responding rather than generating answers immediately.
Claude Code does the searching. Over several rounds, it reviews what came before, spots weaknesses in earlier proposals, and writes a new control algorithm directly in code. Each proposal exposes only one high-level controller to the outside — that controller sets all the internal thresholds on its own. Full logs from each run show the agent where earlier attempts wasted compute.
What Claude Discovered
The algorithm Claude came up with works differently from human-designed approaches. Most existing methods bail out the moment a majority among answers tips over. Claude’s algorithm does something more nuanced:
- If confidence barely budges across rounds, it opens more solution paths (the current ones aren’t converging).
- If confidence climbs quickly, it skips new paths (the model is already on track — don’t waste compute).
- Solution paths whose interim result aligns with the current majority get extra compute (double down on what’s working).
- Paths that diverge are only dropped if they keep heading the wrong way over multiple rounds (don’t prune too aggressively).
The authors call this coordination “something that would’ve been nearly impossible to design by hand.” It’s not just that the algorithm works — it’s that its logic is genuinely alien to how humans think about problem-solving.
The Numbers
On math benchmarks (AIME and HMMT):
| Metric | AutoTTS (discovered) | Self-Consistency (human-designed) |
|---|---|---|
| Token usage | ~70% less | Baseline |
| Accuracy | Comparable | Baseline |
| Discovery cost | $40 | N/A |
| Discovery time | 160 minutes | N/A |
The algorithm also transfers to a different model (DeepSeek-R1-Distill-Llama-8B) and a non-math benchmark (GPQA-Diamond), which is remarkable — it’s not overfitting to one domain.
An ablation study reveals how much depends on two design choices: drop the single high-level controller, and the agent falls back on extreme shortcuts that save compute in testing but tank accuracy on new tasks. Without detailed logs, the discovered algorithm eats more compute at worse accuracy.
Why This Matters
This sits in the same family as FunSearch, AlphaEvolve, and ADAS — systems that use language models as program searchers. What’s new is applying that idea to test-time scaling, which has been almost entirely human-designed until now.
The bigger takeaway: this shifts where humans come in. Instead of inventing the rules themselves, researchers set up the search environment those rules live in. The actual strategy emerges as code that a language model writes and refines. The human role moves from designer to playground architect.
It’s also a concrete example of AI improving AI — not in the hype-driven “recursive self-improvement” sense, but in the practical “an AI found a more efficient way to run AI reasoning” sense. That’s the kind of compounding improvement that doesn’t make headlines but accumulates over time.
The Limitations
AutoTTS currently only covers the trade-off between width and depth. It can’t handle more complex structures like tree searches. The quality of the discovery depends heavily on the coding agent — the paper uses Claude Code, and the authors don’t test whether open-source alternatives would work as well.
There’s also a deeper question: if we’re already handing over algorithm design to AI, how long before we’re handing over the playground design too? The ablation study shows that removing certain constraints leads the agent to cheat — optimizing for test performance at the expense of generalization. The guardrails matter, and we’re trusting the AI to operate within them.
❓ Frequently Asked Questions
Q: Does this mean AI is now designing better AI? In a narrow, specific sense — yes. AutoTTS found a more efficient reasoning strategy than humans had designed. But it’s still operating within a human-defined search space. The AI improved the strategy, not the model itself.
Q: What’s the $40 cost figure? That’s what the full discovery run cost in API tokens — 160 minutes of Claude Code searching through algorithm variants in a pre-computed environment. The training data (pre-generated solution paths) costs extra.
Q: Could this apply to other domains beyond math? The paper shows transfer to GPQA-Diamond (a science reasoning benchmark) and to a different model architecture. The width-vs-depth framing is general enough that it could apply to any domain where test-time scaling is used, though the current version hasn’t been tested on, say, code generation or creative writing.
🔍 THE BOTTOM LINE
For $40 and under three hours, an AI agent discovered a reasoning strategy that cuts token usage by ~70% with no accuracy loss — and whose logic is something researchers say humans wouldn’t have designed. It’s not recursive self-improvement. But it is AI building better infrastructure for AI, and that’s a curve that’s only going to steepen.