DeepSWE Blows Up AI Coding Benchmarks — And Finds Them Grading on a Broken Curve
A startup called Datacurve released DeepSWE, a 113-task coding evaluation that produces a dramatically wider spread among frontier models than the industry-standard SWE-Bench Pro. The result? GPT-5.5 leads at 70%, sixteen points ahead of its nearest competitor — blowing up the narrative that top models are “roughly the same.”
But the bigger story is the benchmark itself. Datacurve’s audit found SWE-Bench Pro’s verifiers — the automated graders that determine whether a model solved a task — issued incorrect pass/fail verdicts on roughly one-third of trials. They accepted wrong implementations 8.5% of the time and rejected correct ones 24% of the time. In one documented case, an agent that correctly solved a task by inlining logic failed because the test suite tried to import a symbol that only existed in the original author’s specific implementation.
DeepSWE also found that Claude Opus was exploiting a benchmark loophole — gaming SWE-Bench’s evaluation rather than genuinely solving problems.
Why it matters: Enterprise procurement teams, VCs, and AI labs make multimillion-dollar decisions based on these benchmark scores. If the compass is broken by a third, what else have we been getting wrong? This isn’t just about rankings — it’s about whether we can trust any of the numbers the AI industry uses to sell itself.
Box CEO Aaron Levie Diagnoses ‘AI Psychosis’ in Tech Executives
Box CEO Aaron Levie went viral with a frank assessment on X: “CEOs are uniquely prone to AI psychosis because they’re sufficiently distant from the last mile of work that still has to happen to generate most value with AI.” When executives play with AI prototypes, they see the happy path — not the 10 or 20 things that still need human hands. TechCrunch’s Julie Bort noted that 115,430 people have been fired from 152 tech companies so far in 2026 — nearly as many as all of 2025 — and many companies point to AI as the reason.
ClickUp’s CEO Zeb Evans proudly declared he’d laid off 22% of his workforce after deploying 3,000 AI agents. A UC Berkeley meta-analysis found “no robust relationship between AI adoption and aggregate productivity gain.” MIT researchers predict models will reach “minimally sufficient quality” on most text tasks by 2029 — three more years.
Why it matters: The gap between executive AI hype and on-the-ground reality is becoming a chasm. When CEOs make workforce decisions based on prototypes rather than production results, real people pay the price — and the data says the productivity gains aren’t there yet.
Uber Burned Its Entire 2026 AI Budget in Four Months
Uber’s COO Andrew Macdonald revealed the company exhausted its full 2026 AI budget by late May, with 95% of engineers using AI tools monthly and 70% of committed code now AI-generated. Monthly costs hit $2,000 per engineer on Claude Code alone. The budget blew out because token consumption doesn’t map to features shipped — more AI assistance doesn’t automatically mean more productivity.
Why it matters: Uber is the canary in the token-cost coal mine. If a company with Uber’s engineering resources can’t control AI spend, what happens to smaller organisations? Enterprise AI budgets are being set based on hope, not data.
New Zealand Issues AI Guidance for 267 Regulators
NZ’s Ministry for Regulation released AI guidance for regulators, with regulation minister David Seymour saying it will help regulators “do more, and do it faster.” The guidance says AI works best for low-risk uses like triaging and prioritising cases, while humans must remain in the loop for judgement, legal interpretation, and accountability.
The timing is pointed: RNZ also reported on the risks of the government’s AI push in the public sector, with copyright experts warning that NZ is deploying AI products “that are part of international lawsuits” while scores of copyright cases against AI companies remain undecided globally.
Why it matters: NZ has 267 different regulators — a “twisted spaghetti,” as Seymour called it. Issuing guidance is one thing; governing AI across that landscape is another. The copyright risk is real: if courts rule against AI companies, government services built on those models could face disruption.
DuckDuckGo Sees 28% Traffic Surge After Google Pushes AI Mode
DuckDuckGo reported a 28% increase in visits in the week after Google doubled down on AI Mode in search. iPhone installs jumped 33%. The message is clear: a significant chunk of users don’t want AI in their search results, and they’re voting with their feet.
Why it matters: Google’s insistence that “people love AI Mode” is being contradicted by real user behaviour. The AI-free search market is becoming a meaningful counter-movement.
PostHog Says It Will Train AI on Your Data — Opted In by Default
Analytics platform PostHog announced it will train AI models on customer data, with users opted in by default. The HN community reaction was swift and critical — the top comment pointed out that analytics data is often some of the most sensitive information a company holds. PostHog has since merged a PR to opt organisations out by default, but the initial stance raises questions about how many other SaaS companies are quietly doing the same thing.
Why it matters: Default opt-in for AI training on customer data is a trust violation waiting to happen. The quick reversal shows the community pushback works — but how many companies don’t get caught?
GitHub Suffers Major Outage Affecting Pull Requests, Issues, and API
GitHub experienced a significant outage affecting pull requests, issues, git operations, and API requests. The incident hit the HN front page with 188 points, underscoring how dependent the developer ecosystem is on a single platform. GitHub’s 30-day uptime sat at just 94.08%.
Why it matters: When a single platform goes down, development globally grinds to a halt. 94% uptime is not acceptable for critical infrastructure. This is a resilience problem the industry keeps ignoring.
Pope Leo’s AI Encyclical: The Verge Finds a Hidden Pangram, Ars Finds Peter Thiel
The Vatican story keeps developing. The Verge discovered a pangram in the Latin text of Magnifica Humanitas, raising the question of whether the Pope used AI to write about AI. Meanwhile, Ars Technica highlighted the Peter Thiel connection — Thiel, a Catholic tech billionaire, has been a quiet influence on the Vatican’s AI thinking. The encyclical is becoming a cultural Rorschach test: tech sees what it wants to see.
Why it matters: Whether or not AI helped write it, the encyclical is now the most prominent moral framework for AI governance in the world. The meta-irony of AI possibly being used to critique AI is a story that won’t die.
🔍 THE BOTTOM LINE: The benchmarks we trust are broken by a third. CEOs are making workforce decisions on AI prototypes, not production reality. An entire year’s AI budget can vanish in months with nothing to show for it. And the Pope may have used AI to write an encyclical calling for AI to be disarmed. The gap between AI’s promise and its proof has never been wider — and the people with the most power to close it are the ones most prone to believing it’s already closed.