GLM 5.2 on AMD MI355X Hits 2,626 tok/s — at Half the Cost of Blackwell

The demand for AI inference is outpacing supply, and there aren’t enough NVIDIA Blackwells to go around. Wafer.ai just proved you might not need them — GLM 5.2 on AMD’s MI355X hits 2,626 tok/s/node, matching 80% of a B200’s throughput at roughly half the cost per GPU.

🔍 THE BOTTOM LINE

AMD’s Instinct MI355X competes with NVIDIA’s Blackwell at the silicon level. The software gap — the CUDA moat everyone said would take years to close — is narrowing fast enough that the question is no longer “can AMD run frontier models?” but “how many weeks of engineering does it take?” For GLM 5.2, the answer was roughly one sprint.

The Numbers That Matter

Wafer ran a 20k input / 1k output workload at 60% cache hit rate on a single MI355X node. At saturation (2.4 requests per second), the node delivered 2,626 aggregate tokens per second with a p50 time-to-first-token of 0.81 seconds and 100% success rate. NVIDIA’s B200 on the same workload hit 3,192 tok/s/node at 3.0 RPS — meaning AMD captured 82% of the performance at roughly 2.75x lower cost per GPU.

Single-stream decode was 213 tok/s on 10k input / 1.5k output — not topping the Artificial Analysis leaderboard, but winning on performance per dollar. The MI355X averages about 2.75x cheaper per GPU than the B300. That math is brutal for NVIDIA if the software gap keeps closing.

What It Took to Get There

The path wasn’t clean — it rarely is on AMD’s ROCm stack. Wafer quantized GLM 5.2 from bf16 to MXFP4 using AMD’s Quark tool, reporting near-lossless results against Zhipu AI’s official FP8 quantization across GPQA-Diamond, tau2, and GSM8K benchmarks. The data shows a -1.9% drop on GPQA-Diamond and -1.0% on GSM8K, with tau2 macro actually improving by +1.5%. Wafer calls this “lossless”; HN commenters pushed back, arguing that a 2-4% drop on reasoning benchmarks is the difference between frontier and not-frontier. One commenter noted that AMD’s MI355X can run FP6 operations at the same speed as FP4 — a path that would be genuinely lossless at a modest performance cost. They chose sglang as the inference framework because vLLM had no working MXFP4 path for GLM’s MoE architecture and ATOM degraded at long context.

Speculative decode required two fixes that sound trivial but weren’t documented anywhere: the multi-token prediction head’s shared expert needed to be registered under sglang’s decoder module prefix (Quark named it under model.layers.78.mlp.shared_experts, but sglang’s MTP layer lives under model.decoder.*), and a #include <cuda_runtime.h> in the fused multi-step metadata kernel needed an #ifdef USE_ROCM guard for draft depth ≥4.

The bigger throughput win came from MoE kernel tuning. GLM 5.2’s fp4 mixture-of-experts was silently falling back to a slow FlyDSL heuristic on the sglang ROCm image — AMD’s aiter library only shipped tuned configs for the a8w8/fp8 path, not fp4. Wafer tuned the kernel selection themselves for GLM’s specific shapes (model_dim 6144, moe_inter 2048, E=256, topk=8), which lifted aggregate throughput from 1,944 tok/s to 2,626 tok/s. Switching from TP8 to TP4×DP2 also helped — data parallelism beat tensor parallelism for this prefill-bound workload.

The CUDA Moat Is a Support Problem Now

Wafer’s own framing is the sharpest line in the piece: “SOTA on AMD is becoming more a matter of support, not software. The CUDA moat is eroding in real time.” They didn’t write any custom kernels for this work — unlike their earlier Qwen 3.5 397B deployment, where they had to. The bugs they hit were configuration mismatches and missing ROCm guards, not fundamental capability gaps.

This matters because the inference market is the bottleneck. Frontier models are shipping every few weeks — Claude Fable, GLM 5.2, Minimax M3 — and NVIDIA GPU prices are climbing accordingly. If AMD can deliver 80% of the performance at half the cost with weeks of engineering rather than months, the economics flip. The parallel push into open-weight models means the model side is already commoditizing. The hardware side is next.

What This Means for NZ

New Zealand’s AI infrastructure conversation — see our earlier piece on the Super Fund and sovereign AI infrastructure — has been framed around buying NVIDIA GPUs or renting cloud tokens. AMD’s MI355X at half the cost per GPU changes the build-vs-buy math. A sovereign inference cluster built on AMD silicon is now plausible at a price point that doesn’t require a national security exception. The catch is the engineering talent to do what Wafer did — tune kernels, fix framework bugs, configure ROCm. That’s the real scarce resource, not the chips.

The Other Side

AMD’s software ecosystem is still meaningfully behind NVIDIA’s. Wafer needed weeks of engineering to get GLM 5.2 to SOTA on MI355X. NVIDIA delivers day-0 support for every frontier model. For a company that needs inference running Monday morning, the CUDA moat is still real — it’s just no longer permanent. The question for procurement teams is whether the 2x cost saving is worth the 2-4 week integration tax. For high-volume inference workloads, the answer is increasingly yes.

The cost comparison also needs an asterisk. The MI355X pulls 1,400W per GPU versus the B200’s 1,200W — roughly 16% more power. HN commenters estimated AMD is 20-60% worse on tokens-per-second-per-watt, though the MI355X’s ~50% more memory (288GB vs 192GB) complicates the comparison. “Half the cost” is a capex figure. For data centres with constrained power hookups — increasingly the binding constraint — the opex gap narrows the economics.

❓ FAQ

Is AMD MI355X actually competitive with NVIDIA B200 for production inference? At 2,626 tok/s/node vs 3,192 tok/s/node, you get 82% of the performance at roughly half the cost. For cost-sensitive workloads where you control the engineering timeline, yes. For workloads where you need day-0 support for every new model, not yet.

What is MXFP4 and why does it matter? MXFP4 is a 4-bit floating-point format that compresses model weights from bf16 (16-bit) with minimal quality loss. Wafer’s MXFP4 quantization of GLM 5.2 was lossless against Zhipu AI’s official FP8 baseline across standard benchmarks. Lower precision means less memory bandwidth consumed, which means faster inference.

Why doesn’t AMD just ship better ROCm support? They’re trying. The gap Wafer hit wasn’t a fundamental capability issue — it was missing #ifdef USE_ROCM guards and untuned kernel configs. AMD’s aiter library is actively adding tuned configs for newer quantization formats. The question is whether they can keep up with the pace of frontier model releases.

Does this affect NVIDIA’s pricing power? Not yet — demand still outstrips supply for Blackwells. But if 2-3 more groups replicate Wafer’s results on different models, AMD becomes a credible threat to NVIDIA’s inference pricing. The training market is different and still NVIDIA-dominated.

🔍 THE BOTTOM LINE

The inference cost curve is bending. AMD’s MI355X at half the price of a Blackwell, delivering 80% of the throughput on a frontier model, with no custom kernels required — just configuration fixes and kernel tuning — means the CUDA moat is now measured in weeks, not years. For the AI inference market, that’s the most important number in this story: not 2,626 tok/s, but the shrinking gap between “NVIDIA works out of the box” and “AMD works after a sprint.”