For the last year, the frontier-AI business model has leaned on one assumption: if you want the best long-running coding agent, you pay for a closed model. GLM-5.2 weakens that assumption. It does not clearly beat Claude Opus 4.8, and it does not universally replace GPT-5.5 — but it is close enough on important software-engineering benchmarks, and cheap enough, to change the decision.
The cleanest comparison: FrontierSWE
The fairest single benchmark here is FrontierSWE, because all three models sit on the same leaderboard under the same harness. Claude Opus 4.8 ranks first at 75% dominance, GLM-5.2 follows at 74%, and GPT-5.5 at 73%.
The honest reading is not “GLM wins.” It is that GLM-5.2 is now within touching distance of the best closed coding-agent models on a difficult long-horizon software-engineering benchmark. For an open-weights model, that is the real news. Z.ai itself frames GLM as trailing Opus 4.8 by roughly one point on FrontierSWE — a claim that happens to line up with the independent leaderboard.
Opus 4.8 still looks like the quality leader
Claude Opus 4.8 should not be dismissed. On the current public data it looks like the strongest of the three for serious agentic coding and long-running software tasks. It leads on FrontierSWE, and Artificial Analysis places it top of its broader Intelligence Index as well.
The less obvious part is honesty. Anthropic says Opus 4.8 is more likely to flag uncertainty, push back, and catch flaws in its own work — and claims it is roughly four times less likely than Opus 4.7 to let flaws in its own code pass unremarked. Those are vendor claims, so treat them as positioning rather than proof. But for coding agents the underlying point matters: a model that writes impressive code while quietly leaving broken assumptions behind can be expensive to use, even when the token bill looks reasonable.
GPT-5.5: strong, but a mixed benchmark story
GPT-5.5 is not weak. It scores very strongly across OpenAI's own suite — Terminal-Bench, GDPval, OSWorld-Verified, BrowseComp and tool-use evaluations. The complication is that the GPT-5.5-versus-Opus-4.8 comparison is not always clean: OpenAI's launch materials benchmark heavily against Claude Opus 4.7, because 4.8 had not yet shipped. Later third-party data (FrontierSWE, Artificial Analysis) then makes Opus 4.8 look stronger in some agentic areas.
That does not make GPT-5.5 bad — it means it should be described precisely. GPT-5.5 looks especially strong for broad tool use, professional knowledge work, terminal workflows and OpenAI-ecosystem integration, and it may be more token-efficient in some workflows. But on current public long-horizon coding-agent comparisons, both Opus 4.8 and GLM-5.2 put real pressure on it.
GLM-5.2's real advantage: cost plus openness
GLM-5.2's biggest strength is not that it beats every frontier model — it does not. Its strength is that it gets close while being open and far cheaper to run. According to Artificial Analysis it is MIT-licensed, a 744B-total / 40B-active mixture-of-experts model with a one-million-token context window, priced on Z.ai's first-party API at about $1.40 input and $4.40 output per million tokens. Opus 4.8 is $5 / $25 and GPT-5.5 is $5 / $30.
That gap matters because coding agents are token factories. They plan, inspect files, write code, run tests, read errors, revise and repeat — so output-token price drives real cost. If a model is slightly weaker but five-to-seven times cheaper on output, the economics can flip fast. GLM-5.2 does not need to be the best model in the world; it only needs to be good enough on enough coding-agent tasks that teams start routing a large share of work to it. Our AI development & programming directory lists the agentic tools where that routing decision actually plays out.
The benchmark trap: don't compare every number directly
The most important transparency point: not every number should be treated as apples-to-apples. The table below keeps the version and harness details visible on purpose.
| Benchmark | Reported figures | Why caution is needed |
|---|---|---|
| SWE-Bench Pro | GLM-5.2 62.1 (Z.ai) · GPT-5.5 58.6 (OpenAI) | Vendor-reported on each side; OpenAI itself notes memorisation concerns around this benchmark. |
| Terminal-Bench | GLM-5.2 81.0 on v2.1 (Z.ai) · GPT-5.5 82.7 on v2.0 (OpenAI) | Different benchmark versions and harnesses — not directly comparable. |
| PostTrainBench | GLM-5.2 #1; Opus 4.8 Max at 34.1% after the 17 June 2026 update | A specialised AI-R&D automation benchmark (improving a small model on one H100 in 10 hours), not a general coding score. |
The safe conclusion is to look for repeated signals across benchmarks rather than crowning one leaderboard. A one-line “GPT beats GLM” or “GLM beats GPT” claim is easy to make and easy to get wrong.
What the evidence actually supports
| Claim | What the data supports | Confidence |
|---|---|---|
| GLM-5.2 is close to Opus 4.8 on long-horizon coding | FrontierSWE: Opus 75% vs GLM 74%; Z.ai also reports a ~1-point gap. | High |
| GLM-5.2 edges GPT-5.5 on some coding-agent benchmarks | FrontierSWE: GLM 74% vs GPT 73%; SWE-Bench Pro favours GLM — but version/harness caveats apply. | Medium-high |
| Opus 4.8 is the strongest of the three for serious agentic coding | Tops FrontierSWE and the Artificial Analysis Intelligence Index (61 vs 60 vs 51). | High |
| GLM-5.2 has the best cost/openness story | MIT-licensed, 1M context, ~$1.40/$4.40 per 1M tokens vs $5/$25 and $5/$30. | High |
| PostTrainBench is favourable to GLM-5.2 | GLM ranked #1 after the 17 June update; specialised AI-R&D benchmark, not general coding. | Medium |
| Opus 4.8 is marketed around honesty / less bluffing | Anthropic says it flags uncertainty more and lets fewer of its own code flaws pass. | Medium-high (vendor) |
Verdict
Claude Opus 4.8 looks like the strongest quality choice for difficult, long-running coding-agent work. GPT-5.5 remains a very strong closed model, especially for OpenAI-ecosystem workflows, broad professional tasks and tool-heavy work. GLM-5.2 is the disruptor: not clearly better than Opus 4.8, not universally better than GPT-5.5, but close enough on several important coding-agent benchmarks, open enough to deploy freely, and cheap enough to force a rethink.
The frontier-model market is no longer simply “pay more to get the only thing that works.” It is becoming a routing problem: use Opus 4.8 when quality matters most, use GPT-5.5 where OpenAI's tool ecosystem and general reliability win, and test GLM-5.2 aggressively wherever cost, openness and long-context coding matter. GLM-5.2 does not end the closed-model business model — it makes that model harder to defend. If you want to understand why long sessions still drift regardless of which model you pick, our piece on context rot in AI agents is a useful companion.
Common questions
Is GLM-5.2 better than Claude Opus 4.8?
Not clearly. On FrontierSWE — the one leaderboard ranking all three on the same harness — Opus 4.8 leads at 75% with GLM-5.2 at 74%. Independent composite scores also place Opus highest. GLM is best called the strongest open-weights challenger, not a clear winner.
Why is GLM-5.2 considered disruptive?
Cost and openness. It is MIT-licensed with a one-million-token context, and Z.ai's first-party pricing (~$1.40 input / $4.40 output per million tokens) is far below Opus 4.8 and GPT-5.5. Because agentic coding burns output tokens, “good enough” at a fraction of the cost changes the buying decision.
Can these benchmark numbers be compared directly?
Not all of them. Several are vendor-reported and use different benchmark versions or harnesses (for example Terminal-Bench 2.0 vs 2.1). FrontierSWE is the cleanest because all three appear on the same leaderboard.
Which should I use?
Treat it as routing rather than a single winner: Opus 4.8 where correctness and long-horizon reliability matter most, GPT-5.5 for OpenAI-ecosystem and broad tool use, and GLM-5.2 where cost, openness and long-context coding dominate.
Sources
Independent and primary sources behind the figures above. Vendor-reported numbers are labelled as such throughout the article.
- FrontierSWE leaderboard — three-way comparison: Opus 4.8 75%, GLM-5.2 74%, GPT-5.5 73%.
- Artificial Analysis — independent Intelligence Index, pricing, parameters and openness for all three models.
- PostTrainBench — AI-R&D automation benchmark; 17 June 2026 update placing GLM-5.2 first.
- Anthropic — Claude Opus 4.8 — positioning, pricing and the honesty / self-review claims.
- Z.ai — GLM-5.2 docs — context window, licensing and vendor-reported benchmark figures.
- OpenAI — Introducing GPT-5.5 — benchmark table, pricing and context window.