For the last year, the frontier-AI business model has leaned on one assumption: if you want the best long-running coding agent, you pay for a closed model. GLM-5.2 weakens that assumption. It does not clearly beat Claude Opus 4.8, and it does not universally replace GPT-5.5 — but it is close enough on important software-engineering benchmarks, and cheap enough, to change the decision.

How to read this comparison. The figures below are a snapshot from mid-June 2026. Some come from independent evaluators (FrontierSWE, Artificial Analysis); others are vendor-reported by Z.ai, OpenAI or Anthropic and have not been independently audited. Crucially, benchmark versions and harnesses differ between vendors, so not every number is apples-to-apples. Where that matters, we flag it rather than smoothing it over.

The cleanest comparison: FrontierSWE

The fairest single benchmark here is FrontierSWE, because all three models sit on the same leaderboard under the same harness. Claude Opus 4.8 ranks first at 75% dominance, GLM-5.2 follows at 74%, and GPT-5.5 at 73%.

FrontierSWE — dominance %, same leaderboard (0–100 scale, higher is better) 0 100% Claude Opus 4.8 75% GLM-5.2 74% GPT-5.5 73%
Plotted on a true 0–100 axis so the gap is not exaggerated: the whole field is separated by two percentage points. Source: FrontierSWE leaderboard.

The honest reading is not “GLM wins.” It is that GLM-5.2 is now within touching distance of the best closed coding-agent models on a difficult long-horizon software-engineering benchmark. For an open-weights model, that is the real news. Z.ai itself frames GLM as trailing Opus 4.8 by roughly one point on FrontierSWE — a claim that happens to line up with the independent leaderboard.

Opus 4.8 still looks like the quality leader

Claude Opus 4.8 should not be dismissed. On the current public data it looks like the strongest of the three for serious agentic coding and long-running software tasks. It leads on FrontierSWE, and Artificial Analysis places it top of its broader Intelligence Index as well.

Artificial Analysis Intelligence Index (independent composite, 0–100) 0 100 Claude Opus 4.8 61 GPT-5.5 60 GLM-5.2 51
A broad composite rather than a coding-only score, so it weights different tasks than FrontierSWE — useful as a cross-check, not a tie-breaker. Source: Artificial Analysis.

The less obvious part is honesty. Anthropic says Opus 4.8 is more likely to flag uncertainty, push back, and catch flaws in its own work — and claims it is roughly four times less likely than Opus 4.7 to let flaws in its own code pass unremarked. Those are vendor claims, so treat them as positioning rather than proof. But for coding agents the underlying point matters: a model that writes impressive code while quietly leaving broken assumptions behind can be expensive to use, even when the token bill looks reasonable.

The strongest version of the Opus case: reach for Opus 4.8 when correctness, judgement and long-running agent reliability matter more than price.

GPT-5.5: strong, but a mixed benchmark story

GPT-5.5 is not weak. It scores very strongly across OpenAI's own suite — Terminal-Bench, GDPval, OSWorld-Verified, BrowseComp and tool-use evaluations. The complication is that the GPT-5.5-versus-Opus-4.8 comparison is not always clean: OpenAI's launch materials benchmark heavily against Claude Opus 4.7, because 4.8 had not yet shipped. Later third-party data (FrontierSWE, Artificial Analysis) then makes Opus 4.8 look stronger in some agentic areas.

That does not make GPT-5.5 bad — it means it should be described precisely. GPT-5.5 looks especially strong for broad tool use, professional knowledge work, terminal workflows and OpenAI-ecosystem integration, and it may be more token-efficient in some workflows. But on current public long-horizon coding-agent comparisons, both Opus 4.8 and GLM-5.2 put real pressure on it.

GLM-5.2's real advantage: cost plus openness

GLM-5.2's biggest strength is not that it beats every frontier model — it does not. Its strength is that it gets close while being open and far cheaper to run. According to Artificial Analysis it is MIT-licensed, a 744B-total / 40B-active mixture-of-experts model with a one-million-token context window, priced on Z.ai's first-party API at about $1.40 input and $4.40 output per million tokens. Opus 4.8 is $5 / $25 and GPT-5.5 is $5 / $30.

First-party API list price — US$ per million tokens (lower is cheaper) Input Output $0 $10 $20 $30 GLM-5.2 $1.40 $4.40 Opus 4.8 $5 $25 GPT-5.5 $5 $30
Vendor-published list prices (GLM-5.2 on Z.ai's first-party API). Output tokens dominate the bill for agentic work; actual spend depends on workload, caching and provider. Source: Artificial Analysis; vendor pricing pages.

That gap matters because coding agents are token factories. They plan, inspect files, write code, run tests, read errors, revise and repeat — so output-token price drives real cost. If a model is slightly weaker but five-to-seven times cheaper on output, the economics can flip fast. GLM-5.2 does not need to be the best model in the world; it only needs to be good enough on enough coding-agent tasks that teams start routing a large share of work to it. Our AI development & programming directory lists the agentic tools where that routing decision actually plays out.

Best for: cost-sensitive, high-volume agentic workloads, on-premise or self-hosted deployments, and long-context coding where the output bill is the binding constraint.

The benchmark trap: don't compare every number directly

The most important transparency point: not every number should be treated as apples-to-apples. The table below keeps the version and harness details visible on purpose.

BenchmarkReported figuresWhy caution is needed
SWE-Bench ProGLM-5.2 62.1 (Z.ai) · GPT-5.5 58.6 (OpenAI)Vendor-reported on each side; OpenAI itself notes memorisation concerns around this benchmark.
Terminal-BenchGLM-5.2 81.0 on v2.1 (Z.ai) · GPT-5.5 82.7 on v2.0 (OpenAI)Different benchmark versions and harnesses — not directly comparable.
PostTrainBenchGLM-5.2 #1; Opus 4.8 Max at 34.1% after the 17 June 2026 updateA specialised AI-R&D automation benchmark (improving a small model on one H100 in 10 hours), not a general coding score.

The safe conclusion is to look for repeated signals across benchmarks rather than crowning one leaderboard. A one-line “GPT beats GLM” or “GLM beats GPT” claim is easy to make and easy to get wrong.

What the evidence actually supports

ClaimWhat the data supportsConfidence
GLM-5.2 is close to Opus 4.8 on long-horizon codingFrontierSWE: Opus 75% vs GLM 74%; Z.ai also reports a ~1-point gap.High
GLM-5.2 edges GPT-5.5 on some coding-agent benchmarksFrontierSWE: GLM 74% vs GPT 73%; SWE-Bench Pro favours GLM — but version/harness caveats apply.Medium-high
Opus 4.8 is the strongest of the three for serious agentic codingTops FrontierSWE and the Artificial Analysis Intelligence Index (61 vs 60 vs 51).High
GLM-5.2 has the best cost/openness storyMIT-licensed, 1M context, ~$1.40/$4.40 per 1M tokens vs $5/$25 and $5/$30.High
PostTrainBench is favourable to GLM-5.2GLM ranked #1 after the 17 June update; specialised AI-R&D benchmark, not general coding.Medium
Opus 4.8 is marketed around honesty / less bluffingAnthropic says it flags uncertainty more and lets fewer of its own code flaws pass.Medium-high (vendor)

Verdict

Claude Opus 4.8 looks like the strongest quality choice for difficult, long-running coding-agent work. GPT-5.5 remains a very strong closed model, especially for OpenAI-ecosystem workflows, broad professional tasks and tool-heavy work. GLM-5.2 is the disruptor: not clearly better than Opus 4.8, not universally better than GPT-5.5, but close enough on several important coding-agent benchmarks, open enough to deploy freely, and cheap enough to force a rethink.

The frontier-model market is no longer simply “pay more to get the only thing that works.” It is becoming a routing problem: use Opus 4.8 when quality matters most, use GPT-5.5 where OpenAI's tool ecosystem and general reliability win, and test GLM-5.2 aggressively wherever cost, openness and long-context coding matter. GLM-5.2 does not end the closed-model business model — it makes that model harder to defend. If you want to understand why long sessions still drift regardless of which model you pick, our piece on context rot in AI agents is a useful companion.

Common questions

Is GLM-5.2 better than Claude Opus 4.8?

Not clearly. On FrontierSWE — the one leaderboard ranking all three on the same harness — Opus 4.8 leads at 75% with GLM-5.2 at 74%. Independent composite scores also place Opus highest. GLM is best called the strongest open-weights challenger, not a clear winner.

Why is GLM-5.2 considered disruptive?

Cost and openness. It is MIT-licensed with a one-million-token context, and Z.ai's first-party pricing (~$1.40 input / $4.40 output per million tokens) is far below Opus 4.8 and GPT-5.5. Because agentic coding burns output tokens, “good enough” at a fraction of the cost changes the buying decision.

Can these benchmark numbers be compared directly?

Not all of them. Several are vendor-reported and use different benchmark versions or harnesses (for example Terminal-Bench 2.0 vs 2.1). FrontierSWE is the cleanest because all three appear on the same leaderboard.

Which should I use?

Treat it as routing rather than a single winner: Opus 4.8 where correctness and long-horizon reliability matter most, GPT-5.5 for OpenAI-ecosystem and broad tool use, and GLM-5.2 where cost, openness and long-context coding dominate.

Sources

Independent and primary sources behind the figures above. Vendor-reported numbers are labelled as such throughout the article.