Claude 4.8 Tops SWE-bench, USAMO 2026, and GraphWalks

Claude 4.8 Tops SWE-bench, USAMO 2026, and GraphWalks - Benchmarks vs GPT-5.5 and Gemini

5월 28, 2026

▲ Claude 4.8 launch - new AI coding and math champion

Claude 4.8, also known as Claude Opus 4.8, is Anthropic's newest flagship AI model released on May 28, 2026, setting top scores on three major benchmarks simultaneously - coding, advanced mathematics, and long-document retrieval. If you follow AI model performance, these numbers are hard to ignore.

What Is Claude 4.8 and What Changed?

Claude 4.8 introduces three significant upgrades over its predecessor. First, Effort Control - a Low/High/Max dial that lets users choose how deeply the model reasons through a problem, preserving usage limits for tasks that need it. Second, parallel sub-task execution in Claude Code allows hundreds of tasks to run simultaneously, letting developers refactor entire codebases in one pass. Third, the model has become measurably more honest: it now says "I don't know" and acknowledges errors, with unsupported answers dropping noticeably in user testing. Fast Mode, the lighter inference tier, became 3x cheaper and 2.5x faster than before.

▲ Claude 4.8 benchmark scores vs GPT-5.5 and Gemini

Claude 4.8 Benchmark Results vs GPT-5.5 and Gemini

The benchmark data tells a clear story. On SWE-bench Pro - a rigorous test using real-world software engineering tasks - Claude 4.8 scored 69.2%, compared to 58.6% for GPT-5.5 and 54.2% for Gemini 3.1 Pro. That is a margin of more than 10 percentage points over both competitors in one of the most demanding coding evaluations available. On USAMO 2026, the elite American math olympiad, Claude 4.8 reached 96.7% - up from 69.3% in Claude 4.7, a single-generation jump of +27.4 points. On GraphWalks 1M, which tests retrieval accuracy across 1 million token contexts, the model scored 68.1% versus 40.3% in 4.7, a gain of +28 percentage points.

▲ Effort Control and Fast Mode impact for everyday users

What Effort Control Means for Everyday Claude Users

The Effort Control feature is the most practically significant addition for regular users. Previously, Claude 4.7 always ran at maximum reasoning intensity, consuming usage limits at a fixed rate regardless of task complexity. Claude 4.8 lets you set it to Low for quick summaries and classifications, High for standard tasks with quality output roughly matching 4.7 at full power, or Max for the deepest reasoning the model can produce. This means the same weekly limit now stretches further - spending fewer tokens on routine tasks and saving depth for complex analysis, long document review, or difficult coding problems.

Key Takeaways

① Coding benchmark #1 - SWE-bench Pro 69.2%, beating GPT-5.5 and Gemini by over 10 points in real-world software tasks.

② Generational math leap - USAMO 2026 score jumped +27.4 points from 4.7 to 4.8, the largest single-generation gain in Claude history.

③ Effort Control changes the economics - Low/High/Max dial lets you preserve usage limits and get better results where it counts.

Claude 4.8 marks Anthropic's strongest benchmark performance to date. Whether this lead holds as competitors respond remains to be seen, but the May 2026 leaderboard belongs to Claude.

👉 Gemini 3.5 Flash Launch - Beats 3.1 Pro and Opus 4.7 (May 2026) - also worth a read.

📌 Sources: Anthropic, MacRumors, MarkTechPost (2026)

이 블로그 검색

Tech News by InClicks