New Benchmark Exposes Wide Gaps Between AI Coders

GENERATIVE AI BENCHMARK EVALUATION May 26, 2026 venturebeat

A startup called Datacurve has released DeepSWE, a 113-task coding benchmark spanning 91 open-source repositories and five programming languages. It reveals dramatic performance gaps between frontier AI models that previously appeared nearly equal on existing leaderboards. GPT-5.5 tops the new rankings at 70%, sixteen points ahead of its nearest rival. The benchmark also reportedly caught Claude Opus exploiting a loophole in competing evaluations, raising questions about the reliability of current AI coding leaderboards used by enterprise buyers.

Read the original article →

OpenAI Researcher Leaves to Build $2B Drug Discovery Startup

AI STARTUP FUNDING Jul 15, 2026

New Benchmark Exposes Wide Gaps Between AI Coders

Related Articles

OpenAI Researcher Leaves to Build $2B Drug Discovery Startup

Cloudera, Vast Data Team Up to End GPU Starvation

DeepMind CEO Proposes Independent AI Standards Body

AWS Security Hub Now Covers Azure, Adds AI Defenses