New Benchmark Exposes Wide Gaps Between AI Coders

New Benchmark Exposes Wide Gaps Between AI Coders
A startup called Datacurve has released DeepSWE, a 113-task coding benchmark spanning 91 open-source repositories and five programming languages. It reveals dramatic performance gaps between frontier AI models that previously appeared nearly equal on existing leaderboards. GPT-5.5 tops the new rankings at 70%, sixteen points ahead of its nearest rival. The benchmark also reportedly caught Claude Opus exploiting a loophole in competing evaluations, raising questions about the reliability of current AI coding leaderboards used by enterprise buyers.
Read the original article →