Modeling Bench - Search News

Morning Overview on MSN

Microsoft’s new MAI-Code model turns plain-English descriptions into working app code

Microsoft released MAI-Code, a model designed to convert plain-English descriptions into functional application code, pushing ...

19d

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

DeepSWE puts GPT-5.5 atop the AI coding leaderboard while raising new questions about Claude Opus, SWE-Bench Pro, and benchmark leakage.

Live Science

Scientists design new 'AGI benchmark' that indicates whether any future AI model could cause 'catastrophic harm'

OpenAI scientists have designed MLE-bench — a compilation of 75 extremely difficult tests that can assess whether a future advanced AI agent is capable of modifying its own code and improving itself.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

Microsoft’s new MAI-Code model turns plain-English descriptions into working app code

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

Scientists design new 'AGI benchmark' that indicates whether any future AI model could cause 'catastrophic harm'

Trending now