Live
The latest in AI — model releases, research breakthroughs, and industry news
Back to all articles

Google DeepMind’s Gemini 2.5 Pro Tops MMLU at 95.2% — But Can It Ship?

Google DeepMind announces Gemini 2.5 Pro with record-setting benchmark scores, but the model remains in limited preview. Developers question whether Google can match OpenAI's deployment velocity.

Twitter LinkedIn

Google DeepMind has unveiled Gemini 2.5 Pro, which achieves 95.2% on MMLU — the highest score ever recorded on the widely-cited benchmark. But in a pattern that has become frustratingly familiar for Google, the model is available only to a small group of testers through a waitlist.

The model also scores 92.1% on MATH, 84.3% on GPQA Diamond (graduate-level science questions), and demonstrates new capabilities in long-form video understanding — processing up to 1 hour of video in a single request.

The Deployment Problem

This is the third time in 18 months that Google has announced a frontier model with top-tier benchmarks only to face criticism over slow or limited rollout. Gemini 2.0 Flash, announced in December, took four months to reach general API availability. Gemini 2.5 Pro has no announced GA date.

“We’re being deliberate about safety evaluation,” said Google DeepMind CEO Demis Hassabis. “These models are powerful, and we want to get the deployment right.”

Critics counter that the cautious approach has cost Google ground to OpenAI and Anthropic, both of which have shipped multiple model generations in the same period.

Technical Highlights

Gemini 2.5 Pro uses a new “mixture of experts” architecture with 128 experts and 400B total parameters, activating approximately 32B per token. It supports 2M context and native tool use across Google’s ecosystem (Search, Maps, YouTube, Docs).

The model integrates with Google’s new “AI Sandbox,” which lets enterprises test frontier models against their own data before committing to production deployment.

Benchmarks Gemini Google DeepMind LLM

Related Articles