OpenAI Releases GPT-5: First Model to Surpass Human Experts on SWE-Bench

OpenAI's latest flagship model achieves 87.3% on SWE-Bench, marking the first time an AI system has consistently outperformed senior software engineers on real-world GitHub issues. The model is available immediately in ChatGPT and via API.

mujeeburehman0000@gmail.com

Contributor

2 min read Jul 14, 2025

Twitter LinkedIn

OpenAI today released GPT-5, its most capable language model to date, achieving a groundbreaking 87.3% on the SWE-Bench benchmark — a suite of real-world software engineering tasks drawn from popular GitHub repositories.

The result is significant because SWE-Bench tests are not toy problems. They require understanding complex codebases, identifying bugs, and producing working patches that pass existing test suites. GPT-5’s score surpasses the typical performance of mid-to-senior engineers on the same benchmark.

Key Technical Improvements

GPT-5 introduces what OpenAI calls “extended reasoning chains,” a new training technique that allows the model to maintain coherent multi-step reasoning across contexts of up to 1 million tokens. The model also features improved tool use, native code execution, and a new “verification loop” that catches its own errors before outputting.

On MMLU, GPT-5 scores 94.1% (up from 86.4% for GPT-4o). On MATH, it achieves 89.7%. On HumanEval, it reaches 96.2%. But it’s the SWE-Bench result that has the industry buzzing — previous state-of-the-art was Anthropic’s Claude 4 Opus at 81.2%.

Pricing and Availability

GPT-5 is available immediately in ChatGPT Plus and Team plans. API pricing starts at $15 per million input tokens and $60 per million output tokens for the flagship model, with a smaller “GPT-5 Mini” variant priced at $3/$12.

Industry Reaction

Anthropic CEO Dario Amodei called the results “impressive” but noted that “benchmark saturation doesn’t tell the full story about real-world reliability.” Google DeepMind’s Jeff Dean highlighted that Gemini’s upcoming update would address “different dimensions of capability.”

The release comes amid intensifying competition in the frontier model space, with Anthropic, Google, Meta, and xAI all racing to release their next-generation models in the coming months.

Benchmarks GPT-5 LLM OpenAI

Model Releases · 11mo ago

Claude 4 Opus Ships with Native Multimodal Reasoning and 2M Context

Anthropic's Claude 4 Opus is now generally available with a 2-million-token context window, native image understanding that matches GPT-5, and a new constitutional AI framework for safer responses.

mujeeburehman0000@gmail.com

2 min

Model Releases · 11mo ago

Google DeepMind’s Gemini 2.5 Pro Tops MMLU at 95.2% — But Can It Ship?

Google DeepMind announces Gemini 2.5 Pro with record-setting benchmark scores, but the model remains in limited preview. Developers question whether Google can match OpenAI's deployment velocity.

mujeeburehman0000@gmail.com

2 min

Model Releases · 11mo ago

Meta Releases Llama 4: 405B Open-Source Model Rivals Proprietary Alternatives

Meta's Llama 4 405B matches GPT-4o performance across most benchmarks while being fully open-source. The model also introduces a 70B variant that runs on consumer hardware.

mujeeburehman0000@gmail.com

1 min

OpenAI Releases GPT-5: First Model to Surpass Human Experts on SWE-Bench

Related Articles

Claude 4 Opus Ships with Native Multimodal Reasoning and 2M Context

Google DeepMind’s Gemini 2.5 Pro Tops MMLU at 95.2% — But Can It Ship?

Meta Releases Llama 4: 405B Open-Source Model Rivals Proprietary Alternatives