OpenAI's latest flagship model fuses text, image, audio, and video understanding in a single unified architecture — and the early benchmarks are staggering.
OpenAI has officially announced GPT-5, calling it the company's most significant leap since the original GPT-4 release. Unlike previous generations, GPT-5 abandons the bolt-on adapter approach for vision and audio, opting instead for a single unified transformer trained jointly across modalities from day one.
Early evaluators report that the model can watch a 30-minute lecture video, transcribe it, summarize it, and then answer counterfactual questions about the slides — all in a single context window. On the new MMLU-Pro-Multimodal benchmark, GPT-5 scored 92.4%, beating GPT-4o by nearly 17 points.
Why this matters
The biggest practical shift is latency. Because the model no longer has to round-trip through separate vision encoders, real-time agents can now reason over a screen, a microphone, and a webcam stream concurrently. Developers are already prototyping accessibility tools, AI tutors, and live meeting assistants that would have been impossible six months ago.
What's next
OpenAI says a smaller "GPT-5 mini" tier will land in the coming weeks, priced aggressively against Claude Haiku and Gemini Flash. The full API rollout begins next month, with fine-tuning support promised before Q3.