August 2025 was marked by the release of several significant updates and completely new models in the field of artificial intelligence that promise to substantially change the AI landscape. Anthropic, Google DeepMind, and OpenAI presented their latest achievements, demonstrating progress in agentic tasks, world generation, and open language models. Let’s examine these releases.
Open GPT-OSS Models from OpenAI
OpenAI has finally opened their models by releasing GPT-OSS – a family of open models designed for powerful reasoning, agentic tasks, and universal use cases for developers. This series includes two models:
- gpt-oss-120b: A large model with 117 billion total parameters (and 5.1 billion active), designed for production, general, and high-reasoning use cases that fits on a single H100 GPU (80 GB).
- gpt-oss-20b: A smaller model with 21 billion total parameters (and 3.6 billion active), designed for latency reduction, local or specialized use, operating within 16 GB of memory, perfectly suited for consumer hardware.
Both models are released under the permissive Apache 2.0 license, allowing free use for experiments, fine-tuning, and commercial deployment without restrictions.
Key features of GPT-OSS include:
- Customizable reasoning level (low, medium, high) for balancing between speed and detail.
- Full Chain-of-Thought (CoT) providing access to the model’s reasoning process for debugging and trust enhancement (not intended for end users).
- Agentic capabilities including built-in tools for web browsing, function calling with defined schemas, and Python code execution.
- Native MXFP4 quantization for the MoE layer, ensuring efficient memory usage.
In terms of performance, gpt-oss-120b nearly matches OpenAI o4-mini on core reasoning benchmarks and even surpasses o4-mini on health-related queries (HealthBench) and competitive mathematics (AIME 2024 & 2025). gpt-oss-20b matches or exceeds OpenAI o3-mini despite its smaller size. The models also show excellent results in tool usage and CoT reasoning.

Safety is a fundamental aspect of OpenAI’s approach. The models underwent comprehensive safety training and evaluations, including testing a version of gpt-oss-120b specifically fine-tuned for adversarial purposes. Results show that even with malicious fine-tuning, the models couldn’t achieve high capability levels according to the internal Preparedness Framework. OpenAI also announced a Red Teaming Challenge with a $500,000 prize pool to encourage researchers to identify new safety issues.
GPT-OSS models are easily deployable and available for download on Hugging Face, supported by leading platforms such as Azure, Hugging Face, vLLM, Ollama, llama.cpp, LM Studio, AWS, and many others. Microsoft is also releasing GPU-optimized versions of gpt-oss-20b for Windows devices.
Source 1
Source 2
Claude Opus 4.1 from Anthropic: Enhanced Coding and Data Analysis
Anthropic has released Claude Opus 4.1, representing a significant update to Claude Opus 4. This model demonstrates improvements in:
- Agentic tasks, real-world coding, and reasoning.
- Coding performance, achieving 74.5% on the SWE-bench Verified benchmark.
- Deep research skills and data analysis, particularly in detail tracking and agentic search.

According to feedback, Claude Opus 4.1 notably improves multi-file code refactoring (GitHub) and bug-fixing accuracy in large codebases without unnecessary changes (Rakuten Group). Windsurf notes a one standard deviation improvement over Opus 4 on their junior developer benchmark, similar to the jump from Sonnet 3.7 to Sonnet 4.
Anthropic recommends all users upgrade from Opus 4 to Opus 4.1, which is already available for paid Claude users, in Claude Code, via API, as well as on Amazon Bedrock and Google Cloud’s Vertex AI at the same price as Opus 4. Claude Opus 4.1 is a hybrid reasoning model and uses various benchmarking methodologies, including tool usage and extended thinking for some tasks.
Source
Genie 3 from Google DeepMind: Advanced Real-Time World Generation Model
Google DeepMind introduced Genie 3 – a universal world model capable of generating an unprecedented diversity of interactive environments. This is their first world model enabling real-time interaction while significantly improving consistency and realism compared to Genie 2.
Key capabilities of Genie 3 include:
- Generation of dynamic worlds that can be navigated in real-time at 24 frames per second, maintaining consistency for several minutes at 720p resolution. The model can maintain environment consistency for up to several minutes, with visual memory extending up to one minute back.
- Modeling of physical world properties, including water, lighting, and complex environmental interactions.
- Natural world simulation, creating vibrant ecosystems with animal behavior and complex vegetation.
- Animation and fantasy modeling, allowing creation of fantastical scenarios and expressive animated characters.
- Location exploration and historical settings, transcending geographical and temporal boundaries.
- Prompt-driven world events, allowing modification of the generated world, such as weather conditions or introducing new objects and characters.
Genie 3 is viewed as a key step toward AGI (Artificial General Intelligence) as it enables training AI agents in an unlimited number of rich simulation environments. It’s already being used to research embodied agents, such as the SIMA agent, for achieving complex goals in generated worlds.
Despite breakthroughs, Genie 3 has limitations, including limited action space for agents, difficulties modeling interactions between multiple independent agents, inaccurate simulation of real geographical locations, text rendering issues, and limited interaction duration (minutes rather than hours). Google DeepMind emphasizes deep commitment to responsible development, releasing Genie 3 as a limited research preview to gather feedback and minimize risks.
Source
Conclusion
The presented releases reflect three key directions in modern AI development. OpenAI opened their models for the first time under Apache 2.0, providing developers with powerful tools for commercial use without restrictions. GPT-OSS demonstrates technology maturity – models run on consumer hardware and show performance comparable to closed systems.
Anthropic focused on practical improvements, bringing Claude Opus 4.1 to 74.5% on SWE-bench Verified. This represents concrete progress in solving real engineering tasks where accuracy and reliability are crucial.
Google DeepMind chose a fundamentally different path with Genie 3, creating the first world model with real-time interactivity. Despite current limitations, the technology opens new possibilities for agent training and content creation.
The overall trend is clear: transition from closed research projects to ready products. The openness of GPT-OSS, practicality of Claude Opus 4.1, and innovation of Genie 3 show that the industry is moving toward more accessible and specialized solutions while maintaining high safety standards.