Highlights
它们不只是看到图像,而是以图像进行思考!
OpenAI 新发布的 o3 和 o4-mini 模型,代表了 AI 向更实用、更自主的 agentic 能力迈出的实在一步。这些模型不再仅仅是对话,它们被训练来思考更长时间,并能自主判断何时、如何组合使用网页搜索、Python 分析、图像生成等工具,去解决以前 AI 难以处理的复杂、多步骤任务。感觉 AI 离能真正帮你干活又近了一步。而且,它们在变得更强的同时,通常还更便宜了。
OpenAI 是怎么做到的?关键似乎在于大规模强化学习 (RL) 的应用,不仅用于提升推理质量,更重要的是教会模型策略性地使用工具。同时,它们真正实现了用图像思考,将视觉信息融入推理过程,而不只是看图说话,这对解决混合视觉和文本的问题很有用。当然,安全能力也伴随着性能一起提升了。
具体来说,o3 是目前最强的模型,在编码、数学、视觉分析等复杂领域表现突出,困难任务中的重大错误比 o1 少 20%,在 Codeforces、MMMU 等基准上创了新高。o4-mini 则更快、更经济,但在数学(新的 AIME SOTA)和编码上性能依然亮眼,使用限制也更宽松。这两款模型都能调用 ChatGPT 内的所有工具,现在已对 Plus/Pro/Team 用户开放,并通过 API 提供给开发者。免费用户也能体验 o4-mini。他们还开源了一个 Codex CLI 工具,可以在终端运行。
Today, we’re releasing OpenAI o3 and o4-mini, the latest in our o-series of models trained to think for longer before responding. These are the smartest models we’ve released to date, representing a step change in ChatGPT’s capabilities for everyone from curious users to advanced researchers. For the first time, our reasoning models can agentically use and combine every tool within ChatGPT—this includes searching the web, analyzing uploaded files and other data with Python, reasoning deeply about visual inputs, and even generating images. Critically, these models are trained to reason about when and how to use tools to produce detailed and thoughtful answers in the right output formats, typically in under a minute, to solve more complex problems. This allows them to tackle multi-faceted questions more effectively, a step toward a more agentic ChatGPT that can independently execute tasks on your behalf. The combined power of state-of-the-art reasoning with full tool access translates into significantly stronger performance across academic benchmarks and real-world tasks, setting a new standard in both intelligence and usefulness.
今天,我们发布了 OpenAI o3 和 o4-mini,这是我们 o 系列模型中的最新模型,经过训练,可以在回复前进行更长时间的思考。这些是我们迄今为止发布的最智能的模型,代表了 ChatGPT 在能力上的一次飞跃,适用于从好奇的用户到高级研究人员的每个人。我们的推理模型首次能够自主地使用和组合 ChatGPT 中的每个工具——这包括搜索网络、使用 Python 分析上传的文件和其他数据、深入推理视觉输入,甚至生成图像。至关重要的是,这些模型经过训练,可以推理何时以及如何使用工具,以正确的输出格式生成详细而周到的答案,通常在一分钟内,从而解决更复杂的问题。这使它们能够更有效地处理多方面的问题,朝着更具自主性的 ChatGPT 迈出一步,该 ChatGPT 可以独立代表您执行任务。最先进的推理能力与完全工具访问权限的结合,转化为学术基准和现实世界任务中显著更强的性能,在智能和实用性方面都树立了新的标准。
What’s changed 发生了什么变化
OpenAI o3 is our most powerful reasoning model that pushes the frontier across coding, math, science, visual perception, and more. It sets a new SOTA on benchmarks including Codeforces, SWE-bench (without building a custom model-specific scaffold), and MMMU. It’s ideal for complex queries requiring multi-faceted analysis and whose answers may not be immediately obvious. It performs especially strongly at visual tasks like analyzing images, charts, and graphics. In evaluations by external experts, o3 makes 20 percent fewer major errors than OpenAI o1 on difficult, real-world tasks—especially excelling in areas like programming, business/consulting, and creative ideation. Early testers highlighted its analytical rigor as a thought partner and emphasized its ability to generate and critically evaluate novel hypotheses—particularly within biology, math, and engineering contexts.
OpenAI o3 是我们最强大的推理模型,它在编码、数学、科学、视觉感知等领域都推动了前沿发展。它在包括 Codeforces、SWE-bench(无需构建特定于模型的自定义支架)和 MMMU 在内的基准测试中创下了新的 SOTA。它非常适合需要多方面分析且答案可能不立即显而易见的复杂查询。它在分析图像、图表和图形等视觉任务方面表现尤为出色。在外部专家的评估中,o3 在困难的现实世界任务中比 OpenAI o1 减少了 20% 的重大错误——尤其是在编程、商业/咨询和创意构思等领域表现出色。早期测试人员强调了它作为思想伙伴的分析严谨性,并强调了它生成和批判性评估新颖假设的能力——尤其是在生物学、数学和工程背景下。
OpenAI o4-mini is a smaller model optimized for fast, cost-efficient reasoning—it achieves remarkable performance for its size and cost, particularly in math, coding, and visual tasks. On AIME 2025, o4-mini scores 99.5 percent when given access to a Python interpreter, effectively saturating this benchmark. In expert evaluations, it also outperforms its predecessor, o3‑mini, on non-STEM tasks as well as domains like data science. Thanks to its efficiency, o4-mini supports significantly higher usage limits than o3, making it a strong high-volume, high-throughput option for questions that benefit from reasoning.
OpenAI o4-mini 是一款更小的模型,针对快速、经济高效的推理进行了优化——它以其尺寸和成本实现了卓越的性能,尤其是在数学、编码和视觉任务方面。在 AIME 2025 上,o4-mini 在获得 Python 解释器访问权限后,得分达到 99.5%,有效地饱和了该基准。在专家评估中,它在非 STEM 任务以及数据科学等领域也优于其前身 o3-mini。由于其效率,o4-mini 支持比 o3 显著更高的使用限制,使其成为受益于推理的问题的强大高容量、高吞吐量选项。
External expert evaluators rated both models as demonstrating improved instruction following and more useful, verifiable responses than their predecessors, thanks to improved intelligence and the inclusion of web sources. Compared to previous iterations of our reasoning models, these two models should also feel more natural and conversational, especially as they reference memory and past conversations to make responses more personalized and relevant.
外部专家评估人员认为,由于智能的提升和网络资源的加入,这两个模型在指令遵循和更有用、可验证的响应方面都比之前的模型有所改进。与之前版本的推理模型相比,这两个模型也应该感觉更自然和对话化,特别是当它们引用记忆和过去的对话,使响应更加个性化和相关时。
Multimodal 多模态
Coding 编码
Instruction following and agentic tool use
指令遵循和代理工具使用
All models are evaluated at high ‘reasoning effort’ settings—similar to variants like ‘o4-mini-high’ in ChatGPT.
所有模型都在高”推理努力”设置下进行评估——类似于 ChatGPT 中的 ‘o4-mini-high’ 等变体。
Continuing to scale reinforcement learning
继续扩大强化学习的规模
Throughout the development of OpenAI o3, we’ve observed that large-scale reinforcement learning exhibits the same “more compute = better performance” trend observed in GPT‑series pretraining. By retracing the scaling path—this time in RL—we’ve pushed an additional order of magnitude in both training compute and inference-time reasoning, yet still see clear performance gains, validating that the models’ performance continues to improve the more they’re allowed to think. At equal latency and cost with OpenAI o1, o3 delivers higher performance in ChatGPT—and we’ve validated that if we let it think longer, its performance keeps climbing.
在 OpenAI o3 的开发过程中,我们观察到大规模强化学习表现出与 GPT 系列预训练相同的”更多计算 = 更好性能”的趋势。 通过追溯扩展路径——这次在 RL 中——我们在训练计算和推理时间推理方面都增加了一个数量级,但仍然看到明显的性能提升,这验证了模型的性能会随着允许它们思考的时间增加而不断提高。 在与 OpenAI o1 相同的延迟和成本下,o3 在 ChatGPT 中提供更高的性能——我们已经验证,如果我们让它思考更长时间,它的性能会持续提升。
We also trained both models to use tools through reinforcement learning—teaching them not just how to use tools, but to reason about when to use them. Their ability to deploy tools based on desired outcomes makes them more capable in open-ended situations—particularly those involving visual reasoning and multi-step workflows. This improvement is reflected both in academic benchmarks and real-world tasks, as reported by early testers.
我们还训练了这两个模型通过强化学习来使用工具——不仅教它们如何使用工具,还教它们推理何时使用工具。它们基于期望结果部署工具的能力使它们在开放式情境中更具能力——尤其是在涉及视觉推理和多步骤工作流程的情境中。正如早期测试人员报告的那样,这种改进体现在学术基准和现实世界的任务中。
Thinking with images 以图像思考
For the first time, these models can integrate images directly into their chain of thought. They don’t just see an image—they think with it. This unlocks a new class of problem-solving that blends visual and textual reasoning, reflected in their state-of-the-art performance across multimodal benchmarks.
这些模型首次能够将图像直接整合到其思维链中。它们不只是看到图像,而是以图像进行思考。这开启了一类新的问题解决方式,融合了视觉和文本推理,这反映在它们在多模态基准测试中的最先进性能上。
People can upload a photo of a whiteboard, a textbook diagram, or a hand-drawn sketch, and the model can interpret it—even if the image is blurry, reversed, or low quality. With tool use, the models can manipulate images on the fly—rotating, zooming, or transforming them as part of their reasoning process.
人们可以上传白板照片、教科书图表或手绘草图,模型可以对其进行解读——即使图像模糊、反转或质量低劣。通过工具使用,模型可以动态地处理图像——在推理过程中旋转、缩放或转换图像。
These models deliver best-in-class accuracy on visual perception tasks, enabling it to solve questions that were previously out of reach. Check out the visual reasoning research blog to learn more.
这些模型在视觉感知任务上提供了一流的准确性,使其能够解决以前无法解决的问题。请查看视觉推理研究博客以了解更多信息。
Toward agentic tool use 迈向自主工具使用
OpenAI o3 and o4-mini have full access to tools within ChatGPT, as well as your own custom tools via function calling in the API. These models are trained to reason about how to solve problems, choosing when and how to use tools to produce detailed and thoughtful answers in the right output formats quickly—typically in under a minute.
OpenAI o3 和 o4-mini 可以完全访问 ChatGPT 内的工具,以及通过 API 中的函数调用访问您自己的自定义工具。这些模型经过训练,能够推理如何解决问题,选择何时以及如何使用工具来快速生成详细且周到的答案,并以正确的输出格式呈现——通常在一分钟内。
For example, a user might ask: “How will summer energy usage in California compare to last year?” The model can search the web for public utility data, write Python code to build a forecast, generate a graph or image, and explain the key factors behind the prediction, chaining together multiple tool calls. Reasoning allows the models to react and pivot as needed to information it encounters. For example, they can search the web multiple times with the help of search providers, look at results, and try new searches if they need more info.
例如,用户可能会问:”加州夏季的能源使用量与去年相比如何?”该模型可以搜索网络上的公共事业数据,编写 Python 代码来构建预测,生成图表或图像,并解释预测背后的关键因素,将多个工具调用链接在一起。推理使模型能够根据遇到的信息做出反应和调整。例如,在搜索提供商的帮助下,他们可以在网络上多次搜索,查看结果,并在需要更多信息时尝试新的搜索。
This flexible, strategic approach allows the models to tackle tasks that require access to up-to-date information beyond the model’s built-in knowledge, extended reasoning, synthesis, and output generation across modalities.
这种灵活的战略方法使模型能够处理需要访问模型内置知识之外的最新信息、扩展推理、综合以及跨模态的输出生成的任务。
All examples were completed with OpenAI o3.
所有示例均使用 OpenAI o3 完成。
Advancing cost-efficient reasoning
推进成本效益型推理
Cost vs performance: o3-mini and o4-mini
成本与性能对比:o3-mini 和 o4-mini
Cost vs performance: o1 and o3
成本与性能对比:o1 和 o3
OpenAI o3 and o4-mini are the most intelligent models we have ever released, and they’re also often more efficient than their predecessors, OpenAI o1 and o3‑mini. For example, on the 2025 AIME math competition, the cost-performance frontier for o3 strictly improves over o1, and similarly, o4-mini’s frontier strictly improves over o3‑mini. More generally, we expect that for most real-world usage, o3 and o4-mini will also be both smarter and cheaper than o1 and o3‑mini, respectively.
OpenAI o3 和 o4-mini 是我们发布过的最智能的模型,而且通常比它们的前代产品 OpenAI o1 和 o3-mini 更高效。例如,在 2025 年的 AIME 数学竞赛中,o3 的性价比前沿严格优于 o1,类似地,o4-mini 的前沿严格优于 o3-mini。更普遍地说,我们预计对于大多数现实世界的使用,o3 和 o4-mini 也将分别比 o1 和 o3-mini 更智能且更便宜。
Safety 安全性
Each improvement in model capabilities warrants commensurate improvements to safety. For OpenAI o3 and o4-mini, we completely rebuilt our safety training data, adding new refusal prompts in areas such as biological threats (biorisk), malware generation, and jailbreaks. This refreshed data has led o3 and o4-mini to achieve strong performance on our internal refusal benchmarks (e.g., instruction hierarchy, jailbreaks). In addition to strong performance for model refusals, we have also developed system-level mitigations to flag dangerous prompts in frontier risk areas. Similar to our earlier work in image generation, we trained a reasoning LLM monitor which works from human-written and interpretable safety specifications. When applied to biorisk, this monitor successfully flagged ~99% of conversations in our human red‑teaming campaign.
模型能力的每一次改进都需要相应地改进安全性。对于 OpenAI o3 和 o4-mini,我们彻底重建了我们的安全训练数据,在生物威胁(生物风险)、恶意软件生成和越狱等领域添加了新的拒绝提示。这种更新后的数据使 o3 和 o4-mini 在我们的内部拒绝基准测试(例如,指令层次结构、越狱)中取得了优异的性能。除了模型拒绝方面的出色表现外,我们还开发了系统级缓解措施,以标记前沿风险领域的危险提示。与我们早期在图像生成方面的工作类似,我们训练了一个推理LLM监视器,该监视器基于人工编写且可解释的安全规范工作。当应用于生物风险时,该监视器成功标记了我们人工红队活动中约 99%的对话。
We stress tested both models with our most rigorous safety program to date. In accordance with our updated Preparedness Framework, we evaluated o3 and o4-mini across the three tracked capability areas covered by the Framework: biological and chemical, cybersecurity, and AI self-improvement. Based on the results of these evaluations, we have determined that both o3 and o4‑mini remain below the Framework’s “High” threshold in all three categories. We have published the detailed results from these evaluations in the accompanying system card.
我们使用迄今为止最严格的安全程序对这两个模型进行了压力测试。根据我们更新的准备框架,我们评估了 o3 和 o4-mini 在框架涵盖的三个跟踪能力领域:生物和化学、网络安全和人工智能自我改进。根据这些评估的结果,我们确定 o3 和 o4-mini 在所有三个类别中仍低于框架的”高”阈值。我们已在随附的系统卡中发布了这些评估的详细结果。
Codex CLI: frontier reasoning in the terminal
Codex CLI:终端中的前沿推理
We’re also sharing a new experiment: Codex CLI, a lightweight coding agent you can run from your terminal. It works directly on your computer and is designed to maximize the reasoning capabilities of models like o3 and o4-mini, with upcoming support for additional API models like GPT‑4.1
我们还在分享一项新的实验:Codex CLI,一个你可以从终端运行的轻量级编码代理。它直接在你的电脑上工作,旨在最大限度地发挥 o3 和 o4-mini 等模型的推理能力,并即将支持像 GPT‑4.1 这样的额外 API 模型。
You can get the benefits of multimodal reasoning from the command line by passing screenshots or low fidelity sketches to the model, combined with access to your code locally. We think of it as a minimal interface to connect our models to users and their computers. Codex CLI is fully open-source at github.com/openai/codex today.
通过将屏幕截图或低保真草图传递给模型,并结合对本地代码的访问,您可以从命令行获得多模态推理的优势。我们认为这是一个将我们的模型连接到用户及其计算机的最小接口。Codex CLI 今天在 github.com/openai/codex 上完全开源。
Alongside, we are launching a $1 million initiative to support projects using Codex CLI and OpenAI models. We will evaluate and accept applications for grants in increments of $25,000 USD in the form of API credits. Proposals can be submitted here
与此同时,我们正在启动一项 100 万美元的计划,以支持使用 Codex CLI 和 OpenAI 模型的项目。我们将评估并接受赠款申请,每次以 25,000 美元的 API 积分形式发放。提案可在此处提交.
Access 访问
ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high. ChatGPT Enterprise and Edu users will gain access in one week. Free users can try o4-mini by selecting ‘Think’ in the composer before submitting their query. Rate limits across all plans remain unchanged from the prior set of models.
ChatGPT Plus、Pro 和 Team 用户将从今天开始在模型选择器中看到 o3、o4-mini 和 o4-mini-high,它们将取代 o1、o3-mini 和 o3-mini-high。 ChatGPT Enterprise 和 Edu 用户将在一周后获得访问权限。免费用户可以通过在提交查询之前在编辑器中选择”思考”来尝试 o4-mini。所有计划的速率限制与之前的模型集相比保持不变。
We expect to release OpenAI o3‑pro in a few weeks with full tool support. For now, Pro users can still access o1‑pro.
我们预计将在几周内发布具有完整工具支持的 OpenAI o3-pro。目前,Pro 用户仍然可以访问 o1-pro。
Both o3 and o4-mini are also available to developers today via the Chat Completions API and Responses API (some developers will need to verify their organizations to access these models). The Responses API supports reasoning summaries, the ability to preserve reasoning tokens around function calls for better performance, and will soon support built-in tools like web search, file search, and code interpreter within the model’s reasoning. To get started, explore our docs and stay tuned for more updates
开发者今天可以通过 Chat Completions API 和 Responses API 使用 o3 和 o4-mini(某些开发者需要验证其组织才能访问这些模型)。Responses API 支持推理摘要,能够保留函数调用周围的推理 token 以获得更好的性能,并且很快将支持内置工具,例如 Web 搜索、文件搜索和模型推理中的代码解释器。要开始使用,请浏览我们的文档并随时关注更多更新.
What’s next 接下来是什么
Today’s updates reflect the direction our models are heading in: we’re converging the specialized reasoning capabilities of the o-series with more of the natural conversational abilities and tool use of the GPT‑series. By unifying these strengths, our future models will support seamless, natural conversations alongside proactive tool use and advanced problem-solving.
今天的更新反映了我们模型的未来发展方向:我们将 o 系列的专业推理能力与 GPT 系列的自然对话能力和工具使用能力相结合。通过统一这些优势,我们未来的模型将支持无缝、自然的对话,同时支持积极的工具使用和高级问题解决。