- AI Engineering
- Posts
- Finally, Google released Gemini 3 — beats Sonnet 4.5 and GPT-5.1
Finally, Google released Gemini 3 — beats Sonnet 4.5 and GPT-5.1
.. PLUS: Agents With Persistent Workspace Context
In today’s newsletter:
Dimension - Agents With Persistent Workspace Context
Google released Gemini 3: Key breakthroughs and benchmark highlights
Reading time: 3 minutes.
Most agents can retrieve info, but they don’t understand your files, tools, or workflow. They restart from zero every session and rely on manual instructions.
Dimension adds persistent workspace context. The agent knows your documents, tasks, collaborators, and tools, and carries this context across sessions.
Most frameworks only support retrieval. Dimension supports both retrieval and actions like drafting emails, updating docs, or creating calendar events, with all actions reviewable before execution.
Workflows let it run multi-step tasks in the background.
To measure these capabilities, the team built Task Arena, a benchmark that evaluates agents on real tools such as Gmail, Drive, Calendar, and Notion. Many agents fail the action tests because they lack usable connectors. Dimension performs well on both retrieval and action benchmarks.
Key Features
Workspace-level context across email, files, calendars, and repos
Sub-200 ms retrieval with dense indexing
Reviewable actions with editable drafts
Background workflows for multi-step tasks
Strong results on Task Arena retrieval and action tests
Gemini 3 - A Major Step Forward in Reasoning and Multimodal Intelligence
Google released Gemini 3 with two variants: Gemini 3 Pro, the default model, and Gemini 3 Deep Think, an extended-reasoning mode for complex, multi-step problems. The update focuses on deeper reasoning, long-context multimodal understanding, and more reliable agentic execution.
Benchmark Performance
Gemini 3 Pro reaches 1501 on LMArena, up from 1451 on Gemini 2.5 Pro. It reports 37.5% on Humanity’s Last Exam(no tools) and 91.9% on GPQA Diamond, placing it among the strongest academic-reasoning models.
In mathematics, the model hits 23.4% on MathArena Apex, 95% on AIME 2025 without tools, and 100% with code execution. For multimodal tasks, Gemini 3 Pro scores 81% on MMMU-Pro and 87.6% on Video-MMMU, indicating major gains in cross-modal reasoning.

Gemini 3 Deep Think
Deep Think shows further improvements:
41.0% on Humanity’s Last Exam (no tools)
93.8% on GPQA Diamond
45.1% on ARC-AGI-2 with code execution
ARC-AGI-2 is specifically designed to test generalization and hypothesis-testing. The 45.1% score represents roughly a 3× jump over other frontier models, which typically land in the mid-teens to low-twenties.

Architecture and Core Capabilities
Gemini 3 Pro uses a sparse mixture-of-experts transformer with native multimodal inputs across text, images, audio, video, and code. The MoE setup activates only a targeted subset of parameters on each request, improving compute efficiency.
The model supports up to 1M tokens of context, enabling ingestion of full repositories, books, long research documents, multimodal PDFs, and extended video transcripts. Its unified architecture handles interleaved data such as diagrams inside documents, annotated audio, and code mixed with text without switching models.
Spatial and Multimodal Improvements
On ScreenSpot-Pro, a benchmark for computer-use and spatial reasoning, Gemini 3 Pro jumps from 11.4% → 72.7%, reflecting large gains in interface understanding. Video reasoning also improves, with higher frame-rate comprehension and stable long-range temporal reasoning.
Agentic Coding and Tool Use
Gemini 3 Pro scores 54.2% on Terminal-Bench 2.0 and 76.2% on SWE-bench Verified, demonstrating more consistent tool use and structured task execution. The model can plan workflows, chain tool calls, validate outputs, and adjust steps through iterative reasoning — a major shift from single-shot generation.
Google highlights architectural work to improve long-horizon reasoning loops, hypothesis branching, and evaluation of intermediate results.
Generative Interfaces
Gemini 3 supports generative UIs that dynamically build layouts, simulations, web tools, or visualizations based on user queries. Rather than returning text only, it constructs the interface that best fits the task, combining code, graphics, data views, and interactive components.
Key Takeaways
SOTA reasoning, topping LMArena at 1501
Deep Think achieves 45.1% on ARC-AGI-2, a major jump in generalization
Leading multimodal performance across video, imagery, and interleaved documents
Agentic execution via strong coding and terminal benchmarks
Generative interfaces that construct dynamic tools and layouts
1M-token context for large-scale ingestion and planning
Sparse MoE architecture for efficient, targeted compute
That’s a Wrap
That’s all for today. Thank you for reading today’s edition. See you in the next issue with more AI Engineering insights.
PS: We curate this AI Engineering content for free, and your support means everything. If you find value in what you read, consider sharing it with a friend or two.
Your feedback is valuable: If there’s a topic you’re stuck on or curious about, reply to this email. We’re building this for you, and your feedback helps shape what we send.
WORK WITH US
Looking to promote your company, product, or service to 160K+ AI developers? Get in touch today by replying to this email.

