• AI Engineering
  • Posts
  • Finally, Google released Gemini 3 — beats Sonnet 4.5 and GPT-5.1

Finally, Google released Gemini 3 — beats Sonnet 4.5 and GPT-5.1

.. PLUS: Agents With Persistent Workspace Context

In today’s newsletter:

  • Dimension - Agents With Persistent Workspace Context

  • Google released Gemini 3: Key breakthroughs and benchmark highlights

Reading time: 3 minutes.

Most agents can retrieve info, but they don’t understand your files, tools, or workflow. They restart from zero every session and rely on manual instructions.

Dimension adds persistent workspace context. The agent knows your documents, tasks, collaborators, and tools, and carries this context across sessions.

Most frameworks only support retrieval. Dimension supports both retrieval and actions like drafting emails, updating docs, or creating calendar events, with all actions reviewable before execution.

Workflows let it run multi-step tasks in the background.

To measure these capabilities, the team built Task Arena, a benchmark that evaluates agents on real tools such as Gmail, Drive, Calendar, and Notion. Many agents fail the action tests because they lack usable connectors. Dimension performs well on both retrieval and action benchmarks.

Key Features

  • Workspace-level context across email, files, calendars, and repos

  • Sub-200 ms retrieval with dense indexing

  • Reviewable actions with editable drafts

  • Background workflows for multi-step tasks

  • Strong results on Task Arena retrieval and action tests

Gemini 3 - A Major Step Forward in Reasoning and Multimodal Intelligence

Google released Gemini 3 with two variants: Gemini 3 Pro, the default model, and Gemini 3 Deep Think, an extended-reasoning mode for complex, multi-step problems. The update focuses on deeper reasoning, long-context multimodal understanding, and more reliable agentic execution.

Benchmark Performance

Gemini 3 Pro reaches 1501 on LMArena, up from 1451 on Gemini 2.5 Pro. It reports 37.5% on Humanity’s Last Exam(no tools) and 91.9% on GPQA Diamond, placing it among the strongest academic-reasoning models.

In mathematics, the model hits 23.4% on MathArena Apex, 95% on AIME 2025 without tools, and 100% with code execution. For multimodal tasks, Gemini 3 Pro scores 81% on MMMU-Pro and 87.6% on Video-MMMU, indicating major gains in cross-modal reasoning.

Gemini 3 Deep Think

Deep Think shows further improvements:

  • 41.0% on Humanity’s Last Exam (no tools)

  • 93.8% on GPQA Diamond

  • 45.1% on ARC-AGI-2 with code execution

ARC-AGI-2 is specifically designed to test generalization and hypothesis-testing. The 45.1% score represents roughly a 3× jump over other frontier models, which typically land in the mid-teens to low-twenties.

Architecture and Core Capabilities

Gemini 3 Pro uses a sparse mixture-of-experts transformer with native multimodal inputs across text, images, audio, video, and code. The MoE setup activates only a targeted subset of parameters on each request, improving compute efficiency.

The model supports up to 1M tokens of context, enabling ingestion of full repositories, books, long research documents, multimodal PDFs, and extended video transcripts. Its unified architecture handles interleaved data such as diagrams inside documents, annotated audio, and code mixed with text without switching models.

Spatial and Multimodal Improvements

On ScreenSpot-Pro, a benchmark for computer-use and spatial reasoning, Gemini 3 Pro jumps from 11.4% → 72.7%, reflecting large gains in interface understanding. Video reasoning also improves, with higher frame-rate comprehension and stable long-range temporal reasoning.

Agentic Coding and Tool Use

Gemini 3 Pro scores 54.2% on Terminal-Bench 2.0 and 76.2% on SWE-bench Verified, demonstrating more consistent tool use and structured task execution. The model can plan workflows, chain tool calls, validate outputs, and adjust steps through iterative reasoning — a major shift from single-shot generation.

Google highlights architectural work to improve long-horizon reasoning loops, hypothesis branching, and evaluation of intermediate results.

Generative Interfaces

Gemini 3 supports generative UIs that dynamically build layouts, simulations, web tools, or visualizations based on user queries. Rather than returning text only, it constructs the interface that best fits the task, combining code, graphics, data views, and interactive components.

Key Takeaways

  • SOTA reasoning, topping LMArena at 1501

  • Deep Think achieves 45.1% on ARC-AGI-2, a major jump in generalization

  • Leading multimodal performance across video, imagery, and interleaved documents

  • Agentic execution via strong coding and terminal benchmarks

  • Generative interfaces that construct dynamic tools and layouts

  • 1M-token context for large-scale ingestion and planning

  • Sparse MoE architecture for efficient, targeted compute

That’s a Wrap

That’s all for today. Thank you for reading today’s edition. See you in the next issue with more AI Engineering insights.

PS: We curate this AI Engineering content for free, and your support means everything. If you find value in what you read, consider sharing it with a friend or two.

Your feedback is valuable: If there’s a topic you’re stuck on or curious about, reply to this email. We’re building this for you, and your feedback helps shape what we send.

WORK WITH US

Looking to promote your company, product, or service to 160K+ AI developers? Get in touch today by replying to this email.