Train LLM 25% Faster Without Touching the Kernels

... PLUS: Glean: Centralized Context Made Claude Cowork 10x Better

In today’s newsletter:

  • Glean: Centralized Context Made Claude Cowork 10x Better

  • Unsloth: Train LLM 25% Faster Without Touching the Kernels

Reading time: 5 minutes.

Glean published one of the most interesting MCP evaluations we’ve seen so far.

They benchmarked two different context architectures inside Claude Cowork: federated MCP versus centralized indexing with a unified knowledge layer.

Same harness. Same model. Same queries.

The only thing that changed was the context layer.

The results were not close.

Centralized indexing was preferred roughly 2.5x more often, while federated MCP consumed about 30% more tokens on average.

The Problem With Federated MCP

In the federated setup, every application had its own MCP server. Claude had to independently query systems like Gmail, Slack, Drive, Salesforce, GitHub, and Atlassian, then combine everything itself.

That meant multiple tool calls per query, inconsistent ranking quality between systems, repeated retrieval loops, and significant over-fetching to compensate for weak search.

The model then had to filter, normalize, and synthesize all of that context during reasoning. When retrieval failed, the system compensated with even more tool calls and retry loops.

In some cases, federated MCP burned roughly 83k tokens just to produce a correct answer, compared to about 43k for the centralized setup.

Why Centralized Context Performed Better

The centralized approach worked differently.

Instead of querying every source independently, all enterprise data was indexed into one unified layer connected through a knowledge graph.

Claude made a single MCP call and received cleaner, better-ranked context upfront, including relationships between documents, people, conversations, and systems across the organization.

That dramatically reduced the need for excessive retrieval and filtering. Token usage stayed remarkably stable around 42k–44k tokens even as tasks became more complex.

The Gap Gets Worse on Complex Tasks

The most interesting part of the benchmark was what happened as task difficulty increased.

On simpler tasks, centralized indexing won about 66% of the time. On more complex multi-step tasks, that increased to 73%.

That makes sense. In long-running agent workflows, retrieval quality compounds. A missed document or weak ranking early in the chain affects everything downstream: planning, reasoning, synthesis, and the final output itself.

More tool calls do not necessarily fix weak retrieval. In many cases, they just inject more noise into the context window. Once the model is forced to reason over partially relevant or contradictory information, performance starts degrading further.

The Real Insight

MCP standardizes tool connectivity, but it does not standardize context quality.

Two systems can expose the exact same tools to Claude while producing very different outcomes depending on retrieval architecture, ranking quality, indexing strategy, and cross-source relationships.

The difference here was not the model.

It was the context architecture underneath the model.

Why This Matters

Token costs are rising fast. Reasoning models are getting more expensive, while enterprise agents are becoming increasingly retrieval-heavy and multi-step.

You cannot brute-force your way around weak context with more retrieval, more reasoning loops, or more tool calls.

At some point, better search architecture beats more compute.

No new hardware. No accuracy tradeoffs. Just an update.

Unsloth collaborated with NVIDIA to add an extra 25% training speed on top of Unsloth's existing 2-5x speedup. The improvements turn on automatically on RTX laptops, data center GPUs, and DGX Spark machines.

But the more interesting story is where the gains came from.

The Gains Weren't Where You'd Expect

When people think about faster training, they think about better hardware or better core kernels. These three optimizations came from somewhere else: the work happening between the fast parts.

The pattern across all three is the same. The training process was quietly doing unnecessary extra work:

  • Recalculating information it already had

  • Letting two tasks block each other instead of running at the same time

  • Repeating a slow step once per expert when once total was enough

Once the core training code is fast, that surrounding overhead stops being invisible. It starts eating up real training time.

The Three Optimizations

1. Stop recalculating the same thing at every layer (+14.3% per batch)

To save memory, short training examples get packed into one long sequence rather than being padded with empty tokens. The model still needs to know where each original example starts and ends inside that packed sequence.

The problem: it was recalculating that boundary information at every transformer layer. For a 28-layer model, that's the same work done 28 times in a row.

The fix is to calculate it once and reuse it across all layers. The forward pass saw the biggest gain because that's where the repeated recalculation hits hardest: +43.3% on Qwen3-14B QLoRA SFT, +14.3% overall per batch.

2. Stop letting copying and computing wait on each other (+8%)

Gradient checkpointing saves GPU memory by not holding every intermediate activation in memory during training. Instead, activations get offloaded to CPU RAM and copied back to the GPU when the backward pass needs them.

With a single buffer, this is a waiting game: copy the activation over, wait for it to arrive, run the backward pass, then start the next copy. Copy and compute take turns.

With two buffers, while the backward pass runs on one activation, the next one is already being copied over in the background. Copy hides behind compute.

Results on NVIDIA B200 GPUs:

Model

Speedup

Extra VRAM

8B

+8.4%

+0.37 GB

14B

+6.7%

+0.47 GB

32B

+4.6%

+0.23 GB

Training loss was unchanged across all runs.

3. Ask once instead of once per expert (+10-15%)

In mixture-of-experts (MoE) models, a router decides which tokens go to which expert. A naive implementation asks one question per expert: "which tokens are assigned to you?" With many experts, that's many questions, and each one forces a CPU-GPU sync. More experts, more syncs, more slowdown.

The fix groups all token assignments at once using argsort, then counts tokens per expert in a single bincount call. Same routing result, but the CPU-GPU sync happens once instead of once per expert.

This gave 10-15% speedup in team validation, with +23% on the forward pass in targeted benchmarks.

The Bigger Lesson

These three fixes touch different parts of training, but they share the same root cause: the code between the fast kernels was doing work it didn't need to do.

Packed-sequence metadata rebuilt at every layer. Activations copied one at a time while the backward pass waited. Token routing queried once per expert instead of once total.

None of these were flaws in the math. They were overhead in the plumbing around it.
That's where the 25% came from.

To get all three improvements, just update Unsloth.

That’s all for today. Thank you for reading today’s edition. See you in the next issue with more AI Engineering insights.

PS: We curate this AI Engineering content for free, and your support means everything. If you find value in what you read, consider sharing it with a friend or two.

Your feedback is valuable: If there’s a topic you’re stuck on or curious about, reply to this email. We’re building this for you, and your feedback helps shape what we send.

WORK WITH US

Looking to promote your company, product, or service to 200K+ AI developers? Get in touch today by replying to this email.