Gemma 4 Just Got Up to 3x Faster

... PLUS: Stop Parsing Documents You Don't Need

In today’s newsletter:

  • ADE Classify: Stop Parsing Documents You Don't Need

  • Gemma 4 Just Got Up to 3x Faster

Reading time: 5 minutes.

ADE Classify: Stop Parsing Documents You Don't Need

Document pipelines have a blind spot. They receive a mixed bundle, parse everything in it, and let the extraction model figure out what's relevant.

LandingAI just released ADE Classify API to fix this. It sits at the front of your pipeline, before parsing begins, and labels every page so you only process what actually matters.

The Problem

A 50-page mortgage PDF contains two invoices, three bank statements, and 45 pages of cover sheets and policy disclosures.

Parsing everything wastes compute on noise. But the bigger issue is accuracy. Feed a 50-page dump to an extraction model and ask it to pull invoice totals, and it will try to extract financial data from driver's licenses. The output looks plausible. The data is wrong.

Same root cause both times: the pipeline had no idea what it was looking at before it started.

What It Does

ADE Classify evaluates a document page by page before any parsing happens. You define the classes you care about, the API assigns every page a label and a reasoning explanation, and your pipeline uses those labels to decide what to do next.

Parse this page. Skip that one. Send these to the invoice workflow. Flag those for human review.

How It Works

You pass a list of custom document classes. Classes can be simple labels or include descriptions to remove ambiguity. The difference between a receipt and a formal invoice is something you define directly, no custom model training needed.

The API evaluates every page concurrently. Each page comes back with a class and a reason. Pages that don't fit any class get flagged as unknown with a suggested class rather than silently dropped into the wrong bucket. Unknowns go to human review. The reason gets logged for auditing.

What Changes

Cost. An insurance claim packet has 100 pages. Classify identifies the 10 medical records and 2 invoices. Discard the remaining 88 before expensive parsing.

Accuracy. Extraction models focused on the right pages produce better results than models trying to reason over a full mixed dump.

Routing. Different document types need different workflows. Send bank statements to verification, invoices to financial extraction, unknowns to human review.

Explainability. Every classification includes reasoning. Unknown pages get flagged with suggested classes for human review.

Google recently released Multi-Token Prediction (MTP) drafters for the Gemma 4 family. Same output quality, up to 3x faster inference. No retraining, no fine-tuning, no quality tradeoff.

To understand why this matters, you first need to understand where the slowness was coming from.

The Real Bottleneck in LLM Inference

Standard LLM inference is often slow because of its memory. For every single token generated, the GPU has to move billions of model parameters from VRAM into the compute units, do the calculation, produce one token, then move all those parameters again for the next one.

The compute itself is fast. The constant movement of data to feed it is the bottleneck. This leaves the GPU underutilized for most of the inference process, generating one token at a time regardless of how obvious that next token is.

Predicting "words" after "Actions speak louder than..." costs the same amount of compute as solving a complex reasoning step. That imbalance is where speculative decoding comes in.

What MTP Drafters Do

Instead of generating one token at a time with the full model, MTP drafters pair the large target model with a small, fast drafter model.

The drafter runs first and predicts several tokens ahead in the time it would take the target model to generate just one. The target model then verifies all of those predicted tokens in a single forward pass.

If the target model agrees with the draft, the full sequence gets accepted and one additional token is generated on top. That means in the time it normally takes to produce one token, you get the entire drafted sequence plus one more.

If the target model disagrees at some point, it falls back to its own token from that position. Quality is never compromised because the target model always has the final say.

The Engineering Behind It

Two decisions make the drafters fast without adding cost.

The draft model shares the target model's KV cache. This means it doesn't recalculate the context the larger model has already processed. It picks up where the target left off, rather than starting from scratch.

For the smaller edge models (E2B and E4B), where the final logit calculation becomes its own bottleneck, Google implemented a clustering technique in the embedder to accelerate that step specifically.

On Apple Silicon with the 26B MoE model, running multiple requests together (batch sizes of 4 to 8) unlocks up to 2.2x speedup locally. Similar gains appear on NVIDIA A100s at higher batch sizes.

Before and After

Setup

Standard inference

With MTP drafter

Gemma 4 31B Dense

Baseline

Up to 3x faster

Gemma 4 26B MoE (Apple Silicon, batch 4-8)

Baseline

~2.2x faster

E2B / E4B edge models

Baseline

Faster + lower battery drain

Output quality is identical across all configurations. The target model's verification step ensures that.

How to Use It

MTP drafters for the full Gemma 4 family are available now under the Apache 2.0 license. Weights are on Hugging Face and Kaggle. They work with Transformers, MLX, vLLM, SGLang, and Ollama out of the box. For mobile, they're available through Google AI Edge Gallery on Android and iOS.

That’s all for today. Thank you for reading today’s edition. See you in the next issue with more AI Engineering insights.

PS: We curate this AI Engineering content for free, and your support means everything. If you find value in what you read, consider sharing it with a friend or two.

Your feedback is valuable: If there’s a topic you’re stuck on or curious about, reply to this email. We’re building this for you, and your feedback helps shape what we send.

WORK WITH US

Looking to promote your company, product, or service to 200K+ AI developers? Get in touch today by replying to this email.