AI Engineering
Posts
Run 70B LLMs on a 4GB GPU

Run 70B LLMs on a 4GB GPU

... PLUS: Evals for AI Agents

Sumanth P
February 23, 2026

In today's newsletter:

Firecrawl Spark 1 Pro & Mini: New Models for Web Data Extraction
AirLLM: Run Massive LLMs on Consumer Hardware
Evals for AI Agents: Anthropic's Testing Guide

Reading time: 4 minutes.

Firecrawl Spark 1 Pro & Mini: New Models for Web Data Extraction

Firecrawl released Spark 1 Mini and Spark 1 Pro. These specialized models turn a prompt into structured JSON by autonomously navigating and searching the web

The Spark 1 Lineup

Spark 1 Mini (Efficiency Default)

Priced 60% lower than previous versions
Designed for high-volume tasks like profile scraping or product extraction
~40% recall
Surpasses several higher-cost extraction tools

Spark 1 Pro (Reasoning Tier)

Built for complex, multi-domain research
Handles authentication flows and nested menus
~50% recall
Highest accuracy tier currently available in the Spark lineup

Technical Capabilities

Spark models operate directly on the DOM structure. They identify elements based on intent rather than brittle CSS selectors, enabling stable extraction even when layouts change.

Mini fits bulk extraction on predictable sites.
Pro fits deeper research across multiple domains.

Both models are live on the Playground, API, and SDKs.

Test Spark 1 Mini & Pro in the Firecrawl agent →

AirLLM: Run Massive LLMs on Consumer Hardware

AirLLM is a Python library that runs 70B+ parameter models on consumer-grade GPUs without quantization, distillation, or pruning.

A single 4GB GPU can run Llama 3 70B. An 8GB GPU handles Llama 3.1 405B. Models that typically require enterprise compute now run on hardware most developers already have.

How it works:

AirLLM uses a layer-by-layer loading strategy. Instead of loading the entire model into VRAM, it loads and processes model layers sequentially from disk. Each layer runs its computation, then the next layer loads while the previous one unloads.

This approach trades inference speed for memory efficiency. Inference is slower than keeping the full model in VRAM, but it makes models accessible that otherwise wouldn't run at all on consumer hardware.

The library supports prefetching to overlap model loading with computation, which provides about 10% speed improvement. It works on Linux, macOS (including Apple Silicon M1/M2/M3/M4), and Windows.

Supported models:

AirLLM works with Llama, Qwen, Mistral, Mixtral, Baichuan, InternLM, ChatGLM, and other architectures. Models are downloaded from HuggingFace and automatically split into layers for efficient loading.

For disk space management, you can set delete_original=True when initializing to remove the original HuggingFace download after transformation, saving about half the disk space.

What this enables:

Running large models locally without cloud API costs. A 70B model that would cost hundreds of dollars monthly through API providers now runs on a $200 GPU.

Privacy-sensitive applications can process data entirely offline. Medical records, legal documents, or proprietary code never leave your infrastructure.

Researchers and startups can prototype with state-of-the-art models without enterprise budgets. Test Qwen2.5, Mistral, or Mixtral locally before committing to cloud infrastructure.

The trade-off is speed. Inference is slower than cloud APIs or dedicated inference servers with models fully loaded in VRAM. But for use cases where latency isn't critical, or where privacy and cost matter more than speed, AirLLM makes previously inaccessible models practical.

AirLLM Github Repo →

Evals for AI Agents: Anthropic's Testing Guide

Anthropic published a comprehensive guide on building evaluations for AI agents.

Agent evaluations differ from traditional LLM testing. Agents use tools across many turns, modify state in the environment, and adapt as they go. Mistakes can propagate and compound, making evaluation more complex.

Multi-step workflows require evaluation of both final outcomes and intermediate steps.

A typical agent evaluation harness looks like this:

The grading layer inside that harness can be implemented using three distinct approaches.

Code-based graders

These rely on deterministic checks such as string matching, binary tests, and static analysis. They are fast, inexpensive, and easy to automate at scale. Their limitation is rigidity. Valid but unexpected variations in agent behavior can fail even when the outcome is acceptable.

Model-based graders

These use rubric scoring, natural language assertions, or pairwise comparisons powered by another model. They handle nuance and open-ended outputs more effectively than strict code checks. Calibration against human judgment is necessary to prevent drift or scoring bias.

Human graders

These involve expert review, crowdsourced evaluation, or structured A/B testing. They provide the highest quality signal, particularly for subjective or complex tasks. The trade-off is cost and slower iteration speed.

Evaluation Depends on Agent Type

Different agents require different validation setups:

Coding agents → SWE-bench style pass/fail tests
Conversational agents → Multi-turn simulations with state validation
Research agents → Groundedness and coverage verification
Computer-use agents → Screenshot-based UI checks

Agent behavior varies between runs. The same task may succeed once and fail on the next attempt.

Two metrics help quantify this variability:

pass@k
Measures whether an agent succeeds at least once across k attempts.
Example: A 75% per-run success rate across 10 attempts gives ~99% probability of at least one success.
pass^k
Measures whether an agent succeeds on every attempt.
Example: A 75% per-run success rate across 3 required passes drops to ~42%.

Designing a meaningful evaluation suite starts with task selection. Begin with 20–50 real failure cases pulled from production or internal testing. Write unambiguous specifications with reference solutions. Separate capability evaluations from regression evaluations so improvements do not mask new failures.

Read the full guide →

That’s all for today. Thank you for reading today’s edition. See you in the next issue with more AI Engineering insights.

PS: We curate this AI Engineering content for free, and your support means everything. If you find value in what you read, consider sharing it with a friend or two.

Your feedback is valuable: If there’s a topic you’re stuck on or curious about, reply to this email. We’re building this for you, and your feedback helps shape what we send.

WORK WITH US

Looking to promote your company, product, or service to 180K+ AI developers? Get in touch today by replying to this email.