- AI Engineering
- Posts
- ml-intern: An open-source ML engineer from Hugging Face
ml-intern: An open-source ML engineer from Hugging Face
... PLUS: Convert Any Document into Clean Data for AI Agents
In today’s newsletter:
Firecrawl Parse: Convert Any Document into Clean Data for AI Agents
Hugging Face Open-Source AI ML Intern
Reading time: 5 minutes.
Most agent pipelines handle web content well. Point an agent at a URL, it gets clean Markdown, reasons over it, and moves on.
Real world runs on PDFs, Word files, and spreadsheets. Getting clean, structured data out of those has always been the messy part of building agent pipelines.
Firecrawl /parse endpoint is built specifically for this. Upload a PDF, DOCX, or XLSX file directly and get back clean Markdown or JSON, with reading order and tables intact.
The Problem with Documents
PDF contents are not stored in reading order. For example, two columns can be stored as mixed characters, with no signal indicating they belong to separate sections.
This leads to structural errors during extraction. Tables get flattened into plain text.
Multi-column layouts are read across instead of top to bottom.
The model receives something that appears to be content, but the structure is already broken.
DOCX and XLSX make this worse. Word files contain nested tables and tracked changes. Spreadsheets have merged cells and multi-row headers. Simple extraction methods flatten everything and lose that structure.
The /parse Endpoint
Post any document file directly to the API and get back clean, structured output ready for your agent pipeline.
Converts PDF, DOCX, and XLSX into Markdown or JSON
Preserves reading order and tables
Zero Data Retention support for sensitive documents
Three parsing modes: Auto (default, handles mixed documents), Fast (Rust-only, text PDFs), OCR (scanned documents)
Here's how a basic call looks:

Your agent pipeline is only as good as the data going into it. If that data lives in a PDF, a Word file, or a spreadsheet, the parse is where it either works or quietly doesn't.
Most AI coding agents can write Python. That's not the hard part of ML work.
The hard part is:
Knowing which dataset to pull
Finding the right paper to reference
Knowing if a fine-tuning approach will work before spending compute on it
Knowing what to do when training stalls at 40 iterations.
That's where generalist agents break down. They write plausible code, but they don't actually know the ML ecosystem.
Hugging Face just open-sourced ML Intern, an agent built specifically for this. This is an autonomous ML engineer with direct access to the full Hugging Face ecosystem.
What It Actually Does
The interface is a single command from your terminal. You give it a goal, and from that prompt, the agent researches, writes code, runs experiments, launches jobs on Hugging Face infrastructure, and pushes the final model. It streams every step back to you as it works.
What makes this different from a general coding agent is the tool routing. ML Intern doesn't search the web generically. It has purpose-built access to:
Hugging Face docs and research papers
HF datasets, model repos, and job infrastructure
GitHub code search
Local and sandboxed code execution
When it needs a dataset, it searches HF datasets. When it needs to understand an architecture, it reads the actual paper. When it's ready to train, it launches a real HF job. The tools match the domain.
The Engineering Decisions Worth Noticing
Two things in the architecture stand out:
The Doom Loop Detector. Long-running agents get stuck: same tool call, same error, spinning without progress. ML Intern detects repeated patterns and injects corrective prompts before they waste your compute budget.
Approval gates before risky actions. Before launching jobs, running sandbox code, or any destructive operation, the agent pauses for confirmation. The agent can push models and launch compute jobs. Those actions have real costs, and the approval gate is what makes that level of access safe.
What This Looks Like in Practice
It runs in two modes: interactive, where you chat with it through a session, and headless, where a single prompt runs end to end without intervention.
The default model is Claude, swappable via a flag. Max iterations default to 300 but can be tuned down for faster, cheaper runs.
Why Long Sessions Are the Hard Problem for ML Agents
Most coding agents are built for short tasks. Write a function, fix a bug, generate a script. The session is measured in seconds or minutes.
ML work doesn't fit that shape. A fine-tuning run takes hours. Debugging why a training curve flatlines might take dozens of tool calls, reading papers, checking dataset distributions, and adjusting configs across multiple iterations. An agent that loses its state mid-experiment is useless.
ML Intern handles this with automatic context compaction at 170k tokens. When the session grows too large, it compresses history without losing the thread of what's been tried and what hasn't.
The session state is uploaded to Hugging Face, so a 300-step workflow survives without collapsing under its own history. This is the unglamorous part that makes long-horizon ML tasks actually possible.
Built on smolagents
ML Intern is built on smolagents, Hugging Face's own lightweight agent framework. That's not incidental. It means the entire stack sits within the same ecosystem: the tools call HF APIs, the session state lives on HF, and the framework that coordinates everything is maintained by the same team.
For extensibility, this matters. smolagents is designed around a simple tool interface. Adding a new tool is a function with a docstring. The agent discovers it automatically.
If you have proprietary datasets, internal experiment trackers, or custom compute infrastructure, you can wire them in without fighting the framework. That's what the "editing a single file" claim in the README actually means in practice.
Why This Matters
Generalist agents are useful for writing code. Domain-specific agents are useful for doing work.
ML Intern's tools, approval gates, loop detection, and context management are all built around ML workflows specifically.
That's what makes "fine-tune llama on my dataset" a viable single command rather than a prompt that produces a script you still have to fix and run yourself.
It's fully open source. The architecture is clean and extensible: adding tools means editing a single file, adding MCP servers means editing a config.
That’s all for today. Thank you for reading today’s edition. See you in the next issue with more AI Engineering insights.
PS: We curate this AI Engineering content for free, and your support means everything. If you find value in what you read, consider sharing it with a friend or two.
Your feedback is valuable: If there’s a topic you’re stuck on or curious about, reply to this email. We’re building this for you, and your feedback helps shape what we send.
WORK WITH US
Looking to promote your company, product, or service to 200K+ AI developers? Get in touch today by replying to this email.
