- AI Engineering
- Posts
- Turn Any Document into a Synthetic Dataset and Fine-Tune with Llama
Turn Any Document into a Synthetic Dataset and Fine-Tune with Llama
.. PLUS: Turn any ML paper to code repository, Agentic AI Terminal
In today’s newsletter:
Notebook: Generate and fine-tune synthetic data with Llama.
AI Terminal: I built a FastAPI app and set it up as an MCP server directly in my terminal.
PaperCoder: Turn any ML paper into code repository.
Reading time: 3 minutes.
Unsloth AI and Meta have released a free notebook that transforms your documents into high-quality synthetic datasets using Llama, then fine-tunes them with Unsloth AI.
Preparing training data for LLM fine-tuning is one of the most time consuming parts of any machine learning pipeline.
What if you could skip the manual work and turn any document, including a PDF, a blog post, or even a YouTube video, into a structured, fine-tuning-ready dataset in minutes?
Meta's Synthetic Data Kit, an open-source tool, simplifies the generation of fine-tuning datasets from various types of documents.
This setup combines Meta’s Synthetic Data Kit with Unsloth AI’s lightweight fine-tuning framework.
This notebook:
Ingests content from PDFs, websites, YouTube videos, and more
Uses Meta’s Synthetic Data Kit and Llama 3.2 to generate question-answer pairs
Cleans and filters outputs to ensure high-quality data
Fine-tunes a model with Unsloth’s lightweight training setup
Runs entirely on your local machine with no cloud setup or API keys required
You can take internal documents, research articles, or even educational videos and turn them into useful training data in just a few steps, then fine-tune a model on that data.

I built a FastAPI app and set it up as an MCP server using FastAPI MCP in just a few minutes, right inside my terminal!
I didn’t write a single line of code.
Warp’s AI terminal handled the entire workflow:
Cloned the GitHub repo
Parsed the README and folder structure
Summarized the repo’s purpose
Surfaced key files and components
Created a FastAPI endpoint and exposed it as an MCP server
All I did was type a natural language prompt: “Clone this repo, analyze it, create a FastAPI /predict endpoint, and expose it as an MCP server.”
Key Features:
AI-Driven Command Generation: Convert plain English descriptions into the right commands for automation.
Reusable Workflows: Save and share workflows for seamless collaboration.
Error Handling and Learning: AI learns from your habits to troubleshoot and offer contextual completions.

Turn any ML paper into code repository!
PaperCoder is a multi-agent LLM system that reads machine learning papers and automatically turns them into complete code repositories.
How It Works
It follows a three-stage pipeline:
Planning: Understands the paper's structure and objectives
Analysis: Extracts core methods, models, and equations
Code Generation: Produces clean, runnable code
Each stage is handled by a specialized AI agent, working together to ensure accuracy and completeness.
This method outperforms strong baselines both on Paper2Code and PaperBench and produces faithful, high-quality implementations.
It’s 100% Open Source

That’s a Wrap
That’s all for today. Thank you for reading today’s edition. See you in the next issue with more AI Engineering insights.
PS: We curate this AI Engineering content for free, and your support means everything. If you find value in what you read, consider sharing it with a friend or two.
Your feedback is valuable: If there’s a topic you’re stuck on or curious about, reply to this email. We’re building this for you, and your feedback helps shape what we send.
WORK WITH US
Looking to promote your company, product, or service to 100K+ AI developers? Get in touch today by replying to this email.