- AI Engineering
- Posts
- Andrew Ng Open-Sourced Context Hub for AI Coding Agents
Andrew Ng Open-Sourced Context Hub for AI Coding Agents
... PLUS: Alibaba's SQLite for Vector Search
In today’s newsletter:
Agentic Document Extraction: Turn complex PDFs into structured data with vision-first parsing
Context Hub: Self-learning API docs for coding agents
Zvec: Alibaba's SQLite for vectors, embedded database for on-device RAG
Reading time: 5 minutes.
Traditional OCR extracts text but loses what matters: structure, order, context. That breaks when you're processing documents at scale.
Documents come in different formats. Invoices, contracts, bills, receipts. Each with unique layouts. Manual extraction doesn't scale.
LandingAI's Agentic Document Extraction (ADE) uses a vision-first approach that preserves layout and context.
Take utility bills as an example. They're common proof-of-address documents for KYC and onboarding. But they come from hundreds of providers, each with different formats.
Here's how ADE's parse-to-extract workflow handles this:
The system runs a two-step process:
Parse - Converts PDFs into structured markdown with chunk metadata and bounding boxes
Extract - Applies your schema to pull specific fields with grounding coordinates
You define the fields you need (provider info, account details, billing summary, charges), process batches from any provider, and get structured JSON + CSV showing exactly where each field came from.
This same approach works across any document type: invoices, insurance forms, tax documents, contracts, medical records.
What you get:
Markdown extraction with grounding coordinates
Structured JSON matching your schema
Field-level metadata linking to source chunks
Context Hub is a CLI tool that solves two problems with coding agents: they hallucinate APIs and they forget what they learn between sessions.
Here's how it works. Instead of agents searching the web for API docs and getting noisy results, they fetch curated, versioned documentation directly through a CLI command.
chub search openai
chub get openai/chat --lang pyThe agent reads the doc, writes correct code. If the code works, you're done.
This is where it gets interesting. When the agent discovers a gap or workaround, it can annotate the doc locally:
chub annotate stripe/api "Needs raw body for webhook verification"That annotation persists. Next session, when the agent fetches the same doc, the annotation appears automatically. The agent learns from past experience instead of starting from scratch every time.
Feedback flows back to doc authors. Agents can upvote or downvote docs:
chub feedback stripe/api upAuthors use that feedback to improve the content. Better docs for everyone, not just your local annotations.
Key features:
Language-specific docs - Python and JavaScript variants of the same API
Incremental fetch - grab specific reference files instead of everything, saves tokens
Open markdown repo - inspect exactly what your agent reads
Agent skills included - built-in SKILL.md files for Claude Code and other agents
All content is maintained as markdown in the repo. Anyone can contribute docs through pull requests. API providers, framework authors, or the community.
It's 100% open source
Alibaba open-sourced Zvec, an in-process vector database built on Proxima—their production vector search engine powering Taobao, Alipay, Youku, and Alibaba's advertising infrastructure. Now available as an embeddable library under Apache 2.0.
Zvec packages this engine so you can import it like any library. Vector search runs in-process, executing directly inside your application.
Designed for edge and local deployment:
Processes data in 64MB streaming chunks with memory-mapped paging instead of loading everything into RAM. Configurable concurrency controls manage CPU usage across threads, keeping vector search running on laptops and edge devices.
On the Cohere 10M benchmark via VectorDBBench, Zvec sustains over 8,000 QPS with competitive index build times.
What it supports:
Dense and sparse vectors with native multi-vector queries
Hybrid search combining semantic similarity with structured filters
Full CRUD operations with schema evolution
Built-in reranking (weighted fusion and RRF)
Here's what a basic vector search looks like:

Results return as document IDs with similarity scores, sorted by relevance.
That’s all for today. Thank you for reading today’s edition. See you in the next issue with more AI Engineering insights.
PS: We curate this AI Engineering content for free, and your support means everything. If you find value in what you read, consider sharing it with a friend or two.
Your feedback is valuable: If there’s a topic you’re stuck on or curious about, reply to this email. We’re building this for you, and your feedback helps shape what we send.
WORK WITH US
Looking to promote your company, product, or service to 160K+ AI developers? Get in touch today by replying to this email.


