Run and Deploy LLMs on your Phone

.. PLUS: Agent-to-User-Interface from Google

In today’s newsletter:

  • Google A2UI: Agent-to-User-Interface

  • Run and Deploy LLMs on your Phone

Reading time: 3 minutes.

Text-based chat interfaces break down quickly once agents start doing real work.

Ask an agent to plan a trip and you get long, unstructured text listing dozens of options. That doesn’t scale. What you actually want are tables, filters, buttons, and interactive components.

Hardcoding UIs for every possible agent response isn’t practical either.

Google’s A2UI (Agent-to-User Interface) takes a different approach. Instead of returning text, the agent returns declarative JSON that describes the UI it wants to render. The client application interprets that JSON and maps it to trusted, native components.

The key idea is separation of concerns. The agent decides what to show. The application controls how it is rendered.

Core design principles

  • Declarative JSON only, no executable code

  • Client-side control over what components are allowed

  • Framework-agnostic output that works across React, Flutter, and SwiftUI

  • Sandboxed custom components with strict safety policies

This keeps UI generation flexible while leaving security and rendering decisions firmly in the developer’s hands, not the model’s.

LLMs can now be trained, optimized, and executed directly on mobile devices.

In this example, we break down how to adapt a model for mobile devices, convert it into a mobile-executable format, and run it 100% locally on Android, without relying on servers or cloud inference.

We’ll use:

  • Unsloth to fine-tune models under mobile compute and precision constraints

  • TorchAO to generate low-precision, phone-efficient weights

  • ExecuTorch to execute the model natively on Android, fully offline

Let’s get started!

1️⃣ Load model

Let’s start by loading Qwen3-0.6B with the phone_deployment configuration enabled.

This configuration turns on quantization-aware training (QAT) at load time, so the model is optimized under mobile execution constraints rather than full-precision server settings.

By simulating low-precision arithmetic throughout training, QAT forces weights, activations, and value ranges to adjust to quantization effects early, ensuring the trained model behaves correctly when exported via TorchAO and deployed with ExecuTorch.

2️⃣ Load Datasets

With the model initialized for mobile execution, Let’s define the behaviors it should learn.

In this setup, Two datasets are loaded:

  • A reasoning dataset for step-by-step problem solving

  • A chat dataset for natural conversational responses

Both datasets are trained under the same quantized, phone-first setup established during model initialization.

3️⃣ Convert reasoning data

Reasoning datasets are often stored as structured fields such as problem, solution, and explanation.

Since the model is trained as a conversational system, each reasoning example is converted into:

  • a user prompt containing the question

  • an assistant response containing the full step-by-step reasoning

This ensures that reasoning behavior is learned in the same interaction format the model will use at inference time.

4️⃣ Standardize chat data

To keep training signals consistent, the chat dataset is reformatted into the same user → assistant schema.

Using a single conversational format:

  • Stabilizes training

  • Simplifies batching and optimization

  • Allows reasoning and chat behaviors to reinforce each other

5️⃣ Mix Reasoning and Chat Data

With both datasets standardized, let’s combine them into a single training corpus:

  • 75% reasoning data

  • 25% chat data

This ratio biases the model toward structured thinking while preserving conversational fluency.

6️⃣ Fine-Tune Model

With data prepared, Qwen3-0.6B is fine-tuned using Unsloth’s trainer.

Throughout training:

  • Quantization-aware training stays enabled

  • Memory usage stays low

  • Training runs are short but effective

This setup ensures the model learns under quantized execution constraints from the start, producing weights that are stable and deployment-ready for low-precision runtimes.

7️⃣ Save the model

After training completes, let’s export the model in a TorchAO-compatible quantized format.

At this point, the weights are finalized and optimized for low-precision execution, but remain independent of any specific mobile runtime.

This output serves as the handoff point to mobile compilation.

8️⃣ Export to .pte

Next, the quantized model is converted into a single .pte file using ExecuTorch.

The .pte format:

  • is optimized for on-device inference

  • does not require a Python runtime

  • is designed for mobile CPUs and NPUs

The model configuration and tokenizer are bundled alongside the weights so the Android app has everything required to run locally.

The resulting artifact is approximately 470 MB, which is typical for on-device LLMs.

9️⃣ Run on Android 

Finally, the .pte model and tokenizer are loaded into the ExecuTorch Android demo app.

Once loaded:

  • inference runs entirely on-device

  • no server calls or network access are required

  • execution uses the same ExecuTorch runtime deployed in production mobile apps

Qwen3-0.6B now runs locally on an Android phone using the same ExecuTorch runtime deployed in production mobile apps.

That’s a Wrap

That’s all for today. Thank you for reading today’s edition. See you in the next issue with more AI Engineering insights.

PS: We curate this AI Engineering content for free, and your support means everything. If you find value in what you read, consider sharing it with a friend or two.

Your feedback is valuable: If there’s a topic you’re stuck on or curious about, reply to this email. We’re building this for you, and your feedback helps shape what we send.

WORK WITH US

Looking to promote your company, product, or service to 160K+ AI developers? Get in touch today by replying to this email.