- Google just promoted LiteRT's NPU acceleration to GA.
- Real apps already shipping on it: Google Meet runs a 25x bigger Ultra-HD segmentation model, Epic Games hits 30 FPS MetaHuman facial capture on Android, and Argmax's speech SDK gets a 2x speedup.
- Here's what changed and why it matters.
TL;DR
On April 23, 2026 Google announced that NPU acceleration in LiteRT — the runtime that succeeded TensorFlow Lite — has graduated from preview to production-ready. Through a single API, developers can now hit the dedicated AI silicon in Qualcomm Snapdragon and MediaTek Dimensity SoCs (Google Tensor still experimental) and reach up to 100× CPU and 10× GPU speedups, while the chip stays cool. Google Meet, Epic Games, and Argmax are already running it in shipping apps.

What's new
LiteRT first showed up at Google I/O '25 as a preview — a high-performance runtime designed specifically for advanced hardware acceleration. As of last week it's GA, and the framing has clearly shifted: LiteRT is being positioned as the universal on-device inference framework for the AI era, not just a TFLite refresh.
The headline upgrades over TFLite:
- Faster: averages 1.4× faster GPU than TFLite via the new ML Drift engine, plus state-of-the-art NPU acceleration.
- Simpler: one streamlined workflow for GPU and NPU across edge platforms — no more juggling vendor SDKs.
- Powerful: built for cross-platform GenAI, including open models like Gemma running through LiteRT-LM.
- Flexible: first-class PyTorch and JAX conversion to the same
.tfliteformat you already trust.
The new CompiledModel API is the entry point for NPU/GPU acceleration; the classic Interpreter API still ships, so existing models keep running.
Why it matters
NPUs aren't new — they've shipped in flagship phones for years. What was missing was a clean, vendor-neutral way to actually use them. Historically, integrating with the Hexagon Tensor Processor or NeuroPilot meant per-vendor SDKs, per-SoC compilers, and a fragile build pipeline. Most teams never bothered and shipped CPU/GPU paths that drained battery, throttled the device, and dropped frames after a few minutes of real use.
LiteRT collapses that mess into a three-step flow: optionally AOT-compile your .tflite model for target SoCs, ship it through Google Play for On-device AI (PODAI), and call it through the LiteRT runtime — which automatically falls back to GPU or CPU on devices without a supported NPU. The unified API is what makes shipping NPU features actually viable for non-Google teams.
Technical facts
Numbers from the announcement and the LiteRT framework post:
| Metric | Result |
|---|---|
| NPU vs CPU inference | up to 100× faster |
| NPU vs GPU inference | up to 10× faster |
| Gemma 3 1B prefill on Galaxy S25 Ultra (NPU vs GPU, both LiteRT) | 3× additional gain |
| Async + zero-copy buffers (segmentation sample) | up to 2× end-to-end |
| LiteRT GPU vs legacy TFLite GPU (avg) | 1.4× faster |
| JIT init w/ caching — ResNet152 | 7,465 ms → 198 ms, 1,525 MB → 355 MB |
| JIT init w/ caching — MobileNet v3 LRASPP | 1,592 ms → 166 ms |
Two architectural details do most of the heavy lifting:
- Zero-copy hardware buffers — the NPU reads tensors directly from
AHardwareBuffermemory, killing the CPU round-trip that otherwise dominates real-time pipelines. - On-device compilation caching — JIT artifacts get cached and only recompile when the compiler plugin, build fingerprint, model, or compile options change. That's how you get the ResNet152 number above.
Comparison
vs TensorFlow Lite
TFLite was built around classical ML and bolted NPU support on per vendor. LiteRT is the opposite: NPU and GenAI are first-class. Same .tflite file format, same Interpreter API still works, but a new CompiledModel path unlocks AOT/JIT, ML Drift GPU, and the unified NPU runtime.
vs Llama.cpp
On Samsung Galaxy S25 Ultra running Gemma 3 1B, LiteRT outperforms Llama.cpp on both CPU and GPU for prefill and decode. Switching from LiteRT GPU to LiteRT NPU adds another 3× on prefill on top of that.
Silicon coverage
| Vendor | SoCs supported now |
|---|---|
| Qualcomm (AI Engine Direct / Hexagon) | Snapdragon 8 Gen 1, 8+ Gen 1, 8 Gen 2, 8 Gen 3, 8 Elite, 8 Elite Gen 5 |
| MediaTek (NeuroPilot) | Dimensity 7300, 8300, 9000, 9200, 9300, 9400, 9500 |
| Google Tensor | Experimental access (sign-up) |
| Industrial edge | Qualcomm Dragonwing IQ8 (e.g., Arduino VENTUNO Q) |
| AI PCs | Intel Core Ultra series 2 & 3 via OpenVINO (preparing) |
Use cases
Google Meet — Ultra-HD background segmentation
By moving segmentation to the mobile NPU, Meet shipped a model 25× larger than its previous version with no inference-speed regression. The bigger story is the power footprint: thermals stay flat across a typical 20–30 minute call, so the higher-quality background replacement actually survives a real meeting instead of throttling 7 minutes in.
Epic Games — Live Link Face on Android
Epic's Live Link Face (Beta) for Android uses LiteRT NPU acceleration to run the real-time facial solver. Result: up to 30 FPS MetaHuman facial animation captured from a single phone camera and streamed straight into Unreal Engine. That kind of latency budget was previously a desktop-class workload.
Argmax — on-device speech recognition
The new Argmax Pro SDK for Android ships frontier speech models like NVIDIA Parakeet TDT 0.6B v2 directly on-device. Tested across Google Tensor, MediaTek, and Qualcomm SoCs, switching from GPU to NPU delivered over 2× speedup. Argmax uses LiteRT's AOT compilation plus Google Play AI Packs to skip on-device compile entirely. Enterprise customer Heidi Health uses it for extended live medical transcription without destroying battery.
Google AI Edge Portal — benchmarking at scale
The companion announcement: AI Edge Portal (private preview) now benchmarks LiteRT models with NPU support across 100+ Android devices, including a dedicated fleet of 30+ Qualcomm devices. You pick chipsets, accelerators, and AOT vs JIT, and get latency, memory, and per-device hardware utilization back — without owning a device lab.
Limitations & pricing
- Free and open source (Apache 2.0). LiteRT is GA; AI Edge Portal is private preview, also free during the preview window.
- Hardware coverage is the catch. Only listed Snapdragon and Dimensity SoCs are supported — older mid-range and low-end chips fall back to GPU/CPU.
- Google Tensor is still experimental and gated behind a sign-up form.
- Build environment is opinionated: Ubuntu 22.04 LTS, Bazel 7.4.1, Android SDK API 34, NDK API 28. IoT and Windows toolchains are marked “coming soon” in the Qualcomm docs.
- JIT vs AOT trade-off: JIT means a slow first launch (mitigated by the on-device cache); AOT means precompiling per target SoC.
- AI Edge Portal is Android-only for now; the Accelerator Allocation table for NPU is still listed as “coming soon.”
What's next
The roadmap is mostly a question of how fast Google can paint in the rest of the device matrix. Expect Google Tensor to leave experimental access, more silicon partners on top of Qualcomm and MediaTek, fuller iOS / macOS / Windows / Web GPU coverage via ML Drift, and the OpenVINO + Intel Core Ultra path landing for AI PCs. On the model side, LiteRT-LM — the same orchestration layer powering Gemini Nano in Chrome and Pixel Watch — is being positioned as the default way to ship Gemma 4 and other open models to NPUs. AI Edge Portal will pick up bulk inference, dedicated LLM benchmarking, and quantization tooling.
If you're shipping any real-time on-device AI on Android — segmentation, ASR, motion capture, on-device LLM — this is the moment the “just use the NPU” advice becomes practical instead of aspirational.
Sources: Google Developers Blog — Building real-world on-device AI with LiteRT and NPU, LiteRT: The Universal Framework for On-Device AI, NPU acceleration with LiteRT, Google AI Edge Portal.


