How to Use Google LiteRT for On-Device AI Models

Google LiteRT is a unified framework that runs AI models on mobile devices, IoT hardware, and AI PCs by intelligently routing workloads across NPU, GPU, and CPU processors. You get 2x to 25x performance improvements compared to traditional CPU-only inference, with significantly lower battery drain. This guide walks you through implementing LiteRT in your Android projects, choosing the right hardware accelerator for your use case, and deploying production-ready on-device AI applications using the same techniques Google Meet and Epic Games rely on.

What Is LiteRT Framework for Mobile AI

LiteRT is Google's evolution of TensorFlow Lite, designed to accelerate AI inference on edge devices by automatically selecting the best hardware accelerator available. Instead of manually configuring separate code paths for NPU, GPU, or CPU execution, you get a single API that routes your model to the optimal processor.

The framework supports over 120 operations commonly used in computer vision, natural language processing, and audio processing models. It works across Android devices with Qualcomm, MediaTek, and Samsung NPUs, iOS devices with Apple's Neural Engine, industrial IoT systems using Qualcomm Dragonwing, and AI PCs running Intel OpenVINO or AMD XDNA processors.

LiteRT differs from its predecessor by introducing Ahead-of-Time (AOT) compilation, which pre-compiles models for specific hardware during your build process rather than at runtime. This eliminates the 200-400ms startup lag typical of JIT compilation, critical for applications like real-time video filters or voice assistants where first-inference latency matters.

NPU vs GPU for On-Device AI Performance

NPUs (Neural Processing Units) deliver 2.1x to 3.5x faster inference than GPU acceleration for most AI workloads, while consuming 40-60% less power. This performance gap exists because NPUs are purpose-built for the matrix multiplication and convolution operations that dominate neural network inference.

Google Meet's HD background blur model demonstrates this advantage clearly. The team deployed a model 25 times larger than their previous version using NPU acceleration through LiteRT, maintaining the same 30 FPS inference speed while dramatically improving blur quality. Running that same model on GPU would've required throttling to 15 FPS or draining battery 2.3x faster.

However, GPUs still win for certain workloads. If your model uses operations not yet optimized for NPU (like certain attention mechanisms or dynamic control flow), GPU execution might actually be faster. LiteRT's automatic hardware selection handles this transparently, but understanding when to force a specific backend helps you optimize edge cases.

CPU inference remains relevant for ultra-low-power scenarios or devices without dedicated AI accelerators. A small keyword spotting model running on CPU might use just 5mW, compared to 80mW on NPU. For always-on detection that triggers heavier models, CPU-first makes sense.

How to Deploy AI Models on Android Devices with LiteRT

Setting up LiteRT in your Android project requires adding the appropriate dependencies and converting your trained model to the .tflite format. Here's the complete implementation process.

Step 1: Add LiteRT Dependencies

Add these lines to your app's build.gradle file:

dependencies {
    implementation 'com.google.ai.edge.litert:litert-android:1.0.1'
    implementation 'com.google.ai.edge.litert:litert-gpu:1.0.1'
    implementation 'com.google.ai.edge.litert:litert-support:1.0.1'
}

The base litert-android package provides CPU and NPU acceleration. The litert-gpu package adds GPU delegate support. If you're targeting devices with Qualcomm Hexagon DSPs, also include litert-hexagon-delegate.

Step 2: Convert Your Model to TFLite Format

If you're starting with a PyTorch or ONNX model, you'll need to convert it first. For TensorFlow models, use the TFLite converter with optimization enabled:

import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model('your_model_directory')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]

tflite_model = converter.convert()

with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

The float16 quantization typically reduces model size by 50% with minimal accuracy loss (usually under 1% for vision models). For even smaller models, try int8 quantization, though you'll need a representative dataset for calibration.

Step 3: Initialize the Interpreter with Hardware Acceleration

Create an interpreter instance that automatically selects the best available hardware:

import com.google.ai.edge.litert.Interpreter
import com.google.ai.edge.litert.gpu.GpuDelegate

val options = Interpreter.Options().apply {
    // LiteRT automatically tries NPU first, then GPU, then CPU
    setUseNNAPI(true)  // Enables NPU acceleration via Android NNAPI
    addDelegate(GpuDelegate())  // Fallback to GPU if NPU unavailable
    setNumThreads(4)  // CPU threads if neither NPU nor GPU available
}

val interpreter = Interpreter(loadModelFile("model.tflite"), options)

The setUseNNAPI(true) flag enables NPU acceleration through Android's Neural Networks API. LiteRT handles the complexity of querying available hardware and selecting the fastest option automatically.

Step 4: Run Inference

Prepare your input tensor and run inference:

val inputShape = intArrayOf(1, 224, 224, 3)  // Batch, Height, Width, Channels
val inputBuffer = ByteBuffer.allocateDirect(1 * 224 * 224 * 3 * 4)
    .order(ByteOrder.nativeOrder())

// Fill inputBuffer with your preprocessed image data

val outputShape = intArrayOf(1, 1000)  // Example for ImageNet classification
val outputBuffer = ByteBuffer.allocateDirect(1 * 1000 * 4)
    .order(ByteOrder.nativeOrder())

interpreter.run(inputBuffer, outputBuffer)

// Parse outputBuffer for results

First-inference latency will be higher (typically 50-150ms depending on model size) as LiteRT compiles the model for your specific hardware. Subsequent inferences run at full speed, usually 5-30ms for common vision models on modern Android devices.

Google LiteRT NPU Acceleration Tutorial: Real-World Performance

Epic Games' implementation of MetaHuman facial animation on Android provides a detailed case study of NPU acceleration benefits. Their model performs real-time facial tracking and animation at 30 FPS on mid-range Android devices, something previously impossible without cloud processing.

The team started with a 47MB model that ran at 12 FPS on GPU. After converting to LiteRT with NPU acceleration and applying int8 quantization, they achieved 30 FPS with a 23MB model. Power consumption dropped from 2.8W to 1.1W, extending gameplay sessions by roughly 90 minutes on typical device batteries.

Google's own Pixel camera uses LiteRT for real-time HDR+ processing. The Night Sight feature runs a multi-frame fusion model that processes 15 frames in under 2 seconds on NPU, compared to 6+ seconds on CPU. This 3x speedup makes the feature practical for handheld photography instead of requiring a tripod.

NVIDIA's Parakeet TDT 0.6B speech recognition model demonstrates AOT compilation benefits. Standard runtime compilation added 380ms startup time, unacceptable for voice assistant applications. With LiteRT's AOT compilation, the model loads in 45ms and begins transcribing immediately. Honestly, that startup difference is what separates "technically works" from "actually ships to users."

On-Device AI Inference Optimization Guide

Choosing between NPU, GPU, and CPU execution depends on your specific model architecture and performance requirements. Here's how to make that decision systematically.

When to Force NPU Acceleration

NPU excels at convolutional neural networks, fully connected layers, and standard recurrent operations. If your model is primarily MobileNetV3, EfficientNet, BERT, or similar architectures, NPU will deliver the best performance-per-watt ratio.

Force NPU execution when battery life matters more than absolute peak performance. A background object detection model running continuously should target NPU to avoid draining 15-20% battery per hour as GPU would.

When to Force GPU Acceleration

GPU performs better for models with custom operations, complex branching, or operations not yet optimized for NPU hardware. Transformer models with dynamic attention patterns often run faster on GPU despite higher power consumption.

If you need to process multiple models simultaneously, GPU's parallel execution capabilities can outweigh NPU's efficiency advantages. Running three separate models on NPU requires sequential execution, while GPU can batch them together.

When CPU Makes Sense

Tiny models under 1MB often run faster on CPU due to the overhead of data transfer to NPU or GPU. A 200KB keyword spotting model that triggers larger models should run on CPU to minimize latency and power.

Models requiring frequent host-device data transfers also favor CPU. If you're processing results from one model and feeding them to another in a tight loop, keeping everything on CPU avoids the 2-5ms transfer penalty per iteration.

Measuring Real Performance

LiteRT includes built-in benchmarking tools. Add this to your implementation to measure actual hardware performance:

val options = Interpreter.Options().apply {
    setUseNNAPI(true)
}

val startTime = System.nanoTime()
interpreter.run(inputBuffer, outputBuffer)
val inferenceTime = (System.nanoTime() - startTime) / 1_000_000  // Convert to ms

Log.d("LiteRT", "Inference time: ${inferenceTime}ms")

Run this across 100+ inferences to get stable measurements. First inference will be 3-10x slower due to compilation, so exclude it from your averages.

Cross-Platform Deployment Beyond Android

LiteRT's cross-platform support extends beyond mobile devices into industrial IoT and AI PCs. Qualcomm's Dragonwing platform uses LiteRT to run computer vision models on factory floor cameras, processing quality control inspections at 60 FPS while consuming under 5W.

Intel's OpenVINO toolkit integrates with LiteRT to accelerate models on AI PCs with dedicated NPU hardware. The same model file you deploy on Android can run on Windows laptops with Intel Core Ultra processors, automatically utilizing the integrated NPU for 40-50% lower power consumption compared to GPU execution.

For developers building products across mobile, edge, and desktop platforms, this unified approach means you write conversion and optimization code once. The alternative? Maintaining separate TensorFlow Lite, ONNX Runtime, and platform-specific implementations typically requires 2-3x more engineering time.

Look, if you're building AI applications that process data locally rather than sending it to cloud APIs, understanding frameworks like LiteRT is as fundamental as knowing how to structure AI coding agents for production use. The performance differences between properly optimized on-device inference and naive CPU execution often determine whether your product is viable at all. Start with the Android implementation above, measure your actual inference times across different hardware backends, and optimize based on your specific latency and power requirements rather than assumptions about what "should" be faster.