AI Keeps Running Even Without the Cloud — Implementing an Edge AI On-Device Deployment Pipeline
I started out thinking, "Can't we just run inference on a server?" But after running into a latency issue with an AR prototype at work, my thinking changed completely. The hundreds of milliseconds spent on a cloud round-trip can cause motion sickness on AR devices and lead to accidents in autonomous vehicles. This is exactly the architecture that lets Siri respond in airplane mode, lets the Meta Quest 3 track hand gestures without a network, and lets factory sensors detect anomalies in 0.1 seconds without an internet connection.
This article walks through the architecture behind Edge AI on-device inference, and with code examples, how to build an actual deployment pipeline — from an Android app to cross-platform Python code. By the end, you'll be able to build your own inference pipeline that runs an INT8 quantized model on an Android app or via ONNX Runtime.
As of 2025, the global edge AI market is valued at $24.9 billion and is projected to reach approximately $118.7 billion by 2033 (CAGR 21.7%) (Grand View Research). Inference workloads already account for more than 50% of total AI computing (CEVA 2025 Edge AI Report), and as sub-billion models like Llama 3.2 1B and Gemma 3 270M have reached the point where they can handle real tasks, on-device inference has moved out of "research territory" and into the realm of "production engineering."
Core Concepts
What Exactly Is On-Device Inference?
A cloud-based AI pipeline is simple: send data to a server, the server runs inference, and you get the result back. On-device inference completes this entire process on local hardware. The data never leaves the device.
The scope of Edge varies by context. It encompasses everything from user devices like smartphones and tablets to factory gateways, autonomous vehicle onboard computers, and wearables. The common thread is that "inference is completed at the same location where data is generated."
Three components work together:
| Component | Role | Representative Technologies |
|---|---|---|
| Lightweight model | Reduces large cloud-scale models to a size runnable on edge devices | Quantization, pruning, knowledge distillation |
| Inference runtime | Device-optimized execution engine | LiteRT, ONNX Runtime, ExecuTorch, Core ML |
| Hardware accelerator | Dedicated AI compute chips for speed and power efficiency | NPU, GPU, DSP |
Three Pillars of Model Compression
Running a cloud model as-is on a device is not feasible — RAM is only a few hundred MB to a few GB. The model must go through a compression pipeline.
Quantization is the most effective method. Reducing model weight precision from FP32 to INT8 shrinks the model size by roughly 4x; going down to INT4 can reduce it by up to 8x. Computation speed increases and memory bandwidth requirements decrease.
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/")
# Apply Post-Training Quantization (PTQ)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.int8]
# Calibrate range with a representative dataset (INT8 static quantization)
def representative_dataset():
for data in calibration_data:
yield [data.astype("float32")]
converter.representative_dataset = representative_dataset
tflite_model = converter.convert()
with open("model_int8.tflite", "wb") as f:
f.write(tflite_model)PTQ vs QAT: Post-Training Quantization (PTQ) is applied after training and can be used quickly. Quantization-Aware Training (QAT) simulates quantization during training to minimize accuracy loss. QAT is recommended for accuracy-sensitive tasks.
Pruning removes unnecessary weights. Structured pruning removes entire neurons or layers, so it translates to actual speedups on general-purpose hardware. Unstructured pruning only makes the matrix sparse, so the perceptible benefit is limited without dedicated hardware support.
Knowledge Distillation trains a small student model to mimic the output distribution of a large teacher model. It can achieve higher accuracy in the small model compared to training on labels alone.
import torch
import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.7):
# KL divergence: a function that measures the distance between two probability distributions.
# Guides the student to imitate the teacher model's soft probability distribution.
soft_loss = F.kl_div(
F.log_softmax(student_logits / temperature, dim=1),
F.softmax(teacher_logits / temperature, dim=1),
reduction="batchmean"
) * (temperature ** 2)
# Hard target loss: train on actual labels
hard_loss = F.cross_entropy(student_logits, labels)
return alpha * soft_loss + (1 - alpha) * hard_lossSequentially applying a composite pipeline (pruning → quantization → distillation) enables significant compression. According to Promwad's analysis, a representative image classification benchmark achieved a 74% reduction in parameters with less than 3% accuracy loss (AI Model Compression — Promwad). It's important to define acceptable loss thresholds per task in advance.
Choosing an Inference Runtime
Honestly, this was the most confusing part for me. There are so many runtimes.
| Runtime | Best For | Key Features |
|---|---|---|
| LiteRT (formerly TFLite) | Android, embedded Linux | NPU utilization via NNAPI and GPU delegates |
| Core ML | iOS, macOS | Full Apple Neural Engine utilization, native Swift |
| ONNX Runtime | Cross-platform, browser | Framework-neutral, browser inference possible via WASM |
| ExecuTorch | PyTorch ecosystem | Already deployed by Meta in Instagram and WhatsApp |
| llama.cpp | Desktop, embedded LLMs | CPU-optimized, GGUF format-based |
When the platform is fixed, the choice becomes easier. If it's iOS, Core ML; if you need broad Android support, LiteRT; if you trained with PyTorch and need to support multiple backends, ExecuTorch is the realistic choice. The examples below cover Android (Example 1) and cross-platform Python (Examples 2 and 3) respectively.
Practical Application
Example 1: Image Classification on Android with LiteRT
If you're developing an Android app, this example is a good place to start. iOS developers will find the Core ML API structure similar, so you can refer to the concepts and skip to Example 2.
This is a structure that classifies a live camera feed in real time while leveraging the device NPU via NNAPI. I remember running it without NnApiDelegate at first and getting inference times more than 3x my target — you really have to experience it firsthand to appreciate how much difference that single delegate line makes.
import org.tensorflow.lite.Interpreter
import org.tensorflow.lite.nnapi.NnApiDelegate
import android.content.res.AssetManager
import android.graphics.Bitmap
import java.io.FileInputStream
import java.nio.ByteBuffer
import java.nio.ByteOrder
import java.nio.MappedByteBuffer
import java.nio.channels.FileChannel
class EdgeInferenceEngine(private val assetManager: AssetManager, modelFileName: String) {
private val nnApiDelegate = NnApiDelegate()
private val interpreter: Interpreter
private val inputSize = 224
private val numClasses = 1000
init {
val options = Interpreter.Options().apply {
addDelegate(nnApiDelegate)
// Handle NPU fallback with CPU multi-threading
setNumThreads(4)
}
interpreter = Interpreter(loadModelFile(modelFileName), options)
}
fun classify(bitmap: Bitmap): FloatArray {
val inputBuffer = preprocessImage(bitmap)
val outputBuffer = Array(1) { FloatArray(numClasses) }
interpreter.run(inputBuffer, outputBuffer)
return outputBuffer[0]
}
// NnApiDelegate implements AutoCloseable, so it must be released.
// Call this from the Activity or Fragment's onDestroy().
fun close() {
interpreter.close()
nnApiDelegate.close()
}
private fun loadModelFile(fileName: String): MappedByteBuffer {
val fileDescriptor = assetManager.openFd(fileName)
val inputStream = FileInputStream(fileDescriptor.fileDescriptor)
return inputStream.channel.map(
FileChannel.MapMode.READ_ONLY,
fileDescriptor.startOffset,
fileDescriptor.declaredLength
)
}
private fun preprocessImage(bitmap: Bitmap): ByteBuffer {
val resized = Bitmap.createScaledBitmap(bitmap, inputSize, inputSize, true)
// allocateDirect: allocates memory outside the JVM heap to eliminate JNI copy overhead.
// nativeOrder(): aligns with the device CPU's byte order (endianness).
val buffer = ByteBuffer.allocateDirect(4 * 3 * inputSize * inputSize)
buffer.order(ByteOrder.nativeOrder())
val pixels = IntArray(inputSize * inputSize)
resized.getPixels(pixels, 0, inputSize, 0, 0, inputSize, inputSize)
for (pixel in pixels) {
// Must match the ImageNet normalization parameters used during training for accurate results.
buffer.putFloat(((pixel shr 16 and 0xFF) / 255.0f - 0.485f) / 0.229f)
buffer.putFloat(((pixel shr 8 and 0xFF) / 255.0f - 0.456f) / 0.224f)
buffer.putFloat(((pixel and 0xFF) / 255.0f - 0.406f) / 0.225f)
}
return buffer
}
}| Code Point | Description |
|---|---|
NnApiDelegate() |
Delegates computation to the device NPU/DSP via Android NNAPI; falls back to CPU if unavailable |
loadModelFile() |
Loads the model file inside the APK as a memory map via AssetManager |
allocateDirect + nativeOrder |
Passes the buffer to native code without copy overhead at the JNI boundary |
close() |
NnApiDelegate holds native resources and must be explicitly released |
Example 2: Cross-Platform Inference with ONNX Runtime in Python
If you're an ML engineer or server-side developer, this example is the most familiar starting point. The same code runs on everything from a Raspberry Pi to a Windows desktop.
ONNX is an intermediate format that lets you convert PyTorch or TensorFlow models and run them anywhere.
from typing import Any
import onnxruntime as ort
import numpy as np
from PIL import Image
def create_session(model_path: str) -> ort.InferenceSession:
# Auto-select available providers (CUDA > CoreML > CPU in priority)
providers = ort.get_available_providers()
print(f"Available execution providers: {providers}")
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session_options.intra_op_num_threads = 4
return ort.InferenceSession(model_path, session_options, providers=providers)
def run_inference(session: ort.InferenceSession, image_path: str) -> np.ndarray:
# Omitting convert("RGB") causes shape mismatches with RGBA PNGs or grayscale images.
img = Image.open(image_path).convert("RGB").resize((224, 224))
input_data = np.array(img, dtype=np.float32) / 255.0
input_data = (input_data - [0.485, 0.456, 0.406]) / [0.229, 0.224, 0.225]
input_data = np.transpose(input_data, (2, 0, 1)) # HWC → CHW
input_data = np.expand_dims(input_data, axis=0) # Add batch dimension
input_name = session.get_inputs()[0].name
outputs = session.run(None, {input_name: input_data})
return outputs[0]When exporting a PyTorch model to ONNX, the dynamic_axes setting is often overlooked — fixing the batch size will force you to re-export later.
import torch
def export_to_onnx(model: torch.nn.Module, save_path: str) -> None:
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
save_path,
export_params=True,
opset_version=17,
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch_size"}} # Handle batch size dynamically
)
print(f"ONNX model saved: {save_path}")Example 3: Hybrid Edge-Cloud Routing
If you're wondering how to split traffic between on-device and cloud, this example will help.
This is a situation you encounter frequently in practice — running all inference on-device isn't always the right answer. A structure that dynamically routes simple requests to the device and complex ones to the cloud creates a sensible balance between cost and performance.
complexity_score is the key variable in the routing decision. It can be computed by combining features such as input text length, model output entropy, previous inference failure rate, and query type. Starting with simple rule-based logic and gradually evolving it into a learned classifier is a realistic approach in production.
from typing import Any
import asyncio
from dataclasses import dataclass
from enum import Enum
class InferenceTarget(Enum):
DEVICE = "device"
CLOUD = "cloud"
@dataclass
class InferenceRequest:
input_data: Any
# 0.0~1.0: computed by combining input length, output entropy, prior failure rate, etc.
complexity_score: float
requires_privacy: bool
latency_budget_ms: float
class HybridInferenceRouter:
COMPLEXITY_THRESHOLD = 0.7
LATENCY_THRESHOLD_MS = 50.0
def __init__(self, local_model: Any, cloud_client: Any) -> None:
self.local_model = local_model
self.cloud_client = cloud_client
def route(self, request: InferenceRequest) -> InferenceTarget:
# Always route locally when privacy is required
if request.requires_privacy:
return InferenceTarget.DEVICE
# Route locally when latency budget is tight
if request.latency_budget_ms < self.LATENCY_THRESHOLD_MS:
return InferenceTarget.DEVICE
# Route to cloud when complexity is high
if request.complexity_score > self.COMPLEXITY_THRESHOLD:
return InferenceTarget.CLOUD
return InferenceTarget.DEVICE
async def infer(self, request: InferenceRequest) -> Any:
target = self.route(request)
if target == InferenceTarget.DEVICE:
return await self.local_model.predict(request.input_data)
else:
return await self.cloud_client.predict(request.input_data)That hybrid configurations significantly reduce energy and cost compared to pure cloud is a direction confirmed by N-iX's edge AI trend analysis as well (Key edge AI trends — N-iX). You don't have to process everything on-device.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Ultra-low latency | Eliminates network round-trips, cutting response time by up to 70% — the only viable option in scenarios like AR and autonomous driving where sub-20ms is mandatory |
| Offline operation | Full functionality without an internet connection — works on subways, in airplanes, and at remote sites |
| Data privacy | Sensitive data never leaves the device — advantageous for regulatory compliance in medical and financial sectors |
| Cost reduction | Eliminates cloud API call costs and bandwidth costs |
| Reliability | Unaffected by server downtime or network failures |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Model accuracy degradation | A 2–4% accuracy loss is typical when combining quantization, pruning, and distillation | Define acceptable loss thresholds per task in advance; minimize loss with QAT |
| Hardware fragmentation | Android alone has thousands of chipset variants — NNAPI, Qualcomm QNN, and MediaTek APU each require separate testing | Narrow down the list of target devices and include real-device tests in CI |
| Memory constraints | Model file size, runtime memory, and peak memory must be measured separately | Verify peak memory with a Profiler before deployment |
| Update complexity | Cloud updates take effect immediately via server redeployment, but on-device requires app updates or a separate OTA pipeline | Separate model files from the app binary to enable OTA updates |
| Security threats | Storing model files on a device exposes them to reverse engineering and model extraction attacks | Consider model encryption and Secure Enclave utilization |
NPU (Neural Processing Unit): A dedicated chip specialized for AI matrix operations. Unlike a GPU, which handles general-purpose parallel computation, an NPU is extremely optimized for only the operations needed in AI inference, resulting in far better performance-per-watt. Qualcomm Hexagon, Apple Neural Engine, and Samsung Exynos NPU are prime examples.
The Most Common Mistakes in Practice
-
Looking only at file size without measuring memory — A 100 MB model can use 400 MB at runtime. It is strongly recommended to always measure model file size, runtime memory, and peak memory separately.
-
Training and inference preprocessing getting out of sync — Even a small difference in normalization parameters, channel order (HWC vs CHW), or pixel value range (0–1 vs 0–255) causes accuracy to plummet. It is recommended to include preprocessing code in the model conversion pipeline and manage its version alongside the model.
-
Judging performance based on a development machine — Inference that takes 50 ms on an M4 MacBook can take 800 ms on a three-year-old budget Android phone. Benchmarking on the actual target device is essential.
Closing Thoughts
The reason on-device inference is not simply "shrinking a cloud model and putting it on-device" is that the design philosophy itself is different. Cloud AI is designed to elastically scale compute resources. On-device, by contrast, treats memory limits, power budgets, and hardware fragmentation as constraints from the outset, and focuses on extracting the best possible accuracy within them. Model architecture selection, compression order, runtime choice, hardware testing strategy, model update pipeline — there are layers of considerations distinct from cloud deployment. Whether these constraints are embraced at the start of design or encountered right before deployment often determines the success or failure of a project.
Three steps you can try right now:
-
Try model conversion: Pick a familiar TensorFlow or PyTorch model and convert it with
tf.lite.TFLiteConverterortorch.onnx.export. Simply seeing the file size difference before and after conversion gives you an intuitive feel for it quickly. -
Measure benchmarks: Run the converted model locally with ONNX Runtime or LiteRT and time the latency. Attaching only a CPU provider to
ort.InferenceSessionis enough to build a complete inference pipeline in 10 lines. -
Apply quantization and compare accuracy: Apply INT8 quantization with PTQ and compare accuracy against the original model to develop an intuition for "how much loss you can accept." Qualcomm AI Hub lets you try Snapdragon-targeted profiling on its free tier, and Edge Impulse lets you go all the way to IoT-targeted deployment on a free plan. If you want to start without a cloud platform, you can also enable the profiling option in
ort.SessionOptionsand measure directly on your local machine.
References
- Edge AI: The future of AI inference is smarter local compute | InfoWorld
- 2026 AI story: Inference at the edge, not just scale in the cloud | RD World Online
- On-Device LLMs in 2026: What Changed, What Matters, What's Next | Edge AI and Vision Alliance
- Key edge AI trends transforming enterprise tech in 2026 | N-iX
- Edge AI Market Size, Share & Trends | Grand View Research
- 2025 Edge AI Technology Report | CEVA
- AI Model Compression: Pruning and Quantization Strategies | Promwad
- Optimizing Your AI Model for the Edge | Qualcomm Developer Blog
- Efficient Inference at the Edge: Quantization, Pruning, and Knowledge Distillation | Uplatz
- Top 10 Edge AI Frameworks for 2025 | Huebits Blog
- ExecuTorch vs ONNX Runtime | Cactus Compute
- Optimizing Edge AI: A Comprehensive Survey | arXiv
- Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models | arXiv
- 로컬 컴퓨팅으로 넘어가는 AI 추론 | CIO Korea
- 메타 엣지 디바이스용 AI 추론 프레임워크 '엑스큐토치 1.0' 공개 | CIO Korea