AI Keeps Running Even Without the Cloud — Implementing an Edge AI On-Device Deployment Pipeline

I started out thinking, "Can't we just run inference on a server?" But after running into a latency issue with an AR prototype at work, my thinking changed completely. The hundreds of milliseconds spent on a cloud round-trip can cause motion sickness on AR devices and lead to accidents in autonomous vehicles. This is exactly the architecture that lets Siri respond in airplane mode, lets the Meta Quest 3 track hand gestures without a network, and lets factory sensors detect anomalies in 0.1 seconds without an internet connection.

This article walks through the architecture behind Edge AI on-device inference, and with code examples, how to build an actual deployment pipeline — from an Android app to cross-platform Python code. By the end, you'll be able to build your own inference pipeline that runs an INT8 quantized model on an Android app or via ONNX Runtime.

As of 2025, the global edge AI market is valued at $24.9 billion and is projected to reach approximately $118.7 billion by 2033 (CAGR 21.7%) (Grand View Research). Inference workloads already account for more than 50% of total AI computing (CEVA 2025 Edge AI Report), and as sub-billion models like Llama 3.2 1B and Gemma 3 270M have reached the point where they can handle real tasks, on-device inference has moved out of "research territory" and into the realm of "production engineering."

Core Concepts

What Exactly Is On-Device Inference?

A cloud-based AI pipeline is simple: send data to a server, the server runs inference, and you get the result back. On-device inference completes this entire process on local hardware. The data never leaves the device.

The scope of Edge varies by context. It encompasses everything from user devices like smartphones and tablets to factory gateways, autonomous vehicle onboard computers, and wearables. The common thread is that "inference is completed at the same location where data is generated."

Three components work together:

Component	Role	Representative Technologies
Lightweight model	Reduces large cloud-scale models to a size runnable on edge devices	Quantization, pruning, knowledge distillation
Inference runtime	Device-optimized execution engine	LiteRT, ONNX Runtime, ExecuTorch, Core ML
Hardware accelerator	Dedicated AI compute chips for speed and power efficiency	NPU, GPU, DSP

Three Pillars of Model Compression

Running a cloud model as-is on a device is not feasible — RAM is only a few hundred MB to a few GB. The model must go through a compression pipeline.

Quantization is the most effective method. Reducing model weight precision from FP32 to INT8 shrinks the model size by roughly 4x; going down to INT4 can reduce it by up to 8x. Computation speed increases and memory bandwidth requirements decrease.

python

import tensorflow as tf
 
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/")
 
# Apply Post-Training Quantization (PTQ)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.int8]
 
# Calibrate range with a representative dataset (INT8 static quantization)
def representative_dataset():
    for data in calibration_data:
        yield [data.astype("float32")]
 
converter.representative_dataset = representative_dataset
tflite_model = converter.convert()
 
with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

PTQ vs QAT: Post-Training Quantization (PTQ) is applied after training and can be used quickly. Quantization-Aware Training (QAT) simulates quantization during training to minimize accuracy loss. QAT is recommended for accuracy-sensitive tasks.

Pruning removes unnecessary weights. Structured pruning removes entire neurons or layers, so it translates to actual speedups on general-purpose hardware. Unstructured pruning only makes the matrix sparse, so the perceptible benefit is limited without dedicated hardware support.

Knowledge Distillation trains a small student model to mimic the output distribution of a large teacher model. It can achieve higher accuracy in the small model compared to training on labels alone.

python

import torch
import torch.nn.functional as F
 
def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.7):
    # KL divergence: a function that measures the distance between two probability distributions.
    # Guides the student to imitate the teacher model's soft probability distribution.
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / temperature, dim=1),
        F.softmax(teacher_logits / temperature, dim=1),
        reduction="batchmean"
    ) * (temperature ** 2)
 
    # Hard target loss: train on actual labels
    hard_loss = F.cross_entropy(student_logits, labels)
 
    return alpha * soft_loss + (1 - alpha) * hard_loss

Sequentially applying a composite pipeline (pruning → quantization → distillation) enables significant compression. According to Promwad's analysis, a representative image classification benchmark achieved a 74% reduction in parameters with less than 3% accuracy loss (AI Model Compression — Promwad). It's important to define acceptable loss thresholds per task in advance.

Choosing an Inference Runtime

Honestly, this was the most confusing part for me. There are so many runtimes.

Runtime	Best For	Key Features
LiteRT (formerly TFLite)	Android, embedded Linux	NPU utilization via NNAPI and GPU delegates
Core ML	iOS, macOS	Full Apple Neural Engine utilization, native Swift
ONNX Runtime	Cross-platform, browser	Framework-neutral, browser inference possible via WASM
ExecuTorch	PyTorch ecosystem	Already deployed by Meta in Instagram and WhatsApp
llama.cpp	Desktop, embedded LLMs	CPU-optimized, GGUF format-based

When the platform is fixed, the choice becomes easier. If it's iOS, Core ML; if you need broad Android support, LiteRT; if you trained with PyTorch and need to support multiple backends, ExecuTorch is the realistic choice. The examples below cover Android (Example 1) and cross-platform Python (Examples 2 and 3) respectively.

Practical Application

Example 1: Image Classification on Android with LiteRT

If you're developing an Android app, this example is a good place to start. iOS developers will find the Core ML API structure similar, so you can refer to the concepts and skip to Example 2.

This is a structure that classifies a live camera feed in real time while leveraging the device NPU via NNAPI. I remember running it without NnApiDelegate at first and getting inference times more than 3x my target — you really have to experience it firsthand to appreciate how much difference that single delegate line makes.

kotlin

import org.tensorflow.lite.Interpreter
import org.tensorflow.lite.nnapi.NnApiDelegate
import android.content.res.AssetManager
import android.graphics.Bitmap
import java.io.FileInputStream
import java.nio.ByteBuffer
import java.nio.ByteOrder
import java.nio.MappedByteBuffer
import java.nio.channels.FileChannel
 
class EdgeInferenceEngine(private val assetManager: AssetManager, modelFileName: String) {
 
    private val nnApiDelegate = NnApiDelegate()
    private val interpreter: Interpreter
    private val inputSize = 224
    private val numClasses = 1000
 
    init {
        val options = Interpreter.Options().apply {
            addDelegate(nnApiDelegate)
            // Handle NPU fallback with CPU multi-threading
            setNumThreads(4)
        }
        interpreter = Interpreter(loadModelFile(modelFileName), options)
    }
 
    fun classify(bitmap: Bitmap): FloatArray {
        val inputBuffer = preprocessImage(bitmap)
        val outputBuffer = Array(1) { FloatArray(numClasses) }
        interpreter.run(inputBuffer, outputBuffer)
        return outputBuffer[0]
    }
 
    // NnApiDelegate implements AutoCloseable, so it must be released.
    // Call this from the Activity or Fragment's onDestroy().
    fun close() {
        interpreter.close()
        nnApiDelegate.close()
    }
 
    private fun loadModelFile(fileName: String): MappedByteBuffer {
        val fileDescriptor = assetManager.openFd(fileName)
        val inputStream = FileInputStream(fileDescriptor.fileDescriptor)
        return inputStream.channel.map(
            FileChannel.MapMode.READ_ONLY,
            fileDescriptor.startOffset,
            fileDescriptor.declaredLength
        )
    }
 
    private fun preprocessImage(bitmap: Bitmap): ByteBuffer {
        val resized = Bitmap.createScaledBitmap(bitmap, inputSize, inputSize, true)
        // allocateDirect: allocates memory outside the JVM heap to eliminate JNI copy overhead.
        // nativeOrder(): aligns with the device CPU's byte order (endianness).
        val buffer = ByteBuffer.allocateDirect(4 * 3 * inputSize * inputSize)
        buffer.order(ByteOrder.nativeOrder())
 
        val pixels = IntArray(inputSize * inputSize)
        resized.getPixels(pixels, 0, inputSize, 0, 0, inputSize, inputSize)
 
        for (pixel in pixels) {
            // Must match the ImageNet normalization parameters used during training for accurate results.
            buffer.putFloat(((pixel shr 16 and 0xFF) / 255.0f - 0.485f) / 0.229f)
            buffer.putFloat(((pixel shr 8 and 0xFF) / 255.0f - 0.456f) / 0.224f)
            buffer.putFloat(((pixel and 0xFF) / 255.0f - 0.406f) / 0.225f)
        }
        return buffer
    }
}

Code Point	Description
`NnApiDelegate()`	Delegates computation to the device NPU/DSP via Android NNAPI; falls back to CPU if unavailable
`loadModelFile()`	Loads the model file inside the APK as a memory map via `AssetManager`
`allocateDirect` + `nativeOrder`	Passes the buffer to native code without copy overhead at the JNI boundary
`close()`	`NnApiDelegate` holds native resources and must be explicitly released

Example 2: Cross-Platform Inference with ONNX Runtime in Python

If you're an ML engineer or server-side developer, this example is the most familiar starting point. The same code runs on everything from a Raspberry Pi to a Windows desktop.

ONNX is an intermediate format that lets you convert PyTorch or TensorFlow models and run them anywhere.

python

from typing import Any
import onnxruntime as ort
import numpy as np
from PIL import Image
 
def create_session(model_path: str) -> ort.InferenceSession:
    # Auto-select available providers (CUDA > CoreML > CPU in priority)
    providers = ort.get_available_providers()
    print(f"Available execution providers: {providers}")
 
    session_options = ort.SessionOptions()
    session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
    session_options.intra_op_num_threads = 4
 
    return ort.InferenceSession(model_path, session_options, providers=providers)
 
def run_inference(session: ort.InferenceSession, image_path: str) -> np.ndarray:
    # Omitting convert("RGB") causes shape mismatches with RGBA PNGs or grayscale images.
    img = Image.open(image_path).convert("RGB").resize((224, 224))
    input_data = np.array(img, dtype=np.float32) / 255.0
    input_data = (input_data - [0.485, 0.456, 0.406]) / [0.229, 0.224, 0.225]
    input_data = np.transpose(input_data, (2, 0, 1))   # HWC → CHW
    input_data = np.expand_dims(input_data, axis=0)     # Add batch dimension
 
    input_name = session.get_inputs()[0].name
    outputs = session.run(None, {input_name: input_data})
    return outputs[0]

When exporting a PyTorch model to ONNX, the dynamic_axes setting is often overlooked — fixing the batch size will force you to re-export later.

python

import torch
 
def export_to_onnx(model: torch.nn.Module, save_path: str) -> None:
    model.eval()
    dummy_input = torch.randn(1, 3, 224, 224)
 
    torch.onnx.export(
        model,
        dummy_input,
        save_path,
        export_params=True,
        opset_version=17,
        input_names=["input"],
        output_names=["output"],
        dynamic_axes={"input": {0: "batch_size"}}  # Handle batch size dynamically
    )
    print(f"ONNX model saved: {save_path}")

Example 3: Hybrid Edge-Cloud Routing

If you're wondering how to split traffic between on-device and cloud, this example will help.

This is a situation you encounter frequently in practice — running all inference on-device isn't always the right answer. A structure that dynamically routes simple requests to the device and complex ones to the cloud creates a sensible balance between cost and performance.

complexity_score is the key variable in the routing decision. It can be computed by combining features such as input text length, model output entropy, previous inference failure rate, and query type. Starting with simple rule-based logic and gradually evolving it into a learned classifier is a realistic approach in production.

python

from typing import Any
import asyncio
from dataclasses import dataclass
from enum import Enum
 
class InferenceTarget(Enum):
    DEVICE = "device"
    CLOUD = "cloud"
 
@dataclass
class InferenceRequest:
    input_data: Any
    # 0.0~1.0: computed by combining input length, output entropy, prior failure rate, etc.
    complexity_score: float
    requires_privacy: bool
    latency_budget_ms: float
 
class HybridInferenceRouter:
    COMPLEXITY_THRESHOLD = 0.7
    LATENCY_THRESHOLD_MS = 50.0
 
    def __init__(self, local_model: Any, cloud_client: Any) -> None:
        self.local_model = local_model
        self.cloud_client = cloud_client
 
    def route(self, request: InferenceRequest) -> InferenceTarget:
        # Always route locally when privacy is required
        if request.requires_privacy:
            return InferenceTarget.DEVICE
 
        # Route locally when latency budget is tight
        if request.latency_budget_ms < self.LATENCY_THRESHOLD_MS:
            return InferenceTarget.DEVICE
 
        # Route to cloud when complexity is high
        if request.complexity_score > self.COMPLEXITY_THRESHOLD:
            return InferenceTarget.CLOUD
 
        return InferenceTarget.DEVICE
 
    async def infer(self, request: InferenceRequest) -> Any:
        target = self.route(request)
 
        if target == InferenceTarget.DEVICE:
            return await self.local_model.predict(request.input_data)
        else:
            return await self.cloud_client.predict(request.input_data)

That hybrid configurations significantly reduce energy and cost compared to pure cloud is a direction confirmed by N-iX's edge AI trend analysis as well (Key edge AI trends — N-iX). You don't have to process everything on-device.

Pros and Cons Analysis

Advantages

Item	Details
Ultra-low latency	Eliminates network round-trips, cutting response time by up to 70% — the only viable option in scenarios like AR and autonomous driving where sub-20ms is mandatory
Offline operation	Full functionality without an internet connection — works on subways, in airplanes, and at remote sites
Data privacy	Sensitive data never leaves the device — advantageous for regulatory compliance in medical and financial sectors
Cost reduction	Eliminates cloud API call costs and bandwidth costs
Reliability	Unaffected by server downtime or network failures

Disadvantages and Caveats

Item	Details	Mitigation
Model accuracy degradation	A 2–4% accuracy loss is typical when combining quantization, pruning, and distillation	Define acceptable loss thresholds per task in advance; minimize loss with QAT
Hardware fragmentation	Android alone has thousands of chipset variants — NNAPI, Qualcomm QNN, and MediaTek APU each require separate testing	Narrow down the list of target devices and include real-device tests in CI
Memory constraints	Model file size, runtime memory, and peak memory must be measured separately	Verify peak memory with a Profiler before deployment
Update complexity	Cloud updates take effect immediately via server redeployment, but on-device requires app updates or a separate OTA pipeline	Separate model files from the app binary to enable OTA updates
Security threats	Storing model files on a device exposes them to reverse engineering and model extraction attacks	Consider model encryption and Secure Enclave utilization

NPU (Neural Processing Unit): A dedicated chip specialized for AI matrix operations. Unlike a GPU, which handles general-purpose parallel computation, an NPU is extremely optimized for only the operations needed in AI inference, resulting in far better performance-per-watt. Qualcomm Hexagon, Apple Neural Engine, and Samsung Exynos NPU are prime examples.

The Most Common Mistakes in Practice

Looking only at file size without measuring memory — A 100 MB model can use 400 MB at runtime. It is strongly recommended to always measure model file size, runtime memory, and peak memory separately.
Training and inference preprocessing getting out of sync — Even a small difference in normalization parameters, channel order (HWC vs CHW), or pixel value range (0–1 vs 0–255) causes accuracy to plummet. It is recommended to include preprocessing code in the model conversion pipeline and manage its version alongside the model.
Judging performance based on a development machine — Inference that takes 50 ms on an M4 MacBook can take 800 ms on a three-year-old budget Android phone. Benchmarking on the actual target device is essential.

Closing Thoughts

The reason on-device inference is not simply "shrinking a cloud model and putting it on-device" is that the design philosophy itself is different. Cloud AI is designed to elastically scale compute resources. On-device, by contrast, treats memory limits, power budgets, and hardware fragmentation as constraints from the outset, and focuses on extracting the best possible accuracy within them. Model architecture selection, compression order, runtime choice, hardware testing strategy, model update pipeline — there are layers of considerations distinct from cloud deployment. Whether these constraints are embraced at the start of design or encountered right before deployment often determines the success or failure of a project.

Three steps you can try right now:

Try model conversion: Pick a familiar TensorFlow or PyTorch model and convert it with tf.lite.TFLiteConverter or torch.onnx.export. Simply seeing the file size difference before and after conversion gives you an intuitive feel for it quickly.
Measure benchmarks: Run the converted model locally with ONNX Runtime or LiteRT and time the latency. Attaching only a CPU provider to ort.InferenceSession is enough to build a complete inference pipeline in 10 lines.
Apply quantization and compare accuracy: Apply INT8 quantization with PTQ and compare accuracy against the original model to develop an intuition for "how much loss you can accept." Qualcomm AI Hub lets you try Snapdragon-targeted profiling on its free tier, and Edge Impulse lets you go all the way to IoT-targeted deployment on a free plan. If you want to start without a cloud platform, you can also enable the profiling option in ort.SessionOptions and measure directly on your local machine.

References

#EdgeAI#온디바이스추론#모델경량화#양자화#ONNXRuntime#LiteRT#지식증류#Pruning#NPU#하이브리드엣지클라우드

AI Keeps Running Even Without the Cloud — Implementing an Edge AI On-Device Deployment Pipeline | DEV BAK - 기술블로그

AI Keeps Running Even Without the Cloud — Implementing an Edge AI On-Device Deployment Pipeline

Core Concepts

What Exactly Is On-Device Inference?

The scope of Edge varies by context. It encompasses everything from user devices like smartphones and tablets to factory gateways, autonomous vehicle onboard computers, and wearables. The common thread is that "inference is completed at the same location where data is generated."

Three components work together:

Component	Role	Representative Technologies
Lightweight model	Reduces large cloud-scale models to a size runnable on edge devices	Quantization, pruning, knowledge distillation
Inference runtime	Device-optimized execution engine	LiteRT, ONNX Runtime, ExecuTorch, Core ML
Hardware accelerator	Dedicated AI compute chips for speed and power efficiency	NPU, GPU, DSP

Three Pillars of Model Compression

Running a cloud model as-is on a device is not feasible — RAM is only a few hundred MB to a few GB. The model must go through a compression pipeline.

python

import tensorflow as tf
 
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/")
 
# Apply Post-Training Quantization (PTQ)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.int8]
 
# Calibrate range with a representative dataset (INT8 static quantization)
def representative_dataset():
    for data in calibration_data:
        yield [data.astype("float32")]
 
converter.representative_dataset = representative_dataset
tflite_model = converter.convert()
 
with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

PTQ vs QAT: Post-Training Quantization (PTQ) is applied after training and can be used quickly. Quantization-Aware Training (QAT) simulates quantization during training to minimize accuracy loss. QAT is recommended for accuracy-sensitive tasks.

python

import torch
import torch.nn.functional as F
 
def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.7):
    # KL divergence: a function that measures the distance between two probability distributions.
    # Guides the student to imitate the teacher model's soft probability distribution.
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / temperature, dim=1),
        F.softmax(teacher_logits / temperature, dim=1),
        reduction="batchmean"
    ) * (temperature ** 2)
 
    # Hard target loss: train on actual labels
    hard_loss = F.cross_entropy(student_logits, labels)
 
    return alpha * soft_loss + (1 - alpha) * hard_loss

Choosing an Inference Runtime

Honestly, this was the most confusing part for me. There are so many runtimes.

Runtime	Best For	Key Features
LiteRT (formerly TFLite)	Android, embedded Linux	NPU utilization via NNAPI and GPU delegates
Core ML	iOS, macOS	Full Apple Neural Engine utilization, native Swift
ONNX Runtime	Cross-platform, browser	Framework-neutral, browser inference possible via WASM
ExecuTorch	PyTorch ecosystem	Already deployed by Meta in Instagram and WhatsApp
llama.cpp	Desktop, embedded LLMs	CPU-optimized, GGUF format-based

Practical Application

Example 1: Image Classification on Android with LiteRT

If you're developing an Android app, this example is a good place to start. iOS developers will find the Core ML API structure similar, so you can refer to the concepts and skip to Example 2.

kotlin

import org.tensorflow.lite.Interpreter
import org.tensorflow.lite.nnapi.NnApiDelegate
import android.content.res.AssetManager
import android.graphics.Bitmap
import java.io.FileInputStream
import java.nio.ByteBuffer
import java.nio.ByteOrder
import java.nio.MappedByteBuffer
import java.nio.channels.FileChannel
 
class EdgeInferenceEngine(private val assetManager: AssetManager, modelFileName: String) {
 
    private val nnApiDelegate = NnApiDelegate()
    private val interpreter: Interpreter
    private val inputSize = 224
    private val numClasses = 1000
 
    init {
        val options = Interpreter.Options().apply {
            addDelegate(nnApiDelegate)
            // Handle NPU fallback with CPU multi-threading
            setNumThreads(4)
        }
        interpreter = Interpreter(loadModelFile(modelFileName), options)
    }
 
    fun classify(bitmap: Bitmap): FloatArray {
        val inputBuffer = preprocessImage(bitmap)
        val outputBuffer = Array(1) { FloatArray(numClasses) }
        interpreter.run(inputBuffer, outputBuffer)
        return outputBuffer[0]
    }
 
    // NnApiDelegate implements AutoCloseable, so it must be released.
    // Call this from the Activity or Fragment's onDestroy().
    fun close() {
        interpreter.close()
        nnApiDelegate.close()
    }
 
    private fun loadModelFile(fileName: String): MappedByteBuffer {
        val fileDescriptor = assetManager.openFd(fileName)
        val inputStream = FileInputStream(fileDescriptor.fileDescriptor)
        return inputStream.channel.map(
            FileChannel.MapMode.READ_ONLY,
            fileDescriptor.startOffset,
            fileDescriptor.declaredLength
        )
    }
 
    private fun preprocessImage(bitmap: Bitmap): ByteBuffer {
        val resized = Bitmap.createScaledBitmap(bitmap, inputSize, inputSize, true)
        // allocateDirect: allocates memory outside the JVM heap to eliminate JNI copy overhead.
        // nativeOrder(): aligns with the device CPU's byte order (endianness).
        val buffer = ByteBuffer.allocateDirect(4 * 3 * inputSize * inputSize)
        buffer.order(ByteOrder.nativeOrder())
 
        val pixels = IntArray(inputSize * inputSize)
        resized.getPixels(pixels, 0, inputSize, 0, 0, inputSize, inputSize)
 
        for (pixel in pixels) {
            // Must match the ImageNet normalization parameters used during training for accurate results.
            buffer.putFloat(((pixel shr 16 and 0xFF) / 255.0f - 0.485f) / 0.229f)
            buffer.putFloat(((pixel shr 8 and 0xFF) / 255.0f - 0.456f) / 0.224f)
            buffer.putFloat(((pixel and 0xFF) / 255.0f - 0.406f) / 0.225f)
        }
        return buffer
    }
}

Code Point	Description
`NnApiDelegate()`	Delegates computation to the device NPU/DSP via Android NNAPI; falls back to CPU if unavailable
`loadModelFile()`	Loads the model file inside the APK as a memory map via `AssetManager`
`allocateDirect` + `nativeOrder`	Passes the buffer to native code without copy overhead at the JNI boundary
`close()`	`NnApiDelegate` holds native resources and must be explicitly released

Example 2: Cross-Platform Inference with ONNX Runtime in Python

If you're an ML engineer or server-side developer, this example is the most familiar starting point. The same code runs on everything from a Raspberry Pi to a Windows desktop.

ONNX is an intermediate format that lets you convert PyTorch or TensorFlow models and run them anywhere.

python

from typing import Any
import onnxruntime as ort
import numpy as np
from PIL import Image
 
def create_session(model_path: str) -> ort.InferenceSession:
    # Auto-select available providers (CUDA > CoreML > CPU in priority)
    providers = ort.get_available_providers()
    print(f"Available execution providers: {providers}")
 
    session_options = ort.SessionOptions()
    session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
    session_options.intra_op_num_threads = 4
 
    return ort.InferenceSession(model_path, session_options, providers=providers)
 
def run_inference(session: ort.InferenceSession, image_path: str) -> np.ndarray:
    # Omitting convert("RGB") causes shape mismatches with RGBA PNGs or grayscale images.
    img = Image.open(image_path).convert("RGB").resize((224, 224))
    input_data = np.array(img, dtype=np.float32) / 255.0
    input_data = (input_data - [0.485, 0.456, 0.406]) / [0.229, 0.224, 0.225]
    input_data = np.transpose(input_data, (2, 0, 1))   # HWC → CHW
    input_data = np.expand_dims(input_data, axis=0)     # Add batch dimension
 
    input_name = session.get_inputs()[0].name
    outputs = session.run(None, {input_name: input_data})
    return outputs[0]

When exporting a PyTorch model to ONNX, the dynamic_axes setting is often overlooked — fixing the batch size will force you to re-export later.

python

import torch
 
def export_to_onnx(model: torch.nn.Module, save_path: str) -> None:
    model.eval()
    dummy_input = torch.randn(1, 3, 224, 224)
 
    torch.onnx.export(
        model,
        dummy_input,
        save_path,
        export_params=True,
        opset_version=17,
        input_names=["input"],
        output_names=["output"],
        dynamic_axes={"input": {0: "batch_size"}}  # Handle batch size dynamically
    )
    print(f"ONNX model saved: {save_path}")

Example 3: Hybrid Edge-Cloud Routing

If you're wondering how to split traffic between on-device and cloud, this example will help.

python

from typing import Any
import asyncio
from dataclasses import dataclass
from enum import Enum
 
class InferenceTarget(Enum):
    DEVICE = "device"
    CLOUD = "cloud"
 
@dataclass
class InferenceRequest:
    input_data: Any
    # 0.0~1.0: computed by combining input length, output entropy, prior failure rate, etc.
    complexity_score: float
    requires_privacy: bool
    latency_budget_ms: float
 
class HybridInferenceRouter:
    COMPLEXITY_THRESHOLD = 0.7
    LATENCY_THRESHOLD_MS = 50.0
 
    def __init__(self, local_model: Any, cloud_client: Any) -> None:
        self.local_model = local_model
        self.cloud_client = cloud_client
 
    def route(self, request: InferenceRequest) -> InferenceTarget:
        # Always route locally when privacy is required
        if request.requires_privacy:
            return InferenceTarget.DEVICE
 
        # Route locally when latency budget is tight
        if request.latency_budget_ms < self.LATENCY_THRESHOLD_MS:
            return InferenceTarget.DEVICE
 
        # Route to cloud when complexity is high
        if request.complexity_score > self.COMPLEXITY_THRESHOLD:
            return InferenceTarget.CLOUD
 
        return InferenceTarget.DEVICE
 
    async def infer(self, request: InferenceRequest) -> Any:
        target = self.route(request)
 
        if target == InferenceTarget.DEVICE:
            return await self.local_model.predict(request.input_data)
        else:
            return await self.cloud_client.predict(request.input_data)

Pros and Cons Analysis

Advantages

Item	Details
Ultra-low latency	Eliminates network round-trips, cutting response time by up to 70% — the only viable option in scenarios like AR and autonomous driving where sub-20ms is mandatory
Offline operation	Full functionality without an internet connection — works on subways, in airplanes, and at remote sites
Data privacy	Sensitive data never leaves the device — advantageous for regulatory compliance in medical and financial sectors
Cost reduction	Eliminates cloud API call costs and bandwidth costs
Reliability	Unaffected by server downtime or network failures

Disadvantages and Caveats

Item	Details	Mitigation
Model accuracy degradation	A 2–4% accuracy loss is typical when combining quantization, pruning, and distillation	Define acceptable loss thresholds per task in advance; minimize loss with QAT
Hardware fragmentation	Android alone has thousands of chipset variants — NNAPI, Qualcomm QNN, and MediaTek APU each require separate testing	Narrow down the list of target devices and include real-device tests in CI
Memory constraints	Model file size, runtime memory, and peak memory must be measured separately	Verify peak memory with a Profiler before deployment
Update complexity	Cloud updates take effect immediately via server redeployment, but on-device requires app updates or a separate OTA pipeline	Separate model files from the app binary to enable OTA updates
Security threats	Storing model files on a device exposes them to reverse engineering and model extraction attacks	Consider model encryption and Secure Enclave utilization

NPU (Neural Processing Unit): A dedicated chip specialized for AI matrix operations. Unlike a GPU, which handles general-purpose parallel computation, an NPU is extremely optimized for only the operations needed in AI inference, resulting in far better performance-per-watt. Qualcomm Hexagon, Apple Neural Engine, and Samsung Exynos NPU are prime examples.

The Most Common Mistakes in Practice

Looking only at file size without measuring memory — A 100 MB model can use 400 MB at runtime. It is strongly recommended to always measure model file size, runtime memory, and peak memory separately.
Training and inference preprocessing getting out of sync — Even a small difference in normalization parameters, channel order (HWC vs CHW), or pixel value range (0–1 vs 0–255) causes accuracy to plummet. It is recommended to include preprocessing code in the model conversion pipeline and manage its version alongside the model.
Judging performance based on a development machine — Inference that takes 50 ms on an M4 MacBook can take 800 ms on a three-year-old budget Android phone. Benchmarking on the actual target device is essential.

Closing Thoughts

Three steps you can try right now:

Try model conversion: Pick a familiar TensorFlow or PyTorch model and convert it with tf.lite.TFLiteConverter or torch.onnx.export. Simply seeing the file size difference before and after conversion gives you an intuitive feel for it quickly.
Measure benchmarks: Run the converted model locally with ONNX Runtime or LiteRT and time the latency. Attaching only a CPU provider to ort.InferenceSession is enough to build a complete inference pipeline in 10 lines.
Apply quantization and compare accuracy: Apply INT8 quantization with PTQ and compare accuracy against the original model to develop an intuition for "how much loss you can accept." Qualcomm AI Hub lets you try Snapdragon-targeted profiling on its free tier, and Edge Impulse lets you go all the way to IoT-targeted deployment on a free plan. If you want to start without a cloud platform, you can also enable the profiling option in ort.SessionOptions and measure directly on your local machine.

References

#EdgeAI#온디바이스추론#모델경량화#양자화#ONNXRuntime#LiteRT#지식증류#Pruning#NPU#하이브리드엣지클라우드

Core Concepts

What Exactly Is On-Device Inference?

Three Pillars of Model Compression

Choosing an Inference Runtime

Practical Application

Example 1: Image Classification on Android with LiteRT

Example 2: Cross-Platform Inference with ONNX Runtime in Python

Example 3: Hybrid Edge-Cloud Routing

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

What Exactly Is On-Device Inference?

Three Pillars of Model Compression

Choosing an Inference Runtime

Practical Application

Example 1: Image Classification on Android with LiteRT

Example 2: Cross-Platform Inference with ONNX Runtime in Python

Example 3: Hybrid Edge-Cloud Routing

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Trust Boundaries That Break When AI Agents Call External Tools — How to Prevent Prompt Injection and Memory Poisoning with MAESTRO and OWASP ASI Top 10

Building an MCP Server with TypeScript: Connecting PostgreSQL and Grafana to Hermes AI Agent

Hermes Agent SOUL.md and the 5-Pillar Architecture — An Inside Look at the Tier 3 Skill Auto-Generation Mechanism

How to Specialize 7B·70B Models on a Single GPU — LoRA·QLoRA·PEFT Principles and Practical Code

Cut LLM API Costs by Up to 80% — 5 Optimization Strategies Proven in GPT-4o & Claude Production

vLLM vs SGLang Performance Comparison — Choosing an Inference Engine Through the Lens of 2026 KV Cache Architecture