Deploying LLM Streaming API with Hono + Cloudflare Workers — How to Run a Type-Safe AI Layer at the Edge
There was a time when I wondered, "Is this right?" as I watched an LLM proxy built with Express take 2 seconds per request on Vercel due to cold starts. After that, I tried to force SSE into NestJS using @nestjs/event-emitter, but the structure became too complex, so I eventually gave up. It was right at that moment that I first encountered Hono, and now I reach for it almost reflexively when designing AI API layers.
The core message of this article is simple: thanks to a Web Standards-based design, a single set of code runs anywhere from the edge to Lambda, enabling LLM streaming and type safety with minimal code. There is a reason why the React frontend + Hono API (Cloudflare Workers) + model provider configuration has established itself as the standard stack for AI startups as of 2026. Companies like Cloudflare, Deno, Clerk, and Unkey are already using it in production, and both Vercel and Cloudflare provide official documentation.
Basic concepts of TypeScript and REST APIs are sufficient to follow along with this article. It is okay if you do not have experience with NestJS or Vercel.
Key Concepts
Why Hono is a good fit for AI APIs
Hono is a TypeScript web framework measuring approximately 14KB. Looking at the numbers alone, you might think, "Isn't it just a smaller version of Express?", but its design philosophy is different.
| Features | Express/Fastify | Hono |
|---|---|---|
| 기반 API | Node.js http 모듈 |
Web Standards (Request/Response) |
| Runtime | Node.js Only | Workers, Bun, Deno, Lambda, Node.js |
| Streaming | Separate setup required | streamSSE() Helper Built-in |
| Type-safe RPC | None | hc end-to-end type as client |
| Cold start | Runtime dependency | Near 0 (based on Workers) |
AI APIs are inherently I/O-bound workloads. While the LLM generates a token, the server simply waits for the result and relays it to the client. In such workloads, the heavy initialization cost of the Node.js event loop is a waste. On the other hand, edge environments like Cloudflare Workers run in V8 contexts isolated per request to minimize latency, and geographically distributed edge nodes reduce the physical distance to the user. This is why an ultra-lightweight framework is the perfect fit for this workload.
SSE (Server-Sent Events) — This is a method where the server keeps an HTTP connection open and sends a unidirectional stream of events to the client. A typical example is the ChatGPT-style response, which is delivered to the client in real-time whenever an LLM generates a token.
Benefits of a Web Standards-based System
The Request received by the Hono route handler and the Response returned are standard objects identical to the browser fetch(). This is important because code written in Cloudflare Workers works exactly as is on Bun or Vercel Edge. While Express, which uses the Node.js-specific http module, requires separate adapters for each runtime, Hono does not.
// 이 코드는 Cloudflare Workers, Bun, Vercel Edge 어디서나 동일하게 동작합니다
import { Hono } from 'hono'
const app = new Hono()
app.get('/health', (c) => {
return c.json({ status: 'ok', runtime: 'any' })
})
export default appThe fact that you don't need to relearn the framework API when switching AI model providers or moving deployment platforms is a bigger advantage than you might think. It simply allows you to "develop with Buns first and move to Workers later" early in the project.
Practical Application
Example 1: LLM Token Streaming Endpoint
This is the most fundamental pattern. It is a structure used when you need to proxy external LLM APIs (OpenAI, Anthropic, etc.) and deliver tokens to the client in real time. When I first introduced this pattern to the team, the most common reaction was, "Is it really okay for this to be this short?"
import { Hono } from 'hono'
import { streamSSE } from 'hono/streaming'
import { streamText } from 'ai'
import { openai } from '@ai-sdk/openai'
import { zValidator } from '@hono/zod-validator'
import { z } from 'zod'
const app = new Hono()
const chatSchema = z.object({
messages: z.array(
z.object({
role: z.enum(['user', 'assistant']),
content: z.string(),
})
),
})
app.post('/api/chat', zValidator('json', chatSchema), async (c) => {
const { messages } = c.req.valid('json')
return streamSSE(c, async (stream) => {
try {
const { textStream } = await streamText({
model: openai('gpt-4o'),
messages,
})
for await (const text of textStream) {
await stream.writeSSE({ data: JSON.stringify({ text }) })
}
await stream.writeSSE({ data: '[DONE]' })
} catch (err) {
// 스트림 도중 에러가 나면 이미 200 헤더가 나간 상태이므로
// 에러 이벤트를 스트림으로 전달해야 합니다
await stream.writeSSE({
event: 'error',
data: JSON.stringify({ message: 'LLM API 오류가 발생했습니다' }),
})
}
})
})
export default app| Code Point | Role |
|---|---|
zValidator('json', chatSchema) |
Validate request body against Zod schema, automatically return 400 on failure |
c.req.valid('json') |
Type-safely retrieve validated data |
streamSSE(c, ...) |
Hono handles SSE header configuration and stream management |
stream.writeSSE({ event: 'error', ... }) |
Pattern for delivering errors to the client during streaming |
zValidator Middleware — This is Hono-specific middleware provided by the @hono/zod-validator package. It automatically returns a 400 response upon validation failure, and subsequent handlers can only access data of the validated type.
Example 2: Direct integration with Cloudflare Workers AI
This is a useful pattern when external API costs are burdensome or when you want to quickly create prototypes. Cloudflare Workers has built-in AI models accessible via the c.env.AI binding, allowing you to utilize Cloudflare GPUs without a separate API key.
import { Hono } from 'hono'
import { zValidator } from '@hono/zod-validator'
import { z } from 'zod'
type Bindings = {
AI: Ai
}
const app = new Hono<{ Bindings: Bindings }>()
const generateSchema = z.object({ prompt: z.string().min(1) })
app.post('/api/generate', zValidator('json', generateSchema), async (c) => {
const { prompt } = c.req.valid('json')
const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
prompt,
stream: true,
})
// Workers AI SDK 타입 정의가 stream: true 반환값을 ReadableStream으로
// 좁히지 않아서 타입 단언이 필요합니다
return new Response(response as ReadableStream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
},
})
})
export default appYou can use it immediately by adding AI bindings to wrangler.toml:
[ai]
binding = "AI"Model Availability Note — @cf/meta/llama-3.1-8b-instruct is current as of the time of writing. The list of supported models and cost policies for Cloudflare Workers AI is subject to change, so we recommend checking the official documentation for the current status.
Example 3: Communication between microservices with type-safe RPC
As AI applications grow, you will want to separate APIs by role, such as embedding services and summarization services. This is where Hono's RPC feature comes in handy.
If you are already using tRPC, the question "Why Hono RPC?" will naturally arise. tRPC is more specialized for frontend-backend integration, and OpenAPI codegen requires a separate build step. Since Hono RPC allows clients to directly import server types, it can maintain type safety between services without the need for code generation tools. If your microservices are already built using Hono, it is the choice with the least friction.
Honestly, when I first saw it, I thought, "Is this actually possible?"
// embed-service/src/index.ts
import { Hono } from 'hono'
import { zValidator } from '@hono/zod-validator'
import { z } from 'zod'
const embedSchema = z.object({ text: z.string().min(1) })
const app = new Hono()
const route = app.post(
'/api/embed',
zValidator('json', embedSchema),
async (c) => {
const { text } = c.req.valid('json')
// 실제 임베딩 로직은 모델 프로바이더에 따라 교체해서 사용하시면 됩니다
const embedding: number[] = Array(1536).fill(0) // 더미 구현
return c.json({ embedding, dimensions: embedding.length })
}
)
// 서버 타입을 export — 클라이언트가 이 타입을 import합니다
export type AppType = typeof route
export default app// another-service/src/client.ts
import { hc } from 'hono/client'
import type { AppType } from '../../embed-service/src/index'
const client = hc<AppType>('https://embed-service.workers.dev')
async function getEmbedding(text: string) {
// IDE에서 자동완성과 타입 체크가 동작합니다
const res = await client.api.embed.$post({ json: { text } })
const { embedding } = await res.json()
return embedding
}Hono RPC — A method that reuses server route types directly on the client side without OpenAPI specifications or generating separate code. You can understand it as implementing gRPC's type safety over REST.
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Ultra-lightweight | ~14KB, near-zero cold start in Workers environments due to minimized dependencies |
| Runtime Independent | Deploy Workers, Bun, Lambda, and Vercel Edge with a single codebase |
| Built-in Streaming | Simple LLM streaming implementation with streamSSE(), streamText() helpers |
| Type-Safe RPC | hc Achieve end-to-end type safety without generating code for the client |
| Middleware Ecosystem | Robust official/third-party middleware including Bearer Auth, CORS, Rate Limit, Zod Validator, etc. |
| Official Integration | Official workbook and templates provided for Cloudflare Workers and Vercel AI SDK |
Establishing team conventions without DI can feel daunting at first, but simply splitting routers into files by domain solves the problem quite well. For services with clear roles, like the AI API layer, I actually found it better to have no unnecessary clutter.
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| No DI or Decorators | Lack of structuring features like NestJS, large teams need to establish conventions manually | Separate routers into domain-specific files and document team conventions |
| WebSocket Limitations | WebSocket support is limited in Cloudflare Workers environments | Replace unidirectional streams with SSE; consider Durable Objects if bidirectional is essential |
| No state management | Cannot maintain state between requests in serverless/edge environments | Combined with external stores such as Cloudflare KV, D1, Upstash Redis, etc. |
| Direct Implementation of Rate Limit | Developers must manually add AI API cost control logic | Utilize hono-rate-limiter or workers-hono-rate-limit middleware |
Durable Objects — Stateful serverless computing units provided by Cloudflare Workers. While standard Workers are stateless, Durable Objects can maintain memory and storage between requests and are used for WebSocket session management.
The Most Common Mistakes in Practice
- Omitting Rate Limiting — AI APIs incur costs per token. If you leave LLM endpoints open without authentication and rate limiting, you may face a massive bill. It is recommended to include
hono-rate-limiterfrom the start. - Inconsistent Error Response Formats — If validation errors, LLM API errors, and internal errors are sent out in different formats, it becomes difficult for the client to handle them. It is recommended to define a global error handler using the
app.onError()hook. - Missing Exception Handling During Streaming — As in the code in Example 1, if an LLM API call fails inside the
streamSSEcallback, you cannot pass an error as a status code because the 200 response header has already been sent. Debugging becomes much easier later if you pre-define a pattern for sending special events likeevent: 'error'to the stream and having the client handle them.
In Conclusion
Hono addresses the three requirements for the AI API layer—streaming, type safety, and edge execution—within a single framework, making it the most practical choice at present. It is difficult to find another option that can handle these three things simultaneously without cold starts, code generation tools, or runtime replacements.
3 Steps to Start Right Now:
- Project Initialization: Running
pnpm create hono@latest my-ai-apiallows you to select a runtime via the interactive CLI. If you select Cloudflare Workers,wrangleris configured as well. - AI SDK Connection: Add the Vercel AI SDK to
pnpm add ai @ai-sdk/openai, paste the streaming endpoint code from Example 1 intosrc/index.ts, and runpnpm dev. It will work immediately if you have onlyOPENAI_API_KEYin.env. - Edge Deployment Verification:
pnpm run deploy(Based on Cloudflare Workers) is deployed to the global edge with a single step. You can experience the difference by directly comparing response times with a local Node.js server using load testing tools such aswrkorhey.
Reference Materials
If you are just starting out, you can begin by looking at these three things:
- Hono Official Documentation | hono.dev — The first place to read when starting out
- Hono RPC Guide | hono.dev — If you want to learn more about type-safe communication between services
- Cloudflare Workers AI — Vercel AI SDK Integration | cloudflare.com — Workers AI Official Integration Setup
Additional note: