Deploying LLM Streaming API with Hono + Cloudflare Workers — How to Run a Type-Safe AI Layer at the Edge

There was a time when I wondered, "Is this right?" as I watched an LLM proxy built with Express take 2 seconds per request on Vercel due to cold starts. After that, I tried to force SSE into NestJS using @nestjs/event-emitter, but the structure became too complex, so I eventually gave up. It was right at that moment that I first encountered Hono, and now I reach for it almost reflexively when designing AI API layers.

The core message of this article is simple: thanks to a Web Standards-based design, a single set of code runs anywhere from the edge to Lambda, enabling LLM streaming and type safety with minimal code. There is a reason why the React frontend + Hono API (Cloudflare Workers) + model provider configuration has established itself as the standard stack for AI startups as of 2026. Companies like Cloudflare, Deno, Clerk, and Unkey are already using it in production, and both Vercel and Cloudflare provide official documentation.

Basic concepts of TypeScript and REST APIs are sufficient to follow along with this article. It is okay if you do not have experience with NestJS or Vercel.

Key Concepts

Why Hono is a good fit for AI APIs

Hono is a TypeScript web framework measuring approximately 14KB. Looking at the numbers alone, you might think, "Isn't it just a smaller version of Express?", but its design philosophy is different.

Features	Express/Fastify	Hono
기반 API	Node.js `http` 모듈	Web Standards (`Request`/`Response`)
Runtime	Node.js Only	Workers, Bun, Deno, Lambda, Node.js
Streaming	Separate setup required	`streamSSE()` Helper Built-in
Type-safe RPC	None	`hc` end-to-end type as client
Cold start	Runtime dependency	Near 0 (based on Workers)

AI APIs are inherently I/O-bound workloads. While the LLM generates a token, the server simply waits for the result and relays it to the client. In such workloads, the heavy initialization cost of the Node.js event loop is a waste. On the other hand, edge environments like Cloudflare Workers run in V8 contexts isolated per request to minimize latency, and geographically distributed edge nodes reduce the physical distance to the user. This is why an ultra-lightweight framework is the perfect fit for this workload.

SSE (Server-Sent Events) — This is a method where the server keeps an HTTP connection open and sends a unidirectional stream of events to the client. A typical example is the ChatGPT-style response, which is delivered to the client in real-time whenever an LLM generates a token.

Benefits of a Web Standards-based System

The Request received by the Hono route handler and the Response returned are standard objects identical to the browser fetch(). This is important because code written in Cloudflare Workers works exactly as is on Bun or Vercel Edge. While Express, which uses the Node.js-specific http module, requires separate adapters for each runtime, Hono does not.

typescript

// 이 코드는 Cloudflare Workers, Bun, Vercel Edge 어디서나 동일하게 동작합니다
import { Hono } from 'hono'
 
const app = new Hono()
 
app.get('/health', (c) => {
  return c.json({ status: 'ok', runtime: 'any' })
})
 
export default app

The fact that you don't need to relearn the framework API when switching AI model providers or moving deployment platforms is a bigger advantage than you might think. It simply allows you to "develop with Buns first and move to Workers later" early in the project.

Practical Application

Example 1: LLM Token Streaming Endpoint

This is the most fundamental pattern. It is a structure used when you need to proxy external LLM APIs (OpenAI, Anthropic, etc.) and deliver tokens to the client in real time. When I first introduced this pattern to the team, the most common reaction was, "Is it really okay for this to be this short?"

typescript

import { Hono } from 'hono'
import { streamSSE } from 'hono/streaming'
import { streamText } from 'ai'
import { openai } from '@ai-sdk/openai'
import { zValidator } from '@hono/zod-validator'
import { z } from 'zod'
 
const app = new Hono()
 
const chatSchema = z.object({
  messages: z.array(
    z.object({
      role: z.enum(['user', 'assistant']),
      content: z.string(),
    })
  ),
})
 
app.post('/api/chat', zValidator('json', chatSchema), async (c) => {
  const { messages } = c.req.valid('json')
 
  return streamSSE(c, async (stream) => {
    try {
      const { textStream } = await streamText({
        model: openai('gpt-4o'),
        messages,
      })
 
      for await (const text of textStream) {
        await stream.writeSSE({ data: JSON.stringify({ text }) })
      }
 
      await stream.writeSSE({ data: '[DONE]' })
    } catch (err) {
      // 스트림 도중 에러가 나면 이미 200 헤더가 나간 상태이므로
      // 에러 이벤트를 스트림으로 전달해야 합니다
      await stream.writeSSE({
        event: 'error',
        data: JSON.stringify({ message: 'LLM API 오류가 발생했습니다' }),
      })
    }
  })
})
 
export default app

Code Point	Role
`zValidator('json', chatSchema)`	Validate request body against Zod schema, automatically return 400 on failure
`c.req.valid('json')`	Type-safely retrieve validated data
`streamSSE(c, ...)`	Hono handles SSE header configuration and stream management
`stream.writeSSE({ event: 'error', ... })`	Pattern for delivering errors to the client during streaming

zValidator Middleware — This is Hono-specific middleware provided by the @hono/zod-validator package. It automatically returns a 400 response upon validation failure, and subsequent handlers can only access data of the validated type.

Example 2: Direct integration with Cloudflare Workers AI

This is a useful pattern when external API costs are burdensome or when you want to quickly create prototypes. Cloudflare Workers has built-in AI models accessible via the c.env.AI binding, allowing you to utilize Cloudflare GPUs without a separate API key.

typescript

import { Hono } from 'hono'
import { zValidator } from '@hono/zod-validator'
import { z } from 'zod'
 
type Bindings = {
  AI: Ai
}
 
const app = new Hono<{ Bindings: Bindings }>()
 
const generateSchema = z.object({ prompt: z.string().min(1) })
 
app.post('/api/generate', zValidator('json', generateSchema), async (c) => {
  const { prompt } = c.req.valid('json')
 
  const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    prompt,
    stream: true,
  })
 
  // Workers AI SDK 타입 정의가 stream: true 반환값을 ReadableStream으로
  // 좁히지 않아서 타입 단언이 필요합니다
  return new Response(response as ReadableStream, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
    },
  })
})
 
export default app

You can use it immediately by adding AI bindings to wrangler.toml:

toml

[ai]
binding = "AI"

Model Availability Note — @cf/meta/llama-3.1-8b-instruct is current as of the time of writing. The list of supported models and cost policies for Cloudflare Workers AI is subject to change, so we recommend checking the official documentation for the current status.

Example 3: Communication between microservices with type-safe RPC

As AI applications grow, you will want to separate APIs by role, such as embedding services and summarization services. This is where Hono's RPC feature comes in handy.

If you are already using tRPC, the question "Why Hono RPC?" will naturally arise. tRPC is more specialized for frontend-backend integration, and OpenAPI codegen requires a separate build step. Since Hono RPC allows clients to directly import server types, it can maintain type safety between services without the need for code generation tools. If your microservices are already built using Hono, it is the choice with the least friction.

Honestly, when I first saw it, I thought, "Is this actually possible?"

typescript

// embed-service/src/index.ts
import { Hono } from 'hono'
import { zValidator } from '@hono/zod-validator'
import { z } from 'zod'
 
const embedSchema = z.object({ text: z.string().min(1) })
 
const app = new Hono()
 
const route = app.post(
  '/api/embed',
  zValidator('json', embedSchema),
  async (c) => {
    const { text } = c.req.valid('json')
 
    // 실제 임베딩 로직은 모델 프로바이더에 따라 교체해서 사용하시면 됩니다
    const embedding: number[] = Array(1536).fill(0) // 더미 구현
    return c.json({ embedding, dimensions: embedding.length })
  }
)
 
// 서버 타입을 export — 클라이언트가 이 타입을 import합니다
export type AppType = typeof route
export default app

typescript

// another-service/src/client.ts
import { hc } from 'hono/client'
import type { AppType } from '../../embed-service/src/index'
 
const client = hc<AppType>('https://embed-service.workers.dev')
 
async function getEmbedding(text: string) {
  // IDE에서 자동완성과 타입 체크가 동작합니다
  const res = await client.api.embed.$post({ json: { text } })
  const { embedding } = await res.json()
  return embedding
}

Hono RPC — A method that reuses server route types directly on the client side without OpenAPI specifications or generating separate code. You can understand it as implementing gRPC's type safety over REST.

Pros and Cons Analysis

Advantages

Item	Content
Ultra-lightweight	~14KB, near-zero cold start in Workers environments due to minimized dependencies
Runtime Independent	Deploy Workers, Bun, Lambda, and Vercel Edge with a single codebase
Built-in Streaming	Simple LLM streaming implementation with `streamSSE()`, `streamText()` helpers
Type-Safe RPC	`hc` Achieve end-to-end type safety without generating code for the client
Middleware Ecosystem	Robust official/third-party middleware including Bearer Auth, CORS, Rate Limit, Zod Validator, etc.
Official Integration	Official workbook and templates provided for Cloudflare Workers and Vercel AI SDK

Establishing team conventions without DI can feel daunting at first, but simply splitting routers into files by domain solves the problem quite well. For services with clear roles, like the AI API layer, I actually found it better to have no unnecessary clutter.

Disadvantages and Precautions

Item	Content	Response Plan
No DI or Decorators	Lack of structuring features like NestJS, large teams need to establish conventions manually	Separate routers into domain-specific files and document team conventions
WebSocket Limitations	WebSocket support is limited in Cloudflare Workers environments	Replace unidirectional streams with SSE; consider Durable Objects if bidirectional is essential
No state management	Cannot maintain state between requests in serverless/edge environments	Combined with external stores such as Cloudflare KV, D1, Upstash Redis, etc.
Direct Implementation of Rate Limit	Developers must manually add AI API cost control logic	Utilize `hono-rate-limiter` or `workers-hono-rate-limit` middleware

Durable Objects — Stateful serverless computing units provided by Cloudflare Workers. While standard Workers are stateless, Durable Objects can maintain memory and storage between requests and are used for WebSocket session management.

The Most Common Mistakes in Practice

Omitting Rate Limiting — AI APIs incur costs per token. If you leave LLM endpoints open without authentication and rate limiting, you may face a massive bill. It is recommended to include hono-rate-limiter from the start.
Inconsistent Error Response Formats — If validation errors, LLM API errors, and internal errors are sent out in different formats, it becomes difficult for the client to handle them. It is recommended to define a global error handler using the app.onError() hook.
Missing Exception Handling During Streaming — As in the code in Example 1, if an LLM API call fails inside the streamSSE callback, you cannot pass an error as a status code because the 200 response header has already been sent. Debugging becomes much easier later if you pre-define a pattern for sending special events like event: 'error' to the stream and having the client handle them.

In Conclusion

Hono addresses the three requirements for the AI API layer—streaming, type safety, and edge execution—within a single framework, making it the most practical choice at present. It is difficult to find another option that can handle these three things simultaneously without cold starts, code generation tools, or runtime replacements.

3 Steps to Start Right Now:

Project Initialization: Running pnpm create hono@latest my-ai-api allows you to select a runtime via the interactive CLI. If you select Cloudflare Workers, wrangler is configured as well.
AI SDK Connection: Add the Vercel AI SDK to pnpm add ai @ai-sdk/openai, paste the streaming endpoint code from Example 1 into src/index.ts, and run pnpm dev. It will work immediately if you have only OPENAI_API_KEY in .env.
Edge Deployment Verification: pnpm run deploy (Based on Cloudflare Workers) is deployed to the global edge with a single step. You can experience the difference by directly comparing response times with a local Node.js server using load testing tools such as wrk or hey.

Reference Materials

If you are just starting out, you can begin by looking at these three things:

Hono Official Documentation | hono.dev — The first place to read when starting out
Hono RPC Guide | hono.dev — If you want to learn more about type-safe communication between services
Cloudflare Workers AI — Vercel AI SDK Integration | cloudflare.com — Workers AI Official Integration Setup

Additional note:

Deploying LLM Streaming API with Hono + Cloudflare Workers — How to Run a Type-Safe AI Layer at the Edge | DEV BAK - 기술블로그

Architecture

Deploying LLM Streaming API with Hono + Cloudflare Workers — How to Run a Type-Safe AI Layer at the Edge

Basic concepts of TypeScript and REST APIs are sufficient to follow along with this article. It is okay if you do not have experience with NestJS or Vercel.

Key Concepts

Why Hono is a good fit for AI APIs

Hono is a TypeScript web framework measuring approximately 14KB. Looking at the numbers alone, you might think, "Isn't it just a smaller version of Express?", but its design philosophy is different.

Features	Express/Fastify	Hono
기반 API	Node.js `http` 모듈	Web Standards (`Request`/`Response`)
Runtime	Node.js Only	Workers, Bun, Deno, Lambda, Node.js
Streaming	Separate setup required	`streamSSE()` Helper Built-in
Type-safe RPC	None	`hc` end-to-end type as client
Cold start	Runtime dependency	Near 0 (based on Workers)

Benefits of a Web Standards-based System

typescript

// 이 코드는 Cloudflare Workers, Bun, Vercel Edge 어디서나 동일하게 동작합니다
import { Hono } from 'hono'
 
const app = new Hono()
 
app.get('/health', (c) => {
  return c.json({ status: 'ok', runtime: 'any' })
})
 
export default app

Practical Application

Example 1: LLM Token Streaming Endpoint

typescript

import { Hono } from 'hono'
import { streamSSE } from 'hono/streaming'
import { streamText } from 'ai'
import { openai } from '@ai-sdk/openai'
import { zValidator } from '@hono/zod-validator'
import { z } from 'zod'
 
const app = new Hono()
 
const chatSchema = z.object({
  messages: z.array(
    z.object({
      role: z.enum(['user', 'assistant']),
      content: z.string(),
    })
  ),
})
 
app.post('/api/chat', zValidator('json', chatSchema), async (c) => {
  const { messages } = c.req.valid('json')
 
  return streamSSE(c, async (stream) => {
    try {
      const { textStream } = await streamText({
        model: openai('gpt-4o'),
        messages,
      })
 
      for await (const text of textStream) {
        await stream.writeSSE({ data: JSON.stringify({ text }) })
      }
 
      await stream.writeSSE({ data: '[DONE]' })
    } catch (err) {
      // 스트림 도중 에러가 나면 이미 200 헤더가 나간 상태이므로
      // 에러 이벤트를 스트림으로 전달해야 합니다
      await stream.writeSSE({
        event: 'error',
        data: JSON.stringify({ message: 'LLM API 오류가 발생했습니다' }),
      })
    }
  })
})
 
export default app

Code Point	Role
`zValidator('json', chatSchema)`	Validate request body against Zod schema, automatically return 400 on failure
`c.req.valid('json')`	Type-safely retrieve validated data
`streamSSE(c, ...)`	Hono handles SSE header configuration and stream management
`stream.writeSSE({ event: 'error', ... })`	Pattern for delivering errors to the client during streaming

Example 2: Direct integration with Cloudflare Workers AI

typescript

import { Hono } from 'hono'
import { zValidator } from '@hono/zod-validator'
import { z } from 'zod'
 
type Bindings = {
  AI: Ai
}
 
const app = new Hono<{ Bindings: Bindings }>()
 
const generateSchema = z.object({ prompt: z.string().min(1) })
 
app.post('/api/generate', zValidator('json', generateSchema), async (c) => {
  const { prompt } = c.req.valid('json')
 
  const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    prompt,
    stream: true,
  })
 
  // Workers AI SDK 타입 정의가 stream: true 반환값을 ReadableStream으로
  // 좁히지 않아서 타입 단언이 필요합니다
  return new Response(response as ReadableStream, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
    },
  })
})
 
export default app

You can use it immediately by adding AI bindings to wrangler.toml:

toml

[ai]
binding = "AI"

Example 3: Communication between microservices with type-safe RPC

As AI applications grow, you will want to separate APIs by role, such as embedding services and summarization services. This is where Hono's RPC feature comes in handy.

Honestly, when I first saw it, I thought, "Is this actually possible?"

typescript

// embed-service/src/index.ts
import { Hono } from 'hono'
import { zValidator } from '@hono/zod-validator'
import { z } from 'zod'
 
const embedSchema = z.object({ text: z.string().min(1) })
 
const app = new Hono()
 
const route = app.post(
  '/api/embed',
  zValidator('json', embedSchema),
  async (c) => {
    const { text } = c.req.valid('json')
 
    // 실제 임베딩 로직은 모델 프로바이더에 따라 교체해서 사용하시면 됩니다
    const embedding: number[] = Array(1536).fill(0) // 더미 구현
    return c.json({ embedding, dimensions: embedding.length })
  }
)
 
// 서버 타입을 export — 클라이언트가 이 타입을 import합니다
export type AppType = typeof route
export default app

typescript

// another-service/src/client.ts
import { hc } from 'hono/client'
import type { AppType } from '../../embed-service/src/index'
 
const client = hc<AppType>('https://embed-service.workers.dev')
 
async function getEmbedding(text: string) {
  // IDE에서 자동완성과 타입 체크가 동작합니다
  const res = await client.api.embed.$post({ json: { text } })
  const { embedding } = await res.json()
  return embedding
}

Pros and Cons Analysis

Advantages

Item	Content
Ultra-lightweight	~14KB, near-zero cold start in Workers environments due to minimized dependencies
Runtime Independent	Deploy Workers, Bun, Lambda, and Vercel Edge with a single codebase
Built-in Streaming	Simple LLM streaming implementation with `streamSSE()`, `streamText()` helpers
Type-Safe RPC	`hc` Achieve end-to-end type safety without generating code for the client
Middleware Ecosystem	Robust official/third-party middleware including Bearer Auth, CORS, Rate Limit, Zod Validator, etc.
Official Integration	Official workbook and templates provided for Cloudflare Workers and Vercel AI SDK

Disadvantages and Precautions

Item	Content	Response Plan
No DI or Decorators	Lack of structuring features like NestJS, large teams need to establish conventions manually	Separate routers into domain-specific files and document team conventions
WebSocket Limitations	WebSocket support is limited in Cloudflare Workers environments	Replace unidirectional streams with SSE; consider Durable Objects if bidirectional is essential
No state management	Cannot maintain state between requests in serverless/edge environments	Combined with external stores such as Cloudflare KV, D1, Upstash Redis, etc.
Direct Implementation of Rate Limit	Developers must manually add AI API cost control logic	Utilize `hono-rate-limiter` or `workers-hono-rate-limit` middleware

The Most Common Mistakes in Practice

Omitting Rate Limiting — AI APIs incur costs per token. If you leave LLM endpoints open without authentication and rate limiting, you may face a massive bill. It is recommended to include hono-rate-limiter from the start.
Inconsistent Error Response Formats — If validation errors, LLM API errors, and internal errors are sent out in different formats, it becomes difficult for the client to handle them. It is recommended to define a global error handler using the app.onError() hook.
Missing Exception Handling During Streaming — As in the code in Example 1, if an LLM API call fails inside the streamSSE callback, you cannot pass an error as a status code because the 200 response header has already been sent. Debugging becomes much easier later if you pre-define a pattern for sending special events like event: 'error' to the stream and having the client handle them.

In Conclusion

3 Steps to Start Right Now:

Project Initialization: Running pnpm create hono@latest my-ai-api allows you to select a runtime via the interactive CLI. If you select Cloudflare Workers, wrangler is configured as well.
AI SDK Connection: Add the Vercel AI SDK to pnpm add ai @ai-sdk/openai, paste the streaming endpoint code from Example 1 into src/index.ts, and run pnpm dev. It will work immediately if you have only OPENAI_API_KEY in .env.
Edge Deployment Verification: pnpm run deploy (Based on Cloudflare Workers) is deployed to the global edge with a single step. You can experience the difference by directly comparing response times with a local Node.js server using load testing tools such as wrk or hey.

Reference Materials

If you are just starting out, you can begin by looking at these three things:

Hono Official Documentation | hono.dev — The first place to read when starting out
Hono RPC Guide | hono.dev — If you want to learn more about type-safe communication between services
Cloudflare Workers AI — Vercel AI SDK Integration | cloudflare.com — Workers AI Official Integration Setup

Additional note:

Key Concepts

Why Hono is a good fit for AI APIs

Benefits of a Web Standards-based System

Practical Application

Example 1: LLM Token Streaming Endpoint

Example 2: Direct integration with Cloudflare Workers AI

Example 3: Communication between microservices with type-safe RPC

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Key Concepts

Why Hono is a good fit for AI APIs

Benefits of a Web Standards-based System

Practical Application

Example 1: LLM Token Streaming Endpoint

Example 2: Direct integration with Cloudflare Workers AI

Example 3: Communication between microservices with type-safe RPC

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Recommended Posts

Escape Over-engineering — Reducing Architecture Complexity with YAGNI, KISS, and the Rule of Three

The Choice of a Team That Combined 140 Services into One — Why Modular Monoliths Are Gaining Attention Again

Backstage Golden Path Template: How to Build a Service with Built-in Security and CI/CD from Scratch

Governance-as-Architecture: An experience eliminating quarterly reviews by automatically detecting architecture violations on every commit with ArchUnit and OPA

Flow Engineering: From LLM Workflows to Organizational Architecture, How to Design Flow

Horizontally Scaling a Yjs Collaboration Server with Hocuspocus + Redis: Sticky Session and Document Persistence Strategies