A Practical Guide to Apache Kafka and Event-Driven Architecture for Breaking Microservice Coupling
When I first adopted microservices, I made a similar mistake. Service A called B directly over HTTP, B called C, C called D... and I ended up with the worst possible outcome: a "distributed monolith." The cost was steep. A botched deployment order for the order service caused the inventory service to throw 503 errors for over four hours, and a single API spec change in the notification service kept breaking the payments team's code. That's when I finally realized something was fundamentally wrong.
Event-Driven Architecture (EDA) approaches this problem differently. It's a model where you fire an event — "an order has been placed, react however you see fit" — and you're done. Whether the inventory service is up or down, whether the payment service is slow or fast — the order service doesn't need to care. And at the center of it all is Apache Kafka. If you're nodding along thinking "I'm suffering under a structure where services know too much about each other," this article is exactly for you.
This guide covers everything in one place: the core principles of EDA, what's changed in Kafka 4.0, code patterns you can use in production right away, and the pitfalls you really only learn by running into them yourself. By the end of this article, you'll understand what to avoid first when introducing EDA to your team — and why Netflix and Uber place Kafka at the center of their company-wide infrastructure.
Core Concepts
The Three Players of EDA: Producer, Broker, Consumer
The fastest way to understand EDA is through a newspaper analogy. A journalist (Producer) writes an article. The newspaper company (Broker — Kafka in this case) stores and distributes it. Subscribers (Consumers) pick only the sections they care about. Even if a subscriber is away on a trip, the papers stack up, and they can read them when they return.
[Order Service] → publishes "order.placed" event
↓
[ Kafka Broker ]
↓ ↓ ↓
[Inventory Svc] [Payment Svc] [Notification Svc]
(deduct stock) (process pay) (email/SMS)The key here is that the Order Service has absolutely no knowledge of the other three services. Want to add a points service later? Not a single line of Order Service code needs to change. The new service simply subscribes to the order.placed topic, and that's it.
Understanding Kafka's Core Structure
When you first encounter Kafka, the terminology can feel a bit unfamiliar — but break it down piece by piece and it's quite intuitive.
| Concept | One-line description | Practical intuition |
|---|---|---|
| Topic | A logical channel where events accumulate | Separate by event type, like order.placed or payment.completed |
| Partition | A physical subdivision of a topic | Number of partitions = maximum number of Consumers that can process in parallel |
| Consumer Group | Consumers in the same group share partitions among themselves | The key to horizontal scaling. No duplicate processing within a group |
| Offset | The position of a message within a partition | This number enables precise reprocessing from the exact point after a failure |
| Log Compaction | A cleanup strategy that retains only the latest value for a given key | Used in event sourcing as a "current state" snapshot |
What is a Consumer Group? Consumers sharing the same group ID act as a single logical consumer. If a topic has 4 partitions and there are 4 Consumers, each handles one — perfect parallel processing. Meanwhile, different groups (an inventory service group, a notification service group) each independently consume the same messages in full.
Consumer Group Rebalancing is a tricky issue you'll encounter often in production. Rebalancing occurs when a Consumer is added or removed, or when a message isn't processed within max.poll.interval.ms. During this time, message processing for that group is temporarily suspended. I spent a long time puzzling over sudden processing delays early on — it turned out a heavy batch processing routine was exceeding max.poll.interval.ms (default: 5 minutes). You can either tune this value to match actual processing time, or use the Cooperative Sticky Assignor supported from Kafka 3.1+, which allows other partitions to continue processing during rebalancing for a much more stable experience.
Kafka 4.0: A Complete Break from ZooKeeper
In March 2025, Kafka 4.0 was released, finally completing the long-overdue removal of the ZooKeeper dependency. KRaft (Kafka Raft) mode became the default — and honestly, you can only truly appreciate how big a deal this is if you've ever been paged at 3 AM because of ZooKeeper.
Previously, running a Kafka cluster meant separately configuring a Kafka broker ensemble alongside a ZooKeeper ensemble. Two metadata management systems meant double the failure points. With KRaft, Kafka manages metadata directly using its own Raft consensus algorithm (an internal consensus protocol that elects a leader by vote). A single system, scalable to millions of partitions, with dramatically faster cluster startup times.
Also included is Queues for Kafka (KIP-932). The existing Kafka Consumer Group is a "broadcast" model — Consumers in the same group divide up the partitions, while different groups each independently receive all the same messages. KIP-932 adds "competing consumer" semantics: multiple Consumers competitively pull messages from a single queue, which is the pattern familiar to anyone who has used RabbitMQ. Teams migrating from RabbitMQ to Kafka now have one more reason to do so.
Three Essential EDA Patterns
These are three patterns that come up regularly when designing EDA in production. Each could fill an entire article on its own, so here I'll focus on when to use each pattern.
CQRS (Command Query Responsibility Segregation) When to use: When read and write loads are drastically different A pattern that separates the responsibility of writes (Commands) and reads (Queries). Events are published to update the write model, while a separately optimized read model (e.g., Elasticsearch) is maintained. A typical example is an e-commerce store where order creation is rare but order list lookups happen thousands of times.
Event Sourcing
When to use: When an audit trail is mandatory, or when you need to reconstruct past state
Instead of storing the "current state" in the DB, you store the events that caused state changes. For an inventory service, you'd accumulate events like StockReceived(100), OrderConsumed(30), Returned(5) and aggregate them to calculate the current stock (75). This enables time-travel — answering questions like "What was the stock count at 3 PM yesterday?"
Saga Pattern When to use: When you need a transaction spanning multiple services but don't want to use 2PC (two-phase commit) A pattern for handling transactions in distributed environments. Operations spanning multiple services — like "place order → process payment → deduct inventory" — are chained through events, and if something fails midway, a compensation event triggers a rollback.
Practical Application
Example 1: E-commerce Order Processing Pipeline (NestJS + kafkajs)
This is the scenario you'll encounter most in production. Let's look at code for when an incoming order must trigger inventory, payment, and notification processing independently. The following code is based on the NestJS microservices module.
Let's start by pinning down the core type definitions. Handling events without types in TypeScript can cost you hours of debugging over a single typo in a field name.
// types/events.ts
export interface CreateOrderDto {
userId: string;
items: Array<{ productId: string; quantity: number; price: number }>;
}
export interface OrderPlacedEvent {
orderId: string;
userId: string;
items: Array<{ productId: string; quantity: number; price: number }>;
totalAmount: number;
correlationId: string;
timestamp: string;
}
export interface DebeziumEvent<T = Record<string, unknown>> {
op: 'c' | 'u' | 'd' | 'r'; // create, update, delete, snapshot read
before: T | null;
after: T | null;
source: { db: string; table: string };
}The folder structure is as follows:
src/
order/
order.module.ts
order.service.ts ← Producer role
inventory/
inventory.module.ts
inventory.consumer.ts ← Consumer role
types/
events.ts// order/order.service.ts — Producer role
import { Injectable } from '@nestjs/common';
import { ClientKafka, Inject } from '@nestjs/microservices';
// @Inject('KAFKA_SERVICE'): NestJS DI 컨테이너에서 'KAFKA_SERVICE' 토큰으로 등록된 ClientKafka를 주입합니다
import { CreateOrderDto, OrderPlacedEvent } from '../types/events';
@Injectable()
export class OrderService {
constructor(
@Inject('KAFKA_SERVICE') private readonly kafkaClient: ClientKafka,
private readonly orderRepository: OrderRepository,
) {}
async placeOrder(dto: CreateOrderDto): Promise<Order> {
const order = await this.orderRepository.save(dto);
const event: OrderPlacedEvent = {
orderId: order.id,
userId: order.userId,
items: order.items,
totalAmount: order.totalAmount,
correlationId: crypto.randomUUID(), // 분산 추적용 고유 ID
timestamp: new Date().toISOString(),
};
await this.kafkaClient.emit('order.placed', event);
return order;
}
}// inventory/inventory.consumer.ts — Consumer role
import { Controller } from '@nestjs/common';
import {
MessagePattern,
Payload,
Ctx,
KafkaContext,
ClientKafka,
Inject,
} from '@nestjs/microservices';
// @Ctx() context: KafkaContext: NestJS가 Kafka 메타데이터(offset, partition 등)를 주입해줍니다
import { OrderPlacedEvent } from '../types/events';
const MAX_RETRY = 3;
@Controller()
export class InventoryConsumer {
constructor(
private readonly inventoryService: InventoryService,
@Inject('KAFKA_SERVICE') private readonly kafkaClient: ClientKafka,
) {}
@MessagePattern('order.placed')
async handleOrderPlaced(
@Payload() event: OrderPlacedEvent,
@Ctx() context: KafkaContext,
) {
const { correlationId, orderId, items } = event;
const retryCount = Number(
context.getMessage().headers?.['x-retry-count'] ?? 0,
);
try {
await this.inventoryService.reserveItems(items);
await this.kafkaClient.emit('inventory.reserved', {
orderId,
correlationId,
status: 'SUCCESS',
});
} catch (error) {
if (retryCount < MAX_RETRY) {
// 재시도: 헤더에 횟수를 담아 같은 토픽으로 재발행
await this.kafkaClient.emit('order.placed', {
...event,
headers: { 'x-retry-count': String(retryCount + 1) },
});
} else {
// 재시도 초과 → DLQ로 격리
await this.kafkaClient.emit('inventory.reserve.dlq', {
originalEvent: event,
reason: error.message,
failedAt: new Date().toISOString(),
});
}
}
}
}| Code point | Description |
|---|---|
correlationId |
A unique ID for tracing the entire event chain. Essential for distributed tracing |
emit() vs send() |
emit is fire-and-forget (no response needed), send waits for a response |
x-retry-count header |
Tracks retry count. Routes to DLQ topic when MAX_RETRY is exceeded |
| Compensation event | The rollback signal in the Saga pattern. Consumed by upstream services for handling |
Example 2: Converting DB Changes to Kafka Events with Debezium (CDC Pattern)
This pattern is incredibly useful when migrating from a legacy system to EDA. When it's difficult to embed event publishing code directly into an existing DB, Debezium reads the DB's change log (binlog/WAL) and automatically converts it into Kafka events.
{
"name": "orders-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "mysql",
"database.port": "3306",
"database.user": "debezium",
"database.password": "dbz",
"database.server.name": "shop",
"table.include.list": "shop.orders",
"topic.prefix": "cdc",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
}
}Note:
database.passwordis an example value. In production, always inject secrets from a secret management tool such as AWS Secrets Manager or Vault.
// CDC event consumption example — topic names are auto-generated in the format "cdc.shop.orders"
import { DebeziumEvent } from '../types/events';
interface OrderRecord {
id: string;
status: string;
userId: string;
}
@MessagePattern('cdc.shop.orders')
async handleOrderCDC(
@Payload() cdcEvent: DebeziumEvent<OrderRecord>,
) {
const { op, after, before } = cdcEvent;
// op: 'c'(create), 'u'(update), 'd'(delete), 'r'(snapshot read)
if (op === 'u' && after?.status === 'SHIPPED') {
await this.notificationService.sendShippingAlert(after);
}
}What is CDC (Change Data Capture)? A technique that detects database changes in real time and converts them into events. Since it can turn an existing DB into an event source without modifying application code, it's commonly used as the first step in migrating legacy systems to EDA.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Loose coupling | Producers and Consumers don't need to know anything about each other. Adding a new service doesn't touch existing code, reducing deployment risk and cutting cross-team dependencies |
| High throughput | Processes millions of messages per second. Netflix processes trillions of events per day through Kafka |
| Durability & reprocessing | Logs are persisted to disk. After a failure, you can reprocess precisely from the stored offset — and replay past events after a bug fix |
| Elastic scaling | Horizontal scaling is as simple as adding partitions. Consumer Groups automatically distribute the load, so you just spin up more Consumer instances as traffic spikes |
| Fault isolation | Even if the notification service goes down, the order service keeps publishing events. When the notification service recovers, it processes the accumulated events. Failures don't cascade |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Operational complexity | Running and monitoring a Kafka cluster requires specialized expertise | In the early stages, managed services like Confluent Cloud or Amazon MSK significantly reduce the operational burden |
| Eventual consistency | Unlike synchronous processing, immediate consistency is not guaranteed | Display a "processing" state in the UI and design the user experience from the start with this in mind |
| Message ordering | Ordering is guaranteed only at the partition level — not across the entire topic | For events where order matters, use the same partition key (e.g., userId, orderId) |
| Exactly-once processing | The default is at-least-once. Duplicate processing can occur | Enable transactional publishing on the Producer side with enable.idempotence=true + transactional.id, and set isolation.level=read_committed on the Consumer side to only read committed messages. Ideally, design the Consumer itself to be idempotent |
| Debugging difficulty | Tracing through an event chain has its limits when relying solely on log analysis | Embed a correlationId in every event and introduce distributed tracing tools like Jaeger or Zipkin |
| Schema management | Adding a field to the order.placed event impacts all Consumers |
Manage schema versions with Confluent Schema Registry and include backward-compatibility checks in CI |
What is an Idempotent Consumer? A Consumer designed so that processing the same message multiple times produces the same result. For example, when a duplicate "deduct 5 from inventory" event arrives, you can implement this by recording the event ID in the DB and ignoring any event that has already been processed.
The Most Common Mistakes in Production
-
Making topics too coarse-grained — Funneling all order events into a single
ordertopic forces Consumers to filter out events they don't need. Breaking them down by event type —order.placed,order.shipped,order.cancelled— makes things far more manageable down the line. -
Not setting a partition key — Using the default round-robin distribution means events from the same user can be scattered across multiple partitions. I missed partition keys early on and spent a long time debugging "I processed these in order, so why is the state getting corrupted?" If ordering matters in your domain, set
userIdororderIdas the partition key. -
Omitting a Dead Letter Queue (DLQ) — If a Consumer repeatedly fails to process a specific event, it can block the entire partition. As shown in the code example above, it's ideal to build the structure of isolating failed events into a DLQ topic and monitoring them separately right from the initial design. Adding it later turns out to be surprisingly costly.
Closing Thoughts
Event-Driven Architecture is not simply a matter of swapping out your tech stack. It fundamentally changes how teams depend on and communicate with each other. The moment "we need to wait for Team A to change their API spec to add this feature" becomes "we can deploy independently from Team A by subscribing to the events we care about," team autonomy changes entirely.
Of course, at first the learning curve — from ZooKeeper configuration to Consumer Group rebalancing issues — can feel daunting. But since Kafka 4.0, the barrier to entry has dropped considerably, and managed services have substantially reduced the operational burden. Now is a great time to start.
Three steps you can take right now:
-
Spin up Kafka 4.0 locally — A single Docker Compose file is all you need. You can start KRaft-mode Kafka immediately with
docker run -p 9092:9092 apache/kafka:4.0.0. Running Redpanda Console or Conduktor alongside it lets you see topic lists, per-partition offsets, and Consumer Group lag at a glance. Just watching Consumer Group lag converge to zero in real time will make the concepts click much faster. -
Write a simple Producer/Consumer — Use the client for your current language (
kafkajsfor Node.js,spring-kafkafor Java) to write code that publishes and consumes a "hello world" event. Open the "Consumer Groups" tab in the Kafka UI and watch what happens to partition reassignment as you add or remove Consumers — it's the fastest way to build intuition. -
Replace one synchronous call in an existing service with an event — Rather than converting the entire system at once, try replacing one low-stakes, fire-and-forget call in your current project — something like "send a notification" where a failure just means a retry. Small wins build team confidence.
Next article: Saga Pattern in Practice — When a distributed transaction requires a rollback, how do you design compensation events? We'll cover that in code, along with the differences between the Choreography and Orchestration approaches.
References
- Apache Kafka Official Documentation | apache.org
- Confluent Developer — Kafka Concepts and Tutorials | confluent.io
- KRaft — Apache Kafka Without ZooKeeper | Confluent Developer
- Kafka 4.0 KRaft Architecture | InfoQ
- Confluent Schema Registry Official Documentation | confluent.io
- Top Trends for Data Streaming with Apache Kafka and Flink in 2026 | Kai Waehner
- The Rise of Durable Execution Engine with Apache Kafka | Kai Waehner
- Uber uForwarder: Scalable Kafka Consumer Proxy | InfoQ
- Event-Driven Architecture and Kafka Explained: Pros and Cons | PRODYNA
- Event Driven Architecture and Kafka — F-Lab