A Practical Guide to Apache Kafka and Event-Driven Architecture for Breaking Microservice Coupling

When I first adopted microservices, I made a similar mistake. Service A called B directly over HTTP, B called C, C called D... and I ended up with the worst possible outcome: a "distributed monolith." The cost was steep. A botched deployment order for the order service caused the inventory service to throw 503 errors for over four hours, and a single API spec change in the notification service kept breaking the payments team's code. That's when I finally realized something was fundamentally wrong.

Event-Driven Architecture (EDA) approaches this problem differently. It's a model where you fire an event — "an order has been placed, react however you see fit" — and you're done. Whether the inventory service is up or down, whether the payment service is slow or fast — the order service doesn't need to care. And at the center of it all is Apache Kafka. If you're nodding along thinking "I'm suffering under a structure where services know too much about each other," this article is exactly for you.

This guide covers everything in one place: the core principles of EDA, what's changed in Kafka 4.0, code patterns you can use in production right away, and the pitfalls you really only learn by running into them yourself. By the end of this article, you'll understand what to avoid first when introducing EDA to your team — and why Netflix and Uber place Kafka at the center of their company-wide infrastructure.

Core Concepts

The Three Players of EDA: Producer, Broker, Consumer

The fastest way to understand EDA is through a newspaper analogy. A journalist (Producer) writes an article. The newspaper company (Broker — Kafka in this case) stores and distributes it. Subscribers (Consumers) pick only the sections they care about. Even if a subscriber is away on a trip, the papers stack up, and they can read them when they return.

css

[Order Service] → publishes "order.placed" event
        ↓
   [ Kafka Broker ]
        ↓              ↓              ↓
[Inventory Svc]  [Payment Svc]  [Notification Svc]
(deduct stock)  (process pay)  (email/SMS)

The key here is that the Order Service has absolutely no knowledge of the other three services. Want to add a points service later? Not a single line of Order Service code needs to change. The new service simply subscribes to the order.placed topic, and that's it.

Understanding Kafka's Core Structure

When you first encounter Kafka, the terminology can feel a bit unfamiliar — but break it down piece by piece and it's quite intuitive.

Concept	One-line description	Practical intuition
Topic	A logical channel where events accumulate	Separate by event type, like `order.placed` or `payment.completed`
Partition	A physical subdivision of a topic	Number of partitions = maximum number of Consumers that can process in parallel
Consumer Group	Consumers in the same group share partitions among themselves	The key to horizontal scaling. No duplicate processing within a group
Offset	The position of a message within a partition	This number enables precise reprocessing from the exact point after a failure
Log Compaction	A cleanup strategy that retains only the latest value for a given key	Used in event sourcing as a "current state" snapshot

What is a Consumer Group? Consumers sharing the same group ID act as a single logical consumer. If a topic has 4 partitions and there are 4 Consumers, each handles one — perfect parallel processing. Meanwhile, different groups (an inventory service group, a notification service group) each independently consume the same messages in full.

Consumer Group Rebalancing is a tricky issue you'll encounter often in production. Rebalancing occurs when a Consumer is added or removed, or when a message isn't processed within max.poll.interval.ms. During this time, message processing for that group is temporarily suspended. I spent a long time puzzling over sudden processing delays early on — it turned out a heavy batch processing routine was exceeding max.poll.interval.ms (default: 5 minutes). You can either tune this value to match actual processing time, or use the Cooperative Sticky Assignor supported from Kafka 3.1+, which allows other partitions to continue processing during rebalancing for a much more stable experience.

Kafka 4.0: A Complete Break from ZooKeeper

In March 2025, Kafka 4.0 was released, finally completing the long-overdue removal of the ZooKeeper dependency. KRaft (Kafka Raft) mode became the default — and honestly, you can only truly appreciate how big a deal this is if you've ever been paged at 3 AM because of ZooKeeper.

Previously, running a Kafka cluster meant separately configuring a Kafka broker ensemble alongside a ZooKeeper ensemble. Two metadata management systems meant double the failure points. With KRaft, Kafka manages metadata directly using its own Raft consensus algorithm (an internal consensus protocol that elects a leader by vote). A single system, scalable to millions of partitions, with dramatically faster cluster startup times.

Also included is Queues for Kafka (KIP-932). The existing Kafka Consumer Group is a "broadcast" model — Consumers in the same group divide up the partitions, while different groups each independently receive all the same messages. KIP-932 adds "competing consumer" semantics: multiple Consumers competitively pull messages from a single queue, which is the pattern familiar to anyone who has used RabbitMQ. Teams migrating from RabbitMQ to Kafka now have one more reason to do so.

Three Essential EDA Patterns

These are three patterns that come up regularly when designing EDA in production. Each could fill an entire article on its own, so here I'll focus on when to use each pattern.

CQRS (Command Query Responsibility Segregation) When to use: When read and write loads are drastically different A pattern that separates the responsibility of writes (Commands) and reads (Queries). Events are published to update the write model, while a separately optimized read model (e.g., Elasticsearch) is maintained. A typical example is an e-commerce store where order creation is rare but order list lookups happen thousands of times.

Event Sourcing When to use: When an audit trail is mandatory, or when you need to reconstruct past state Instead of storing the "current state" in the DB, you store the events that caused state changes. For an inventory service, you'd accumulate events like StockReceived(100), OrderConsumed(30), Returned(5) and aggregate them to calculate the current stock (75). This enables time-travel — answering questions like "What was the stock count at 3 PM yesterday?"

Saga Pattern When to use: When you need a transaction spanning multiple services but don't want to use 2PC (two-phase commit) A pattern for handling transactions in distributed environments. Operations spanning multiple services — like "place order → process payment → deduct inventory" — are chained through events, and if something fails midway, a compensation event triggers a rollback.

Practical Application

Example 1: E-commerce Order Processing Pipeline (NestJS + kafkajs)

This is the scenario you'll encounter most in production. Let's look at code for when an incoming order must trigger inventory, payment, and notification processing independently. The following code is based on the NestJS microservices module.

Let's start by pinning down the core type definitions. Handling events without types in TypeScript can cost you hours of debugging over a single typo in a field name.

typescript

// types/events.ts
 
export interface CreateOrderDto {
  userId: string;
  items: Array<{ productId: string; quantity: number; price: number }>;
}
 
export interface OrderPlacedEvent {
  orderId: string;
  userId: string;
  items: Array<{ productId: string; quantity: number; price: number }>;
  totalAmount: number;
  correlationId: string;
  timestamp: string;
}
 
export interface DebeziumEvent<T = Record<string, unknown>> {
  op: 'c' | 'u' | 'd' | 'r'; // create, update, delete, snapshot read
  before: T | null;
  after: T | null;
  source: { db: string; table: string };
}

The folder structure is as follows:

src/
  order/
    order.module.ts
    order.service.ts       ← Producer role
  inventory/
    inventory.module.ts
    inventory.consumer.ts  ← Consumer role
  types/
    events.ts

typescript

// order/order.service.ts — Producer role
import { Injectable } from '@nestjs/common';
import { ClientKafka, Inject } from '@nestjs/microservices';
// @Inject('KAFKA_SERVICE'): NestJS DI 컨테이너에서 'KAFKA_SERVICE' 토큰으로 등록된 ClientKafka를 주입합니다
import { CreateOrderDto, OrderPlacedEvent } from '../types/events';
 
@Injectable()
export class OrderService {
  constructor(
    @Inject('KAFKA_SERVICE') private readonly kafkaClient: ClientKafka,
    private readonly orderRepository: OrderRepository,
  ) {}
 
  async placeOrder(dto: CreateOrderDto): Promise<Order> {
    const order = await this.orderRepository.save(dto);
 
    const event: OrderPlacedEvent = {
      orderId: order.id,
      userId: order.userId,
      items: order.items,
      totalAmount: order.totalAmount,
      correlationId: crypto.randomUUID(), // 분산 추적용 고유 ID
      timestamp: new Date().toISOString(),
    };
 
    await this.kafkaClient.emit('order.placed', event);
 
    return order;
  }
}

typescript

// inventory/inventory.consumer.ts — Consumer role
import { Controller } from '@nestjs/common';
import {
  MessagePattern,
  Payload,
  Ctx,
  KafkaContext,
  ClientKafka,
  Inject,
} from '@nestjs/microservices';
// @Ctx() context: KafkaContext: NestJS가 Kafka 메타데이터(offset, partition 등)를 주입해줍니다
import { OrderPlacedEvent } from '../types/events';
 
const MAX_RETRY = 3;
 
@Controller()
export class InventoryConsumer {
  constructor(
    private readonly inventoryService: InventoryService,
    @Inject('KAFKA_SERVICE') private readonly kafkaClient: ClientKafka,
  ) {}
 
  @MessagePattern('order.placed')
  async handleOrderPlaced(
    @Payload() event: OrderPlacedEvent,
    @Ctx() context: KafkaContext,
  ) {
    const { correlationId, orderId, items } = event;
    const retryCount = Number(
      context.getMessage().headers?.['x-retry-count'] ?? 0,
    );
 
    try {
      await this.inventoryService.reserveItems(items);
 
      await this.kafkaClient.emit('inventory.reserved', {
        orderId,
        correlationId,
        status: 'SUCCESS',
      });
    } catch (error) {
      if (retryCount < MAX_RETRY) {
        // 재시도: 헤더에 횟수를 담아 같은 토픽으로 재발행
        await this.kafkaClient.emit('order.placed', {
          ...event,
          headers: { 'x-retry-count': String(retryCount + 1) },
        });
      } else {
        // 재시도 초과 → DLQ로 격리
        await this.kafkaClient.emit('inventory.reserve.dlq', {
          originalEvent: event,
          reason: error.message,
          failedAt: new Date().toISOString(),
        });
      }
    }
  }
}

Code point	Description
`correlationId`	A unique ID for tracing the entire event chain. Essential for distributed tracing
`emit()` vs `send()`	`emit` is fire-and-forget (no response needed), `send` waits for a response
`x-retry-count` header	Tracks retry count. Routes to DLQ topic when MAX_RETRY is exceeded
Compensation event	The rollback signal in the Saga pattern. Consumed by upstream services for handling

Example 2: Converting DB Changes to Kafka Events with Debezium (CDC Pattern)

This pattern is incredibly useful when migrating from a legacy system to EDA. When it's difficult to embed event publishing code directly into an existing DB, Debezium reads the DB's change log (binlog/WAL) and automatically converts it into Kafka events.

json

{
  "name": "orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.server.name": "shop",
    "table.include.list": "shop.orders",
    "topic.prefix": "cdc",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
  }
}

Note: database.password is an example value. In production, always inject secrets from a secret management tool such as AWS Secrets Manager or Vault.

typescript

// CDC event consumption example — topic names are auto-generated in the format "cdc.shop.orders"
import { DebeziumEvent } from '../types/events';
 
interface OrderRecord {
  id: string;
  status: string;
  userId: string;
}
 
@MessagePattern('cdc.shop.orders')
async handleOrderCDC(
  @Payload() cdcEvent: DebeziumEvent<OrderRecord>,
) {
  const { op, after, before } = cdcEvent;
  // op: 'c'(create), 'u'(update), 'd'(delete), 'r'(snapshot read)
 
  if (op === 'u' && after?.status === 'SHIPPED') {
    await this.notificationService.sendShippingAlert(after);
  }
}

What is CDC (Change Data Capture)? A technique that detects database changes in real time and converts them into events. Since it can turn an existing DB into an event source without modifying application code, it's commonly used as the first step in migrating legacy systems to EDA.

Pros and Cons

Advantages

Item	Details
Loose coupling	Producers and Consumers don't need to know anything about each other. Adding a new service doesn't touch existing code, reducing deployment risk and cutting cross-team dependencies
High throughput	Processes millions of messages per second. Netflix processes trillions of events per day through Kafka
Durability & reprocessing	Logs are persisted to disk. After a failure, you can reprocess precisely from the stored offset — and replay past events after a bug fix
Elastic scaling	Horizontal scaling is as simple as adding partitions. Consumer Groups automatically distribute the load, so you just spin up more Consumer instances as traffic spikes
Fault isolation	Even if the notification service goes down, the order service keeps publishing events. When the notification service recovers, it processes the accumulated events. Failures don't cascade

Disadvantages and Caveats

Item	Details	Mitigation
Operational complexity	Running and monitoring a Kafka cluster requires specialized expertise	In the early stages, managed services like Confluent Cloud or Amazon MSK significantly reduce the operational burden
Eventual consistency	Unlike synchronous processing, immediate consistency is not guaranteed	Display a "processing" state in the UI and design the user experience from the start with this in mind
Message ordering	Ordering is guaranteed only at the partition level — not across the entire topic	For events where order matters, use the same partition key (e.g., `userId`, `orderId`)
Exactly-once processing	The default is at-least-once. Duplicate processing can occur	Enable transactional publishing on the Producer side with `enable.idempotence=true` + `transactional.id`, and set `isolation.level=read_committed` on the Consumer side to only read committed messages. Ideally, design the Consumer itself to be idempotent
Debugging difficulty	Tracing through an event chain has its limits when relying solely on log analysis	Embed a `correlationId` in every event and introduce distributed tracing tools like Jaeger or Zipkin
Schema management	Adding a field to the `order.placed` event impacts all Consumers	Manage schema versions with Confluent Schema Registry and include backward-compatibility checks in CI

What is an Idempotent Consumer? A Consumer designed so that processing the same message multiple times produces the same result. For example, when a duplicate "deduct 5 from inventory" event arrives, you can implement this by recording the event ID in the DB and ignoring any event that has already been processed.

The Most Common Mistakes in Production

Making topics too coarse-grained — Funneling all order events into a single order topic forces Consumers to filter out events they don't need. Breaking them down by event type — order.placed, order.shipped, order.cancelled — makes things far more manageable down the line.
Not setting a partition key — Using the default round-robin distribution means events from the same user can be scattered across multiple partitions. I missed partition keys early on and spent a long time debugging "I processed these in order, so why is the state getting corrupted?" If ordering matters in your domain, set userId or orderId as the partition key.
Omitting a Dead Letter Queue (DLQ) — If a Consumer repeatedly fails to process a specific event, it can block the entire partition. As shown in the code example above, it's ideal to build the structure of isolating failed events into a DLQ topic and monitoring them separately right from the initial design. Adding it later turns out to be surprisingly costly.

Closing Thoughts

Event-Driven Architecture is not simply a matter of swapping out your tech stack. It fundamentally changes how teams depend on and communicate with each other. The moment "we need to wait for Team A to change their API spec to add this feature" becomes "we can deploy independently from Team A by subscribing to the events we care about," team autonomy changes entirely.

Of course, at first the learning curve — from ZooKeeper configuration to Consumer Group rebalancing issues — can feel daunting. But since Kafka 4.0, the barrier to entry has dropped considerably, and managed services have substantially reduced the operational burden. Now is a great time to start.

Three steps you can take right now:

Spin up Kafka 4.0 locally — A single Docker Compose file is all you need. You can start KRaft-mode Kafka immediately with docker run -p 9092:9092 apache/kafka:4.0.0. Running Redpanda Console or Conduktor alongside it lets you see topic lists, per-partition offsets, and Consumer Group lag at a glance. Just watching Consumer Group lag converge to zero in real time will make the concepts click much faster.
Write a simple Producer/Consumer — Use the client for your current language (kafkajs for Node.js, spring-kafka for Java) to write code that publishes and consumes a "hello world" event. Open the "Consumer Groups" tab in the Kafka UI and watch what happens to partition reassignment as you add or remove Consumers — it's the fastest way to build intuition.
Replace one synchronous call in an existing service with an event — Rather than converting the entire system at once, try replacing one low-stakes, fire-and-forget call in your current project — something like "send a notification" where a failure just means a retry. Small wins build team confidence.

Next article: Saga Pattern in Practice — When a distributed transaction requires a rollback, how do you design compensation events? We'll cover that in code, along with the differences between the Choreography and Orchestration approaches.

References

A Practical Guide to Apache Kafka and Event-Driven Architecture for Breaking Microservice Coupling | DEV BAK - 기술블로그

Backend

A Practical Guide to Apache Kafka and Event-Driven Architecture for Breaking Microservice Coupling

Core Concepts

The Three Players of EDA: Producer, Broker, Consumer

css

[Order Service] → publishes "order.placed" event
        ↓
   [ Kafka Broker ]
        ↓              ↓              ↓
[Inventory Svc]  [Payment Svc]  [Notification Svc]
(deduct stock)  (process pay)  (email/SMS)

Understanding Kafka's Core Structure

When you first encounter Kafka, the terminology can feel a bit unfamiliar — but break it down piece by piece and it's quite intuitive.

Concept	One-line description	Practical intuition
Topic	A logical channel where events accumulate	Separate by event type, like `order.placed` or `payment.completed`
Partition	A physical subdivision of a topic	Number of partitions = maximum number of Consumers that can process in parallel
Consumer Group	Consumers in the same group share partitions among themselves	The key to horizontal scaling. No duplicate processing within a group
Offset	The position of a message within a partition	This number enables precise reprocessing from the exact point after a failure
Log Compaction	A cleanup strategy that retains only the latest value for a given key	Used in event sourcing as a "current state" snapshot

What is a Consumer Group? Consumers sharing the same group ID act as a single logical consumer. If a topic has 4 partitions and there are 4 Consumers, each handles one — perfect parallel processing. Meanwhile, different groups (an inventory service group, a notification service group) each independently consume the same messages in full.

Kafka 4.0: A Complete Break from ZooKeeper

Three Essential EDA Patterns

These are three patterns that come up regularly when designing EDA in production. Each could fill an entire article on its own, so here I'll focus on when to use each pattern.

Practical Application

Example 1: E-commerce Order Processing Pipeline (NestJS + kafkajs)

Let's start by pinning down the core type definitions. Handling events without types in TypeScript can cost you hours of debugging over a single typo in a field name.

typescript

// types/events.ts
 
export interface CreateOrderDto {
  userId: string;
  items: Array<{ productId: string; quantity: number; price: number }>;
}
 
export interface OrderPlacedEvent {
  orderId: string;
  userId: string;
  items: Array<{ productId: string; quantity: number; price: number }>;
  totalAmount: number;
  correlationId: string;
  timestamp: string;
}
 
export interface DebeziumEvent<T = Record<string, unknown>> {
  op: 'c' | 'u' | 'd' | 'r'; // create, update, delete, snapshot read
  before: T | null;
  after: T | null;
  source: { db: string; table: string };
}

The folder structure is as follows:

src/
  order/
    order.module.ts
    order.service.ts       ← Producer role
  inventory/
    inventory.module.ts
    inventory.consumer.ts  ← Consumer role
  types/
    events.ts

typescript

// order/order.service.ts — Producer role
import { Injectable } from '@nestjs/common';
import { ClientKafka, Inject } from '@nestjs/microservices';
// @Inject('KAFKA_SERVICE'): NestJS DI 컨테이너에서 'KAFKA_SERVICE' 토큰으로 등록된 ClientKafka를 주입합니다
import { CreateOrderDto, OrderPlacedEvent } from '../types/events';
 
@Injectable()
export class OrderService {
  constructor(
    @Inject('KAFKA_SERVICE') private readonly kafkaClient: ClientKafka,
    private readonly orderRepository: OrderRepository,
  ) {}
 
  async placeOrder(dto: CreateOrderDto): Promise<Order> {
    const order = await this.orderRepository.save(dto);
 
    const event: OrderPlacedEvent = {
      orderId: order.id,
      userId: order.userId,
      items: order.items,
      totalAmount: order.totalAmount,
      correlationId: crypto.randomUUID(), // 분산 추적용 고유 ID
      timestamp: new Date().toISOString(),
    };
 
    await this.kafkaClient.emit('order.placed', event);
 
    return order;
  }
}

typescript

// inventory/inventory.consumer.ts — Consumer role
import { Controller } from '@nestjs/common';
import {
  MessagePattern,
  Payload,
  Ctx,
  KafkaContext,
  ClientKafka,
  Inject,
} from '@nestjs/microservices';
// @Ctx() context: KafkaContext: NestJS가 Kafka 메타데이터(offset, partition 등)를 주입해줍니다
import { OrderPlacedEvent } from '../types/events';
 
const MAX_RETRY = 3;
 
@Controller()
export class InventoryConsumer {
  constructor(
    private readonly inventoryService: InventoryService,
    @Inject('KAFKA_SERVICE') private readonly kafkaClient: ClientKafka,
  ) {}
 
  @MessagePattern('order.placed')
  async handleOrderPlaced(
    @Payload() event: OrderPlacedEvent,
    @Ctx() context: KafkaContext,
  ) {
    const { correlationId, orderId, items } = event;
    const retryCount = Number(
      context.getMessage().headers?.['x-retry-count'] ?? 0,
    );
 
    try {
      await this.inventoryService.reserveItems(items);
 
      await this.kafkaClient.emit('inventory.reserved', {
        orderId,
        correlationId,
        status: 'SUCCESS',
      });
    } catch (error) {
      if (retryCount < MAX_RETRY) {
        // 재시도: 헤더에 횟수를 담아 같은 토픽으로 재발행
        await this.kafkaClient.emit('order.placed', {
          ...event,
          headers: { 'x-retry-count': String(retryCount + 1) },
        });
      } else {
        // 재시도 초과 → DLQ로 격리
        await this.kafkaClient.emit('inventory.reserve.dlq', {
          originalEvent: event,
          reason: error.message,
          failedAt: new Date().toISOString(),
        });
      }
    }
  }
}

Code point	Description
`correlationId`	A unique ID for tracing the entire event chain. Essential for distributed tracing
`emit()` vs `send()`	`emit` is fire-and-forget (no response needed), `send` waits for a response
`x-retry-count` header	Tracks retry count. Routes to DLQ topic when MAX_RETRY is exceeded
Compensation event	The rollback signal in the Saga pattern. Consumed by upstream services for handling

Example 2: Converting DB Changes to Kafka Events with Debezium (CDC Pattern)

json

{
  "name": "orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.server.name": "shop",
    "table.include.list": "shop.orders",
    "topic.prefix": "cdc",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
  }
}

Note: database.password is an example value. In production, always inject secrets from a secret management tool such as AWS Secrets Manager or Vault.

typescript

// CDC event consumption example — topic names are auto-generated in the format "cdc.shop.orders"
import { DebeziumEvent } from '../types/events';
 
interface OrderRecord {
  id: string;
  status: string;
  userId: string;
}
 
@MessagePattern('cdc.shop.orders')
async handleOrderCDC(
  @Payload() cdcEvent: DebeziumEvent<OrderRecord>,
) {
  const { op, after, before } = cdcEvent;
  // op: 'c'(create), 'u'(update), 'd'(delete), 'r'(snapshot read)
 
  if (op === 'u' && after?.status === 'SHIPPED') {
    await this.notificationService.sendShippingAlert(after);
  }
}

What is CDC (Change Data Capture)? A technique that detects database changes in real time and converts them into events. Since it can turn an existing DB into an event source without modifying application code, it's commonly used as the first step in migrating legacy systems to EDA.

Pros and Cons

Advantages

Item	Details
Loose coupling	Producers and Consumers don't need to know anything about each other. Adding a new service doesn't touch existing code, reducing deployment risk and cutting cross-team dependencies
High throughput	Processes millions of messages per second. Netflix processes trillions of events per day through Kafka
Durability & reprocessing	Logs are persisted to disk. After a failure, you can reprocess precisely from the stored offset — and replay past events after a bug fix
Elastic scaling	Horizontal scaling is as simple as adding partitions. Consumer Groups automatically distribute the load, so you just spin up more Consumer instances as traffic spikes
Fault isolation	Even if the notification service goes down, the order service keeps publishing events. When the notification service recovers, it processes the accumulated events. Failures don't cascade

Disadvantages and Caveats

Item	Details	Mitigation
Operational complexity	Running and monitoring a Kafka cluster requires specialized expertise	In the early stages, managed services like Confluent Cloud or Amazon MSK significantly reduce the operational burden
Eventual consistency	Unlike synchronous processing, immediate consistency is not guaranteed	Display a "processing" state in the UI and design the user experience from the start with this in mind
Message ordering	Ordering is guaranteed only at the partition level — not across the entire topic	For events where order matters, use the same partition key (e.g., `userId`, `orderId`)
Exactly-once processing	The default is at-least-once. Duplicate processing can occur	Enable transactional publishing on the Producer side with `enable.idempotence=true` + `transactional.id`, and set `isolation.level=read_committed` on the Consumer side to only read committed messages. Ideally, design the Consumer itself to be idempotent
Debugging difficulty	Tracing through an event chain has its limits when relying solely on log analysis	Embed a `correlationId` in every event and introduce distributed tracing tools like Jaeger or Zipkin
Schema management	Adding a field to the `order.placed` event impacts all Consumers	Manage schema versions with Confluent Schema Registry and include backward-compatibility checks in CI

What is an Idempotent Consumer? A Consumer designed so that processing the same message multiple times produces the same result. For example, when a duplicate "deduct 5 from inventory" event arrives, you can implement this by recording the event ID in the DB and ignoring any event that has already been processed.

The Most Common Mistakes in Production

Making topics too coarse-grained — Funneling all order events into a single order topic forces Consumers to filter out events they don't need. Breaking them down by event type — order.placed, order.shipped, order.cancelled — makes things far more manageable down the line.
Not setting a partition key — Using the default round-robin distribution means events from the same user can be scattered across multiple partitions. I missed partition keys early on and spent a long time debugging "I processed these in order, so why is the state getting corrupted?" If ordering matters in your domain, set userId or orderId as the partition key.
Omitting a Dead Letter Queue (DLQ) — If a Consumer repeatedly fails to process a specific event, it can block the entire partition. As shown in the code example above, it's ideal to build the structure of isolating failed events into a DLQ topic and monitoring them separately right from the initial design. Adding it later turns out to be surprisingly costly.

Closing Thoughts

Three steps you can take right now:

Spin up Kafka 4.0 locally — A single Docker Compose file is all you need. You can start KRaft-mode Kafka immediately with docker run -p 9092:9092 apache/kafka:4.0.0. Running Redpanda Console or Conduktor alongside it lets you see topic lists, per-partition offsets, and Consumer Group lag at a glance. Just watching Consumer Group lag converge to zero in real time will make the concepts click much faster.
Write a simple Producer/Consumer — Use the client for your current language (kafkajs for Node.js, spring-kafka for Java) to write code that publishes and consumes a "hello world" event. Open the "Consumer Groups" tab in the Kafka UI and watch what happens to partition reassignment as you add or remove Consumers — it's the fastest way to build intuition.
Replace one synchronous call in an existing service with an event — Rather than converting the entire system at once, try replacing one low-stakes, fire-and-forget call in your current project — something like "send a notification" where a failure just means a retry. Small wins build team confidence.

Next article: Saga Pattern in Practice — When a distributed transaction requires a rollback, how do you design compensation events? We'll cover that in code, along with the differences between the Choreography and Orchestration approaches.

Core Concepts

The Three Players of EDA: Producer, Broker, Consumer

Understanding Kafka's Core Structure

Kafka 4.0: A Complete Break from ZooKeeper

Three Essential EDA Patterns

Practical Application

Example 1: E-commerce Order Processing Pipeline (NestJS + kafkajs)

Example 2: Converting DB Changes to Kafka Events with Debezium (CDC Pattern)

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Production

Closing Thoughts

References

Core Concepts

The Three Players of EDA: Producer, Broker, Consumer

Understanding Kafka's Core Structure

Kafka 4.0: A Complete Break from ZooKeeper

Three Essential EDA Patterns

Practical Application

Example 1: E-commerce Order Processing Pipeline (NestJS + kafkajs)

Example 2: Converting DB Changes to Kafka Events with Debezium (CDC Pattern)

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Production

Closing Thoughts

References

Recommended Posts

"It's internal, so it should be fine" is over — Applying Zero Trust to Microservice APIs: mTLS, JWT, and OPA in Practice

An Introduction to Platform Engineering and IDPs (Internal Developer Platforms) — How Developers Can Focus on Code Without Worrying About Infrastructure

Implementing Distributed Tracing in Microservices with NestJS + OpenTelemetry — A Practical Guide to Jaeger & Grafana Tempo Integration

Serverless Edge Computing Implementation Guide — 5 Real-World Patterns for Achieving Global P99 30ms

Safe Without GC — Building a High-Performance REST API Server with Rust + Axum (Tokio · SQLx · Real-World Code)

LLM Agent Backend Design Patterns: Surviving Production with ReAct, LangGraph, and Temporal