Flow Engineering: From LLM Workflows to Organizational Architecture, How to Design Flow
There are moments when you look into a codebase and think, "Why is this logic so tangled?" I also experienced a situation a few years ago where it took me two weeks to deploy a new feature, and at first, I assumed it was because the code was messy. It turned out that a single PR had to go through reviews by three teams, and since there was only one shared QA environment, several days were wasted just trying to get the deployment order right. It was only much later that I realized the root cause was actually that the workflow was blocked.
Flow Engineering is an approach that addresses "where work gets stuck" at the design level, rather than focusing on code quality issues. Whether in AI agent workflows, microservices architectures, or team deployment pipelines—visualizing the flow and eliminating bottlenecks is key everywhere. In fact, the effect is quite dramatic, as seen in the Weights & Biases case study, where a task that was stuck at 17% accuracy with simple prompts rose to 91% after designing the flow.
In this article, we examine how Flow Engineering works in three contexts (AI/LLM workflows, software architecture, and organizational processes) and how to apply it while writing actual code with LangGraph.
Key Concepts
What is "Flow"
When you first encounter Flow Engineering, the concepts may feel a bit abstract. However, if you place the three contexts side by side, the common philosophy becomes clear:
Core Philosophy of Flow Engineering: Work must flow toward value without bottlenecks. The starting point of design is to visualize where that flow gets blocked.
| Context | Definition | Key Target |
|---|---|---|
| AI/LLM Engineering | Design workflow graphs to break down complex tasks into steps and enable LLM self-validation | Prompt → Agent Node/Edge |
| Software Architecture | Building a system optimized for the flow of change by combining Wardley Mapping + DDD + Team Topologies | Organizational Structure ↔ Technology Structure |
| Organization/Process | Measures how efficiently the four Flow Items (Feature, Defect, Risk, Debt) flow through the deployment pipeline | Cycle Time (Job Start to Completion Time), Throughput |
If you look at the three contexts, you realize that they are ultimately asking the same question: "Where is the current work getting stuck?"
AI/LLM Context: Beyond the Limitations of a Single Prompt
Everyone has probably experienced receiving unstable results at least once after thinking, "I just need to use prompts well," and attempting to handle complex tasks with a single LLM call. The secret is simple: break down the task and design the flow so that the LLM performs self-refinement at each step.
LangGraph is a graph-based state machine that represents this flow using nodes and edges. Since the release of v1.0, it has established itself as the de facto standard framework supporting cycles, memory, and tool calls.
from langgraph.graph import StateGraph, END
from typing import TypedDict
# 실행 전 초기화 필요:
# from langchain_openai import ChatOpenAI
# from langchain_community.tools import DuckDuckGoSearchRun
# llm = ChatOpenAI(model="gpt-4o")
# search_web = DuckDuckGoSearchRun().invoke
class ResearchState(TypedDict):
query: str
search_results: list[str]
draft: str
review_feedback: str
final_answer: str
def search_node(state: ResearchState) -> ResearchState:
results = search_web(state["query"])
return {**state, "search_results": results}
def draft_node(state: ResearchState) -> ResearchState:
draft = llm.invoke(f"다음 자료로 답변을 작성하세요: {state['search_results']}")
return {**state, "draft": draft.content}
def review_node(state: ResearchState) -> ResearchState:
# 자기 검증 — 초안을 비판적으로 검토
feedback = llm.invoke(f"이 답변의 문제점을 찾아주세요: {state['draft']}")
return {**state, "review_feedback": feedback.content}
def revise_node(state: ResearchState) -> ResearchState:
final = llm.invoke(
f"피드백을 반영해 개선하세요.\n초안: {state['draft']}\n피드백: {state['review_feedback']}"
)
return {**state, "final_answer": final.content}
graph = StateGraph(ResearchState)
graph.add_node("search", search_node)
graph.add_node("draft", draft_node)
graph.add_node("review", review_node)
graph.add_node("revise", revise_node)
graph.set_entry_point("search")
graph.add_edge("search", "draft")
graph.add_edge("draft", "review")
graph.add_edge("review", "revise")
graph.add_edge("revise", END)
app = graph.compile()Self-Refine: A pattern in which an LLM critiques and improves its own output. It is implemented by using a separate validation agent or by re-calling the same model with a different prompt.
Initially, I removed the review node and connected directly to revise, but when I requested modifications without a verification step, I experienced a situation where the LLM actually ruined a perfectly good draft. It was tantamount to "requesting improvements without knowing what needed to be improved." Explicitly adhering to the order of search → draft → review → revise is much more important than you might think.
Software Architecture Context: A System Optimized for the Flow of Change
If the principle of "explicitly designing steps" worked in LLM workflows, the same question applies to software architecture: "Where does change get stuck?" Susanne Kaiser's Architecture for Flow answers this question by combining three methodologies:
| Methodology | Role | Questions to Answer |
|---|---|---|
| Wardley Mapping | Understanding the Strategic Landscape | Which Components to Build Yourself and Which to Buy? |
| Domain-Driven Design | Problem Space Separation | How to Divide Boundaries? |
| Team Topologies | Designing Team Interactions | How Will Teams Collaborate? |
Honestly, at first, I found it difficult to understand why three methodologies needed to be used together, but seeing Wardley Mapping as an example made it immediately clear. If you place the question, "Should we build the authentication module ourselves or buy a SaaS like Auth0?" on a Wardley Map, authentication is already in the realm of commoditized components. This means that building it ourselves offers no differentiation and only creates a maintenance burden. On the other hand, our service's core recommendation algorithm is closer to genesis (the realm of independent innovation), so building it ourselves is the right choice. The role of Wardley Mapping is to help make this decision consistently across the entire system.
When these three elements work together, a structure is created where code changes can be deployed independently without the need for approval from other teams.
Organizational/Process Context: Flow Framework and Four Flow Items
Even if the architecture is well-designed, you cannot know where bottlenecks occur without measuring how actual work flows through the pipeline. Mik Kersten's Flow Framework is a methodology that measures how efficiently four Flow Items flow through the deployment pipeline:
Flow Items:
├── Feature → 새로운 비즈니스 가치 창출
├── Defect → 품질 문제 수정
├── Risk → 보안, 컴플라이언스, 아키텍처 부채
└── Debt → 기술 부채 해소Flow Metrics: Consists of Flow Velocity (number of completed items per unit of time), Flow Time (time taken for an item from request to deployment), Flow Efficiency (ratio of actual work to total time), and Flow Distribution (ratio of the four items).
Practical Application
Example 1: Multi-Agent Research Workflow
This is an orchestrator-subagent structure borrowed from IBM's LangGraph + Granite example. It is a flow where the orchestrator analyzes tasks, specialized subagents (search, code generation) execute in parallel, and the results are integrated during the verification and editing phases.
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
# 초기화 예시:
# from langchain_openai import ChatOpenAI
# from langchain_community.tools import DuckDuckGoSearchRun
# llm = ChatOpenAI(model="gpt-4o")
# search_tool = DuckDuckGoSearchRun()
class MultiAgentState(TypedDict):
task: str
task_analysis: str # 오케스트레이터가 생성한 작업 분석
search_result: str
code_result: str
validation_result: str
final_output: str
errors: Annotated[list[str], operator.add] # 병렬 노드에서 안전하게 누적
def orchestrator(state: MultiAgentState) -> dict:
"""작업을 분석하고 서브에이전트가 공유할 컨텍스트를 생성"""
analysis = llm.invoke(
f"다음 작업을 분석하고 핵심 요구사항을 요약하세요: {state['task']}"
)
return {"task_analysis": analysis.content}
def search_agent(state: MultiAgentState) -> dict:
"""검색 전문 에이전트 — 오케스트레이터 분석을 컨텍스트로 활용"""
result = search_tool.invoke(
f"{state['task_analysis']}\n\n검색 쿼리: {state['task']}"
)
return {"search_result": result}
def code_agent(state: MultiAgentState) -> dict:
"""코드 생성 전문 에이전트"""
result = llm.invoke(
f"분석: {state['task_analysis']}\n\n요구사항에 맞는 코드를 작성하세요: {state['task']}"
)
return {"code_result": result.content}
def validation_agent(state: MultiAgentState) -> dict:
"""검색과 코드 결과를 모두 받은 뒤 실행 — 두 브랜치가 완료되어야 진입"""
combined = f"검색: {state['search_result']}\n코드: {state['code_result']}"
validation = llm.invoke(f"다음 결과를 검증하세요: {combined}")
return {"validation_result": validation.content}
def editor_agent(state: MultiAgentState) -> dict:
"""최종 통합"""
final = llm.invoke(
f"""다음 결과를 통합하여 최종 답변을 작성하세요:
- 작업 분석: {state['task_analysis']}
- 검색 결과: {state['search_result']}
- 코드 결과: {state['code_result']}
- 검증 결과: {state['validation_result']}"""
)
return {"final_output": final.content}
builder = StateGraph(MultiAgentState)
builder.add_node("orchestrator", orchestrator)
builder.add_node("search", search_agent)
builder.add_node("code", code_agent)
builder.add_node("validation", validation_agent)
builder.add_node("editor", editor_agent)
builder.set_entry_point("orchestrator")
# 팬아웃: orchestrator → search, code 두 노드가 같은 슈퍼스텝에서 실행
builder.add_edge("orchestrator", "search")
builder.add_edge("orchestrator", "code")
# 팬인: search, code 모두 완료된 슈퍼스텝 이후에 validation 진입
builder.add_edge("search", "validation")
builder.add_edge("code", "validation")
builder.add_edge("validation", "editor")
builder.add_edge("editor", END)
multi_agent_app = builder.compile()LangGraph executes internally in superstep units. The structure is such that search and code run simultaneously in the same superstep after the orchestrator is complete, and the validation superstep begins only after both nodes are finished. Thanks to this, you do not need to implement synchronization processing separately, as long as you keep state merging annotations like Annotated[list, operator.add] handy.
One thing to note is that the orchestrator is only meaningful if task_analysis is actually utilized in the sub-agent. When I first used this pattern, I once had code where the orchestrator was practically useless because I just put the analysis results into the state and no one read them.
Example 2: Finding Development Bottlenecks with Flow Metrics
As the BDC (Development Bank of Canada) team discovered when measuring flow with Axify, most delays occur before and after development (planning waiting, QA waiting) rather than during development. If you create a simple Flow Metrics measurement code, it looks like this:
interface FlowItem {
id: string;
type: "feature" | "defect" | "risk" | "debt";
requestedAt: Date;
startedAt: Date | null;
completedAt: Date | null;
}
interface FlowMetrics {
flowTime: number; // 요청부터 완료까지 (일)
waitTime: number; // 요청부터 시작까지 (일)
activeTime: number; // 실제 작업 시간 (일)
flowEfficiency: number; // activeTime / flowTime × 100 (%)
}
function calculateFlowMetrics(item: FlowItem): FlowMetrics | null {
if (!item.startedAt || !item.completedAt) return null;
const msPerDay = 1000 * 60 * 60 * 24;
const flowTime = (item.completedAt.getTime() - item.requestedAt.getTime()) / msPerDay;
const waitTime = (item.startedAt.getTime() - item.requestedAt.getTime()) / msPerDay;
const activeTime = (item.completedAt.getTime() - item.startedAt.getTime()) / msPerDay;
const flowEfficiency = (activeTime / flowTime) * 100;
return { flowTime, waitTime, activeTime, flowEfficiency };
}
function analyzeBottleneck(items: FlowItem[]): void {
const metrics = items
.map((item) => ({ item, metrics: calculateFlowMetrics(item) }))
.filter((r) => r.metrics !== null);
const avgWaitTime =
metrics.reduce((sum, r) => sum + r.metrics!.waitTime, 0) / metrics.length;
const avgEfficiency =
metrics.reduce((sum, r) => sum + r.metrics!.flowEfficiency, 0) / metrics.length;
console.log(`평균 대기 시간: ${avgWaitTime.toFixed(1)}일`);
console.log(`평균 흐름 효율성: ${avgEfficiency.toFixed(1)}%`);
// Mik Kersten의 연구에서 지식 노동(knowledge work)의 Flow Efficiency는
// 통상 15~40% 범위이며, 15% 미만은 대기 시간이 압도적으로 많다는 신호다.
if (avgEfficiency < 15) {
console.log("⚠️ 병목 감지: 실제 작업보다 대기 시간이 압도적으로 많습니다.");
console.log("→ 기획·QA·리뷰 프로세스를 점검해보시면 좋습니다.");
}
}It was quite shocking to realize that the problem, which seemed like "developers were slow" until I saw these numbers, was actually just waiting for the process. A Flow Efficiency in the 10% range means that out of an 8-hour day, only 48 minutes are actually spent writing code, with the rest of the time spent waiting for approval or the next sprint.
Pros and Cons Analysis
Advantages
| Item | Content | Precautions |
|---|---|---|
| Bottleneck Visualization | Break down workflows into measurable units to discover hidden latency points | Incorrectly set measurement points can point to the wrong bottlenecks |
| LLM Quality Improvement | Significantly increase accuracy compared to a single prompt with step decomposition + self-verification (17% → 91% cases) | As the number of nodes increases, debugging and costs increase together |
| Team Autonomy | Flow-centric design reduces inter-team dependency, increasing the possibility of independent deployment | Adopting Team Topologies entails organizational change, so executive support is required |
| Business Linkage | Flow Metrics allows you to directly link engineering performance to business impact | Verifying business connectivity first when designing metrics prevents distortion |
| Scalability | Horizontal scaling is easy with parallel agent execution and an event-driven pattern | Horizontal scaling without Observability infrastructure makes it difficult to trace the cause of failures |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| The Paradox of Measurement | Poorly designed Flow Metrics lead to distortions where meaningless metrics, such as commit counts, are optimized | It is recommended to first verify the connection to business impact when designing metrics |
| Increased complexity | Multi-agent workflows are much more difficult to debug than a single LLM call | It is recommended to build the Observability infrastructure first and then add agents |
| Increased Costs | Token costs can increase more than linearly when chaining LLM calls | Consider using tiered caching and a mix of low-cost models |
| Organizational Change | The adoption of Team Topologies and Architecture for Flow entails a change in organizational culture, not just a technical issue | Driving this solely through the technology team without management support has a high probability of failure |
| Requirements Variability | Even if the flow is stabilized, frequent changes in requirements incur workflow redesign costs. | It is recommended to separate interfaces for nodes with a high change frequency to allow for flexible replacement. |
Observability: The degree to which the internal state of a system can be inferred from external outputs (logs, traces, metrics). In agent workflows, tracking which node made which decision based on which input is key. LangSmith is widely used in the LangChain ecosystem, while OpenTelemetry is commonly used for general purposes.
The Most Common Mistakes in Practice
-
Starting Optimization Without Measurement: It is common to overhaul processes based solely on the feeling that "our team flow is slow," often resulting in improvements to areas that are not actual bottlenecks. We recommend collecting Flow Metrics first and determining areas for improvement after reviewing the data.
-
Expecting too much from a single LLM call: This occurs when attempting to handle a complex task in a single call results in unstable outcomes. A structure that breaks down the task into 3 to 5 steps and verifies each step is much more stable. As seen above, simply adding a single review node can drastically change the quality of the results.
-
Deploying agents to production without Observability: When a workflow that ran well locally behaves strangely in production, debugging is virtually impossible if you cannot see which node received what input. The figure from the LangChain report stating that 89% of companies have already implemented Observability is not without reason.
In Conclusion
Designing flow is ultimately the process of honestly looking into "what we are currently waiting for." The essence of Flow Engineering is eliminating that waiting time through code, processes, and organizational structures. Whether it is an AI agent workflow, a microservices architecture, or a team's development process—anywhere—it is helpful to first visualize the points where the flow gets stuck and start the design from there.
Here are 3 steps you can start right now, tailored to your situation:
-
You can start by measuring your team's Flow Time. It is recommended to extract issues from the last three months in GitHub, Jira, or Linear and calculate the interval between "Request Date → Deployment Date." You can immediately see the average Flow Time for Features and Defects, as well as which stage (Planning Wait, Review, QA) takes the most time. If you feel your team's process is slow, I recommend starting here.
-
If you are using LLM, you can break down a complex prompt into three steps. By dividing it into "Search → Draft → Review" or "Analysis → Generation → Validation" and defining each step as a node in LangGraph to connect them, you can immediately see changes in accuracy. If the quality of your LLM results is unstable, we recommend starting here.
-
If you wish to approach this from an architectural perspective, I recommend reading Susanne Kaiser's Architecture for Flow or Mik Kersten's Project to Product. Both books explain concepts with concrete examples, making them excellent starting points for team discussions. If deployment dependencies between teams are complex, I recommend starting here.
Reference Materials
- Flow Engineering is All You Need | Medium
- What is Architecture for Flow? | Susanne Kaiser
- Architecture for Flow Official Site
- Architecture for Flow | O'Reilly
- AFLOW: ICLR 2025 Paper | arXiv
- FLOW: Modularized Agentic Workflow Automation | OpenReview ICLR 2025
- LLM Workflows: Patterns, Tools & Production Architecture | Morph
- State of Agent Engineering | LangChain
- LangGraph Official Documentation
- Flow Framework Official Site
- What are Flow Metrics? | DX
- Flow Architectures | O'Reilly
- Understanding flow in software development | Swarmia
- Dissecting 'architecting for fast, sustainable flow' | microservices.io