Scaling AI Workflows using LangGraph
A deep dive into scaling LangGraph workflows for production, covering Redis persistence, tool coordination, and horizontal scaling strategies.
π AI Documentation Note: This article was generated by AI to document the architecture and implementation patterns for scaling LangGraph in production environments. It serves as a comprehensive guide based on real-world deployment considerations.
GoVisually's AI agents are orchestrated using LangGraph. We use it for compliance checking, document analysis, and a growing number of AI-powered features across the platform. Over the past several months, we've learned a lot about what it takes to run LangGraph reliably in production. This post captures those learnings.
The development experience with LangGraph is smooth. You spin up a graph, chain some nodes together, and watch your agents collaborate. But production introduces harder questions: "What happens when my server restarts mid-workflow?" or "How do I prevent three parallel agents from all calling the same expensive tool?"
Why Redis Persistence is Critical
Redis persistence isn't just about fault toleranceβit's the foundation that enables everything else. Here's what it unlocks:
Horizontal scaling becomes possible. Without shared state, you're stuck on a single instance. With Redis checkpointing, any worker can pick up any workflow. Add capacity by spinning up more workers; no code changes needed.
Workflows become resumable. A compliance check might take 30-60 seconds. If something fails halfway throughβa network blip, a deployment, an OOM errorβyou don't lose that work. Resume from the last checkpoint and continue.
Long-running operations work reliably. Human-in-the-loop workflows that span hours or days? No problem. The state persists across sessions, restarts, and even infrastructure changes.
The Architecture
The solution is straightforward: move all state to Redis. Here's what that looks like:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Load Balancer (Render/Cloud Provider) β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βββββββββββββββΌββββββββββββββ
β β β
βββββΌββββ βββββΌββββ βββββΌββββ
βWorker1β βWorker2β βWorker3β
βββββ¬ββββ βββββ¬ββββ βββββ¬ββββ
β β β
βββββββββββββββΌββββββββββββββ
β
βΌ
ββββββββββββββββββ
β Redis Cache β
β β
β β’ Checkpoints β
β β’ Tool Cache β
β β’ Long-term β
β Memory β
ββββββββββββββββββ
Each worker is stateless. All workflow state lives in Redis. This means:
- Any worker can handle any request - no sticky sessions needed
- Workflows survive restarts - just resume from the last checkpoint
- Scaling is trivial - add more workers as needed
Setting Up Redis Checkpointing
LangGraph has first-class support for Redis persistence through langgraph-checkpoint-redis:
pip install langgraph-checkpoint-redis redis
Wiring it up is straightforward:
from langgraph.graph import StateGraph
from langgraph.checkpoint.redis import AsyncRedisSaver
# Create checkpointer with TTL (workflows expire after 60 min)
checkpointer = AsyncRedisSaver.from_conn_string(
"redis://localhost:6379",
ttl={"default_ttl": 60, "refresh_on_read": True}
)
workflow = StateGraph(ComplianceState)
workflow.add_node("pdf_parser", pdf_parser_node)
workflow.add_node("fda_agent", fda_compliance_agent)
workflow.add_node("eu_agent", eu_compliance_agent)
workflow.add_edge("pdf_parser", "fda_agent")
workflow.add_edge("fda_agent", "eu_agent")
# Compile with persistence
graph = workflow.compile(checkpointer=checkpointer)
Now every node execution automatically creates a checkpoint. If the workflow crashes at eu_agent, you resume from exactly where it left offβno wasted LLM calls.
The Data Access Problem
When you run multiple agents in parallel, each one needs different slices of the same data. Our FDA compliance agent needs ingredients and health claims. The EU agent needs regulatory-relevant sections. The image analyzer needs high-resolution images with specific metadata.
The naive approach is to load the full dataset for every agent. But when you're dealing with parsed PDFs, images, and metadata, that's a lot of wasted tokensβand tokens cost money.
Instead, we start agents with minimal context and let them request additional data as needed. This keeps each agent focused on exactly what it needs, nothing more.
Solution: Centralized Data Access with jq Filtering
We implemented a data access tool that acts as a gateway to workflow state. Agents query it using jq syntax to retrieve only the data they need:
# FDA Agent - Get only ingredients sections
".pdf_data.sections[] | select(.type == 'ingredients')"
# EU Agent - Get regulatory-relevant claims with metadata
".pdf_data.claims[] | select(.regulatory_relevant == true) | {text, page}"
# Image Agent - Get high-res images only
".images[] | select(.dimensions.width > 1000) | {url, format}"
The result? 70-90% reduction in tokens passed to each agent. That translates directly to cost savings and faster responses.
Tool Coordination: Preventing Duplicate Work
When three agents run in parallel and all need the same PDF-to-JSON conversion, you don't want to run that conversion three times. The first agent should execute it, and the others should wait for the result.
We use a hybrid approach that combines in-memory coordination with Redis caching. The pattern is:
- Check if result already exists (instant return)
- Check if same process is already computing (wait for it)
- Try to acquire a distributed lock (first one wins)
- If we get the lock, compute and cache
- If someone else has the lock, wait for their result
This prevents both in-process and cross-process duplicate work.
The Agent-Tool Loop
A core pattern in LangGraph is the agent-tool loop. An AI node decides what to do, calls a tool, gets results, and decides again. This loop continues until the agent determines it has enough information to complete its task.
βββββββββββββββ
β AI Node βββββββββββββββββββ
β (decides) β β
ββββββββ¬βββββββ β
β β
β tool_calls β tool_result
βΌ β
βββββββββββββββ β
β Tool Node βββββββββββββββββββ
β (executes) β
βββββββββββββββ
In LangGraph, you wire this up with conditional edges:
from langgraph.graph import StateGraph
from langgraph.prebuilt import ToolNode
def should_continue(state):
"""Route based on whether the AI wants to call more tools"""
last_message = state["messages"][-1]
if last_message.tool_calls:
return "tools"
return "end"
workflow = StateGraph(AgentState)
workflow.add_node("agent", agent_node)
workflow.add_node("tools", ToolNode(tools))
workflow.add_conditional_edges(
"agent",
should_continue,
{"tools": "tools", "end": "__end__"}
)
workflow.add_edge("tools", "agent") # Loop back after tool execution
This loop is where checkpointing becomes critical. Each iteration creates a checkpoint, so if something fails mid-loop, you resume from the last successful tool callβnot from the beginning.
Scaling Strategy
Our recommended architecture uses stateless FastAPI workers with Redis for all shared state. Here's the setup:
# render.yaml
services:
- type: web
name: ai-core
scaling:
minInstances: 2
maxInstances: 10
targetCPUPercent: 70
envVars:
- key: REDIS_URL
fromDatabase:
name: redis
property: connectionString
This gives you:
- Automatic horizontal scaling - Workers spin up and down based on CPU usage
- No sticky sessions - Any worker can handle any request since all state lives in Redis
- Fault tolerance - If a worker dies, another picks up the workflow from its last checkpoint
- Simple deployments - Roll out new code without worrying about in-flight workflows
The key insight is that workers should be completely stateless. All coordination happens through Redisβcheckpoints, tool caching, distributed locks. This means scaling is just a matter of adding more workers.
Performance Tips
A few things I've learned along the way:
Minimize State Size
Don't store large data in state. Store references instead. Keep your 10MB PDF in S3 and store just the key in workflow state. This keeps checkpoints fast and Redis memory usage low.
Run Agents in Parallel
LangGraph supports fan-out patterns:
from langgraph.constants import Send
def route_to_agents(state):
"""Fan-out to multiple agents in parallel"""
return [
Send("fda_agent", state),
Send("eu_agent", state),
Send("image_agent", state),
]
Three agents that each take 10 seconds run in 10 seconds total, not 30.
What About Costs?
Token usage is the biggest cost driver. Here's what moves the needle:
- Data filtering - Only send relevant data to each agent (70-90% reduction)
- Response caching - Cache LLM responses for identical prompts
- Prompt caching - Claude supports caching system prompts, which significantly reduces costs for repeated operations
With these optimizations, we've seen meaningful reductions in per-workflow costs.
Monitoring
You'll want visibility into what's happening. Key metrics to track:
- Workflow success rate - Target 99%+
- Checkpoint write latency - Should be under 50ms
- Cache hit rate - Higher is better (80%+)
- LLM response time - Watch for degradation
- Token usage - Track per workflow and per agent
We use LangSmith for all our telemetry and debugging. It gives you full visibility into every LLM call, token usage, and workflow execution. The ability to replay and debug failed runs has been invaluable for identifying issues in production.
Looking Ahead
LangGraph is evolving quickly. The patterns I've described work today, but the framework is adding more built-in support for production concerns. The Redis checkpointer, for example, is relatively new.
What won't change is the fundamental architecture: stateless workers with shared state storage. That pattern scales, and it's how most distributed systems work.
If you're building AI workflows that need to survive the chaos of production, start with persistence. Everything else builds on that foundation.
This post covers patterns used in production at GoVisually for AI-powered compliance checking workflows. The specific implementation details may vary based on your requirements and infrastructure.