Scaling AI Workflows using LangGraph

📝 AI Documentation Note: This article was generated by AI to document the architecture and implementation patterns for scaling LangGraph in production environments. It serves as a comprehensive guide based on real-world deployment considerations.

GoVisually's AI agents are orchestrated using LangGraph. We use it for compliance checking, document analysis, and a growing number of AI-powered features across the platform. Over the past several months, we've learned a lot about what it takes to run LangGraph reliably in production. This post captures those learnings.

The development experience with LangGraph is smooth. You spin up a graph, chain some nodes together, and watch your agents collaborate. But production introduces harder questions: "What happens when my server restarts mid-workflow?" or "How do I prevent three parallel agents from all calling the same expensive tool?"

Why Redis Persistence is Critical

Redis persistence isn't just about fault tolerance—it's the foundation that enables everything else. Here's what it unlocks:

Horizontal scaling becomes possible. Without shared state, you're stuck on a single instance. With Redis checkpointing, any worker can pick up any workflow. Add capacity by spinning up more workers; no code changes needed.

Workflows become resumable. A compliance check might take 30-60 seconds. If something fails halfway through—a network blip, a deployment, an OOM error—you don't lose that work. Resume from the last checkpoint and continue.

Long-running operations work reliably. Human-in-the-loop workflows that span hours or days? No problem. The state persists across sessions, restarts, and even infrastructure changes.

The Architecture

The solution is straightforward: move all state to Redis. Here's what that looks like:

┌─────────────────────────────────────────────────┐
│     Load Balancer (Render/Cloud Provider)       │
└─────────────────┬───────────────────────────────┘
                  │
    ┌─────────────┼─────────────┐
    │             │             │
┌───▼───┐     ┌───▼───┐     ┌───▼───┐
│Worker1│     │Worker2│     │Worker3│
└───┬───┘     └───┬───┘     └───┬───┘
    │             │             │
    └─────────────┼─────────────┘
                  │
                  ▼
         ┌────────────────┐
         │  Redis Cache   │
         │                │
         │ • Checkpoints  │
         │ • Tool Cache   │
         │ • Long-term    │
         │   Memory       │
         └────────────────┘

Each worker is stateless. All workflow state lives in Redis. This means:

Any worker can handle any request - no sticky sessions needed
Workflows survive restarts - just resume from the last checkpoint
Scaling is trivial - add more workers as needed

Setting Up Redis Checkpointing

LangGraph has first-class support for Redis persistence through langgraph-checkpoint-redis:

pip install langgraph-checkpoint-redis redis

Wiring it up is straightforward:

from langgraph.graph import StateGraph
from langgraph.checkpoint.redis import AsyncRedisSaver

# Create checkpointer with TTL (workflows expire after 60 min)
checkpointer = AsyncRedisSaver.from_conn_string(
    "redis://localhost:6379",
    ttl={"default_ttl": 60, "refresh_on_read": True}
)

workflow = StateGraph(ComplianceState)
workflow.add_node("pdf_parser", pdf_parser_node)
workflow.add_node("fda_agent", fda_compliance_agent)
workflow.add_node("eu_agent", eu_compliance_agent)

workflow.add_edge("pdf_parser", "fda_agent")
workflow.add_edge("fda_agent", "eu_agent")

# Compile with persistence
graph = workflow.compile(checkpointer=checkpointer)

Now every node execution automatically creates a checkpoint. If the workflow crashes at eu_agent, you resume from exactly where it left off—no wasted LLM calls.

The Data Access Problem

When you run multiple agents in parallel, each one needs different slices of the same data. Our FDA compliance agent needs ingredients and health claims. The EU agent needs regulatory-relevant sections. The image analyzer needs high-resolution images with specific metadata.

The naive approach is to load the full dataset for every agent. But when you're dealing with parsed PDFs, images, and metadata, that's a lot of wasted tokens—and tokens cost money.

Instead, we start agents with minimal context and let them request additional data as needed. This keeps each agent focused on exactly what it needs, nothing more.

Solution: Centralized Data Access with jq Filtering

We implemented a data access tool that acts as a gateway to workflow state. Agents query it using jq syntax to retrieve only the data they need:

# FDA Agent - Get only ingredients sections
".pdf_data.sections[] | select(.type == 'ingredients')"

# EU Agent - Get regulatory-relevant claims with metadata
".pdf_data.claims[] | select(.regulatory_relevant == true) | {text, page}"

# Image Agent - Get high-res images only
".images[] | select(.dimensions.width > 1000) | {url, format}"

The result? 70-90% reduction in tokens passed to each agent. That translates directly to cost savings and faster responses.

Tool Coordination: Preventing Duplicate Work

When three agents run in parallel and all need the same PDF-to-JSON conversion, you don't want to run that conversion three times. The first agent should execute it, and the others should wait for the result.

We use a hybrid approach that combines in-memory coordination with Redis caching. The pattern is:

Check if result already exists (instant return)
Check if same process is already computing (wait for it)
Try to acquire a distributed lock (first one wins)
If we get the lock, compute and cache
If someone else has the lock, wait for their result

This prevents both in-process and cross-process duplicate work.

The Agent-Tool Loop

A core pattern in LangGraph is the agent-tool loop. An AI node decides what to do, calls a tool, gets results, and decides again. This loop continues until the agent determines it has enough information to complete its task.

┌─────────────┐
│   AI Node   │◄────────────────┐
│  (decides)  │                 │
└──────┬──────┘                 │
       │                        │
       │ tool_calls             │ tool_result
       ▼                        │
┌─────────────┐                 │
│  Tool Node  │─────────────────┘
│  (executes) │
└─────────────┘

In LangGraph, you wire this up with conditional edges:

from langgraph.graph import StateGraph
from langgraph.prebuilt import ToolNode

def should_continue(state):
    """Route based on whether the AI wants to call more tools"""
    last_message = state["messages"][-1]
    if last_message.tool_calls:
        return "tools"
    return "end"

workflow = StateGraph(AgentState)
workflow.add_node("agent", agent_node)
workflow.add_node("tools", ToolNode(tools))

workflow.add_conditional_edges(
    "agent",
    should_continue,
    {"tools": "tools", "end": "__end__"}
)
workflow.add_edge("tools", "agent")  # Loop back after tool execution

This loop is where checkpointing becomes critical. Each iteration creates a checkpoint, so if something fails mid-loop, you resume from the last successful tool call—not from the beginning.

Scaling Strategy

Our recommended architecture uses stateless FastAPI workers with Redis for all shared state. Here's the setup:

# render.yaml
services:
  - type: web
    name: ai-core
    scaling:
      minInstances: 2
      maxInstances: 10
      targetCPUPercent: 70
    envVars:
      - key: REDIS_URL
        fromDatabase:
          name: redis
          property: connectionString

This gives you:

Automatic horizontal scaling - Workers spin up and down based on CPU usage
No sticky sessions - Any worker can handle any request since all state lives in Redis
Fault tolerance - If a worker dies, another picks up the workflow from its last checkpoint
Simple deployments - Roll out new code without worrying about in-flight workflows

The key insight is that workers should be completely stateless. All coordination happens through Redis—checkpoints, tool caching, distributed locks. This means scaling is just a matter of adding more workers.

Performance Tips

A few things I've learned along the way:

Minimize State Size

Don't store large data in state. Store references instead. Keep your 10MB PDF in S3 and store just the key in workflow state. This keeps checkpoints fast and Redis memory usage low.

Run Agents in Parallel

LangGraph supports fan-out patterns:

from langgraph.constants import Send

def route_to_agents(state):
    """Fan-out to multiple agents in parallel"""
    return [
        Send("fda_agent", state),
        Send("eu_agent", state),
        Send("image_agent", state),
    ]

Three agents that each take 10 seconds run in 10 seconds total, not 30.

What About Costs?

Token usage is the biggest cost driver. Here's what moves the needle:

Data filtering - Only send relevant data to each agent (70-90% reduction)
Response caching - Cache LLM responses for identical prompts
Prompt caching - Claude supports caching system prompts, which significantly reduces costs for repeated operations

With these optimizations, we've seen meaningful reductions in per-workflow costs.

Monitoring

You'll want visibility into what's happening. Key metrics to track:

Workflow success rate - Target 99%+
Checkpoint write latency - Should be under 50ms
Cache hit rate - Higher is better (80%+)
LLM response time - Watch for degradation
Token usage - Track per workflow and per agent

We use LangSmith for all our telemetry and debugging. It gives you full visibility into every LLM call, token usage, and workflow execution. The ability to replay and debug failed runs has been invaluable for identifying issues in production.

Looking Ahead

LangGraph is evolving quickly. The patterns I've described work today, but the framework is adding more built-in support for production concerns. The Redis checkpointer, for example, is relatively new.

What won't change is the fundamental architecture: stateless workers with shared state storage. That pattern scales, and it's how most distributed systems work.

If you're building AI workflows that need to survive the chaos of production, start with persistence. Everything else builds on that foundation.

This post covers patterns used in production at GoVisually for AI-powered compliance checking workflows. The specific implementation details may vary based on your requirements and infrastructure.