Why We Chose MongoDB for LangGraph Checkpointing

AI WrittenLangGraphMongoDBRedisScaling

When we needed to scale our AI workflows horizontally, we had to choose between MongoDB and Redis for LangGraph checkpointing. Here's why MongoDB made more sense for us.

📝 AI Documentation Note: This article was generated by AI to document our checkpoint persistence decisions for LangGraph workflows. It captures the technical considerations and trade-offs we evaluated.

Why We Chose MongoDB for LangGraph Checkpointing

We're building LangGraph workflows for document analysis with multiple AI agents. During prototyping, things worked fine with a single server. But we knew we'd eventually need to scale horizontally with multiple workers.

That's when we hit the persistence question. LangGraph workflows need shared state. When you run multiple API servers, they need to share workflow checkpoints. Worker A might start a workflow, but Worker B needs to retrieve its status. Without shared persistence, that doesn't work.

The Architecture Challenge

Our API pattern is pretty straightforward:

  • POST request starts a workflow, returns a job ID
  • Poll for status while the workflow runs
  • Retrieve results when it completes

Simple enough with one server, but add horizontal scaling and you need persistent checkpoints.

LangGraph supports two main options for checkpointing: Redis and MongoDB. We had to pick one.

Why We Considered Redis

Redis seemed like the obvious choice at first:

  • Fast (sub-millisecond reads/writes)
  • Designed for caching and session state
  • Solid LangGraph integration with langgraph-checkpoint-redis
  • Built-in TTL support for automatic cleanup
from langgraph.checkpoint.redis import RedisSaver
 
checkpointer = RedisSaver(
    redis_client=redis.Redis(host="localhost", port=6379),
    ttl=3600  # 1 hour TTL
)

But we ran into questions. Redis stores checkpoints as key-value pairs. That's fine for retrieving a specific checkpoint, but what about:

  • Querying workflows by status
  • Finding all workflows from the last hour
  • Debugging patterns across failed workflows

You'd need to maintain separate indexes in another database. That felt like added complexity.

Why MongoDB Made More Sense

The honest answer is we already had MongoDB running. We use it for our main application data, so adding checkpoint storage meant no new infrastructure to manage.

But beyond the "it's already there" reason, MongoDB turned out to be a better fit for how we actually use checkpoints.

Query Flexibility

With MongoDB, you can query checkpoints like any other document:

// Find all workflows from the last hour
db.checkpoints.find({
  created_at: { $gte: new Date(Date.now() - 3600000) }
})
 
// Find workflows by status
db.checkpoints.find({
  "checkpoint.channel_values.workflow_state.status": "failed"
})
 
// List checkpoints for a specific workflow
db.checkpoints.find({
  thread_id: "workflow-uuid"
}).sort({ created_at: 1 })

This makes debugging so much easier. When something goes wrong, you can query for patterns. Find all failed workflows, see what they have in common, fix the issue, and resume them from their last checkpoint.

Automatic Cleanup with TTL Indexes

MongoDB's TTL indexes handle cleanup at the database level:

db.checkpoints.createIndex(
  { created_at: 1 },
  { expireAfterSeconds: 3600 }
)

MongoDB scans for expired documents every 60 seconds and deletes them automatically. No manual cleanup jobs, no orphaned data.

The LangGraph MongoDB integration sets this up for you automatically when you configure a TTL, so there's nothing extra to manage.

Schema Evolution

Checkpoints are documents, so adding fields is natural:

  • Started with basic workflow state
  • Added metadata for analytics later
  • Old checkpoints don't break, new ones include extra fields
  • Queries handle both gracefully

Write Performance is Fine

We tested MongoDB checkpoint writes and they came in under 20ms. Our workflows aren't high-frequency—just a few checkpoints per second during peak times. For our use case, that's fast enough.

If you're doing thousands of checkpoints per second, Redis would probably make more sense. We're not.

The Implementation

Setting up MongoDB checkpointing is straightforward:

from langgraph.checkpoint.mongodb import MongoDBSaver
from motor.motor_asyncio import AsyncIOMotorClient
import os
 
# Initialize MongoDB client
mongo_client = AsyncIOMotorClient(
    os.getenv("MONGODB_URI"),
    maxPoolSize=50
)
 
# Create checkpointer with TTL
checkpointer = MongoDBSaver(
    client=mongo_client,
    db_name="ai_workflows",
    ttl=3600  # 1 hour
)
 
# Use it in your graph
graph = workflow.compile(checkpointer=checkpointer)

MongoDBSaver creates all necessary indexes automatically—including the TTL index for expiration. Connection pooling keeps things fast under load.

What We Learned

A few things became clear during testing:

  • Connection pooling - Without it, writes spiked to 200ms+. With maxPoolSize=50, we're at 10-20ms.
  • Keep state size small - Store references to S3, not large files. Keeps checkpoints fast.
  • TTL timing isn't exact - MongoDB cleanup runs every 60 seconds, so documents might stick around briefly after expiration.
  • Querying is surprisingly useful - We thought we'd only retrieve specific workflows, but querying by time range and status has been invaluable for debugging.

When Redis Would Be Better

If your workflows are extremely high-throughput (thousands per second), Redis's sub-millisecond latency would matter more. Or if you're already running Redis and don't have MongoDB, adding a new database just for checkpoints might not make sense.

But for most AI workflow use cases—document processing, agent orchestration, business automation—MongoDB's query flexibility and built-in persistence tools make it a solid choice.

We're still prototyping this setup. There's probably optimization work ahead as we move to production and scale further. But in testing so far, MongoDB checkpointing has handled everything we've thrown at it.

The combination of familiar infrastructure, automatic cleanup, and query flexibility means less time managing checkpoints and more time building features. That's the kind of trade-off I'll take.

Continue Reading

Browse All Articles