Taming AI Agents By Making Them Queue

Integrating AI agents via direct API calls works fine for prototypes and one-off automations. When you want agents to be observable, replaceable components of a production system — where you can swap implementations without cascading changes, compare outputs across versions, and build replay into the operational model — a direct API call stops being sufficient.

The queue is an architectural contract that makes agents first-class operational components. Not because the direct approach is broken, but because it wasn’t designed for the requirements that emerge once agents are core to how your system works.

The Problem Without a Queue

When agents are integrated via direct API calls, several tensions emerge as usage scales:

Cascading changes on agent updates. A new agent version ships. It has a slightly different response schema. Every caller that destructured that schema now breaks, and the blast radius is invisible until runtime.

No visibility into what happened. Your application sees “user profile identified → quiz curated.” You have no way to understand what the agent did between those states. LangGraph-style trace-level observability exists inside the agent framework, but it’s not visible to the system orchestrating it.

Eval is ad hoc. You want to know if the new agent is better than the old one. You run it manually, maybe against a few examples, and make a judgment call based on feel. That’s not eval — that’s intuition.

Latency coupling. If the agent is called synchronously in the request path, the user’s request waits for the agent to finish. Even if the agent call is moved to a background job, the system still needs to know when the agent is done to give the user feedback — creating a polling or callback dependency that couples the response latency to agent processing time. The queue decouples not just the code, but the latency: the system queues the event and responds immediately; the agent processes asynchronously.

The Architecture

The alternative is deceptively simple: treat the queue as the only integration point between your system and the agent.

An agent receives input as an event from a topic it subscribes to. It produces output as an event on an output topic. If it needs to mutate system state, it calls a mutation API — nothing else. The agent knows nothing about internal system IDs, routing logic, or the structure of other agents’ outputs.

[Event Broker] → [Agent Topic] → [Agent Service]
                                    ↓
                              [Mutation API]
                                    ↓
                              [Output Topic] → [Downstream Consumers]

The event schema is the contract. CloudEvents gives you a well-defined, industry-standard envelope — type, source, datacontenttype, data — so agents and downstream systems speak a common language. All events are being migrated to this format; if you eventually want to open this to third-party agents, CloudEvents compatibility signals interoperability from day one.

Fan-Out: The Canonical Pattern

The first pattern is fan-out: a single input event is processed independently by multiple agents, each updating only the state it owns — a partitioned slice of the shared system, scoped per agent via key-based isolation in the mutation API.

Consider knowledge tracing. Multiple subscribers process the same input content — each running their own algorithms, updating their own derived state. Subscriber A runs a spaced-repetition model. Subscriber B runs a concept-dependency graph. Neither knows about the other. Neither affects the other’s state.

This is powerful because agents are genuinely isolated. You can add a new subscriber without changing anything else. You can swap an agent’s implementation without touching the input pipeline. You can eval each subscriber independently against empirical downstream data.

The Mutation API Tradeoff

When an agent produces an output that requires a system mutation, it calls an API. In my current implementation at idhesive, these calls are synchronous — the agent waits for the mutation to complete before acknowledging the event.

This is an intentional tradeoff. Synchronous mutation calls keep behavior predictable and make error handling straightforward: if the mutation fails, the agent can retry or dead-letter the event. Because agents operate outside the trust boundary, every event is validated against the schema before the mutation API acts on it — no trust is assumed. The cost is that the agent is aware of the mutation API and coupled to it. An alternative is transparent event-driven mutation — output events trigger mutation handlers automatically — but that adds routing complexity and makes the system’s behavior harder to reason about from the agent’s perspective.

For now, explicit is better than clever. The agent knows it called an API; the observability story is cleaner.

Retrospective Replay

Once agents are decoupled from your system via events, retrospective replay becomes tractable.

When you update an agent — a bug fix, a new model, a revised prompt — you replay historical input events through the new agent version and write outputs to a separate stream. Both streams coexist. Downstream consumers continue reading from their existing stream without disruption. You can compare old and new outputs side by side.

The replay source is the Postgres outbox table (event_archive), not the broker. Events are written to the outbox first, then published to Redpanda — the outbox is the durable record. This means replay is a first-class operation: you can re-emit any historical event from the outbox without needing to re-process it through the broker. It also means the broker can be treated as ephemeral for replay purposes — if a consumer falls behind or needs to rebuild its projection, it reads from the outbox, not from the queue head.

This only works because agents are stateless and idempotent. The same input event produces the same output regardless of when or how many times it’s processed. But the actual safety mechanism for replay is the idempotency key on the mutation API call — the agent includes a key derived from the event and the attempt version, so a replay with a different key is treated by the system as a distinct operation, not a duplicate. A re-run to rescore a failed attempt produces a different idempotency key than the original run, so the mutation API handles it as a new event and downstream consumers deduplicate by key. If an agent carries state between invocations — a self-learning agent that updates its own weights based on input — replay is no longer a clean comparison. That’s a genuinely hard problem worth addressing separately.

Eval Without Golden Datasets

You could build a golden dataset and compare new outputs against expected results. But that requires maintaining labeled examples, and for many agentic tasks, the “right” answer is context-dependent.

A more empirical approach: use downstream behavioral data as the signal. If the agent’s output feeds into a downstream system — a quiz platform, a recommendation engine, a classification pipeline — the health of that downstream system tells you whether the new agent is better or worse. Error rates, accuracy metrics, latency distributions. New agent ships → observe downstream metrics → validate.

This is not a perfect eval. But it’s grounded in what actually matters: is the system working better than before?

The Observability Gap

What I have today in production: high-level state transitions. I can see that a user profile was identified, and that a quiz was curated from it. I cannot see what happened inside the agent between those steps.

What I want: trace-level visibility across the entire pipeline — not just “this state transitioned to that state” but “this event entered the agent, this mutation API call was made, this output event was produced.” The queue-based architecture gives you that pipeline-level trace: events are the observable spine from input through agent to output. What the system doesn’t see is the agent’s internal reasoning — the agent is a black box to the pipeline. That’s intentional and appropriate for now. The system sees what goes in and what comes out; LangGraph-style reasoning traces stay inside the agent where they belong.

Closing this gap is where the real leverage is. Once you can observe the pipeline end-to-end — events in, mutations called, events out — agent swaps stop being opaque and start being routine. The system doesn’t need to see inside the agent; it needs to see the contract the agent honours.

What You Could Have

The difference between where most teams are and where this architecture gets you:

	Without Queue	With Queue
Agent swap	Coordinated code changes across callers	Swap the topic subscription
Visibility	State-level transitions	End-to-end traces
Eval	Ad hoc, intuition-based	Behavioral diff with empirical data
Replay	Not feasible	Retrospective replay for failed or missing records; new scoped key for new agent versions with concurrent output comparison
Third-party agents	Full API coupling	Event schema contract

The queue isn’t overhead. It’s the thing that makes everything else tractable.

What’s Next

Fan-out is the first pattern. Pipeline and voting/consensus patterns follow naturally — agents chained via queues, or multiple agents independently computing on the same input with results compared algorithmically. A future post will cover both.

The self-learning agent question remains open: how do you do retrospective replay when the agent’s state changes between invocations? That requires versioning the agent’s learned state alongside the event replay, which is a more complex system. A problem for a future post.

What the queue gives you today is operational confidence. Your agents become genuinely replaceable components — not because you’ve abstracted them behind a clever interface, but because you’ve made the contract explicit and observable. That’s the difference between an agent integration that scales and one that just works for now.