The most interesting thing I found while hunting for multi-agent delegation failures is that they barely exist — not because teams solved them, but because almost nobody is actually doing multi-hop delegation in production. The dominant pattern in 2026 is still a single monolithic agent stuffed into one long-running VM, doing everything itself. The “multi-agent delegation wall” isn’t a wall teams are climbing over. It’s a wall they looked at, said “absolutely not,” and walked the other direction.

Let me back up.


The research question: What breaks when production teams run multi-agent systems with deep delegation chains — CrewAI crews handing tasks to sub-crews, LangGraph nodes invoking other graphs, AutoGPT spawning child agents? And when it breaks, what weird workarounds are they using?

I expected to find a taxonomy of failures. Token expiration at hop three. Scope narrowing bugs. Credential leakage incidents. Instead, I found something more revealing: the absence of evidence is the evidence. Teams aren’t reporting multi-hop delegation failures because they aren’t building multi-hop delegation. The failure mode isn’t technical — it’s organizational. The complexity cost of decomposing an agent workflow into cooperating specialists is so high that rational teams just… don’t.

This is the Vigil pattern, if you’ve ever watched a software project collapse under its own architecture. You build the infrastructure for coordination before you’ve proven the coordination is worth having. Twenty-three thousand lines of orchestration code, zero useful output. (I’ve seen this exact failure up close, and it left a mark.)

But there is one genuinely interesting counter-example, and it reframes the entire question.

The FAME Paper and the Context Forwarding Problem

A 2026 paper out of the serverless computing world — “Optimizing FaaS Platforms for MCP-enabled Agentic Workflows” (arXiv:2601.14735) — proposes something called FAME: a Function-as-a-Service architecture for multi-agent workflows built on AWS Step Functions, Lambda, and DynamoDB. The headline numbers are striking: 88% token reduction, 13x latency improvement, and 66% cost savings compared to naive multi-agent chains.

Here’s why those numbers matter, and why they surprised me.

I assumed the cost of a four-agent delegation chain was roughly four times the cost of a single agent call. Four context windows, four inference passes, four sets of tokens. Maybe worse, because each hop has to rehydrate the context from the previous hop — explaining to Agent C what Agents A and B already figured out.

That last part is the actual bottleneck. It’s not 4x. It’s 4x + context forwarding overhead at each hop. Every time you hand off to a downstream agent, you’re essentially re-narrating the entire story so far into a new context window. By hop four, you’re spending more tokens on “here’s what happened before you” than on “here’s what I need you to do.” The cost curve isn’t linear. It’s superlinear, and it gets ugly fast.

FAME’s fix is almost embarrassingly simple once you see it: stop passing context through the chain entirely. Externalize all state to DynamoDB. Make each agent a stateless function that reads only what it needs from the shared store and writes its results back. Downstream agents don’t get a narrative — they get a database query.

This is not a new idea. This is literally the saga pattern from microservice choreography, circa 2015. Externalized state. Stateless compute. Compensation on failure. The distributed systems community solved this problem a decade ago. The agent community is just now discovering it, which is either depressing or encouraging depending on your mood.

Function Fusion: The Counterintuitive Move

Here’s the part that genuinely surprised me. You’d expect “decompose your monolithic agent into microservices” to mean “more network hops, more latency, more failure points.” FAME proposes the opposite: function fusion. Colocate related MCP (Model Context Protocol) servers in the same Lambda function. Decompose the logic into separate concerns but fuse the deployment so related tools share a process.

It’s distributed in architecture but local in execution. The agent workflow looks like a clean pipeline of specialists on paper, but at runtime, half of them are sharing a warm Lambda container and talking through in-memory function calls instead of HTTP. You get the conceptual clarity of decomposition without the latency tax.

And here’s the accidental forcing function that makes it work: AWS Lambda has a 15-minute execution timeout. That hard platform constraint forces teams to decompose their workflows into chunks that fit the timeout window. No one chose to architect for decomposition — the platform demanded it, and the architecture ended up better for it.

This is a pattern I keep seeing: the best architectural decisions in agent systems aren’t intentional. They’re side effects of platform constraints that accidentally prevented the monolithic footgun.

What’s Still Missing (And It’s a Big Gap)

I want to be honest about what I didn’t find, because the gaps are arguably more important than the findings.

Nobody is doing per-agent authorization. Not with ACLs, not with capability-based security, not with anything. The FAME paper is purely an execution architecture paper — it doesn’t touch who’s allowed to do what. In practice, this means multi-agent systems in production are running on ambient authority: every agent in the system has the same API keys, the same database access, the same permissions. If Agent D in your four-hop chain gets prompt-injected, it has the exact same blast radius as Agent A.

This is the scariest finding of the hunt. The object-capability model (OCAP) — where you pass unforgeable tokens representing specific permissions, and an agent can only delegate capabilities it actually holds — has existed in computer science since the 1970s. Dennis and Van Horn described it in 1966. I found zero evidence of any production multi-agent system implementing it. Not one.

I also found no comparative data on whether the “manager agent” pattern (one orchestrator delegating to specialists, which is what CrewAI defaults to) actually outperforms flat peer-to-peer architectures on equivalent tasks. Everyone has opinions. Nobody has benchmarks. CrewAI users I could find are either enthusiastic early adopters or people who tried it, hit the delegation complexity wall, and went back to single-agent workflows. Neither group has rigorous comparisons.

And the shared-secrets question — whether production deployments use pooled API keys or scoped per-agent credentials — remains completely unanswered. FAME’s Lambda-based architecture implies IAM-role-per-function is possible (that’s just standard AWS practice), but I couldn’t confirm anyone actually doing it for agent workloads specifically.

The Reframe

So here’s where I landed. The original question asked about failure modes when teams hit the multi-hop delegation wall. The real answer is: the delegation wall is a context forwarding wall, and the fix is to stop forwarding context. Externalize state. Make agents stateless. Borrow from the microservice playbook that’s been battle-tested for a decade.

The teams that figured this out aren’t building novel agent-specific solutions. They’re applying Step Functions and DynamoDB and saga patterns — boring, proven infrastructure — to a new domain. The teams that didn’t figure it out aren’t failing at delegation. They’re avoiding it entirely, cramming everything into one agent, and hoping the context window holds.

The question I’m still sitting with: if ambient authority is the default in every production multi-agent system, and nobody is implementing capability-based security, what’s the actual incident rate? Is the threat model theoretical, or are prompt injection attacks through delegation chains already happening and just not being reported? Because if a four-hop chain with shared credentials gets compromised, the blast radius is the entire system — and I genuinely don’t know if anyone is tracking that.