Technical··18 min read

The 12-hour bug. Debugging agent identity in a multi-agent streaming system.

How a 'simple' UI feature led us through five layers of the stack, a library patch, and a fundamental misunderstanding of how streaming actually works.

Algorise·The team

There's a particular flavour of debugging session that every engineer knows but rarely talks about. It starts with a feature request that seems almost trivially simple, and ends twelve hours later with you questioning your career choices while staring at a streaming response that still isn't doing what you want.

This is the story of one of those sessions.

At Algorise, we built Algora — an AI digital employee that handles everything from querying internal knowledge bases to generating analytics reports. Under the hood, it's a multi-agent system orchestrated by LangGraph. A supervisor agent receives each user query and routes it to the most appropriate specialist: the RAG agent for document searches, the Data Analyst for SQL queries and visualisations, the HR agent for employee lookups, and so on. Each agent has its own tools, its own system prompts, its own personality quirks.

The architecture is elegant when you diagram it on a whiteboard. In production, it's a symphony of async generators, state machines, and streaming protocols all trying to stay synchronised across a Python backend and React frontend.

The feature request seemed innocent enough: show users which agent is currently responding to their question.

Simple, right? The supervisor picks an agent, the agent starts streaming its response, we display "HR Agent is responding..." in the UI. Maybe an afternoon of work. Maybe less.

Twelve hours later, I had touched five different layers of the stack. I'd discovered that our streaming events were being emitted in an order that made perfect sense to LangGraph and absolutely no sense to our frontend. I'd learned more than I ever wanted to know about Server-Sent Events buffering. I'd written and deleted approximately 400 lines of code. I'd consumed enough coffee to concern my future cardiologist.

But I'd also uncovered something genuinely interesting: the gap between how we think streaming multi-agent systems work and how they actually work is vast, underdocumented, and full of subtle edge cases that only reveal themselves in production.

This is a deep dive into that debugging journey — the wrong turns, the "aha" moments, and the surprisingly elegant solution we landed on. If you're building anything with LangGraph, streaming AI responses, or multi-agent architectures, I hope our pain saves you some of yours.

The architecture

Building a production AI chat system with multiple agents and real-time streaming requires a carefully orchestrated pipeline. Each layer serves a specific purpose, and data must flow correctly through every single one — or the user sees nothing.

The streaming pipeline01LangGraph execution (astream_events)02Callback handler (v5)03SSE transport (StreamingResponse)04Frontend consumption (useChat)05UI rendering (React 19) User query Knowledge base LLM events Streaming tokens Tool calls Rendered chunks
Fig. 1The pipeline a single token traverses, from a user's query down to a rendered chunk in the chat window.

The streaming pipeline

The journey of a single token from LLM to user interface traverses five distinct layers:

  1. LangGraph execution (astream_events with subgraphs=True) — the AI graph runs, potentially spawning nested subagents via the DeepAgents library's task tool. Each node emits events: on_chat_model_stream, on_tool_start, on_tool_end.
  2. Callback handler (langgraph_callback_v5.py) — transforms raw LangGraph events into Vercel AI SDK protocol format. Text chunks become 0:"content", tool invocations become 9:{"toolCallId":...}, and completion signals become d:{"finishReason":"stop"}.
  3. SSE transport — FastAPI's StreamingResponse pushes formatted chunks over HTTP. The connection stays open, delivering incremental updates without polling.
  4. Frontend consumption — the assistant-ui library's useChat hook parses the AI SDK protocol, updating React state as chunks arrive.
  5. UI rendering — React 19 components render messages, tool calls, and streaming indicators in real-time.

Why this complexity exists

This architecture isn't over-engineering — it's the minimum viable solution for three requirements:

Real-time streaming. Users expect character-by-character responses. Batch responses feel broken in 2024. LangGraph's astream_events API provides the granularity; the callback handler bridges the format gap.

Multi-agent orchestration. A single agent can't handle RAG search, SQL queries, web lookups, and report generation equally well. The supervisor pattern routes queries to specialists. Subagents handle complex multi-step tasks. Each needs its events properly namespaced and surfaced.

Proper UI rendering. Tool calls need structured display — collapsible panels, loading states, result previews. The AI SDK protocol provides semantic meaning (tool_call vs text), enabling the frontend to render appropriately rather than dumping raw JSON.

Data must flow correctly through every layer. A bug in the callback handler's tool serialisation breaks the frontend's tool rendering. A missing subgraphs=True flag silences all subagent output. An incorrect content type header prevents SSE parsing. The system is only as reliable as its weakest transformation.

The problem: "subagent" everywhere

It started with a user complaint that seemed trivial: "The agent indicator always says 'subagent' instead of showing which agent is actually working."

The Algorise platform uses a supervisor architecture where specialised AI agents handle different types of queries. An HR question routes to the hr-agent. A document lookup goes to the rag-agent. Data analysis queries land with the data-analyst-agent. The UI was designed to show users exactly which agent was handling their request — a small but important piece of transparency in an AI-powered system.

Except it wasn't working. Every single query, regardless of type, displayed the same unhelpful label: "subagent".

First steps: the obvious suspect

My first instinct was to blame the frontend. The DeepAgentTaskToolUI component was rendering correctly, it just had the wrong data. Classic prop drilling issue, I assumed. Someone probably forgot to pass the value down from a parent component.

// DeepAgentTaskToolUI.tsx
interface DeepAgentTaskToolUIProps {
  subagent_type?: string;
  status: 'running' | 'completed' | 'error';
}
 
export function DeepAgentTaskToolUI({
  subagent_type = 'subagent',  // <- The default that kept appearing
  status
}: DeepAgentTaskToolUIProps) {
  return (
    <div className="flex items-center gap-2">
      <AgentIcon type={subagent_type} />
      <span className="text-sm text-muted-foreground">
        {formatAgentName(subagent_type)} {/* Always showed "Subagent" */}
      </span>
    </div>
  );
}

I traced the props through the component tree. The data was being passed. The streaming infrastructure was delivering messages. The tool call was arriving at the component with all its fields intact.

But when I logged the incoming props, I saw it: subagent_type was genuinely coming in as undefined, triggering the default value. The frontend was innocent.

The plot thickens

This shifted the investigation upstream. If the frontend was receiving undefined, then the backend was sending undefined. But why? The supervisor architecture clearly knew which agent it was invoking. The routing logic was working — queries were going to the right agents and returning correct answers.

Somewhere between "supervisor decides to call hr-agent" and "frontend receives the tool call," the agent type was getting lost. The question that would consume the next several hours crystallised:

Where exactly does subagent_type get set, and why isn't it being set correctly?

The answer, as I would discover, lay buried in the streaming infrastructure — in the gap between what LangGraph emitted and what our API transformed for the frontend. But I didn't know that yet. All I knew was that data was disappearing somewhere in the pipeline, and I needed to find where.

Phase 1: The tool name mismatch

The first step in any debugging journey is understanding what's actually happening versus what you think is happening. Armed with a failing test case and zero visibility into the streaming pipeline, I did what any self-respecting debugger does: I added logging. Lots of it.

logger.info(f"Processing tool: {tool_name}")
logger.info(f"Raw tool args: {tool_args}")
logger.info(f"Is deep agent task check: {tool_name == 'mcp__algorise__task'}")

The logs started flowing, and almost immediately something looked wrong:

Processing tool: task
Is deep agent task check: False

Wait. task? Not mcp__algorise__task?

I dove into the backend streaming handler. There it was, the culprit hiding in plain sight:

is_deep_agent_task = tool_name == "mcp__algorise__task"
 
if is_deep_agent_task:
    # Special handling for subagent tasks
    # Parse nested agent info, extract title, etc.
    pass

This check was always returning False. DeepAgents doesn't send the fully-qualified MCP tool name — it just sends "task". The special handling code? Dead code. It had probably never executed successfully in production.

But here's where it got interesting. If the special handling never ran, why did the subagent icon appear in the UI at all?

More logging revealed the answer. When a tool didn't match any special cases, it fell through to generic tool handling, which included this little gem:

# Auto-prefix unprefixed tool names
if not tool_name.startswith("mcp__"):
    tool_name = f"mcp__algorise__{tool_name}"

So "task" became "mcp__algorise__task" after the special handling check. The frontend received a tool call with the correct name and dutifully rendered the subagent icon. The icon worked entirely by accident.

The fix seemed obvious:

is_deep_agent_task = tool_name == "mcp__algorise__task" or tool_name == "task"

I deployed the change, ran the test case again, and... still broken. Different broken, but broken.

The icon appeared. The tool was recognised. But the subagent title was "undefined" and the status showed "working on unknown task." The args object was essentially empty — or rather, it contained default values instead of the actual task information.

{
  "title": "undefined",
  "agent_type": "unknown",
  "instructions": ""
}

The logging confirmed it. Even though we now entered the special handling block, tool_args contained nothing useful. The nested agent information that should have been there — the task title, the agent type, the instructions — wasn't being captured.

One problem down. Another one staring me in the face.

Phase 2: The library patch

With the tracking mechanism identified, I needed a way to pass the agent name from the backend to the frontend. The Vercel AI SDK provides a providerMetadata field specifically for passing provider-specific data alongside messages. This seemed perfect.

First attempt: the naive approach

I started by having the backend include the agent name in the stream response:

{ "providerMetadata": { "agentName": "deep_agent" } }

Simple enough, right? The AI SDK immediately rejected this with a validation error. After digging into the SDK's type definitions, I discovered that providerMetadata expects a nested object structure where the top-level keys represent provider namespaces. You cannot pass primitive values directly.

The fix was straightforward once I understood the schema:

{ "providerMetadata": { "algorise": { "agentName": "deep_agent" } } }

Now the backend was sending valid providerMetadata. I verified this by logging the raw SSE events in the browser's network tab. The data was there, properly structured, flowing through the stream.

The vanishing metadata

Here is where things got interesting. Despite the backend correctly sending providerMetadata, when I logged the message parts on the frontend, the metadata was nowhere to be found. The text content arrived perfectly, but the providerMetadata had simply vanished.

I traced the data flow through the frontend code. The SSE stream came in correctly. The Vercel AI SDK parsed it correctly. But somewhere between the AI SDK and our assistant-ui components, the metadata disappeared.

The culprit was @assistant-ui/react-ai-sdk, the library that bridges Vercel AI SDK messages to assistant-ui's format. Examining its convertMessage.js file revealed the problem:

return {
  type: "text",
  text: part.text,
  // providerMetadata? Never heard of it.
};

The library was constructing new text part objects and simply not including the providerMetadata field. Our agent tracking data was being stripped out during conversion.

The surgical fix

I could have forked the library, but that creates maintenance overhead. Instead, I reached for patch-package, a tool that lets you make targeted patches to node_modules that persist across installs.

The patch was minimal:

// convertMessage.js
return {
  type: "text",
  text: part.text,
+ ...part.providerMetadata && { providerMetadata: part.providerMetadata }
};

A single line, conditionally spreading providerMetadata onto the returned object when it exists. I ran npx patch-package @assistant-ui/react-ai-sdk to generate the patch file, then added a postinstall script to package.json to ensure the patch applies automatically:

{
  "scripts": {
    "postinstall": "patch-package"
  }
}

Validation

After running npm install to trigger the postinstall hook, I tested the full flow again. This time, logging the message parts showed the complete picture:

{
  type: "text",
  text: "Here's the analysis...",
  providerMetadata: {
    algorise: {
      agentName: "deep_agent"
    }
  }
}

The metadata was flowing through. Now I just needed to wire it up to the UI components to display the agent name. We were finally getting somewhere.

Phase 3: The streaming chunk mystery

With the tool name detection fixed, I ran the code again expecting victory. Instead, I got this:

args_dict = {}
subagent_type = args_dict.get("subagent_type", "subagent")  # Always "subagent"

The arguments dictionary was empty. Every. Single. Time.

I added comprehensive logging to trace exactly what was flowing through the stream:

logger.debug(f"Tool call chunk: name={chunk.name}, id={chunk.id}, args={chunk.args}, index={chunk.index}")

What I saw next fundamentally changed my understanding of how LangGraph streaming works.

The revelation

The console output told a story I hadn't anticipated:

DEBUG: Tool call chunk: name='task', id='toolu_01ABC123', args='', index=1
DEBUG: Tool call chunk: name=None, id=None, args='{"descrip', index=1
DEBUG: Tool call chunk: name=None, id=None, args='tion": "An', index=1
DEBUG: Tool call chunk: name=None, id=None, args='alyzing sa', index=1
DEBUG: Tool call chunk: name=None, id=None, args='les data",', index=1
DEBUG: Tool call chunk: name=None, id=None, args=' "subagent', index=1
DEBUG: Tool call chunk: name=None, id=None, args='_type": "d', index=1
DEBUG: Tool call chunk: name=None, id=None, args='ata_analys', index=1
DEBUG: Tool call chunk: name=None, id=None, args='t"}', index=1

I stared at this output for a solid minute before the implications sank in. The tool call wasn't arriving as a single unit. It was being streamed character by character, fragmented across dozens of chunks.

The chunk structure

Each tool_call_chunk object has four fields:

  • name: the tool name — only present in the first chunk
  • id: the unique tool call identifier — only present in the first chunk
  • args: a fragment of the JSON arguments string
  • index: the position identifier that links related chunks together

The first chunk is special. It announces the tool call with its name and ID, but the args field is an empty string. This makes sense from a streaming perspective: the LLM has decided to call a tool and knows which one, but hasn't yet generated the arguments.

Subsequent chunks carry the actual argument data, streaming in piece by piece as the model generates them. But here's the critical detail: these continuation chunks have name=None and id=None. The only way to associate them with the original tool call is through the index field.

Why the original approach was doomed

The original code's logic was essentially:

if chunk.name == "task":
    args_dict = json.loads(chunk.args)  # chunk.args is ""
    subagent_type = args_dict.get("subagent_type", "subagent")
    # Start emitting task events immediately

This approach was fundamentally broken for three reasons:

  1. Timing. The detection happened on chunk 1, but the arguments didn't exist yet. Parsing an empty string as JSON either fails or yields nothing useful.
  2. Fragmentation. Even if we waited for more chunks, chunk.args only contains a fragment like '{"descrip'. That's not valid JSON. You can't parse it.
  3. Identity loss. After chunk 1, subsequent chunks have id=None. The code had no mechanism to say "this chunk with partial args belongs to that tool call I detected earlier."

The original implementation was racing against the stream and losing. It was like trying to read a book by only looking at the first page of each chapter.

The path forward

The solution crystallised. Instead of acting on the first chunk, I needed to:

  1. Detect the tool call initiation (chunk 1 with name and id)
  2. Accumulate all argument fragments using the index field as a correlation key
  3. Concatenate the fragments into complete JSON
  4. Parse and extract subagent_type only when the JSON was complete
  5. Then — and only then — emit the task-started event

This meant maintaining state across the stream. I would need to track which tool calls were in flight, buffer their arguments, and detect when they were complete. The simple stateless chunk-by-chunk processing had to become a stateful accumulator.

The solution: buffering with index and namespace

After hours of debugging, the solution crystallised into a clear pattern: we needed to buffer pending task tools while tracking both their chunk index and namespace. Neither alone was sufficient — together, they provided the unique key we needed to match streaming argument chunks to their originating tool calls.

The buffering strategy

The core insight was that when a task tool first appears in a chunk, it has an id — but all subsequent argument chunks arrive with id=None. We needed to hold onto the initial tool's metadata and accumulate arguments until we had enough to parse:

pending_task_tools: dict[str, dict] = {}
 
# When we first see a task tool with an id:
if tool_id and tool_name == "task":
    pending_task_tools[tool_id] = {
        "tool_name": tool_name,
        "accumulated_args": tool_args,
        "index": chunk_index,
        "namespace": namespace_tuple,
    }

The accumulated_args field acts as our buffer. Each time we receive a chunk with id=None, we append its argument fragment to the appropriate buffer. But how do we know which buffer?

Why index alone wasn't enough

This is where we hit a subtle but critical issue. Initially, I tried matching chunks purely by their index field — the position of the tool call within the message's tool array. A chunk at index=1 should match a pending tool at index=1, right?

Wrong.

When the supervisor agent delegates to a subagent like hr-agent, LangGraph creates a nested execution context. The supervisor's tools might have namespace=() (the root), while the subagent's tools arrive with namespace=('tools:uuid-here',). Both agents might have a tool at index=1 — the supervisor calling task, the subagent calling some internal tool.

Without namespace discrimination, we'd accidentally concatenate argument fragments from different agents, corrupting both tool calls and producing garbage JSON. The fix was straightforward once understood:

# Match by BOTH index AND namespace
for tool_id, pending_data in pending_task_tools.items():
    if (pending_data.get("index") == chunk_index and
        pending_data.get("namespace") == namespace_tuple):
        pending_data["accumulated_args"] += tool_args
        break

The complete flow

The full algorithm flows like this:

  1. Initial tool chunk arrives with id="call_abc123", name="task", args='{"sub', index=1, namespace=()
  2. Buffer it. Store the id as key, capture index/namespace, start accumulating args
  3. Subsequent chunks arrive with id=None, args='agent_', same index/namespace
  4. Match and append. Find the pending tool by index+namespace, concatenate args
  5. Attempt parse. After each append, try json.loads(accumulated_args)
  6. Success. When parsing succeeds and we find subagent_type, emit the tracking event
try:
    args_dict = json.loads(pending_data["accumulated_args"])
    subagent_type = args_dict.get("subagent_type")
    if subagent_type:
        yield format_agent_tracking_event(
            agent_name=subagent_type,  # "hr-agent"
            status="running"
        )
        del pending_task_tools[tool_id]
except json.JSONDecodeError:
    # Not complete yet, keep accumulating
    pass

The moment of truth

Running the test again, watching the logs scroll by — and there it was:

[STREAMING] Emitting agent tracking: hr-agent (running)

The frontend finally received the event. The UI updated. "HR Agent" appeared in the active agents list while the query processed. No more phantom agents, no more missed delegations.

What actually generalises

What we changed
  • Patched @assistant-ui/react-ai-sdk to preserve providerMetadata on text parts
  • Fixed tool name detection to accept both task and mcp__algorise__task
  • Implemented buffering with index+namespace matching for streaming tool args
  • Added providerMetadata to pass agent identity through the SSE pipeline
What held up
  • Streaming is not a request/response. Design for incompleteness
  • Third-party libraries encode their authors' assumptions
  • Accidental correctness delays discovery until the workaround fails
  • Nested execution needs identity carried through the pipeline

Streaming changes everything

Perhaps the most fundamental lesson: streaming data behaves nothing like complete request/response cycles. When you send a full JSON payload and receive a full response, you can inspect, validate, and debug at clear boundaries. With streaming, those boundaries dissolve. The first chunk might contain metadata, or it might not. The tool name might arrive before its arguments, or after. You cannot assume temporal ordering guarantees semantic completeness.

If you are building streaming systems, design for incompleteness from day one. Buffer aggressively. Validate defensively. Never assume the current chunk contains everything you need.

Libraries encode assumptions

The @assistant-ui/react-ai-sdk library stripped providerMetadata during its internal transformations. This was not malicious or even a bug from their perspective — their use cases simply did not require that field. But our multi-agent architecture absolutely depended on it.

This is a broader truth about third-party code: every library embeds its authors' assumptions about how it will be used. When your use case diverges from theirs, you have three options: work around it, fork it, or patch it. We chose strategic patching, which preserved our ability to receive upstream updates while fixing our immediate need.

Coincidences mask real bugs

The UI appeared to work correctly for months because of an unrelated auto-prefixing behaviour. The actual data path was broken, but a different code path produced visually identical results. This kind of accidental correctness is insidious — it delays discovery until the workaround fails.

When something works but you cannot explain exactly why, investigate. That mystery is a bug waiting to surface at the worst possible moment.

Context becomes critical in nested execution

Multi-agent systems introduce a problem that single-agent systems never face: when Agent A spawns Agent B, and both emit tool calls, which output belongs to which context? Our breakthrough came from implementing proper namespace tracking with index-based buffering. Each agent's emissions now carry their identity through the entire pipeline.

Logging reveals truth

Every significant breakthrough in this debugging journey came immediately after adding strategic logging. Not generic logging everywhere, but targeted instrumentation at the boundaries where data transformed. The logs told us what the code actually did, rather than what we assumed it did.

The problems you solve today become the intuition you rely on tomorrow. Write them down.