Skills Development Multi-Agent System Architecture Design

Multi-Agent System Architecture Design

v20260509
multi-agent-architect
This skill guides the design and optimization of production-grade multi-agent systems. It specializes in using LangGraph, LangChain, and DeepAgents to create structured, complex AI workflows. It covers state management, agent routing, memory integration, and tool calling, making it ideal for building robust, scalable agents like supervisors, planners, and coders.
Get Skill
346 downloads
Overview

Multi-Agent Architect & Updater Skill

Overview

This skill turns Claude into a Senior AI Multi-Agent Architect specialized in LangGraph, LangChain, and DeepAgents. It provides structured workflows for creating and updating production-grade multi-agent systems — including supervisor agents, planners, researchers, coders, and memory-backed autonomous pipelines. Use it whenever you need to design, build, debug, or scale any multi-agent AI system.

If this skill adapts material from an external GitHub repository, declare both:

  • source_repo: owner/repo
  • source_type: official or source_type: community

When to Use This Skill

  • Use when you need to create a new agent or multi-agent workflow from scratch
  • Use when working with LangGraph state graphs, nodes, edges, or conditional routing
  • Use when the user asks about agent communication, memory systems, or tool-calling pipelines
  • Use when debugging or optimizing an existing LangChain/LangGraph agent system
  • Use when architecting supervisor, planner, research, coding, or validation agent roles
  • Use when integrating DeepAgents with hierarchical planning and delegation

How It Works

Step 1: Understand the Goal

Before writing any code, clarify:

  • What is the business objective this agent system must achieve?
  • What agent roles are needed (supervisor, planner, researcher, coder, validator)?
  • What tools does each agent require?
  • What memory strategy is needed (Redis, Vector DB, LangChain Memory)?
  • What communication protocol connects agents (shared state, message passing)?

Step 2: Define the State Schema

All agents share a typed state object passed through the graph:

from typing import TypedDict

class AgentState(TypedDict):
    user_goal: str
    tasks: list[str]
    completed_tasks: list[str]
    next_agent: str
    context: dict
    step_count: int          # guards against infinite loops
    error: str | None

Step 3: Define Agent Nodes

Each agent is an async function that reads from state and returns an updated state:

import logging
from langchain_openai import ChatOpenAI

logger = logging.getLogger(__name__)

async def research_node(state: AgentState) -> AgentState:
    logger.info("research_node: starting")
    llm = ChatOpenAI(model="gpt-4o")
    result = await llm.bind_tools(research_tools).ainvoke(state["user_goal"])
    state["context"]["research"] = result.content
    state["next_agent"] = "coder"
    return state

Step 4: Build the LangGraph

Wire nodes together with edges and conditional routing:

from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode

def build_graph() -> StateGraph:
    graph = StateGraph(AgentState)

    graph.add_node("supervisor", supervisor_node)
    graph.add_node("research",   research_node)
    graph.add_node("coder",      coding_node)
    graph.add_node("validator",  validation_node)
    graph.add_node("tools",      ToolNode(all_tools))

    graph.set_entry_point("supervisor")

    graph.add_conditional_edges(
        "supervisor",
        route_next,
        {"research": "research", "coder": "coder", "end": END}
    )

    graph.add_edge("research",  "supervisor")
    graph.add_edge("coder",     "validator")
    graph.add_edge("validator", "supervisor")

    return graph.compile()

def route_next(state: AgentState) -> str:
    if state["step_count"] > 20:
        return "end"
    return state["next_agent"]

Step 5: Add Memory

from langchain_community.chat_message_histories import RedisChatMessageHistory

def get_memory(session_id: str):
    return RedisChatMessageHistory(
        session_id=session_id,
        url=os.getenv("REDIS_URL"),
        ttl=3600
    )

Step 6: Run the Graph

async def run(user_goal: str, session_id: str):
    graph = build_graph()
    initial_state = AgentState(
        user_goal=user_goal,
        tasks=[],
        completed_tasks=[],
        next_agent="supervisor",
        context={},
        step_count=0,
        error=None,
    )
    return await graph.ainvoke(initial_state)

Step 7: Expose via FastAPI (optional)

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class RunRequest(BaseModel):
    goal: str
    session_id: str

@app.post("/run")
async def run_agent(req: RunRequest):
    result = await run(req.goal, req.session_id)
    return {"result": result}

Updating an Existing Agent

When the user wants to update or debug an existing agent, structure the response as:

## Existing Issue
[Describe the current problem]

## Root Cause
[Identify why it's happening in the architecture]

## Proposed Update
[Outline the changes at architecture level]

## Updated Code
[Generate only the changed modules]

## Migration Notes
[What breaks, what's backward-compatible]

## Performance Impact
[Latency / token / memory delta]

Standard Folder Structure

Always generate code in this layout:

multi_agent_system/
├── agents/          # One file per agent role
├── tools/           # Tool definitions and wrappers
├── memory/          # Redis, VectorDB, LangChain memory helpers
├── prompts/         # Prompt templates (one per agent)
├── workflows/       # High-level orchestration logic
├── graphs/          # LangGraph state + compiled graph definitions
├── api/             # FastAPI routes (optional)
├── configs/         # Config loader — no secrets in code
├── tests/           # Unit + integration tests per agent
└── main.py

Examples

Example 1: Research + Coding Multi-Agent Workflow

# agents/research_agent.py
async def research_node(state: AgentState) -> AgentState:
    llm = ChatOpenAI(model="gpt-4o").bind_tools([web_search, rag_search])
    response = await llm.ainvoke(
        f"Research the following and return structured findings:\n{state['user_goal']}"
    )
    state["context"]["research"] = response.content
    state["next_agent"] = "coder"
    return state

# agents/coding_agent.py
async def coding_node(state: AgentState) -> AgentState:
    llm = ChatOpenAI(model="gpt-4o").bind_tools([python_repl, github_tool])
    response = await llm.ainvoke(
        f"Given this research:\n{state['context']['research']}\n\nWrite production Python code."
    )
    state["context"]["code"] = response.content
    state["next_agent"] = "validator"
    return state

Example 2: Supervisor with Dynamic Delegation

# agents/supervisor_agent.py
DELEGATION_PROMPT = """
You are a supervisor. Given the current state, decide the next agent.
Available agents: research, coder, validator, end.
Respond with ONLY the agent name.

Goal: {goal}
Completed: {completed}
Context keys available: {context}
"""

async def supervisor_node(state: AgentState) -> AgentState:
    state["step_count"] += 1
    llm = ChatOpenAI(model="gpt-4o")
    decision = await llm.ainvoke(
        DELEGATION_PROMPT.format(
            goal=state["user_goal"],
            completed=state["completed_tasks"],
            context=list(state["context"].keys()),
        )
    )
    next_agent = decision.content.strip().lower()
    # Validate against allowlist before setting
    allowed = {"research", "coder", "validator", "end"}
    state["next_agent"] = next_agent if next_agent in allowed else "end"
    return state

Example 3: DeepAgents Reflection Loop

async def reflection_node(state: AgentState) -> AgentState:
    llm = ChatOpenAI(model="gpt-4o")
    critique = await llm.ainvoke(
        f"Evaluate this output critically:\n{state['context'].get('code', '')}\n"
        "List any bugs, gaps, or improvements. Be concise."
    )
    state["context"]["critique"] = critique.content
    state["next_agent"] = "coder" if "bug" in critique.content.lower() else "end"
    return state

Best Practices

  • ✅ One agent = one responsibility — never combine planning + coding + testing in one node
  • ✅ Use TypedDict for all state schemas — enables type checking and graph validation
  • ✅ Bind only the tools each agent needs — reduces hallucinated tool calls
  • ✅ Always add a step_count guard to prevent infinite routing loops
  • ✅ Use async/await throughout — LangGraph supports async natively
  • ✅ Store all secrets in environment variables loaded via os.getenv()
  • ✅ Set TTLs on all Redis keys scoped to session_id
  • ✅ Log at every node entry and tool call for observability
  • ✅ Validate supervisor routing output against an allowlist of agent names
  • ❌ Don't hardcode API keys, model names, or Redis URLs
  • ❌ Don't share tool lists across agents that don't need them
  • ❌ Don't skip error handling — tool failures and empty LLM responses are common
  • ❌ Don't trust unvalidated LLM routing decisions — always check against an allowlist

Limitations

  • This skill does not replace environment-specific testing, load testing, or security review before production deployment.
  • Generated LangGraph code targets the current stable API — always verify method signatures against your installed version (pip show langgraph).
  • Stop and ask for clarification if the agent's goal, tool permissions, or routing logic is ambiguous before generating a full architecture.
  • DeepAgents integration patterns assume the library is installed and configured in the target environment.

Security & Safety Notes

  • Never expose API keys in generated code. All secrets must use environment variables:
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")   # ✅ correct
    OPENAI_API_KEY = "sk-..."                        # ❌ never do this
    
  • Always validate and sanitize user inputs before injecting them into agent prompts — treat all user input as untrusted.
  • Add a permission layer before allowing agents to execute shell commands or write to filesystems.
  • If generating a Python REPL tool node, document that it must only run in a sandboxed, isolated environment.
  • For production deployments, add rate-limit handling and exponential backoff on all LLM and external API calls.
  • Scope all Redis session keys to session_id and set a TTL to prevent memory leaks across sessions.

Common Pitfalls

  • Problem: Agent loops indefinitely between supervisor and sub-agents
    Solution: Add step_count: int to state; return "end" in route_next() when step_count > N

  • Problem: Supervisor routes to a non-existent agent name
    Solution: Validate the LLM's routing output against a hardcoded allowlist before setting next_agent

  • Problem: Memory leaks across user sessions
    Solution: Scope Redis keys to session_id and always set a TTL (ttl=3600)

  • Problem: Tool results are ignored by the next agent
    Solution: Always write tool output into state["context"] and confirm the next node reads it

  • Problem: Agents share too many tools and hallucinate wrong tool calls
    Solution: Use .bind_tools([only_relevant_tools]) per agent instead of a global tool list

  • Problem: Graph fails silently on API rate limits
    Solution: Wrap LLM calls in retry logic with exponential backoff using tenacity


Related Skills

  • @langchain-rag - When you need retrieval-augmented generation pipelines specifically
  • @fastapi-backend - When deploying agent systems as production REST APIs
  • @python-async - When deepening async/await patterns used throughout agent nodes
Info
Category Development
Name multi-agent-architect
Version v20260509
Size 12.01KB
Updated At 2026-05-10
Language