In this blog, we will be covering the following topics for building a multi AI-Agent system:

What qualifies as an AI agent?

An AI agent is a system where artificial intelligence actively controls program flow and decision-making. The hallmark of an AI agent is its ability to determine "what happens next" rather than following pre-programmed logic. This is typically implemented through tool calling - where the Large Language Model (LLM) is given access to a set of tools that it can employ to accomplish tasks. These tools can range from simple information retrieval functions to complex AI-powered capabilities that can both access and modify data.

This is distinct from systems that merely use LLMs within a predetermined workflow - even if they leverage AI, they don't qualify as true AI agents if their logic flow is hardcoded rather than dynamically determined by the AI itself.

Single vs Multi AI-Agent systems

A multi-agent system emerges when we integrate AI-powered tools into an LLM's toolkit, enabling multiple AI components to collaborate seamlessly. This classification also applies to systems where multiple AI prompts are fed each other's outputs, enabling them to "chat" with each other. In both scenarios, the result is a sophisticated network of AI entities working in concert to achieve complex objectives.

Single AI-agent flow

Single AI-agent diagram(Image source)

Multi AI-agent flow

Multi AI-agent diagram (Image source)

In the multi-agent flow, the "manager" agent can interact with the "worker" agents to complete the task. In this case, we have three worker agents - one for Spanish, one for French and one for Italian translations.

Tool / Function Calling

Tool calling enables LLMs to execute functions you program and process their results. For example, a math function that takes "1+2" and returns "3" allows the LLM to do accurate calculations instead of hallucinating results.

Serial vs Parallel tool calling

LLMs can call multiple functions simultaneously or sequentially. In a recipe application with getMealIdea and getIngredientsForMeal tools, the LLM might call getMealIdea in parallel to generate multiple meals, then use those results to call getIngredientsForMeal. This is illustrated in the diagram below:

Tool calling example

Error handling

If a tool throws an error (for any reason), instead of stopping the agentic flow, we can convert the error to a string and then pass that error state to the LLM. For example, in the above diagram, if the getMealIdea("breakfast") function call failed, we can tell the LLM: "Failed execution. Please try again". This way, the LLM will probably retry the same function call, and this way you don't waste your tokens that were used up in the flow.

Implementing tool calling

Here are two useful resources for implementing tool calling:

Cost considerations

Running AI agents can be very expensive, because tool calling can quickly escalate LLM costs since each tool response requires sending the entire conversation context back to the LLM, resulting in repeated token charges.

Consider our meal recommendation example with three tool call iterations:

Initial request:

After getMealIdea results:

After getIngredientsForMeal results:

Total cost: 4,202 input tokens + 276 output tokens

As tool complexity and call frequency increase, costs scale non-linearly due to this context accumulation pattern.

Cost reduction strategies:

Structured outputs

Structured outputs enforce JSON schema compliance, enabling reliable control flow based on LLM responses. Instead of parsing unpredictable text, you get guaranteed JSON structure.

Example use case: Content moderation

You can use structured outputs to implement a content moderation tool. Here's an example of the JSON schema you can use:

1{
2    "offensive": true,
3    "reason": "Contains profanity and hate speech"
4}

Your code can then reliably branch logic based on the offensive boolean value.

Key applications in agent systems:

OpenAI's structured outputs guide provides implementation details.

Prompting techniques

Below are some of the most important prompting techniques that I have come across while building multi-agent systems.

Using XML

XML tags provide clear structure and organization for your prompts. Here's an effective system prompt template:

1"""
2You are a helpful assistant.
3
4<Task>
5....
6</Task>
7
8<Planning method>
9....
10</Planning method>
11
12<Examples>
13<Example 1>
14....
15</Example 1>
16
17<Example 2>
18....
19</Example 2>
20</Examples>
21
22<Important>
23....
24</Important>
25
26<Output format>
27....
28</Output format>
29"""

Benefits:

Adding examples

Few-shot examples are the single most effective lever for steering LLM behaviour because they demonstrate—rather than describe—the pattern you want the model to follow.

Pro tip: When costs matter, distil your original few-shot prompt into a synthetic one-shot example using a cheaper model. You keep most of the performance while cutting tokens.

Effective tool descriptions

Poor tool metadata is the #1 reason agents hallucinate arguments. A good description has four parts:

1{
2  "name": "generate_itinerary",
3  "description": "Use this to build a day-by-day travel plan. Do NOT call if the user only asks for flight prices.",
4  "parameters": {
5    "type": "object",
6    "properties": {
7      "city": {"type": "string", "description": "Destination city"},
8      "days": {"type": "integer", "minimum": 1, "maximum": 14}
9    },
10    "required": ["city", "days"]
11  }
12}

You can even extend the description field to include examples of input and outputs if needed.

Adding escape options for LLM

LLMs tend to be overly accommodating and will attempt to fulfill any request, even when they should refuse or acknowledge limitations. Explicitly providing "escape routes" prevents hallucination and inappropriate responses.

Example 1: Rejecting off-topic requests For a meal recommendation agent, clearly define boundaries in your system prompt:

1"""
2<Important>
3If the user asks for something unrelated to meal recommendations, respond with: "Sorry, I cannot help with that."
4</Important>
5"""

Example 2: Handling impossible tasks When building agents that query documents, prevent hallucination by providing a structured format for "not found" scenarios:

1"""
2<Output format>
3When information is found:
4{
5    "found": true,
6    "result": "..."
7}
8
9When information is not found:
10{
11    "found": false,
12    "result": null
13}
14</Output format>
15"""

This structured approach lets your application handle missing information gracefully rather than displaying hallucinated results.

Avoid UUIDs

Universally Unique Identifiers are great for databases but terrible for LLM contexts:

This means that if you are injecting objects from your database into the LLM context, you will have to create a mapper function which will convert the UUIDs in the object to shorter IDs.

Adding time

LLMs do not know what the current time is, so, user queries that require time-based logic will not work. To solve this, you can add the current time at the end of the system prompt:

1"""
2...rest of the prompt
3
4<Current time>
5User timezone: 18:01:45, Wednesday, 2nd July 2025
6UTC timezone: 12:01:45, Wednesday, 2nd July 2025
7</Current time>
8"""

It is important to provide both, the user's timezone and the UTC timezone because other context data in your prompt maybe in UTC timezone, but the user's query will refer to time in their local timezone.

How to think about orchestration and delegation of tasks

Going from broad to narrow

Start with a planner agent that receives the user's goal and explodes it into a task graph.

All agents read from and write to a shared context store that tracks:

Because the context can balloon, apply size-control tactics early:

Here is an example of a rough structure of a system prompt that can be used for the planner agent. It will need to be changed based on your application's data most likely, but nonetheless, it is a good starting point:

1"""
2You are **PlannerAgent**.
3Your responsibility is to turn a single user goal into a DAG (directed acyclic graph) of executable tasks.
4
5<Definitions>
6• *Task* - The smallest unit of work that can be handled by one executor.  
7• *Executor* - Either a deterministic tool (e.g. get_weather), a specialist agent
8  (e.g. TranslatorFR), or "self" if you will do the step directly.
9</Definitions>
10
11<Objective>
121. Decompose the user's goal into at most **7** tasks.  
132. Add dependency edges so that independent tasks can run in parallel.  
143. For each task provide a one-sentence **rationale** explaining:  
15   - why the task is needed  
16   - why that executor is appropriate.  
174. **Specify the exact arguments** (if any) the executor should receive.  
185. Enforce a **maximum recursion depth of 3** when splitting tasks.  
196. Use short, deterministic IDs (`t1`, `t2`, …) — **never** UUIDs.
20</Objective>
21
22<Available executors>
23{tools_and_agents_json}
24</Available executors>
25
26<Shared context schema>
27You can read (but not delete) these top-level keys:
28{
29  "user_query": "...",
30  "task_graph": { /* previous runs if any */ },
31  "results": { /* completed task outputs */ }
32}
33</Shared context schema>
34
35<Output format>
36Return one raw JSON object with this exact shape:
37
38{
39  "tasks": [
40    {
41      "id": "t1",
42      "description": "...",
43      "executor": "<tool | agent | self>",
44      "args": {                       // literal values or references
45        "city": "Paris",
46        "days": 3
47      },
48      "rationale": "...",
49      "depends_on": []
50    },
51    {
52      "id": "t2",
53      "description": "...",
54      "executor": "<tool | agent | self>",
55      "args": {                       // pull 'itinerary' from t1's output
56        "text": "<<t1.itinerary>>"
57      },
58      "rationale": "...",
59      "depends_on": ["t1"]
60    }
61    // … up to t7
62  ],
63  "graph_depth": <int>,
64  "estimated_parallel_groups": <int>
65}
66
67Reference rules:
68- Any `args` value wrapped in `<< >>` is a reference to an earlier task's output.  
69- The planner must add that earlier task's ID to `depends_on`.  
70- If the referenced task returns a single primitive, use `<<task_id.result>>`.  
71- Nested references (t3 → t2 → t1) are allowed but depth ≤ 3 overall.
72</Output format>
73
74<Style rules>
75- Keep each rationale ≤ 25 tokens.  
76- Output *raw JSON* (no markdown).  
77- Be deterministic: given the same goal and executors, always produce the same plan.
78
79<Current time>
80User timezone: {user_local_time}
81UTC timezone: {utc_time}
82</Current time>
83"""

Reflection agent

After each major step—or just before returning to the user—a reflection agent reviews the current state:

It receives only three inputs:

1{
2    "user_query": "...",
3    "task_graph": "...",        // with rationales
4    "draft_output": "..."
5}

and returns a critique object:

1{
2    "pass": false,
3    "issues": [
4        "Breakfast calories missing",
5    ],
6    "suggestions": [
7        "Call getNutritionFacts for breakfast",
8    ]
9}

Coupling a planner (for construction) with a reflection agent (for critique) yields a modular, self-correcting and cost-aware multi-agent framework.

Choosing the right model

Here is a table that can help you think about model selection:

Decision leverWhen to choose cheaper / smallerWhen to choose larger / multimodal
Prompt length≤ 4 k tokens context> 8 k tokens or heavy cites
Task complexityPattern-matching, CRUD-style tool glueOpen-ended reasoning, novel synthesis
Latency target< 300 ms end-to-endUser is willing to wait > 1 s
Cost ceilingPenny-sensitive use cases, high queries per secondLow-throughput, high-value workflows

Practical playbook:

Batching vs real-time

LLM providers offer batch APIs that reduce costs by 50% with identical quality, but come with trade-offs:

Batch processing:

Ideal for non-time-critical tasks:

Implementation resources:

Rate limits

OpenAI and most providers enforce per-minute caps on requests, tokens, and sometimes concurrent generations.

Limit typeTypical defaultMitigation
RPM (req / min)3k-20kHTTP 429 back-off
TPM (tokens / min)100k-300kSummarise context, HTTP 429 back-off

Mitigation: For queries below TPM limits, implement HTTP 429 exponential back-off. For queries exceeding TPM, reduce context size through summarization techniques.

Note: The actual rate limits vary based on provider, and their usage tiers. For example, when you first sign up for OpenAI, you may have a TPM of 30k, but once you spend a certain amount on the platform, this value is increased to >100k.

Observability tools, debugging and evals

Here are some recommended tools for building multi-agent systems:

Note: While many other excellent tools exist, these recommendations are based on personal experience. Disclosure: I am the creator of AgentGraph and LLMBatching.