Chapter 10: Evaluation, Observability & Debugging AI Agents

Introduction

Welcome, future Applied AI Engineer! By now, you’ve built some incredible agentic AI systems, watched them reason, use tools, and tackle complex tasks. But how do you know if your agent is truly performing well? How do you diagnose problems when it misbehaves? This is where the crucial practices of evaluation, observability, and debugging come into play.

In this chapter, we’re diving deep into the art and science of understanding your AI agents. We’ll learn how to measure their effectiveness, monitor their behavior in real-time, and systematically troubleshoot issues. Think of it as giving your agent a health check-up, a set of X-ray goggles, and a sophisticated diagnostic kit. Without these skills, deploying reliable and robust AI agents in production would be like flying blind!

We’ll build upon your knowledge of agent orchestration, tool use, and multi-agent systems from previous chapters. Get ready to equip yourself with the tools and mindset to confidently build, deploy, and maintain high-performing agentic AI applications.

Core Concepts: Seeing Inside Your Agent’s Mind

Building an AI agent is one thing; understanding why it makes certain decisions or where it fails is another. This section lays the foundation for truly mastering your agents.

Why Evaluate AI Agents?

Imagine you’ve built an agent that helps users book flights. How do you know it’s better than the old manual system? Or even better than a previous version of your agent? Evaluation provides the answers.

Evaluation is the process of systematically assessing an agent’s performance, reliability, and safety. Unlike traditional software, where a function either works or doesn’t, AI agents operate in a probabilistic world. They might hallucinate, misuse tools, or get stuck in loops. We need robust methods to catch these issues.

Key Reasons for Evaluation:

Performance Measurement: Quantify how well the agent achieves its goals (e.g., success rate, accuracy, task completion).
Reliability & Robustness: Ensure the agent performs consistently across various inputs and edge cases.
Safety & Alignment: Verify the agent avoids harmful outputs, adheres to ethical guidelines, and stays within its intended scope.
Regression Testing: Prevent new changes from degrading existing performance.
Cost & Latency Optimization: Evaluate the efficiency of the agent’s operations.

Types of Evaluation

Evaluating AI agents often involves a blend of methods.

1. Offline Evaluation

This happens before deploying the agent to users, using predefined datasets.

Benchmark Datasets: Curated sets of inputs with expected correct outputs. You run your agent against these and compare its responses.
Synthetic Data Generation: Creating diverse test cases programmatically to cover a wide range of scenarios, especially edge cases.
RAG Metrics (for RAG-enabled agents):
- Context Relevance: Is the retrieved information actually relevant to the query?
- Faithfulness: Does the agent’s answer use only the retrieved context, avoiding hallucinations?
- Answer Relevance: Is the final answer relevant and helpful to the user’s original query?

2. Online Evaluation

This happens after deployment, with real users interacting with the agent.

A/B Testing: Deploying multiple versions of an agent (e.g., A and B) to different user groups and comparing their performance based on user interactions and predefined metrics.
User Feedback: Directly collecting ratings, comments, or explicit “thumbs up/down” from users.
Proxy Metrics: Indirect indicators of success, such as time spent on task, conversion rates, or error rates.

The Power of Observability for AI Agents

If evaluation tells you how well your agent is doing, observability tells you what it’s doing at every step of the way. It’s about gaining deep insight into the internal workings of your agentic system.

Why is this so critical for AI agents? They are often complex, non-deterministic, and involve multiple interacting components (LLMs, tools, memory, other agents). Without observability, debugging becomes guesswork.

Three Pillars of Observability:

Logs: Detailed, timestamped records of events that happen within your agent. This includes tool calls, LLM inputs/outputs, errors, and state changes.
Metrics: Numerical measurements collected over time, like latency of LLM calls, token usage, tool success rates, or agent step counts. These help you spot trends and performance issues.
Traces: A complete, end-to-end view of a single request or agent execution, showing the sequence of operations, their dependencies, and their duration. This is invaluable for understanding the flow of complex agentic workflows.

Common Failure Modes & Debugging Strategies

Even the best agents can stumble. Understanding common failure modes and having a systematic approach to debugging is essential.

Common Failure Modes:

Hallucinations: The agent generates factually incorrect information, often confidently.
Tool Misuse/Non-use: The agent calls the wrong tool, provides incorrect arguments, or fails to use a necessary tool.
Infinite Loops: The agent gets stuck in a repetitive cycle of thoughts or actions.
Context Window Overflow: The agent tries to process too much information, exceeding the LLM’s context limit, leading to truncated responses or errors.
Prompt Injection/Jailbreaking: Malicious inputs cause the agent to deviate from its intended behavior or reveal sensitive information.
Poor Planning/Reasoning: The agent’s internal thought process leads to suboptimal or illogical steps.
Race Conditions (in multi-agent systems): Agents interfere with each other’s actions or state due to uncoordinated access.

Debugging Strategies:

Inspect Logs: Look for error messages, unexpected tool calls, or unusual LLM inputs/outputs.
Analyze Traces: Follow the entire execution path. Where did the agent deviate from the expected flow? Which LLM call led to a bad decision? Which tool returned an unexpected result?
Review Prompts: Are the system prompts clear, concise, and unambiguous? Are tool descriptions accurate? Could the prompt be leading the agent astray?
Examine Tool Outputs: Did the tools return the expected data? Were there external API errors?
Check Memory & State: Is the agent retaining the correct information across turns? Is its internal state consistent?
Simplify and Isolate: Break down complex problems. Test individual tools or LLM calls in isolation to pinpoint the source of the error.
Iterative Prompt Engineering: Adjust prompts, add more examples, or refine instructions based on observed failures.

Let’s visualize the observability flow for an agent:

flowchart TD A[User Query] --> B{Agent Orchestration} B --> C[LLM Call 1] C --> D[Tool Call] D --> E[External Service] E --> F[Tool Result] F --> G[LLM Call 2] G --> H[Agent Response] subgraph Observability Layer L[Logs] M[Metrics] T[Traces] end B -->|\1| T C -->|\1| L C -->|\1| T D -->|\1| L D -->|\1| M D -->|\1| T E -->|\1| L G -->|\1| L G -->|\1| T

This diagram illustrates how logs, metrics, and traces are collected at various points throughout an agent’s execution, providing a comprehensive view of its behavior.

Step-by-Step Implementation: Adding Observability to Your Agent

Let’s put these concepts into practice. We’ll take a simple agent and integrate basic logging and then introduce a dedicated tracing solution like LangSmith.

We’ll assume you have a basic AgentExecutor set up from a previous chapter. If not, here’s a minimal example you can use:

# First, ensure you have the necessary libraries installed:
# pip install langchain==0.1.13 langchain-openai==0.1.13 langsmith==0.1.17 python-dotenv==1.0.1
# Note: Versions are stable as of 2026-01-16. Always check official docs for latest.

Let’s create a foundational agent:

import os
from dotenv import load_dotenv
from langchain_core.agents import AgentExecutor, create_react_agent
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI

# Load environment variables from .env file (for API keys)
load_dotenv()

# --- Mock Tool ---
@tool
def get_current_weather(location: str) -> str:
    """Gets the current weather for a specified location.
    Use this tool to get real-time weather information."""
    print(f"\n--- Tool Call: get_current_weather for {location} ---")
    if "London" in location:
        return "It's cloudy with a temperature of 10°C in London."
    elif "New York" in location:
        return "It's sunny with a temperature of 25°C in New York."
    else:
        return "Weather data not available for this location."

tools = [get_current_weather]

# --- LLM Setup ---
# Using the latest stable GPT-4o model from OpenAI.
# Ensure OPENAI_API_KEY is set in your .env file.
llm = ChatOpenAI(model="gpt-4o-2024-05-13", temperature=0)

# --- Agent Prompt ---
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI assistant that can answer questions using tools."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

# --- Create Agent ---
agent = create_react_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

print("Agent setup complete. Ready to invoke.")

Step 1: Basic Logging with `verbose=True`

LangChain’s AgentExecutor comes with a built-in verbose flag that provides excellent immediate feedback. This is your first line of defense for understanding agent behavior.

Action: Run the agent with verbose=True (which we already set).

# In the same script, add this at the end:
if __name__ == "__main__":
    print("\n--- Invoking agent with verbose logging ---")
    result = agent_executor.invoke({"input": "What's the weather like in London?"})
    print("\n--- Agent Response ---")
    print(result["output"])

    print("\n--- Invoking agent for an unknown location ---")
    result_unknown = agent_executor.invoke({"input": "What's the weather like in Paris?"})
    print("\n--- Agent Response (Unknown Location) ---")
    print(result_unknown["output"])

What to Observe/Learn: When you run this, you’ll see a detailed log of the agent’s “thoughts.” This includes:

The tool_code (what tool it decided to call).
The tool_input (arguments passed to the tool).
The observation (the tool’s output).
The LLM’s final thought and answer.

This verbose output is a form of logging and tracing, giving you a textual trace of the agent’s decision-making process. It’s incredibly useful for initial debugging! Notice how the print statement inside our get_current_weather tool also adds to the log, demonstrating custom logging within tools.

Step 2: Advanced Tracing with LangSmith

While verbose=True is great for local development, for production-grade observability, you need a dedicated tracing platform. LangSmith, developed by LangChain, is a powerful tool specifically designed for tracing, monitoring, and evaluating LLM applications and agents.

Why LangSmith?

Visual Traces: See the entire execution flow of your agent, including all LLM calls, tool uses, and intermediate steps, in a friendly UI.
Detailed Metrics: Track token usage, latency, and cost for each component.
Debugging: Easily pinpoint where an agent failed or deviated.
Evaluation: Create datasets and run evaluations to systematically test your agent.

Action: Set up LangSmith.

Get a LangSmith API Key:
- Go to https://smith.langchain.com/ and sign up.
- Find your API key in the settings.

Add Environment Variables: Add these to your .env file:

LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY="your_langsmith_api_key_here"
LANGCHAIN_PROJECT="My First Agent Chapter 10" # Give your project a descriptive name

Replace "your_langsmith_api_key_here" with your actual key.

Run Your Agent Again:
- Ensure your dotenv is loaded at the top of your script.
- No code changes are needed in your agent’s invoke call! LangChain automatically integrates with LangSmith when these environment variables are set.

# No code changes needed for the agent invocation part.
# Just ensure your .env file is updated and loaded.

# Example run (from the __main__ block earlier):
# result = agent_executor.invoke({"input": "What's the weather like in London?"})
# result_unknown = agent_executor.invoke({"input": "What's the weather like in Paris?"})

What to Observe/Learn:

After running the agent, navigate to your LangSmith project dashboard (e.g., https://smith.langchain.com/projects/your-project-id).
You will see “Runs” corresponding to each agent_executor.invoke call.
Click on a run to see a beautiful, interactive trace. You’ll see:
- The overall AgentExecutor run.
- Nested LLMChain calls.
- Tool calls.
- The exact prompts sent to the LLM and the responses received.
- Latency and token usage for each step.

This visual trace is incredibly powerful for understanding the agent’s flow and debugging. It’s like a debugger for your LLM thoughts!

Step 3: Capturing Custom Metrics

While LangSmith captures many metrics automatically, you might want to track application-specific metrics. For example, how long a specific tool takes, or how many times a certain decision path is taken.

You can integrate with general-purpose monitoring solutions like Prometheus, Grafana, or simply log to a file and analyze later. For this example, we’ll just add a simple timer to our tool.

Action: Modify the get_current_weather tool to include timing.

import time # Add this import at the top

@tool
def get_current_weather(location: str) -> str:
    """Gets the current weather for a specified location.
    Use this tool to get real-time weather information."""
    start_time = time.time() # Start timer
    print(f"\n--- Tool Call: get_current_weather for {location} ---")
    if "London" in location:
        result = "It's cloudy with a temperature of 10°C in London."
    elif "New York" in location:
        result = "It's sunny with a temperature of 25°C in New York."
    else:
        result = "Weather data not available for this location."
    
    end_time = time.time() # End timer
    duration = end_time - start_time
    print(f"--- Tool Execution Time: {duration:.2f} seconds ---")
    # In a real application, you'd send this 'duration' to a metrics system.
    return result

What to Observe/Learn: When you run the agent again, you’ll see the tool execution time printed in the console. This demonstrates how you can capture custom metrics directly within your code. For production, you’d integrate with a monitoring system (e.g., sending these durations to a time-series database).

Mini-Challenge: Debugging a Deliberate Failure

Now it’s your turn to play detective! We’ll introduce a subtle issue into our agent.

Challenge: Modify the agent’s system prompt to intentionally confuse it about tool usage. For example, make the system prompt emphasize “only answer questions about historical events,” but keep the get_current_weather tool available. Then, ask the agent about the weather.

Modify the prompt variable:

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an AI assistant that specializes in **historical events and facts**. You should only answer questions related to history. Do not use tools for anything else."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

Run the agent with the original weather query:

if __name__ == "__main__":
    print("\n--- Invoking agent with verbose logging (deliberate failure) ---")
    result = agent_executor.invoke({"input": "What's the weather like in London?"})
    print("\n--- Agent Response ---")
    print(result["output"])

Analyze the verbose output and the LangSmith trace.

Hint: Pay close attention to the agent’s initial thought process and whether it decides to use the tool despite the prompt’s new instruction. What happens if the agent decides not to use the tool but still tries to answer?

What to Observe/Learn: You should see the agent struggling. It might refuse to answer, try to answer without the tool, or even hallucinate an answer, because the new system prompt conflicts with the tool’s availability and the user’s intent. The LangSmith trace will clearly show the LLM’s thought process and why it chose (or didn’t choose) to use the tool. This highlights the importance of prompt clarity and alignment with available tools.

Common Pitfalls & Troubleshooting

Let’s review common issues you’ll encounter and how to debug them using the techniques we just learned.

Pitfall 1: Agent Gets Stuck in an Infinite Loop

Scenario: Your agent repeatedly calls the same tool with slightly different arguments, or cycles through “thoughts” without making progress. How to Spot:

Verbose Logs: You’ll see repetitive tool_code, tool_input, and observation blocks.
LangSmith Trace: The trace will show a long, repeating sequence of steps. You’ll easily identify the loop. Troubleshooting:
Review Prompt: Is the prompt clear about when to stop? Does it provide explicit instructions on how to reach a final answer? Add instructions like “Once you have found the answer, respond directly without further tool use.”
Tool Output: Is a tool returning inconsistent or ambiguous results, leading the agent to re-query?
Memory Management: Is the agent’s memory growing uncontrollably, leading to redundant queries or context window issues?

Pitfall 2: Tool Misuse or Hallucination

Scenario: The agent calls the wrong tool, passes incorrect arguments to a tool, or simply makes up facts instead of using an available tool. How to Spot:

Verbose Logs: Look for unexpected tool_code or tool_input. If it hallucinates, the thought process won’t mention tool use, and the answer will be incorrect.
LangSmith Trace: The trace will visually confirm if the wrong tool was called or if the LLM generated an answer directly without tool interaction when it should have. Troubleshooting:
Refine Tool Descriptions: Make tool descriptions very specific, including clear examples of expected inputs and outputs. The LLM relies heavily on these descriptions.
Prompt Engineering: Reinforce the agent’s reliance on tools for specific types of information. “Always use the get_current_weather tool for weather inquiries.”
Function Calling: Ensure your LLM model supports robust function calling and that you’re using it effectively. This makes tool use more structured and less prone to hallucination.

Pitfall 3: Context Window Limits Hit

Scenario: Your agent’s conversations or tool outputs become too long, exceeding the LLM’s maximum input token limit, leading to errors or truncated responses. How to Spot:

Logs/Errors: You might see explicit ContextWindowExceeded or similar errors from the LLM API.
LangSmith Trace: LangSmith shows token counts for each LLM call. You can easily spot runs where token usage is consistently high or hits a limit. Troubleshooting:
Summarization: Implement a summarization step for long tool outputs or conversational history before feeding it back to the LLM.
Retrieval-Augmented Generation (RAG): Instead of putting all information in the context, retrieve only the most relevant chunks.
Memory Strategies: Use more sophisticated memory systems that summarize or prioritize information (e.g., “summarize memory,” “entity memory”).
Chunking: Break down large documents or tool results into smaller, manageable chunks.

By actively using logs, traces, and metrics, you’ll gain invaluable insights into your agent’s behavior, allowing you to quickly identify and resolve these common issues. This iterative process of building, observing, and refining is the hallmark of an effective Applied AI Engineer.

Summary

Phew! We’ve covered a lot in this chapter, transforming you from an agent builder into an agent diagnostician.

Here are the key takeaways:

Evaluation is Essential: It’s how we measure an agent’s performance, reliability, and safety, using both offline (benchmarks, RAG metrics) and online (A/B testing, user feedback) methods.
Observability is Your X-Ray Vision: Comprising logs, metrics, and traces, it gives you deep insight into your agent’s internal workings and decision-making process.
LangSmith is a Powerful Tool: For production-grade LLM application observability, LangSmith provides visual traces, detailed metrics, and debugging capabilities.
Debugging is a Skill: Understanding common failure modes like infinite loops, tool misuse, and context window overflows, combined with systematic inspection of logs and traces, enables effective troubleshooting.
Iterate and Refine: The process of building, observing, evaluating, and debugging is continuous for robust AI agent development.

By mastering evaluation, observability, and debugging, you’re not just building AI agents; you’re building reliable AI agents that can confidently be deployed and maintained in real-world applications.

What’s Next?

With your agents now under your diagnostic control, the next step is to make them efficient and cost-effective. In Chapter 11: Cost & Latency Optimization, we’ll explore strategies to make your AI agents run faster and cheaper, a critical aspect for production deployments. Get ready to optimize!

References

LangChain Documentation - LangSmith: The official guide to using LangSmith for tracing, monitoring, and evaluating LLM applications. https://docs.langchain.com/docs/langsmith/
OpenAI API Documentation - Function Calling: Essential for understanding how LLMs interact with tools. https://platform.openai.com/docs/guides/function-calling
Microsoft Agent Framework Documentation: Provides insights into building robust agentic AI solutions. https://learn.microsoft.com/en-us/agent-framework/
HatchWorks Blog - AI Agent Design Best Practices: Discusses guardrails and memory design for agents. https://hatchworks.com/blog/ai-agents/ai-agent-design-best-practices/
Vellum AI Blog - The 2026 Guide to AI Agent Workflows: Covers emerging architectures and design patterns for agentic workflows. https://www.vellum.ai/blog/agentic-workflows-emerging-architectures-and-design-patterns

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.