How to Debug and Monitor AI Agents with LangSmith

If you're building AI agents that make multiple LLM calls, you've hit the black box problem: you can't see why an agent failed on step three, which prompt burned through $47 in tokens overnight, or what changed between the version that worked yesterday and the one failing today. LangSmith solves this by automatically logging every LLM call with full input/output traces, tracking token costs per request, versioning your prompts, and running automated evaluations before production. This guide walks you through setting up LangSmith tracing, monitoring costs in real time, and catching errors before users see them.

Why AI Agents Are Black Boxes (And Why That's a Problem)

Traditional software fails predictably. You get a stack trace, line numbers, and error messages. AI agents fail opaquely: a multi-step workflow returns nonsense, and you don't know if the retrieval step grabbed wrong documents, the LLM misinterpreted context, or your prompt template had a typo.

The debugging challenges multiply fast. Your agent might succeed 90% of the time but fail on edge cases you can't reproduce. You can't inspect intermediate reasoning steps without manually adding print statements everywhere. Token costs spike from $12 to $340 per day, and you've got no idea which component is responsible.

According to recent developer surveys, roughly 68% of teams building production AI agents report spending more time debugging than writing new features. That ratio flips when you add proper observability.

What LangSmith Is and How It Provides AI Agent Observability

LangSmith is an observability platform built specifically for LLM applications and AI agents. It automatically captures every interaction with language models: the full prompt sent, the response received, token counts, latency, any errors thrown. Think of it as distributed tracing for AI systems.

The platform integrates with LangChain, but you can also use it standalone with direct OpenAI, Anthropic, or other LLM API calls. Every trace becomes a tree structure showing how your agent broke down a task, which tools it called, where things went wrong.

LangSmith stores these traces with metadata you define: user IDs, session identifiers, environment tags, whatever you need. You can filter to "show me all production failures from last Tuesday" or "compare token usage between prompt version 2.3 and 2.4." This beats grepping through application logs by a mile.

LangSmith Tutorial for AI Agent Debugging: Step-by-Step Setup

Setting up LangSmith takes about 10 minutes. You'll need a LangSmith account (free tier supports 5,000 traces per month) and your application code that makes LLM calls.

Install the LangSmith SDK

For Python projects, install the client library:

pip install langsmith langchain

Set your environment variables with your API key from the LangSmith dashboard:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="your-api-key-here"
export LANGCHAIN_PROJECT="your-project-name"

Instrument Your AI Agent Code

If you're using LangChain, tracing happens automatically once environment variables are set. For direct API calls, wrap them with the tracing decorator:

from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable(run_type="llm", name="summarize_document")
def summarize(document_text):
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Summarize the following document concisely."},
            {"role": "user", "content": document_text}
        ]
    )
    return response.choices[0].message.content

Now every call to summarize() appears in your LangSmith dashboard with the exact input text, the model's response, token counts, execution time. You can click into any trace to see the full conversation history.

Add Metadata for Better Filtering

Tag traces with context that helps debugging:

from langsmith import traceable

@traceable(
    run_type="chain",
    name="customer_support_agent",
    metadata={"user_id": "user_12345", "environment": "production"}
)
def handle_support_request(user_message):
    # Your agent logic here
    pass

This lets you filter the dashboard by user, environment, or any custom dimension. When a specific customer reports an issue, you can pull up their exact trace history in seconds.

How to Trace LLM Calls and Monitor Token Costs in Real-Time

Token costs sneak up on you. A single poorly designed prompt that includes 8,000 tokens of context when 500 would suffice can blow through your budget in hours if it runs frequently.

LangSmith automatically calculates costs based on the model you're using and current pricing. The dashboard shows you cost per trace, aggregated daily costs, which components are most expensive. You can sort traces by cost to find outliers immediately.

For example, you might discover that 3% of your traces account for 47% of your token spend because they're hitting a fallback path that uses GPT-4 instead of GPT-3.5. That's actionable data you can't get from your OpenAI bill alone.

Set Up Cost Alerts

Configure alerts to notify you when daily spend exceeds thresholds:

from langsmith import Client

client = Client()
client.create_feedback(
    run_id="your-run-id",
    key="cost_exceeded",
    score=1.0 if cost > 10.0 else 0.0
)

You can build custom dashboards that track cost per user, per feature, per prompt version. This visibility alone typically reduces token spend by 30 to 40% once teams see where money goes.

Using Prompt Versioning to Track What Works and Prevent Regressions

You tweak a prompt to fix one edge case and accidentally break three others. Without versioning, you're flying blind. LangSmith's prompt management feature stores every version of your prompts with a commit-style history.

Create a prompt in the LangSmith UI or via API, then reference it in code:

from langsmith import Client

client = Client()
prompt = client.pull_prompt("summarization_v3")

# Use the prompt template
formatted = prompt.format(document=document_text)
response = llm.invoke(formatted)

When you update the prompt in the dashboard, you can compare performance metrics between versions. LangSmith shows you side by side: version 2.3 had 12% better accuracy but cost 8% more in tokens than version 2.2. You make the tradeoff decision with data, not guesses.

This becomes critical when you're testing AI models before deploying to production, since prompt changes can have subtle downstream effects.

Setting Up Automated Evaluations to Catch Errors Before Production

Manual testing doesn't scale when your agent handles 10,000 requests per day. You need automated evaluations that run against every prompt change or model update.

LangSmith lets you create datasets of example inputs with expected outputs, then evaluate your agent against them:

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

# Create a dataset once
examples = [
    ("What's our refund policy?", "We offer 30-day refunds..."),
    ("How do I reset my password?", "Click the 'Forgot Password' link..."),
]

dataset = client.create_dataset("support_qa_v1", examples=examples)

# Define an evaluator
def check_accuracy(run, example):
    prediction = run.outputs["response"]
    expected = example.outputs["expected"]
    return {"score": 1.0 if expected in prediction else 0.0}

# Run evaluation
results = evaluate(
    lambda inputs: your_agent(inputs["question"]),
    data="support_qa_v1",
    evaluators=[check_accuracy]
)

print(f"Accuracy: {results['accuracy']}")

You can run this in CI/CD pipelines. If accuracy drops below 85%, the deployment fails automatically. This catches regressions before users see them, which is the entire point of having observability.

Teams using automated evaluations report catching roughly 73% of breaking changes before production, compared to 22% with manual spot-checking alone. That's a huge difference.

Real-World Debugging Scenarios with LangSmith

Here's how LangSmith helps with common debugging situations you'll actually face.

Scenario 1: Agent Loops Forever

Your agent enters an infinite loop, calling the same tool repeatedly. In the LangSmith trace view, you see the tree structure showing 47 consecutive calls to the "search_documents" tool. You click into one and notice the tool is returning empty results, so the agent keeps retrying. The fix: improve the tool's error handling to signal "no results found" explicitly.

Scenario 2: Sudden Cost Spike

Your daily spend jumps from $15 to $180 overnight. LangSmith's cost dashboard shows the spike correlates with a specific user's session. You filter traces by that user and see they're uploading 50-page PDFs that get embedded in every prompt. The fix: implement chunking and summarization before sending context to the LLM.

Scenario 3: Inconsistent Outputs

The same input sometimes works, sometimes fails. Traces reveal that failures correlate with high latency responses (over 8 seconds). The LLM is timing out mid-generation. The fix: reduce max tokens and add retry logic with exponential backoff.

These scenarios are impossible to debug efficiently without observability. You'd be adding logging statements, redeploying, hoping to catch the issue in the act. Not fun.

LangSmith vs Other AI Monitoring Tools: What to Choose

LangSmith isn't the only option. Here's how it compares to alternatives.

LangFuse: Open-source alternative with similar tracing capabilities. Better if you need full data control and self-hosting. Requires more setup work. LangFuse is strong for teams with strict data residency requirements.

Weights & Biases: Broader ML experiment tracking, not LLM-specific. Better for model training workflows than production agent monitoring. Use W&B if you're also tracking fine-tuning experiments and want one platform for everything.

Helicone: Focuses on cost tracking and caching. Simpler than LangSmith but lacks evaluation frameworks. Good for straightforward API call monitoring when you don't need complex agent tracing.

Arize AI: Enterprise-focused with drift detection and model performance monitoring. Overkill for small teams, valuable when you're monitoring dozens of models at scale with compliance requirements.

LangSmith hits the sweet spot for teams building AI agents: it's purpose-built for LLM workflows, integrates easily, includes both debugging and evaluation features. Most teams building agentic workflows will find it's the fastest path to production-ready observability.

Best Practices for AI Agent Monitoring and When Observability Becomes Critical

Don't wait until production to add observability. Instrument your agents from day one, even in development. The traces help you understand what your agent is actually doing, which accelerates development.

Tag everything. Add user IDs, session IDs, feature flags, environment labels to every trace. You'll thank yourself when debugging a customer-reported issue at 11 PM.

Create evaluation datasets incrementally. Every time you fix a bug, add that case to your evaluation set. Over time, you build a regression test suite that prevents old bugs from resurfacing.

Monitor cost per feature, not just total cost. Knowing you spent $240 this month is useless. Knowing that the document summarization feature costs $0.08 per request while Q&A costs $0.02 lets you optimize intelligently.

Observability becomes critical the moment you move beyond prototyping. If you're serving real users, handling production traffic, or spending more than $100 per month on LLM APIs, you need observability yesterday. The cost of not having it (both in debugging time and wasted tokens) exceeds the cost of the tool within weeks, honestly.

For teams deciding whether to build or buy AI tooling, observability is almost always a "buy" decision. Building your own tracing infrastructure takes months and distracts from your core product.

Debugging AI Agents Step by Step: A Complete Workflow

Here's your end-to-end workflow for debugging production AI agents with LangSmith.

Step 1: User reports an issue or you notice anomalous behavior in metrics. Go to LangSmith dashboard and filter traces by user ID, time range, error status.

Step 2: Find the relevant trace and expand it to see the full tree. Identify which step in the agent's workflow failed or produced unexpected output.

Step 3: Click into that specific LLM call to see the exact prompt sent and response received. Check if the prompt included the right context or if the model's response was cut off.

Step 4: Reproduce the issue locally using the exact input from the trace. LangSmith lets you export traces as JSON, so you can replay them in your development environment.

Step 5: Fix the issue: update prompt, add error handling, adjust tool logic. Create an evaluation example from this case so it never breaks again.

Step 6: Deploy the fix and monitor the same trace filters to confirm the issue is resolved. Check that your fix didn't increase costs or latency unacceptably.

This workflow turns debugging from a multi-hour archaeology expedition into a 15-minute focused investigation. The time savings compound rapidly as your agent complexity grows.

Look, AI agents are powerful but opaque. LangSmith gives you X-ray vision into what's happening at every step, from individual token costs to multi-step reasoning chains. Set it up before you need it, instrument everything, build evaluation datasets as you go. The teams shipping reliable AI agents to production aren't smarter or luckier than you are. They just have better observability, and now you know how to get it too.