AI Agent Observability: Tracing Multi-Step LLM Workflows
AI agent observability in 2026 is the engineering discipline that separates production AI products from prototype demos. Multi-step agents fail in ways that single-call LLM features don’t — tool calls go wrong silently, prompt context drifts across turns, costs spike without warning, and debugging without traces is essentially impossible. This guide is the practical playbook for instrumenting LLM workflows you actually run at scale.
Cost & latency snapshot: a properly instrumented agent typically adds 50–150ms of latency per call for tracing and roughly $0.0001 per call in observability cost. A non-instrumented agent costs zero to run and infinite engineering hours to debug when it breaks at 3am.
What makes agent observability different
Standard application observability (Datadog, Honeycomb, Sentry) treats a request as a single span with attached metadata. LLM agents don’t fit that model:
- One user turn = many LLM calls. A planning step, multiple tool calls, a synthesis step. The trace is a tree, not a line.
- Prompt is data, not config. The input string includes retrieved context, prior turn history, tool definitions, and the user message — sometimes 50K+ tokens. Logging it matters.
- Failure modes are semantic. The model returns valid JSON that’s factually wrong, or calls the wrong tool with valid arguments. HTTP 200 means nothing.
- Cost depends on every byte. Input tokens, output tokens, cached tokens, and model tier all factor into per-call cost. Generic APM doesn’t capture this.
- Evals are part of monitoring. Did the agent answer correctly? Was the tool call appropriate? These need automated evaluation, not just up/down checks.
Build vs buy decision tree
- Solo founder or 2-person team shipping a single LLM feature: Use a hosted observability tool. Helicone, Langfuse Cloud, or LangSmith’s free tier covers you.
- Team shipping a multi-step agent product: LangSmith or Braintrust if you also need evals; Langfuse if you want self-hosting + open source.
- Enterprise with compliance constraints: Self-hosted Langfuse or Phoenix from Arize. Both ship Helm charts and have BAA-eligible deployment patterns.
- You’re already deep in Datadog / New Relic / Honeycomb: Add OpenTelemetry GenAI semantic conventions on top. Lighter touch, less specialized features.
What you need to instrument
Every LLM call
- Model name + version + provider
- Input tokens (broken out by system, user, tool definitions, cached vs uncached)
- Output tokens (broken out by content vs tool_use vs reasoning)
- Latency (TTFT — time to first token — and total time)
- Cost (computed from token counts × model pricing)
- Full prompt text (with sensitive-data redaction)
- Full response text
- Stop reason (max_tokens, end_turn, tool_use, etc.)
Every tool call
- Tool name + version
- Arguments the model passed
- Tool execution time
- Tool response (truncated for very large outputs)
- Tool error if any
Every multi-turn context
- Conversation ID + user ID
- Turn count in this conversation
- Total tokens accumulated across turns
- Memory layer reads (RAG retrievals with their similarity scores)
Business context
- User tier (free, pro, enterprise) — for cost analysis
- Feature surface (chat, email writer, code agent, etc.)
- Environment (production, staging, development)
- Experiment / A/B variant if applicable
The OpenTelemetry GenAI prompt-tracing pattern
2026’s emerging standard is OpenTelemetry’s GenAI semantic conventions. The pattern that works across vendors:
// Pseudocode for instrumenting a single LLM call
with tracer.start_as_current_span("llm.completion") as span:
span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.request.model", "claude-sonnet-4")
span.set_attribute("gen_ai.request.temperature", 0.2)
span.set_attribute("gen_ai.request.max_tokens", 4096)
# Add prompt as a span event (better for searchability than attributes)
span.add_event("gen_ai.content.prompt", {
"gen_ai.prompt.0.role": "system",
"gen_ai.prompt.0.content": system_prompt_redacted,
"gen_ai.prompt.1.role": "user",
"gen_ai.prompt.1.content": user_msg_redacted,
})
response = claude.messages.create(...)
span.set_attribute("gen_ai.response.model", response.model)
span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
span.set_attribute("gen_ai.usage.cache_read_tokens",
response.usage.cache_read_input_tokens or 0)
span.set_attribute("gen_ai.response.finish_reason", response.stop_reason)
span.add_event("gen_ai.content.completion", {
"gen_ai.completion.0.role": "assistant",
"gen_ai.completion.0.content": response.content[0].text,
})
Run this in every code path that touches an LLM and you have queryable, trace-tree data that any OTel-compatible backend can ingest.
The tools landscape in 2026
| Tool | Strengths | Pricing | Best for |
|---|---|---|---|
| LangSmith | Traces + evals + dataset management. LangChain-native. | Free tier; $39+/user/mo | Teams using LangChain heavily |
| Helicone | Lightweight proxy-based logging, easy setup, good cost tracking. | Free 10K logs/mo; $20+/mo | Easy onboarding, simple workflows |
| Langfuse | Open-source, self-hostable, full traces + evals + prompt management. | Free OSS; Cloud $59+/mo | Compliance-sensitive deployments |
| Braintrust | Best eval workflows + dataset comparison. | Free tier; usage-based | Teams running serious eval discipline |
| Arize Phoenix | Open-source, OTel-native, ML-ops heritage. | Free OSS | Teams with existing ML observability |
| OpenLLMetry (Traceloop) | SDK-based instrumentation; sends to any OTel backend. | Free OSS | Teams on Datadog / Honeycomb / Grafana |
Evaluation: monitoring vs offline evals
Observability without evaluation is just logging. The 2026 production pattern has two distinct eval pipelines:
Online monitoring evals
Run on a sample of production traces (1–5% typically). Triggered evals include: was the JSON output schema-valid? Did the model refuse a benign query? Did the tool call complete without error? Use cheap LLM-as-judge calls (GPT-5 Mini or Claude Haiku) for soft-judgement questions, structured assertions for hard ones.
Offline eval suites
Run against a fixed dataset on every prompt change or model swap. 50–500 examples with known-correct outputs, scored automatically or with human review. The dataset is your regression suite — you don’t ship a prompt change that drops eval score.
Tools like Braintrust and LangSmith collapse both into a single workflow. The dataset, the metric definitions, and the production trace data all live in the same surface.
If you’re shipping an LLM product to production and want help building the eval + observability stack, our LLM & AI Engineering guides cover the architecture patterns for production deployments.
Cost monitoring — the non-obvious trap
Three cost lines surprise teams in production:
Runaway multi-turn conversations
Each turn appends to the context. A 50-turn conversation might be feeding 100K tokens to the model on the final turn. Single user can rack up a $5 bill in a session if you’re not capping context.
Mitigation: hard turn limits, sliding-window context with summarization, alert on per-user daily cost thresholds.
Cache misses on long prompts
Anthropic’s prompt caching saves 90% on cached input. If your prompt template changes slightly each call, you blow the cache on every request — paying 10× what you should. Monitor cache hit rate explicitly.
Tool call loops
An agent calling the same tool 10 times in a row because it doesn’t like the response burns tokens and time. Monitor for repeated tool calls with identical or near-identical arguments.
Alerting: what to wake people up for
- Error rate spike — 5xx from the model provider, tool execution failures.
- P95 latency above threshold — provider degradation usually shows up here first.
- Cost spike per user or per feature — runaway conversations, prompt cache misses.
- Refusal rate spike — model started refusing benign queries (often a sign of prompt template drift).
- Eval score regression — quality drop in production sample even when no obvious errors.
- Tool call failure rate — external API the agent depends on is degraded.
Debugging workflow with traces
When a user reports “the agent did something weird”:
- Find the conversation by user ID + timestamp.
- Pull the trace tree — see every LLM call, every tool call, every retrieval.
- Inspect the prompt at the failing step. Often the bug is in retrieved context, not in the model output.
- Check token counts — runaway context usually visible immediately.
- Re-run the same prompt in a playground (LangSmith, Langfuse, or your own) to confirm reproduction.
- Fix the prompt or the upstream tool. Add to eval suite.
The whole workflow takes 10–30 minutes with good observability. Without it, the same investigation is hours of guessing.
For the broader build playbook on shipping LLM features to production, see our production LLM engineering guides — observability is one piece of a bigger stack.
Privacy considerations
Logging prompts means logging user data. Three patterns that hold up:
- Redaction at the SDK layer. Strip PII before the prompt or response reaches your observability backend. Vendors like Helicone and Langfuse have built-in redaction hooks.
- Sampling instead of full logging. 1–5% of traces fully logged, the rest just metadata. Costs less, satisfies most debug needs.
- Self-hosting for regulated industries. Healthcare, finance, legal — keep observability data on-prem with self-hosted Langfuse, Phoenix, or your own Postgres-backed setup.
Production gotchas
Streaming complicates tracing
If you stream responses to users, the “completion event” happens incrementally. Your tracing needs to accumulate the streamed chunks into a single span and capture both TTFT and total time as separate metrics.
Provider fallback noise
If you fail over from Anthropic to OpenAI when one is down, your traces should record both the failed attempt AND the successful one. Treat them as two spans, not one.
Prompt template drift
Teams version-control their code but not their prompts. The 2026 best practice is treating prompts as deployable artifacts — version them, evaluate them, and track which prompt version produced each trace.
Per-feature, not per-call cost
One agent “feature” (chat, summarization, etc.) might use 3–10 LLM calls. Roll up costs by feature to make pricing + product decisions, not just per-call.
The 2026 observability checklist
- Every LLM call traced with OTel-compatible attributes (model, tokens, cost, latency).
- Every tool call traced with arguments + response (truncated if large).
- Trace IDs propagate across multi-step agent calls.
- Prompts logged with sensitive data redacted.
- Cost computed per call and rolled up per user / per feature / per day.
- Online eval running on 1–5% sample, scored automatically.
- Offline eval suite gates prompt changes.
- Alerts on error rate, p95 latency, cost spikes, refusal rate, eval score drops.
- Cache hit rate monitored if using prompt caching.
- Trace UI accessible to engineers in <30 seconds when debugging.
Frequently asked questions
What’s the best LLM observability tool in 2026?
Depends on your stack. LangSmith if you’re LangChain-heavy and want integrated evals. Helicone for the easiest onboarding and proxy-based logging. Langfuse for self-hosting + open source. Braintrust if eval workflow quality is your top priority. All four are production-ready in 2026.
Do I really need agent observability for a small project?
Once you have multi-step agents in production with real users, yes — debugging without traces is nearly impossible. For single-call LLM features (one prompt in, one response out), basic logging via Helicone or even just Stripe-like dashboards is enough.
How much does observability cost in 2026?
Helicone: free up to 10K logs/mo, $20+/mo above. LangSmith: free tier, $39+/user/mo paid. Langfuse Cloud: $59+/mo. Self-hosted Langfuse or Phoenix: free + your hosting costs. Most teams spend $20–$200/mo total at MVP scale.
Should I use OpenTelemetry GenAI conventions?
Yes if you have existing observability infrastructure (Datadog, Honeycomb, Grafana). OTel-based instrumentation lets you route LLM traces alongside your application traces. If you’re greenfield, a specialized LLM observability tool (LangSmith, Langfuse) gives you better out-of-the-box LLM-specific UX.
What’s the difference between observability and evaluation?
Observability tells you what the agent did. Evaluation tells you whether it was correct. Both are needed in production. Online evals (run on samples of traces) catch quality regressions; offline evals (run on fixed datasets) gate prompt changes before deploy.
How do I handle PII in logged prompts?
Redact at the SDK layer before data reaches the observability backend — both Helicone and Langfuse have redaction hooks. Or sample 1–5% of traces with full logging and the rest as metadata only. For regulated industries (HIPAA, finance), self-host the observability stack.
How do I trace across multiple LLM providers?
Use OpenTelemetry GenAI conventions — the attribute names (gen_ai.system, gen_ai.request.model, etc.) work the same across providers. Your trace tree shows Anthropic and OpenAI calls side by side, and you can query both with the same observability backend.
Founder of MakeAnAppLike. I write about clone apps, AI-powered SaaS, and the playbooks behind getting a product to its first thousand users. Background in software engineering and product. Previously shipped consumer marketplaces and B2B tools. Today my focus is on practical, founder-friendly guides — what to build, what to skip, and how to rank for it. If something I wrote helped you, say hi on LinkedIn.
Continue reading
Best Vector Databases in 2026: Pinecone vs Weaviate vs Qdrant vs pgvector
The four vector databases builders actually shortlist in 2026 — Pinecone, Weaviate, Qdrant, and pgvector — compared on real pricing, latency, scale limits, and production failure modes from our own shipped LLM features.
