Skills Artificial Intelligence LangChain OpenTelemetry Observability Tracing

LangChain OpenTelemetry Observability Tracing

v20260423
langchain-otel-observability
This skill integrates LangChain 1.0 and LangGraph 1.0 into standard OpenTelemetry (OTEL) backends (e.g., Jaeger, Honeycomb, Tempo). It solves critical observability gaps by ensuring correct prompt content capture, propagating spans across complex subgraphs, and providing five LLM-specific Service Level Objectives (SLOs) like latency and cost. Ideal for multi-tenant, compliance-sensitive, or existing OTEL stack environments where LangSmith is unsuitable.
Get Skill
72 downloads
Overview

LangChain OTEL Observability (Python)

Overview

An engineer wires OpenTelemetry expecting to see prompts and responses in Honeycomb. The traces land — but only timing, model name, and token counts appear. The prompt body is blank. This is not a bug: it's the OTEL GenAI semantic-conventions privacy-safe default (P27), where OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT is off. The instinct is to flip it on and move on. On a multi-tenant workload that flip is a leak — the next engineer to search traces for Tenant A sees Tenant B's PII in the results, because redaction was supposed to happen upstream and never did.

A second trap lives inside LangGraph. A BaseCallbackHandler attached to the parent runnable never fires on inner agent tool calls, because LangGraph creates a child runtime per subgraph and callbacks do not inherit (P28). Spans inside subgraphs appear orphaned in the waterfall — or they do not appear at all — and SLO dashboards under-count latency on the exact calls that matter most: the nested agent loops.

This skill wires LangChain 1.0 / LangGraph 1.0 into an OTEL-native backend (Jaeger, Honeycomb, Grafana Tempo, Datadog) with a correct content-capture policy, subgraph-aware span propagation, and five LLM-specific SLOs (p95 / p99 latency, error rate, cost-per-request, TTFT) with burn-rate alerts. Pin: langchain-core 1.0.x, langgraph 1.0.x, opentelemetry-instrumentation-langchain >= 0.33, OTEL GenAI semconv as of 2026-04. Pain-catalog anchors: P27, P28 (and cross-references P04, P34, P37).

Prerequisites

  • Python 3.10+
  • langchain-core >= 1.0, < 2.0, langgraph >= 1.0, < 2.0
  • An OTEL-native backend picked: Jaeger (dev), Honeycomb / Tempo / Datadog (prod)
  • For multi-tenant: upstream redaction middleware already in place (see langchain-security-basics and langchain-middleware-patterns)
  • Access to set env vars at deploy time (OTLP_ENDPOINT, API keys)

Instructions

Step 1 — Install the SDK and instrumentor, configure the exporter

pip install \
  opentelemetry-api \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-http \
  "opentelemetry-instrumentation-langchain>=0.33"
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.langchain import LangchainInstrumentor

resource = Resource.create({
    "service.name": "my-langchain-app",
    "service.version": "1.0.0",
    "deployment.environment": os.getenv("ENV", "dev"),
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(
        endpoint=os.environ["OTLP_ENDPOINT"],       # per-backend; see matrix
        headers=_parse_headers(os.getenv("OTLP_HEADERS", "")),
    ),
    max_queue_size=2048,        # spans buffered before drop; raise for high volume
    max_export_batch_size=512,  # batched export keeps per-span overhead under 1ms
))
trace.set_tracer_provider(provider)

LangchainInstrumentor().instrument()   # emits gen_ai.* attrs on every run

BatchSpanProcessor keeps per-span overhead well under 1 ms. Use SimpleSpanProcessor only in local dev — it blocks the call path per span.

Per-backend OTLP_ENDPOINT and header config lives in Backend Setup Matrix — Jaeger, Honeycomb, Grafana Tempo, Datadog.

Step 2 — Verify the GenAI attribute schema

Trigger one call and inspect what landed in the backend. LangChain 1.0 emits these gen_ai.* attributes natively on every chat-model span:

Attribute Example
gen_ai.system anthropic
gen_ai.request.model claude-sonnet-4-6
gen_ai.request.temperature 0.0
gen_ai.usage.input_tokens 1234
gen_ai.usage.output_tokens 567
gen_ai.response.finish_reasons ["stop"]

Missing anything? Likely a stale instrumentor version or an outdated provider package. The full emitted-vs-custom matrix plus LangGraph's span taxonomy (LangGraph.invokeLangGraph.node.*LangGraph.subgraph.*) is in GenAI Semantic Conventions.

Step 3 — Decide on prompt-content capture (critical — do not skip)

The engineer's instinct is to flip the capture flag to see prompts. Before flipping it, classify the workload into one of these buckets:

Workload Flag Notes
Dev / staging with synthetic inputs true Fine. Do not copy these traces to prod.
Single-tenant internal tool true Fine if RBAC on backend is tight.
Single-tenant product, signed compliance artifacts true BAA / DPIA in place; retention policy matches log retention.
Multi-tenant SaaS, no upstream redaction false Hard no. Fix redaction first.
Multi-tenant SaaS, with upstream redaction true Safe — the span sees the already-redacted text.
Healthcare / finance / legal without legal sign-off false Hard no.
# trusted single-tenant ONLY
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true
export TRACELOOP_TRACE_CONTENT=true   # OpenLLMetry alias; set both to be safe

Leave unset (default) anywhere else. To capture bodies in a multi-tenant system, wire redaction middleware upstream of the model call first — see Prompt Content Policy and cross-reference pack siblings langchain-security-basics (PII redaction middleware pattern, P34) and langchain-middleware-patterns (middleware order: redact → cache → model, P24). Failure pattern P27 — prompts missing from traces because capture was never opted in — is the #1 first-day OTEL complaint; make the decision explicit instead of surprise-flipping the flag in prod.

Step 4 — Propagate callbacks through subgraphs (P28)

LangGraph creates a child runtime per subgraph. Callbacks bound at the parent definition time do not inherit:

# WRONG — subagent spans orphaned or missing (P28)
agent = create_react_agent(model=llm, tools=tools).with_config(
    callbacks=[my_handler]  # bound at definition time; children do not see it
)
agent.invoke({"messages": [...]})

# RIGHT — pass callbacks at invocation via config; they propagate down
agent.invoke(
    {"messages": [...]},
    config={"callbacks": [my_handler]}  # invocation-time; inherited by children
)

The same rule applies to custom attribute handlers (e.g. the CostAttributeHandler in the semantic-conventions reference that stamps gen_ai.usage.cost_usd on each model span). Attach via config["callbacks"], never via .with_config(). Failure pattern P28 symptom: SLO dashboards show low latency because the slow nested spans are missing entirely, not because the nested calls are fast.

Step 5 — Define LLM SLOs and dashboards

Five SLIs matter from day one. All five derive from gen_ai.* span attributes — no second pipeline required:

SLI Target example Why
p95 latency (top-level chat) < 5 s for chat UI Provider variance dominates
p99 latency < 15 s Tail matters on chat; agents with tools live here
Error rate < 0.5% Includes 429s + finish_reason IN ("length","content_filter")
Cost per request (p95) < $0.05 Catches haikuopus regressions
TTFT p95 (streaming) < 2 s Perceived latency, not total duration

Concrete Honeycomb / PromQL / Datadog queries for each SLI, plus multi-window multi-burn-rate alerts (14.4× / 1h fast burn, 6× / 6h slow burn), are in LLM SLO Dashboards.

Step 6 — Tune sampling

Defaults are wrong for two ends of the volume spectrum:

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# Low/medium volume — keep everything for debuggability
# (< ~100 req/s) — SDK default 100% is fine

# High volume — head-sample, but carve out errors + slow spans via tail sampling
# at the OTEL Collector (see references/llm-slo-dashboards.md)
provider = TracerProvider(
    resource=resource,
    sampler=TraceIdRatioBased(0.10),  # 10% head sample
)

Watch out: head sampling at 10% means 90% of p99 outliers are discarded before they reach the backend — p99 metrics become noisy and biased toward the median. For tail-latency SLOs, move sampling to a Collector with tailsamplingprocessor so errors and slow spans (latency > 5000ms) are always kept while the rest is probabilistically sampled at 10%. Typical trace overhead with BatchSpanProcessor at the 512-span batch size: under 1 ms per span; recommended sampling rate for high-volume production is 1-10%.

Output

  • OTEL exporter wired to a chosen backend (Jaeger / Honeycomb / Tempo / Datadog)
  • opentelemetry-instrumentation-langchain emitting gen_ai.* attrs on every LangChain and LangGraph span
  • Explicit prompt-content capture decision recorded against a workload bucket, with the multi-tenant guardrail enforced upstream
  • Callbacks propagated via config["callbacks"] at invocation time so subgraph spans nest correctly under their parent node
  • Five LLM SLOs (p95 / p99 latency, error rate, cost-per-request, TTFT) with dashboards and MWMBR burn-rate alerts
  • Sampling strategy matched to workload volume and SLO precision needs

Error Handling

Symptom Cause Fix
Traces land but prompt and completion bodies are empty OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT unset (P27 — privacy-safe default) Set to true only for the workload buckets in Step 3; for multi-tenant, wire upstream redaction first
Subgraph / tool-call spans orphaned or missing Callbacks bound via .with_config() at definition time (P28) Pass via config["callbacks"] at invocation time so children inherit
gen_ai.usage.cache_read_input_tokens resets every call Per-call usage, aggregation is your job (P04) Custom callback summing across calls keyed by session.id; see langchain-model-inference
p99 dashboard looks noisy and median-biased 10% head sampling drops outliers before backend Move to Collector tailsamplingprocessor — always keep errors and latency > 5000ms
Traces never appear OTLPSpanExporter endpoint wrong protocol (gRPC on 4317 vs HTTP on 4318) Verify with curl -v $OTLP_ENDPOINT; swap to the proto-grpc exporter package if your backend expects gRPC
Cost attribute missing from spans LangChain 1.0 does not emit gen_ai.usage.cost_usd natively Add a BaseCallbackHandler that computes from tokens × pricing; see semantic-conventions reference
PR review flags sk-... in trace attributes Secrets in prompts captured via gen_ai.prompt.content (P37-adjacent) Upstream redactor must strip API-key patterns before model call; audit via 0.1% sampler
Exporter dropping spans silently Queue overflow at high volume Increase max_queue_size to 4096+; add Collector between SDK and backend

Examples

Running Jaeger locally for dev-loop tracing

Spin up Jaeger in Docker, point the SDK at http://localhost:4318/v1/traces, leave content capture on (it's dev, inputs are synthetic). You get a generic span waterfall — no LLM-specific UX, but good for verifying the instrumentor emits what you expect before paying for a SaaS backend.

See Backend Setup Matrix for the docker run command and SDK config.

Investigating an agent latency incident in Honeycomb

Honeycomb's BubbleUp over gen_ai.request.model, gen_ai.usage.input_tokens, and tool call count is the fastest path from "p95 spiked at 14:00" to "one specific tool took 20 s because the vectorstore was slow." Requires content-capture-off by default so you can turn the team loose on search without PII-leak worries.

See LLM SLO Dashboards for the exact Honeycomb query shape.

Dual-exporting during a LangSmith → Tempo migration

Register two BatchSpanProcessors — one to LangSmith's OTLP endpoint, one to Tempo. Run both for two weeks, compare waterfalls, cut over. LangSmith handles LLM-specific analytics; Tempo handles unified trace search across LLM and non-LLM services in your Grafana stack.

See Backend Setup Matrix dual-export section.

Resources

Info
Name langchain-otel-observability
Version v20260423
Size 20.3KB
Updated At 2026-04-28
Language