技能 硬件工程 Datadog日志分析与事故排查

Datadog日志分析与事故排查

v20260416
datadog
本指南提供了在Nexu平台使用Datadog日志API进行生产环境故障排查的完整流程。它指导用户如何查询崩溃事件、分析OpenClaw的标准错误输出、检查网关启动状态以及审查API请求日志。内容包括认证要求、按Pod和时间范围过滤的最佳实践,并提供了使用Python解析原始日志的步骤,帮助用户快速定位生产问题。
获取技能
204 次下载
概览

Datadog Log Investigation

Query Datadog Logs API to investigate production issues for the Nexu platform.

Authentication

Before making any Datadog API call, you MUST ask the user for these two keys:

  • DD_API_KEY — Datadog API Key (Organization Settings → API Keys)
  • DD_APP_KEY — Datadog Application Key (Organization Settings → Application Keys, requires logs_read_data scope)

Store them in shell variables for the session. Never hardcode or commit them.

Site: datadoghq.com (US)

API Base

All requests go to https://api.datadoghq.com/api/v2/logs/events/search.

Headers:

DD-API-KEY: <api_key>
DD-APPLICATION-KEY: <app_key>
Content-Type: application/json

Common Queries

OpenClaw Crash Events

curl -s "https://api.datadoghq.com/api/v2/logs/events/search" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filter": {
      "query": "service:nexu-gateway @event:openclaw_crash",
      "from": "now-1h",
      "to": "now"
    },
    "sort": "-timestamp",
    "page": {"limit": 20}
  }'

Key fields in results:

  • attributes.attributes.exitCode — process exit code (1 = fatal error, null = signal)
  • attributes.attributes.signal — kill signal (SIGKILL, SIGTERM, etc.)
  • attributes.tagspod_name, image_tag — which pod and which version

OpenClaw stderr Output (Crash Details)

curl -s "https://api.datadoghq.com/api/v2/logs/events/search" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filter": {
      "query": "service:nexu-gateway @stream:stderr",
      "from": "now-1h",
      "to": "now"
    },
    "sort": "-timestamp",
    "page": {"limit": 50}
  }'

This shows the actual error output from the OpenClaw process (e.g., invalid_auth, EADDRINUSE, config validation failures).

Gateway Startup / Recovery Events

curl -s "https://api.datadoghq.com/api/v2/logs/events/search" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filter": {
      "query": "service:nexu-gateway (\"starting gateway\" OR \"gateway is ready\" OR \"spawned openclaw\")",
      "from": "now-1h",
      "to": "now"
    },
    "sort": "timestamp",
    "page": {"limit": 30}
  }'

Slack Token Health Check

curl -s "https://api.datadoghq.com/api/v2/logs/events/search" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filter": {
      "query": "service:nexu-api slack_token_health*",
      "from": "now-1h",
      "to": "now"
    },
    "sort": "-timestamp",
    "page": {"limit": 20}
  }'

API HTTP Request Logs

curl -s "https://api.datadoghq.com/api/v2/logs/events/search" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filter": {
      "query": "service:nexu-api http_request @attributes.status:>=500",
      "from": "now-1h",
      "to": "now"
    },
    "sort": "-timestamp",
    "page": {"limit": 20}
  }'

Filter by Pod

Add pod_name:<name> to the query:

service:nexu-gateway pod_name:nexu-gateway-1 @event:openclaw_crash

Filter by Time Window

Use ISO 8601 timestamps:

{
  "from": "2026-03-10T05:00:00Z",
  "to": "2026-03-10T06:00:00Z"
}

Or relative: "now-30m", "now-1h", "now-24h".

Parsing Results

Use python3 inline to extract key fields:

curl -s ... | python3 -c "
import json, sys
data = json.load(sys.stdin)
events = data.get('data', [])
print(f'Total events: {len(events)}')
for e in events:
    attrs = e['attributes']['attributes']
    tags = e['attributes']['tags']
    pod = next((t.split(':',1)[1] for t in tags if t.startswith('pod_name:')), '?')
    ts = attrs.get('time', '?')
    msg = e['attributes'].get('message', '')[:120]
    print(f'{ts} | pod={pod} | {msg}')
"

Services and Events Reference

Service Description
nexu-gateway Gateway sidecar (manages OpenClaw process)
nexu-api API server
Event Meaning
openclaw_crash OpenClaw process exited unexpectedly
openclaw_restart_scheduled Sidecar scheduling a restart
openclaw_restart_limit Max restart attempts exceeded
openclaw_orphan_killed Killed zombie OpenClaw process
slack_token_health_check_invalidated Invalid Slack tokens detected and marked

Tag Reference

Tag Example
pod_name nexu-gateway-1, nexu-gateway-2
image_tag sha-55f13372bb72abc7db1538cca3db2bcda0d35eba
kube_stateful_set nexu-gateway

Investigation Playbook

When investigating a crash:

  1. Check crash events — get exit codes, signals, timestamps, affected pods
  2. Check stderr — get the actual error message from OpenClaw
  3. Check startup events — correlate crash with deploy times (image_tag changes)
  4. Check token health — if invalid_auth, look for slack_token_health_check_invalidated
  5. Check API logs — if API errors are contributing

Rules

  1. Never hardcode API keys in skill files or logs — always use variables
  2. Default time window — start with now-1h, expand to now-24h if needed
  3. Always parse and summarize — don't dump raw JSON to the user
  4. Correlate across services — crashes often involve both gateway and API logs
  5. Check image_tag to determine if crashes are related to a specific deployment
信息
Category 硬件工程
Name datadog
版本 v20260416
大小 5.83KB
更新时间 2026-04-28
语言