服务监控指标与告警设计

v20260618

monitoring-setup-guide

本技能旨在为任何服务创建完整的可观测性（Observability）设置指南。它系统地覆盖了“四黄金指标”（延迟、流量、错误、饱和度），指导用户定义核心业务指标、设置服务级别目标（SLO），配置可操作的告警规则，并构建全面的监控仪表板，确保运营团队对系统健康状态一目了然，是SRE实践的关键文档。

监控可观测性 SLO 指标告警 DevOps SRE Grafana

获取技能

55 次下载

概览

Monitoring Setup Guide Skill

Produce a complete monitoring setup guide for a service — defining exactly what to measure, how to structure logs, how to configure alerts with actionable thresholds, and how to build dashboards that answer real operational questions. A good monitoring guide eliminates "we don't know what's happening in production" as a root cause category, and gives on-call engineers a single source of truth for what healthy looks like.

Required Inputs

Ask for these if not already provided:

Service name and description — what the service does and its role in the system
Tech stack — language, framework, and infrastructure (e.g. Go/gRPC on Kubernetes, Python/FastAPI on ECS)
Current monitoring tooling — Datadog, Prometheus + Grafana, CloudWatch, New Relic, Honeycomb, or none yet
Key user journeys — the 2–4 most important things a user or consumer does with the service (these drive what to alert on)
Existing alerts — paste any existing alert configurations or describe what's currently monitored

Output Format

Monitoring Setup Guide: [Service Name]

Team: [Team name] | Tech lead: [Name] Stack: [Language/Framework] on [Infrastructure] Monitoring platform: [Datadog / Prometheus+Grafana / CloudWatch / etc.] Date: [Date] | Review cycle: Quarterly

1. Monitoring Philosophy

Good monitoring answers three questions:

Is the service healthy right now? (alerting)
Was it healthy in the past, and is it trending worse? (dashboards + SLO tracking)
Why did something fail? (logs + traces)

This guide defines the answers for [Service Name]. Every alert must be actionable — if an on-call engineer cannot take a specific action in response to the alert, the alert should not exist.

Key user journeys monitored:

Journey 1: [e.g. "User submits a payment — POST /charges, receives confirmation"]
Journey 2: [e.g. "User views transaction history — GET /transactions"]
Journey 3: [e.g. "Subscription renewal job runs — background worker processes billing events"]

2. The Four Golden Signals

Apply the four golden signals specifically to [Service Name]:

Latency

Latency measures how long requests take to complete. Track it separately for successful and failed requests — slow failures hide behind fast errors if you only measure aggregate latency.

Metric	Description	Source	Dimensions
`[service].request.duration_ms`	End-to-end request latency	Application instrumentation	`endpoint`, `method`, `status_code`
`[service].db.query_duration_ms`	Database query latency	ORM / query instrumentation	`query_name`, `table`
`[service].external.request_duration_ms`	Outbound call latency to dependencies	HTTP client instrumentation	`target_service`, `endpoint`
`[service].queue.processing_duration_ms`	Time to process one message (if applicable)	Consumer instrumentation	`queue_name`, `message_type`

Latency SLO targets:

Endpoint / operation	p50 target	p95 target	p99 target
`GET /api/v1/[resource]`	< [50] ms	< [200] ms	< [500] ms
`POST /api/v1/[resource]`	< [100] ms	< [400] ms	< [1000] ms
`GET /health`	< [10] ms	< [20] ms	< [50] ms
[Background job name]	< [5] sec	< [15] sec	< [60] sec

Traffic

Traffic measures demand on the system. Use it to detect unexpected spikes, traffic drops (which can indicate upstream failures), and to capacity-plan.

Metric	Description	Source
`[service].request.count`	Requests per second	Application / load balancer
`[service].request.count_by_endpoint`	RPS broken down by endpoint	Application
`[service].queue.messages_consumed_per_second`	Consumer throughput	Queue consumer
`[service].queue.depth`	Messages waiting in queue	Queue metrics

Traffic baselines (update after observing production for 2+ weeks):

Time period	Expected RPS	Low-traffic floor	Spike ceiling
Peak (weekday business hours)	[N] RPS	[N × 0.5] RPS	[N × 5] RPS
Off-peak (nights/weekends)	[N × 0.2] RPS	[N × 0.05] RPS	[N] RPS

Errors

Errors measure the fraction of requests that fail. Distinguish between client errors (4xx — caller is doing something wrong) and server errors (5xx — the service is broken).

Metric	Description	Alert on?
`[service].request.error_rate`	5xx errors / total requests	Yes — see alert rules
`[service].request.client_error_rate`	4xx errors / total requests	Threshold alert — sudden spike may indicate API misuse
`[service].dependency.error_rate`	Errors calling downstream dependencies	Yes — upstream health signal
`[service].queue.dlq_depth`	Messages in dead-letter queue	Yes — indicates processing failures

Saturation

Saturation measures how "full" the service is — how close to maximum capacity are the constrained resources.

Resource	Metric	Alert threshold	Source
CPU	`[service].cpu.utilisation_pct`	>80% sustained 5 min	Container / VM metrics
Memory	`[service].memory.utilisation_pct`	>85% sustained 5 min	Container / VM metrics
DB connections	`[service].db.connection_pool.utilisation_pct`	>75%	Application / DB metrics
Thread pool / goroutines	`[service].runtime.goroutine_count` / `thread_count`	>N (establish baseline)	Runtime metrics
Disk (if applicable)	`[service].disk.utilisation_pct`	>75%	Infrastructure
Queue depth (if applicable)	`[service].queue.depth`	>[backlog threshold]	Queue metrics

3. Business Metrics

Beyond the golden signals, track metrics that measure whether the service is delivering business value. These matter for SLO reporting and product dashboards.

Metric	Description	Source	Alert?
`[service].[primary_action].success_rate`	[e.g. "Payment success rate"]	Application	Yes — if drops >5% vs 1h average
`[service].[primary_action].count`	[e.g. "Payments processed per minute"]	Application	Yes — sudden drop (traffic anomaly)
`[service].[resource].created_per_hour`	[e.g. "New accounts created"]	Application / DB	No — informational
`[service].cache.hit_rate`	Fraction of requests served from cache	Cache instrumentation	Yes — if drops below [60]%
`[service].job.[name].success_rate`	[Background job success rate]	Job framework	Yes — if drops below [99]%

4. Log Strategy

Structured Logging Schema

All logs must be structured JSON. Do not emit unstructured text logs in production. Every log line must include the mandatory fields.

Mandatory fields (every log line):

{
  "timestamp": "2024-01-15T10:23:45.123Z",
  "level": "info",
  "service": "[service-name]",
  "version": "[git-sha-short]",
  "trace_id": "[uuid-from-request-context]",
  "span_id": "[span-uuid]",
  "request_id": "[uuid-per-request]",
  "message": "[human readable description]"
}

Request log (emit for every HTTP request):

{
  "timestamp": "...",
  "level": "info",
  "service": "[service-name]",
  "event": "http_request",
  "method": "POST",
  "path": "/api/v1/[resource]",
  "status_code": 201,
  "duration_ms": 45,
  "user_id": "[uuid — DO NOT log PII directly]",
  "request_id": "[uuid]",
  "trace_id": "[uuid]"
}

Error log (emit for every error with context):

{
  "timestamp": "...",
  "level": "error",
  "service": "[service-name]",
  "event": "error",
  "error_code": "[application-error-code]",
  "error_message": "[description — no sensitive data]",
  "stack_trace": "[stack trace]",
  "request_id": "[uuid]",
  "trace_id": "[uuid]",
  "context": {
    "[key]": "[relevant context without PII]"
  }
}

Log Levels — When to Use Each

Level	Use when	Example
`error`	Something failed that requires attention — this should page on-call eventually	Database query failed, external API returned 5xx, required config missing
`warn`	Something unexpected happened but service is still functioning	Retry succeeded after failure, cache miss on expected hit, rate limit approaching
`info`	Significant business events and request lifecycle	Request received, payment processed, user authenticated, job started/completed
`debug`	Detailed diagnostic information — off in production by default	Query parameters, intermediate computation results, cache key lookups

What NOT to Log

Never log:

Passwords, tokens, API keys, or secrets (even hashed)
Full credit card numbers or PAN data
Social security numbers or government IDs
Full names + dates of birth + contact info in the same log line (PII aggregation)
Request/response bodies in full (use field-level extraction instead)
Health check requests (too noisy — exclude GET /health from access logs)

5. Distributed Tracing Setup

Distributed tracing is mandatory for any service that calls other services. It enables root-cause analysis across service boundaries.

Instrumentation Checklist

[ ] Tracing library installed:
    - Go: go.opentelemetry.io/otel
    - Python: opentelemetry-sdk, opentelemetry-instrumentation
    - Node: @opentelemetry/sdk-node
    - Java: opentelemetry-java-instrumentation

[ ] Tracer initialized at service startup with service name and version

[ ] Trace context propagated via W3C Trace Context headers:
    traceparent: 00-[trace-id]-[span-id]-01
    tracestate: [optional vendor-specific]

[ ] Automatic instrumentation enabled for:
    [ ] Inbound HTTP/gRPC requests (creates root span)
    [ ] Outbound HTTP/gRPC calls (creates child spans)
    [ ] Database queries (creates child spans with sanitized query)
    [ ] Cache operations (Redis, Memcached)
    [ ] Message queue produce/consume

[ ] Custom spans added for:
    [ ] Key business operations ([e.g. payment processing, user lookup])
    [ ] Background jobs (each job execution = root span)
    [ ] Third-party API calls with custom attributes

[ ] Span attributes to capture on all spans:
    - user.id (if authenticated — no PII)
    - deployment.environment (production/staging)
    - service.version (git SHA)
    - [service-specific key attributes]

[ ] Trace exporter configured to: [Datadog / Jaeger / Tempo / OTLP endpoint]

[ ] Sampling rate configured:
    - Production: [1–10]% of requests (adjust based on volume and cost)
    - Always sample: errors, slow requests (>p99 threshold), and 100% of [critical endpoint]

Trace Instrumentation Examples

# Python — OpenTelemetry example
from opentelemetry import trace

tracer = trace.get_tracer("[service-name]")

def process_payment(payment_data):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("payment.amount_cents", payment_data["amount"])
        span.set_attribute("payment.currency", payment_data["currency"])
        # Never: span.set_attribute("payment.card_number", ...)
        try:
            result = _do_process(payment_data)
            span.set_status(trace.StatusCode.OK)
            return result
        except PaymentError as e:
            span.set_status(trace.StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise

6. Alert Rules Specification

Every alert must have: a name, a condition, a threshold, a severity, and a clear on-call action. Alerts without a clear action should not exist.

Alert Definitions

Alert name	Condition	Threshold	Severity	On-call action
`[Service]HighErrorRate`	5xx error rate, 5-min rolling window	>1% for 2 consecutive windows	P1	Check recent deploys; inspect error logs; see runbook [link]
`[Service]CriticalErrorRate`	5xx error rate, 2-min rolling window	>5%	P1 — immediate	Same as above — page immediately, do not wait
`[Service]HighP99Latency`	p99 latency on key endpoints	>2× SLO target for 3 min	P2	Check DB latency, cache hit rate, and upstream dependencies
`[Service]LatencySLOBreach`	p99 latency	>SLO target for 5 consecutive minutes	P1	SLO burn — page on-call, escalate if not resolved in 20 min
`[Service]HighCPU`	CPU utilisation	>80% sustained for 5 min	P2	Check for traffic spike; scale up if needed; check for runaway processes
`[Service]HighMemory`	Memory utilisation	>85% sustained for 5 min	P2	Check for memory leak (especially after deploys); restart pod if OOM imminent
`[Service]DBConnectionPoolHigh`	DB connection pool utilisation	>75%	P2	Check for long-running queries; consider scaling service or increasing pool size
`[Service]DLQDepthHigh`	Dead-letter queue depth	>10 messages	P2	Inspect DLQ messages for error pattern; fix bug and replay if safe
`[Service]TrafficDropAnomaly`	RPS, compared to same hour yesterday	>50% drop sustained 5 min	P1	Upstream may be down; check caller health; check load balancer
`[Service]PrimaryActionSuccessRateDrop`	[Business metric success rate]	<[95]% over 10 min	P1	[Service-specific action — e.g. "Check payment provider status"]
`[Service]DownstreamDependencyErrors`	Error rate calling [dependency]	>5% over 5 min	P2	Check [dependency] status page; enable fallback if available

Alert Configuration Examples

# Prometheus / Grafana alerting rules (adapt for your platform)
groups:
  - name: [service-name]-alerts
    rules:

      - alert: [Service]HighErrorRate
        expr: |
          (
            sum(rate([service]_http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate([service]_http_requests_total[5m]))
          ) > 0.01
        for: 2m
        labels:
          severity: critical
          team: [team-name]
        annotations:
          summary: "High error rate on [Service Name]"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
          runbook_url: "[runbook link]"

      - alert: [Service]HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum(rate([service]_http_request_duration_seconds_bucket[5m])) by (le, endpoint)
          ) > [0.5]
        for: 3m
        labels:
          severity: warning
          team: [team-name]
        annotations:
          summary: "p99 latency elevated on [Service Name]"
          description: "p99 latency on {{ $labels.endpoint }} is {{ $value | humanizeDuration }}"
          runbook_url: "[runbook link]"

# Datadog monitor configuration (Python SDK or Terraform)
import datadog

datadog.initialize(api_key="[key]", app_key="[key]")

datadog.api.Monitor.create(
    type="metric alert",
    query=f"sum(last_5m):sum:{{service}}.http.errors{{service:[service-name]}} / sum:{{service}}.http.requests{{service:[service-name]}} > 0.01",
    name="[Service] High Error Rate",
    message="Error rate exceeded 1%. @pagerduty-[service-oncall]\n\nRunbook: [link]",
    tags=["service:[service-name]", "team:[team-name]"],
    options={
        "thresholds": {"critical": 0.01, "warning": 0.005},
        "notify_no_data": False,
        "evaluation_delay": 60,
    }
)

7. Dashboard Layout Specification

The primary service dashboard must answer "is the service healthy right now?" at a glance. Use this layout:

┌─────────────────────────────────────────────────────────────────────┐
│  [SERVICE NAME] — Service Health Dashboard           [Time range ▼] │
├───────────────┬───────────────┬───────────────┬─────────────────────┤
│  Error rate   │  p99 Latency  │  RPS (current)│  SLO budget remaining│
│  [BIG NUMBER] │  [BIG NUMBER] │  [BIG NUMBER] │  [BIG NUMBER / days] │
│  vs SLO: 0.1% │  vs SLO: 500ms│  vs avg: [N]  │  [Error budget gauge]│
├───────────────┴───────────────┴───────────────┴─────────────────────┤
│                   Error rate over time (24h)                        │
│  [Time series: 5xx rate line, SLO threshold line]                   │
├─────────────────────────────────┬───────────────────────────────────┤
│  Latency percentiles over time  │  Request throughput over time     │
│  [Lines: p50, p95, p99, p999]   │  [Bars: RPS by endpoint]          │
│  [SLO threshold horizontal line]│                                   │
├─────────────────────────────────┴───────────────────────────────────┤
│  Latency heatmap (all requests — shows distribution shape)          │
├─────────────────────────────────┬───────────────────────────────────┤
│  CPU utilisation over time      │  Memory utilisation over time     │
│  [All instances/pods — lines]   │  [All instances/pods — lines]     │
│  [Alert threshold: 80%]         │  [Alert threshold: 85%]           │
├─────────────────────────────────┴───────────────────────────────────┤
│  DB: connection pool utilisation│  DB: query latency (p99 per query)│
├─────────────────────────────────┴───────────────────────────────────┤
│  [Business metric 1 over time]  │  [Business metric 2 over time]    │
│  e.g. Payment success rate      │  e.g. Orders created/min          │
└─────────────────────────────────┴───────────────────────────────────┘

Second dashboard — Dependency Health:

┌─────────────────────────────────────────────────────────────────────┐
│  [SERVICE NAME] — Dependency Health                                 │
├─────────────────────────────────────────────────────────────────────┤
│  For each dependency: error rate | latency | current status         │
│  [Database]    [N]% errors | [N]ms p99 | ● Healthy / ⚠ Degraded    │
│  [Redis]       [N]% errors | [N]ms p99 | ● Healthy                 │
│  [External API][N]% errors | [N]ms p99 | ● Healthy                 │
├─────────────────────────────────────────────────────────────────────┤
│  Outbound call latency over time (one line per dependency)          │
├─────────────────────────────────────────────────────────────────────┤
│  Circuit breaker / fallback state (if implemented)                  │
└─────────────────────────────────────────────────────────────────────┘

8. Observability Debt Analysis

Honest assessment of what is missing today and what the priority to add it is:

Gap	Impact	Priority	Effort	Owner	Target date
[e.g. No distributed tracing — can't see cross-service latency]	High — blind to dependency issues	P1	[2 days]	[Name]	[Date]
[e.g. No business metric alerts — only infra alerts]	High — silent business failures	P1	[1 day]	[Name]	[Date]
[e.g. Logs are unstructured text — not searchable]	Medium — slow incident investigation	P2	[3 days]	[Name]	[Date]
[e.g. No dead-letter queue monitoring]	Medium — failed messages go unnoticed	P2	[4 hours]	[Name]	[Date]
[e.g. Alert thresholds not calibrated to production baseline]	Medium — alert fatigue or missed alerts	P2	[1 day]	[Name]	[Date]
[e.g. No latency heatmap — outliers invisible in averages]	Low — harder to spot tail latency issues	P3	[2 hours]	[Name]	[Date]

Total observability debt: [N] items | Estimated effort: [N days]

Quality Checks

Every alert has a named on-call action — no alert says "investigate" without specifying what to investigate first
Alert thresholds are calibrated against production baselines, not set to default values from a template
Structured logging is implemented — no unstructured text log lines in production
PII is explicitly excluded from logs — a named engineer has verified this
Distributed tracing is propagating trace IDs across all service boundaries (verify with a test request)
The primary dashboard answers "is the service healthy?" in under 10 seconds — no hunting for the right panel
Business metrics are tracked alongside infrastructure metrics — not just four golden signals
Observability debt items have owners and dates — not just "would be nice to have"

Anti-Patterns

Do not create alerts without a specific on-call action — an alert that just says "investigate" trains engineers to ignore it
Do not set alert thresholds from a template without calibrating against production baselines — uncalibrated thresholds cause either alert fatigue or missed incidents
Do not log PII, tokens, or secrets — a logging standard is incomplete without an explicit list of what must never be logged
Do not measure only the four golden signals without adding at least one business metric alert — infrastructure health can be green while the business-critical path is silently failing
Do not deploy distributed tracing without verifying that trace IDs propagate across all service boundaries — partial tracing is worse than no tracing because it produces misleading incomplete traces

信息

Category 硬件工程

Name monitoring-setup-guide

版本 v20260618

大小 23.25KB

Source mohitagw15856/pm-claude-skills

更新时间 2026-06-19