Produce a complete monitoring setup guide for a service — defining exactly what to measure, how to structure logs, how to configure alerts with actionable thresholds, and how to build dashboards that answer real operational questions. A good monitoring guide eliminates "we don't know what's happening in production" as a root cause category, and gives on-call engineers a single source of truth for what healthy looks like.
Ask for these if not already provided:
Team: [Team name] | Tech lead: [Name] Stack: [Language/Framework] on [Infrastructure] Monitoring platform: [Datadog / Prometheus+Grafana / CloudWatch / etc.] Date: [Date] | Review cycle: Quarterly
Good monitoring answers three questions:
This guide defines the answers for [Service Name]. Every alert must be actionable — if an on-call engineer cannot take a specific action in response to the alert, the alert should not exist.
Key user journeys monitored:
Apply the four golden signals specifically to [Service Name]:
Latency measures how long requests take to complete. Track it separately for successful and failed requests — slow failures hide behind fast errors if you only measure aggregate latency.
| Metric | Description | Source | Dimensions |
|---|---|---|---|
[service].request.duration_ms |
End-to-end request latency | Application instrumentation | endpoint, method, status_code |
[service].db.query_duration_ms |
Database query latency | ORM / query instrumentation | query_name, table |
[service].external.request_duration_ms |
Outbound call latency to dependencies | HTTP client instrumentation | target_service, endpoint |
[service].queue.processing_duration_ms |
Time to process one message (if applicable) | Consumer instrumentation | queue_name, message_type |
Latency SLO targets:
| Endpoint / operation | p50 target | p95 target | p99 target |
|---|---|---|---|
GET /api/v1/[resource] |
< [50] ms | < [200] ms | < [500] ms |
POST /api/v1/[resource] |
< [100] ms | < [400] ms | < [1000] ms |
GET /health |
< [10] ms | < [20] ms | < [50] ms |
| [Background job name] | < [5] sec | < [15] sec | < [60] sec |
Traffic measures demand on the system. Use it to detect unexpected spikes, traffic drops (which can indicate upstream failures), and to capacity-plan.
| Metric | Description | Source |
|---|---|---|
[service].request.count |
Requests per second | Application / load balancer |
[service].request.count_by_endpoint |
RPS broken down by endpoint | Application |
[service].queue.messages_consumed_per_second |
Consumer throughput | Queue consumer |
[service].queue.depth |
Messages waiting in queue | Queue metrics |
Traffic baselines (update after observing production for 2+ weeks):
| Time period | Expected RPS | Low-traffic floor | Spike ceiling |
|---|---|---|---|
| Peak (weekday business hours) | [N] RPS | [N × 0.5] RPS | [N × 5] RPS |
| Off-peak (nights/weekends) | [N × 0.2] RPS | [N × 0.05] RPS | [N] RPS |
Errors measure the fraction of requests that fail. Distinguish between client errors (4xx — caller is doing something wrong) and server errors (5xx — the service is broken).
| Metric | Description | Alert on? |
|---|---|---|
[service].request.error_rate |
5xx errors / total requests | Yes — see alert rules |
[service].request.client_error_rate |
4xx errors / total requests | Threshold alert — sudden spike may indicate API misuse |
[service].dependency.error_rate |
Errors calling downstream dependencies | Yes — upstream health signal |
[service].queue.dlq_depth |
Messages in dead-letter queue | Yes — indicates processing failures |
Saturation measures how "full" the service is — how close to maximum capacity are the constrained resources.
| Resource | Metric | Alert threshold | Source |
|---|---|---|---|
| CPU | [service].cpu.utilisation_pct |
>80% sustained 5 min | Container / VM metrics |
| Memory | [service].memory.utilisation_pct |
>85% sustained 5 min | Container / VM metrics |
| DB connections | [service].db.connection_pool.utilisation_pct |
>75% | Application / DB metrics |
| Thread pool / goroutines | [service].runtime.goroutine_count / thread_count |
>N (establish baseline) | Runtime metrics |
| Disk (if applicable) | [service].disk.utilisation_pct |
>75% | Infrastructure |
| Queue depth (if applicable) | [service].queue.depth |
>[backlog threshold] | Queue metrics |
Beyond the golden signals, track metrics that measure whether the service is delivering business value. These matter for SLO reporting and product dashboards.
| Metric | Description | Source | Alert? |
|---|---|---|---|
[service].[primary_action].success_rate |
[e.g. "Payment success rate"] | Application | Yes — if drops >5% vs 1h average |
[service].[primary_action].count |
[e.g. "Payments processed per minute"] | Application | Yes — sudden drop (traffic anomaly) |
[service].[resource].created_per_hour |
[e.g. "New accounts created"] | Application / DB | No — informational |
[service].cache.hit_rate |
Fraction of requests served from cache | Cache instrumentation | Yes — if drops below [60]% |
[service].job.[name].success_rate |
[Background job success rate] | Job framework | Yes — if drops below [99]% |
All logs must be structured JSON. Do not emit unstructured text logs in production. Every log line must include the mandatory fields.
Mandatory fields (every log line):
{
"timestamp": "2024-01-15T10:23:45.123Z",
"level": "info",
"service": "[service-name]",
"version": "[git-sha-short]",
"trace_id": "[uuid-from-request-context]",
"span_id": "[span-uuid]",
"request_id": "[uuid-per-request]",
"message": "[human readable description]"
}
Request log (emit for every HTTP request):
{
"timestamp": "...",
"level": "info",
"service": "[service-name]",
"event": "http_request",
"method": "POST",
"path": "/api/v1/[resource]",
"status_code": 201,
"duration_ms": 45,
"user_id": "[uuid — DO NOT log PII directly]",
"request_id": "[uuid]",
"trace_id": "[uuid]"
}
Error log (emit for every error with context):
{
"timestamp": "...",
"level": "error",
"service": "[service-name]",
"event": "error",
"error_code": "[application-error-code]",
"error_message": "[description — no sensitive data]",
"stack_trace": "[stack trace]",
"request_id": "[uuid]",
"trace_id": "[uuid]",
"context": {
"[key]": "[relevant context without PII]"
}
}
| Level | Use when | Example |
|---|---|---|
error |
Something failed that requires attention — this should page on-call eventually | Database query failed, external API returned 5xx, required config missing |
warn |
Something unexpected happened but service is still functioning | Retry succeeded after failure, cache miss on expected hit, rate limit approaching |
info |
Significant business events and request lifecycle | Request received, payment processed, user authenticated, job started/completed |
debug |
Detailed diagnostic information — off in production by default | Query parameters, intermediate computation results, cache key lookups |
Never log:
GET /health from access logs)Distributed tracing is mandatory for any service that calls other services. It enables root-cause analysis across service boundaries.
[ ] Tracing library installed:
- Go: go.opentelemetry.io/otel
- Python: opentelemetry-sdk, opentelemetry-instrumentation
- Node: @opentelemetry/sdk-node
- Java: opentelemetry-java-instrumentation
[ ] Tracer initialized at service startup with service name and version
[ ] Trace context propagated via W3C Trace Context headers:
traceparent: 00-[trace-id]-[span-id]-01
tracestate: [optional vendor-specific]
[ ] Automatic instrumentation enabled for:
[ ] Inbound HTTP/gRPC requests (creates root span)
[ ] Outbound HTTP/gRPC calls (creates child spans)
[ ] Database queries (creates child spans with sanitized query)
[ ] Cache operations (Redis, Memcached)
[ ] Message queue produce/consume
[ ] Custom spans added for:
[ ] Key business operations ([e.g. payment processing, user lookup])
[ ] Background jobs (each job execution = root span)
[ ] Third-party API calls with custom attributes
[ ] Span attributes to capture on all spans:
- user.id (if authenticated — no PII)
- deployment.environment (production/staging)
- service.version (git SHA)
- [service-specific key attributes]
[ ] Trace exporter configured to: [Datadog / Jaeger / Tempo / OTLP endpoint]
[ ] Sampling rate configured:
- Production: [1–10]% of requests (adjust based on volume and cost)
- Always sample: errors, slow requests (>p99 threshold), and 100% of [critical endpoint]
# Python — OpenTelemetry example
from opentelemetry import trace
tracer = trace.get_tracer("[service-name]")
def process_payment(payment_data):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.amount_cents", payment_data["amount"])
span.set_attribute("payment.currency", payment_data["currency"])
# Never: span.set_attribute("payment.card_number", ...)
try:
result = _do_process(payment_data)
span.set_status(trace.StatusCode.OK)
return result
except PaymentError as e:
span.set_status(trace.StatusCode.ERROR, str(e))
span.record_exception(e)
raise
Every alert must have: a name, a condition, a threshold, a severity, and a clear on-call action. Alerts without a clear action should not exist.
| Alert name | Condition | Threshold | Severity | On-call action |
|---|---|---|---|---|
[Service]HighErrorRate |
5xx error rate, 5-min rolling window | >1% for 2 consecutive windows | P1 | Check recent deploys; inspect error logs; see runbook [link] |
[Service]CriticalErrorRate |
5xx error rate, 2-min rolling window | >5% | P1 — immediate | Same as above — page immediately, do not wait |
[Service]HighP99Latency |
p99 latency on key endpoints | >2× SLO target for 3 min | P2 | Check DB latency, cache hit rate, and upstream dependencies |
[Service]LatencySLOBreach |
p99 latency | >SLO target for 5 consecutive minutes | P1 | SLO burn — page on-call, escalate if not resolved in 20 min |
[Service]HighCPU |
CPU utilisation | >80% sustained for 5 min | P2 | Check for traffic spike; scale up if needed; check for runaway processes |
[Service]HighMemory |
Memory utilisation | >85% sustained for 5 min | P2 | Check for memory leak (especially after deploys); restart pod if OOM imminent |
[Service]DBConnectionPoolHigh |
DB connection pool utilisation | >75% | P2 | Check for long-running queries; consider scaling service or increasing pool size |
[Service]DLQDepthHigh |
Dead-letter queue depth | >10 messages | P2 | Inspect DLQ messages for error pattern; fix bug and replay if safe |
[Service]TrafficDropAnomaly |
RPS, compared to same hour yesterday | >50% drop sustained 5 min | P1 | Upstream may be down; check caller health; check load balancer |
[Service]PrimaryActionSuccessRateDrop |
[Business metric success rate] | <[95]% over 10 min | P1 | [Service-specific action — e.g. "Check payment provider status"] |
[Service]DownstreamDependencyErrors |
Error rate calling [dependency] | >5% over 5 min | P2 | Check [dependency] status page; enable fallback if available |
# Prometheus / Grafana alerting rules (adapt for your platform)
groups:
- name: [service-name]-alerts
rules:
- alert: [Service]HighErrorRate
expr: |
(
sum(rate([service]_http_requests_total{status=~"5.."}[5m]))
/
sum(rate([service]_http_requests_total[5m]))
) > 0.01
for: 2m
labels:
severity: critical
team: [team-name]
annotations:
summary: "High error rate on [Service Name]"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
runbook_url: "[runbook link]"
- alert: [Service]HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate([service]_http_request_duration_seconds_bucket[5m])) by (le, endpoint)
) > [0.5]
for: 3m
labels:
severity: warning
team: [team-name]
annotations:
summary: "p99 latency elevated on [Service Name]"
description: "p99 latency on {{ $labels.endpoint }} is {{ $value | humanizeDuration }}"
runbook_url: "[runbook link]"
# Datadog monitor configuration (Python SDK or Terraform)
import datadog
datadog.initialize(api_key="[key]", app_key="[key]")
datadog.api.Monitor.create(
type="metric alert",
query=f"sum(last_5m):sum:{{service}}.http.errors{{service:[service-name]}} / sum:{{service}}.http.requests{{service:[service-name]}} > 0.01",
name="[Service] High Error Rate",
message="Error rate exceeded 1%. @pagerduty-[service-oncall]\n\nRunbook: [link]",
tags=["service:[service-name]", "team:[team-name]"],
options={
"thresholds": {"critical": 0.01, "warning": 0.005},
"notify_no_data": False,
"evaluation_delay": 60,
}
)
The primary service dashboard must answer "is the service healthy right now?" at a glance. Use this layout:
┌─────────────────────────────────────────────────────────────────────┐
│ [SERVICE NAME] — Service Health Dashboard [Time range ▼] │
├───────────────┬───────────────┬───────────────┬─────────────────────┤
│ Error rate │ p99 Latency │ RPS (current)│ SLO budget remaining│
│ [BIG NUMBER] │ [BIG NUMBER] │ [BIG NUMBER] │ [BIG NUMBER / days] │
│ vs SLO: 0.1% │ vs SLO: 500ms│ vs avg: [N] │ [Error budget gauge]│
├───────────────┴───────────────┴───────────────┴─────────────────────┤
│ Error rate over time (24h) │
│ [Time series: 5xx rate line, SLO threshold line] │
├─────────────────────────────────┬───────────────────────────────────┤
│ Latency percentiles over time │ Request throughput over time │
│ [Lines: p50, p95, p99, p999] │ [Bars: RPS by endpoint] │
│ [SLO threshold horizontal line]│ │
├─────────────────────────────────┴───────────────────────────────────┤
│ Latency heatmap (all requests — shows distribution shape) │
├─────────────────────────────────┬───────────────────────────────────┤
│ CPU utilisation over time │ Memory utilisation over time │
│ [All instances/pods — lines] │ [All instances/pods — lines] │
│ [Alert threshold: 80%] │ [Alert threshold: 85%] │
├─────────────────────────────────┴───────────────────────────────────┤
│ DB: connection pool utilisation│ DB: query latency (p99 per query)│
├─────────────────────────────────┴───────────────────────────────────┤
│ [Business metric 1 over time] │ [Business metric 2 over time] │
│ e.g. Payment success rate │ e.g. Orders created/min │
└─────────────────────────────────┴───────────────────────────────────┘
Second dashboard — Dependency Health:
┌─────────────────────────────────────────────────────────────────────┐
│ [SERVICE NAME] — Dependency Health │
├─────────────────────────────────────────────────────────────────────┤
│ For each dependency: error rate | latency | current status │
│ [Database] [N]% errors | [N]ms p99 | ● Healthy / ⚠ Degraded │
│ [Redis] [N]% errors | [N]ms p99 | ● Healthy │
│ [External API][N]% errors | [N]ms p99 | ● Healthy │
├─────────────────────────────────────────────────────────────────────┤
│ Outbound call latency over time (one line per dependency) │
├─────────────────────────────────────────────────────────────────────┤
│ Circuit breaker / fallback state (if implemented) │
└─────────────────────────────────────────────────────────────────────┘
Honest assessment of what is missing today and what the priority to add it is:
| Gap | Impact | Priority | Effort | Owner | Target date |
|---|---|---|---|---|---|
| [e.g. No distributed tracing — can't see cross-service latency] | High — blind to dependency issues | P1 | [2 days] | [Name] | [Date] |
| [e.g. No business metric alerts — only infra alerts] | High — silent business failures | P1 | [1 day] | [Name] | [Date] |
| [e.g. Logs are unstructured text — not searchable] | Medium — slow incident investigation | P2 | [3 days] | [Name] | [Date] |
| [e.g. No dead-letter queue monitoring] | Medium — failed messages go unnoticed | P2 | [4 hours] | [Name] | [Date] |
| [e.g. Alert thresholds not calibrated to production baseline] | Medium — alert fatigue or missed alerts | P2 | [1 day] | [Name] | [Date] |
| [e.g. No latency heatmap — outliers invisible in averages] | Low — harder to spot tail latency issues | P3 | [2 hours] | [Name] | [Date] |
Total observability debt: [N] items | Estimated effort: [N days]