技能 编程开发 服务器管理最佳实践原则

服务器管理最佳实践原则

v20260423
server-management
本技能集汇集了生产级服务器管理的系统化知识体系。内容涵盖进程管理、监控指标设计、日志规范化、水平与垂直扩容策略,以及系统安全和故障排查的流程。核心目标是培养系统架构师的系统思考能力,确保应用的高可用性和健壮性。
获取技能
393 次下载
概览

Server Management

Server management principles for production operations. Learn to THINK, not memorize commands.


1. Process Management Principles

Tool Selection

Scenario Tool
Node.js app PM2 (clustering, reload)
Any app systemd (Linux native)
Containers Docker/Podman
Orchestration Kubernetes, Docker Swarm

Process Management Goals

Goal What It Means
Restart on crash Auto-recovery
Zero-downtime reload No service interruption
Clustering Use all CPU cores
Persistence Survive server reboot

2. Monitoring Principles

What to Monitor

Category Key Metrics
Availability Uptime, health checks
Performance Response time, throughput
Errors Error rate, types
Resources CPU, memory, disk

Alert Severity Strategy

Level Response
Critical Immediate action
Warning Investigate soon
Info Review daily

Monitoring Tool Selection

Need Options
Simple/Free PM2 metrics, htop
Full observability Grafana, Datadog
Error tracking Sentry
Uptime UptimeRobot, Pingdom

3. Log Management Principles

Log Strategy

Log Type Purpose
Application logs Debug, audit
Access logs Traffic analysis
Error logs Issue detection

Log Principles

  1. Rotate logs to prevent disk fill
  2. Structured logging (JSON) for parsing
  3. Appropriate levels (error/warn/info/debug)
  4. No sensitive data in logs

4. Scaling Decisions

When to Scale

Symptom Solution
High CPU Add instances (horizontal)
High memory Increase RAM or fix leak
Slow response Profile first, then scale
Traffic spikes Auto-scaling

Scaling Strategy

Type When to Use
Vertical Quick fix, single instance
Horizontal Sustainable, distributed
Auto Variable traffic

5. Health Check Principles

What Constitutes Healthy

Check Meaning
HTTP 200 Service responding
Database connected Data accessible
Dependencies OK External services reachable
Resources OK CPU/memory not exhausted

Health Check Implementation

  • Simple: Just return 200
  • Deep: Check all dependencies
  • Choose based on load balancer needs

6. Security Principles

Area Principle
Access SSH keys only, no passwords
Firewall Only needed ports open
Updates Regular security patches
Secrets Environment vars, not files
Audit Log access and changes

7. Troubleshooting Priority

When something's wrong:

  1. Check if running (process status)
  2. Check logs (error messages)
  3. Check resources (disk, memory, CPU)
  4. Check network (ports, DNS)
  5. Check dependencies (database, APIs)

8. Anti-Patterns

❌ Don't ✅ Do
Run as root Use non-root user
Ignore logs Set up log rotation
Skip monitoring Monitor from day one
Manual restarts Auto-restart config
No backups Regular backup schedule

Remember: A well-managed server is boring. That's the goal.

When to Use

This skill is applicable to execute the workflow or actions described in the overview.

Limitations

  • Use this skill only when the task clearly matches the scope described above.
  • Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
  • Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.
信息
Category 编程开发
Name server-management
版本 v20260423
大小 4.06KB
更新时间 2026-04-24
语言