Server Management
Server management principles for production operations.
Learn to THINK, not memorize commands.
1. Process Management Principles
Tool Selection
| Scenario |
Tool |
| Node.js app |
PM2 (clustering, reload) |
| Any app |
systemd (Linux native) |
| Containers |
Docker/Podman |
| Orchestration |
Kubernetes, Docker Swarm |
Process Management Goals
| Goal |
What It Means |
| Restart on crash |
Auto-recovery |
| Zero-downtime reload |
No service interruption |
| Clustering |
Use all CPU cores |
| Persistence |
Survive server reboot |
2. Monitoring Principles
What to Monitor
| Category |
Key Metrics |
| Availability |
Uptime, health checks |
| Performance |
Response time, throughput |
| Errors |
Error rate, types |
| Resources |
CPU, memory, disk |
Alert Severity Strategy
| Level |
Response |
| Critical |
Immediate action |
| Warning |
Investigate soon |
| Info |
Review daily |
Monitoring Tool Selection
| Need |
Options |
| Simple/Free |
PM2 metrics, htop |
| Full observability |
Grafana, Datadog |
| Error tracking |
Sentry |
| Uptime |
UptimeRobot, Pingdom |
3. Log Management Principles
Log Strategy
| Log Type |
Purpose |
| Application logs |
Debug, audit |
| Access logs |
Traffic analysis |
| Error logs |
Issue detection |
Log Principles
-
Rotate logs to prevent disk fill
-
Structured logging (JSON) for parsing
-
Appropriate levels (error/warn/info/debug)
-
No sensitive data in logs
4. Scaling Decisions
When to Scale
| Symptom |
Solution |
| High CPU |
Add instances (horizontal) |
| High memory |
Increase RAM or fix leak |
| Slow response |
Profile first, then scale |
| Traffic spikes |
Auto-scaling |
Scaling Strategy
| Type |
When to Use |
| Vertical |
Quick fix, single instance |
| Horizontal |
Sustainable, distributed |
| Auto |
Variable traffic |
5. Health Check Principles
What Constitutes Healthy
| Check |
Meaning |
| HTTP 200 |
Service responding |
| Database connected |
Data accessible |
| Dependencies OK |
External services reachable |
| Resources OK |
CPU/memory not exhausted |
Health Check Implementation
- Simple: Just return 200
- Deep: Check all dependencies
- Choose based on load balancer needs
6. Security Principles
| Area |
Principle |
| Access |
SSH keys only, no passwords |
| Firewall |
Only needed ports open |
| Updates |
Regular security patches |
| Secrets |
Environment vars, not files |
| Audit |
Log access and changes |
7. Troubleshooting Priority
When something's wrong:
-
Check if running (process status)
-
Check logs (error messages)
-
Check resources (disk, memory, CPU)
-
Check network (ports, DNS)
-
Check dependencies (database, APIs)
8. Anti-Patterns
| ❌ Don't |
✅ Do |
| Run as root |
Use non-root user |
| Ignore logs |
Set up log rotation |
| Skip monitoring |
Monitor from day one |
| Manual restarts |
Auto-restart config |
| No backups |
Regular backup schedule |
Remember: A well-managed server is boring. That's the goal.
When to Use
This skill is applicable to execute the workflow or actions described in the overview.