This workflow analyzes a specific AWS resource to assess its health status, diagnose potential issues using CloudWatch logs and metrics, and develop a comprehensive remediation plan for any problems discovered.
Fetch https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ for monitoring and troubleshooting guidance to inform the diagnostic approach.
Locate the target resource using the appropriate AWS CLI command for its type:
# EC2
aws ec2 describe-instances --filters "Name=tag:Name,Values=<name>"
# Lambda
aws lambda get-function --function-name <name>
# RDS
aws rds describe-db-instances --db-instance-identifier <name>
# ECS
aws ecs describe-services --cluster <cluster> --services <name>
# ALB
aws elbv2 describe-load-balancers --names <name>
# DynamoDB
aws dynamodb describe-table --table-name <name>
# SQS
aws sqs get-queue-attributes --queue-url <url> --attribute-names All
# API Gateway
aws apigatewayv2 get-apis
If multiple matches are found, prompt the user to specify region/account.
Run service-specific health checks:
# EC2
aws ec2 describe-instance-status --instance-ids <id>
# RDS
aws rds describe-db-instances --db-instance-identifier <name> \
--query 'DBInstances[0].DBInstanceStatus'
# Lambda - error rate over 24h
aws cloudwatch get-metric-statistics --namespace AWS/Lambda \
--metric-name Errors --dimensions Name=FunctionName,Value=<name> \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 3600 --statistics Sum
# ECS
aws ecs describe-services --cluster <cluster> --services <name> \
--query 'services[0].[status,runningCount,desiredCount,pendingCount]'
Key health indicators by service type:
Find log groups and run CloudWatch Logs Insights queries:
# Find log groups
aws logs describe-log-groups --log-group-name-prefix /aws/<service>/<name>
# Start a query (last 24h errors)
aws logs start-query \
--log-group-name /aws/lambda/<name> \
--start-time $(date -u -d '24 hours ago' +%s) \
--end-time $(date -u +%s) \
--query-string 'filter @message like /ERROR/ | stats count(*) as errorCount by bin(1h)'
# Get results
aws logs get-query-results --query-id <id>
# Lambda cold starts
aws logs start-query \
--log-group-name /aws/lambda/<name> \
--start-time $(date -u -d '24 hours ago' +%s) \
--end-time $(date -u +%s) \
--query-string 'filter @type = "REPORT" | filter @initDuration > 0 | stats count() as coldStarts by bin(1h)'
# RDS Performance Insights (if enabled)
aws pi get-resource-metrics \
--service-type RDS --identifier db:<identifier> \
--metric-queries '[{"Metric":"db.load.avg"}]' \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period-in-seconds 3600
Identify: recurring error patterns, correlation with deployments (CloudTrail), performance trends, dependency failures.
Severity:
Root Cause Categories:
Immediate Actions (Critical):
# Lambda throttling โ increase reserved concurrency
aws lambda put-reserved-concurrency \
--function-name <name> --reserved-concurrent-executions 100
# RDS connection exhaustion โ reboot to reset connections
aws rds reboot-db-instance --db-instance-identifier <name>
Short-term Fixes (High/Medium): Configuration adjustments, right-sizing, CloudWatch alarm improvements, IAM corrections.
Long-term Improvements: Architectural changes for resilience, preventive monitoring, enable AWS Health Dashboard notifications via EventBridge.
Present findings:
๐ฅ AWS Resource Health Assessment
๐ Resource Overview:
โข Resource: [Name] ([Type])
โข Status: [Healthy/Warning/Critical]
โข Region: [Region] | Account: [Account ID]
๐จ Issues Identified:
โข Critical: X | High: Y | Medium: Z | Low: N
๐ Top Issues:
1. [Issue]: [Description] โ Impact: [High/Medium/Low]
2. [Issue]: [Description] โ Impact: [High/Medium/Low]
๐ ๏ธ Remediation: X immediate, Y short-term, Z long-term actions
โ Proceed with detailed remediation plan? (y/n)
Then generate a full markdown report covering: health metrics, issues with root cause analysis, phased remediation steps with AWS CLI commands, CloudWatch alarm recommendations, and validation checklist.
aws configure
logs:*, cloudwatch:*, pi:*)