Azure Resource Health & Issue Diagnosis
This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered.
Prerequisites
- Azure MCP server configured and authenticated
- Target Azure resource identified (name and optionally resource group/subscription)
- Resource must be deployed and running to generate logs/telemetry
- Prefer Azure MCP tools (
azmcp-*) over direct Azure CLI when available
Workflow Steps
Step 1: Get Azure Best Practices
Action: Retrieve diagnostic and troubleshooting best practices
Tools: Azure MCP best practices tool
Process:
-
Load Best Practices:
- Execute Azure best practices tool to get diagnostic guidelines
- Focus on health monitoring, log analysis, and issue resolution patterns
- Use these practices to inform diagnostic approach and remediation recommendations
Step 2: Resource Discovery & Identification
Action: Locate and identify the target Azure resource
Tools: Azure MCP tools + Azure CLI fallback
Process:
-
Resource Lookup:
- If only resource name provided: Search across subscriptions using
azmcp-subscription-list
- Use
az resource list --name <resource-name> to find matching resources
- If multiple matches found, prompt user to specify subscription/resource group
- Gather detailed resource information:
- Resource type and current status
- Location, tags, and configuration
- Associated services and dependencies
-
Resource Type Detection:
- Identify resource type to determine appropriate diagnostic approach:
-
Web Apps/Function Apps: Application logs, performance metrics, dependency tracking
-
Virtual Machines: System logs, performance counters, boot diagnostics
-
Cosmos DB: Request metrics, throttling, partition statistics
-
Storage Accounts: Access logs, performance metrics, availability
-
SQL Database: Query performance, connection logs, resource utilization
-
Application Insights: Application telemetry, exceptions, dependencies
-
Key Vault: Access logs, certificate status, secret usage
-
Service Bus: Message metrics, dead letter queues, throughput
Step 3: Health Status Assessment
Action: Evaluate current resource health and availability
Tools: Azure MCP monitoring tools + Azure CLI
Process:
-
Basic Health Check:
- Check resource provisioning state and operational status
- Verify service availability and responsiveness
- Review recent deployment or configuration changes
- Assess current resource utilization (CPU, memory, storage, etc.)
-
Service-Specific Health Indicators:
-
Web Apps: HTTP response codes, response times, uptime
-
Databases: Connection success rate, query performance, deadlocks
-
Storage: Availability percentage, request success rate, latency
-
VMs: Boot diagnostics, guest OS metrics, network connectivity
-
Functions: Execution success rate, duration, error frequency
Step 4: Log & Telemetry Analysis
Action: Analyze logs and telemetry to identify issues and patterns
Tools: Azure MCP monitoring tools for Log Analytics queries
Process:
-
Find Monitoring Sources:
- Use
azmcp-monitor-workspace-list to identify Log Analytics workspaces
- Locate Application Insights instances associated with the resource
- Identify relevant log tables using
azmcp-monitor-table-list
-
Execute Diagnostic Queries:
Use azmcp-monitor-log-query with targeted KQL queries based on resource type:
General Error Analysis:
// Recent errors and exceptions
union isfuzzy=true
AzureDiagnostics,
AppServiceHTTPLogs,
AppServiceAppLogs,
AzureActivity
| where TimeGenerated > ago(24h)
| where Level == "Error" or ResultType != "Success"
| summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
Performance Analysis:
// Performance degradation patterns
Perf
| where TimeGenerated > ago(7d)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h)
| where avg_CounterValue > 80
Application-Specific Queries:
// Application Insights - Failed requests
requests
| where timestamp > ago(24h)
| where success == false
| summarize FailureCount=count() by resultCode, bin(timestamp, 1h)
| order by timestamp desc
// Database - Connection failures
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.SQL"
| where Category == "SQLSecurityAuditEvents"
| where action_name_s == "CONNECTION_FAILED"
| summarize ConnectionFailures=count() by bin(TimeGenerated, 1h)
-
Pattern Recognition:
- Identify recurring error patterns or anomalies
- Correlate errors with deployment times or configuration changes
- Analyze performance trends and degradation patterns
- Look for dependency failures or external service issues
Step 5: Issue Classification & Root Cause Analysis
Action: Categorize identified issues and determine root causes
Process:
-
Issue Classification:
-
Critical: Service unavailable, data loss, security breaches
-
High: Performance degradation, intermittent failures, high error rates
-
Medium: Warnings, suboptimal configuration, minor performance issues
-
Low: Informational alerts, optimization opportunities
-
Root Cause Analysis:
-
Configuration Issues: Incorrect settings, missing dependencies
-
Resource Constraints: CPU/memory/disk limitations, throttling
-
Network Issues: Connectivity problems, DNS resolution, firewall rules
-
Application Issues: Code bugs, memory leaks, inefficient queries
-
External Dependencies: Third-party service failures, API limits
-
Security Issues: Authentication failures, certificate expiration
-
Impact Assessment:
- Determine business impact and affected users/systems
- Evaluate data integrity and security implications
- Assess recovery time objectives and priorities
Step 6: Generate Remediation Plan
Action: Create a comprehensive plan to address identified issues
Process:
-
Immediate Actions (Critical issues):
- Emergency fixes to restore service availability
- Temporary workarounds to mitigate impact
- Escalation procedures for complex issues
-
Short-term Fixes (High/Medium issues):
- Configuration adjustments and resource scaling
- Application updates and patches
- Monitoring and alerting improvements
-
Long-term Improvements (All issues):
- Architectural changes for better resilience
- Preventive measures and monitoring enhancements
- Documentation and process improvements
-
Implementation Steps:
- Prioritized action items with specific Azure CLI commands
- Testing and validation procedures
- Rollback plans for each change
- Monitoring to verify issue resolution
Step 7: User Confirmation & Report Generation
Action: Present findings and get approval for remediation actions
Process:
-
Display Health Assessment Summary:
๐ฅ Azure Resource Health Assessment
๐ Resource Overview:
โข Resource: [Name] ([Type])
โข Status: [Healthy/Warning/Critical]
โข Location: [Region]
โข Last Analyzed: [Timestamp]
๐จ Issues Identified:
โข Critical: X issues requiring immediate attention
โข High: Y issues affecting performance/reliability
โข Medium: Z issues for optimization
โข Low: N informational items
๐ Top Issues:
1. [Issue Type]: [Description] - Impact: [High/Medium/Low]
2. [Issue Type]: [Description] - Impact: [High/Medium/Low]
3. [Issue Type]: [Description] - Impact: [High/Medium/Low]
๐ ๏ธ Remediation Plan:
โข Immediate Actions: X items
โข Short-term Fixes: Y items
โข Long-term Improvements: Z items
โข Estimated Resolution Time: [Timeline]
โ Proceed with detailed remediation plan? (y/n)
-
Generate Detailed Report:
# Azure Resource Health Report: [Resource Name]
**Generated**: [Timestamp]
**Resource**: [Full Resource ID]
**Overall Health**: [Status with color indicator]
## ๐ Executive Summary
[Brief overview of health status and key findings]
## ๐ Health Metrics
- **Availability**: X% over last 24h
- **Performance**: [Average response time/throughput]
- **Error Rate**: X% over last 24h
- **Resource Utilization**: [CPU/Memory/Storage percentages]
## ๐จ Issues Identified
### Critical Issues
- **[Issue 1]**: [Description]
- **Root Cause**: [Analysis]
- **Impact**: [Business impact]
- **Immediate Action**: [Required steps]
### High Priority Issues
- **[Issue 2]**: [Description]
- **Root Cause**: [Analysis]
- **Impact**: [Performance/reliability impact]
- **Recommended Fix**: [Solution steps]
## ๐ ๏ธ Remediation Plan
### Phase 1: Immediate Actions (0-2 hours)
```bash
# Critical fixes to restore service
[Azure CLI commands with explanations]
Phase 2: Short-term Fixes (2-24 hours)
# Performance and reliability improvements
[Azure CLI commands with explanations]
Phase 3: Long-term Improvements (1-4 weeks)
# Architectural and preventive measures
[Azure CLI commands and configuration changes]
๐ Monitoring Recommendations
-
Alerts to Configure: [List of recommended alerts]
-
Dashboards to Create: [Monitoring dashboard suggestions]
-
Regular Health Checks: [Recommended frequency and scope]
โ
Validation Steps
๐ Prevention Measures
- [Recommendations to prevent similar issues]
- [Process improvements]
- [Monitoring enhancements]
Error Handling
-
Resource Not Found: Provide guidance on resource name/location specification
-
Authentication Issues: Guide user through Azure authentication setup
-
Insufficient Permissions: List required RBAC roles for resource access
-
No Logs Available: Suggest enabling diagnostic settings and waiting for data
-
Query Timeouts: Break down analysis into smaller time windows
-
Service-Specific Issues: Provide generic health assessment with limitations noted
Success Criteria
- โ
Resource health status accurately assessed
- โ
All significant issues identified and categorized
- โ
Root cause analysis completed for major problems
- โ
Actionable remediation plan with specific steps provided
- โ
Monitoring and prevention recommendations included
- โ
Clear prioritization of issues by business impact
- โ
Implementation steps include validation and rollback procedures