AI驱动的智能故障排除流程

v20260509

incident-response-smart-fix

本技能提供了一个基于多智能体编排的复杂生产环境故障排除流程。它将AI代码助手、可观测性平台和自动化工具（如分布式追踪、Git bisect）结合，形成“分析-调查-修复-验证”的完整闭环。旨在指导用户解决跨系统的复杂Bug，显著降低平均恢复时间（MTTR），提升系统整体的韧性和稳定性。

AI 调试可观测性 DevOps 多智能体故障排查代码质量系统设计

获取技能

397 次下载

概览

Intelligent Issue Resolution with Multi-Agent Orchestration

[Extended thinking: This workflow implements a sophisticated debugging and resolution pipeline that leverages AI-assisted debugging tools and observability platforms to systematically diagnose and resolve production issues. The intelligent debugging strategy combines automated root cause analysis with human expertise, using modern 2024/2025 practices including AI code assistants (GitHub Copilot, Claude Code), observability platforms (Sentry, DataDog, OpenTelemetry), git bisect automation for regression tracking, and production-safe debugging techniques like distributed tracing and structured logging. The process follows a rigorous four-phase approach: (1) Issue Analysis Phase - error-detective and debugger agents analyze error traces, logs, reproduction steps, and observability data to understand the full context of the failure including upstream/downstream impacts, (2) Root Cause Investigation Phase - debugger and code-reviewer agents perform deep code analysis, automated git bisect to identify introducing commit, dependency compatibility checks, and state inspection to isolate the exact failure mechanism, (3) Fix Implementation Phase - domain-specific agents (python-pro, typescript-pro, rust-expert, etc.) implement minimal fixes with comprehensive test coverage including unit, integration, and edge case tests while following production-safe practices, (4) Verification Phase - test-automator and performance-engineer agents run regression suites, performance benchmarks, security scans, and verify no new issues are introduced. Complex issues spanning multiple systems require orchestrated coordination between specialist agents (database-optimizer → performance-engineer → devops-troubleshooter) with explicit context passing and state sharing. The workflow emphasizes understanding root causes over treating symptoms, implementing lasting architectural improvements, automating detection through enhanced monitoring and alerting, and preventing future occurrences through type system enhancements, static analysis rules, and improved error handling patterns. Success is measured not just by issue resolution but by reduced mean time to recovery (MTTR), prevention of similar issues, and improved system resilience.]

Use this skill when

Working on intelligent issue resolution with multi-agent orchestration tasks or workflows
Needing guidance, best practices, or checklists for intelligent issue resolution with multi-agent orchestration

Do not use this skill when

The task is unrelated to intelligent issue resolution with multi-agent orchestration
You need a different domain or tool outside this scope

Instructions

Clarify goals, constraints, and required inputs.
Apply relevant best practices and validate outcomes.
Provide actionable steps and verification.
If detailed examples are required, open resources/implementation-playbook.md.

Resources

resources/implementation-playbook.md for detailed patterns and examples.

Limitations

Use this skill only when the task clearly matches the scope described above.
Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.

信息

Category 编程开发

Name incident-response-smart-fix

版本 v20260509

大小 13.12KB

Source sickn33/antigravity-awesome-skills

更新时间 2026-05-10