下载

安全上线部署

v20260420

shipping-and-launch

指导团队通过代码质量、安全、性能、基础设施、文档、功能开关与分阶段灰度发布等检查流程，确保生产上线可控、可观测并可回滚。

发布部署监控功能开关灰度清单

获取技能

234 次下载

概览

Shipping and Launch

Overview

Ship with confidence. The goal is not just to deploy — it's to deploy safely, with monitoring in place, a rollback plan ready, and a clear understanding of what success looks like. Every launch should be reversible, observable, and incremental.

When to Use

Deploying a feature to production for the first time
Releasing a significant change to users
Migrating data or infrastructure
Opening a beta or early access program
Any deployment that carries risk (all of them)

The Pre-Launch Checklist

Code Quality

All tests pass (unit, integration, e2e)
Build succeeds with no warnings
Lint and type checking pass
Code reviewed and approved
No TODO comments that should be resolved before launch
No console.log debugging statements in production code
Error handling covers expected failure modes

Security

No secrets in code or version control
npm audit shows no critical or high vulnerabilities
Input validation on all user-facing endpoints
Authentication and authorization checks in place
Security headers configured (CSP, HSTS, etc.)
Rate limiting on authentication endpoints
CORS configured to specific origins (not wildcard)

Performance

Core Web Vitals within "Good" thresholds
No N+1 queries in critical paths
Images optimized (compression, responsive sizes, lazy loading)
Bundle size within budget
Database queries have appropriate indexes
Caching configured for static assets and repeated queries

Accessibility

Keyboard navigation works for all interactive elements
Screen reader can convey page content and structure
Color contrast meets WCAG 2.1 AA (4.5:1 for text)
Focus management correct for modals and dynamic content
Error messages are descriptive and associated with form fields
No accessibility warnings in axe-core or Lighthouse

Infrastructure

Environment variables set in production
Database migrations applied (or ready to apply)
DNS and SSL configured
CDN configured for static assets
Logging and error reporting configured
Health check endpoint exists and responds

Documentation

README updated with any new setup requirements
API documentation current
ADRs written for any architectural decisions
Changelog updated
User-facing documentation updated (if applicable)

Feature Flag Strategy

Ship behind feature flags to decouple deployment from release:

// Feature flag check
const flags = await getFeatureFlags(userId);

if (flags.taskSharing) {
  // New feature: task sharing
  return <TaskSharingPanel task={task} />;
}

// Default: existing behavior
return null;

Feature flag lifecycle:

1. DEPLOY with flag OFF     → Code is in production but inactive
2. ENABLE for team/beta     → Internal testing in production environment
3. GRADUAL ROLLOUT          → 5% → 25% → 50% → 100% of users
4. MONITOR at each stage    → Watch error rates, performance, user feedback
5. CLEAN UP                 → Remove flag and dead code path after full rollout

Rules:

Every feature flag has an owner and an expiration date
Clean up flags within 2 weeks of full rollout
Don't nest feature flags (creates exponential combinations)
Test both flag states (on and off) in CI

Staged Rollout

The Rollout Sequence

1. DEPLOY to staging
   └── Full test suite in staging environment
   └── Manual smoke test of critical flows

2. DEPLOY to production (feature flag OFF)
   └── Verify deployment succeeded (health check)
   └── Check error monitoring (no new errors)

3. ENABLE for team (flag ON for internal users)
   └── Team uses the feature in production
   └── 24-hour monitoring window

4. CANARY rollout (flag ON for 5% of users)
   └── Monitor error rates, latency, user behavior
   └── Compare metrics: canary vs. baseline
   └── 24-48 hour monitoring window
   └── Advance only if all thresholds pass (see table below)

5. GRADUAL increase (25% -> 50% -> 100%)
   └── Same monitoring at each step
   └── Ability to roll back to previous percentage at any point

6. FULL rollout (flag ON for all users)
   └── Monitor for 1 week
   └── Clean up feature flag

Rollout Decision Thresholds

Use these thresholds to decide whether to advance, hold, or roll back at each stage:

Metric	Advance (green)	Hold and investigate (yellow)	Roll back (red)
Error rate	Within 10% of baseline	10-100% above baseline	>2x baseline
P95 latency	Within 20% of baseline	20-50% above baseline	>50% above baseline
Client JS errors	No new error types	New errors at <0.1% of sessions	New errors at >0.1% of sessions
Business metrics	Neutral or positive	Decline <5% (may be noise)	Decline >5%

When to Roll Back

Roll back immediately if:

Error rate increases by more than 2x baseline
P95 latency increases by more than 50%
User-reported issues spike
Data integrity issues detected
Security vulnerability discovered

Monitoring and Observability

What to Monitor

Application metrics:
├── Error rate (total and by endpoint)
├── Response time (p50, p95, p99)
├── Request volume
├── Active users
└── Key business metrics (conversion, engagement)

Infrastructure metrics:
├── CPU and memory utilization
├── Database connection pool usage
├── Disk space
├── Network latency
└── Queue depth (if applicable)

Client metrics:
├── Core Web Vitals (LCP, INP, CLS)
├── JavaScript errors
├── API error rates from client perspective
└── Page load time

Error Reporting

// Set up error boundary with reporting
class ErrorBoundary extends React.Component {
  componentDidCatch(error: Error, info: React.ErrorInfo) {
    // Report to error tracking service
    reportError(error, {
      componentStack: info.componentStack,
      userId: getCurrentUser()?.id,
      page: window.location.pathname,
    });
  }

  render() {
    if (this.state.hasError) {
      return <ErrorFallback onRetry={() => this.setState({ hasError: false })} />;
    }
    return this.props.children;
  }
}

// Server-side error reporting
app.use((err: Error, req: Request, res: Response, next: NextFunction) => {
  reportError(err, {
    method: req.method,
    url: req.url,
    userId: req.user?.id,
  });

  // Don't expose internals to users
  res.status(500).json({
    error: { code: 'INTERNAL_ERROR', message: 'Something went wrong' },
  });
});

Post-Launch Verification

In the first hour after launch:

1. Check health endpoint returns 200
2. Check error monitoring dashboard (no new error types)
3. Check latency dashboard (no regression)
4. Test the critical user flow manually
5. Verify logs are flowing and readable
6. Confirm rollback mechanism works (dry run if possible)

Rollback Strategy

Every deployment needs a rollback plan before it happens:

## Rollback Plan for [Feature/Release]

### Trigger Conditions
- Error rate > 2x baseline
- P95 latency > [X]ms
- User reports of [specific issue]

### Rollback Steps
1. Disable feature flag (if applicable)
   OR
1. Deploy previous version: `git revert <commit> && git push`
2. Verify rollback: health check, error monitoring
3. Communicate: notify team of rollback

### Database Considerations
- Migration [X] has a rollback: `npx prisma migrate rollback`
- Data inserted by new feature: [preserved / cleaned up]

### Time to Rollback
- Feature flag: < 1 minute
- Redeploy previous version: < 5 minutes
- Database rollback: < 15 minutes

Common Rationalizations

Rationalization	Reality
"It works in staging, it'll work in production"	Production has different data, traffic patterns, and edge cases. Monitor after deploy.
"We don't need feature flags for this"	Every feature benefits from a kill switch. Even "simple" changes can break things.
"Monitoring is overhead"	Not having monitoring means you discover problems from user complaints instead of dashboards.
"We'll add monitoring later"	Add it before launch. You can't debug what you can't see.
"Rolling back is admitting failure"	Rolling back is responsible engineering. Shipping a broken feature is the failure.

Red Flags

Deploying without a rollback plan
No monitoring or error reporting in production
Big-bang releases (everything at once, no staging)
Feature flags with no expiration or owner
No one monitoring the deploy for the first hour
Production environment configuration done by memory, not code
"It's Friday afternoon, let's ship it"

Verification

Before deploying:

Pre-launch checklist completed (all sections green)
Feature flag configured (if applicable)
Rollback plan documented
Monitoring dashboards set up
Team notified of deployment

After deploying:

Health check returns 200
Error rate is normal
Latency is normal
Critical user flow works
Logs are flowing
Rollback tested or verified ready

信息

Category 编程开发

Name shipping-and-launch

版本 v20260420

大小 9.68KB

Source addyosmani/agent-skills

更新时间 2026-04-24

语言

简体中文

English