技能 编程开发 CAST AI自动扩缩容配置

CAST AI自动扩缩容配置

v20260423
castai-core-workflow-a
本工作流指导用户配置CAST AI的自动扩缩容策略,实现Kubernetes集群的最佳成本管理和资源利用。内容涵盖启用竞价实例(Spot Instances)、设置节点下沉/驱逐规则、定义集群限制,并通过Terraform创建特定工作负载的节点模板,确保资源分配的稳定性和成本效益。
获取技能
492 次下载
概览

CAST AI Core Workflow: Autoscaler & Policies

Overview

Primary workflow for CAST AI: configure autoscaler policies to optimize cluster costs. Covers enabling spot instances, configuring the node downscaler and evictor, setting cluster CPU/memory limits, and creating node templates for workload-specific requirements.

Prerequisites

  • Completed castai-install-auth with Phase 2 (cluster controller + evictor)
  • CASTAI_API_KEY and CASTAI_CLUSTER_ID set
  • Cluster in "ready" status

Instructions

Step 1: Read Current Policies

curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  | jq .

Step 2: Enable Cost-Optimized Autoscaling

curl -X PUT -H "X-API-Key: ${CASTAI_API_KEY}" \
  -H "Content-Type: application/json" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  -d '{
    "enabled": true,
    "unschedulablePods": {
      "enabled": true,
      "headroom": {
        "cpuPercentage": 10,
        "memoryPercentage": 10,
        "enabled": true
      }
    },
    "nodeDownscaler": {
      "enabled": true,
      "emptyNodes": {
        "enabled": true,
        "delaySeconds": 180
      }
    },
    "spotInstances": {
      "enabled": true,
      "clouds": ["aws"],
      "spotDiversityEnabled": true,
      "spotDiversityPriceIncreaseLimitPercent": 20
    },
    "clusterLimits": {
      "enabled": true,
      "cpu": {
        "minCores": 4,
        "maxCores": 100
      }
    }
  }'

Step 3: Configure Node Templates via Terraform

resource "castai_node_template" "spot_workers" {
  cluster_id = castai_eks_cluster.this.id
  name       = "spot-workers"
  is_default = false
  is_enabled = true

  constraints {
    min_cpu               = 2
    max_cpu               = 16
    min_memory            = 4096
    max_memory            = 65536
    spot                  = true
    use_spot_fallbacks    = true
    fallback_restore_rate_seconds = 600

    instance_families {
      include = ["m5", "m6i", "c5", "c6i", "r5", "r6i"]
    }

    architectures = ["amd64"]
  }

  custom_labels = {
    "workload-type" = "batch"
  }
}

resource "castai_node_template" "gpu_ondemand" {
  cluster_id = castai_eks_cluster.this.id
  name       = "gpu-ondemand"
  is_default = false
  is_enabled = true

  constraints {
    spot                  = false
    gpu_manufacturers     = ["NVIDIA"]

    instance_families {
      include = ["p3", "p4d", "g4dn", "g5"]
    }
  }

  custom_labels = {
    "workload-type" = "gpu"
  }
}

Step 4: Verify Autoscaler is Working

# Check if the autoscaler is processing nodes
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/external-clusters/${CASTAI_CLUSTER_ID}/nodes" \
  | jq '[.items[] | {name, instanceType, lifecycle, castaiManaged: .castaiManaged}]
        | group_by(.lifecycle)
        | map({lifecycle: .[0].lifecycle, count: length})'

# Expected: mix of spot and on-demand nodes

Error Handling

Error Cause Solution
Policy update returns 400 Invalid policy JSON Validate with jq before sending
Nodes not scaling Policy not enabled Verify .enabled: true in policy
Spot instances not used Provider not configured Add cloud provider to spotInstances.clouds
Evictor too aggressive Low delay threshold Increase emptyNodes.delaySeconds
Cluster limit hit maxCores too low Increase clusterLimits.cpu.maxCores

Resources

Next Steps

For workload-level autoscaling, see castai-core-workflow-b.

信息
Category 编程开发
Name castai-core-workflow-a
版本 v20260423
大小 4.42KB
更新时间 2026-04-28
语言