Diagnostic guide for the 10 most common CAST AI issues, covering agent connectivity, API errors, autoscaler failures, and node provisioning problems.
kubectl access to the clusterCASTAI_API_KEY configuredkubectl get pods -n castai-agent
kubectl logs -n castai-agent deployment/castai-agent --tail=50
Causes and fixes:
--set provider=eks|gke|aks correctly in Helmapi.cast.ai is allowed# Check agent heartbeat
kubectl logs -n castai-agent deployment/castai-agent | grep -i "heartbeat\|connect\|error"
# Verify network connectivity from inside the cluster
kubectl run castai-debug --image=curlimages/curl --rm -it --restart=Never -- \
curl -s -o /dev/null -w "%{http_code}" https://api.cast.ai/v1/kubernetes/external-clusters
Fix: Restart the agent pod: kubectl rollout restart deployment/castai-agent -n castai-agent
# Test API key
curl -s -o /dev/null -w "%{http_code}" \
-H "X-API-Key: ${CASTAI_API_KEY}" \
https://api.cast.ai/v1/kubernetes/external-clusters
# Should return 200, not 401
Fix: Generate a new API key at console.cast.ai > API > API Access Keys.
# Check for pending pods
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
# Verify unschedulable pods policy is enabled
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
"https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
| jq '.unschedulablePods'
Causes:
unschedulablePods.enabled is false -- enable itclusterLimits.cpu.maxCores
# Check node downscaler configuration
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
"https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
| jq '.nodeDownscaler'
Causes:
nodeDownscaler.enabled is false
PodDisruptionBudget blocking evictionemptyNodes.delaySeconds
# Check spot configuration
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
"https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
| jq '.spotInstances'
Fix: Enable spotDiversityEnabled: true and set spotDiversityPriceIncreaseLimitPercent to 20-30 for better availability.
Symptoms: Pods being evicted too frequently, service disruption.
kubectl get events --field-selector reason=Evicted -A --sort-by=.lastTimestamp | tail -20
Fix: Increase evictor cycle interval or switch to non-aggressive mode:
helm upgrade castai-evictor castai-helm/castai-evictor \
-n castai-agent \
--set castai.apiKey="${CASTAI_API_KEY}" \
--set castai.clusterID="${CASTAI_CLUSTER_ID}" \
--set evictor.aggressiveMode=false \
--set evictor.cycleInterval=600
terraform plan -var-file=environments/prod.tfvars
# If drift detected:
terraform refresh -var-file=environments/prod.tfvars
Fix: Avoid mixing Terraform and console-based policy changes. Pick one source of truth.
# Check installed versions
helm list -n castai-agent
helm search repo castai-helm --versions | head -10
# Update to latest
helm repo update
helm upgrade castai-agent castai-helm/castai-agent -n castai-agent \
--reuse-values
kubectl logs -n castai-agent deployment/castai-workload-autoscaler --tail=50
Causes:
autoscaling.cast.ai/enabled: "true"
For comprehensive diagnostics, see castai-debug-bundle.