技能 编程开发 Databricks CI 流水线集成

Databricks CI 流水线集成

v20260311
databricks-ci-integration
通过 GitHub Actions 自动完成 Databricks 资产包校验、单元测试、预发布部署与正式环境上线,适合需要将 Databricks 项目纳入稳定 CI/CD 流程的团队。
获取技能
210 次下载
概览

Databricks CI Integration

Overview

Automate Databricks deployments with Asset Bundles and GitHub Actions. Covers bundle validation, unit testing notebooks, deploying to staging/production, and integration testing against Databricks workspaces.

Prerequisites

  • Databricks workspace with service principal
  • Databricks CLI v0.200+ installed
  • GitHub secrets for authentication
  • Asset Bundle (databricks.yml) configured

Instructions

Step 1: GitHub Actions for Bundle Validation

# .github/workflows/databricks-ci.yml
name: Databricks CI

on:
  pull_request:
    paths:
      - 'src/**'
      - 'resources/**'
      - 'databricks.yml'
      - 'tests/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: |
          curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

      - name: Validate bundle
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
        run: databricks bundle validate

      - name: Run Python unit tests
        run: |
          pip install pytest pyspark delta-spark
          pytest tests/unit/ -v --tb=short

  deploy-staging:
    needs: validate
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

      - name: Deploy to staging
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
        run: databricks bundle deploy --target staging

Step 2: Unit Tests for Notebooks

# tests/unit/test_transformations.py
import pytest
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

@pytest.fixture(scope="session")
def spark():
    return SparkSession.builder.master("local[*]").getOrCreate()

def test_clean_events(spark):
    """Test silver layer cleaning logic."""
    from src.transformation.silver_clean_events import clean_events

    schema = StructType([
        StructField("event_id", StringType()),
        StructField("user_id", StringType()),
        StructField("event_type", StringType()),
        StructField("timestamp", StringType()),
    ])

    raw_data = [
        ("1", "user-1", "click", "2024-01-01T00:00:00Z"),  # 2024 year
        ("1", "user-1", "click", "2024-01-01T00:00:00Z"),  # Duplicate
        ("2", None, "click", "2024-01-02T00:00:00Z"),       # Null user
    ]

    df = spark.createDataFrame(raw_data, schema)
    result = clean_events(df)

    assert result.count() == 1  # Deduped and nulls removed
    assert result.first()["user_id"] == "user-1"

def test_aggregate_metrics(spark):
    """Test gold layer aggregation."""
    from src.aggregation.gold_daily_metrics import aggregate_daily

    # Create test data...
    result = aggregate_daily(spark, "2024-01-01")
    assert result.count() > 0

Step 3: Deploy to Production on Merge

# .github/workflows/databricks-deploy.yml
name: Databricks Deploy

on:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'resources/**'
      - 'databricks.yml'

jobs:
  deploy-production:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

      - name: Validate bundle
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_PROD }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_PROD }}
        run: databricks bundle validate --target prod

      - name: Deploy to production
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_PROD }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_PROD }}
        run: |
          databricks bundle deploy --target prod
          databricks bundle run daily_etl --target prod --no-wait

Step 4: Integration Tests

  integration-tests:
    needs: deploy-staging
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

      - name: Run integration test job
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
        run: |
          # Run the test notebook on staging
          databricks bundle run integration_tests --target staging

      - name: Verify output tables
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
        run: |
          databricks sql execute \
            --statement "SELECT COUNT(*) FROM staging_catalog.silver.clean_events WHERE date = current_date()"

Error Handling

Issue Cause Solution
Bundle validation fails Invalid YAML Run databricks bundle validate locally first
Auth error in CI Token expired Use service principal with OIDC
Test cluster timeout Cluster not started Increase timeout or use existing cluster
Deploy conflict Concurrent deployments Use GitHub environments with concurrency limit

Examples

Quick Local Validation

# Validate and see what would be deployed
databricks bundle validate --target staging
databricks bundle deploy --target staging --dry-run

Resources

Output

  • Configuration files or code changes applied to the project
  • Validation report confirming correct implementation
  • Summary of changes made and their rationale
信息
Category 编程开发
Name databricks-ci-integration
版本 v20260311
大小 6.58KB
更新时间 2026-03-12
语言