技能 数据科学 Delta Lake 数据ETL流程

Delta Lake 数据ETL流程

v20260311
databricks-core-workflow-a
指导在 Databricks 上构建 Delta Lake ETL 管道,按 Bronze/Silver/Gold 层设计,涵盖 Auto Loader 采集、清洗、聚合及可选的 Delta Live Tables 质量防护,输出分析就绪表。
获取技能
458 次下载
概览

Databricks Core Workflow A: Delta Lake ETL

Contents

Overview

Build production Delta Lake ETL pipelines using medallion architecture (Bronze -> Silver -> Gold).

Prerequisites

  • Completed databricks-install-auth setup
  • Understanding of Delta Lake concepts
  • Unity Catalog configured (recommended)

Medallion Architecture

Raw Sources -> Bronze (Raw/Landing) -> Silver (Cleaned/Business Logic) -> Gold (Aggregated/Analytics Ready)

Instructions

Step 1: Bronze Layer - Raw Ingestion

Ingest raw data with metadata columns (_ingested_at, _source_file). Use mergeSchema for schema evolution. Use Auto Loader (cloudFiles) for streaming ingestion with schema inference.

Step 2: Silver Layer - Data Cleansing

Read from Bronze using Change Data Feed. Apply transformations: trim/lowercase strings, parse timestamps, hash PII, filter nulls, generate surrogate keys. Merge into Silver with upsert pattern.

Step 3: Gold Layer - Business Aggregations

Aggregate Silver data by business dimensions and time grain. Use partition-level overwrites for efficient updates.

Step 4: Delta Live Tables (Optional)

Declarative pipeline with @dlt.table decorators and data quality expectations (@dlt.expect_or_drop).

See detailed implementation for complete Bronze/Silver/Gold pipeline code, Auto Loader config, DLT pipeline, and orchestration example.

Output

  • Bronze layer with raw data and ingestion metadata
  • Silver layer with cleansed, deduplicated data
  • Gold layer with business aggregations
  • Delta Lake tables with ACID transactions

Error Handling

Error Cause Solution
Schema mismatch Source schema changed Use mergeSchema option
Duplicate records Missing deduplication Add merge logic with primary keys
Null values Data quality issues Add expectations/filters in Silver
Memory errors Large aggregations Increase cluster size or partition data

Examples

Quick Pipeline Run

# Full medallion pipeline
bronze.ingest_to_bronze(spark, "/mnt/landing/orders/", "catalog.bronze.orders")
silver.transform_to_silver(spark, "catalog.bronze.orders", "catalog.silver.orders", primary_keys=["order_id"])
gold.aggregate_to_gold(spark, "catalog.silver.orders", "catalog.gold.metrics", group_by_columns=["region"])

Resources

Next Steps

For ML workflows, see databricks-core-workflow-b.

信息
Category 数据科学
Name databricks-core-workflow-a
版本 v20260311
大小 3.56KB
更新时间 2026-03-12
语言