技能 数据科学 使用DuckDB查询Fabric湖仓数据

使用DuckDB查询Fabric湖仓数据

v20260619
using-duckdb
该技能允许用户使用DuckDB查询OneLake中存储的Delta Lake表和原始文件。它支持两种运行模式:本地命令行或Fabric Notebook环境。适用于数据探索、数据质量验证、Schema发现以及跨Fabric生态系统的即时数据分析。
获取技能
280 次下载
概览

Using DuckDB with Fabric

Query Delta Lake tables and raw files in OneLake using DuckDB. Works both locally (CLI/Python) and inside Fabric notebooks. Read-only; for writes, use the executing-spark skill.

Two Modes

Mode Where it runs Auth Best for
Local Developer machine Azure CLI (az login) Exploration, validation, ad-hoc analysis
In-notebook Fabric Spark container notebookutils.credentials.getToken('storage') Combining DuckDB speed with Spark write-back

Local: Prerequisites

  • DuckDB installed (brew install duckdb on macOS)
  • Azure CLI authenticated (az login)
  • Extensions installed: INSTALL delta; INSTALL azure; (one-time)

Local: Querying Delta Tables

WS_ID=$(fab get "Workspace.Workspace" -q "id" | tr -d '"')
LH_ID=$(fab get "Workspace.Workspace/LH.Lakehouse" -q "id" | tr -d '"')

duckdb -c "
LOAD delta; LOAD azure;
CREATE SECRET (TYPE azure, PROVIDER credential_chain, CHAIN 'cli');

SELECT * FROM delta_scan(
  'abfss://${WS_ID}@onelake.dfs.fabric.microsoft.com/${LH_ID}/Tables/schema/table'
) LIMIT 10;
"

The CHAIN 'cli' parameter uses Azure CLI credentials. Without it, DuckDB tries managed identity first (fails on local machines).

Local: Querying Raw Files

BASE="abfss://${WS_ID}@onelake.dfs.fabric.microsoft.com/${LH_ID}/Files"

duckdb -c "
LOAD azure;
CREATE SECRET (TYPE azure, PROVIDER credential_chain, CHAIN 'cli');

SELECT * FROM read_csv('${BASE}/data.csv') LIMIT 10;
SELECT * FROM read_parquet('${BASE}/facts.parquet') LIMIT 10;
SELECT * FROM read_json('${BASE}/events/*.json');
"

Glob patterns (*, **) work for reading multiple files.

In-Notebook: Attaching DuckDB to a Lakehouse

Inside a Fabric notebook, DuckDB can query lakehouse Delta tables directly using a storage token. This approach is faster than Spark SQL for analytical queries on single-node data.

import duckdb
import time

# Get storage token from notebook context
token = notebookutils.credentials.getToken('storage')

# Create DuckDB connection
con = duckdb.connect(f'temp_{time.time_ns()}.duckdb')
con.sql('SET enable_object_cache=true')

# Register OneLake secret
con.sql(f"""
    CREATE OR REPLACE SECRET onelake (
        TYPE AZURE,
        PROVIDER ACCESS_TOKEN,
        ACCESS_TOKEN '{token}'
    )
""")

# Query Delta tables
workspace = "<workspace-id>"
lakehouse = "<lakehouse-name>"
path = f"abfss://{workspace}@onelake.dfs.fabric.microsoft.com/{lakehouse}.Lakehouse/Tables"

df = con.sql(f"""
    SELECT * FROM delta_scan('{path}/schema/table_name') LIMIT 100
""").df()
print(df)

Auto-Discovering Tables

Dynamically find all Delta tables in a lakehouse:

tables = con.sql(f"""
    SELECT DISTINCT split_part(file, '_delta_log', 1) as table_path
    FROM glob('{path}/*/*/*_delta_log/*.json')
""").df()['table_path'].tolist()

for t in tables:
    view_name = t.split('/')[-1]
    con.sql(f"CREATE OR REPLACE VIEW {view_name} AS SELECT * FROM delta_scan('{t}')")
    print(f"Created view: {view_name}")

OneLake Path Format

abfss://<workspace-id>@onelake.dfs.fabric.microsoft.com/<item-id>/Tables/<schema>/<table>
abfss://<workspace-id>@onelake.dfs.fabric.microsoft.com/<item-id>/Files/<path>
Item type ID source
Lakehouse fab get "ws/LH.Lakehouse" -q "id"
Warehouse fab get "ws/WH.Warehouse" -q "id"
SQL Database fab get "ws/DB.SQLDatabase" -q "id"

Cross-item joins work in a single DuckDB query; use different abfss:// paths.

Common Patterns

For data freshness checks, quality validation, schema discovery, cross-table joins, and row count audits, see references/common-patterns.md.

References

信息
Category 数据科学
Name using-duckdb
版本 v20260619
大小 4.03KB
更新时间 2026-06-20
语言