Skills Data Science Executing Spark Code Via Livy API

Executing Spark Code Via Livy API

v20260619
executing-spark
Run arbitrary PySpark or pure Python code directly on Fabric Spark compute using the Livy API. These sessions are ephemeral, meaning no notebook artifacts are created or persisted, making it ideal for automated ETL processes, programmatic data transformations, or agent-driven compute workflows. Provides full read/write access to lakehouse Delta tables via Spark SQL.
Get Skill
276 downloads
Overview

Executing Spark Code in Fabric (No Notebook)

Run arbitrary PySpark or Python code on Fabric Spark compute via the Livy API. No notebook artifact is created or persisted; sessions are ephemeral. Full read/write access to lakehouse Delta tables via Spark SQL.

Prerequisites

  • Azure CLI authenticated (az login)
  • A lakehouse in the target workspace (the Livy session runs against it)
  • Fabric capacity (F or trial)

Critical: Authentication

The Livy API requires a token from az account get-access-token --resource https://api.fabric.microsoft.com. Tokens from fab auth do not work for OneLake storage access inside the Spark session.

import subprocess, json

result = subprocess.run(
    ["az", "account", "get-access-token", "--resource", "https://api.fabric.microsoft.com"],
    capture_output=True, text=True
)
token = json.loads(result.stdout)["accessToken"]

Do not output or log the token. Pass it directly to the API call.

Lifecycle

1. Create session   POST .../sessions              {"kind": "pyspark"}
2. Wait for idle    GET  .../sessions/{id}          poll until state: "idle" (~30-90s)
3. Submit code      POST .../sessions/{id}/statements   {"code": "...", "kind": "pyspark"}
4. Get result       GET  .../sessions/{id}/statements/{n}   poll until state: "available"
5. Delete session   DELETE .../sessions/{id}        ALWAYS do this

Base URL: https://api.fabric.microsoft.com/v1/workspaces/{wsId}/lakehouses/{lhId}/livyapi/versions/2023-12-01

CRITICAL: Always delete sessions when done. Idle sessions consume Fabric capacity units (CUs). A forgotten session burns compute until it times out (default: 20 minutes). In automation, wrap cleanup in a finally block.

Getting IDs

WS_ID=$(fab get "Workspace.Workspace" -q "id" | tr -d '"')
LH_ID=$(fab get "Workspace.Workspace/Lakehouse.Lakehouse" -q "id" | tr -d '"')

Submitting Code

Submit PySpark or pure Python as statements. The spark object is available automatically.

# Statement payload
{"code": "df = spark.sql('SELECT * FROM products LIMIT 10')\ndf.show()", "kind": "pyspark"}

Results are in output.data["text/plain"] when state: "available" and output.status: "ok".

What Works

  • spark.sql("SELECT ...") ; full Spark SQL against lakehouse tables
  • spark.sql("SHOW TABLES") ; metastore access
  • df.write.mode("overwrite").saveAsTable(...) ; write Delta tables
  • Pure Python (pandas, numpy, pyarrow); runs on Spark container
  • In-memory Spark DataFrames and transformations
  • Multiple sequential statements in one session

What Does Not Work

  • deltalake (delta-rs) is not pre-installed; use Spark SQL instead
  • notebookutils has limited functionality (no FUSE mount at /lakehouse/default/)
  • Tokens from fab auth ; must use az CLI token
  • Tokens expire after ~60 minutes; long sessions need token refresh

When to Use This vs Alternatives

Scenario Approach
Quick read-only exploration DuckDB locally (fastest; see using-duckdb skill)
Write data back to lakehouse Livy session or notebook
Ephemeral transform; no artifact Livy session (this skill)
Complex multi-cell workflow Notebook (nb exec or portal)
Scheduled ETL Notebook via fab job run
Agent-driven compute (Dagster, orchestrators) Livy session

References

  • references/livy-api.md -- Full API reference with endpoints, request/response formats, and error handling
  • references/example-script.md -- Complete working script that creates a session, queries data, writes results, and cleans up
Info
Category Data Science
Name executing-spark
Version v20260619
Size 5.54KB
Updated At 2026-06-20
Language