Databricks Cheatsheet

🗄️ Delta Lake Operations

Delta Table Creation

Maak Delta table:

-- SQL
CREATE TABLE sales_delta
USING delta
LOCATION '/mnt/datalake/sales'
AS SELECT * FROM sales_parquet;

# Python
df.write.format("delta")\
  .mode("overwrite")\
  .save("/mnt/datalake/sales")

ACID Transactions

UPDATE, DELETE, MERGE operations:

-- UPDATE
UPDATE sales_delta
SET price = price * 1.1
WHERE category = 'Electronics';

-- MERGE (UPSERT)
MERGE INTO customers_target t
USING customers_source s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN
  UPDATE SET t.email = s.email
WHEN NOT MATCHED THEN
  INSERT (customer_id, email) VALUES (s.customer_id, s.email);

Time Travel

Data historie bekijken:

-- Toon table history
DESCRIBE HISTORY sales_delta;

-- Data van specifieke versie
SELECT * FROM sales_delta 
VERSION AS OF 5;

-- Data van specifieke timestamp
SELECT * FROM sales_delta 
TIMESTAMP AS OF '2024-01-15 10:30:00';

Optimization & Vacuum

Performance optimalisatie:

-- Z-Ordering voor query performance
OPTIMIZE sales_delta
ZORDER BY customer_id, transaction_date;

-- Compact small files
OPTIMIZE sales_delta;

-- Cleanup oude versies (retentie 7 dagen)
VACUUM sales_delta 
RETAIN 168 HOURS;

🏛️ Unity Catalog - Unified Governance

Object	Commando	Beschrijving
Catalog	CREATE CATALOG sales_catalog	Top-level container voor data
Schema	CREATE SCHEMA sales_catalog.raw	Database in een catalog
Table	CREATE TABLE sales_catalog.raw.customers	Tabel in schema
View	CREATE VIEW sales_catalog.analytics.daily_sales	View voor data abstractie
Volume	CREATE VOLUME sales_catalog.raw.files	Voor unstructured data
Share	CREATE SHARE external_partners	Data delen met externe partijen

-- Complete Unity Catalog setup -- 1. Catalog aanmaken CREATE CATALOG IF NOT EXISTS enterprise_data COMMENT 'Enterprise data catalog voor alle business units'; -- 2. Schemas voor verschillende zones CREATE SCHEMA enterprise_data.raw COMMENT 'Raw landing zone voor bron systemen'; CREATE SCHEMA enterprise_data.staging COMMENT 'Staging zone voor data transformatie'; CREATE SCHEMA enterprise_data.analytics COMMENT 'Analytics zone voor business reports'; -- 3. Table met managed storage CREATE TABLE enterprise_data.raw.customers ( customer_id BIGINT, first_name STRING, last_name STRING, email STRING, created_at TIMESTAMP ) USING delta COMMENT 'Customer master data' TBLPROPERTIES ( 'delta.autoOptimize.optimizeWrite' = 'true', 'delta.autoOptimize.autoCompact' = 'true' );

📊 Spark SQL - Databricks Notebooks

Notebook Basics

Notebook cell commands:

-- Magic commands
%sql    # SQL cell
%python # Python cell
%scala  # Scala cell
%r      # R cell
%sh     # Shell commands
%fs     # File system commands
%md     # Markdown documentation

# Widgets voor interactie
dbutils.widgets.dropdown("environment", "dev", ["dev", "test", "prod"])
env = dbutils.widgets.get("environment")

dbutils Utilities

Handige Databricks utilities:

# File system operations
dbutils.fs.ls("/mnt/datalake")
dbutils.fs.cp("source", "destination")
dbutils.fs.mkdirs("/mnt/new_folder")

# Secrets management
secret = dbutils.secrets.get("scope", "key")

# Notebook workflows
dbutils.notebook.run("notebook_path", timeout_seconds=300, arguments={"param": "value"})

# Display helpers
display(df)  # Show dataframe with UI

Spark Configuratie

Cluster en Spark config:

# Python - Spark session config
from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .appName("DataProcessing") \
  .config("spark.sql.shuffle.partitions", "200") \
  .config("spark.sql.adaptive.enabled", "true") \
  .config("spark.databricks.delta.optimizeWrite.enabled", "true") \
  .getOrCreate()

-- SQL - Session config
SET spark.sql.shuffle.partitions = 200;
SET spark.databricks.delta.optimizeWrite.enabled = true;

⚙️ Workflows & Job Scheduling

Task Types

Verschillende job types:

# Notebook task
{
  "task_key": "process_sales",
  "notebook_task": {
    "notebook_path": "/Shared/ETL/sales_pipeline",
    "base_parameters": {
      "start_date": "2024-01-01",
      "end_date": "2024-12-31"
    }
  },
  "timeout_seconds": 3600
}

# Spark JAR task
{
  "spark_jar_task": {
    "main_class_name": "com.company.ETLJob",
    "parameters": ["--date", "2024-01-01"]
  }
}

Schedule & Triggers

Job scheduling:

# Cron schedule
{
  "schedule": {
    "quartz_cron_expression": "0 0 6 * * ?",  # Dagelijks 6:00
    "timezone_id": "Europe/Amsterdam",
    "pause_status": "UNPAUSED"
  }
}

# File arrival trigger
{
  "file_arrival": {
    "url": "dbfs:/mnt/landing/incoming/",
    "min_time_between_triggers_seconds": 300
  }
}

Cluster Configuratie

Job cluster settings:

# New cluster config
{
  "new_cluster": {
    "spark_version": "13.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 2,
    "spark_conf": {
      "spark.databricks.delta.preview.enabled": "true"
    },
    "autoscale": {
      "min_workers": 2,
      "max_workers": 8
    }
  }
}

🤖 MLflow - Machine Learning

Functionaliteit	Code	Beschrijving
Experiment Start	mlflow.start_run()	Start nieuwe MLflow run
Log Parameters	mlflow.log_param("learning_rate", 0.01)	Log model parameters
Log Metrics	mlflow.log_metric("accuracy", 0.95)	Log evaluation metrics
Log Model	mlflow.sklearn.log_model(model, "model")	Save model artifact
Autologging	mlflow.autolog()	Automatisch loggen
Load Model	mlflow.pyfunc.load_model("runs:/run_id/model")	Model laden voor inference

# Complete MLflow example import mlflow import mlflow.sklearn from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Start experiment mlflow.set_experiment("/Shared/churn_prediction") with mlflow.start_run(run_name="rf_baseline"): # Log parameters mlflow.log_param("n_estimators", 100) mlflow.log_param("max_depth", 10) # Train model model = RandomForestClassifier(n_estimators=100, max_depth=10) model.fit(X_train, y_train) # Evaluate predictions = model.predict(X_test) accuracy = accuracy_score(y_test, predictions) # Log metrics mlflow.log_metric("accuracy", accuracy) mlflow.log_metric("precision", precision) mlflow.log_metric("recall", recall) # Log model mlflow.sklearn.log_model(model, "churn_model") # Log artifact (feature importance plot) mlflow.log_artifact("feature_importance.png")

🔍 Databricks SQL - BI & Analytics

SQL Warehouse

SQL Endpoint configuratie:

-- Warehouse create
CREATE WAREHOUSE analytics_wh
WITH (
  WAREHOUSE_SIZE = 'Small',
  AUTO_SUSPEND = 10,
  MIN_CLUSTERS = 1,
  MAX_CLUSTERS = 5,
  SCALING_POLICY = 'STANDARD'
);

-- Query warehouse
SELECT *
FROM system.compute.warehouses
WHERE state = 'RUNNING';

Query History

Monitor query performance:

-- Recent queries
SELECT 
  query_text,
  execution_time,
  rows_produced,
  user_name,
  warehouse_id
FROM system.query.history
WHERE start_time > CURRENT_TIMESTAMP - INTERVAL 1 DAY
ORDER BY execution_time DESC
LIMIT 100;

BI Integratie

Connect Power BI/Tableau:

# Connection string voor Power BI
Server: https://<workspace>.cloud.databricks.com
Port: 443
HTTP Path: /sql/1.0/warehouses/<warehouse-id>
Authentication: Personal Access Token

# Python connectie
from databricks import sql

connection = sql.connect(
  server_hostname="<workspace>.cloud.databricks.com",
  http_path="/sql/1.0/warehouses/<id>",
  access_token="dapi..."
)

⚡ Performance Optimization

Techniek	Code	Impact
Partitioning	PARTITIONED BY (date DATE)	Snellere date range queries
Z-Ordering	ZORDER BY customer_id	Betere data locality
File Size	spark.conf.set("spark.sql.files.maxPartitionBytes", "128mb")	Optimal partition size
Caching	df.cache() of CACHE SELECT * FROM table	Snellere herhaalde queries
Bloom Filters	TBLPROPERTIES ('delta.bloomFilter.columns' = 'customer_id')	Snellere join performance
Compression	TBLPROPERTIES ('delta.dataSkippingNumIndexedCols' = 32)	Kleinere storage

💡 Performance Tips: 1. Gebruik Photon engine voor snellere queries
2. Configureer auto-scaling voor variabele workloads
3. Gebruik Delta Cache voor herhaalde queries
4. Optimaliseer regulier met OPTIMIZE en ZORDER
5. Monitor performance met Query History

🔐 Security & Governance

Row & Column Security

Fine-grained access control:

-- Row filter (dynamic)
CREATE ROW FILTER sales_region_filter
ON sales
AS (region) WHERE region IN (
  SELECT region FROM user_regions 
  WHERE user = CURRENT_USER()
);

-- Column mask (PII)
CREATE MASK email_mask AS (
  CASE 
    WHEN CURRENT_USER() = 'admin@company.com' 
    THEN email
    ELSE REGEXP_REPLACE(email, '(.)(.*)@', '***@')
  END
);

Lineage & Auditing

Data lineage tracking:

-- Table lineage
DESCRIBE EXTENDED analytics.daily_sales;

-- Unity Catalog lineage API
# Python
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
lineage = w.lineage.get(table_name="catalog.schema.table")

Backup & Recovery

Disaster recovery:

-- Clone voor backup
CREATE TABLE sales_backup 
DEEP CLONE sales_production;

-- Restore from version
RESTORE TABLE sales 
TO VERSION AS OF 10;

-- Export voor on-prem backup
# Delta Lake export
df.write.format("delta")\
  .save("abfss://container@storage.dfs.core.windows.net/backup/sales")

📋 Snelle Referentie - Dagelijks Gebruik

-- Meest gebruikte Databricks commands -- 1. Data lezen van Delta table SELECT * FROM delta.`/mnt/datalake/sales`; -- 2. Nieuwe Delta table aanmaken CREATE TABLE sales_delta USING delta LOCATION '/mnt/datalake/sales' AS SELECT * FROM sales_csv; -- 3. Time travel query SELECT * FROM sales_delta TIMESTAMP AS OF '2024-01-15'; -- 4. MERGE (UPSERT) operation MERGE INTO target_table t USING source_table s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; -- 5. Performance optimalisatie OPTIMIZE sales_delta ZORDER BY (customer_id, date); -- 6. Unity Catalog table CREATE TABLE catalog.schema.table_name AS SELECT * FROM source_table; -- 7. MLflow experiment query SELECT * FROM mlflow_artifact WHERE experiment_id = '123';

Lakehouse Platform - Delta Lake, Unity Catalog, Spark SQL & MLflow

Hoe te gebruiken: