Databricks Cheatsheet

Lakehouse Platform - Delta Lake, Unity Catalog, Spark SQL & MLflow

Platform: Databricks | Versie: v3.0 | Update: Jan 2025
๐Ÿ’ก Databricks Lakehouse: Combineert data warehouse en data lake in รฉรฉn platform met Delta Lake, Unity Catalog en geรฏntegreerde MLflow.
๐Ÿ—„๏ธ Delta Lake Operations
Delta Table Creation
Maak Delta table:
-- SQL CREATE TABLE sales_delta USING delta LOCATION '/mnt/datalake/sales' AS SELECT * FROM sales_parquet;
# Python df.write.format("delta")\ .mode("overwrite")\ .save("/mnt/datalake/sales")
ACID Transactions
UPDATE, DELETE, MERGE operations:
-- UPDATE UPDATE sales_delta SET price = price * 1.1 WHERE category = 'Electronics';
-- MERGE (UPSERT) MERGE INTO customers_target t USING customers_source s ON t.customer_id = s.customer_id WHEN MATCHED THEN UPDATE SET t.email = s.email WHEN NOT MATCHED THEN INSERT (customer_id, email) VALUES (s.customer_id, s.email);
Time Travel
Data historie bekijken:
-- Toon table history DESCRIBE HISTORY sales_delta;
-- Data van specifieke versie SELECT * FROM sales_delta VERSION AS OF 5;
-- Data van specifieke timestamp SELECT * FROM sales_delta TIMESTAMP AS OF '2024-01-15 10:30:00';
Optimization & Vacuum
Performance optimalisatie:
-- Z-Ordering voor query performance OPTIMIZE sales_delta ZORDER BY customer_id, transaction_date;
-- Compact small files OPTIMIZE sales_delta;
-- Cleanup oude versies (retentie 7 dagen) VACUUM sales_delta RETAIN 168 HOURS;
๐Ÿ›๏ธ Unity Catalog - Unified Governance
Object Commando Beschrijving
Catalog CREATE CATALOG sales_catalog Top-level container voor data
Schema CREATE SCHEMA sales_catalog.raw Database in een catalog
Table CREATE TABLE sales_catalog.raw.customers Tabel in schema
View CREATE VIEW sales_catalog.analytics.daily_sales View voor data abstractie
Volume CREATE VOLUME sales_catalog.raw.files Voor unstructured data
Share CREATE SHARE external_partners Data delen met externe partijen
-- Complete Unity Catalog setup -- 1. Catalog aanmaken CREATE CATALOG IF NOT EXISTS enterprise_data COMMENT 'Enterprise data catalog voor alle business units'; -- 2. Schemas voor verschillende zones CREATE SCHEMA enterprise_data.raw COMMENT 'Raw landing zone voor bron systemen'; CREATE SCHEMA enterprise_data.staging COMMENT 'Staging zone voor data transformatie'; CREATE SCHEMA enterprise_data.analytics COMMENT 'Analytics zone voor business reports'; -- 3. Table met managed storage CREATE TABLE enterprise_data.raw.customers ( customer_id BIGINT, first_name STRING, last_name STRING, email STRING, created_at TIMESTAMP ) USING delta COMMENT 'Customer master data' TBLPROPERTIES ( 'delta.autoOptimize.optimizeWrite' = 'true', 'delta.autoOptimize.autoCompact' = 'true' );
๐Ÿ“Š Spark SQL - Databricks Notebooks
Notebook Basics
Notebook cell commands:
-- Magic commands %sql # SQL cell %python # Python cell %scala # Scala cell %r # R cell %sh # Shell commands %fs # File system commands %md # Markdown documentation
# Widgets voor interactie dbutils.widgets.dropdown("environment", "dev", ["dev", "test", "prod"]) env = dbutils.widgets.get("environment")
dbutils Utilities
Handige Databricks utilities:
# File system operations dbutils.fs.ls("/mnt/datalake") dbutils.fs.cp("source", "destination") dbutils.fs.mkdirs("/mnt/new_folder") # Secrets management secret = dbutils.secrets.get("scope", "key") # Notebook workflows dbutils.notebook.run("notebook_path", timeout_seconds=300, arguments={"param": "value"}) # Display helpers display(df) # Show dataframe with UI
Spark Configuratie
Cluster en Spark config:
# Python - Spark session config from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("DataProcessing") \ .config("spark.sql.shuffle.partitions", "200") \ .config("spark.sql.adaptive.enabled", "true") \ .config("spark.databricks.delta.optimizeWrite.enabled", "true") \ .getOrCreate()
-- SQL - Session config SET spark.sql.shuffle.partitions = 200; SET spark.databricks.delta.optimizeWrite.enabled = true;
โš™๏ธ Workflows & Job Scheduling
Task Types
Verschillende job types:
# Notebook task { "task_key": "process_sales", "notebook_task": { "notebook_path": "/Shared/ETL/sales_pipeline", "base_parameters": { "start_date": "2024-01-01", "end_date": "2024-12-31" } }, "timeout_seconds": 3600 }
# Spark JAR task { "spark_jar_task": { "main_class_name": "com.company.ETLJob", "parameters": ["--date", "2024-01-01"] } }
Schedule & Triggers
Job scheduling:
# Cron schedule { "schedule": { "quartz_cron_expression": "0 0 6 * * ?", # Dagelijks 6:00 "timezone_id": "Europe/Amsterdam", "pause_status": "UNPAUSED" } }
# File arrival trigger { "file_arrival": { "url": "dbfs:/mnt/landing/incoming/", "min_time_between_triggers_seconds": 300 } }
Cluster Configuratie
Job cluster settings:
# New cluster config { "new_cluster": { "spark_version": "13.3.x-scala2.12", "node_type_id": "Standard_DS3_v2", "num_workers": 2, "spark_conf": { "spark.databricks.delta.preview.enabled": "true" }, "autoscale": { "min_workers": 2, "max_workers": 8 } } }
๐Ÿค– MLflow - Machine Learning
Functionaliteit Code Beschrijving
Experiment Start mlflow.start_run() Start nieuwe MLflow run
Log Parameters mlflow.log_param("learning_rate", 0.01) Log model parameters
Log Metrics mlflow.log_metric("accuracy", 0.95) Log evaluation metrics
Log Model mlflow.sklearn.log_model(model, "model") Save model artifact
Autologging mlflow.autolog() Automatisch loggen
Load Model mlflow.pyfunc.load_model("runs:/run_id/model") Model laden voor inference
# Complete MLflow example import mlflow import mlflow.sklearn from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Start experiment mlflow.set_experiment("/Shared/churn_prediction") with mlflow.start_run(run_name="rf_baseline"): # Log parameters mlflow.log_param("n_estimators", 100) mlflow.log_param("max_depth", 10) # Train model model = RandomForestClassifier(n_estimators=100, max_depth=10) model.fit(X_train, y_train) # Evaluate predictions = model.predict(X_test) accuracy = accuracy_score(y_test, predictions) # Log metrics mlflow.log_metric("accuracy", accuracy) mlflow.log_metric("precision", precision) mlflow.log_metric("recall", recall) # Log model mlflow.sklearn.log_model(model, "churn_model") # Log artifact (feature importance plot) mlflow.log_artifact("feature_importance.png")
๐Ÿ” Databricks SQL - BI & Analytics
SQL Warehouse
SQL Endpoint configuratie:
-- Warehouse create CREATE WAREHOUSE analytics_wh WITH ( WAREHOUSE_SIZE = 'Small', AUTO_SUSPEND = 10, MIN_CLUSTERS = 1, MAX_CLUSTERS = 5, SCALING_POLICY = 'STANDARD' );
-- Query warehouse SELECT * FROM system.compute.warehouses WHERE state = 'RUNNING';
Query History
Monitor query performance:
-- Recent queries SELECT query_text, execution_time, rows_produced, user_name, warehouse_id FROM system.query.history WHERE start_time > CURRENT_TIMESTAMP - INTERVAL 1 DAY ORDER BY execution_time DESC LIMIT 100;
BI Integratie
Connect Power BI/Tableau:
# Connection string voor Power BI Server: https://<workspace>.cloud.databricks.com Port: 443 HTTP Path: /sql/1.0/warehouses/<warehouse-id> Authentication: Personal Access Token
# Python connectie from databricks import sql connection = sql.connect( server_hostname="<workspace>.cloud.databricks.com", http_path="/sql/1.0/warehouses/<id>", access_token="dapi..." )
โšก Performance Optimization
Techniek Code Impact
Partitioning PARTITIONED BY (date DATE) Snellere date range queries
Z-Ordering ZORDER BY customer_id Betere data locality
File Size spark.conf.set("spark.sql.files.maxPartitionBytes", "128mb") Optimal partition size
Caching df.cache() of CACHE SELECT * FROM table Snellere herhaalde queries
Bloom Filters TBLPROPERTIES ('delta.bloomFilter.columns' = 'customer_id') Snellere join performance
Compression TBLPROPERTIES ('delta.dataSkippingNumIndexedCols' = 32) Kleinere storage
๐Ÿ’ก Performance Tips: 1. Gebruik Photon engine voor snellere queries
2. Configureer auto-scaling voor variabele workloads
3. Gebruik Delta Cache voor herhaalde queries
4. Optimaliseer regulier met OPTIMIZE en ZORDER
5. Monitor performance met Query History
๐Ÿ” Security & Governance
Row & Column Security
Fine-grained access control:
-- Row filter (dynamic) CREATE ROW FILTER sales_region_filter ON sales AS (region) WHERE region IN ( SELECT region FROM user_regions WHERE user = CURRENT_USER() );
-- Column mask (PII) CREATE MASK email_mask AS ( CASE WHEN CURRENT_USER() = 'admin@company.com' THEN email ELSE REGEXP_REPLACE(email, '(.)(.*)@', '***@') END );
Lineage & Auditing
Data lineage tracking:
-- Table lineage DESCRIBE EXTENDED analytics.daily_sales;
-- Unity Catalog lineage API # Python from databricks.sdk import WorkspaceClient w = WorkspaceClient() lineage = w.lineage.get(table_name="catalog.schema.table")
Backup & Recovery
Disaster recovery:
-- Clone voor backup CREATE TABLE sales_backup DEEP CLONE sales_production;
-- Restore from version RESTORE TABLE sales TO VERSION AS OF 10;
-- Export voor on-prem backup # Delta Lake export df.write.format("delta")\ .save("abfss://container@storage.dfs.core.windows.net/backup/sales")
๐Ÿ“‹ Snelle Referentie - Dagelijks Gebruik
-- Meest gebruikte Databricks commands -- 1. Data lezen van Delta table SELECT * FROM delta.`/mnt/datalake/sales`; -- 2. Nieuwe Delta table aanmaken CREATE TABLE sales_delta USING delta LOCATION '/mnt/datalake/sales' AS SELECT * FROM sales_csv; -- 3. Time travel query SELECT * FROM sales_delta TIMESTAMP AS OF '2024-01-15'; -- 4. MERGE (UPSERT) operation MERGE INTO target_table t USING source_table s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; -- 5. Performance optimalisatie OPTIMIZE sales_delta ZORDER BY (customer_id, date); -- 6. Unity Catalog table CREATE TABLE catalog.schema.table_name AS SELECT * FROM source_table; -- 7. MLflow experiment query SELECT * FROM mlflow_artifact WHERE experiment_id = '123';

Hoe te gebruiken:

1. Print deze pagina als PDF (Ctrl+P โ†’ Save as PDF)

2. Bewaar als referentie naast je Databricks workspace

3. Deel met je team voor consistente ontwikkeling