Wat is het verschil tussen Databricks en traditionele data platforms?

Databricks biedt een unified lakehouse platform dat data engineering, data science en business intelligence combineert. Het gebruikt Delta Lake voor betrouwbare data opslag, Unity Catalog voor governance en biedt native Apache Spark, MLflow en Delta Sharing integratie.

Hoe kies ik tussen Databricks SQL en Spark?

Databricks SQL is geoptimaliseerd voor BI workloads met sub-second query performance op Delta Lake. Apache Spark is voor data engineering en data science workloads. Gebruik Databricks SQL voor: BI dashboards, ad-hoc SQL queries, data exploration. Gebruik Spark voor: ETL pipelines, machine learning, stream processing, complexe data transformaties.

Wat zijn de kosten van Databricks?

Databricks kosten bestaan uit: 1) Databricks Units (DBUs) voor compute, 2) Cloud provider kosten (AWS/Azure/GCP), 3) Delta Lake storage kosten, 4) Premium features (Unity Catalog, Delta Sharing). Kosten worden bepaald door workload type (all-purpose, SQL, ML) en gekozen cloud provider.

Hoe implementeer ik data governance in Databricks?

Data governance best practices: 1) Unity Catalog voor centrale metadata management, 2) Delta Lake voor ACID compliance, 3) Row/Column level security, 4) Data lineage tracking, 5) Data quality monitoring, 6) Audit logging, 7) Role-Based Access Control (RBAC), 8) Data classification en tagging.

Databricks Lakehouse Platform: Complete Gids voor Unified Analytics

Laatst bijgewerkt: 20 december 2024

Leestijd: 50 minuten

Databricks, Databricks Lakehouse, Apache Spark, Delta Lake, Unity Catalog, MLflow, Data Engineering, Data Science, Lakehouse Architecture

Leer de fundamenten en geavanceerde technieken van Databricks Lakehouse Platform. Van Delta Lake en Unity Catalog tot MLflow en praktische implementatie strategieën voor unified analytics.

Zoek je Databricks Experts?

Vind ervaren Databricks Data Engineers en Lakehouse Architects voor je unified analytics projecten

Databricks Vacatures Vacature Plaassen €25

1. Inleiding tot Databricks Lakehouse

Wat is Databricks?

Databricks is een unified data analytics platform dat data engineering, data science en business intelligence samenbrengt op één platform. Opgericht door de makers van Apache Spark, biedt het een Lakehouse-architectuur die de beste elementen van data lakes en data warehouses combineert.

Unified Platform

Combineert data engineering, science en BI in één platform

High Performance

Photron-engine voor sub-second query performance

Enterprise Ready

Unity Catalog voor enterprise governance

Collaborative

Collaborative workspace voor data teams

Feature	Traditional Approach	Databricks Lakehouse	Voordelen
Architecture	Gescheiden silo's	Unified lakehouse	Simplified architecture
Data Storage	Data warehouse + data lake	Single Delta Lake	Eliminates data duplication
Governance	Multiple tools	Unity Catalog (centralized)	Consistent governance
ML Support	External tools	Native MLflow integration	Streamlined ML lifecycle
Performance	Batch processing	Real-time + batch	Faster insights
Cost	High (multiple tools)	Optimized (single platform)	Cost efficiency

2. Lakehouse Architecture Overzicht

Databricks Architecture

Databricks Lakehouse architecture combineert data lake flexibility met data warehouse performance, met Unity Catalog voor unified governance en Delta Lake voor reliable data storage.

Core Architecture Components

# DATABRICKS LAKEHOUSE ARCHITECTURE OVERZICHT

# 1. WORKSPACE CONFIGURATIE EN STRUCTUUR
"""
Databricks Workspace Organization:
├── Workspace (per regio/organisatie)
├── Users & Groups
├── Clusters (Compute Resources)
├── Notebooks & Jobs
├── Libraries & Packages
└── Secrets & Configuration
"""

# 2. CLOUD PROVIDER INTEGRATIE
# AWS Integration
storage_bucket = "s3://databricks-data-lake/"
iam_role = "arn:aws:iam::123456789012:role/databricks-role"

# Azure Integration
storage_account = "abfss://databricks@storageaccount.dfs.core.windows.net/"
managed_identity = "/subscriptions/.../resourceGroups/.../providers/Microsoft.ManagedIdentity/..."

# 3. MEDALLION ARCHITECTURE IMPLEMENTATIE
# Bronze Layer - Raw data ingestion
bronze_path = "/mnt/datalake/bronze/sales/raw/"

# Silver Layer - Cleaned, validated data
silver_path = "/mnt/datalake/silver/sales/cleaned/"

# Gold Layer - Business-ready aggregates
gold_path = "/mnt/datalake/gold/sales/aggregates/"

# 4. DELTA LAKE TABEL CREATIE
from pyspark.sql import SparkSession
from delta.tables import *

# Initialize Spark session with Delta Lake configuration
spark = SparkSession.builder \
    .appName("Databricks Lakehouse") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.databricks.delta.optimizeWrite.enabled", "true") \
    .config("spark.databricks.delta.autoCompact.enabled", "true") \
    .getOrCreate()

# Create Delta table with schema evolution
spark.sql("""
CREATE TABLE IF NOT EXISTS bronze.sales_raw (
    customer_id INT,
    product_id INT,
    sales_date DATE,
    amount DECIMAL(18,2),
    quantity INT,
    region STRING,
    file_source STRING,
    ingestion_timestamp TIMESTAMP
)
USING DELTA
LOCATION '/mnt/datalake/bronze/sales/raw/'
TBLPROPERTIES (
    delta.autoOptimize.optimizeWrite = true,
    delta.autoOptimize.autoCompact = true,
    delta.enableChangeDataFeed = true
)
""")

# 5. UNITY CATALOG CONFIGURATIE
spark.sql("""
CREATE CATALOG IF NOT EXISTS production;
USE CATALOG production;

CREATE SCHEMA IF NOT EXISTS sales 
COMMENT 'Sales data schema for business analytics';

CREATE SCHEMA IF NOT EXISTS marketing 
COMMENT 'Marketing data schema for campaign analysis';

GRANT USAGE ON CATALOG production TO `analysts@datapartner365.nl`;
GRANT SELECT ON SCHEMA sales TO `analysts@datapartner365.nl`;
""")

# 6. CLUSTER CONFIGURATION
"""
Cluster Configuration Options:
- Single Node: Development/Testing
- Standard: General purpose workloads
- High Concurrency: Multiple users, Databricks SQL
- GPU: Machine Learning workloads
- Photon: High-performance analytics

Auto-scaling Configuration:
  min_workers: 2
  max_workers: 10
  auto_termination_minutes: 30
"""

# 7. NOTEBOOK WORKSPACE ORGANIZATIE
"""
Workspace Folder Structure:
├── 01_Data_Ingestion
│   ├── Sales_Data_Ingestion
│   ├── Customer_Data_Ingestion
│   └── Product_Data_Ingestion
├── 02_Data_Transformation
│   ├── Bronze_to_Silver
│   ├── Silver_to_Gold
│   └── Data_Quality_Checks
├── 03_Data_Science
│   ├── Customer_Segmentation
│   ├── Sales_Forecasting
│   └── Recommendation_Engine
├── 04_BI_Reports
│   ├── Daily_Sales_Dashboard
│   ├── Customer_Analytics
│   └── Executive_Reports
└── 05_Orchestration
    ├── Main_ETL_Pipeline
    └── ML_Pipeline_Orchestration
"""

3. Delta Lake: ACID-compliant Data Lake

Delta Lake Fundamentals

Delta Lake is een open-source storage layer die ACID transactions, scalable metadata handling en data versioning brengt naar data lakes, waardoor betrouwbare data engineering mogelijk wordt.

Delta Lake Implementation

# DELTA LAKE IMPLEMENTATIE IN DATABRICKS

# 1. DELTA TABLE CREATIE MET OPTIMALISATIES
from delta import *
from pyspark.sql.functions import *

# Create Delta table with Z-ordering and partitioning
spark.sql("""
CREATE TABLE IF NOT EXISTS silver.sales_cleaned (
    sales_id BIGINT GENERATED ALWAYS AS IDENTITY,
    customer_id INT NOT NULL,
    product_id INT NOT NULL,
    sales_date DATE NOT NULL,
    sales_amount DECIMAL(18,2) NOT NULL,
    quantity INT NOT NULL,
    discount_amount DECIMAL(18,2),
    region STRING,
    store_id INT,
    created_timestamp TIMESTAMP,
    updated_timestamp TIMESTAMP
)
USING DELTA
PARTITIONED BY (sales_date)
LOCATION '/mnt/datalake/silver/sales/cleaned/'
TBLPROPERTIES (
    delta.autoOptimize.optimizeWrite = true,
    delta.autoOptimize.autoCompact = true,
    delta.dataSkippingNumIndexedCols = 32,
    'delta.deletedFileRetentionDuration' = 'interval 15 days',
    'delta.logRetentionDuration' = 'interval 30 days'
)
CLUSTER BY (customer_id, region)
""")

# 2. UPSERT OPERATIES (MERGE) MET DELTA
# Merge operation voor incremental updates
sales_updates_df = spark.table("staging.sales_updates")

target_table = DeltaTable.forPath(spark, "/mnt/datalake/silver/sales/cleaned/")

target_table.alias("target").merge(
    sales_updates_df.alias("source"),
    "target.sales_id = source.sales_id"
).whenMatchedUpdate(
    set={
        "sales_amount": "source.sales_amount",
        "quantity": "source.quantity",
        "updated_timestamp": current_timestamp()
    }
).whenNotMatchedInsert(
    values={
        "customer_id": "source.customer_id",
        "product_id": "source.product_id",
        "sales_date": "source.sales_date",
        "sales_amount": "source.sales_amount",
        "quantity": "source.quantity",
        "region": "source.region",
        "created_timestamp": current_timestamp(),
        "updated_timestamp": current_timestamp()
    }
).execute()

# 3. TIME TRAVEL EN VERSIONING
# Query data from specific version
historical_data = spark.sql("""
SELECT * FROM silver.sales_cleaned 
VERSION AS OF 10
WHERE sales_date = '2024-01-15'
""")

# Query data from specific timestamp
timestamp_data = spark.sql("""
SELECT * FROM silver.sales_cleaned 
TIMESTAMP AS OF '2024-01-15 10:00:00'
""")

# Restore table to previous version
spark.sql("""
RESTORE TABLE silver.sales_cleaned TO VERSION AS OF 5
""")

# 4. CHANGE DATA FEED (CDF)
# Enable Change Data Feed for table
spark.sql("""
ALTER TABLE silver.sales_cleaned 
SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")

# Query change data
changes_df = spark.sql("""
SELECT * FROM table_changes('silver.sales_cleaned', 1)
WHERE _change_type IN ('insert', 'update_postimage', 'delete')
""")

# 5. DELTA LIVE TABLES (DLT) PIPELINES
# DLT Pipeline definition voor medallion architecture
dlt_code = """
import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *

@dlt.table(
    comment="Bronze layer: Raw sales data",
    table_properties={
        "quality": "bronze",
        "pipelines.autoOptimize.managed": "true"
    }
)
def sales_bronze():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "json")
        .option("cloudFiles.inferColumnTypes", "true")
        .load("/mnt/datalake/landing/sales/")
        .withColumn("ingestion_timestamp", current_timestamp())
        .withColumn("file_source", input_file_name())
    )

@dlt.table(
    comment="Silver layer: Cleaned sales data",
    table_properties={
        "quality": "silver",
        "pipelines.autoOptimize.managed": "true"
    }
)
@dlt.expect_or_drop("valid_sales_amount", "sales_amount > 0")
@dlt.expect_or_drop("valid_quantity", "quantity > 0")
def sales_silver():
    return (
        dlt.read_stream("sales_bronze")
        .dropDuplicates(["customer_id", "product_id", "sales_date"])
        .withColumn("sales_year", year(col("sales_date")))
        .withColumn("sales_month", month(col("sales_date")))
        .withColumn("sales_quarter", quarter(col("sales_date")))
    )

@dlt.table(
    comment="Gold layer: Daily sales aggregates",
    table_properties={
        "quality": "gold",
        "pipelines.autoOptimize.managed": "true"
    }
)
def sales_gold_daily():
    return (
        dlt.read("sales_silver")
        .groupBy("sales_date", "region", "product_id")
        .agg(
            sum("sales_amount").alias("daily_sales"),
            avg("sales_amount").alias("avg_sale_amount"),
            count("*").alias("transaction_count")
        )
    )
"""

# 6. OPTIMIZE EN VACUUM OPERATIES
# Optimize table for performance
spark.sql("""
OPTIMIZE silver.sales_cleaned
ZORDER BY (customer_id, sales_date)
""")

# Vacuum old files (retention period)
spark.sql("""
VACUUM silver.sales_cleaned RETAIN 168 HOURS
""")

# 7. DELTA SHARING VOOR DATA SHARING
# Create share for external data sharing
spark.sql("""
CREATE SHARE IF NOT EXISTS sales_share
COMMENT 'Share for sales data with partners';

ALTER SHARE sales_share ADD TABLE silver.sales_cleaned;
ALTER SHARE sales_share ADD TABLE gold.sales_aggregates;

GRANT SELECT ON SHARE sales_share TO `partner@externalcompany.com`;
""")

4. Unity Catalog: Unified Governance

Centralized Data Governance

Unity Catalog biedt unified governance voor data en AI op het Databricks Lakehouse Platform, met centrale toegangscontrole, audit logging en data discovery.

Hierarchical Namespace

Catalog → Schema → Table hierarchy

Centralized metadata management
Consistent naming conventions
Simplified data discovery

Fine-grained Security

Row and column level security

Dynamic views for data masking
Column encryption
Audit logging

Data Discovery

Centralized data catalog

Data lineage tracking
Schema evolution tracking
Data quality monitoring

Access Management

Unified access control

Role-Based Access Control
Group management
Service principal support

Databricks Experts Nodig?

Vind ervaren Databricks Data Engineers en Lakehouse Architects voor je unified analytics projecten

Databricks Vacatures Zoek Databricks Experts

5. Apache Spark Engine en Optimizations

Spark in Databricks

Apache Spark in Databricks is geoptimaliseerd met Photon engine, auto-scaling, en automatische performance tuning voor maximale efficiency.

Spark Optimization Techniques

# SPARK OPTIMIZATIONS IN DATABRICKS

# 1. SPARK SESSION CONFIGURATIE
spark = SparkSession.builder \
    .appName("Optimized Data Processing") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128MB") \
    .config("spark.sql.autoBroadcastJoinThreshold", "100MB") \
    .config("spark.databricks.io.cache.enabled", "true") \
    .config("spark.databricks.optimizer.dynamicPartitionPruning", "true") \
    .config("spark.databricks.delta.optimizeWrite.enabled", "true") \
    .config("spark.databricks.delta.autoCompact.enabled", "true") \
    .getOrCreate()

# 2. PHOTON ENGINE CONFIGURATIE
spark.conf.set("spark.databricks.photon.enabled", "true")
spark.conf.set("spark.databricks.photon.parquetWriter.enabled", "true")
spark.conf.set("spark.databricks.photon.maxMemoryPercent", "50")

# 3. AUTOSCALING CONFIGURATIE
"""
Cluster Auto-scaling Configuration:
Initial Workers: 2
Min Workers: 2
Max Workers: 10
Scale-up: When tasks are pending
Scale-down: After 10 minutes of inactivity
Spot instances: For non-critical workloads
"""

# 4. CACHE MANAGEMENT EN OPTIMALISATIE
# Cache frequently used tables
spark.sql("CACHE TABLE silver.sales_cleaned")

# Cache with specific storage level
sales_df = spark.table("silver.sales_cleaned")
sales_df.persist("MEMORY_AND_DISK")

# Clear cache when needed
spark.catalog.clearCache()

# 5. PARTITIONING STRATEGIEËN
# Create partitioned table
spark.sql("""
CREATE TABLE gold.sales_monthly_partitioned
USING DELTA
PARTITIONED BY (sales_year INT, sales_month INT)
AS
SELECT 
    YEAR(sales_date) AS sales_year,
    MONTH(sales_date) AS sales_month,
    customer_id,
    SUM(sales_amount) AS monthly_sales,
    COUNT(*) AS transaction_count
FROM silver.sales_cleaned
GROUP BY 
    YEAR(sales_date),
    MONTH(sales_date),
    customer_id
""")

# 6. DYNAMIC PARTITION PRUNING
# Query met dynamic partition pruning
recent_sales = spark.sql("""
SELECT * 
FROM gold.sales_monthly_partitioned p
WHERE p.sales_year = 2024 
    AND p.sales_month IN (10, 11, 12)
""")

# 7. SPARK UI EN MONITORING
# Access Spark UI for job monitoring
spark_ui_url = spark.sparkContext.uiWebUrl
print(f"Spark UI: {spark_ui_url}")

# 8. SPARK TUNING PARAMETERS
tuning_params = {
    "spark.sql.shuffle.partitions": "200",  # Default shuffle partitions
    "spark.sql.adaptive.maxNumPostShufflePartitions": "1000",
    "spark.sql.files.maxPartitionBytes": "134217728",  # 128MB
    "spark.sql.broadcastTimeout": "300",
    "spark.sql.adaptive.enabled": "true",
    "spark.databricks.adaptive.autoOptimizeShuffle.enabled": "true"
}

for key, value in tuning_params.items():
    spark.conf.set(key, value)

# 9. STRUCTURED STREAMING CONFIGURATIE
streaming_query = (
    spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "sales_topic")
    .load()
    .selectExpr("CAST(value AS STRING)")
    .writeStream
    .format("delta")
    .option("checkpointLocation", "/mnt/datalake/checkpoints/sales_stream/")
    .trigger(processingTime="1 minute")
    .table("bronze.sales_stream")
)

6. Databricks SQL en BI Integration

SQL Analytics in Databricks

Databricks SQL biedt een serverless SQL endpoint voor BI en analytics workloads met sub-second query performance op Delta Lake data.

Feature	Databricks SQL	Traditional BI Tools	Benefits
Performance	Photon engine, sub-second	Minutes to hours	Real-time analytics
Data Freshness	Real-time on Delta Lake	ETL delays	Up-to-date insights
Cost	Pay per query	Fixed infrastructure	Cost efficient
Governance	Unity Catalog integrated	Separate tools	Consistent policies
Collaboration	Shared dashboards	Individual files	Team collaboration

7. MLflow: Machine Learning Lifecycle

ML Operations in Databricks

MLflow is een open-source platform voor de complete machine learning lifecycle, geïntegreerd in Databricks voor experiment tracking, model registry en deployment.

MLflow Implementation

# MLFLOW IMPLEMENTATION IN DATABRICKS

import mlflow
import mlflow.sklearn
import mlflow.pyfunc
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import numpy as np

# 1. MLFLOW EXPERIMENT SETUP
mlflow.set_experiment("/Users/username@datapartner365.nl/sales_forecasting")

# 2. MLFLOW AUTOLOGGING CONFIGURATION
mlflow.autolog(
    log_input_examples=True,
    log_model_signatures=True,
    log_models=True,
    log_datasets=True,
    disable=False,
    exclusive=True,
    disable_for_unsupported_versions=True,
    silent=True
)

# 3. MLFLOW RUN WITH PARAMETERS AND METRICS
with mlflow.start_run(run_name="random_forest_sales_forecast") as run:
    
    # Log parameters
    params = {
        "n_estimators": 100,
        "max_depth": 10,
        "min_samples_split": 2,
        "min_samples_leaf": 1,
        "random_state": 42
    }
    
    mlflow.log_params(params)
    
    # Prepare data
    data = spark.sql("""
        SELECT 
            customer_id,
            product_id,
            sales_date,
            sales_amount,
            quantity,
            region,
            DAYOFWEEK(sales_date) AS day_of_week,
            MONTH(sales_date) AS month,
            YEAR(sales_date) AS year
        FROM silver.sales_cleaned
        WHERE sales_date >= '2023-01-01'
    """).toPandas()
    
    # Feature engineering
    data['sales_lag_1'] = data.groupby(['customer_id', 'product_id'])['sales_amount'].shift(1)
    data['sales_lag_7'] = data.groupby(['customer_id', 'product_id'])['sales_amount'].shift(7)
    data['rolling_mean_7'] = data.groupby(['customer_id', 'product_id'])['sales_amount'].rolling(7).mean().values
    
    # Prepare features and target
    features = ['sales_lag_1', 'sales_lag_7', 'rolling_mean_7', 'day_of_week', 'month']
    target = 'sales_amount'
    
    X = data[features].fillna(0)
    y = data[target]
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train model
    model = RandomForestRegressor(**params)
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    
    # Log metrics
    mlflow.log_metrics({
        "mse": mse,
        "rmse": rmse,
        "r2": r2,
        "training_samples": len(X_train),
        "test_samples": len(X_test)
    })
    
    # Log model
    mlflow.sklearn.log_model(
        model,
        "random_forest_model",
        registered_model_name="sales_forecasting_rf",
        input_example=X_train.iloc[:5],
        pip_requirements=["scikit-learn==1.3.0"]
    )
    
    # Log artifacts (feature importance plot)
    feature_importance = pd.DataFrame({
        'feature': features,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    feature_importance.to_csv("feature_importance.csv", index=False)
    mlflow.log_artifact("feature_importance.csv")
    
    # Log tags
    mlflow.set_tags({
        "project": "sales_forecasting",
        "team": "data_science",
        "model_type": "random_forest",
        "data_source": "silver.sales_cleaned"
    })

# 4. MODEL REGISTRY INTEGRATION
# Transition model to staging
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="sales_forecasting_rf",
    version=1,
    stage="Staging",
    archive_existing_versions=False
)

# Add model description
client.update_model_version(
    name="sales_forecasting_rf",
    version=1,
    description="Random Forest model for sales forecasting with lag features"
)

# 5. MODEL DEPLOYMENT
# Load model for batch predictions
model_uri = f"models:/sales_forecasting_rf/1"
loaded_model = mlflow.pyfunc.load_model(model_uri)

# Make batch predictions
batch_data = spark.sql("""
    SELECT 
        customer_id,
        product_id,
        sales_date,
        LAG(sales_amount, 1) OVER (PARTITION BY customer_id, product_id ORDER BY sales_date) AS sales_lag_1,
        LAG(sales_amount, 7) OVER (PARTITION BY customer_id, product_id ORDER BY sales_date) AS sales_lag_7,
        AVG(sales_amount) OVER (
            PARTITION BY customer_id, product_id 
            ORDER BY sales_date 
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) AS rolling_mean_7,
        DAYOFWEEK(sales_date) AS day_of_week,
        MONTH(sales_date) AS month
    FROM silver.sales_cleaned
    WHERE sales_date >= '2024-12-01'
""").toPandas()

predictions = loaded_model.predict(batch_data[features].fillna(0))

# 6. MLFLOW AUTOML (DATABRICKS AUTOML)
import databricks.automl

summary = databricks.automl.regress(
    dataset=spark.table("silver.sales_cleaned"),
    target_col="sales_amount",
    primary_metric="r2",
    timeout_minutes=30,
    experiment_dir="/Users/username@datapartner365.nl/automl_sales"
)

# 7. MODEL MONITORING AND DRIFT DETECTION
from mlflow.models import infer_signature
from mlflow.data import from_pandas

# Log training data signature
signature = infer_signature(X_train, y_train)
training_data = from_pandas(X_train, source="silver.sales_cleaned")

mlflow.log_input(training_data, context="training")

# Log evaluation results
evaluation_results = {
    "model": "sales_forecasting_rf",
    "version": 1,
    "metrics": {"mse": mse, "rmse": rmse, "r2": r2},
    "timestamp": pd.Timestamp.now().isoformat()
}

mlflow.log_dict(evaluation_results, "evaluation_results.json")

8. Data Engineering Workflows

ETL/ELT in Databricks

Data engineering workflows in Databricks ondersteunen zowel batch als streaming data processing met Delta Live Tables, Apache Spark en geïntegreerde orchestration.

Medallion Architecture

Bronze → Silver → Gold data quality layers

Incremental processing
Data quality constraints
Schema evolution

Streaming Processing

Structured Streaming integration

Kafka/Kinesis integration
Watermarking
Checkpointing

Delta Live Tables

Declarative pipeline framework

Automatic dependency management
Data quality monitoring
Pipeline observability

Workflow Orchestration

Databricks Workflows

Multi-task workflows
DAG-based scheduling
Job monitoring

Klaar voor Databricks Projecten?

Vind de juiste Databricks experts of plaats je lakehouse engineering vacature

Vacature Plaassen €25 Bekijk Databricks Vacatures

9. Security en Data Governance

Enterprise Security Framework

Security en governance in Databricks omvatten Unity Catalog, fine-grained access control, encryption en compliance controls voor enterprise workloads.

Security Feature	Implementatie	Use Case	Best Practices
Unity Catalog	Centralized metadata management	Enterprise-wide governance	Implement catalog hierarchy
Row-Level Security	Dynamic views met predicates	Data segmentation per department	Use Unity Catalog policies
Column Encryption	Customer-managed keys	PII data protection	Use cloud provider KMS
Audit Logging	Unity Catalog audit logs	Compliance en forensics	Centralized log aggregation
Network Security	VPC/VNet, private links	Network isolation	Use private connectivity
Data Lineage	Unity Catalog lineage tracking	Data provenance	Enable lineage for critical data

10. Performance Tuning en Optimization

Databricks Performance Tuning

Performance tuning en optimization in Databricks vereisen aandacht voor cluster configuration, data layout, caching strategies en query optimization.

Cluster Optimization

Right-size instance types
Enable auto-scaling
Use spot instances
Monitor utilization

Data Layout

Z-ordering voor queries
Optimal partitioning
Data skipping indexes
Compression strategies

Query Optimization

Photon engine enablement
Adaptive query execution
Dynamic partition pruning
Cost-based optimization

Monitoring Tools

Spark UI integration
Cluster metrics
Query profiling
Performance alerts

11. Cost Management en Optimization

Databricks Cost Control

Cost optimization in Databricks vereist een combinatie van workload management, instance selection, auto-scaling en monitoring voor maximale ROI.

Cost Optimization Techniques

# DATABRICKS COST OPTIMIZATION STRATEGIEËN

# 1. WORKLOAD MANAGEMENT EN TAGGING
"""
Cost Allocation Tags:
- Department: sales, marketing, finance
- Project: customer360, sales_forecasting
- Environment: dev, staging, prod
- Cost Center: 12345, 67890
"""

# 2. INSTANCE TYPE SELECTIE
"""
Instance Selection Guidelines:
- Development: Standard_F4s_v2 (4 vCPUs, 8 GB RAM)
- Production (Small): Standard_DS3_v2 (4 vCPUs, 14 GB RAM)
- Production (Medium): Standard_DS5_v2 (16 vCPUs, 56 GB RAM)
- Production (Large): Standard_E16_v3 (16 vCPUs, 128 GB RAM)
- GPU Workloads: Standard_NC6s_v3 (6 vCPUs, 112 GB RAM, 1 GPU)
"""

# 3. AUTO-SCALING CONFIGURATIE
spark.conf.set("spark.databricks.clusterUsageTags.clusterAllTags", '{"CostCenter": "12345"}')

"""
Auto-scaling Configuration:
- Min Workers: 2
- Max Workers: 10
- Scale-up Delta: 2 workers
- Scale-down Delta: 1 worker
- Spot Bid Price: 70% of on-demand
"""

# 4. AUTO-TERMINATION EN PAUSE
"""
Cluster Termination Settings:
- Development: 30 minutes idle
- Staging: 60 minutes idle
- Production: No auto-termination
- Weekend pause: Friday 18:00 - Monday 08:00
"""

# 5. DBU USAGE MONITORING
# Query DBU usage per cluster
dbu_usage = spark.sql("""
SELECT 
    cluster_id,
    cluster_name,
    SUM(dbu_seconds) / 3600 AS dbu_hours,
    SUM(total_cost) AS total_cost,
    DATE_TRUNC('day', usage_date) AS usage_day,
    user_name,
    job_id
FROM system.billing.usage
WHERE usage_date >= DATEADD(day, -30, CURRENT_DATE())
GROUP BY 
    cluster_id,
    cluster_name,
    DATE_TRUNC('day', usage_date),
    user_name,
    job_id
ORDER BY total_cost DESC
""")

# 6. COST ALLOCATION DASHBOARD
cost_dashboard_code = """
# Create cost allocation dashboard
import pandas as pd
import plotly.express as px

# Aggregate costs by department
costs_by_dept = spark.sql("""
    SELECT 
        tags['Department'] AS department,
        SUM(total_cost) AS total_cost,
        SUM(dbu_seconds) / 3600 AS dbu_hours,
        COUNT(DISTINCT cluster_id) AS cluster_count
    FROM system.billing.usage
    WHERE usage_date >= DATEADD(month, -1, CURRENT_DATE())
        AND tags['Department'] IS NOT NULL
    GROUP BY tags['Department']
    ORDER BY total_cost DESC
""").toPandas()

# Create visualization
fig = px.bar(costs_by_dept, 
             x='department', 
             y='total_cost',
             title='Monthly Costs by Department',
             labels={'department': 'Department', 'total_cost': 'Total Cost (€)'})
fig.show()
"""

# 7. COST OPTIMIZATION RECOMMENDATIONS
recommendations = [
    "Use spot instances for non-critical workloads",
    "Enable auto-termination for development clusters",
    "Right-size clusters based on workload patterns",
    "Use Photon engine for SQL workloads",
    "Implement cost allocation tags for chargeback",
    "Monitor and optimize Delta Lake storage",
    "Use Databricks SQL for BI workloads",
    "Implement workload management for resource sharing"
]

12. Migratie naar Databricks

Migration Strategies

Migratie naar Databricks vereist zorgvuldige planning, schema conversion, data migration en validation voor succesvolle adoptie.

Assessment Phase

Current state analysis
Workload assessment
Cost-benefit analysis
Readiness assessment

Planning Phase

Migration strategy
Timeline planning
Resource allocation
Risk assessment

Execution Phase

Data migration
Code conversion
Testing validation
Performance tuning

Validation Phase

Data validation
Performance validation
User acceptance testing
Documentation update

13. Best Practices en Patterns

Databricks Best Practices

Best practices en patterns helpen bij het bouwen van schaalbare, performante en onderhoudbare Databricks oplossingen.

 Design Patterns voor Databricks
            Lakehouse Pattern
            Medallion architecture
Delta Lake format
Unity Catalog governance
Multi-engine access

          

            Data Engineering Pattern
            Delta Live Tables
Incremental processing
Data quality checks
Orchestration workflows

          

            MLOps Pattern
            MLflow integration
Model registry
Feature store
Model monitoring

          

14. Praktijk Case Studies

Real-World Implementaties

Praktijk case studies tonen hoe Databricks wordt geïmplementeerd in verschillende industrieën met meetbare business outcomes.

Case Study: Retail Analytics

Uitdaging: Legacy data warehouse, real-time inventory management, personalized recommendations.

Oplossing: Databricks Lakehouse met real-time streaming en ML-powered recommendations.

Metric	Before	After	Improvement
Data Processing	24h batch cycle	Real-time (seconds)	86,400x faster
Query Performance	Minutes to hours	Sub-second	100x+ faster
Recommendation Accuracy	35% click-through	65% click-through	85% improvement
Cost	$1.5M/year	$300k/year	80% reduction
Time to Insight	Weeks	Hours	168x faster

Conclusie en Key Takeaways

 Key Lessons Learned
              DO's
              Start met medallion architecture
Implementeer Unity Catalog vanaf dag 1
Gebruik Delta Lake voor betrouwbare data
Monitor costs met allocation tags
Gebruik MLflow voor MLOps

            

              DON'Ts
              Negeer data governance niet
Overschat niet migration complexity
Vergeet training en adoptie niet
Copy on-premise patterns niet
Onderschat niet team skills gap

            

              Emerging Trends
              Lakehouse Federation
AI/ML integration
Real-time analytics
Low-code data engineering
Multi-cloud deployments

            

Veelgestelde Vragen (FAQ)

Q: Wat is het verschil tussen Databricks en Snowflake?

A: Databricks is een unified analytics platform voor data engineering, data science en BI met native Apache Spark. Snowflake is een cloud data warehouse geoptimaliseerd voor SQL workloads. Databricks biedt betere ML integratie en data engineering capabilities, terwijl Snowflake sterk is in SQL performance en data sharing.

Q: Hoe implementeer ik CI/CD voor Databricks?

A: CI/CD voor Databricks: 1) Git integration voor notebooks, 2) Databricks CLI/API voor automation, 3) Terraform/ARM templates voor infra, 4) Databricks Asset Bundles, 5) Automated testing frameworks, 6) Deployment pipelines met Azure DevOps/GitHub Actions.

Q: Wat zijn de SLA's van Databricks?

A: Databricks SLA's: 1) Platform availability: 99.95%, 2) Support response times: 1-4 hours (afhankelijk van severity), 3) Data durability: 99.999999999%, 4) Enterprise SLA's beschikbaar via premium support.

Q: Hoe monitor ik Databricks performance?

A: Monitoring tools: 1) Databricks metrics dashboard, 2) Spark UI voor job monitoring, 3) Ganglia metrics voor cluster monitoring, 4) Unity Catalog audit logs, 5) Custom dashboards met Grafana/Power BI, 6) Alerting via webhooks en email.