Unity Catalog Deep Dive: Complete Gids voor MKB

Wat is Unity Catalog? (Beyond the Basics)

Unity Catalog is niet zomaar een data catalog - het is een **gecentraliseerde governance laag** die over je hele Databricks deployment heen ligt. Het biedt een uniform beheerlaag voor data, AI assets en toegangscontrole over je gehele data estate.

Unity Catalog Architecture: Drie-Lagen Model

1. Governance Layer

Centrale Metadata Store: Eén bron van waarheid
Access Control Policies: Fine-grained RBAC
Audit Logging: Volledige activiteit tracking
Lineage Tracking: E2E data flow visualisatie

2. Data Layer

Multi-cloud Support: AWS, Azure, GCP
Storage Abstraction: Eenduidige toegang
Delta Format: ACID transactions
Time Travel: Data versioning

3. Compute Layer

Shared Infrastructure: Efficiënt resource gebruik
Isolated Execution: Veilige query isolation
Autoscaling: Dynamische scaling
Photon Engine: High-performance queries

Core Concepten en Hiërarchie

 Unity Catalog Object Model Hiërarchie
            -- Unity Catalog Hiërarchie: Top-down benadering

-- 1. METASTORE (Hoogste niveau)
--    ├── Bevat alle metadata
--    ├── Cross-workspace sharing
--    ├── Centraal beheerpunt

-- 2. CATALOG (Organisatie niveau)
--    ├── main (default catalog)
--    ├── analytics (voor BI/data science)
--    ├── sandbox (voor development)
--    ├── sensitive (voor PII data)

-- 3. SCHEMA (Database/Dataset niveau)
--    ├── default (binnen elke catalog)
--    ├── bron (raw/bron data)
--    ├── silver (cleaned data)
--    ├── gold (business ready)

-- 4. TABLE/VIEW (Data objecten)
--    ├── Delta tables (ACID compliant)
--    ├── Views (logische abstractions)
--    ├── Materialized views (geoptimaliseerd)
--    ├── External tables (on-prem/cloud)

-- 5. COLUMN (Laagste niveau)
--    ├── Column-level security
--    ├── Masking policies
--    ├── Tagging (PII, sensitivity)
--    ├── Lineage tracking

-- Praktisch voorbeeld hiërarchie
METASTORE: "company_metastore"
    │
    ├── CATALOG: "main"
    │   │
    │   ├── SCHEMA: "sales"
    │   │   ├── TABLE: "orders"
    │   │   │   ├── COLUMN: "customer_id" (PII tagged)
    │   │   │   ├── COLUMN: "order_amount"
    │   │   │   └── COLUMN: "order_date"
    │   │   │
    │   │   └── VIEW: "monthly_sales"
    │   │
    │   └── SCHEMA: "marketing"
    │       ├── TABLE: "campaigns"
    │       └── TABLE: "leads"
    │
    └── CATALOG: "analytics"
        │
        └── SCHEMA: "bi"
            ├── TABLE: "kpi_dashboard"
            └── VIEW: "executive_summary"
          

Metastore

Top-level container: Bevat alle metadata
Cross-workspace: Gedeeld tussen meerdere workspaces
Central management: Eén beheerpunt voor governance
Backup/restore: Metadata backup mogelijkheden

Catalog

Eerste organisatie niveau: Bijv. departementen
Isolatie boundary: Security en access boundary
Resource grouping: Logische groepering van data
Default: "main": Standaard catalog

Schema

Database equivalent: Container voor tables/views
Data lake mapping: Mapt naar directories
Medallion architecture: bronze/silver/gold
Access control: Permissions op schema niveau

Enterprise Setup voor MKB

 Stap-voor-Stap Unity Catalog SetupStap 1

                Infrastructure Setup
                # Terraform voor Unity Catalog setup
resource "databricks_metastore" "company_metastore" {
  name          = "company_metastore"
  storage_root  = "s3://company-uc-metadata"
  region        = "eu-west-1"
  force_destroy = true
}

resource "databricks_metastore_data_access" "uc_access" {
  metastore_id = databricks_metastore.company_metastore.id
  name         = "uc-storage-access"
  aws_iam_role {
    role_arn = aws_iam_role.uc_role.arn
  }
}

resource "databricks_metastore_assignment" "assignment" {
  metastore_id         = databricks_metastore.company_metastore.id
  workspace_id         = databricks_mws_workspaces.workspace.workspace_id
  default_catalog_name = "main"
}
              
Stap 2

                Initial Catalog Setup
                -- SQL: Catalog en schema creatie
-- Maak catalogs voor verschillende doeleinden
CREATE CATALOG IF NOT EXISTS main 
COMMENT 'Primary catalog for business data';

CREATE CATALOG IF NOT EXISTS analytics 
COMMENT 'Catalog for BI and analytics data';

CREATE CATALOG IF NOT EXISTS sandbox 
COMMENT 'Development and testing catalog';

-- Maak schemas binnen elke catalog
USE CATALOG main;

CREATE SCHEMA IF NOT EXISTS bronze 
COMMENT 'Raw/bronze layer data';

CREATE SCHEMA IF NOT EXISTS silver 
COMMENT 'Cleaned/silver layer data';

CREATE SCHEMA IF NOT EXISTS gold 
COMMENT 'Business ready/gold layer data';

CREATE SCHEMA IF NOT EXISTS sensitive 
COMMENT 'Schema for sensitive/PII data';
              
Stap 3

                Access Control Setup
                -- SQL: RBAC setup voor verschillende teams
-- Maak groepen en rollen
CREATE ROLE IF NOT EXISTS data_engineer;
CREATE ROLE IF NOT EXISTS data_scientist;
CREATE ROLE IF NOT EXISTS business_analyst;
CREATE ROLE IF NOT EXISTS executive;

-- Grant permissions op catalog niveau
GRANT USAGE ON CATALOG main TO ROLE data_engineer;
GRANT USAGE ON CATALOG analytics TO ROLE business_analyst;
GRANT USAGE ON CATALOG sandbox TO ROLE data_scientist;

-- Grant specifieke permissions op schema niveau
GRANT SELECT ON SCHEMA main.bronze TO ROLE data_engineer;
GRANT SELECT ON SCHEMA main.silver TO ROLE data_scientist;
GRANT SELECT ON SCHEMA main.gold TO ROLE business_analyst;
GRANT SELECT ON SCHEMA main.gold TO ROLE executive;

-- Grant create permissions voor engineers
GRANT CREATE ON SCHEMA main.bronze TO ROLE data_engineer;
GRANT CREATE ON SCHEMA main.silver TO ROLE data_engineer;
              

Geavanceerde Unity Catalog Features

Lineage Tracking

-- SQL: Lineage queries
-- Volledige upstream lineage
SELECT * FROM system.information_schema.table_lineage
WHERE downstream_catalog = 'main'
  AND downstream_schema = 'gold'
  AND downstream_table = 'customer_360';

-- Downstream impact analysis
SELECT * FROM system.information_schema.column_lineage
WHERE upstream_catalog = 'main'
  AND upstream_schema = 'silver'
  AND upstream_table = 'transactions'
  AND upstream_column = 'amount';

-- Lineage visualisatie in notebook
%python
from databricks.sdk import WorkspaceClient
import networkx as nx

def visualize_lineage(table_name):
    w = WorkspaceClient()
    lineage = w.data_lineage.get(table_name)
    
    # Build graph
    G = nx.DiGraph()
    for edge in lineage.edges:
        G.add_edge(edge.source, edge.target)
    
    # Visualize
    nx.draw(G, with_labels=True)
    return G

Tagging & Classification

-- SQL: Tagging en classificatie
-- Maak custom tags
CREATE TAG IF NOT EXISTS data_sensitivity 
COMMENT 'Data sensitivity classification';

CREATE TAG IF NOT EXISTS data_quality 
COMMENT 'Data quality metrics';

CREATE TAG IF NOT EXISTS business_domain 
COMMENT 'Business domain classification';

-- Pas tags toe op tabellen
ALTER TABLE main.sensitive.customers 
SET TAG data_sensitivity = 'CONFIDENTIAL';

ALTER TABLE main.gold.sales 
SET TAG business_domain = 'SALES',
    TAG data_quality = 'CERTIFIED';

-- Pas tags toe op kolommen (PII detectie)
ALTER TABLE main.sensitive.customers 
ALTER COLUMN email 
SET TAG data_sensitivity = 'PII';

ALTER TABLE main.sensitive.customers 
ALTER COLUMN bsn 
SET TAG data_sensitivity = 'SENSITIVE_PII';

-- Query op basis van tags
SELECT 
  table_catalog,
  table_schema,
  table_name,
  tag_name,
  tag_value
FROM system.information_schema.table_tags
WHERE tag_name = 'data_sensitivity'
  AND tag_value = 'PII';

Fine-grained Access Control

-- SQL: Advanced access control
-- Column-level security
CREATE MASKING POLICY email_mask AS (email STRING) 
RETURN CASE 
  WHEN current_user() IN ('admin', 'data_engineer') 
    THEN email
  ELSE regexp_replace(email, '(?<=.).(?=.*@)', '*')
END;

CREATE MASKING POLICY amount_mask AS (amount DECIMAL) 
RETURN CASE 
  WHEN current_user() IN ('admin', 'executive') 
    THEN amount
  ELSE NULL
END;

-- Pas masking policies toe
ALTER TABLE main.sensitive.customers 
ALTER COLUMN email 
SET MASKING POLICY email_mask;

ALTER TABLE main.gold.sales 
ALTER COLUMN revenue 
SET MASKING POLICY amount_mask;

-- Row-level security (filter)
CREATE ROW FILTER ssn_filter AS (ssn STRING) 
RETURN CASE 
  WHEN current_user() IN ('admin', 'hr_user') 
    THEN true
  ELSE false
END;

ALTER TABLE main.sensitive.employees 
SET ROW FILTER ssn_filter ON (ssn);

-- Dynamic views met security
CREATE VIEW main.analytics.safe_customer_data AS
SELECT 
  customer_id,
  CASE 
    WHEN current_user() IN ('admin', 'sales_manager') 
      THEN customer_name
    ELSE 'CONFIDENTIAL'
  END as customer_name,
  CASE 
    WHEN current_user() IN ('admin', 'sales_manager') 
      THEN email
    ELSE regexp_replace(email, '(?<=.).(?=.*@)', '*')
  END as email,
  country,
  customer_segment
FROM main.sensitive.customers;

Data Sharing & Collaboration

-- SQL: Data sharing binnen en buiten organisatie
-- INTERN: Share binnen metastore
CREATE SHARE sales_data_share
COMMENT 'Share sales data with analytics team';

-- Voeg objecten toe aan share
ALTER SHARE sales_data_share 
ADD TABLE main.gold.sales;

ALTER SHARE sales_data_share 
ADD VIEW main.analytics.sales_dashboard;

-- Grant access aan andere catalogs
GRANT SELECT ON SHARE sales_data_share 
TO CATALOG analytics;

-- EXTERN: Delta Sharing (open protocol)
-- Maak een provider (als je data deelt)
CREATE PROVIDER external_partner 
COMMENT 'External business partner';

-- Maak een recipient (als je data ontvangt)
CREATE RECIPIENT supplier_data
COMMENT 'Supplier data share';

-- Deel data via Delta Sharing
CREATE SHARE external_data_share AS
SELECT * FROM main.gold.shared_metrics;

GRANT SELECT ON SHARE external_data_share 
TO RECIPIENT supplier_data;

-- Query gedeelde data (als recipient)
SELECT * FROM delta_sharing.supplier_data.sales_metrics;

-- Real-time data sharing
CREATE LIVE TABLE real_time_metrics
AS SELECT 
  current_timestamp() as timestamp,
  count(*) as active_sessions,
  avg(response_time) as avg_response_time
FROM system.streaming.sessions
GROUP BY window(timestamp, '1 minute');

-- Share live table
ALTER SHARE real_time_share 
ADD LIVE TABLE real_time_metrics;

PII Detectie en GDPR Compliance

 Geautomatiseerde PII Workflow
            -- SQL: Complete PII compliance workflow
-- 1. Enable PII detection op metastore niveau
ALTER METASTORE SET DBPROPERTIES (
  'pii.detection.enabled' = 'true',
  'pii.auto.classification' = 'true',
  'pii.retention.days' = '90'
);

-- 2. Creëër PII-dedicated schema structuur
CREATE SCHEMA IF NOT EXISTS main.pii_data 
MANAGED LOCATION 's3://company-data/pii/'
COMMENT 'Schema voor PII/GDPR gevoelige data';

-- 3. Maak PII detection pipeline
CREATE OR REPLACE TABLE main.audit.pii_detection_log
AS SELECT 
  current_timestamp() as detection_time,
  catalog_name,
  schema_name,
  table_name,
  column_name,
  pii_type,
  confidence_score,
  sample_value,
  CASE 
    WHEN pii_type IN ('EMAIL', 'PHONE', 'SSN', 'IBAN') 
      THEN 'SENSITIVE'
    WHEN pii_type IN ('NAME', 'ADDRESS') 
      THEN 'IDENTIFIER'
    ELSE 'NON_PII'
  END as sensitivity_level
FROM system.information_schema.column_pii_tags;

-- 4. Automatische PII tagging
CREATE OR REPLACE PROCEDURE auto_tag_pii()
LANGUAGE SQL
AS $$
DECLARE
  cursor CURSOR FOR 
    SELECT 
      catalog_name,
      schema_name,
      table_name,
      column_name,
      pii_type
    FROM system.information_schema.column_pii_tags
    WHERE confidence_score > 0.8;
BEGIN
  FOR record IN cursor LOOP
    EXECUTE IMMEDIATE format('
      ALTER TABLE %I.%I.%I 
      ALTER COLUMN %I 
      SET TAG pii_type = %L,
          TAG sensitivity = %L',
      record.catalog_name,
      record.schema_name,
      record.table_name,
      record.column_name,
      record.pii_type,
      CASE 
        WHEN record.pii_type IN ('EMAIL', 'PHONE', 'SSN') 
          THEN 'HIGH'
        ELSE 'MEDIUM'
      END
    );
  END LOOP;
END;
$$;

-- 5. GDPR compliance dashboard query
CREATE OR REPLACE VIEW main.compliance.gdpr_dashboard AS
WITH pii_inventory AS (
  SELECT 
    catalog_name,
    schema_name,
    table_name,
    COUNT(DISTINCT column_name) as total_columns,
    SUM(CASE WHEN tag_value = 'HIGH' THEN 1 ELSE 0 END) as high_sensitivity,
    SUM(CASE WHEN tag_value = 'MEDIUM' THEN 1 ELSE 0 END) as medium_sensitivity
  FROM system.information_schema.column_tags
  WHERE tag_name = 'sensitivity'
  GROUP BY 1, 2, 3
),
access_audit AS (
  SELECT 
    table_catalog,
    table_schema,
    table_name,
    COUNT(DISTINCT user_name) as unique_users,
    MIN(event_time) as first_access,
    MAX(event_time) as last_access
  FROM system.access.audit
  WHERE service_name = 'databricks-sql'
  GROUP BY 1, 2, 3
)
SELECT 
  pi.catalog_name,
  pi.schema_name,
  pi.table_name,
  pi.total_columns,
  pi.high_sensitivity,
  pi.medium_sensitivity,
  COALESCE(aa.unique_users, 0) as access_count,
  COALESCE(aa.first_access, current_timestamp()) as first_access,
  COALESCE(aa.last_access, current_timestamp()) as last_access,
  -- Compliance score
  CASE 
    WHEN pi.high_sensitivity = 0 THEN 100
    WHEN pi.high_sensitivity > 0 AND aa.unique_users <= 3 THEN 90
    WHEN pi.high_sensitivity > 0 AND aa.unique_users <= 10 THEN 80
    ELSE 70
  END as compliance_score
FROM pii_inventory pi
LEFT JOIN access_audit aa
  ON pi.catalog_name = aa.table_catalog
  AND pi.schema_name = aa.table_schema
  AND pi.table_name = aa.table_name
ORDER BY compliance_score ASC;
          

Performance Optimalisatie

Query Performance

-- SQL: Performance monitoring en optimalisatie
-- 1. Query performance analysis
CREATE OR REPLACE TABLE main.monitoring.query_performance
AS SELECT 
  query_id,
  query_text,
  user_name,
  execution_time,
  result_size,
  rows_produced,
  -- Performance metrics
  execution_time / rows_produced as ms_per_row,
  CASE 
    WHEN execution_time > 60000 THEN 'SLOW'
    WHEN execution_time > 30000 THEN 'MEDIUM'
    ELSE 'FAST'
  END as performance_category
FROM system.query.history
WHERE start_time > current_date() - 7;

-- 2. Table statistics voor query optimizer
ANALYZE TABLE main.gold.sales 
COMPUTE STATISTICS FOR ALL COLUMNS;

-- Show table statistics
DESCRIBE DETAIL main.gold.sales;

-- 3. Z-Porder optimalisatie (co-locatie)
CREATE OR REPLACE TABLE main.gold.optimized_sales
USING DELTA
LOCATION 's3://company-data/gold/sales/'
TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact' = 'true'
)
AS SELECT * FROM main.silver.sales
-- Optimaliseer voor meest gebruikte queries
CLUSTER BY (customer_id, order_date);

-- 4. Materialized views voor vaak gebruikte aggregaties
CREATE MATERIALIZED VIEW main.gold.daily_sales_mv
REFRESH EVERY 1 HOUR
AS SELECT 
  date_trunc('day', order_date) as sales_day,
  customer_segment,
  product_category,
  SUM(order_amount) as daily_sales,
  COUNT(DISTINCT customer_id) as unique_customers,
  AVG(order_amount) as avg_order_value
FROM main.gold.sales
GROUP BY 1, 2, 3;

-- 5. Query acceleration met Photon
SET use_photon = true;

-- Force Photon voor specifieke query
SELECT /*+ photon */ 
  customer_id,
  SUM(order_amount) as total_spent
FROM main.gold.sales
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 100;

Storage Optimalisatie

-- SQL: Storage management en optimalisatie
-- 1. Storage analytics
CREATE OR REPLACE VIEW main.monitoring.storage_usage AS
SELECT 
  catalog_name,
  schema_name,
  table_name,
  -- Storage metrics
  SUM(size_in_bytes) / 1024 / 1024 / 1024 as size_gb,
  SUM(num_files) as total_files,
  AVG(file_size) as avg_file_size_mb,
  -- Optimalisatie opportunities
  CASE 
    WHEN AVG(file_size) < 100 * 1024 * 1024 THEN 'SMALL_FILES'
    WHEN COUNT(*) > 1000 THEN 'TOO_MANY_FILES'
    ELSE 'OPTIMAL'
  END as optimization_status
FROM system.information_schema.table_storage
GROUP BY 1, 2, 3
ORDER BY size_gb DESC;

-- 2. Auto-optimization configuratie
ALTER TABLE main.gold.large_table 
SET TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact' = 'true',
  'delta.targetFileSize' = '128mb'
);

-- 3. Manual optimization
OPTIMIZE main.gold.sales 
ZORDER BY (customer_id, order_date);

-- Show optimization results
DESCRIBE HISTORY main.gold.sales;

-- 4. Vacuum (clean up old versions)
VACUUM main.gold.sales 
RETAIN 168 HOURS; -- 7 dagen retention

-- 5. Clone voor testing zonder data kopiëren
-- Zero-copy clone (metadata only)
CREATE TABLE main.sandbox.sales_test
CLONE main.gold.sales;

-- Deep clone (copy data)
CREATE TABLE main.backup.sales_backup
DEEP CLONE main.gold.sales;

-- 6. Table maintenance procedure
CREATE OR REPLACE PROCEDURE maintain_tables()
LANGUAGE SQL
AS $$
DECLARE
  cursor CURSOR FOR 
    SELECT 
      table_catalog,
      table_schema,
      table_name
    FROM system.information_schema.tables
    WHERE table_schema IN ('bronze', 'silver', 'gold')
      AND table_type = 'BASE TABLE';
BEGIN
  FOR record IN cursor LOOP
    -- Optimize tables groter dan 1GB
    EXECUTE IMMEDIATE format('
      OPTIMIZE %I.%I.%I 
      WHERE size_in_bytes > 1073741824',
      record.table_catalog,
      record.table_schema,
      record.table_name
    );
    
    -- Vacuum oude versies
    EXECUTE IMMEDIATE format('
      VACUUM %I.%I.%I 
      RETAIN 168 HOURS',
      record.table_catalog,
      record.table_schema,
      record.table_name
    );
  END LOOP;
END;
$$;

Monitoring, Audit en Alerts

 Complete Monitoring Stack
            -- SQL: Enterprise monitoring setup
-- 1. Audit logging configuration
CREATE AUDIT LOG TABLE main.audit.unified_audit_log
USING DELTA
LOCATION 's3://company-audit/logs/'
TBLPROPERTIES (
  'delta.enableChangeDataFeed' = 'true'
);

-- Configureer wat er gelogd wordt
ALTER METASTORE SET DBPROPERTIES (
  'audit.log.queries' = 'true',
  'audit.log.data.access' = 'true',
  'audit.log.data.modification' = 'true',
  'audit.log.schema.changes' = 'true',
  'audit.log.security.changes' = 'true'
);

-- 2. Real-time monitoring dashboard
CREATE OR REPLACE VIEW main.monitoring.realtime_dashboard AS
WITH system_metrics AS (
  SELECT 
    timestamp,
    -- Query metrics
    COUNT(DISTINCT query_id) as active_queries,
    SUM(execution_time) / 1000 as total_execution_seconds,
    -- User metrics
    COUNT(DISTINCT user_name) as active_users,
    -- Error metrics
    SUM(CASE WHEN error_message IS NOT NULL THEN 1 ELSE 0 END) as errors,
    -- Performance metrics
    AVG(execution_time) as avg_query_time,
    PERCENTILE(execution_time, 0.95) as p95_query_time
  FROM system.query.history
  WHERE start_time > current_timestamp() - INTERVAL '1 hour'
  GROUP BY timestamp
),
storage_metrics AS (
  SELECT 
    current_timestamp() as timestamp,
    SUM(size_in_bytes) / 1024 / 1024 / 1024 as total_storage_gb,
    COUNT(DISTINCT table_name) as total_tables,
    SUM(num_files) as total_files
  FROM system.information_schema.table_storage
),
access_metrics AS (
  SELECT 
    timestamp,
    COUNT(*) as total_accesses,
    COUNT(DISTINCT user_name) as unique_users,
    COUNT(DISTINCT table_name) as tables_accessed
  FROM system.access.audit
  WHERE event_time > current_timestamp() - INTERVAL '1 hour'
  GROUP BY timestamp
)
SELECT 
  COALESCE(sm.timestamp, am.timestamp) as timestamp,
  sm.active_queries,
  sm.total_execution_seconds,
  sm.active_users,
  sm.errors,
  sm.avg_query_time,
  sm.p95_query_time,
  stm.total_storage_gb,
  stm.total_tables,
  stm.total_files,
  am.total_accesses,
  am.unique_users,
  am.tables_accessed,
  -- Health score (0-100)
  CASE 
    WHEN sm.errors > 10 THEN 50
    WHEN sm.p95_query_time > 30000 THEN 70
    ELSE 95
  END as system_health_score
FROM system_metrics sm
FULL OUTER JOIN storage_metrics stm 
  ON sm.timestamp = stm.timestamp
FULL OUTER JOIN access_metrics am
  ON sm.timestamp = am.timestamp;

-- 3. Alerting systeem
CREATE OR REPLACE PROCEDURE check_and_alert()
LANGUAGE SQL
AS $$
DECLARE
  error_count INT;
  slow_query_count INT;
  unauthorized_access_count INT;
BEGIN
  -- Check voor errors
  SELECT COUNT(*) INTO error_count
  FROM system.query.history
  WHERE start_time > current_timestamp() - INTERVAL '5 minutes'
    AND error_message IS NOT NULL;
  
  -- Check voor slow queries
  SELECT COUNT(*) INTO slow_query_count
  FROM system.query.history
  WHERE start_time > current_timestamp() - INTERVAL '5 minutes'
    AND execution_time > 60000; -- > 60 seconden
  
  -- Check voor unauthorized access
  SELECT COUNT(*) INTO unauthorized_access_count
  FROM system.access.audit
  WHERE event_time > current_timestamp() - INTERVAL '5 minutes'
    AND NOT is_authorized;
  
  -- Send alerts if thresholds exceeded
  IF error_count > 5 THEN
    -- Call webhook/email/slack
    CALL system.alert('HIGH_ERROR_RATE', 
      'High error rate detected: ' || error_count || ' errors in last 5 minutes');
  END IF;
  
  IF slow_query_count > 10 THEN
    CALL system.alert('SLOW_QUERIES',
      'Multiple slow queries detected: ' || slow_query_count);
  END IF;
  
  IF unauthorized_access_count > 0 THEN
    CALL system.alert('UNAUTHORIZED_ACCESS',
      'Unauthorized access attempts: ' || unauthorized_access_count);
  END IF;
END;
$$;

-- 4. Schedule monitoring jobs
CREATE JOB monitor_system
SCHEDULE EVERY 5 MINUTES
AS
  CALL check_and_alert();
  
  -- Update monitoring tables
  INSERT INTO main.monitoring.hourly_metrics
  SELECT * FROM main.monitoring.realtime_dashboard
  WHERE timestamp > current_timestamp() - INTERVAL '1 hour';

-- 5. GDPR compliance monitoring
CREATE OR REPLACE VIEW main.compliance.access_violations AS
SELECT 
  event_time,
  user_name,
  service_name,
  action_name,
  table_name,
  query_text,
  ip_address,
  user_agent,
  CASE 
    WHEN action_name IN ('SELECT', 'SHOW', 'DESCRIBE') 
      THEN 'DATA_ACCESS'
    WHEN action_name IN ('INSERT', 'UPDATE', 'DELETE', 'MERGE')
      THEN 'DATA_MODIFICATION'
    WHEN action_name IN ('CREATE', 'ALTER', 'DROP')
      THEN 'SCHEMA_CHANGE'
    ELSE 'OTHER'
  END as event_category
FROM system.access.audit
WHERE NOT is_authorized
  OR table_name IN (
    SELECT table_name 
    FROM system.information_schema.table_tags
    WHERE tag_name = 'data_sensitivity'
      AND tag_value IN ('HIGH', 'SENSITIVE_PII')
  )
ORDER BY event_time DESC;
          

Kosten Management voor MKB

Kosten Breakdown

-- SQL: Kosten monitoring en optimalisatie
-- 1. Kosten tracking per team/department
CREATE OR REPLACE VIEW main.finance.cost_analysis AS
WITH compute_costs AS (
  SELECT 
    user_name,
    -- Extract team from email/username
    SPLIT(user_name, '@')[0] as team,
    SUM(DBUs_used) as total_dbus,
    SUM(DBUs_used * 0.65) as compute_cost, -- €0.65/DBU
    COUNT(DISTINCT query_id) as query_count,
    AVG(execution_time) / 1000 as avg_query_seconds
  FROM system.billing.usage
  WHERE usage_date >= DATE_TRUNC('month', CURRENT_DATE())
  GROUP BY 1
),
storage_costs AS (
  SELECT 
    table_catalog,
    table_schema,
    -- Map schema to team
    CASE 
      WHEN table_schema LIKE '%sales%' THEN 'sales'
      WHEN table_schema LIKE '%marketing%' THEN 'marketing'
      WHEN table_schema LIKE '%finance%' THEN 'finance'
      ELSE 'other'
    END as team,
    SUM(size_in_bytes) / 1024 / 1024 / 1024 as storage_gb,
    SUM(size_in_bytes) / 1024 / 1024 / 1024 * 0.023 as storage_cost -- €0.023/GB/maand
  FROM system.information_schema.table_storage
  GROUP BY 1, 2
),
unity_catalog_costs AS (
  SELECT 
    'unity_catalog' as cost_type,
    COUNT(DISTINCT table_name) * 0.20 as catalog_cost -- €0.20/table/maand
  FROM system.information_schema.tables
)
SELECT 
  COALESCE(cc.team, sc.team) as team,
  SUM(COALESCE(cc.compute_cost, 0)) as compute_cost,
  SUM(COALESCE(sc.storage_cost, 0)) as storage_cost,
  SUM(COALESCE(cc.compute_cost, 0) + COALESCE(sc.storage_cost, 0)) as total_cost,
  -- Cost efficiency metrics
  SUM(COALESCE(cc.query_count, 0)) / 
    NULLIF(SUM(COALESCE(cc.compute_cost, 0)), 0) as queries_per_euro,
  SUM(COALESCE(sc.storage_gb, 0)) / 
    NULLIF(SUM(COALESCE(cc.query_count, 0)), 0) as gb_per_query
FROM compute_costs cc
FULL OUTER JOIN storage_costs sc
  ON cc.team = sc.team
GROUP BY 1
ORDER BY total_cost DESC;

-- 2. Kosten optimalisatie recommendations
CREATE OR REPLACE VIEW main.finance.cost_optimization AS
SELECT 
  'STORAGE' as category,
  table_name,
  size_gb,
  CASE 
    WHEN size_gb > 1000 THEN 'CONSIDER ARCHIVING'
    WHEN last_access_days > 90 THEN 'CONSIDER DELETION'
    WHEN file_count > 10000 THEN 'CONSIDER OPTIMIZE'
    ELSE 'OPTIMAL'
  END as recommendation,
  estimated_savings
FROM (
  SELECT 
    table_name,
    SUM(size_in_bytes) / 1024 / 1024 / 1024 as size_gb,
    COUNT(*) as file_count,
    DATEDIFF(day, MAX(last_access), CURRENT_DATE()) as last_access_days,
    CASE 
      WHEN DATEDIFF(day, MAX(last_access), CURRENT_DATE()) > 365 
        THEN SUM(size_in_bytes) / 1024 / 1024 / 1024 * 0.023 * 12
      ELSE 0
    END as estimated_savings
  FROM system.information_schema.table_storage
  GROUP BY 1
)
UNION ALL
SELECT 
  'COMPUTE' as category,
  user_name as resource,
  total_dbus,
  CASE 
    WHEN avg_query_time > 30000 THEN 'OPTIMIZE QUERIES'
    WHEN query_count < 10 THEN 'REDUCE ENVIRONMENT SIZE'
    ELSE 'OPTIMAL'
  END as recommendation,
  CASE 
    WHEN avg_query_time > 30000 THEN total_dbus * 0.10
    ELSE 0
  END as estimated_savings
FROM (
  SELECT 
    user_name,
    SUM(DBUs_used) as total_dbus,
    AVG(execution_time) as avg_query_time,
    COUNT(*) as query_count
  FROM system.billing.usage
  GROUP BY 1
);

-- 3. Auto-scale recommendations
CREATE OR REPLACE PROCEDURE optimize_cluster_sizes()
LANGUAGE SQL
AS $$
DECLARE
  avg_utilization DECIMAL;
  peak_utilization DECIMAL;
BEGIN
  -- Calculate utilization
  SELECT 
    AVG(active_workers / max_workers),
    MAX(active_workers / max_workers)
  INTO avg_utilization, peak_utilization
  FROM system.cluster.usage
  WHERE timestamp > CURRENT_TIMESTAMP() - INTERVAL '7 days';
  
  -- Make recommendations
  IF avg_utilization < 0.3 THEN
    CALL system.alert('UNDERUTILIZED_CLUSTERS',
      'Average cluster utilization only ' || ROUND(avg_utilization * 100, 1) || '%. Consider downsizing.');
  ELSIF peak_utilization > 0.9 THEN
    CALL system.alert('OVERUTILIZED_CLUSTERS',
      'Peak cluster utilization ' || ROUND(peak_utilization * 100, 1) || '%. Consider upsizing.');
  END IF;
END;
$$;

Best Practices voor MKB Implementatie

 MKB Unity Catalog Checklist
              Fase 1: Foundation (Week 1-2)
              ✅ Metastore setup op centrale cloud storage
✅ Catalog structuur (main, analytics, sandbox)
✅ Schema structuur (bronze/silver/gold)
✅ Basis RBAC (admin, engineer, analyst roles)
✅ Audit logging enablement
Kosten: €500-€1.000 setup

            

              Fase 2: Governance (Week 3-4)
              ✅ PII detectie enablement
✅ Auto-tagging van gevoelige data
✅ Column masking policies
✅ Lineage tracking enablement
✅ Data quality monitoring
Kosten: €200-€500/maand (UC Premium)

            

              Fase 3: Optimization (Week 5-8)
              ✅ Query performance monitoring
✅ Storage optimization (Z-order, compaction)
✅ Cost monitoring en alerts
✅ Automated maintenance jobs
✅ Team training en documentatie
Kosten: €300-€800/maand (afhankelijk van gebruik)

            

Veelgemaakte MKB Fouten

Te complex beginnen: Start met 1 catalog, niet 10
Geen naming conventions: Gebruik duidelijke, consistente namen
Geen cost monitoring: Zet vanaf dag 1 kosten monitoring op
Access control vergeten: RBAC is niet optioneel
Geen backup strategy: Zorg voor metadata en data backups
Niet meten: Implementeer monitoring voordat er problemen zijn

Conclusie: Unity Catalog als MKB Data Foundation

 ROI Analyse voor MKB
              Kosten Breakdown (Jaarlijks)
              Unity Catalog Premium: €2.400-€6.000
Compute (DBUs): €3.000-€12.000
Storage: €600-€2.400
Totaal: €6.000-€20.400/jaar

            

              Besparingen (Jaarlijks)
              Minder handmatig werk: €5.000-€15.000
GDPR compliance automation: €10.000-€25.000
Betere data quality: €3.000-€8.000
Preventie datalekken: €20.000+ (risico)
Totaal: €38.000-€68.000/jaar

            

              ROI Samenvatting
              Investering: €6.000-€20.400
Besparing: €38.000-€68.000
Netto voordeel: €32.000-€47.600
ROI periode: 3-6 maanden
ROI percentage: 200-400%

            

Volgende Stappen voor Jou

Stap 1: Gratis Unity Catalog Assessment
Gebruik onze gratis assessment tool om je huidige staat te analyseren en een gepersonaliseerd implementatieplan te krijgen.

Stap 2: Proof of Concept
Start met een 30-dagen PoC op Databricks met Unity Catalog (gratis credits beschikbaar).

Stap 3: Gefaseerde Implementatie
Volg de 8-weken roadmap uit deze blog en meet ROI na elke fase.

Download Implementatie Guide Gratis Assessment Plan Gratis Consult

"Unity Catalog transformeert data governance van een dure enterprise overhead naar een waardevolle MKB investering die zichzelf terugbetaalt binnen maanden. Het is niet langer een kwestie van 'kunnen we het betalen?' maar 'kunnen we het ons veroorloven het níét te doen?'"

Unity Catalog Deep Dive: Complete Enterprise Data Governance voor MKB Budget

Wat is Unity Catalog? (Beyond the Basics)

Unity Catalog Architecture: Drie-Lagen Model

1. Governance Layer

2. Data Layer

3. Compute Layer

Core Concepten en Hiërarchie

Unity Catalog Object Model Hiërarchie

Metastore

Catalog

Schema

Enterprise Setup voor MKB

Stap-voor-Stap Unity Catalog Setup

Infrastructure Setup

Initial Catalog Setup

Access Control Setup

Geavanceerde Unity Catalog Features

Lineage Tracking

Tagging & Classification

Fine-grained Access Control

Data Sharing & Collaboration

PII Detectie en GDPR Compliance

Geautomatiseerde PII Workflow

Performance Optimalisatie

Query Performance

Storage Optimalisatie

Monitoring, Audit en Alerts

Complete Monitoring Stack

Kosten Management voor MKB

Kosten Breakdown

Best Practices voor MKB Implementatie

MKB Unity Catalog Checklist

Fase 1: Foundation (Week 1-2)

Fase 2: Governance (Week 3-4)

Fase 3: Optimization (Week 5-8)

Veelgemaakte MKB Fouten

Conclusie: Unity Catalog als MKB Data Foundation

ROI Analyse voor MKB

Kosten Breakdown (Jaarlijks)

Besparingen (Jaarlijks)

ROI Samenvatting

Volgende Stappen voor Jou

👨‍💻 Over de auteur