Wat is Unity Catalog? (Beyond the Basics)
Unity Catalog is niet zomaar een data catalog - het is een **gecentraliseerde governance laag** die over je hele Databricks deployment heen ligt. Het biedt een uniform beheerlaag voor data, AI assets en toegangscontrole over je gehele data estate.
Unity Catalog Architecture: Drie-Lagen Model
1. Governance Layer
- Centrale Metadata Store: Eén bron van waarheid
- Access Control Policies: Fine-grained RBAC
- Audit Logging: Volledige activiteit tracking
- Lineage Tracking: E2E data flow visualisatie
2. Data Layer
- Multi-cloud Support: AWS, Azure, GCP
- Storage Abstraction: Eenduidige toegang
- Delta Format: ACID transactions
- Time Travel: Data versioning
3. Compute Layer
- Shared Infrastructure: Efficiënt resource gebruik
- Isolated Execution: Veilige query isolation
- Autoscaling: Dynamische scaling
- Photon Engine: High-performance queries
Core Concepten en Hiërarchie
Unity Catalog Object Model Hiërarchie
-- Unity Catalog Hiërarchie: Top-down benadering
-- 1. METASTORE (Hoogste niveau)
-- ├── Bevat alle metadata
-- ├── Cross-workspace sharing
-- ├── Centraal beheerpunt
-- 2. CATALOG (Organisatie niveau)
-- ├── main (default catalog)
-- ├── analytics (voor BI/data science)
-- ├── sandbox (voor development)
-- ├── sensitive (voor PII data)
-- 3. SCHEMA (Database/Dataset niveau)
-- ├── default (binnen elke catalog)
-- ├── bron (raw/bron data)
-- ├── silver (cleaned data)
-- ├── gold (business ready)
-- 4. TABLE/VIEW (Data objecten)
-- ├── Delta tables (ACID compliant)
-- ├── Views (logische abstractions)
-- ├── Materialized views (geoptimaliseerd)
-- ├── External tables (on-prem/cloud)
-- 5. COLUMN (Laagste niveau)
-- ├── Column-level security
-- ├── Masking policies
-- ├── Tagging (PII, sensitivity)
-- ├── Lineage tracking
-- Praktisch voorbeeld hiërarchie
METASTORE: "company_metastore"
│
├── CATALOG: "main"
│ │
│ ├── SCHEMA: "sales"
│ │ ├── TABLE: "orders"
│ │ │ ├── COLUMN: "customer_id" (PII tagged)
│ │ │ ├── COLUMN: "order_amount"
│ │ │ └── COLUMN: "order_date"
│ │ │
│ │ └── VIEW: "monthly_sales"
│ │
│ └── SCHEMA: "marketing"
│ ├── TABLE: "campaigns"
│ └── TABLE: "leads"
│
└── CATALOG: "analytics"
│
└── SCHEMA: "bi"
├── TABLE: "kpi_dashboard"
└── VIEW: "executive_summary"
Metastore
- Top-level container: Bevat alle metadata
- Cross-workspace: Gedeeld tussen meerdere workspaces
- Central management: Eén beheerpunt voor governance
- Backup/restore: Metadata backup mogelijkheden
Catalog
- Eerste organisatie niveau: Bijv. departementen
- Isolatie boundary: Security en access boundary
- Resource grouping: Logische groepering van data
- Default: "main": Standaard catalog
Schema
- Database equivalent: Container voor tables/views
- Data lake mapping: Mapt naar directories
- Medallion architecture: bronze/silver/gold
- Access control: Permissions op schema niveau
Enterprise Setup voor MKB
Stap-voor-Stap Unity Catalog Setup
Infrastructure Setup
# Terraform voor Unity Catalog setup
resource "databricks_metastore" "company_metastore" {
name = "company_metastore"
storage_root = "s3://company-uc-metadata"
region = "eu-west-1"
force_destroy = true
}
resource "databricks_metastore_data_access" "uc_access" {
metastore_id = databricks_metastore.company_metastore.id
name = "uc-storage-access"
aws_iam_role {
role_arn = aws_iam_role.uc_role.arn
}
}
resource "databricks_metastore_assignment" "assignment" {
metastore_id = databricks_metastore.company_metastore.id
workspace_id = databricks_mws_workspaces.workspace.workspace_id
default_catalog_name = "main"
}
Initial Catalog Setup
-- SQL: Catalog en schema creatie
-- Maak catalogs voor verschillende doeleinden
CREATE CATALOG IF NOT EXISTS main
COMMENT 'Primary catalog for business data';
CREATE CATALOG IF NOT EXISTS analytics
COMMENT 'Catalog for BI and analytics data';
CREATE CATALOG IF NOT EXISTS sandbox
COMMENT 'Development and testing catalog';
-- Maak schemas binnen elke catalog
USE CATALOG main;
CREATE SCHEMA IF NOT EXISTS bronze
COMMENT 'Raw/bronze layer data';
CREATE SCHEMA IF NOT EXISTS silver
COMMENT 'Cleaned/silver layer data';
CREATE SCHEMA IF NOT EXISTS gold
COMMENT 'Business ready/gold layer data';
CREATE SCHEMA IF NOT EXISTS sensitive
COMMENT 'Schema for sensitive/PII data';
Access Control Setup
-- SQL: RBAC setup voor verschillende teams
-- Maak groepen en rollen
CREATE ROLE IF NOT EXISTS data_engineer;
CREATE ROLE IF NOT EXISTS data_scientist;
CREATE ROLE IF NOT EXISTS business_analyst;
CREATE ROLE IF NOT EXISTS executive;
-- Grant permissions op catalog niveau
GRANT USAGE ON CATALOG main TO ROLE data_engineer;
GRANT USAGE ON CATALOG analytics TO ROLE business_analyst;
GRANT USAGE ON CATALOG sandbox TO ROLE data_scientist;
-- Grant specifieke permissions op schema niveau
GRANT SELECT ON SCHEMA main.bronze TO ROLE data_engineer;
GRANT SELECT ON SCHEMA main.silver TO ROLE data_scientist;
GRANT SELECT ON SCHEMA main.gold TO ROLE business_analyst;
GRANT SELECT ON SCHEMA main.gold TO ROLE executive;
-- Grant create permissions voor engineers
GRANT CREATE ON SCHEMA main.bronze TO ROLE data_engineer;
GRANT CREATE ON SCHEMA main.silver TO ROLE data_engineer;
Geavanceerde Unity Catalog Features
Lineage Tracking
-- SQL: Lineage queries
-- Volledige upstream lineage
SELECT * FROM system.information_schema.table_lineage
WHERE downstream_catalog = 'main'
AND downstream_schema = 'gold'
AND downstream_table = 'customer_360';
-- Downstream impact analysis
SELECT * FROM system.information_schema.column_lineage
WHERE upstream_catalog = 'main'
AND upstream_schema = 'silver'
AND upstream_table = 'transactions'
AND upstream_column = 'amount';
-- Lineage visualisatie in notebook
%python
from databricks.sdk import WorkspaceClient
import networkx as nx
def visualize_lineage(table_name):
w = WorkspaceClient()
lineage = w.data_lineage.get(table_name)
# Build graph
G = nx.DiGraph()
for edge in lineage.edges:
G.add_edge(edge.source, edge.target)
# Visualize
nx.draw(G, with_labels=True)
return G
Tagging & Classification
-- SQL: Tagging en classificatie
-- Maak custom tags
CREATE TAG IF NOT EXISTS data_sensitivity
COMMENT 'Data sensitivity classification';
CREATE TAG IF NOT EXISTS data_quality
COMMENT 'Data quality metrics';
CREATE TAG IF NOT EXISTS business_domain
COMMENT 'Business domain classification';
-- Pas tags toe op tabellen
ALTER TABLE main.sensitive.customers
SET TAG data_sensitivity = 'CONFIDENTIAL';
ALTER TABLE main.gold.sales
SET TAG business_domain = 'SALES',
TAG data_quality = 'CERTIFIED';
-- Pas tags toe op kolommen (PII detectie)
ALTER TABLE main.sensitive.customers
ALTER COLUMN email
SET TAG data_sensitivity = 'PII';
ALTER TABLE main.sensitive.customers
ALTER COLUMN bsn
SET TAG data_sensitivity = 'SENSITIVE_PII';
-- Query op basis van tags
SELECT
table_catalog,
table_schema,
table_name,
tag_name,
tag_value
FROM system.information_schema.table_tags
WHERE tag_name = 'data_sensitivity'
AND tag_value = 'PII';
Fine-grained Access Control
-- SQL: Advanced access control
-- Column-level security
CREATE MASKING POLICY email_mask AS (email STRING)
RETURN CASE
WHEN current_user() IN ('admin', 'data_engineer')
THEN email
ELSE regexp_replace(email, '(?<=.).(?=.*@)', '*')
END;
CREATE MASKING POLICY amount_mask AS (amount DECIMAL)
RETURN CASE
WHEN current_user() IN ('admin', 'executive')
THEN amount
ELSE NULL
END;
-- Pas masking policies toe
ALTER TABLE main.sensitive.customers
ALTER COLUMN email
SET MASKING POLICY email_mask;
ALTER TABLE main.gold.sales
ALTER COLUMN revenue
SET MASKING POLICY amount_mask;
-- Row-level security (filter)
CREATE ROW FILTER ssn_filter AS (ssn STRING)
RETURN CASE
WHEN current_user() IN ('admin', 'hr_user')
THEN true
ELSE false
END;
ALTER TABLE main.sensitive.employees
SET ROW FILTER ssn_filter ON (ssn);
-- Dynamic views met security
CREATE VIEW main.analytics.safe_customer_data AS
SELECT
customer_id,
CASE
WHEN current_user() IN ('admin', 'sales_manager')
THEN customer_name
ELSE 'CONFIDENTIAL'
END as customer_name,
CASE
WHEN current_user() IN ('admin', 'sales_manager')
THEN email
ELSE regexp_replace(email, '(?<=.).(?=.*@)', '*')
END as email,
country,
customer_segment
FROM main.sensitive.customers;
Data Sharing & Collaboration
-- SQL: Data sharing binnen en buiten organisatie
-- INTERN: Share binnen metastore
CREATE SHARE sales_data_share
COMMENT 'Share sales data with analytics team';
-- Voeg objecten toe aan share
ALTER SHARE sales_data_share
ADD TABLE main.gold.sales;
ALTER SHARE sales_data_share
ADD VIEW main.analytics.sales_dashboard;
-- Grant access aan andere catalogs
GRANT SELECT ON SHARE sales_data_share
TO CATALOG analytics;
-- EXTERN: Delta Sharing (open protocol)
-- Maak een provider (als je data deelt)
CREATE PROVIDER external_partner
COMMENT 'External business partner';
-- Maak een recipient (als je data ontvangt)
CREATE RECIPIENT supplier_data
COMMENT 'Supplier data share';
-- Deel data via Delta Sharing
CREATE SHARE external_data_share AS
SELECT * FROM main.gold.shared_metrics;
GRANT SELECT ON SHARE external_data_share
TO RECIPIENT supplier_data;
-- Query gedeelde data (als recipient)
SELECT * FROM delta_sharing.supplier_data.sales_metrics;
-- Real-time data sharing
CREATE LIVE TABLE real_time_metrics
AS SELECT
current_timestamp() as timestamp,
count(*) as active_sessions,
avg(response_time) as avg_response_time
FROM system.streaming.sessions
GROUP BY window(timestamp, '1 minute');
-- Share live table
ALTER SHARE real_time_share
ADD LIVE TABLE real_time_metrics;
PII Detectie en GDPR Compliance
Geautomatiseerde PII Workflow
-- SQL: Complete PII compliance workflow
-- 1. Enable PII detection op metastore niveau
ALTER METASTORE SET DBPROPERTIES (
'pii.detection.enabled' = 'true',
'pii.auto.classification' = 'true',
'pii.retention.days' = '90'
);
-- 2. Creëër PII-dedicated schema structuur
CREATE SCHEMA IF NOT EXISTS main.pii_data
MANAGED LOCATION 's3://company-data/pii/'
COMMENT 'Schema voor PII/GDPR gevoelige data';
-- 3. Maak PII detection pipeline
CREATE OR REPLACE TABLE main.audit.pii_detection_log
AS SELECT
current_timestamp() as detection_time,
catalog_name,
schema_name,
table_name,
column_name,
pii_type,
confidence_score,
sample_value,
CASE
WHEN pii_type IN ('EMAIL', 'PHONE', 'SSN', 'IBAN')
THEN 'SENSITIVE'
WHEN pii_type IN ('NAME', 'ADDRESS')
THEN 'IDENTIFIER'
ELSE 'NON_PII'
END as sensitivity_level
FROM system.information_schema.column_pii_tags;
-- 4. Automatische PII tagging
CREATE OR REPLACE PROCEDURE auto_tag_pii()
LANGUAGE SQL
AS $$
DECLARE
cursor CURSOR FOR
SELECT
catalog_name,
schema_name,
table_name,
column_name,
pii_type
FROM system.information_schema.column_pii_tags
WHERE confidence_score > 0.8;
BEGIN
FOR record IN cursor LOOP
EXECUTE IMMEDIATE format('
ALTER TABLE %I.%I.%I
ALTER COLUMN %I
SET TAG pii_type = %L,
TAG sensitivity = %L',
record.catalog_name,
record.schema_name,
record.table_name,
record.column_name,
record.pii_type,
CASE
WHEN record.pii_type IN ('EMAIL', 'PHONE', 'SSN')
THEN 'HIGH'
ELSE 'MEDIUM'
END
);
END LOOP;
END;
$$;
-- 5. GDPR compliance dashboard query
CREATE OR REPLACE VIEW main.compliance.gdpr_dashboard AS
WITH pii_inventory AS (
SELECT
catalog_name,
schema_name,
table_name,
COUNT(DISTINCT column_name) as total_columns,
SUM(CASE WHEN tag_value = 'HIGH' THEN 1 ELSE 0 END) as high_sensitivity,
SUM(CASE WHEN tag_value = 'MEDIUM' THEN 1 ELSE 0 END) as medium_sensitivity
FROM system.information_schema.column_tags
WHERE tag_name = 'sensitivity'
GROUP BY 1, 2, 3
),
access_audit AS (
SELECT
table_catalog,
table_schema,
table_name,
COUNT(DISTINCT user_name) as unique_users,
MIN(event_time) as first_access,
MAX(event_time) as last_access
FROM system.access.audit
WHERE service_name = 'databricks-sql'
GROUP BY 1, 2, 3
)
SELECT
pi.catalog_name,
pi.schema_name,
pi.table_name,
pi.total_columns,
pi.high_sensitivity,
pi.medium_sensitivity,
COALESCE(aa.unique_users, 0) as access_count,
COALESCE(aa.first_access, current_timestamp()) as first_access,
COALESCE(aa.last_access, current_timestamp()) as last_access,
-- Compliance score
CASE
WHEN pi.high_sensitivity = 0 THEN 100
WHEN pi.high_sensitivity > 0 AND aa.unique_users <= 3 THEN 90
WHEN pi.high_sensitivity > 0 AND aa.unique_users <= 10 THEN 80
ELSE 70
END as compliance_score
FROM pii_inventory pi
LEFT JOIN access_audit aa
ON pi.catalog_name = aa.table_catalog
AND pi.schema_name = aa.table_schema
AND pi.table_name = aa.table_name
ORDER BY compliance_score ASC;
Performance Optimalisatie
Query Performance
-- SQL: Performance monitoring en optimalisatie
-- 1. Query performance analysis
CREATE OR REPLACE TABLE main.monitoring.query_performance
AS SELECT
query_id,
query_text,
user_name,
execution_time,
result_size,
rows_produced,
-- Performance metrics
execution_time / rows_produced as ms_per_row,
CASE
WHEN execution_time > 60000 THEN 'SLOW'
WHEN execution_time > 30000 THEN 'MEDIUM'
ELSE 'FAST'
END as performance_category
FROM system.query.history
WHERE start_time > current_date() - 7;
-- 2. Table statistics voor query optimizer
ANALYZE TABLE main.gold.sales
COMPUTE STATISTICS FOR ALL COLUMNS;
-- Show table statistics
DESCRIBE DETAIL main.gold.sales;
-- 3. Z-Porder optimalisatie (co-locatie)
CREATE OR REPLACE TABLE main.gold.optimized_sales
USING DELTA
LOCATION 's3://company-data/gold/sales/'
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
)
AS SELECT * FROM main.silver.sales
-- Optimaliseer voor meest gebruikte queries
CLUSTER BY (customer_id, order_date);
-- 4. Materialized views voor vaak gebruikte aggregaties
CREATE MATERIALIZED VIEW main.gold.daily_sales_mv
REFRESH EVERY 1 HOUR
AS SELECT
date_trunc('day', order_date) as sales_day,
customer_segment,
product_category,
SUM(order_amount) as daily_sales,
COUNT(DISTINCT customer_id) as unique_customers,
AVG(order_amount) as avg_order_value
FROM main.gold.sales
GROUP BY 1, 2, 3;
-- 5. Query acceleration met Photon
SET use_photon = true;
-- Force Photon voor specifieke query
SELECT /*+ photon */
customer_id,
SUM(order_amount) as total_spent
FROM main.gold.sales
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 100;
Storage Optimalisatie
-- SQL: Storage management en optimalisatie
-- 1. Storage analytics
CREATE OR REPLACE VIEW main.monitoring.storage_usage AS
SELECT
catalog_name,
schema_name,
table_name,
-- Storage metrics
SUM(size_in_bytes) / 1024 / 1024 / 1024 as size_gb,
SUM(num_files) as total_files,
AVG(file_size) as avg_file_size_mb,
-- Optimalisatie opportunities
CASE
WHEN AVG(file_size) < 100 * 1024 * 1024 THEN 'SMALL_FILES'
WHEN COUNT(*) > 1000 THEN 'TOO_MANY_FILES'
ELSE 'OPTIMAL'
END as optimization_status
FROM system.information_schema.table_storage
GROUP BY 1, 2, 3
ORDER BY size_gb DESC;
-- 2. Auto-optimization configuratie
ALTER TABLE main.gold.large_table
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'delta.targetFileSize' = '128mb'
);
-- 3. Manual optimization
OPTIMIZE main.gold.sales
ZORDER BY (customer_id, order_date);
-- Show optimization results
DESCRIBE HISTORY main.gold.sales;
-- 4. Vacuum (clean up old versions)
VACUUM main.gold.sales
RETAIN 168 HOURS; -- 7 dagen retention
-- 5. Clone voor testing zonder data kopiëren
-- Zero-copy clone (metadata only)
CREATE TABLE main.sandbox.sales_test
CLONE main.gold.sales;
-- Deep clone (copy data)
CREATE TABLE main.backup.sales_backup
DEEP CLONE main.gold.sales;
-- 6. Table maintenance procedure
CREATE OR REPLACE PROCEDURE maintain_tables()
LANGUAGE SQL
AS $$
DECLARE
cursor CURSOR FOR
SELECT
table_catalog,
table_schema,
table_name
FROM system.information_schema.tables
WHERE table_schema IN ('bronze', 'silver', 'gold')
AND table_type = 'BASE TABLE';
BEGIN
FOR record IN cursor LOOP
-- Optimize tables groter dan 1GB
EXECUTE IMMEDIATE format('
OPTIMIZE %I.%I.%I
WHERE size_in_bytes > 1073741824',
record.table_catalog,
record.table_schema,
record.table_name
);
-- Vacuum oude versies
EXECUTE IMMEDIATE format('
VACUUM %I.%I.%I
RETAIN 168 HOURS',
record.table_catalog,
record.table_schema,
record.table_name
);
END LOOP;
END;
$$;
Monitoring, Audit en Alerts
Complete Monitoring Stack
-- SQL: Enterprise monitoring setup
-- 1. Audit logging configuration
CREATE AUDIT LOG TABLE main.audit.unified_audit_log
USING DELTA
LOCATION 's3://company-audit/logs/'
TBLPROPERTIES (
'delta.enableChangeDataFeed' = 'true'
);
-- Configureer wat er gelogd wordt
ALTER METASTORE SET DBPROPERTIES (
'audit.log.queries' = 'true',
'audit.log.data.access' = 'true',
'audit.log.data.modification' = 'true',
'audit.log.schema.changes' = 'true',
'audit.log.security.changes' = 'true'
);
-- 2. Real-time monitoring dashboard
CREATE OR REPLACE VIEW main.monitoring.realtime_dashboard AS
WITH system_metrics AS (
SELECT
timestamp,
-- Query metrics
COUNT(DISTINCT query_id) as active_queries,
SUM(execution_time) / 1000 as total_execution_seconds,
-- User metrics
COUNT(DISTINCT user_name) as active_users,
-- Error metrics
SUM(CASE WHEN error_message IS NOT NULL THEN 1 ELSE 0 END) as errors,
-- Performance metrics
AVG(execution_time) as avg_query_time,
PERCENTILE(execution_time, 0.95) as p95_query_time
FROM system.query.history
WHERE start_time > current_timestamp() - INTERVAL '1 hour'
GROUP BY timestamp
),
storage_metrics AS (
SELECT
current_timestamp() as timestamp,
SUM(size_in_bytes) / 1024 / 1024 / 1024 as total_storage_gb,
COUNT(DISTINCT table_name) as total_tables,
SUM(num_files) as total_files
FROM system.information_schema.table_storage
),
access_metrics AS (
SELECT
timestamp,
COUNT(*) as total_accesses,
COUNT(DISTINCT user_name) as unique_users,
COUNT(DISTINCT table_name) as tables_accessed
FROM system.access.audit
WHERE event_time > current_timestamp() - INTERVAL '1 hour'
GROUP BY timestamp
)
SELECT
COALESCE(sm.timestamp, am.timestamp) as timestamp,
sm.active_queries,
sm.total_execution_seconds,
sm.active_users,
sm.errors,
sm.avg_query_time,
sm.p95_query_time,
stm.total_storage_gb,
stm.total_tables,
stm.total_files,
am.total_accesses,
am.unique_users,
am.tables_accessed,
-- Health score (0-100)
CASE
WHEN sm.errors > 10 THEN 50
WHEN sm.p95_query_time > 30000 THEN 70
ELSE 95
END as system_health_score
FROM system_metrics sm
FULL OUTER JOIN storage_metrics stm
ON sm.timestamp = stm.timestamp
FULL OUTER JOIN access_metrics am
ON sm.timestamp = am.timestamp;
-- 3. Alerting systeem
CREATE OR REPLACE PROCEDURE check_and_alert()
LANGUAGE SQL
AS $$
DECLARE
error_count INT;
slow_query_count INT;
unauthorized_access_count INT;
BEGIN
-- Check voor errors
SELECT COUNT(*) INTO error_count
FROM system.query.history
WHERE start_time > current_timestamp() - INTERVAL '5 minutes'
AND error_message IS NOT NULL;
-- Check voor slow queries
SELECT COUNT(*) INTO slow_query_count
FROM system.query.history
WHERE start_time > current_timestamp() - INTERVAL '5 minutes'
AND execution_time > 60000; -- > 60 seconden
-- Check voor unauthorized access
SELECT COUNT(*) INTO unauthorized_access_count
FROM system.access.audit
WHERE event_time > current_timestamp() - INTERVAL '5 minutes'
AND NOT is_authorized;
-- Send alerts if thresholds exceeded
IF error_count > 5 THEN
-- Call webhook/email/slack
CALL system.alert('HIGH_ERROR_RATE',
'High error rate detected: ' || error_count || ' errors in last 5 minutes');
END IF;
IF slow_query_count > 10 THEN
CALL system.alert('SLOW_QUERIES',
'Multiple slow queries detected: ' || slow_query_count);
END IF;
IF unauthorized_access_count > 0 THEN
CALL system.alert('UNAUTHORIZED_ACCESS',
'Unauthorized access attempts: ' || unauthorized_access_count);
END IF;
END;
$$;
-- 4. Schedule monitoring jobs
CREATE JOB monitor_system
SCHEDULE EVERY 5 MINUTES
AS
CALL check_and_alert();
-- Update monitoring tables
INSERT INTO main.monitoring.hourly_metrics
SELECT * FROM main.monitoring.realtime_dashboard
WHERE timestamp > current_timestamp() - INTERVAL '1 hour';
-- 5. GDPR compliance monitoring
CREATE OR REPLACE VIEW main.compliance.access_violations AS
SELECT
event_time,
user_name,
service_name,
action_name,
table_name,
query_text,
ip_address,
user_agent,
CASE
WHEN action_name IN ('SELECT', 'SHOW', 'DESCRIBE')
THEN 'DATA_ACCESS'
WHEN action_name IN ('INSERT', 'UPDATE', 'DELETE', 'MERGE')
THEN 'DATA_MODIFICATION'
WHEN action_name IN ('CREATE', 'ALTER', 'DROP')
THEN 'SCHEMA_CHANGE'
ELSE 'OTHER'
END as event_category
FROM system.access.audit
WHERE NOT is_authorized
OR table_name IN (
SELECT table_name
FROM system.information_schema.table_tags
WHERE tag_name = 'data_sensitivity'
AND tag_value IN ('HIGH', 'SENSITIVE_PII')
)
ORDER BY event_time DESC;
Kosten Management voor MKB
Kosten Breakdown
-- SQL: Kosten monitoring en optimalisatie
-- 1. Kosten tracking per team/department
CREATE OR REPLACE VIEW main.finance.cost_analysis AS
WITH compute_costs AS (
SELECT
user_name,
-- Extract team from email/username
SPLIT(user_name, '@')[0] as team,
SUM(DBUs_used) as total_dbus,
SUM(DBUs_used * 0.65) as compute_cost, -- €0.65/DBU
COUNT(DISTINCT query_id) as query_count,
AVG(execution_time) / 1000 as avg_query_seconds
FROM system.billing.usage
WHERE usage_date >= DATE_TRUNC('month', CURRENT_DATE())
GROUP BY 1
),
storage_costs AS (
SELECT
table_catalog,
table_schema,
-- Map schema to team
CASE
WHEN table_schema LIKE '%sales%' THEN 'sales'
WHEN table_schema LIKE '%marketing%' THEN 'marketing'
WHEN table_schema LIKE '%finance%' THEN 'finance'
ELSE 'other'
END as team,
SUM(size_in_bytes) / 1024 / 1024 / 1024 as storage_gb,
SUM(size_in_bytes) / 1024 / 1024 / 1024 * 0.023 as storage_cost -- €0.023/GB/maand
FROM system.information_schema.table_storage
GROUP BY 1, 2
),
unity_catalog_costs AS (
SELECT
'unity_catalog' as cost_type,
COUNT(DISTINCT table_name) * 0.20 as catalog_cost -- €0.20/table/maand
FROM system.information_schema.tables
)
SELECT
COALESCE(cc.team, sc.team) as team,
SUM(COALESCE(cc.compute_cost, 0)) as compute_cost,
SUM(COALESCE(sc.storage_cost, 0)) as storage_cost,
SUM(COALESCE(cc.compute_cost, 0) + COALESCE(sc.storage_cost, 0)) as total_cost,
-- Cost efficiency metrics
SUM(COALESCE(cc.query_count, 0)) /
NULLIF(SUM(COALESCE(cc.compute_cost, 0)), 0) as queries_per_euro,
SUM(COALESCE(sc.storage_gb, 0)) /
NULLIF(SUM(COALESCE(cc.query_count, 0)), 0) as gb_per_query
FROM compute_costs cc
FULL OUTER JOIN storage_costs sc
ON cc.team = sc.team
GROUP BY 1
ORDER BY total_cost DESC;
-- 2. Kosten optimalisatie recommendations
CREATE OR REPLACE VIEW main.finance.cost_optimization AS
SELECT
'STORAGE' as category,
table_name,
size_gb,
CASE
WHEN size_gb > 1000 THEN 'CONSIDER ARCHIVING'
WHEN last_access_days > 90 THEN 'CONSIDER DELETION'
WHEN file_count > 10000 THEN 'CONSIDER OPTIMIZE'
ELSE 'OPTIMAL'
END as recommendation,
estimated_savings
FROM (
SELECT
table_name,
SUM(size_in_bytes) / 1024 / 1024 / 1024 as size_gb,
COUNT(*) as file_count,
DATEDIFF(day, MAX(last_access), CURRENT_DATE()) as last_access_days,
CASE
WHEN DATEDIFF(day, MAX(last_access), CURRENT_DATE()) > 365
THEN SUM(size_in_bytes) / 1024 / 1024 / 1024 * 0.023 * 12
ELSE 0
END as estimated_savings
FROM system.information_schema.table_storage
GROUP BY 1
)
UNION ALL
SELECT
'COMPUTE' as category,
user_name as resource,
total_dbus,
CASE
WHEN avg_query_time > 30000 THEN 'OPTIMIZE QUERIES'
WHEN query_count < 10 THEN 'REDUCE ENVIRONMENT SIZE'
ELSE 'OPTIMAL'
END as recommendation,
CASE
WHEN avg_query_time > 30000 THEN total_dbus * 0.10
ELSE 0
END as estimated_savings
FROM (
SELECT
user_name,
SUM(DBUs_used) as total_dbus,
AVG(execution_time) as avg_query_time,
COUNT(*) as query_count
FROM system.billing.usage
GROUP BY 1
);
-- 3. Auto-scale recommendations
CREATE OR REPLACE PROCEDURE optimize_cluster_sizes()
LANGUAGE SQL
AS $$
DECLARE
avg_utilization DECIMAL;
peak_utilization DECIMAL;
BEGIN
-- Calculate utilization
SELECT
AVG(active_workers / max_workers),
MAX(active_workers / max_workers)
INTO avg_utilization, peak_utilization
FROM system.cluster.usage
WHERE timestamp > CURRENT_TIMESTAMP() - INTERVAL '7 days';
-- Make recommendations
IF avg_utilization < 0.3 THEN
CALL system.alert('UNDERUTILIZED_CLUSTERS',
'Average cluster utilization only ' || ROUND(avg_utilization * 100, 1) || '%. Consider downsizing.');
ELSIF peak_utilization > 0.9 THEN
CALL system.alert('OVERUTILIZED_CLUSTERS',
'Peak cluster utilization ' || ROUND(peak_utilization * 100, 1) || '%. Consider upsizing.');
END IF;
END;
$$;
Best Practices voor MKB Implementatie
MKB Unity Catalog Checklist
Fase 1: Foundation (Week 1-2)
- ✅ Metastore setup op centrale cloud storage
- ✅ Catalog structuur (main, analytics, sandbox)
- ✅ Schema structuur (bronze/silver/gold)
- ✅ Basis RBAC (admin, engineer, analyst roles)
- ✅ Audit logging enablement
- Kosten: €500-€1.000 setup
Fase 2: Governance (Week 3-4)
- ✅ PII detectie enablement
- ✅ Auto-tagging van gevoelige data
- ✅ Column masking policies
- ✅ Lineage tracking enablement
- ✅ Data quality monitoring
- Kosten: €200-€500/maand (UC Premium)
Fase 3: Optimization (Week 5-8)
- ✅ Query performance monitoring
- ✅ Storage optimization (Z-order, compaction)
- ✅ Cost monitoring en alerts
- ✅ Automated maintenance jobs
- ✅ Team training en documentatie
- Kosten: €300-€800/maand (afhankelijk van gebruik)
Veelgemaakte MKB Fouten
- Te complex beginnen: Start met 1 catalog, niet 10
- Geen naming conventions: Gebruik duidelijke, consistente namen
- Geen cost monitoring: Zet vanaf dag 1 kosten monitoring op
- Access control vergeten: RBAC is niet optioneel
- Geen backup strategy: Zorg voor metadata en data backups
- Niet meten: Implementeer monitoring voordat er problemen zijn
Conclusie: Unity Catalog als MKB Data Foundation
ROI Analyse voor MKB
Kosten Breakdown (Jaarlijks)
- Unity Catalog Premium: €2.400-€6.000
- Compute (DBUs): €3.000-€12.000
- Storage: €600-€2.400
- Totaal: €6.000-€20.400/jaar
Besparingen (Jaarlijks)
- Minder handmatig werk: €5.000-€15.000
- GDPR compliance automation: €10.000-€25.000
- Betere data quality: €3.000-€8.000
- Preventie datalekken: €20.000+ (risico)
- Totaal: €38.000-€68.000/jaar
ROI Samenvatting
- Investering: €6.000-€20.400
- Besparing: €38.000-€68.000
- Netto voordeel: €32.000-€47.600
- ROI periode: 3-6 maanden
- ROI percentage: 200-400%
Volgende Stappen voor Jou
Stap 1: Gratis Unity Catalog Assessment
Gebruik onze gratis assessment tool om je huidige staat te analyseren en een gepersonaliseerd implementatieplan te krijgen.
Stap 2: Proof of Concept
Start met een 30-dagen PoC op Databricks met Unity Catalog (gratis credits beschikbaar).
Stap 3: Gefaseerde Implementatie
Volg de 8-weken roadmap uit deze blog en meet ROI na elke fase.
"Unity Catalog transformeert data governance van een dure enterprise overhead naar een waardevolle MKB investering die zichzelf terugbetaalt binnen maanden. Het is niet langer een kwestie van 'kunnen we het betalen?' maar 'kunnen we het ons veroorloven het níét te doen?'"