Wat is Apache Spark?

Laatst bijgewerkt: 25 juni 2025

Leestijd: 12 minuten

Apache Spark, Big Data, Distributed Computing, Data Engineering, ETL

Een complete gids over Apache Spark: het distributed computing framework dat big data processing, machine learning en real-time analytics revolutioneerde.

Definitie

Apache Spark is een open-source, distributed computing framework voor grootschalige dataverwerking. Het biedt programmeerinterfaces voor complete clusters met ingebouwde fault tolerance en data parallelism. Spark is tot 100x sneller dan Hadoop MapReduce voor bepaalde workloads door in-memory computing.

Waarom Apache Spark Belangrijk is?

Apache Spark heeft big data processing getransformeerd door:

Snelheid: In-memory computing maakt iteratieve processing mogelijk
Eenvoud: High-level APIs in Java, Scala, Python en R
Universaliteit: Eén framework voor batch, streaming, SQL en ML
Schaalbaarheid: Van enkele servers tot duizenden nodes
Fault tolerance: Automatisch herstel van node failures

Belangrijkste Inzicht

Spark's kracht ligt in zijn Resilient Distributed Datasets (RDDs) en Directed Acyclic Graph (DAG) execution engine, die efficiënte data processing mogelijk maken zonder de complexiteit van traditionele MapReduce.

Spark Architecture Componenten

1. Spark Core

De fundamenten van het Spark framework:

RDD (Resilient Distributed Dataset): Basis data abstraction
Task Scheduling: Distributed task execution
Memory Management: In-memory computing optimalisatie
Fault Recovery: Automatisch herstel bij failures

2. Spark SQL

Structured data processing met SQL en DataFrames:

Component	Beschrijving	Gebruik
DataFrames	Distributed collection van data	Structured data manipulation
Datasets	Type-safe DataFrames	Object-oriented programming
Spark SQL	SQL query engine	ANSI SQL compliant queries
Catalyst Optimizer	Query optimization engine	Automatische performance tuning

3. Spark Streaming

Real-time data processing:

// Spark Streaming voorbeeld - real-time word count
val streamingContext = new StreamingContext(sparkContext, Seconds(1))
val lines = streamingContext.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
      

4. MLlib (Machine Learning)

Scalable machine learning library:

Algorithms: Classification, regression, clustering
Feature Engineering: TF-IDF, Word2Vec, PCA
Pipelines: End-to-end ML workflows
Model Selection: Cross-validation, hyperparameter tuning

5. GraphX

Graph processing en graph-parallel computation:

Graph Algorithms: PageRank, Connected Components
Graph Builders: Construct graphs from RDDs
Graph Operators: Subgraph, mapVertices, joinVertices

Spark Cluster Architecture

Cluster Managers

Spark ondersteunt verschillende cluster managers:

         Cluster Components
        Driver Program: Coördineert job execution
Cluster Manager: Allocates resources (YARN, Mesos, Kubernetes)
Worker Nodes: Execute tasks en store data
Executors: Process tasks op worker nodes

      

Cluster Manager	Beschrijving	Use Case
Standalone	Simple cluster manager included with Spark	Development en testing
Apache YARN	Hadoop's resource manager	Hadoop ecosystem integratie
Apache Mesos	General-purpose cluster manager	Multi-tenant clusters
Kubernetes	Container orchestration platform	Cloud-native deployments

Praktische Spark Programming

RDD Operations

// RDD transformations en actions
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
                     .map(word => (word, 1))
                     .reduceByKey(_ + _)

// Actions trigger execution
counts.saveAsTextFile("hdfs://...")
      

DataFrame API

// DataFrame operations - moderne Spark approach
val df = spark.read.json("examples/src/main/resources/people.json")

// Transformations
val filtered = df.filter($"age" > 21)
                .groupBy("department")
                .agg(avg("salary"), count("*"))

// SQL queries
df.createOrReplaceTempView("people")
val results = spark.sql("SELECT * FROM people WHERE age > 21")
      

Spark Session

// Spark session initialization
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("My Spark Application")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

// Lezen van verschillende data formats
val csvDF = spark.read.option("header", "true").csv("file.csv")
val jsonDF = spark.read.json("file.json")
val parquetDF = spark.read.parquet("file.parquet")
      

Spark in Cloud Platforms

Platform	Service	Voordelen
Databricks	Databricks Runtime	Optimized Spark, Delta Lake, MLflow
Azure	Azure Databricks, HDInsight	Azure ecosystem integratie
AWS	EMR, Glue	Managed Spark clusters
Google Cloud	Dataproc	Serverless Spark jobs

         Veelgemaakte Valkuilen
        Data skew in transformations (ongelijk verdeelde data)
Te veel kleine bestanden lezen/schrijven
Onvoldoende memory configuratie (OOM errors)
Vergeten broadcast variables te gebruiken voor joins
Geen partitionering voor grote datasets

      

Performance Optimalisatie

Configuratie Best Practices

// Optimale Spark configuratie
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skew.enabled", "true")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100485760")
spark.conf.set("spark.sql.shuffle.partitions", "200")
      

Data Partitionering

// Partitionering voor betere performance
// Voor schrijven
df.write.partitionBy("year", "month").parquet("/path/to/data")

// Voor lezen
val partitionedDF = spark.read.parquet("/path/to/data")
  .where($"year" === 2024 && $"month".isin("01", "02"))

// Repartition voor evenwichtige data distributie
val repartitionedDF = df.repartition(100, $"customer_id")
      

Broadcast Joins

// Gebruik broadcast joins voor kleine tabellen
import org.apache.spark.sql.functions.broadcast

val largeDF = spark.table("large_table")
val smallDF = spark.table("small_table")

// Spark zal automatisch broadcasten als tabel klein genoeg is
val joinedDF = largeDF.join(broadcast(smallDF), "id")

// Of forceer broadcast
val joinedDF = largeDF.join(broadcast(smallDF), Seq("id"), "inner")
      

Spark vs Alternatieven

Framework	Sterke Punten	Zwakke Punten	Gebruik
Apache Spark	Snel, universeel, goede APIs	Complexe setup, memory intensive	ETL, ML, Streaming, Analytics
Hadoop MapReduce	Zeer schaalbaar, fault tolerant	Traag, complex programming	Batch processing alleen
Apache Flink	Zeer snel streaming, low latency	Minder mature ecosystem	Real-time streaming
Apache Beam	Unified programming model	Performance overhead	Multi-platform deployments

Spark 3.0 Nieuwe Features

         Spark 3.0 Innovaties
        Adaptive Query Execution: Dynamische query optimalisatie
Dynamic Partition Pruning: Slimmere data filtering
ANS SQL Compliance: Betere SQL standaard ondersteuning
Pandas UDFs: Vectorized Python UDFs voor betere performance
Kubernetes Native: Verbeterde K8s integratie

      

Use Cases & Toepassingen

Enterprise Use Cases

Industrie	Use Case	Spark Component
Finance	Fraude detectie, risk analysis	MLlib, Streaming
E-commerce	Recommendation engines, real-time analytics	MLlib, SQL, Streaming
Healthcare	Genome sequencing, patient analytics	MLlib, GraphX
Telecom	Network optimization, customer churn	Streaming, MLlib