DataPartner365

Jouw partner voor datagedreven groei en inzichten

Python Data Analysis: Complete Gids met Pandas & NumPy

Laatst bijgewerkt: 20 december 2025
Leestijd: 25 minuten
Data Analyse, Python, Pandas, NumPy, Data Cleaning, Data Science

Leer professionele data analyse in Python. Complete tutorial van data cleaning tot advanced analytics met Pandas en NumPy.

Zoek je Data Analysis experts?

Vind gespecialiseerde Data Analysts en Data Scientists voor je analyse projecten

1. Waarom Python voor data analyse?

Wat is data analyse?

Data analyse is het proces van inspecteren, transformeren en modelleren van data om nuttige informatie te ontdekken, conclusies te trekken en besluitvorming te ondersteunen. Python is de meest populaire taal voor data analyse vanwege zijn eenvoudige syntax en krachtige libraries.

Snelle prototyping

Python's eenvoudige syntax laat je snel analyses uitvoeren en prototypen maken.

Uitgebreide libraries

Pandas, NumPy, SciPy, Scikit-learn voor elke analyse behoefte.

Visualisatie integratie

Seamless integratie met Matplotlib, Seaborn en Plotly voor visualisatie.

Actieve community

Grootste data science community met uitgebreide documentatie en support.

Tool Python + Pandas Excel SQL
Data grootte GBs tot TBs (met chunks) Max 1M rijen TB+ (server afhankelijk)
Herhaalbaarheid Uitstekend (scripts) Beperkt (handmatig) Goed (queries)
Automatisatie Full automation Limited (VBA) Goed (stored procedures)
Complexe analyses Machine learning, statistiek Basic formules Aggregaties alleen
Learning curve Medium Easy Medium

Team nodig voor data analyse?

Vind ervaren Data Analysts en Data Scientists gespecialiseerd in Python analyse

2. Setup en Python omgeving

Python 3.9+

Basis programmeertaal

Pandas

Data manipulatie

NumPy

Numerieke berekeningen

Matplotlib

Data visualisatie

Seaborn

Statistische plots

SciPy

Wetenschappelijke computing

Installatie en setup

# Optie 1: Anaconda (aanbevolen voor beginners)
# Download van: https://www.anaconda.com/products/distribution
# Bevat alle data science libraries vooraf geïnstalleerd

# Optie 2: pip installatie
# Basis data science stack
pip install pandas numpy matplotlib seaborn scipy scikit-learn jupyter

# Optie 3: requirements.txt
# Maak een requirements.txt bestand:
pandas==2.0.3
numpy==1.24.3
matplotlib==3.7.1
seaborn==0.12.2
jupyter==1.0.0
scikit-learn==1.3.0

# Installeer alle packages:
pip install -r requirements.txt

# Import statements voor data analyse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Optimalisatie voor betere performance
pd.set_option('display.max_columns', None)   # Toon alle kolommen
pd.set_option('display.max_rows', 100)        # Toon max 100 rijen
pd.set_option('display.float_format', '{:.2f}'.format)  # 2 decimalen voor floats
sns.set_style("whitegrid")                     # Seaborn styling
plt.rcParams['figure.figsize'] = (12, 6)         # Default plot grootte

# Check versies
print(f"Pandas versie: {pd.__version__}")
print(f"NumPy versie: {np.__version__}")

Jupyter Notebook/Lab voor data analyse

Jupyter is de standaard omgeving voor data analyse in Python:

  1. Start Jupyter: jupyter notebook of jupyter lab
  2. Create new notebook: Python 3 kernel
  3. Cell types: Code cells en Markdown cells
  4. Magic commands: %matplotlib inline voor plots in notebook
  5. Keyboard shortcuts: Shift+Enter (run cell), Esc+A/B (insert cell)

Tip: Gebruik !pip install package in een Jupyter cell om packages te installeren.

3. Pandas: Data manipulatie basis

Pandas Data Structures

  • Series: Eén-dimensionale gelabelde array (zoals een kolom in Excel)
  • DataFrame: Twee-dimensionale gelabelde data structuur (zoals een spreadsheet)
  • Index: Unieke labels voor rijen (kan integers, strings, datetimes zijn)

DataFrame creëren en basis operaties

# 1. DataFrames creëren van verschillende bronnen

# Van dictionary
data = {
    'Naam': ['Jan', 'Piet', 'Klaas', 'Marie', 'Anna'],
    'Leeftijd': [25, 30, 35, 28, 32],
    'Stad': ['Amsterdam', 'Rotterdam', 'Den Haag', 'Utrecht', 'Amsterdam'],
    'Salaris': [45000, 52000, 48000, 51000, 47000],
    'Afdeling': ['Sales', 'IT', 'HR', 'Sales', 'Marketing']
}

df = pd.DataFrame(data)
print("DataFrame shape:", df.shape)  # (5, 5)
print("DataFrame info:")
print(df.info())
print("\nEerste 3 rijen:")
print(df.head(3))

# Van CSV bestand
# df = pd.read_csv('data.csv')
# df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# df = pd.read_json('data.json')
# df = pd.read_sql('SELECT * FROM table', connection)

# 2. Basis DataFrame operaties
print("\n=== BASIS OPERATIES ===")
print("Kolom namen:", df.columns.tolist())
print("Index:", df.index.tolist())
print("Data types:")
print(df.dtypes)
print("\nBeschrijvende statistieken:")
print(df.describe())
print("\nUnieke waarden per kolom:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unieke waarden")

# 3. Data selectie en filtering
print("\n=== DATA SELECTIE ===")
# Select single column
salaris_series = df['Salaris']
print("Salaris kolom:", salaris_series.tolist())

# Select multiple columns
subset = df[['Naam', 'Salaris', 'Afdeling']]
print("\nSubset met 3 kolommen:")
print(subset)

# Filter rows based on condition
sales_team = df[df['Afdeling'] == 'Sales']
print("\nSales team:")
print(sales_team)

# Complex filtering
high_earners = df[(df['Salaris'] > 50000) & (df['Leeftijd'] < 35)]
print("\nHigh earners onder 35:")
print(high_earners)

# 4. Sorteren
print("\n=== SORTEREN ===")
df_sorted = df.sort_values(by='Salaris', ascending=False)
print("Gesorteerd op salaris (hoog naar laag):")
print(df_sorted)

# 5. Nieuwe kolommen toevoegen
df['Salaris_Categorie'] = pd.cut(df['Salaris'], 
                                    bins=[40000, 47000, 50000, 55000],
                                    labels=['Laag', 'Medium', 'Hoog'])
df['Leeftijd_Volgend_Jaar'] = df['Leeftijd'] + 1

print("\nDataFrame met nieuwe kolommen:")
print(df)

Advanced DataFrame operaties

# 1. GroupBy operaties
print("=== GROUPBY OPERATIES ===")

# Eenvoudige groupby
afdeling_stats = df.groupby('Afdeling').agg({
    'Salaris': ['mean', 'sum', 'count', 'min', 'max'],
    'Leeftijd': 'mean'
})

print("Afdeling statistieken:")
print(afdeling_stats)

# Multi-level groupby
stad_afdeling_stats = df.groupby(['Stad', 'Afdeling']).agg({
    'Salaris': 'mean',
    'Leeftijd': 'mean',
    'Naam': 'count'  # Aantal werknemers
}).rename(columns={'Naam': 'Aantal_Werknemers'})

print("\nStad en Afdeling statistieken:")
print(stad_afdeling_stats)

# 2. Pivot tables
print("\n=== PIVOT TABELLEN ===")
pivot_table = pd.pivot_table(df, 
                            values='Salaris',
                            index='Stad',
                            columns='Afdeling',
                            aggfunc=['mean', 'count'],
                            fill_value=0)

print("Pivot tabel:")
print(pivot_table)

# 3. Merge/Join operaties
print("\n=== MERGE/JOIN OPERATIES ===")

# Creëer tweede DataFrame
afdeling_data = pd.DataFrame({
    'Afdeling': ['Sales', 'IT', 'HR', 'Marketing', 'Finance'],
    'Manager': ['Peter', 'Lisa', 'Mike', 'Sarah', 'David'],
    'Budget': [100000, 150000, 80000, 120000, 90000]
})

# Inner join (default)
merged_df = pd.merge(df, afdeling_data, on='Afdeling', how='inner')
print("Inner join resultaat:")
print(merged_df)

# Left join
left_merged = pd.merge(df, afdeling_data, on='Afdeling', how='left')
print("\nLeft join resultaat:")
print(left_merged)

# 4. Apply functions
print("\n=== APPLY FUNCTIES ===")

# Apply op een kolom
df['Salaris_Bonus'] = df['Salaris'].apply(lambda x: x * 1.1 if x > 50000 else x * 1.05)

# Complexe apply functie
def categorize_employee(row):
    if row['Leeftijd'] < 30 and row['Salaris'] > 45000:
        return 'High Potential'
    elif row['Leeftijd'] >= 40:
        return 'Experienced'
    else:
        return 'Regular'

df['Categorie'] = df.apply(categorize_employee, axis=1)

print("DataFrame met apply functies:")
print(df[['Naam', 'Leeftijd', 'Salaris', 'Salaris_Bonus', 'Categorie']])

# 5. String operaties
print("\n=== STRING OPERATIES ===")
df['Naam_Upper'] = df['Naam'].str.upper()
df['Naam_Lower'] = df['Naam'].str.lower()
df['Naam_Length'] = df['Naam'].str.len()

# String bevat check
df['Stad_Contains_A'] = df['Stad'].str.contains('a', case=False)

print("String operaties:")
print(df[['Naam', 'Naam_Upper', 'Naam_Lower', 'Naam_Length', 'Stad', 'Stad_Contains_A']])

4. Data cleaning en preprocessing

Veelvoorkomende data problemen

  • Missing values: NaN, None, lege strings
  • Inconsistent formatting: Datum/tijd formats, hoofdletters
  • Outliers: Extreme waarden die analyses verstoren
  • Duplicates: Dubbele rijen in dataset
  • Incorrect data types: Getallen als strings, datums als tekst
  • Inconsistent categories: "M", "Male", "Man" voor hetzelfde

Compleet data cleaning proces

# 1. Sample dataset met problemen
np.random.seed(42)
n_rows = 20

dirty_data = pd.DataFrame({
    'ID': range(1, n_rows + 1),
    'Naam': ['Jan', 'Piet', np.nan, 'Marie', 'Anna', 'Jan', 'peter', 'LISA', 'Tom', 'Sarah',
               'Mike', 'Eva', 'David', 'laura', 'JOHN', 'Sara', 'Mark', 'Emma', 'Paul', 'Lisa'],
    'Leeftijd': [25, 30, 35, '28', 32, 25, 40, 22, 45, 29,
                   33, 27, 38, 31, 26, 34, 150, 24, 36, 28],  # 150 is outlier
    'Salaris': [45000, 52000, None, 51000, 47000, 45000, 60000, 42000, 75000, 48000,
                  55000, 46000, 68000, 49000, 43000, 54000, 1000000, 41000, 72000, 53000],  # 1M is outlier
    'Datum_Aanmelding': ['2023-01-15', '15-02-2023', '2023/03/10', '01-04-2023', '2023-05-20',
                           '2023-01-15', '2023-06-15', '2023-07-01', '01-08-2023', '2023-09-10',
                           '10-10-2023', '2023-11-05', '2023/12/15', '2024-01-10', '15-01-2024',
                           '2024-02-20', '2024/03/05', '05-04-2024', '2024-05-15', '2024-06-01'],
    'Geslacht': ['M', 'M', 'V', 'V', 'V', 'M', 'm', 'F', 'M', 'V',
                   'M', 'v', 'M', 'female', 'M', 'V', 'Male', 'V', 'M', 'f'],
    'Afdeling': ['Sales', 'IT', None, 'Sales', 'Marketing', 'Sales', 'IT', 'HR', 'IT', 'Marketing',
                   'Finance', 'HR', 'IT', 'Marketing', 'Sales', 'HR', 'IT', 'Marketing', 'Finance', None]
})

print("=== ORIGINELE DIRTY DATA ===")
print(dirty_data)
print("\nData info:")
print(dirty_data.info())
print("\nMissing values:")
print(dirty_data.isnull().sum())

# 2. Data cleaning pipeline
def clean_dataframe(df):
    """Complete data cleaning pipeline"""
    df_clean = df.copy()
    
    # A. Remove exact duplicates
    df_clean = df_clean.drop_duplicates()
    print(f"Duplicates removed: {len(df) - len(df_clean)}")
    
    # B. Standardize text columns
    if 'Naam' in df_clean.columns:
        df_clean['Naam'] = df_clean['Naam'].astype(str).str.title().str.strip()
    
    # C. Fix gender column
    gender_mapping = {
        'M': 'Man', 'm': 'Man', 'Male': 'Man',
        'V': 'Vrouw', 'v': 'Vrouw', 'F': 'Vrouw', 
        'f': 'Vrouw', 'female': 'Vrouw'
    }
    if 'Geslacht' in df_clean.columns:
        df_clean['Geslacht'] = df_clean['Geslacht'].replace(gender_mapping)
    
    # D. Convert data types
    if 'Leeftijd' in df_clean.columns:
        df_clean['Leeftijd'] = pd.to_numeric(df_clean['Leeftijd'], errors='coerce')
    
    # E. Handle missing values
    # Voor numerieke kolommen: mean imputation
    numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        if df_clean[col].isnull().any():
            mean_val = df_clean[col].mean()
            df_clean[col] = df_clean[col].fillna(mean_val)
            print(f"Missing values in {col} filled with mean: {mean_val:.2f}")
    
    # Voor categorische kolommen: mode imputation
    categorical_cols = df_clean.select_dtypes(include=['object']).columns
    for col in categorical_cols:
        if df_clean[col].isnull().any():
            mode_val = df_clean[col].mode()[0]
            df_clean[col] = df_clean[col].fillna(mode_val)
            print(f"Missing values in {col} filled with mode: {mode_val}")
    
    # F. Handle outliers
    if 'Leeftijd' in df_clean.columns:
        # Remove age > 100 (obvious outliers)
        outlier_mask = df_clean['Leeftijd'] > 100
        if outlier_mask.any():
            print(f"Removing {outlier_mask.sum()} age outliers")
            df_clean = df_clean[~outlier_mask]
    
    if 'Salaris' in df_clean.columns:
        # Cap salary outliers at 99th percentile
        salary_cap = df_clean['Salaris'].quantile(0.99)
        high_salary_mask = df_clean['Salaris'] > salary_cap
        if high_salary_mask.any():
            print(f"Capping {high_salary_mask.sum()} salary outliers at {salary_cap:.0f}")
            df_clean.loc[high_salary_mask, 'Salaris'] = salary_cap
    
    # G. Fix date formats
    if 'Datum_Aanmelding' in df_clean.columns:
        # Try multiple date formats
        date_formats = ['%Y-%m-%d', '%d-%m-%Y', '%Y/%m/%d', '%d/%m/%Y']
        
        def parse_date(date_str):
            for fmt in date_formats:
                try:
                    return pd.to_datetime(date_str, format=fmt)
                except ValueError:
                    continue
            return pd.NaT
        
        df_clean['Datum_Aanmelding'] = df_clean['Datum_Aanmelding'].apply(parse_date)
        print("Dates standardized")
    
    return df_clean

# 3. Apply cleaning pipeline
cleaned_data = clean_dataframe(dirty_data)

print("\n\n=== GECLEANDE DATA ===")
print(cleaned_data)
print("\nCleaned data info:")
print(cleaned_data.info())
print("\nMissing values after cleaning:")
print(cleaned_data.isnull().sum())
print("\nData statistics:")
print(cleaned_data.describe())

Klaar voor data analyse projecten?

Vind de juiste experts of plaats je Data Analysis vacature

5. NumPy: Numerieke berekeningen

Waarom NumPy gebruiken?

  • Snelheid: 10-100x sneller dan Python lists
  • Efficiënt geheugen: Homogene data types
  • Vectorized operations: Operaties op hele arrays
  • Wiskundige functies: Lineaire algebra, statistiek, random numbers
  • Broadcasting: Operaties tussen arrays van verschillende groottes

NumPy basis en gevorderde operaties

import numpy as np

# 1. NumPy arrays creëren
print("=== NUMPY ARRAYS CREËREN ===")

# Van Python list
arr1 = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr1)
print("Shape:", arr1.shape)
print("Data type:", arr1.dtype)

# 2D array (matrix)
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\n2D Array:")
print(arr2d)
print("Shape:", arr2d.shape)

# Special arrays
zeros = np.zeros((3, 4))          # 3x4 matrix van nullen
ones = np.ones((2, 3))           # 2x3 matrix van enen
identity = np.eye(4)              # 4x4 identity matrix
range_arr = np.arange(0, 10, 2)   # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)   # [0., 0.25, 0.5, 0.75, 1.]

print("\nSpecial arrays:")
print("Zeros matrix:\n", zeros)
print("Linspace:", linspace)

# Random arrays
np.random.seed(42)  # Voor reproduceerbaarheid
random_arr = np.random.rand(3, 3)     # Uniform distribution [0, 1)
randn_arr = np.random.randn(100)        # Standard normal distribution
randint_arr = np.random.randint(1, 100, 10)  # 10 random integers 1-99

print("\nRandom arrays:")
print("Random 3x3:\n", random_arr)
print("Random integers:", randint_arr)

# 2. Array operaties
print("\n=== ARRAY OPERATIES ===")

a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

# Element-wise operations
print("a + b =", a + b)      # [11, 22, 33, 44, 55]
print("a * 2 =", a * 2)      # [2, 4, 6, 8, 10]
print("a ** 2 =", a ** 2)    # [1, 4, 9, 16, 25]
print("np.sin(a) =", np.sin(a))

# Aggregation functions
print("\nAggregations:")
print("Sum:", np.sum(a))
print("Mean:", np.mean(a))
print("Std:", np.std(a))
print("Min:", np.min(a))
print("Max:", np.max(a))
print("Median:", np.median(a))

# 3. Matrix operations
print("\n=== MATRIX OPERATIES ===")

matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])

print("Matrix A:\n", matrix_a)
print("Matrix B:\n", matrix_b)
print("A + B:\n", matrix_a + matrix_b)      # Element-wise addition
print("A * B:\n", matrix_a * matrix_b)      # Element-wise multiplication
print("A @ B:\n", matrix_a @ matrix_b)      # Matrix multiplication
print("A.T:\n", matrix_a.T)                # Transpose
print("det(A):", np.linalg.det(matrix_a))  # Determinant
print("inv(A):\n", np.linalg.inv(matrix_a)) # Inverse

# 4. Broadcasting
print("\n=== BROADCASTING ===")

# Scalar broadcasting
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("Original array:\n", arr)
print("Array + 10:\n", arr + 10)

# Row/column broadcasting
row_vector = np.array([1, 2, 3])
col_vector = np.array([[1], [2], [3]])

print("\nRow vector:", row_vector)
print("Col vector:\n", col_vector)
print("Array + row vector:\n", arr + row_vector)
print("Array + col vector:\n", arr + col_vector)

# 5. Advanced operations
print("\n=== GEVORDERDE OPERATIES ==="

# Reshaping
arr = np.arange(12)
print("Original (12,):", arr)
print("Reshaped to (3, 4):\n", arr.reshape(3, 4))

# Flattening
matrix = np.array([[1, 2], [3, 4]])
print("\nMatrix:\n", matrix)
print("Flattened:", matrix.flatten())
print("Ravel:", matrix.ravel())

# Sorting
random_data = np.random.randn(10)
print("\nRandom data:", random_data)
print("Sorted:", np.sort(random_data))
print("Argsort (indices):", np.argsort(random_data))

# Filtering with boolean indexing
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
mask = data > 5
print("\nData:", data)
print("Mask (data > 5):", mask)
print("Filtered data:", data[mask])

# Vectorized operations (much faster than loops)
large_array = np.random.randn(1000000)

# SLOW: Python loop
import time
start = time.time()
result_slow = [x * 2 for x in large_array]
print(f"\nPython loop time: {time.time() - start:.4f} seconds")

# FAST: NumPy vectorized
start = time.time()
result_fast = large_array * 2
print(f"NumPy vectorized time: {time.time() - start:.4f} seconds")
print(f"Speedup: {((time.time() - start) / (time.time() - start)):.0f}x faster")  # Dit zal ongeveer 100x zijn

6. Data aggregatie en groepering

Aggregatie functies in Pandas

  • Count: count() - aantal niet-null waarden
  • Sum: sum() - som van waarden
  • Mean: mean() - gemiddelde waarde
  • Median: median() - mediaan waarde
  • Std: std() - standaard deviatie
  • Min/Max: min(), max() - minimum/maximum
  • Quantile: quantile() - percentielen
  • Unique: nunique() - aantal unieke waarden
  • Mode: mode() - meest voorkomende waarde

7. Time series analyse

Complete time series analyse

# 1. Time series data creëren
np.random.seed(42)
date_range = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
n_days = len(date_range)

# Simuleer sales data met trend, seizoenspatroon en noise
trend = np.linspace(100, 200, n_days)  # Lineaire trend
seasonality = 50 * np.sin(2 * np.pi * np.arange(n_days) / 365)  # Jaarlijks seizoenspatroon
noise = np.random.normal(0, 10, n_days)  # Random noise

sales = trend + seasonality + noise

# Maak DataFrame
sales_df = pd.DataFrame({
    'Date': date_range,
    'Sales': sales,
    'DayOfWeek': date_range.day_name(),
    'Month': date_range.month_name(),
    'Quarter': date_range.quarter
})

print("=== TIME SERIES DATA ===")
print(sales_df.head())
print("\nData info:")
print(sales_df.info())

# 2. Time series visualisatie
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Raw time series
axes[0, 0].plot(sales_df['Date'], sales_df['Sales'], linewidth=1)
axes[0, 0].set_title('Daily Sales 2023')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Sales')
axes[0, 0].grid(True, alpha=0.3)

# Monthly aggregation
monthly_sales = sales_df.resample('M', on='Date')['Sales'].mean()
axes[0, 1].bar(monthly_sales.index.strftime('%b'), monthly_sales.values)
axes[0, 1].set_title('Average Monthly Sales')
axes[0, 1].set_xlabel('Month')
axes[0, 1].set_ylabel('Average Sales')

# Day of week analysis
dow_sales = sales_df.groupby('DayOfWeek')['Sales'].mean()
dow_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_sales = dow_sales.reindex(dow_order)
axes[1, 0].plot(dow_sales.index, dow_sales.values, marker='o')
axes[1, 0].set_title('Average Sales by Day of Week')
axes[1, 0].set_xlabel('Day of Week')
axes[1, 0].set_ylabel('Average Sales')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].grid(True, alpha=0.3)

# Rolling average (smoothing)
sales_df['Sales_7D_Avg'] = sales_df['Sales'].rolling(window=7, center=True).mean()
axes[1, 1].plot(sales_df['Date'], sales_df['Sales'], alpha=0.5, label='Daily')
axes[1, 1].plot(sales_df['Date'], sales_df['Sales_7D_Avg'], linewidth=2, label='7-Day Avg')
axes[1, 1].set_title('Daily Sales with 7-Day Moving Average')
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Sales')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('time_series_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# 3. Time series decomposition
from statsmodels.tsa.seasonal import seasonal_decompose

# Maak een copy met Date als index
sales_ts = sales_df.set_index('Date')['Sales']

# Decompose time series (additive model)
decomposition = seasonal_decompose(sales_ts, model='additive', period=365)

# Plot decomposition
fig, axes = plt.subplots(4, 1, figsize=(14, 10))

axes[0].plot(decomposition.observed)
axes[0].set_title('Observed')
axes[0].grid(True, alpha=0.3)

axes[1].plot(decomposition.trend)
axes[1].set_title('Trend')
axes[1].grid(True, alpha=0.3)

axes[2].plot(decomposition.seasonal)
axes[2].set_title('Seasonality')
axes[2].grid(True, alpha=0.3)

axes[3].plot(decomposition.resid)
axes[3].set_title('Residuals')
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('time_series_decomposition.png', dpi=300, bbox_inches='tight')
plt.show()

# 4. Time series features engineering
print("\n=== TIME SERIES FEATURES ===")

# Lag features (previous values)
for lag in [1, 7, 30]:
    sales_df[f'Sales_Lag_{lag}'] = sales_df['Sales'].shift(lag)

# Rolling statistics
sales_df['Sales_Rolling_Mean_7'] = sales_df['Sales'].rolling(window=7).mean()
sales_df['Sales_Rolling_Std_7'] = sales_df['Sales'].rolling(window=7).std()
sales_df['Sales_Rolling_Min_7'] = sales_df['Sales'].rolling(window=7).min()
sales_df['Sales_Rolling_Max_7'] = sales_df['Sales'].rolling(window=7).max()

# Difference features
sales_df['Sales_Diff_1'] = sales_df['Sales'].diff(1)
sales_df['Sales_Pct_Change'] = sales_df['Sales'].pct_change() * 100

# Date-based features
sales_df['DayOfMonth'] = sales_df['Date'].dt.day
sales_df['WeekOfYear'] = sales_df['Date'].dt.isocalendar().week
sales_df['Is_Weekend'] = sales_df['Date'].dt.dayofweek >= 5
sales_df['Is_Month_Start'] = sales_df['Date'].dt.is_month_start
sales_df['Is_Month_End'] = sales_df['Date'].dt.is_month_end

print("Features added:")
print(sales_df.columns.tolist())
print("\nSample of engineered features:")
print(sales_df[['Date', 'Sales', 'Sales_Lag_1', 'Sales_Rolling_Mean_7', 
                 'Sales_Diff_1', 'Sales_Pct_Change', 'Is_Weekend']].head(10))

# 5. Time series forecasting (simple example)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Prepare data for forecasting
# Use last 30 days to predict next 7 days
train_size = 30
forecast_horizon = 7

# Create lag features for training
for lag in range(1, train_size + 1):
    sales_df[f'Lag_{lag}'] = sales_df['Sales'].shift(lag)

# Split data (last 'forecast_horizon' days for testing)
train_data = sales_df.iloc[train_size:-forecast_horizon].dropna()
test_data = sales_df.iloc[-forecast_horizon:]

# Prepare features (X) and target (y)
feature_cols = [f'Lag_{i}' for i in range(1, train_size + 1)]
X_train = train_data[feature_cols]
y_train = train_data['Sales']

# Train linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
X_test = test_data[feature_cols]
predictions = model.predict(X_test)

# Evaluate predictions
actual = test_data['Sales'].values
mae = mean_absolute_error(actual, predictions)
rmse = np.sqrt(mean_squared_error(actual, predictions))

print(f"\n=== FORECASTING RESULTS ===")
print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print("\nActual vs Predicted:")
for i, (act, pred) in enumerate(zip(actual, predictions), 1):
    print(f"Day {i}: Actual={act:.1f}, Predicted={pred:.1f}, Error={abs(act-pred):.1f}")

9. Praktijkvoorbeeld: Sales data analyse

Complete sales data analysis pipeline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# 1. Simuleer realistische sales data
np.random.seed(42)
n_transactions = 10000

# Product catalog
products = {
    'Laptop': {'price': 999, 'category': 'Electronics'},
    'Smartphone': {'price': 699, 'category': 'Electronics'},
    'Tablet': {'price': 399, 'category': 'Electronics'},
    'T-Shirt': {'price': 29.99, 'category': 'Clothing'},
    'Jeans': {'price': 79.99, 'category': 'Clothing'},
    'Jacket': {'price': 129.99, 'category': 'Clothing'},
    'Coffee Maker': {'price': 89.99, 'category': 'Home'},
    'Blender': {'price': 49.99, 'category': 'Home'},
    'Book': {'price': 19.99, 'category': 'Books'},
    'Headphones': {'price': 149.99, 'category': 'Electronics'}
}

# Generate sales data
data = []
transaction_id = 1000

for _ in range(n_transactions):
    transaction_id += 1
    
    # Random product selection (with probabilities)
    product_name = np.random.choice(
        list(products.keys()),
        p=[0.15, 0.20, 0.10, 0.08, 0.07, 0.05, 0.06, 0.04, 0.15, 0.10]
    )
    
    product_info = products[product_name]
    
    # Random date in 2023
    date = pd.to_datetime('2023-01-01') + pd.Timedelta(days=np.random.randint(0, 365))
    
    # Random customer
    customer_id = np.random.randint(1000, 2000)
    
    # Random quantity (mostly 1, sometimes more)
    quantity = np.random.choice([1, 2, 3], p=[0.8, 0.15, 0.05])
    
    # Random region
    region = np.random.choice(['North', 'South', 'East', 'West'], 
                            p=[0.3, 0.25, 0.25, 0.2])
    
    # Random discount (10% chance of 10% discount)
    discount = 0.1 if np.random.random() < 0.1 else 0
    
    # Calculate revenue
    unit_price = product_info['price']
    total_price = unit_price * quantity * (1 - discount)
    
    data.append({
        'transaction_id': transaction_id,
        'date': date,
        'customer_id': customer_id,
        'product': product_name,
        'category': product_info['category'],
        'unit_price': unit_price,
        'quantity': quantity,
        'discount': discount,
        'total_price': total_price,
        'region': region
    })

# Create DataFrame
sales_df = pd.DataFrame(data)

print("=== SALES DATA OVERVIEW ===")
print(f"Total transactions: {len(sales_df):,}")
print(f"Date range: {sales_df['date'].min().date()} to {sales_df['date'].max().date()}")
print(f"Unique customers: {sales_df['customer_id'].nunique():,}")
print(f"Unique products: {sales_df['product'].nunique()}")
print(f"Total revenue: €{sales_df['total_price'].sum():,.2f}")

# 2. Sales Performance Analysis
print("\n=== SALES PERFORMANCE ANALYSIS ===")

# Monthly revenue
sales_df['month'] = sales_df['date'].dt.to_period('M')
monthly_revenue = sales_df.groupby('month').agg({
    'total_price': 'sum',
    'transaction_id': 'count',
    'customer_id': 'nunique'
}).rename(columns={
    'total_price': 'revenue',
    'transaction_id': 'transactions',
    'customer_id': 'customers'
})

monthly_revenue['avg_transaction_value'] = monthly_revenue['revenue'] / monthly_revenue['transactions']
monthly_revenue['revenue_growth'] = monthly_revenue['revenue'].pct_change() * 100

print("Monthly Performance:")
print(monthly_revenue)

# Product performance
product_performance = sales_df.groupby('product').agg({
    'total_price': ['sum', 'mean'],
    'quantity': 'sum',
    'transaction_id': 'count',
    'customer_id': 'nunique'
})

product_performance.columns = ['revenue', 'avg_order_value', 'units_sold', 'transactions', 'customers']
product_performance = product_performance.sort_values('revenue', ascending=False)

print("\nTop 5 Products by Revenue:")
print(product_performance.head())

# Category performance
category_performance = sales_df.groupby('category').agg({
    'total_price': 'sum',
    'quantity': 'sum',
    'transaction_id': 'count'
}).rename(columns={
    'total_price': 'revenue',
    'quantity': 'units_sold',
    'transaction_id': 'transactions'
}).sort_values('revenue', ascending=False)

category_performance['revenue_share'] = (category_performance['revenue'] / category_performance['revenue'].sum()) * 100
category_performance['avg_price_per_unit'] = category_performance['revenue'] / category_performance['units_sold']

print("\nCategory Performance:")
print(category_performance)

# 3. Customer Analysis
print("\n=== CUSTOMER ANALYSIS ===")

# Customer segmentation by purchase frequency and value
customer_stats = sales_df.groupby('customer_id').agg({
    'total_price': 'sum',
    'transaction_id': 'count',
    'date': ['min', 'max']
})

customer_stats.columns = ['total_spent', 'purchase_count', 'first_purchase', 'last_purchase']
customer_stats['avg_order_value'] = customer_stats['total_spent'] / customer_stats['purchase_count']

# Recency calculation (days since last purchase)
latest_date = sales_df['date'].max()
customer_stats['recency_days'] = (latest_date - customer_stats['last_purchase']).dt.days

print("Customer Statistics:")
print(customer_stats.describe())

# RFM Segmentation (Recency, Frequency, Monetary)
# Create RFM scores (1-5, where 5 is best)
customer_stats['recency_score'] = pd.qcut(customer_stats['recency_days'], 5, labels=[5, 4, 3, 2, 1])
customer_stats['frequency_score'] = pd.qcut(customer_stats['purchase_count'], 5, labels=[1, 2, 3, 4, 5])
customer_stats['monetary_score'] = pd.qcut(customer_stats['total_spent'], 5, labels=[1, 2, 3, 4, 5])

# Combine RFM scores
customer_stats['rfm_score'] = (
    customer_stats['recency_score'].astype(str) +
    customer_stats['frequency_score'].astype(str) +
    customer_stats['monetary_score'].astype(str)
)

# Define RFM segments
def get_rfm_segment(row):
    if row['rfm_score'] >= '555':
        return 'Champions'
    elif row['rfm_score'] >= '444':
        return 'Loyal Customers'
    elif row['rfm_score'] >= '333':
        return 'Potential Loyalists'
    elif row['rfm_score'] >= '222':
        return 'Need Attention'
    else:
        return 'At Risk'

customer_stats['rfm_segment'] = customer_stats.apply(get_rfm_segment, axis=1)

print("\nRFM Segments Distribution:")
print(customer_stats['rfm_segment'].value_counts())
print("\nAverage Value by Segment:")
print(customer_stats.groupby('rfm_segment')['total_spent'].mean().sort_values(ascending=False))

# 4. Regional Analysis
print("\n=== REGIONAL ANALYSIS ===")

regional_performance = sales_df.groupby('region').agg({
    'total_price': 'sum',
    'transaction_id': 'count',
    'customer_id': 'nunique',
    'quantity': 'sum'
}).rename(columns={
    'total_price': 'revenue',
    'transaction_id': 'transactions',
    'customer_id': 'customers',
    'quantity': 'units_sold'
})

regional_performance['avg_transaction_value'] = regional_performance['revenue'] / regional_performance['transactions']
regional_performance['revenue_per_customer'] = regional_performance['revenue'] / regional_performance['customers']

print("Regional Performance:")
print(regional_performance)

# 5. Time-based Analysis
print("\n=== TIME-BASED ANALYSIS ==="

# Daily patterns
sales_df['day_of_week'] = sales_df['date'].dt.day_name()
sales_df['hour'] = sales_df['date'].dt.hour
sales_df['is_weekend'] = sales_df['date'].dt.dayofweek >= 5

# Revenue by day of week
dow_revenue = sales_df.groupby('day_of_week').agg({
    'total_price': 'sum',
    'transaction_id': 'count'
}).rename(columns={
    'total_price': 'revenue',
    'transaction_id': 'transactions'
})

# Reorder days
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_revenue = dow_revenue.reindex(day_order)

print("Revenue by Day of Week:")
print(dow_revenue)

# Weekend vs weekday comparison
weekend_stats = sales_df.groupby('is_weekend').agg({
    'total_price': 'sum',
    'transaction_id': 'count',
    'quantity': 'sum'
}).rename(columns={
    'total_price': 'revenue',
    'transaction_id': 'transactions',
    'quantity': 'units_sold'
})

weekend_stats['avg_transaction_value'] = weekend_stats['revenue'] / weekend_stats['transactions']
weekend_stats['revenue_share'] = (weekend_stats['revenue'] / weekend_stats['revenue'].sum()) * 100

print("\nWeekend vs Weekday Performance:")
print(weekend_stats)

# 6. Statistical Analysis
print("\n=== STATISTICAL ANALYSIS ==="

# Correlation analysis
numeric_cols = sales_df.select_dtypes(include=[np.number]).columns
correlation_matrix = sales_df[numeric_cols].corr()

print("Correlation Matrix:")
print(correlation_matrix)

# Hypothesis testing: Is there a significant difference between regions?
# ANOVA test for revenue differences between regions
regions = sales_df['region'].unique()
region_groups = [sales_df[sales_df['region'] == region]['total_price'] for region in regions]

f_stat, p_value = stats.f_oneway(*region_groups)

print(f"\nANOVA Test for Regional Revenue Differences:")
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.6f}")

if p_value < 0.05:
    print("Conclusion: Significant differences exist between regions (p < 0.05)")
else:
    print("Conclusion: No significant differences between regions")

# 7. Export results
# Create summary report
summary_report = {
    'Overall Metrics': {
        'Total Revenue': f"€{sales_df['total_price'].sum():,.2f}",
        'Total Transactions': f"{len(sales_df):,}",
        'Unique Customers': f"{sales_df['customer_id'].nunique():,}",
        'Average Transaction Value': f"€{sales_df['total_price'].mean():.2f}",
        'Date Range': f"{sales_df['date'].min().date()} to {sales_df['date'].max().date()}"
    },
    'Top Products': product_performance.head(3).to_dict(),
    'Category Performance': category_performance.to_dict(),
    'Regional Performance': regional_performance.to_dict(),
    'Monthly Trends': monthly_revenue.to_dict()
}

# Save to CSV
monthly_revenue.to_csv('monthly_sales_report.csv')
product_performance.to_csv('product_performance.csv')
customer_stats.to_csv('customer_analysis.csv')

print("\n=== ANALYSIS COMPLETE ===")
print("Reports saved to CSV files:")
print("- monthly_sales_report.csv")
print("- product_performance.csv")
print("- customer_analysis.csv")

# Key insights
print("\n=== KEY INSIGHTS ===")
print("1. Total Revenue: €{:,}".format(int(sales_df['total_price'].sum())))
print("2. Best Performing Category: {} (€{:.2f} revenue)".format(
    category_performance.index[0], 
    category_performance.iloc[0]['revenue']
))
print("3. Top Product: {} (€{:.2f} revenue)".format(
    product_performance.index[0],
    product_performance.iloc[0]['revenue']
))
print("4. Best Region: {} (€{:.2f} revenue)".format(
    regional_performance.index[0],
    regional_performance.iloc[0]['revenue']
))
print("5. Customer Segmentation:")
for segment, count in customer_stats['rfm_segment'].value_counts().items():
    print(f"   - {segment}: {count} customers")

Klaar om te beginnen met Data Analyse?

Vind data professionals of plaats je vacature voor analyse projecten

Conclusie en volgende stappen

Data analyse in Python is een essentiële vaardigheid voor elke data professional. Je hebt nu geleerd:

  1. Pandas: Voor data manipulatie, cleaning en aggregatie
  2. NumPy: Voor numerieke berekeningen en array operaties
  3. Data cleaning: Voor het voorbereiden van data voor analyse
  4. Time series analyse: Voor trend analyse en forecasting
  5. Statistische analyse: Voor het trekken van betekenisvolle conclusies

Volgende stappen:

  • Begin met je eigen data analyse project
  • Leer machine learning voor predictive analytics
  • Integreer met databases en APIs
  • Bouw dashboards voor data presentatie
  • Volg onze advanced data science tutorials