Python Data Analysis: Complete Gids met Pandas & NumPy

Laatst bijgewerkt: 20 december 2025

Leestijd: 25 minuten

Data Analyse, Python, Pandas, NumPy, Data Cleaning, Data Science

Leer professionele data analyse in Python. Complete tutorial van data cleaning tot advanced analytics met Pandas en NumPy.

Zoek je Data Analysis experts?

Vind gespecialiseerde Data Analysts en Data Scientists voor je analyse projecten

Data Analyst Vacatures Vacature Plaassen €25

1. Waarom Python voor data analyse?

Wat is data analyse?

Data analyse is het proces van inspecteren, transformeren en modelleren van data om nuttige informatie te ontdekken, conclusies te trekken en besluitvorming te ondersteunen. Python is de meest populaire taal voor data analyse vanwege zijn eenvoudige syntax en krachtige libraries.

Snelle prototyping

Python's eenvoudige syntax laat je snel analyses uitvoeren en prototypen maken.

Uitgebreide libraries

Pandas, NumPy, SciPy, Scikit-learn voor elke analyse behoefte.

Visualisatie integratie

Seamless integratie met Matplotlib, Seaborn en Plotly voor visualisatie.

Actieve community

Grootste data science community met uitgebreide documentatie en support.

Tool	Python + Pandas	Excel	SQL
Data grootte	GBs tot TBs (met chunks)	Max 1M rijen	TB+ (server afhankelijk)
Herhaalbaarheid	Uitstekend (scripts)	Beperkt (handmatig)	Goed (queries)
Automatisatie	Full automation	Limited (VBA)	Goed (stored procedures)
Complexe analyses	Machine learning, statistiek	Basic formules	Aggregaties alleen
Learning curve	Medium	Easy	Medium

Team nodig voor data analyse?

Vind ervaren Data Analysts en Data Scientists gespecialiseerd in Python analyse

Data Scientist Vacatures Zoek Analyse Talent

2. Setup en Python omgeving

Python 3.9+

Basis programmeertaal

Pandas

Data manipulatie

NumPy

Numerieke berekeningen

Matplotlib

Data visualisatie

Seaborn

Statistische plots

SciPy

Wetenschappelijke computing

Installatie en setup

# Optie 1: Anaconda (aanbevolen voor beginners)
# Download van: https://www.anaconda.com/products/distribution
# Bevat alle data science libraries vooraf geïnstalleerd

# Optie 2: pip installatie
# Basis data science stack
pip install pandas numpy matplotlib seaborn scipy scikit-learn jupyter

# Optie 3: requirements.txt
# Maak een requirements.txt bestand:
pandas==2.0.3
numpy==1.24.3
matplotlib==3.7.1
seaborn==0.12.2
jupyter==1.0.0
scikit-learn==1.3.0

# Installeer alle packages:
pip install -r requirements.txt

# Import statements voor data analyse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Optimalisatie voor betere performance
pd.set_option('display.max_columns', None)   # Toon alle kolommen
pd.set_option('display.max_rows', 100)        # Toon max 100 rijen
pd.set_option('display.float_format', '{:.2f}'.format)  # 2 decimalen voor floats
sns.set_style("whitegrid")                     # Seaborn styling
plt.rcParams['figure.figsize'] = (12, 6)         # Default plot grootte

# Check versies
print(f"Pandas versie: {pd.__version__}")
print(f"NumPy versie: {np.__version__}")

Jupyter Notebook/Lab voor data analyse

Jupyter is de standaard omgeving voor data analyse in Python:

Start Jupyter: jupyter notebook of jupyter lab
Create new notebook: Python 3 kernel
Cell types: Code cells en Markdown cells
Magic commands: %matplotlib inline voor plots in notebook
Keyboard shortcuts: Shift+Enter (run cell), Esc+A/B (insert cell)

Tip: Gebruik !pip install package in een Jupyter cell om packages te installeren.

3. Pandas: Data manipulatie basis

Pandas Data Structures

Series: Eén-dimensionale gelabelde array (zoals een kolom in Excel)
DataFrame: Twee-dimensionale gelabelde data structuur (zoals een spreadsheet)
Index: Unieke labels voor rijen (kan integers, strings, datetimes zijn)

DataFrame creëren en basis operaties

# 1. DataFrames creëren van verschillende bronnen

# Van dictionary
data = {
    'Naam': ['Jan', 'Piet', 'Klaas', 'Marie', 'Anna'],
    'Leeftijd': [25, 30, 35, 28, 32],
    'Stad': ['Amsterdam', 'Rotterdam', 'Den Haag', 'Utrecht', 'Amsterdam'],
    'Salaris': [45000, 52000, 48000, 51000, 47000],
    'Afdeling': ['Sales', 'IT', 'HR', 'Sales', 'Marketing']
}

df = pd.DataFrame(data)
print("DataFrame shape:", df.shape)  # (5, 5)
print("DataFrame info:")
print(df.info())
print("\nEerste 3 rijen:")
print(df.head(3))

# Van CSV bestand
# df = pd.read_csv('data.csv')
# df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# df = pd.read_json('data.json')
# df = pd.read_sql('SELECT * FROM table', connection)

# 2. Basis DataFrame operaties
print("\n=== BASIS OPERATIES ===")
print("Kolom namen:", df.columns.tolist())
print("Index:", df.index.tolist())
print("Data types:")
print(df.dtypes)
print("\nBeschrijvende statistieken:")
print(df.describe())
print("\nUnieke waarden per kolom:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unieke waarden")

# 3. Data selectie en filtering
print("\n=== DATA SELECTIE ===")
# Select single column
salaris_series = df['Salaris']
print("Salaris kolom:", salaris_series.tolist())

# Select multiple columns
subset = df[['Naam', 'Salaris', 'Afdeling']]
print("\nSubset met 3 kolommen:")
print(subset)

# Filter rows based on condition
sales_team = df[df['Afdeling'] == 'Sales']
print("\nSales team:")
print(sales_team)

# Complex filtering
high_earners = df[(df['Salaris'] > 50000) & (df['Leeftijd'] < 35)]
print("\nHigh earners onder 35:")
print(high_earners)

# 4. Sorteren
print("\n=== SORTEREN ===")
df_sorted = df.sort_values(by='Salaris', ascending=False)
print("Gesorteerd op salaris (hoog naar laag):")
print(df_sorted)

# 5. Nieuwe kolommen toevoegen
df['Salaris_Categorie'] = pd.cut(df['Salaris'], 
                                    bins=[40000, 47000, 50000, 55000],
                                    labels=['Laag', 'Medium', 'Hoog'])
df['Leeftijd_Volgend_Jaar'] = df['Leeftijd'] + 1

print("\nDataFrame met nieuwe kolommen:")
print(df)

Advanced DataFrame operaties

# 1. GroupBy operaties
print("=== GROUPBY OPERATIES ===")

# Eenvoudige groupby
afdeling_stats = df.groupby('Afdeling').agg({
    'Salaris': ['mean', 'sum', 'count', 'min', 'max'],
    'Leeftijd': 'mean'
})

print("Afdeling statistieken:")
print(afdeling_stats)

# Multi-level groupby
stad_afdeling_stats = df.groupby(['Stad', 'Afdeling']).agg({
    'Salaris': 'mean',
    'Leeftijd': 'mean',
    'Naam': 'count'  # Aantal werknemers
}).rename(columns={'Naam': 'Aantal_Werknemers'})

print("\nStad en Afdeling statistieken:")
print(stad_afdeling_stats)

# 2. Pivot tables
print("\n=== PIVOT TABELLEN ===")
pivot_table = pd.pivot_table(df, 
                            values='Salaris',
                            index='Stad',
                            columns='Afdeling',
                            aggfunc=['mean', 'count'],
                            fill_value=0)

print("Pivot tabel:")
print(pivot_table)

# 3. Merge/Join operaties
print("\n=== MERGE/JOIN OPERATIES ===")

# Creëer tweede DataFrame
afdeling_data = pd.DataFrame({
    'Afdeling': ['Sales', 'IT', 'HR', 'Marketing', 'Finance'],
    'Manager': ['Peter', 'Lisa', 'Mike', 'Sarah', 'David'],
    'Budget': [100000, 150000, 80000, 120000, 90000]
})

# Inner join (default)
merged_df = pd.merge(df, afdeling_data, on='Afdeling', how='inner')
print("Inner join resultaat:")
print(merged_df)

# Left join
left_merged = pd.merge(df, afdeling_data, on='Afdeling', how='left')
print("\nLeft join resultaat:")
print(left_merged)

# 4. Apply functions
print("\n=== APPLY FUNCTIES ===")

# Apply op een kolom
df['Salaris_Bonus'] = df['Salaris'].apply(lambda x: x * 1.1 if x > 50000 else x * 1.05)

# Complexe apply functie
def categorize_employee(row):
    if row['Leeftijd'] < 30 and row['Salaris'] > 45000:
        return 'High Potential'
    elif row['Leeftijd'] >= 40:
        return 'Experienced'
    else:
        return 'Regular'

df['Categorie'] = df.apply(categorize_employee, axis=1)

print("DataFrame met apply functies:")
print(df[['Naam', 'Leeftijd', 'Salaris', 'Salaris_Bonus', 'Categorie']])

# 5. String operaties
print("\n=== STRING OPERATIES ===")
df['Naam_Upper'] = df['Naam'].str.upper()
df['Naam_Lower'] = df['Naam'].str.lower()
df['Naam_Length'] = df['Naam'].str.len()

# String bevat check
df['Stad_Contains_A'] = df['Stad'].str.contains('a', case=False)

print("String operaties:")
print(df[['Naam', 'Naam_Upper', 'Naam_Lower', 'Naam_Length', 'Stad', 'Stad_Contains_A']])

4. Data cleaning en preprocessing

         Veelvoorkomende data problemen
        Missing values: NaN, None, lege strings
Inconsistent formatting: Datum/tijd formats, hoofdletters
Outliers: Extreme waarden die analyses verstoren
Duplicates: Dubbele rijen in dataset
Incorrect data types: Getallen als strings, datums als tekst
Inconsistent categories: "M", "Male", "Man" voor hetzelfde

      

Compleet data cleaning proces

# 1. Sample dataset met problemen
np.random.seed(42)
n_rows = 20

dirty_data = pd.DataFrame({
    'ID': range(1, n_rows + 1),
    'Naam': ['Jan', 'Piet', np.nan, 'Marie', 'Anna', 'Jan', 'peter', 'LISA', 'Tom', 'Sarah',
               'Mike', 'Eva', 'David', 'laura', 'JOHN', 'Sara', 'Mark', 'Emma', 'Paul', 'Lisa'],
    'Leeftijd': [25, 30, 35, '28', 32, 25, 40, 22, 45, 29,
                   33, 27, 38, 31, 26, 34, 150, 24, 36, 28],  # 150 is outlier
    'Salaris': [45000, 52000, None, 51000, 47000, 45000, 60000, 42000, 75000, 48000,
                  55000, 46000, 68000, 49000, 43000, 54000, 1000000, 41000, 72000, 53000],  # 1M is outlier
    'Datum_Aanmelding': ['2023-01-15', '15-02-2023', '2023/03/10', '01-04-2023', '2023-05-20',
                           '2023-01-15', '2023-06-15', '2023-07-01', '01-08-2023', '2023-09-10',
                           '10-10-2023', '2023-11-05', '2023/12/15', '2024-01-10', '15-01-2024',
                           '2024-02-20', '2024/03/05', '05-04-2024', '2024-05-15', '2024-06-01'],
    'Geslacht': ['M', 'M', 'V', 'V', 'V', 'M', 'm', 'F', 'M', 'V',
                   'M', 'v', 'M', 'female', 'M', 'V', 'Male', 'V', 'M', 'f'],
    'Afdeling': ['Sales', 'IT', None, 'Sales', 'Marketing', 'Sales', 'IT', 'HR', 'IT', 'Marketing',
                   'Finance', 'HR', 'IT', 'Marketing', 'Sales', 'HR', 'IT', 'Marketing', 'Finance', None]
})

print("=== ORIGINELE DIRTY DATA ===")
print(dirty_data)
print("\nData info:")
print(dirty_data.info())
print("\nMissing values:")
print(dirty_data.isnull().sum())

# 2. Data cleaning pipeline
def clean_dataframe(df):
    """Complete data cleaning pipeline"""
    df_clean = df.copy()
    
    # A. Remove exact duplicates
    df_clean = df_clean.drop_duplicates()
    print(f"Duplicates removed: {len(df) - len(df_clean)}")
    
    # B. Standardize text columns
    if 'Naam' in df_clean.columns:
        df_clean['Naam'] = df_clean['Naam'].astype(str).str.title().str.strip()
    
    # C. Fix gender column
    gender_mapping = {
        'M': 'Man', 'm': 'Man', 'Male': 'Man',
        'V': 'Vrouw', 'v': 'Vrouw', 'F': 'Vrouw', 
        'f': 'Vrouw', 'female': 'Vrouw'
    }
    if 'Geslacht' in df_clean.columns:
        df_clean['Geslacht'] = df_clean['Geslacht'].replace(gender_mapping)
    
    # D. Convert data types
    if 'Leeftijd' in df_clean.columns:
        df_clean['Leeftijd'] = pd.to_numeric(df_clean['Leeftijd'], errors='coerce')
    
    # E. Handle missing values
    # Voor numerieke kolommen: mean imputation
    numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        if df_clean[col].isnull().any():
            mean_val = df_clean[col].mean()
            df_clean[col] = df_clean[col].fillna(mean_val)
            print(f"Missing values in {col} filled with mean: {mean_val:.2f}")
    
    # Voor categorische kolommen: mode imputation
    categorical_cols = df_clean.select_dtypes(include=['object']).columns
    for col in categorical_cols:
        if df_clean[col].isnull().any():
            mode_val = df_clean[col].mode()[0]
            df_clean[col] = df_clean[col].fillna(mode_val)
            print(f"Missing values in {col} filled with mode: {mode_val}")
    
    # F. Handle outliers
    if 'Leeftijd' in df_clean.columns:
        # Remove age > 100 (obvious outliers)
        outlier_mask = df_clean['Leeftijd'] > 100
        if outlier_mask.any():
            print(f"Removing {outlier_mask.sum()} age outliers")
            df_clean = df_clean[~outlier_mask]
    
    if 'Salaris' in df_clean.columns:
        # Cap salary outliers at 99th percentile
        salary_cap = df_clean['Salaris'].quantile(0.99)
        high_salary_mask = df_clean['Salaris'] > salary_cap
        if high_salary_mask.any():
            print(f"Capping {high_salary_mask.sum()} salary outliers at {salary_cap:.0f}")
            df_clean.loc[high_salary_mask, 'Salaris'] = salary_cap
    
    # G. Fix date formats
    if 'Datum_Aanmelding' in df_clean.columns:
        # Try multiple date formats
        date_formats = ['%Y-%m-%d', '%d-%m-%Y', '%Y/%m/%d', '%d/%m/%Y']
        
        def parse_date(date_str):
            for fmt in date_formats:
                try:
                    return pd.to_datetime(date_str, format=fmt)
                except ValueError:
                    continue
            return pd.NaT
        
        df_clean['Datum_Aanmelding'] = df_clean['Datum_Aanmelding'].apply(parse_date)
        print("Dates standardized")
    
    return df_clean

# 3. Apply cleaning pipeline
cleaned_data = clean_dataframe(dirty_data)

print("\n\n=== GECLEANDE DATA ===")
print(cleaned_data)
print("\nCleaned data info:")
print(cleaned_data.info())
print("\nMissing values after cleaning:")
print(cleaned_data.isnull().sum())
print("\nData statistics:")
print(cleaned_data.describe())

Klaar voor data analyse projecten?

Vind de juiste experts of plaats je Data Analysis vacature

Vacature Plaassen €25 Bekijk Data Scientist Vacatures

5. NumPy: Numerieke berekeningen

Waarom NumPy gebruiken?

Snelheid: 10-100x sneller dan Python lists
Efficiënt geheugen: Homogene data types
Vectorized operations: Operaties op hele arrays
Wiskundige functies: Lineaire algebra, statistiek, random numbers
Broadcasting: Operaties tussen arrays van verschillende groottes

NumPy basis en gevorderde operaties

import numpy as np

# 1. NumPy arrays creëren
print("=== NUMPY ARRAYS CREËREN ===")

# Van Python list
arr1 = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr1)
print("Shape:", arr1.shape)
print("Data type:", arr1.dtype)

# 2D array (matrix)
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\n2D Array:")
print(arr2d)
print("Shape:", arr2d.shape)

# Special arrays
zeros = np.zeros((3, 4))          # 3x4 matrix van nullen
ones = np.ones((2, 3))           # 2x3 matrix van enen
identity = np.eye(4)              # 4x4 identity matrix
range_arr = np.arange(0, 10, 2)   # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)   # [0., 0.25, 0.5, 0.75, 1.]

print("\nSpecial arrays:")
print("Zeros matrix:\n", zeros)
print("Linspace:", linspace)

# Random arrays
np.random.seed(42)  # Voor reproduceerbaarheid
random_arr = np.random.rand(3, 3)     # Uniform distribution [0, 1)
randn_arr = np.random.randn(100)        # Standard normal distribution
randint_arr = np.random.randint(1, 100, 10)  # 10 random integers 1-99

print("\nRandom arrays:")
print("Random 3x3:\n", random_arr)
print("Random integers:", randint_arr)

# 2. Array operaties
print("\n=== ARRAY OPERATIES ===")

a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

# Element-wise operations
print("a + b =", a + b)      # [11, 22, 33, 44, 55]
print("a * 2 =", a * 2)      # [2, 4, 6, 8, 10]
print("a ** 2 =", a ** 2)    # [1, 4, 9, 16, 25]
print("np.sin(a) =", np.sin(a))

# Aggregation functions
print("\nAggregations:")
print("Sum:", np.sum(a))
print("Mean:", np.mean(a))
print("Std:", np.std(a))
print("Min:", np.min(a))
print("Max:", np.max(a))
print("Median:", np.median(a))

# 3. Matrix operations
print("\n=== MATRIX OPERATIES ===")

matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])

print("Matrix A:\n", matrix_a)
print("Matrix B:\n", matrix_b)
print("A + B:\n", matrix_a + matrix_b)      # Element-wise addition
print("A * B:\n", matrix_a * matrix_b)      # Element-wise multiplication
print("A @ B:\n", matrix_a @ matrix_b)      # Matrix multiplication
print("A.T:\n", matrix_a.T)                # Transpose
print("det(A):", np.linalg.det(matrix_a))  # Determinant
print("inv(A):\n", np.linalg.inv(matrix_a)) # Inverse

# 4. Broadcasting
print("\n=== BROADCASTING ===")

# Scalar broadcasting
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("Original array:\n", arr)
print("Array + 10:\n", arr + 10)

# Row/column broadcasting
row_vector = np.array([1, 2, 3])
col_vector = np.array([[1], [2], [3]])

print("\nRow vector:", row_vector)
print("Col vector:\n", col_vector)
print("Array + row vector:\n", arr + row_vector)
print("Array + col vector:\n", arr + col_vector)

# 5. Advanced operations
print("\n=== GEVORDERDE OPERATIES ==="

# Reshaping
arr = np.arange(12)
print("Original (12,):", arr)
print("Reshaped to (3, 4):\n", arr.reshape(3, 4))

# Flattening
matrix = np.array([[1, 2], [3, 4]])
print("\nMatrix:\n", matrix)
print("Flattened:", matrix.flatten())
print("Ravel:", matrix.ravel())

# Sorting
random_data = np.random.randn(10)
print("\nRandom data:", random_data)
print("Sorted:", np.sort(random_data))
print("Argsort (indices):", np.argsort(random_data))

# Filtering with boolean indexing
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
mask = data > 5
print("\nData:", data)
print("Mask (data > 5):", mask)
print("Filtered data:", data[mask])

# Vectorized operations (much faster than loops)
large_array = np.random.randn(1000000)

# SLOW: Python loop
import time
start = time.time()
result_slow = [x * 2 for x in large_array]
print(f"\nPython loop time: {time.time() - start:.4f} seconds")

# FAST: NumPy vectorized
start = time.time()
result_fast = large_array * 2
print(f"NumPy vectorized time: {time.time() - start:.4f} seconds")
print(f"Speedup: {((time.time() - start) / (time.time() - start)):.0f}x faster")  # Dit zal ongeveer 100x zijn

6. Data aggregatie en groepering

         Aggregatie functies in Pandas
        Count: count() - aantal niet-null waarden
Sum: sum() - som van waarden
Mean: mean() - gemiddelde waarde
Median: median() - mediaan waarde
Std: std() - standaard deviatie
Min/Max: min(), max() - minimum/maximum
Quantile: quantile() - percentielen
Unique: nunique() - aantal unieke waarden
Mode: mode() - meest voorkomende waarde

      

7. Time series analyse

Complete time series analyse

# 1. Time series data creëren
np.random.seed(42)
date_range = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
n_days = len(date_range)

# Simuleer sales data met trend, seizoenspatroon en noise
trend = np.linspace(100, 200, n_days)  # Lineaire trend
seasonality = 50 * np.sin(2 * np.pi * np.arange(n_days) / 365)  # Jaarlijks seizoenspatroon
noise = np.random.normal(0, 10, n_days)  # Random noise

sales = trend + seasonality + noise

# Maak DataFrame
sales_df = pd.DataFrame({
    'Date': date_range,
    'Sales': sales,
    'DayOfWeek': date_range.day_name(),
    'Month': date_range.month_name(),
    'Quarter': date_range.quarter
})

print("=== TIME SERIES DATA ===")
print(sales_df.head())
print("\nData info:")
print(sales_df.info())

# 2. Time series visualisatie
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Raw time series
axes[0, 0].plot(sales_df['Date'], sales_df['Sales'], linewidth=1)
axes[0, 0].set_title('Daily Sales 2023')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Sales')
axes[0, 0].grid(True, alpha=0.3)

# Monthly aggregation
monthly_sales = sales_df.resample('M', on='Date')['Sales'].mean()
axes[0, 1].bar(monthly_sales.index.strftime('%b'), monthly_sales.values)
axes[0, 1].set_title('Average Monthly Sales')
axes[0, 1].set_xlabel('Month')
axes[0, 1].set_ylabel('Average Sales')

# Day of week analysis
dow_sales = sales_df.groupby('DayOfWeek')['Sales'].mean()
dow_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_sales = dow_sales.reindex(dow_order)
axes[1, 0].plot(dow_sales.index, dow_sales.values, marker='o')
axes[1, 0].set_title('Average Sales by Day of Week')
axes[1, 0].set_xlabel('Day of Week')
axes[1, 0].set_ylabel('Average Sales')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].grid(True, alpha=0.3)

# Rolling average (smoothing)
sales_df['Sales_7D_Avg'] = sales_df['Sales'].rolling(window=7, center=True).mean()
axes[1, 1].plot(sales_df['Date'], sales_df['Sales'], alpha=0.5, label='Daily')
axes[1, 1].plot(sales_df['Date'], sales_df['Sales_7D_Avg'], linewidth=2, label='7-Day Avg')
axes[1, 1].set_title('Daily Sales with 7-Day Moving Average')
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Sales')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('time_series_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# 3. Time series decomposition
from statsmodels.tsa.seasonal import seasonal_decompose

# Maak een copy met Date als index
sales_ts = sales_df.set_index('Date')['Sales']

# Decompose time series (additive model)
decomposition = seasonal_decompose(sales_ts, model='additive', period=365)

# Plot decomposition
fig, axes = plt.subplots(4, 1, figsize=(14, 10))

axes[0].plot(decomposition.observed)
axes[0].set_title('Observed')
axes[0].grid(True, alpha=0.3)

axes[1].plot(decomposition.trend)
axes[1].set_title('Trend')
axes[1].grid(True, alpha=0.3)

axes[2].plot(decomposition.seasonal)
axes[2].set_title('Seasonality')
axes[2].grid(True, alpha=0.3)

axes[3].plot(decomposition.resid)
axes[3].set_title('Residuals')
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('time_series_decomposition.png', dpi=300, bbox_inches='tight')
plt.show()

# 4. Time series features engineering
print("\n=== TIME SERIES FEATURES ===")

# Lag features (previous values)
for lag in [1, 7, 30]:
    sales_df[f'Sales_Lag_{lag}'] = sales_df['Sales'].shift(lag)

# Rolling statistics
sales_df['Sales_Rolling_Mean_7'] = sales_df['Sales'].rolling(window=7).mean()
sales_df['Sales_Rolling_Std_7'] = sales_df['Sales'].rolling(window=7).std()
sales_df['Sales_Rolling_Min_7'] = sales_df['Sales'].rolling(window=7).min()
sales_df['Sales_Rolling_Max_7'] = sales_df['Sales'].rolling(window=7).max()

# Difference features
sales_df['Sales_Diff_1'] = sales_df['Sales'].diff(1)
sales_df['Sales_Pct_Change'] = sales_df['Sales'].pct_change() * 100

# Date-based features
sales_df['DayOfMonth'] = sales_df['Date'].dt.day
sales_df['WeekOfYear'] = sales_df['Date'].dt.isocalendar().week
sales_df['Is_Weekend'] = sales_df['Date'].dt.dayofweek >= 5
sales_df['Is_Month_Start'] = sales_df['Date'].dt.is_month_start
sales_df['Is_Month_End'] = sales_df['Date'].dt.is_month_end

print("Features added:")
print(sales_df.columns.tolist())
print("\nSample of engineered features:")
print(sales_df[['Date', 'Sales', 'Sales_Lag_1', 'Sales_Rolling_Mean_7', 
                 'Sales_Diff_1', 'Sales_Pct_Change', 'Is_Weekend']].head(10))

# 5. Time series forecasting (simple example)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Prepare data for forecasting
# Use last 30 days to predict next 7 days
train_size = 30
forecast_horizon = 7

# Create lag features for training
for lag in range(1, train_size + 1):
    sales_df[f'Lag_{lag}'] = sales_df['Sales'].shift(lag)

# Split data (last 'forecast_horizon' days for testing)
train_data = sales_df.iloc[train_size:-forecast_horizon].dropna()
test_data = sales_df.iloc[-forecast_horizon:]

# Prepare features (X) and target (y)
feature_cols = [f'Lag_{i}' for i in range(1, train_size + 1)]
X_train = train_data[feature_cols]
y_train = train_data['Sales']

# Train linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
X_test = test_data[feature_cols]
predictions = model.predict(X_test)

# Evaluate predictions
actual = test_data['Sales'].values
mae = mean_absolute_error(actual, predictions)
rmse = np.sqrt(mean_squared_error(actual, predictions))

print(f"\n=== FORECASTING RESULTS ===")
print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print("\nActual vs Predicted:")
for i, (act, pred) in enumerate(zip(actual, predictions), 1):
    print(f"Day {i}: Actual={act:.1f}, Predicted={pred:.1f}, Error={abs(act-pred):.1f}")

9. Praktijkvoorbeeld: Sales data analyse

Complete sales data analysis pipeline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# 1. Simuleer realistische sales data
np.random.seed(42)
n_transactions = 10000

# Product catalog
products = {
    'Laptop': {'price': 999, 'category': 'Electronics'},
    'Smartphone': {'price': 699, 'category': 'Electronics'},
    'Tablet': {'price': 399, 'category': 'Electronics'},
    'T-Shirt': {'price': 29.99, 'category': 'Clothing'},
    'Jeans': {'price': 79.99, 'category': 'Clothing'},
    'Jacket': {'price': 129.99, 'category': 'Clothing'},
    'Coffee Maker': {'price': 89.99, 'category': 'Home'},
    'Blender': {'price': 49.99, 'category': 'Home'},
    'Book': {'price': 19.99, 'category': 'Books'},
    'Headphones': {'price': 149.99, 'category': 'Electronics'}
}

# Generate sales data
data = []
transaction_id = 1000

for _ in range(n_transactions):
    transaction_id += 1
    
    # Random product selection (with probabilities)
    product_name = np.random.choice(
        list(products.keys()),
        p=[0.15, 0.20, 0.10, 0.08, 0.07, 0.05, 0.06, 0.04, 0.15, 0.10]
    )
    
    product_info = products[product_name]
    
    # Random date in 2023
    date = pd.to_datetime('2023-01-01') + pd.Timedelta(days=np.random.randint(0, 365))
    
    # Random customer
    customer_id = np.random.randint(1000, 2000)
    
    # Random quantity (mostly 1, sometimes more)
    quantity = np.random.choice([1, 2, 3], p=[0.8, 0.15, 0.05])
    
    # Random region
    region = np.random.choice(['North', 'South', 'East', 'West'], 
                            p=[0.3, 0.25, 0.25, 0.2])
    
    # Random discount (10% chance of 10% discount)
    discount = 0.1 if np.random.random() < 0.1 else 0
    
    # Calculate revenue
    unit_price = product_info['price']
    total_price = unit_price * quantity * (1 - discount)
    
    data.append({
        'transaction_id': transaction_id,
        'date': date,
        'customer_id': customer_id,
        'product': product_name,
        'category': product_info['category'],
        'unit_price': unit_price,
        'quantity': quantity,
        'discount': discount,
        'total_price': total_price,
        'region': region
    })

# Create DataFrame
sales_df = pd.DataFrame(data)

print("=== SALES DATA OVERVIEW ===")
print(f"Total transactions: {len(sales_df):,}")
print(f"Date range: {sales_df['date'].min().date()} to {sales_df['date'].max().date()}")
print(f"Unique customers: {sales_df['customer_id'].nunique():,}")
print(f"Unique products: {sales_df['product'].nunique()}")
print(f"Total revenue: €{sales_df['total_price'].sum():,.2f}")

# 2. Sales Performance Analysis
print("\n=== SALES PERFORMANCE ANALYSIS ===")

# Monthly revenue
sales_df['month'] = sales_df['date'].dt.to_period('M')
monthly_revenue = sales_df.groupby('month').agg({
    'total_price': 'sum',
    'transaction_id': 'count',
    'customer_id': 'nunique'
}).rename(columns={
    'total_price': 'revenue',
    'transaction_id': 'transactions',
    'customer_id': 'customers'
})

monthly_revenue['avg_transaction_value'] = monthly_revenue['revenue'] / monthly_revenue['transactions']
monthly_revenue['revenue_growth'] = monthly_revenue['revenue'].pct_change() * 100

print("Monthly Performance:")
print(monthly_revenue)

# Product performance
product_performance = sales_df.groupby('product').agg({
    'total_price': ['sum', 'mean'],
    'quantity': 'sum',
    'transaction_id': 'count',
    'customer_id': 'nunique'
})

product_performance.columns = ['revenue', 'avg_order_value', 'units_sold', 'transactions', 'customers']
product_performance = product_performance.sort_values('revenue', ascending=False)

print("\nTop 5 Products by Revenue:")
print(product_performance.head())

# Category performance
category_performance = sales_df.groupby('category').agg({
    'total_price': 'sum',
    'quantity': 'sum',
    'transaction_id': 'count'
}).rename(columns={
    'total_price': 'revenue',
    'quantity': 'units_sold',
    'transaction_id': 'transactions'
}).sort_values('revenue', ascending=False)

category_performance['revenue_share'] = (category_performance['revenue'] / category_performance['revenue'].sum()) * 100
category_performance['avg_price_per_unit'] = category_performance['revenue'] / category_performance['units_sold']

print("\nCategory Performance:")
print(category_performance)

# 3. Customer Analysis
print("\n=== CUSTOMER ANALYSIS ===")

# Customer segmentation by purchase frequency and value
customer_stats = sales_df.groupby('customer_id').agg({
    'total_price': 'sum',
    'transaction_id': 'count',
    'date': ['min', 'max']
})

customer_stats.columns = ['total_spent', 'purchase_count', 'first_purchase', 'last_purchase']
customer_stats['avg_order_value'] = customer_stats['total_spent'] / customer_stats['purchase_count']

# Recency calculation (days since last purchase)
latest_date = sales_df['date'].max()
customer_stats['recency_days'] = (latest_date - customer_stats['last_purchase']).dt.days

print("Customer Statistics:")
print(customer_stats.describe())

# RFM Segmentation (Recency, Frequency, Monetary)
# Create RFM scores (1-5, where 5 is best)
customer_stats['recency_score'] = pd.qcut(customer_stats['recency_days'], 5, labels=[5, 4, 3, 2, 1])
customer_stats['frequency_score'] = pd.qcut(customer_stats['purchase_count'], 5, labels=[1, 2, 3, 4, 5])
customer_stats['monetary_score'] = pd.qcut(customer_stats['total_spent'], 5, labels=[1, 2, 3, 4, 5])

# Combine RFM scores
customer_stats['rfm_score'] = (
    customer_stats['recency_score'].astype(str) +
    customer_stats['frequency_score'].astype(str) +
    customer_stats['monetary_score'].astype(str)
)

# Define RFM segments
def get_rfm_segment(row):
    if row['rfm_score'] >= '555':
        return 'Champions'
    elif row['rfm_score'] >= '444':
        return 'Loyal Customers'
    elif row['rfm_score'] >= '333':
        return 'Potential Loyalists'
    elif row['rfm_score'] >= '222':
        return 'Need Attention'
    else:
        return 'At Risk'

customer_stats['rfm_segment'] = customer_stats.apply(get_rfm_segment, axis=1)

print("\nRFM Segments Distribution:")
print(customer_stats['rfm_segment'].value_counts())
print("\nAverage Value by Segment:")
print(customer_stats.groupby('rfm_segment')['total_spent'].mean().sort_values(ascending=False))

# 4. Regional Analysis
print("\n=== REGIONAL ANALYSIS ===")

regional_performance = sales_df.groupby('region').agg({
    'total_price': 'sum',
    'transaction_id': 'count',
    'customer_id': 'nunique',
    'quantity': 'sum'
}).rename(columns={
    'total_price': 'revenue',
    'transaction_id': 'transactions',
    'customer_id': 'customers',
    'quantity': 'units_sold'
})

regional_performance['avg_transaction_value'] = regional_performance['revenue'] / regional_performance['transactions']
regional_performance['revenue_per_customer'] = regional_performance['revenue'] / regional_performance['customers']

print("Regional Performance:")
print(regional_performance)

# 5. Time-based Analysis
print("\n=== TIME-BASED ANALYSIS ==="

# Daily patterns
sales_df['day_of_week'] = sales_df['date'].dt.day_name()
sales_df['hour'] = sales_df['date'].dt.hour
sales_df['is_weekend'] = sales_df['date'].dt.dayofweek >= 5

# Revenue by day of week
dow_revenue = sales_df.groupby('day_of_week').agg({
    'total_price': 'sum',
    'transaction_id': 'count'
}).rename(columns={
    'total_price': 'revenue',
    'transaction_id': 'transactions'
})

# Reorder days
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_revenue = dow_revenue.reindex(day_order)

print("Revenue by Day of Week:")
print(dow_revenue)

# Weekend vs weekday comparison
weekend_stats = sales_df.groupby('is_weekend').agg({
    'total_price': 'sum',
    'transaction_id': 'count',
    'quantity': 'sum'
}).rename(columns={
    'total_price': 'revenue',
    'transaction_id': 'transactions',
    'quantity': 'units_sold'
})

weekend_stats['avg_transaction_value'] = weekend_stats['revenue'] / weekend_stats['transactions']
weekend_stats['revenue_share'] = (weekend_stats['revenue'] / weekend_stats['revenue'].sum()) * 100

print("\nWeekend vs Weekday Performance:")
print(weekend_stats)

# 6. Statistical Analysis
print("\n=== STATISTICAL ANALYSIS ==="

# Correlation analysis
numeric_cols = sales_df.select_dtypes(include=[np.number]).columns
correlation_matrix = sales_df[numeric_cols].corr()

print("Correlation Matrix:")
print(correlation_matrix)

# Hypothesis testing: Is there a significant difference between regions?
# ANOVA test for revenue differences between regions
regions = sales_df['region'].unique()
region_groups = [sales_df[sales_df['region'] == region]['total_price'] for region in regions]

f_stat, p_value = stats.f_oneway(*region_groups)

print(f"\nANOVA Test for Regional Revenue Differences:")
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.6f}")

if p_value < 0.05:
    print("Conclusion: Significant differences exist between regions (p < 0.05)")
else:
    print("Conclusion: No significant differences between regions")

# 7. Export results
# Create summary report
summary_report = {
    'Overall Metrics': {
        'Total Revenue': f"€{sales_df['total_price'].sum():,.2f}",
        'Total Transactions': f"{len(sales_df):,}",
        'Unique Customers': f"{sales_df['customer_id'].nunique():,}",
        'Average Transaction Value': f"€{sales_df['total_price'].mean():.2f}",
        'Date Range': f"{sales_df['date'].min().date()} to {sales_df['date'].max().date()}"
    },
    'Top Products': product_performance.head(3).to_dict(),
    'Category Performance': category_performance.to_dict(),
    'Regional Performance': regional_performance.to_dict(),
    'Monthly Trends': monthly_revenue.to_dict()
}

# Save to CSV
monthly_revenue.to_csv('monthly_sales_report.csv')
product_performance.to_csv('product_performance.csv')
customer_stats.to_csv('customer_analysis.csv')

print("\n=== ANALYSIS COMPLETE ===")
print("Reports saved to CSV files:")
print("- monthly_sales_report.csv")
print("- product_performance.csv")
print("- customer_analysis.csv")

# Key insights
print("\n=== KEY INSIGHTS ===")
print("1. Total Revenue: €{:,}".format(int(sales_df['total_price'].sum())))
print("2. Best Performing Category: {} (€{:.2f} revenue)".format(
    category_performance.index[0], 
    category_performance.iloc[0]['revenue']
))
print("3. Top Product: {} (€{:.2f} revenue)".format(
    product_performance.index[0],
    product_performance.iloc[0]['revenue']
))
print("4. Best Region: {} (€{:.2f} revenue)".format(
    regional_performance.index[0],
    regional_performance.iloc[0]['revenue']
))
print("5. Customer Segmentation:")
for segment, count in customer_stats['rfm_segment'].value_counts().items():
    print(f"   - {segment}: {count} customers")

Klaar om te beginnen met Data Analyse?

Vind data professionals of plaats je vacature voor analyse projecten

Vacature Plaassen €25 Zoek Data Analysts Data Scientist Vacatures

Conclusie en volgende stappen

Data analyse in Python is een essentiële vaardigheid voor elke data professional. Je hebt nu geleerd:

Pandas: Voor data manipulatie, cleaning en aggregatie
NumPy: Voor numerieke berekeningen en array operaties
Data cleaning: Voor het voorbereiden van data voor analyse
Time series analyse: Voor trend analyse en forecasting
Statistische analyse: Voor het trekken van betekenisvolle conclusies

Volgende stappen:

Begin met je eigen data analyse project
Leer machine learning voor predictive analytics
Integreer met databases en APIs
Bouw dashboards voor data presentatie
Volg onze advanced data science tutorials