Python Data Analysis: Complete Gids met Pandas & NumPy
Leer professionele data analyse in Python. Complete tutorial van data cleaning tot advanced analytics met Pandas en NumPy.
Zoek je Data Analysis experts?
Vind gespecialiseerde Data Analysts en Data Scientists voor je analyse projecten
Inhoudsopgave
1. Waarom Python voor data analyse?
Wat is data analyse?
Data analyse is het proces van inspecteren, transformeren en modelleren van data om nuttige informatie te ontdekken, conclusies te trekken en besluitvorming te ondersteunen. Python is de meest populaire taal voor data analyse vanwege zijn eenvoudige syntax en krachtige libraries.
Snelle prototyping
Python's eenvoudige syntax laat je snel analyses uitvoeren en prototypen maken.
Uitgebreide libraries
Pandas, NumPy, SciPy, Scikit-learn voor elke analyse behoefte.
Visualisatie integratie
Seamless integratie met Matplotlib, Seaborn en Plotly voor visualisatie.
Actieve community
Grootste data science community met uitgebreide documentatie en support.
| Tool | Python + Pandas | Excel | SQL |
|---|---|---|---|
| Data grootte | GBs tot TBs (met chunks) | Max 1M rijen | TB+ (server afhankelijk) |
| Herhaalbaarheid | Uitstekend (scripts) | Beperkt (handmatig) | Goed (queries) |
| Automatisatie | Full automation | Limited (VBA) | Goed (stored procedures) |
| Complexe analyses | Machine learning, statistiek | Basic formules | Aggregaties alleen |
| Learning curve | Medium | Easy | Medium |
Team nodig voor data analyse?
Vind ervaren Data Analysts en Data Scientists gespecialiseerd in Python analyse
2. Setup en Python omgeving
Python 3.9+
Basis programmeertaal
Pandas
Data manipulatie
NumPy
Numerieke berekeningen
Matplotlib
Data visualisatie
Seaborn
Statistische plots
SciPy
Wetenschappelijke computing
Installatie en setup
# Optie 1: Anaconda (aanbevolen voor beginners)
# Download van: https://www.anaconda.com/products/distribution
# Bevat alle data science libraries vooraf geïnstalleerd
# Optie 2: pip installatie
# Basis data science stack
pip install pandas numpy matplotlib seaborn scipy scikit-learn jupyter
# Optie 3: requirements.txt
# Maak een requirements.txt bestand:
pandas==2.0.3
numpy==1.24.3
matplotlib==3.7.1
seaborn==0.12.2
jupyter==1.0.0
scikit-learn==1.3.0
# Installeer alle packages:
pip install -r requirements.txt
# Import statements voor data analyse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Optimalisatie voor betere performance
pd.set_option('display.max_columns', None) # Toon alle kolommen
pd.set_option('display.max_rows', 100) # Toon max 100 rijen
pd.set_option('display.float_format', '{:.2f}'.format) # 2 decimalen voor floats
sns.set_style("whitegrid") # Seaborn styling
plt.rcParams['figure.figsize'] = (12, 6) # Default plot grootte
# Check versies
print(f"Pandas versie: {pd.__version__}")
print(f"NumPy versie: {np.__version__}")
Jupyter Notebook/Lab voor data analyse
Jupyter is de standaard omgeving voor data analyse in Python:
- Start Jupyter:
jupyter notebookofjupyter lab - Create new notebook: Python 3 kernel
- Cell types: Code cells en Markdown cells
- Magic commands:
%matplotlib inlinevoor plots in notebook - Keyboard shortcuts: Shift+Enter (run cell), Esc+A/B (insert cell)
Tip: Gebruik !pip install package in een Jupyter cell om packages te installeren.
3. Pandas: Data manipulatie basis
Pandas Data Structures
- Series: Eén-dimensionale gelabelde array (zoals een kolom in Excel)
- DataFrame: Twee-dimensionale gelabelde data structuur (zoals een spreadsheet)
- Index: Unieke labels voor rijen (kan integers, strings, datetimes zijn)
DataFrame creëren en basis operaties
# 1. DataFrames creëren van verschillende bronnen
# Van dictionary
data = {
'Naam': ['Jan', 'Piet', 'Klaas', 'Marie', 'Anna'],
'Leeftijd': [25, 30, 35, 28, 32],
'Stad': ['Amsterdam', 'Rotterdam', 'Den Haag', 'Utrecht', 'Amsterdam'],
'Salaris': [45000, 52000, 48000, 51000, 47000],
'Afdeling': ['Sales', 'IT', 'HR', 'Sales', 'Marketing']
}
df = pd.DataFrame(data)
print("DataFrame shape:", df.shape) # (5, 5)
print("DataFrame info:")
print(df.info())
print("\nEerste 3 rijen:")
print(df.head(3))
# Van CSV bestand
# df = pd.read_csv('data.csv')
# df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# df = pd.read_json('data.json')
# df = pd.read_sql('SELECT * FROM table', connection)
# 2. Basis DataFrame operaties
print("\n=== BASIS OPERATIES ===")
print("Kolom namen:", df.columns.tolist())
print("Index:", df.index.tolist())
print("Data types:")
print(df.dtypes)
print("\nBeschrijvende statistieken:")
print(df.describe())
print("\nUnieke waarden per kolom:")
for col in df.columns:
print(f"{col}: {df[col].nunique()} unieke waarden")
# 3. Data selectie en filtering
print("\n=== DATA SELECTIE ===")
# Select single column
salaris_series = df['Salaris']
print("Salaris kolom:", salaris_series.tolist())
# Select multiple columns
subset = df[['Naam', 'Salaris', 'Afdeling']]
print("\nSubset met 3 kolommen:")
print(subset)
# Filter rows based on condition
sales_team = df[df['Afdeling'] == 'Sales']
print("\nSales team:")
print(sales_team)
# Complex filtering
high_earners = df[(df['Salaris'] > 50000) & (df['Leeftijd'] < 35)]
print("\nHigh earners onder 35:")
print(high_earners)
# 4. Sorteren
print("\n=== SORTEREN ===")
df_sorted = df.sort_values(by='Salaris', ascending=False)
print("Gesorteerd op salaris (hoog naar laag):")
print(df_sorted)
# 5. Nieuwe kolommen toevoegen
df['Salaris_Categorie'] = pd.cut(df['Salaris'],
bins=[40000, 47000, 50000, 55000],
labels=['Laag', 'Medium', 'Hoog'])
df['Leeftijd_Volgend_Jaar'] = df['Leeftijd'] + 1
print("\nDataFrame met nieuwe kolommen:")
print(df)
Advanced DataFrame operaties
# 1. GroupBy operaties
print("=== GROUPBY OPERATIES ===")
# Eenvoudige groupby
afdeling_stats = df.groupby('Afdeling').agg({
'Salaris': ['mean', 'sum', 'count', 'min', 'max'],
'Leeftijd': 'mean'
})
print("Afdeling statistieken:")
print(afdeling_stats)
# Multi-level groupby
stad_afdeling_stats = df.groupby(['Stad', 'Afdeling']).agg({
'Salaris': 'mean',
'Leeftijd': 'mean',
'Naam': 'count' # Aantal werknemers
}).rename(columns={'Naam': 'Aantal_Werknemers'})
print("\nStad en Afdeling statistieken:")
print(stad_afdeling_stats)
# 2. Pivot tables
print("\n=== PIVOT TABELLEN ===")
pivot_table = pd.pivot_table(df,
values='Salaris',
index='Stad',
columns='Afdeling',
aggfunc=['mean', 'count'],
fill_value=0)
print("Pivot tabel:")
print(pivot_table)
# 3. Merge/Join operaties
print("\n=== MERGE/JOIN OPERATIES ===")
# Creëer tweede DataFrame
afdeling_data = pd.DataFrame({
'Afdeling': ['Sales', 'IT', 'HR', 'Marketing', 'Finance'],
'Manager': ['Peter', 'Lisa', 'Mike', 'Sarah', 'David'],
'Budget': [100000, 150000, 80000, 120000, 90000]
})
# Inner join (default)
merged_df = pd.merge(df, afdeling_data, on='Afdeling', how='inner')
print("Inner join resultaat:")
print(merged_df)
# Left join
left_merged = pd.merge(df, afdeling_data, on='Afdeling', how='left')
print("\nLeft join resultaat:")
print(left_merged)
# 4. Apply functions
print("\n=== APPLY FUNCTIES ===")
# Apply op een kolom
df['Salaris_Bonus'] = df['Salaris'].apply(lambda x: x * 1.1 if x > 50000 else x * 1.05)
# Complexe apply functie
def categorize_employee(row):
if row['Leeftijd'] < 30 and row['Salaris'] > 45000:
return 'High Potential'
elif row['Leeftijd'] >= 40:
return 'Experienced'
else:
return 'Regular'
df['Categorie'] = df.apply(categorize_employee, axis=1)
print("DataFrame met apply functies:")
print(df[['Naam', 'Leeftijd', 'Salaris', 'Salaris_Bonus', 'Categorie']])
# 5. String operaties
print("\n=== STRING OPERATIES ===")
df['Naam_Upper'] = df['Naam'].str.upper()
df['Naam_Lower'] = df['Naam'].str.lower()
df['Naam_Length'] = df['Naam'].str.len()
# String bevat check
df['Stad_Contains_A'] = df['Stad'].str.contains('a', case=False)
print("String operaties:")
print(df[['Naam', 'Naam_Upper', 'Naam_Lower', 'Naam_Length', 'Stad', 'Stad_Contains_A']])
4. Data cleaning en preprocessing
Veelvoorkomende data problemen
- Missing values: NaN, None, lege strings
- Inconsistent formatting: Datum/tijd formats, hoofdletters
- Outliers: Extreme waarden die analyses verstoren
- Duplicates: Dubbele rijen in dataset
- Incorrect data types: Getallen als strings, datums als tekst
- Inconsistent categories: "M", "Male", "Man" voor hetzelfde
Compleet data cleaning proces
# 1. Sample dataset met problemen
np.random.seed(42)
n_rows = 20
dirty_data = pd.DataFrame({
'ID': range(1, n_rows + 1),
'Naam': ['Jan', 'Piet', np.nan, 'Marie', 'Anna', 'Jan', 'peter', 'LISA', 'Tom', 'Sarah',
'Mike', 'Eva', 'David', 'laura', 'JOHN', 'Sara', 'Mark', 'Emma', 'Paul', 'Lisa'],
'Leeftijd': [25, 30, 35, '28', 32, 25, 40, 22, 45, 29,
33, 27, 38, 31, 26, 34, 150, 24, 36, 28], # 150 is outlier
'Salaris': [45000, 52000, None, 51000, 47000, 45000, 60000, 42000, 75000, 48000,
55000, 46000, 68000, 49000, 43000, 54000, 1000000, 41000, 72000, 53000], # 1M is outlier
'Datum_Aanmelding': ['2023-01-15', '15-02-2023', '2023/03/10', '01-04-2023', '2023-05-20',
'2023-01-15', '2023-06-15', '2023-07-01', '01-08-2023', '2023-09-10',
'10-10-2023', '2023-11-05', '2023/12/15', '2024-01-10', '15-01-2024',
'2024-02-20', '2024/03/05', '05-04-2024', '2024-05-15', '2024-06-01'],
'Geslacht': ['M', 'M', 'V', 'V', 'V', 'M', 'm', 'F', 'M', 'V',
'M', 'v', 'M', 'female', 'M', 'V', 'Male', 'V', 'M', 'f'],
'Afdeling': ['Sales', 'IT', None, 'Sales', 'Marketing', 'Sales', 'IT', 'HR', 'IT', 'Marketing',
'Finance', 'HR', 'IT', 'Marketing', 'Sales', 'HR', 'IT', 'Marketing', 'Finance', None]
})
print("=== ORIGINELE DIRTY DATA ===")
print(dirty_data)
print("\nData info:")
print(dirty_data.info())
print("\nMissing values:")
print(dirty_data.isnull().sum())
# 2. Data cleaning pipeline
def clean_dataframe(df):
"""Complete data cleaning pipeline"""
df_clean = df.copy()
# A. Remove exact duplicates
df_clean = df_clean.drop_duplicates()
print(f"Duplicates removed: {len(df) - len(df_clean)}")
# B. Standardize text columns
if 'Naam' in df_clean.columns:
df_clean['Naam'] = df_clean['Naam'].astype(str).str.title().str.strip()
# C. Fix gender column
gender_mapping = {
'M': 'Man', 'm': 'Man', 'Male': 'Man',
'V': 'Vrouw', 'v': 'Vrouw', 'F': 'Vrouw',
'f': 'Vrouw', 'female': 'Vrouw'
}
if 'Geslacht' in df_clean.columns:
df_clean['Geslacht'] = df_clean['Geslacht'].replace(gender_mapping)
# D. Convert data types
if 'Leeftijd' in df_clean.columns:
df_clean['Leeftijd'] = pd.to_numeric(df_clean['Leeftijd'], errors='coerce')
# E. Handle missing values
# Voor numerieke kolommen: mean imputation
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
if df_clean[col].isnull().any():
mean_val = df_clean[col].mean()
df_clean[col] = df_clean[col].fillna(mean_val)
print(f"Missing values in {col} filled with mean: {mean_val:.2f}")
# Voor categorische kolommen: mode imputation
categorical_cols = df_clean.select_dtypes(include=['object']).columns
for col in categorical_cols:
if df_clean[col].isnull().any():
mode_val = df_clean[col].mode()[0]
df_clean[col] = df_clean[col].fillna(mode_val)
print(f"Missing values in {col} filled with mode: {mode_val}")
# F. Handle outliers
if 'Leeftijd' in df_clean.columns:
# Remove age > 100 (obvious outliers)
outlier_mask = df_clean['Leeftijd'] > 100
if outlier_mask.any():
print(f"Removing {outlier_mask.sum()} age outliers")
df_clean = df_clean[~outlier_mask]
if 'Salaris' in df_clean.columns:
# Cap salary outliers at 99th percentile
salary_cap = df_clean['Salaris'].quantile(0.99)
high_salary_mask = df_clean['Salaris'] > salary_cap
if high_salary_mask.any():
print(f"Capping {high_salary_mask.sum()} salary outliers at {salary_cap:.0f}")
df_clean.loc[high_salary_mask, 'Salaris'] = salary_cap
# G. Fix date formats
if 'Datum_Aanmelding' in df_clean.columns:
# Try multiple date formats
date_formats = ['%Y-%m-%d', '%d-%m-%Y', '%Y/%m/%d', '%d/%m/%Y']
def parse_date(date_str):
for fmt in date_formats:
try:
return pd.to_datetime(date_str, format=fmt)
except ValueError:
continue
return pd.NaT
df_clean['Datum_Aanmelding'] = df_clean['Datum_Aanmelding'].apply(parse_date)
print("Dates standardized")
return df_clean
# 3. Apply cleaning pipeline
cleaned_data = clean_dataframe(dirty_data)
print("\n\n=== GECLEANDE DATA ===")
print(cleaned_data)
print("\nCleaned data info:")
print(cleaned_data.info())
print("\nMissing values after cleaning:")
print(cleaned_data.isnull().sum())
print("\nData statistics:")
print(cleaned_data.describe())
Klaar voor data analyse projecten?
Vind de juiste experts of plaats je Data Analysis vacature
5. NumPy: Numerieke berekeningen
Waarom NumPy gebruiken?
- Snelheid: 10-100x sneller dan Python lists
- Efficiënt geheugen: Homogene data types
- Vectorized operations: Operaties op hele arrays
- Wiskundige functies: Lineaire algebra, statistiek, random numbers
- Broadcasting: Operaties tussen arrays van verschillende groottes
NumPy basis en gevorderde operaties
import numpy as np
# 1. NumPy arrays creëren
print("=== NUMPY ARRAYS CREËREN ===")
# Van Python list
arr1 = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr1)
print("Shape:", arr1.shape)
print("Data type:", arr1.dtype)
# 2D array (matrix)
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\n2D Array:")
print(arr2d)
print("Shape:", arr2d.shape)
# Special arrays
zeros = np.zeros((3, 4)) # 3x4 matrix van nullen
ones = np.ones((2, 3)) # 2x3 matrix van enen
identity = np.eye(4) # 4x4 identity matrix
range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5) # [0., 0.25, 0.5, 0.75, 1.]
print("\nSpecial arrays:")
print("Zeros matrix:\n", zeros)
print("Linspace:", linspace)
# Random arrays
np.random.seed(42) # Voor reproduceerbaarheid
random_arr = np.random.rand(3, 3) # Uniform distribution [0, 1)
randn_arr = np.random.randn(100) # Standard normal distribution
randint_arr = np.random.randint(1, 100, 10) # 10 random integers 1-99
print("\nRandom arrays:")
print("Random 3x3:\n", random_arr)
print("Random integers:", randint_arr)
# 2. Array operaties
print("\n=== ARRAY OPERATIES ===")
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])
# Element-wise operations
print("a + b =", a + b) # [11, 22, 33, 44, 55]
print("a * 2 =", a * 2) # [2, 4, 6, 8, 10]
print("a ** 2 =", a ** 2) # [1, 4, 9, 16, 25]
print("np.sin(a) =", np.sin(a))
# Aggregation functions
print("\nAggregations:")
print("Sum:", np.sum(a))
print("Mean:", np.mean(a))
print("Std:", np.std(a))
print("Min:", np.min(a))
print("Max:", np.max(a))
print("Median:", np.median(a))
# 3. Matrix operations
print("\n=== MATRIX OPERATIES ===")
matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])
print("Matrix A:\n", matrix_a)
print("Matrix B:\n", matrix_b)
print("A + B:\n", matrix_a + matrix_b) # Element-wise addition
print("A * B:\n", matrix_a * matrix_b) # Element-wise multiplication
print("A @ B:\n", matrix_a @ matrix_b) # Matrix multiplication
print("A.T:\n", matrix_a.T) # Transpose
print("det(A):", np.linalg.det(matrix_a)) # Determinant
print("inv(A):\n", np.linalg.inv(matrix_a)) # Inverse
# 4. Broadcasting
print("\n=== BROADCASTING ===")
# Scalar broadcasting
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("Original array:\n", arr)
print("Array + 10:\n", arr + 10)
# Row/column broadcasting
row_vector = np.array([1, 2, 3])
col_vector = np.array([[1], [2], [3]])
print("\nRow vector:", row_vector)
print("Col vector:\n", col_vector)
print("Array + row vector:\n", arr + row_vector)
print("Array + col vector:\n", arr + col_vector)
# 5. Advanced operations
print("\n=== GEVORDERDE OPERATIES ==="
# Reshaping
arr = np.arange(12)
print("Original (12,):", arr)
print("Reshaped to (3, 4):\n", arr.reshape(3, 4))
# Flattening
matrix = np.array([[1, 2], [3, 4]])
print("\nMatrix:\n", matrix)
print("Flattened:", matrix.flatten())
print("Ravel:", matrix.ravel())
# Sorting
random_data = np.random.randn(10)
print("\nRandom data:", random_data)
print("Sorted:", np.sort(random_data))
print("Argsort (indices):", np.argsort(random_data))
# Filtering with boolean indexing
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
mask = data > 5
print("\nData:", data)
print("Mask (data > 5):", mask)
print("Filtered data:", data[mask])
# Vectorized operations (much faster than loops)
large_array = np.random.randn(1000000)
# SLOW: Python loop
import time
start = time.time()
result_slow = [x * 2 for x in large_array]
print(f"\nPython loop time: {time.time() - start:.4f} seconds")
# FAST: NumPy vectorized
start = time.time()
result_fast = large_array * 2
print(f"NumPy vectorized time: {time.time() - start:.4f} seconds")
print(f"Speedup: {((time.time() - start) / (time.time() - start)):.0f}x faster") # Dit zal ongeveer 100x zijn
6. Data aggregatie en groepering
Aggregatie functies in Pandas
- Count:
count()- aantal niet-null waarden - Sum:
sum()- som van waarden - Mean:
mean()- gemiddelde waarde - Median:
median()- mediaan waarde - Std:
std()- standaard deviatie - Min/Max:
min(),max()- minimum/maximum - Quantile:
quantile()- percentielen - Unique:
nunique()- aantal unieke waarden - Mode:
mode()- meest voorkomende waarde
7. Time series analyse
Complete time series analyse
# 1. Time series data creëren
np.random.seed(42)
date_range = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
n_days = len(date_range)
# Simuleer sales data met trend, seizoenspatroon en noise
trend = np.linspace(100, 200, n_days) # Lineaire trend
seasonality = 50 * np.sin(2 * np.pi * np.arange(n_days) / 365) # Jaarlijks seizoenspatroon
noise = np.random.normal(0, 10, n_days) # Random noise
sales = trend + seasonality + noise
# Maak DataFrame
sales_df = pd.DataFrame({
'Date': date_range,
'Sales': sales,
'DayOfWeek': date_range.day_name(),
'Month': date_range.month_name(),
'Quarter': date_range.quarter
})
print("=== TIME SERIES DATA ===")
print(sales_df.head())
print("\nData info:")
print(sales_df.info())
# 2. Time series visualisatie
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Raw time series
axes[0, 0].plot(sales_df['Date'], sales_df['Sales'], linewidth=1)
axes[0, 0].set_title('Daily Sales 2023')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Sales')
axes[0, 0].grid(True, alpha=0.3)
# Monthly aggregation
monthly_sales = sales_df.resample('M', on='Date')['Sales'].mean()
axes[0, 1].bar(monthly_sales.index.strftime('%b'), monthly_sales.values)
axes[0, 1].set_title('Average Monthly Sales')
axes[0, 1].set_xlabel('Month')
axes[0, 1].set_ylabel('Average Sales')
# Day of week analysis
dow_sales = sales_df.groupby('DayOfWeek')['Sales'].mean()
dow_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_sales = dow_sales.reindex(dow_order)
axes[1, 0].plot(dow_sales.index, dow_sales.values, marker='o')
axes[1, 0].set_title('Average Sales by Day of Week')
axes[1, 0].set_xlabel('Day of Week')
axes[1, 0].set_ylabel('Average Sales')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].grid(True, alpha=0.3)
# Rolling average (smoothing)
sales_df['Sales_7D_Avg'] = sales_df['Sales'].rolling(window=7, center=True).mean()
axes[1, 1].plot(sales_df['Date'], sales_df['Sales'], alpha=0.5, label='Daily')
axes[1, 1].plot(sales_df['Date'], sales_df['Sales_7D_Avg'], linewidth=2, label='7-Day Avg')
axes[1, 1].set_title('Daily Sales with 7-Day Moving Average')
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Sales')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('time_series_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# 3. Time series decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
# Maak een copy met Date als index
sales_ts = sales_df.set_index('Date')['Sales']
# Decompose time series (additive model)
decomposition = seasonal_decompose(sales_ts, model='additive', period=365)
# Plot decomposition
fig, axes = plt.subplots(4, 1, figsize=(14, 10))
axes[0].plot(decomposition.observed)
axes[0].set_title('Observed')
axes[0].grid(True, alpha=0.3)
axes[1].plot(decomposition.trend)
axes[1].set_title('Trend')
axes[1].grid(True, alpha=0.3)
axes[2].plot(decomposition.seasonal)
axes[2].set_title('Seasonality')
axes[2].grid(True, alpha=0.3)
axes[3].plot(decomposition.resid)
axes[3].set_title('Residuals')
axes[3].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('time_series_decomposition.png', dpi=300, bbox_inches='tight')
plt.show()
# 4. Time series features engineering
print("\n=== TIME SERIES FEATURES ===")
# Lag features (previous values)
for lag in [1, 7, 30]:
sales_df[f'Sales_Lag_{lag}'] = sales_df['Sales'].shift(lag)
# Rolling statistics
sales_df['Sales_Rolling_Mean_7'] = sales_df['Sales'].rolling(window=7).mean()
sales_df['Sales_Rolling_Std_7'] = sales_df['Sales'].rolling(window=7).std()
sales_df['Sales_Rolling_Min_7'] = sales_df['Sales'].rolling(window=7).min()
sales_df['Sales_Rolling_Max_7'] = sales_df['Sales'].rolling(window=7).max()
# Difference features
sales_df['Sales_Diff_1'] = sales_df['Sales'].diff(1)
sales_df['Sales_Pct_Change'] = sales_df['Sales'].pct_change() * 100
# Date-based features
sales_df['DayOfMonth'] = sales_df['Date'].dt.day
sales_df['WeekOfYear'] = sales_df['Date'].dt.isocalendar().week
sales_df['Is_Weekend'] = sales_df['Date'].dt.dayofweek >= 5
sales_df['Is_Month_Start'] = sales_df['Date'].dt.is_month_start
sales_df['Is_Month_End'] = sales_df['Date'].dt.is_month_end
print("Features added:")
print(sales_df.columns.tolist())
print("\nSample of engineered features:")
print(sales_df[['Date', 'Sales', 'Sales_Lag_1', 'Sales_Rolling_Mean_7',
'Sales_Diff_1', 'Sales_Pct_Change', 'Is_Weekend']].head(10))
# 5. Time series forecasting (simple example)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Prepare data for forecasting
# Use last 30 days to predict next 7 days
train_size = 30
forecast_horizon = 7
# Create lag features for training
for lag in range(1, train_size + 1):
sales_df[f'Lag_{lag}'] = sales_df['Sales'].shift(lag)
# Split data (last 'forecast_horizon' days for testing)
train_data = sales_df.iloc[train_size:-forecast_horizon].dropna()
test_data = sales_df.iloc[-forecast_horizon:]
# Prepare features (X) and target (y)
feature_cols = [f'Lag_{i}' for i in range(1, train_size + 1)]
X_train = train_data[feature_cols]
y_train = train_data['Sales']
# Train linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
X_test = test_data[feature_cols]
predictions = model.predict(X_test)
# Evaluate predictions
actual = test_data['Sales'].values
mae = mean_absolute_error(actual, predictions)
rmse = np.sqrt(mean_squared_error(actual, predictions))
print(f"\n=== FORECASTING RESULTS ===")
print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print("\nActual vs Predicted:")
for i, (act, pred) in enumerate(zip(actual, predictions), 1):
print(f"Day {i}: Actual={act:.1f}, Predicted={pred:.1f}, Error={abs(act-pred):.1f}")
9. Praktijkvoorbeeld: Sales data analyse
Complete sales data analysis pipeline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# 1. Simuleer realistische sales data
np.random.seed(42)
n_transactions = 10000
# Product catalog
products = {
'Laptop': {'price': 999, 'category': 'Electronics'},
'Smartphone': {'price': 699, 'category': 'Electronics'},
'Tablet': {'price': 399, 'category': 'Electronics'},
'T-Shirt': {'price': 29.99, 'category': 'Clothing'},
'Jeans': {'price': 79.99, 'category': 'Clothing'},
'Jacket': {'price': 129.99, 'category': 'Clothing'},
'Coffee Maker': {'price': 89.99, 'category': 'Home'},
'Blender': {'price': 49.99, 'category': 'Home'},
'Book': {'price': 19.99, 'category': 'Books'},
'Headphones': {'price': 149.99, 'category': 'Electronics'}
}
# Generate sales data
data = []
transaction_id = 1000
for _ in range(n_transactions):
transaction_id += 1
# Random product selection (with probabilities)
product_name = np.random.choice(
list(products.keys()),
p=[0.15, 0.20, 0.10, 0.08, 0.07, 0.05, 0.06, 0.04, 0.15, 0.10]
)
product_info = products[product_name]
# Random date in 2023
date = pd.to_datetime('2023-01-01') + pd.Timedelta(days=np.random.randint(0, 365))
# Random customer
customer_id = np.random.randint(1000, 2000)
# Random quantity (mostly 1, sometimes more)
quantity = np.random.choice([1, 2, 3], p=[0.8, 0.15, 0.05])
# Random region
region = np.random.choice(['North', 'South', 'East', 'West'],
p=[0.3, 0.25, 0.25, 0.2])
# Random discount (10% chance of 10% discount)
discount = 0.1 if np.random.random() < 0.1 else 0
# Calculate revenue
unit_price = product_info['price']
total_price = unit_price * quantity * (1 - discount)
data.append({
'transaction_id': transaction_id,
'date': date,
'customer_id': customer_id,
'product': product_name,
'category': product_info['category'],
'unit_price': unit_price,
'quantity': quantity,
'discount': discount,
'total_price': total_price,
'region': region
})
# Create DataFrame
sales_df = pd.DataFrame(data)
print("=== SALES DATA OVERVIEW ===")
print(f"Total transactions: {len(sales_df):,}")
print(f"Date range: {sales_df['date'].min().date()} to {sales_df['date'].max().date()}")
print(f"Unique customers: {sales_df['customer_id'].nunique():,}")
print(f"Unique products: {sales_df['product'].nunique()}")
print(f"Total revenue: €{sales_df['total_price'].sum():,.2f}")
# 2. Sales Performance Analysis
print("\n=== SALES PERFORMANCE ANALYSIS ===")
# Monthly revenue
sales_df['month'] = sales_df['date'].dt.to_period('M')
monthly_revenue = sales_df.groupby('month').agg({
'total_price': 'sum',
'transaction_id': 'count',
'customer_id': 'nunique'
}).rename(columns={
'total_price': 'revenue',
'transaction_id': 'transactions',
'customer_id': 'customers'
})
monthly_revenue['avg_transaction_value'] = monthly_revenue['revenue'] / monthly_revenue['transactions']
monthly_revenue['revenue_growth'] = monthly_revenue['revenue'].pct_change() * 100
print("Monthly Performance:")
print(monthly_revenue)
# Product performance
product_performance = sales_df.groupby('product').agg({
'total_price': ['sum', 'mean'],
'quantity': 'sum',
'transaction_id': 'count',
'customer_id': 'nunique'
})
product_performance.columns = ['revenue', 'avg_order_value', 'units_sold', 'transactions', 'customers']
product_performance = product_performance.sort_values('revenue', ascending=False)
print("\nTop 5 Products by Revenue:")
print(product_performance.head())
# Category performance
category_performance = sales_df.groupby('category').agg({
'total_price': 'sum',
'quantity': 'sum',
'transaction_id': 'count'
}).rename(columns={
'total_price': 'revenue',
'quantity': 'units_sold',
'transaction_id': 'transactions'
}).sort_values('revenue', ascending=False)
category_performance['revenue_share'] = (category_performance['revenue'] / category_performance['revenue'].sum()) * 100
category_performance['avg_price_per_unit'] = category_performance['revenue'] / category_performance['units_sold']
print("\nCategory Performance:")
print(category_performance)
# 3. Customer Analysis
print("\n=== CUSTOMER ANALYSIS ===")
# Customer segmentation by purchase frequency and value
customer_stats = sales_df.groupby('customer_id').agg({
'total_price': 'sum',
'transaction_id': 'count',
'date': ['min', 'max']
})
customer_stats.columns = ['total_spent', 'purchase_count', 'first_purchase', 'last_purchase']
customer_stats['avg_order_value'] = customer_stats['total_spent'] / customer_stats['purchase_count']
# Recency calculation (days since last purchase)
latest_date = sales_df['date'].max()
customer_stats['recency_days'] = (latest_date - customer_stats['last_purchase']).dt.days
print("Customer Statistics:")
print(customer_stats.describe())
# RFM Segmentation (Recency, Frequency, Monetary)
# Create RFM scores (1-5, where 5 is best)
customer_stats['recency_score'] = pd.qcut(customer_stats['recency_days'], 5, labels=[5, 4, 3, 2, 1])
customer_stats['frequency_score'] = pd.qcut(customer_stats['purchase_count'], 5, labels=[1, 2, 3, 4, 5])
customer_stats['monetary_score'] = pd.qcut(customer_stats['total_spent'], 5, labels=[1, 2, 3, 4, 5])
# Combine RFM scores
customer_stats['rfm_score'] = (
customer_stats['recency_score'].astype(str) +
customer_stats['frequency_score'].astype(str) +
customer_stats['monetary_score'].astype(str)
)
# Define RFM segments
def get_rfm_segment(row):
if row['rfm_score'] >= '555':
return 'Champions'
elif row['rfm_score'] >= '444':
return 'Loyal Customers'
elif row['rfm_score'] >= '333':
return 'Potential Loyalists'
elif row['rfm_score'] >= '222':
return 'Need Attention'
else:
return 'At Risk'
customer_stats['rfm_segment'] = customer_stats.apply(get_rfm_segment, axis=1)
print("\nRFM Segments Distribution:")
print(customer_stats['rfm_segment'].value_counts())
print("\nAverage Value by Segment:")
print(customer_stats.groupby('rfm_segment')['total_spent'].mean().sort_values(ascending=False))
# 4. Regional Analysis
print("\n=== REGIONAL ANALYSIS ===")
regional_performance = sales_df.groupby('region').agg({
'total_price': 'sum',
'transaction_id': 'count',
'customer_id': 'nunique',
'quantity': 'sum'
}).rename(columns={
'total_price': 'revenue',
'transaction_id': 'transactions',
'customer_id': 'customers',
'quantity': 'units_sold'
})
regional_performance['avg_transaction_value'] = regional_performance['revenue'] / regional_performance['transactions']
regional_performance['revenue_per_customer'] = regional_performance['revenue'] / regional_performance['customers']
print("Regional Performance:")
print(regional_performance)
# 5. Time-based Analysis
print("\n=== TIME-BASED ANALYSIS ==="
# Daily patterns
sales_df['day_of_week'] = sales_df['date'].dt.day_name()
sales_df['hour'] = sales_df['date'].dt.hour
sales_df['is_weekend'] = sales_df['date'].dt.dayofweek >= 5
# Revenue by day of week
dow_revenue = sales_df.groupby('day_of_week').agg({
'total_price': 'sum',
'transaction_id': 'count'
}).rename(columns={
'total_price': 'revenue',
'transaction_id': 'transactions'
})
# Reorder days
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_revenue = dow_revenue.reindex(day_order)
print("Revenue by Day of Week:")
print(dow_revenue)
# Weekend vs weekday comparison
weekend_stats = sales_df.groupby('is_weekend').agg({
'total_price': 'sum',
'transaction_id': 'count',
'quantity': 'sum'
}).rename(columns={
'total_price': 'revenue',
'transaction_id': 'transactions',
'quantity': 'units_sold'
})
weekend_stats['avg_transaction_value'] = weekend_stats['revenue'] / weekend_stats['transactions']
weekend_stats['revenue_share'] = (weekend_stats['revenue'] / weekend_stats['revenue'].sum()) * 100
print("\nWeekend vs Weekday Performance:")
print(weekend_stats)
# 6. Statistical Analysis
print("\n=== STATISTICAL ANALYSIS ==="
# Correlation analysis
numeric_cols = sales_df.select_dtypes(include=[np.number]).columns
correlation_matrix = sales_df[numeric_cols].corr()
print("Correlation Matrix:")
print(correlation_matrix)
# Hypothesis testing: Is there a significant difference between regions?
# ANOVA test for revenue differences between regions
regions = sales_df['region'].unique()
region_groups = [sales_df[sales_df['region'] == region]['total_price'] for region in regions]
f_stat, p_value = stats.f_oneway(*region_groups)
print(f"\nANOVA Test for Regional Revenue Differences:")
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.6f}")
if p_value < 0.05:
print("Conclusion: Significant differences exist between regions (p < 0.05)")
else:
print("Conclusion: No significant differences between regions")
# 7. Export results
# Create summary report
summary_report = {
'Overall Metrics': {
'Total Revenue': f"€{sales_df['total_price'].sum():,.2f}",
'Total Transactions': f"{len(sales_df):,}",
'Unique Customers': f"{sales_df['customer_id'].nunique():,}",
'Average Transaction Value': f"€{sales_df['total_price'].mean():.2f}",
'Date Range': f"{sales_df['date'].min().date()} to {sales_df['date'].max().date()}"
},
'Top Products': product_performance.head(3).to_dict(),
'Category Performance': category_performance.to_dict(),
'Regional Performance': regional_performance.to_dict(),
'Monthly Trends': monthly_revenue.to_dict()
}
# Save to CSV
monthly_revenue.to_csv('monthly_sales_report.csv')
product_performance.to_csv('product_performance.csv')
customer_stats.to_csv('customer_analysis.csv')
print("\n=== ANALYSIS COMPLETE ===")
print("Reports saved to CSV files:")
print("- monthly_sales_report.csv")
print("- product_performance.csv")
print("- customer_analysis.csv")
# Key insights
print("\n=== KEY INSIGHTS ===")
print("1. Total Revenue: €{:,}".format(int(sales_df['total_price'].sum())))
print("2. Best Performing Category: {} (€{:.2f} revenue)".format(
category_performance.index[0],
category_performance.iloc[0]['revenue']
))
print("3. Top Product: {} (€{:.2f} revenue)".format(
product_performance.index[0],
product_performance.iloc[0]['revenue']
))
print("4. Best Region: {} (€{:.2f} revenue)".format(
regional_performance.index[0],
regional_performance.iloc[0]['revenue']
))
print("5. Customer Segmentation:")
for segment, count in customer_stats['rfm_segment'].value_counts().items():
print(f" - {segment}: {count} customers")
Klaar om te beginnen met Data Analyse?
Vind data professionals of plaats je vacature voor analyse projecten
Conclusie en volgende stappen
Data analyse in Python is een essentiële vaardigheid voor elke data professional. Je hebt nu geleerd:
- Pandas: Voor data manipulatie, cleaning en aggregatie
- NumPy: Voor numerieke berekeningen en array operaties
- Data cleaning: Voor het voorbereiden van data voor analyse
- Time series analyse: Voor trend analyse en forecasting
- Statistische analyse: Voor het trekken van betekenisvolle conclusies
Volgende stappen:
- Begin met je eigen data analyse project
- Leer machine learning voor predictive analytics
- Integreer met databases en APIs
- Bouw dashboards voor data presentatie
- Volg onze advanced data science tutorials