DataPartner365

Jouw partner voor datagedreven groei en inzichten

Machine Learning in Python: Complete Gids met Scikit-learn

Laatst bijgewerkt: 20 december 2025
Leestijd: 30 minuten
Machine Learning, AI, Python, Scikit-learn, Deep Learning, Data Science

Leer professionele machine learning in Python. Complete tutorial van basis concepten tot advanced deep learning met Scikit-learn en TensorFlow.

Zoek je Machine Learning experts?

Vind gespecialiseerde ML Engineers en Data Scientists voor je AI projecten

1. Wat is Machine Learning?

Machine Learning Definitie

Machine Learning is een subveld van kunstmatige intelligentie (AI) waarbij computersystemen leren van data om taken uit te voeren zonder expliciet geprogrammeerd te worden. In plaats van regels te schrijven, leren modellen patronen uit voorbeelden.

Predictive Analytics

Voorspel toekomstige gebeurtenissen op basis van historische data.

Pattern Recognition

Herken complexe patronen in afbeeldingen, tekst en geluid.

Personalization

Geef gepersonaliseerde aanbevelingen en ervaringen.

Automation

Automatiseer complexe besluitvormingsprocessen.

Traditioneel Programmeren Machine Learning Voordelen
Input + Regels → Output Input + Output → Regels Leert van data
Handmatige regels definiëren Automatisch patronen leren Schaalbaar voor complexe problemen
Statische logica Adaptieve modellen Blijft verbeteren met nieuwe data
Menselijke expertise nodig Data-driven beslissingen Kan menselijke bias verminderen
Goed voor gestructureerde problemen Goed voor complexe, ongestructureerde problemen Breed toepasbaar

Reële toepassingen van Machine Learning

E-commerce

  • Product recommendations
  • Price optimization
  • Fraud detection
  • Customer segmentation

Healthcare

  • Disease diagnosis
  • Drug discovery
  • Medical image analysis
  • Patient risk prediction

Automotive

  • Self-driving cars
  • Predictive maintenance
  • Traffic prediction
  • Driver behavior analysis

Finance

  • Credit scoring
  • Algorithmic trading
  • Risk assessment
  • Chatbots voor klantenservice

Team nodig voor AI projecten?

Vind ervaren ML Engineers en Data Scientists gespecialiseerd in machine learning

2. Machine Learning types en toepassingen

Supervised

Gelabelde data

Unsupervised

Ongegelabelde data

Reinforcement

Reward-based learning

Deep Learning

Neural networks

Semi-supervised

Mixed data

Transfer Learning

Pre-trained models

Supervised Learning

Werking: Model leert van gelabelde voorbeelden (input → output paren).

  • Classificatie: Discrete labels voorspellen (spam/not spam)
  • Regressie: Continue waarden voorspellen (huisprijs)
# Voorbeelden van supervised learning algoritmes:
# 1. Linear Regression - voor regressie problemen
# 2. Logistic Regression - voor binaire classificatie
# 3. Decision Trees - voor classificatie en regressie
# 4. Random Forest - ensemble van decision trees
# 5. SVM (Support Vector Machines) - voor classificatie
# 6. Neural Networks - voor complexe patronen

Unsupervised Learning

Werking: Model zoekt patronen in ongegelabelde data.

  • Clustering: Groepeert gelijksoortige data points
  • Dimensionality Reduction: Vermindert aantal features
  • Anomaly Detection: Vindt afwijkende observaties
# Voorbeelden van unsupervised learning algoritmes:
# 1. K-Means Clustering - voor data segmentatie
# 2. Hierarchical Clustering - voor nested clusters
# 3. PCA (Principal Component Analysis) - voor dimensionality reduction
# 4. t-SNE - voor data visualisatie
# 5. Autoencoders - neural networks voor compression

Deep Learning

Werking: Gebruikt neural networks met meerdere lagen voor complexe patronen.

  • CNN: Convolutional Neural Networks - voor images
  • RNN: Recurrent Neural Networks - voor sequences
  • GAN: Generative Adversarial Networks - voor content generation
  • Transformer: Voor NLP (BERT, GPT)
# Deep learning frameworks:
# 1. TensorFlow - Google's DL framework
# 2. PyTorch - Facebook's DL framework
# 3. Keras - High-level API voor TensorFlow
# 4. FastAI - Voor snelle prototyping
ML Type Wanneer gebruiken? Algoritmes Voorbeelden
Supervised Je hebt gelabelde data en wilt voorspellingen doen Linear/Logistic Regression, Random Forest, SVM Spam detection, prijs voorspelling
Unsupervised Je hebt ongegelabelde data en wilt patronen ontdekken K-Means, PCA, DBSCAN Customer segmentation, anomaly detection
Semi-supervised Je hebt weinig gelabelde en veel ongegelabelde data Label Propagation, Self-training Medical imaging met weinig labels
Reinforcement Je wilt een agent leren door trial-and-error Q-Learning, Deep Q Networks Game playing, robotics
Deep Learning Je hebt veel data en complexe patronen CNN, RNN, Transformer Image recognition, NLP, speech recognition

3. Python ML stack setup

Complete Machine Learning Stack

Scikit-learn

Basis ML algoritmes en tools

TensorFlow/PyTorch

Deep learning frameworks

XGBoost/LightGBM

Gradient boosting voor tabulaire data

Pandas/NumPy

Data manipulatie en numerieke berekeningen

Matplotlib/Seaborn

Data visualisatie

SciPy/Statsmodels

Statistische analyse

# Complete ML environment setup

# Optie 1: Miniconda/Anaconda (aanbevolen)
# Download van: https://docs.conda.io/en/latest/miniconda.html
# Creëer nieuwe environment:
conda create -n ml_env python=3.9
conda activate ml_env

# Optie 2: Virtual environment met pip
python -m venv ml_env
# Windows:
ml_env\Scripts\activate
# Mac/Linux:
source ml_env/bin/activate

# Basis ML packages installeren
pip install numpy pandas matplotlib seaborn scipy scikit-learn jupyter

# Geavanceerde ML packages
pip install xgboost lightgbm catboost

# Deep learning packages
pip install tensorflow keras torch torchvision torchaudio

# Extra utilities
pip install joblib imbalanced-learn yellowbrick eli5 shap

# Voor deployment
pip install flask fastapi streamlit mlflow

# Of gebruik een requirements.txt bestand:
# requirements.txt inhoud:
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
tensorflow==2.13.0
xgboost==2.0.0
matplotlib==3.7.1
seaborn==0.12.2
jupyter==1.0.0
flask==2.3.2

# Installeer alle packages:
pip install -r requirements.txt

# Import statements voor ML projecten
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# ML algoritmes
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA

# Deep learning imports
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models

# Styling instellen
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)

print("✅ ML environment ready!")
print(f"TensorFlow versie: {tf.__version__}")
print(f"Scikit-learn versie: {sklearn.__version__}")

4. Data voorbereiding voor ML

Belangrijkste data issues voor ML

  • Missing values: Kan modellen breken of slechte resultaten geven
  • Categorical data: ML algoritmes werken alleen met numerieke data
  • Feature scaling: Sommige algoritmes vereisen geschaalde features
  • Imbalanced data: Kan leiden tot biased modellen
  • Outliers: Kunnen model training verstoren
  • Multicollinearity: Hoog gecorreleerde features kunnen problemen veroorzaken

Complete data preprocessing pipeline

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# 1. Sample dataset voor ML
np.random.seed(42)
n_samples = 1000

# Creëer dataset met verschillende data types en issues
data = pd.DataFrame({
    'age': np.random.randint(18, 70, n_samples),
    'income': np.random.normal(50000, 15000, n_samples),
    'credit_score': np.random.randint(300, 850, n_samples),
    'gender': np.random.choice(['Male', 'Female', None], n_samples, p=[0.45, 0.45, 0.10]),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD', None], n_samples),
    'employment_status': np.random.choice(['Employed', 'Unemployed', 'Self-Employed'], n_samples),
    'loan_amount': np.random.exponential(50000, n_samples),
    'loan_term': np.random.choice([12, 24, 36, 48, 60], n_samples),
    'default': np.random.choice([0, 1], n_samples, p=[0.85, 0.15])  # Target variable
})

# Introduceer missing values en outliers
data.loc[np.random.choice(n_samples, 50, replace=False), 'income'] = np.nan
data.loc[np.random.choice(n_samples, 30, replace=False), 'credit_score'] = np.nan
data.loc[np.random.choice(n_samples, 5, replace=False), 'income'] = 1000000  # Outliers

print("=== DATASET OVERVIEW ===")
print(f"Shape: {data.shape}")
print("\nData types:")
print(data.dtypes)
print("\nMissing values:")
print(data.isnull().sum())
print("\nFirst 5 rows:")
print(data.head())

# 2. Data exploration en analysis
print("\n=== DATA EXPLORATION ===")

# Basic statistics
print("Numeric columns statistics:")
print(data.select_dtypes(include=[np.number]).describe())

# Categorical columns analysis
categorical_cols = data.select_dtypes(include=['object']).columns
print("\nCategorical columns distribution:")
for col in categorical_cols:
    print(f"\n{col}:")
    print(data[col].value_counts(dropna=False))

# Target variable distribution
print("\nTarget variable distribution:")
print(data['default'].value_counts())
print(f"Imbalance ratio: {data['default'].value_counts()[0] / data['default'].value_counts()[1]:.2f}:1")

# 3. Data cleaning pipeline
print("\n=== DATA CLEANING ===")

# Split features and target
X = data.drop('default', axis=1)
y = data['default']

# Split data before preprocessing (to avoid data leakage)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()

print(f"\nNumeric features: {numeric_features}")
print(f"Categorical features: {categorical_features}")

# 4. Create preprocessing pipelines
print("\n=== PREPROCESSING PIPELINES ===")

# Numeric pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Handle missing values
    ('scaler', StandardScaler())  # Standardize features
])

# Categorical pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing with mode
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # Convert to numeric
])

# Combine pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# 5. Apply preprocessing
print("Applying preprocessing...")

# Fit on training data only, then transform both train and test
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Get feature names after one-hot encoding
try:
    # Extract feature names from the preprocessor
    cat_encoder = preprocessor.named_transformers_['cat'].named_steps['onehot']
    cat_features = cat_encoder.get_feature_names_out(categorical_features)
    
    # Combine all feature names
    all_features = np.concatenate([numeric_features, cat_features])
    print(f"\nTotal features after preprocessing: {len(all_features)}")
    print("First 10 features:", all_features[:10])
    
except Exception as e:
    print(f"Could not extract feature names: {e}")

print(f"\nX_train shape after preprocessing: {X_train_processed.shape}")
print(f"X_test shape after preprocessing: {X_test_processed.shape}")

# 6. Handle class imbalance (for classification problems)
print("\n=== HANDLE CLASS IMBALANCE ===")

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

# Option 1: SMOTE (Synthetic Minority Over-sampling Technique)
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)

print(f"Before SMOTE - Class distribution: {np.bincount(y_train)}")
print(f"After SMOTE - Class distribution: {np.bincount(y_train_resampled)}")

# Option 2: Combined sampling
over = SMOTE(sampling_strategy=0.5, random_state=42)
under = RandomUnderSampler(sampling_strategy=0.8, random_state=42)

# Create pipeline with resampling
resampling_pipeline = ImbPipeline(steps=[
    ('preprocessor', preprocessor),
    ('over', over),
    ('under', under)
])

# 7. Feature engineering (adding new features)
print("\n=== FEATURE ENGINEERING ===")

def engineer_features(df):
    """Add engineered features to the dataset"""
    df_engineered = df.copy()
    
    # Example engineered features for loan dataset
    if 'income' in df.columns and 'loan_amount' in df.columns:
        # Debt-to-income ratio
        df_engineered['debt_to_income'] = df_engineered['loan_amount'] / df_engineered['income']
        
        # Monthly payment (simplified)
        if 'loan_term' in df.columns:
            df_engineered['monthly_payment'] = df_engineered['loan_amount'] / df_engineered['loan_term']
    
    # Age groups
    if 'age' in df.columns:
        df_engineered['age_group'] = pd.cut(
            df_engineered['age'], 
            bins=[0, 25, 35, 45, 55, 100],
            labels=['18-25', '26-35', '36-45', '46-55', '56+']
        )
    
    # Credit score categories
    if 'credit_score' in df.columns:
        df_engineered['credit_category'] = pd.cut(
            df_engineered['credit_score'],
            bins=[300, 580, 670, 740, 800, 850],
            labels=['Poor', 'Fair', 'Good', 'Very Good', 'Excellent']
        )
    
    return df_engineered

# Apply feature engineering
X_train_engineered = engineer_features(X_train)
X_test_engineered = engineer_features(X_test)

print(f"Original features: {X_train.shape[1]}")
print(f"Engineered features: {X_train_engineered.shape[1]}")
print("\nNew engineered features:")
new_features = set(X_train_engineered.columns) - set(X_train.columns)
print(list(new_features))

# 8. Feature selection (optional)
print("\n=== FEATURE SELECTION ===")

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# After preprocessing and engineering, we might have many features
# We can select the most important ones

# First need to preprocess the engineered data
preprocessor_engineered = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, X_train_engineered.select_dtypes(include=[np.number]).columns.tolist()),
        ('cat', categorical_transformer, X_train_engineered.select_dtypes(include=['object', 'category']).columns.tolist())
    ])

X_train_eng_processed = preprocessor_engineered.fit_transform(X_train_engineered)
X_test_eng_processed = preprocessor_engineered.transform(X_test_engineered)

# Select top 20 features
selector = SelectKBest(score_func=f_classif, k=20)
X_train_selected = selector.fit_transform(X_train_eng_processed, y_train)
X_test_selected = selector.transform(X_test_eng_processed)

print(f"Selected {X_train_selected.shape[1]} best features out of {X_train_eng_processed.shape[1]}")

print("\n✅ Data preparation complete!")
print("Preprocessed data ready for model training.")

Klaar voor ML projecten?

Vind de juiste experts of plaats je Machine Learning vacature

5. Supervised Learning: Classificatie & Regressie

Supervised Learning Workflow

  1. Data Collection: Verzamel gelabelde data
  2. Preprocessing: Clean en transformeer data
  3. Train/Test Split: Split data voor evaluatie
  4. Model Selection: Kies geschikt algoritme
  5. Training: Train model op training data
  6. Evaluation: Evalueer op test data
  7. Hyperparameter Tuning: Optimaliseer model parameters
  8. Deployment: Implementeer model in productie

Classificatie: Verschillende algoritmes vergelijken

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                           f1_score, confusion_matrix, classification_report,
                           roc_curve, auc, roc_auc_score)

# Import classification algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# 1. Load or create dataset
from sklearn.datasets import make_classification

# Create synthetic classification dataset
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.8, 0.2],  # Imbalanced classes
    random_state=42
)

print("Dataset shape:", X.shape)
print("Class distribution:", np.bincount(y))

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features (important for some algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Define classification models to compare
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(random_state=42, probability=True),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB()
}

# 4. Train and evaluate all models
results = []

for name, model in models.items():
    print(f"\n=== Training {name} ===")
    
    # Train model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] if hasattr(model, 'predict_proba') else None
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, zero_division=0)
    recall = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    roc_auc = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else None
    
    # Cross-validation score
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
    cv_mean = cv_scores.mean()
    cv_std = cv_scores.std()
    
    # Store results
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'ROC-AUC': roc_auc,
        'CV Mean': cv_mean,
        'CV Std': cv_std
    })
    
    # Print classification report
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")
    if roc_auc:
        print(f"ROC-AUC: {roc_auc:.4f}")
    print(f"CV Accuracy: {cv_mean:.4f} (+/- {cv_std:.4f})")

# 5. Compare all models
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('F1-Score', ascending=False)

print("\n" + "="*60)
print("MODEL COMPARISON (sorted by F1-Score)")
print("="*60)
print(results_df.to_string(index=False))

# 6. Hyperparameter tuning for best model
print("\n" + "="*60)
print("HYPERPARAMETER TUNING FOR RANDOM FOREST")
print("="*60)

# Define parameter grid for Random Forest
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

# Create and train GridSearchCV
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best F1-Score: {grid_search.best_score_:.4f}")

# 7. Evaluate tuned model
best_rf = grid_search.best_estimator_
y_pred_best = best_rf.predict(X_test_scaled)
y_pred_proba_best = best_rf.predict_proba(X_test_scaled)[:, 1]

print("\n" + "="*60)
print("TUNED RANDOM FOREST PERFORMANCE")
print("="*60)

print(f"Accuracy: {accuracy_score(y_test, y_pred_best):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_best):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_best):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_best):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba_best):.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_best))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred_best)
print(cm)

# 8. Feature importance analysis
print("\n" + "="*60)
print("FEATURE IMPORTANCE ANALYSIS")
print("="*60)

import matplotlib.pyplot as plt

# Get feature importance from tuned Random Forest
feature_importance = best_rf.feature_importances_
sorted_idx = np.argsort(feature_importance)[::-1]

# Print top 10 features
print("Top 10 most important features:")
for i in range(10):
    print(f"Feature {sorted_idx[i]}: {feature_importance[sorted_idx[i]]:.4f}")

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.bar(range(20), feature_importance[sorted_idx], align='center')
plt.xlabel('Feature Index')
plt.ylabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300)
plt.show()

print("\n✅ Classification model training complete!")

Regressie: Predictieve modellen voor continue waarden

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline

# Import regression algorithms
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

# 1. Load or create regression dataset
from sklearn.datasets import make_regression

# Create synthetic regression dataset
X, y = make_regression(
    n_samples=1000,
    n_features=15,
    n_informative=10,
    noise=10,
    random_state=42
)

print("Dataset shape:", X.shape)

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Define regression models to compare
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(random_state=42),
    'Lasso Regression': Lasso(random_state=42),
    'ElasticNet': ElasticNet(random_state=42),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'SVR': SVR(),
    'K-Nearest Neighbors': KNeighborsRegressor()
}

# 4. Train and evaluate all models
results = []

for name, model in models.items():
    print(f"\n=== Training {name} ===")
    
    # Train model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Cross-validation score
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='r2')
    cv_mean = cv_scores.mean()
    cv_std = cv_scores.std()
    
    # Store results
    results.append({
        'Model': name,
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R²': r2,
        'CV R² Mean': cv_mean,
        'CV R² Std': cv_std
    })
    
    print(f"MSE: {mse:.4f}")
    print(f"RMSE: {rmse:.4f}")
    print(f"MAE: {mae:.4f}")
    print(f"R²: {r2:.4f}")
    print(f"CV R²: {cv_mean:.4f} (+/- {cv_std:.4f})")

# 5. Compare all models
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('R²', ascending=False)

print("\n" + "="*60)
print("REGRESSION MODEL COMPARISON (sorted by R²)")
print("="*60)
print(results_df.to_string(index=False))

# 6. Polynomial regression and feature engineering
print("\n" + "="*60)
print("POLYNOMIAL REGRESSION WITH FEATURE ENGINEERING")
print("="*60)

# Create polynomial features pipeline
poly_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Fit and evaluate polynomial regression
poly_pipeline.fit(X_train, y_train)
y_pred_poly = poly_pipeline.predict(X_test)

mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)

print(f"Polynomial Regression (degree=2) R²: {r2_poly:.4f}")
print(f"Polynomial Regression MSE: {mse_poly:.4f}")

# 7. Hyperparameter tuning for best regression model
print("\n" + "="*60)
print("HYPERPARAMETER TUNING FOR GRADIENT BOOSTING REGRESSOR")
print("="*60)

# Define parameter grid for Gradient Boosting
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Create and train GridSearchCV
gbr = GradientBoostingRegressor(random_state=42)
grid_search = GridSearchCV(
    estimator=gbr,
    param_grid=param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best R² Score: {grid_search.best_score_:.4f}")

# 8. Evaluate tuned model
best_gbr = grid_search.best_estimator_
y_pred_best = best_gbr.predict(X_test_scaled)

print("\n" + "="*60)
print("TUNED GRADIENT BOOSTING REGRESSOR PERFORMANCE")
print("="*60)

print(f"MSE: {mean_squared_error(y_test, y_pred_best):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_best)):.4f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred_best):.4f}")
print(f"R²: {r2_score(y_test, y_pred_best):.4f}")

# 9. Visualize predictions vs actual values
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_best, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values (Gradient Boosting Regressor)')
plt.tight_layout()
plt.savefig('regression_predictions.png', dpi=300)
plt.show()

# 10. Residual analysis
residuals = y_test - y_pred_best

plt.figure(figsize=(12, 4))

# Residuals vs Predicted
plt.subplot(1, 3, 1)
plt.scatter(y_pred_best, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted')

# Histogram of residuals
plt.subplot(1, 3, 2)
plt.hist(residuals, bins=30, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')

# Q-Q plot for normality check
plt.subplot(1, 3, 3)
from scipy import stats
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')

plt.tight_layout()
plt.savefig('residual_analysis.png', dpi=300)
plt.show()

print("\n✅ Regression model training complete!")

6. Model evaluatie en validatie

Belangrijkste evaluatie metrics

Classificatie Metrics

  • Accuracy: (TP+TN) / Total
  • Precision: TP / (TP+FP)
  • Recall: TP / (TP+FN)
  • F1-Score: Harmonic mean van Precision en Recall
  • ROC-AUC: Area onder ROC curve

Regressie Metrics

  • MSE: Mean Squared Error
  • RMSE: Root Mean Squared Error
  • MAE: Mean Absolute Error
  • R²: Coefficient of Determination
  • MAPE: Mean Absolute Percentage Error

Validatie Technieken

  • Train/Test Split: Basis validatie
  • K-Fold CV: Meer robuuste validatie
  • Stratified K-Fold: Voor imbalanced data
  • Time Series Split: Voor tijdreeksen
  • Bootstrapping: Voor kleine datasets

7. Deep Learning met TensorFlow/Keras

Neural Network voor classificatie

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, callbacks
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# 1. Load or create dataset
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=10000,
    n_features=30,
    n_informative=20,
    n_redundant=5,
    n_classes=2,
    weights=[0.8, 0.2],
    random_state=42
)

print(f"Dataset shape: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")

# 2. Split and scale data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train_scaled.shape}")
print(f"Validation set: {X_val_scaled.shape}")
print(f"Test set: {X_test_scaled.shape}")

# 3. Build neural network model
def build_model(input_shape, dropout_rate=0.3):
    """Build a neural network for binary classification"""
    model = keras.Sequential([
        # Input layer
        layers.Input(shape=(input_shape,)),
        
        # Hidden layers with batch normalization and dropout
        layers.Dense(128, activation='relu'),
        layers.BatchNormalization(),
        layers.Dropout(dropout_rate),
        
        layers.Dense(64, activation='relu'),
        layers.BatchNormalization(),
        layers.Dropout(dropout_rate),
        
        layers.Dense(32, activation='relu'),
        layers.BatchNormalization(),
        layers.Dropout(dropout_rate),
        
        # Output layer
        layers.Dense(1, activation='sigmoid')
    ])
    
    return model

# Create model
input_shape = X_train_scaled.shape[1]
model = build_model(input_shape)

# 4. Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=[
        'accuracy',
        keras.metrics.Precision(name='precision'),
        keras.metrics.Recall(name='recall'),
        keras.metrics.AUC(name='auc')
    ]
)

# Display model architecture
model.summary()

# 5. Define callbacks for better training
callbacks_list = [
    # Early stopping to prevent overfitting
    callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True,
        verbose=1
    ),
    
    # Reduce learning rate when plateau
    callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=1e-6,
        verbose=1
    ),
    
    # Model checkpoint
    callbacks.ModelCheckpoint(
        filepath='best_model.keras',
        monitor='val_loss',
        save_best_only=True,
        verbose=1
    ),
    
    # TensorBoard for visualization
    # callbacks.TensorBoard(log_dir='./logs')
]

# 6. Train the model
history = model.fit(
    X_train_scaled, y_train,
    validation_data=(X_val_scaled, y_val),
    epochs=100,
    batch_size=32,
    callbacks=callbacks_list,
    verbose=1
)

# 7. Evaluate the model
print("\n" + "="*60)
print("NEURAL NETWORK EVALUATION")
print("="*60)

# Load best model
best_model = keras.models.load_model('best_model.keras')

# Evaluate on test set
test_loss, test_accuracy, test_precision, test_recall, test_auc = best_model.evaluate(
    X_test_scaled, y_test, verbose=0
)

print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test Precision: {test_precision:.4f}")
print(f"Test Recall: {test_recall:.4f}")
print(f"Test AUC: {test_auc:.4f}")

# Calculate F1-Score
test_f1 = 2 * (test_precision * test_recall) / (test_precision + test_recall)
print(f"Test F1-Score: {test_f1:.4f}")

# Make predictions
y_pred_proba = best_model.predict(X_test_scaled)
y_pred = (y_pred_proba > 0.5).astype(int)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

# 8. Plot training history
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Plot loss
axes[0, 0].plot(history.history['loss'], label='Training Loss')
axes[0, 0].plot(history.history['val_loss'], label='Validation Loss')
axes[0, 0].set_title('Model Loss')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot accuracy
axes[0, 1].plot(history.history['accuracy'], label='Training Accuracy')
axes[0, 1].plot(history.history['val_accuracy'], label='Validation Accuracy')
axes[0, 1].set_title('Model Accuracy')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Accuracy')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot precision
axes[1, 0].plot(history.history['precision'], label='Training Precision')
axes[1, 0].plot(history.history['val_precision'], label='Validation Precision')
axes[1, 0].set_title('Model Precision')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Precision')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot recall
axes[1, 1].plot(history.history['recall'], label='Training Recall')
axes[1, 1].plot(history.history['val_recall'], label='Validation Recall')
axes[1, 1].set_title('Model Recall')
axes[1, 1].set_xlabel('Epoch')
axes[1, 1].set_ylabel('Recall')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('training_history.png', dpi=300)
plt.show()

# 9. ROC Curve
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.savefig('roc_curve.png', dpi=300)
plt.show()

print("\n✅ Neural Network training complete!")
print("Best model saved as 'best_model.keras'")

10. Praktijkvoorbeeld: Customer Churn Prediction

End-to-end ML project: Customer Churn Prediction

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                           f1_score, roc_auc_score, confusion_matrix, 
                           classification_report, roc_curve)
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

# Import ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Import for feature importance visualization
import shap

# 1. Load and explore the dataset
# For this example, we'll create a realistic customer churn dataset
np.random.seed(42)
n_customers = 10000

# Create realistic customer data
data = pd.DataFrame({
    'customer_id': range(1000, 1000 + n_customers),
    'age': np.random.randint(18, 70, n_customers),
    'gender': np.random.choice(['Male', 'Female'], n_customers),
    'tenure_months': np.random.exponential(24, n_customers).astype(int),
    'monthly_charges': np.random.normal(65, 20, n_customers),
    'total_charges': np.random.normal(1500, 500, n_customers),
    'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], 
                                      n_customers, p=[0.5, 0.3, 0.2]),
    'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], 
                                        n_customers, p=[0.4, 0.4, 0.2]),
    'online_security': np.random.choice(['Yes', 'No'], n_customers, p=[0.3, 0.7]),
    'online_backup': np.random.choice(['Yes', 'No'], n_customers, p=[0.35, 0.65]),
    'device_protection': np.random.choice(['Yes', 'No'], n_customers, p=[0.25, 0.75]),
    'tech_support': np.random.choice(['Yes', 'No'], n_customers, p=[0.3, 0.7]),
    'streaming_tv': np.random.choice(['Yes', 'No'], n_customers, p=[0.4, 0.6]),
    'streaming_movies': np.random.choice(['Yes', 'No'], n_customers, p=[0.4, 0.6]),
    'paperless_billing': np.random.choice(['Yes', 'No'], n_customers, p=[0.6, 0.4]),
    'payment_method': np.random.choice([
        'Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'
    ], n_customers, p=[0.3, 0.2, 0.25, 0.25]),
    'monthly_usage_gb': np.random.gamma(2, 50, n_customers),
    'customer_service_calls': np.random.poisson(1, n_customers),
    'satisfaction_score': np.random.randint(1, 6, n_customers)
})

# Create realistic churn based on features
def calculate_churn_probability(row):
    """Calculate churn probability based on customer features"""
    prob = 0.1  # Base probability
    
    # Factors that increase churn probability
    if row['contract_type'] == 'Month-to-month':
        prob += 0.2
    if row['customer_service_calls'] > 2:
        prob += 0.15
    if row['satisfaction_score'] < 3:
        prob += 0.1
    if row['tenure_months'] < 12:
        prob += 0.1
    if row['online_security'] == 'No':
        prob += 0.05
    if row['tech_support'] == 'No':
        prob += 0.05
    
    # Factors that decrease churn probability
    if row['contract_type'] == 'Two year':
        prob -= 0.15
    if row['tenure_months'] > 24:
        prob -= 0.1
    if row['satisfaction_score'] > 4:
        prob -= 0.08
    
    return min(max(prob, 0.05), 0.95)

# Apply churn probability and generate churn labels
data['churn_probability'] = data.apply(calculate_churn_probability, axis=1)
data['churn'] = np.random.binomial(1, data['churn_probability'])

# Drop probability column (not available in real scenario)
data = data.drop('churn_probability', axis=1)

print("=== CUSTOMER CHURN DATASET ===")
print(f"Dataset shape: {data.shape}")
print(f"\nChurn distribution:")
print(data['churn'].value_counts())
print(f"Churn rate: {data['churn'].mean() * 100:.2f}%")

# 2. Exploratory Data Analysis (EDA)
print("\n=== EXPLORATORY DATA ANALYSIS ===")

# Basic statistics
print("\nNumeric columns statistics:")
print(data.select_dtypes(include=[np.number]).describe())

# Churn by categorical features
categorical_features = ['contract_type', 'internet_service', 'online_security', 
                       'tech_support', 'paperless_billing', 'payment_method']

print("\nChurn rates by category:")
for feature in categorical_features:
    churn_rates = data.groupby(feature)['churn'].mean().sort_values(ascending=False)
    print(f"\n{feature}:")
    for category, rate in churn_rates.items():
        print(f"  {category}: {rate * 100:.1f}%")

# 3. Feature engineering
print("\n=== FEATURE ENGINEERING ===")

# Create new features
data['tenure_years'] = data['tenure_months'] / 12
data['avg_monthly_spend'] = data['total_charges'] / data['tenure_months'].replace(0, 1)
data['has_multiple_services'] = (
    (data['online_security'] == 'Yes').astype(int) +
    (data['online_backup'] == 'Yes').astype(int) +
    (data['device_protection'] == 'Yes').astype(int) +
    (data['tech_support'] == 'Yes').astype(int) +
    (data['streaming_tv'] == 'Yes').astype(int) +
    (data['streaming_movies'] == 'Yes').astype(int)
)
data['high_usage'] = (data['monthly_usage_gb'] > data['monthly_usage_gb'].quantile(0.75)).astype(int)
data['frequent_service_calls'] = (data['customer_service_calls'] > 2).astype(int)

print(f"Added {len(['tenure_years', 'avg_monthly_spend', 'has_multiple_services', 'high_usage', 'frequent_service_calls'])} new features")

# 4. Prepare data for modeling
print("\n=== DATA PREPARATION ===")

# Drop customer_id (not a feature)
X = data.drop(['customer_id', 'churn'], axis=1)
y = data['churn']

# Identify feature types
numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

print(f"Numeric features: {len(numeric_features)}")
print(f"Categorical features: {len(categorical_features)}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Training churn rate: {y_train.mean() * 100:.2f}%")
print(f"Test churn rate: {y_test.mean() * 100:.2f}%")

# 5. Create preprocessing pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# 6. Build and compare multiple models
print("\n=== MODEL TRAINING AND COMPARISON ===")

# Define models with initial parameters
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced'),
    'Random Forest': RandomForestClassifier(random_state=42, class_weight='balanced_subsample'),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': XGBClassifier(random_state=42, eval_metric='logloss', use_label_encoder=False),
    'LightGBM': LGBMClassifier(random_state=42, class_weight='balanced')
}

# Create pipelines with SMOTE for each model
pipelines = {}
for name, model in models.items():
    pipelines[name] = ImbPipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('classifier', model)
    ])

# Train and evaluate all models
results = []
for name, pipeline in pipelines.items():
    print(f"\nTraining {name}...")
    
    # Train model
    pipeline.fit(X_train, y_train)
    
    # Make predictions
    y_pred = pipeline.predict(X_test)
    y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Cross-validation score
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
    cv_mean = cv_scores.mean()
    cv_std = cv_scores.std()
    
    # Store results
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'ROC-AUC': roc_auc,
        'CV ROC-AUC Mean': cv_mean,
        'CV ROC-AUC Std': cv_std
    })
    
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall: {recall:.4f}")
    print(f"  F1-Score: {f1:.4f}")
    print(f"  ROC-AUC: {roc_auc:.4f}")

# Compare models
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('F1-Score', ascending=False)

print("\n" + "="*70)
print("MODEL COMPARISON FOR CUSTOMER CHURN PREDICTION")
print("="*70)
print(results_df.to_string(index=False))

# 7. Hyperparameter tuning for best model
print("\n" + "="*70)
print("HYPERPARAMETER TUNING FOR XGBOOST (BEST MODEL)")
print("="*70)

# Define parameter grid for XGBoost
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [3, 5, 7],
    'classifier__learning_rate': [0.01, 0.05, 0.1],
    'classifier__subsample': [0.8, 0.9, 1.0],
    'classifier__colsample_bytree': [0.8, 0.9, 1.0]
}

# Create GridSearchCV for XGBoost pipeline
xgb_pipeline = pipelines['XGBoost']
grid_search = GridSearchCV(
    estimator=xgb_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best ROC-AUC Score: {grid_search.best_score_:.4f}")

# 8. Evaluate tuned model
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
y_pred_proba_best = best_model.predict_proba(X_test)[:, 1]

print("\n" + "="*70)
print("TUNED XGBOOST PERFORMANCE ON TEST SET")
print("="*70)

print(f"Accuracy: {accuracy_score(y_test, y_pred_best):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_best):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_best):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_best):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba_best):.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_best))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred_best)
print(cm)

# 9. Feature importance analysis
print("\n" + "="*70)
print("FEATURE IMPORTANCE ANALYSIS")
print("="*70)

# Get feature names after preprocessing
preprocessor = best_model.named_steps['preprocessor']
xgb_model = best_model.named_steps['classifier']

# Get one-hot encoded feature names
onehot_encoder = preprocessor.named_transformers_['cat'].named_steps['onehot']
cat_feature_names = onehot_encoder.get_feature_names_out(categorical_features)

# Combine all feature names
all_feature_names = np.concatenate([numeric_features, cat_feature_names])

# Get feature importance from XGBoost
feature_importance = xgb_model.feature_importances_
sorted_idx = np.argsort(feature_importance)[::-1]

# Print top 20 features
print("\nTop 20 most important features for churn prediction:")
for i in range(min(20, len(all_feature_names))):
    feature_name = all_feature_names[sorted_idx[i]]
    importance = feature_importance[sorted_idx[i]]
    print(f"{i+1:2d}. {feature_name:30s}: {importance:.4f}")

# 10. SHAP analysis for model interpretability
print("\n" + "="*70)
print("SHAP ANALYSIS FOR MODEL INTERPRETABILITY")
print("="*70)

try:
    # Prepare data for SHAP
    X_test_processed = preprocessor.transform(X_test)
    
    # Create SHAP explainer
    explainer = shap.TreeExplainer(xgb_model)
    shap_values = explainer.shap_values(X_test_processed)
    
    # Summary plot
    plt.figure(figsize=(12, 8))
    shap.summary_plot(shap_values, X_test_processed, feature_names=all_feature_names, 
                      max_display=20, show=False)
    plt.tight_layout()
    plt.savefig('shap_summary.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("SHAP summary plot saved as 'shap_summary.png'")
    
except Exception as e:
    print(f"SHAP analysis failed: {e}")

# 11. Business insights and recommendations
print("\n" + "="*70)
print("BUSINESS INSIGHTS AND RECOMMENDATIONS")
print("="*70)

# Calculate precision at different thresholds
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
print("\nPerformance at different probability thresholds:")
for threshold in thresholds:
    y_pred_thresh = (y_pred_proba_best >= threshold).astype(int)
    precision_thresh = precision_score(y_test, y_pred_thresh)
    recall_thresh = recall_score(y_test, y_pred_thresh)
    print(f"Threshold {threshold}: Precision={precision_thresh:.3f}, Recall={recall_thresh:.3f}")

# Identify high-risk customers
high_risk_threshold = 0.7
high_risk_indices = np.where(y_pred_proba_best >= high_risk_threshold)[0]
high_risk_customers = X_test.iloc[high_risk_indices].copy()
high_risk_customers['churn_probability'] = y_pred_proba_best[high_risk_indices]

print(f"\nIdentified {len(high_risk_customers)} high-risk customers (probability >= {high_risk_threshold})")

# Analyze characteristics of high-risk customers
print("\nCharacteristics of high-risk customers:")
print(f"- Average tenure: {high_risk_customers['tenure_months'].mean():.1f} months (vs {X_test['tenure_months'].mean():.1f} overall)")
print(f"- Monthly charges: €{high_risk_customers['monthly_charges'].mean():.2f} (vs €{X_test['monthly_charges'].mean():.2f} overall)")
print(f"- Month-to-month contracts: {high_risk_customers[high_risk_customers['contract_type'] == 'Month-to-month'].shape[0] / len(high_risk_customers) * 100:.1f}%")

# 12. Save the model
import joblib

# Save the entire pipeline
joblib.dump(best_model, 'customer_churn_model.pkl')

# Save feature names
joblib.dump(all_feature_names, 'feature_names.pkl')

print("\n" + "="*70)
print("MODEL DEPLOYMENT READY")
print("="*70)
print("Model saved as 'customer_churn_model.pkl'")
print("Feature names saved as 'feature_names.pkl'")

print("\n✅ Customer Churn Prediction project complete!")
print(f"Final model achieves ROC-AUC of {roc_auc_score(y_test, y_pred_proba_best):.4f} on test data")

Klaar om te beginnen met Machine Learning?

Vind ML professionals of plaats je vacature voor AI/ML projecten

Conclusie en volgende stappen

Machine Learning in Python is een krachtige vaardigheid die je in staat stelt intelligente systemen te bouwen. Je hebt nu geleerd:

  1. ML basis concepten: Supervised vs unsupervised learning
  2. Data preprocessing: Voorbereiding van data voor ML
  3. Model training: Verschillende algoritmes trainen en vergelijken
  4. Model evaluatie: Metrics en validatie technieken
  5. Deep Learning: Neural networks met TensorFlow/Keras
  6. Praktijk project: End-to-end customer churn prediction

Volgende stappen:

  • Experimenteer met je eigen datasets
  • Leer advanced deep learning (CNNs, RNNs, Transformers)
  • Implementeer ML models in productie met Flask/FastAPI
  • Leer MLOps voor model monitoring en management
  • Volg onze advanced ML en AI tutorials