Machine Learning in Python: Complete Gids met Scikit-learn
Leer professionele machine learning in Python. Complete tutorial van basis concepten tot advanced deep learning met Scikit-learn en TensorFlow.
Zoek je Machine Learning experts?
Vind gespecialiseerde ML Engineers en Data Scientists voor je AI projecten
Inhoudsopgave
- Wat is Machine Learning?
- Machine Learning types en toepassingen
- Python ML stack setup
- Data voorbereiding voor ML
- Supervised Learning: Classificatie & Regressie
- Model evaluatie en validatie
- Unsupervised Learning: Clustering & PCA
- Deep Learning met TensorFlow/Keras
- Model deployment en monitoring
- Praktijkvoorbeeld: Customer Churn Prediction
- ML best practices en tips
1. Wat is Machine Learning?
Machine Learning Definitie
Machine Learning is een subveld van kunstmatige intelligentie (AI) waarbij computersystemen leren van data om taken uit te voeren zonder expliciet geprogrammeerd te worden. In plaats van regels te schrijven, leren modellen patronen uit voorbeelden.
Predictive Analytics
Voorspel toekomstige gebeurtenissen op basis van historische data.
Pattern Recognition
Herken complexe patronen in afbeeldingen, tekst en geluid.
Personalization
Geef gepersonaliseerde aanbevelingen en ervaringen.
Automation
Automatiseer complexe besluitvormingsprocessen.
| Traditioneel Programmeren | Machine Learning | Voordelen |
|---|---|---|
| Input + Regels → Output | Input + Output → Regels | Leert van data |
| Handmatige regels definiëren | Automatisch patronen leren | Schaalbaar voor complexe problemen |
| Statische logica | Adaptieve modellen | Blijft verbeteren met nieuwe data |
| Menselijke expertise nodig | Data-driven beslissingen | Kan menselijke bias verminderen |
| Goed voor gestructureerde problemen | Goed voor complexe, ongestructureerde problemen | Breed toepasbaar |
Reële toepassingen van Machine Learning
E-commerce
- Product recommendations
- Price optimization
- Fraud detection
- Customer segmentation
Healthcare
- Disease diagnosis
- Drug discovery
- Medical image analysis
- Patient risk prediction
Automotive
- Self-driving cars
- Predictive maintenance
- Traffic prediction
- Driver behavior analysis
Finance
- Credit scoring
- Algorithmic trading
- Risk assessment
- Chatbots voor klantenservice
Team nodig voor AI projecten?
Vind ervaren ML Engineers en Data Scientists gespecialiseerd in machine learning
2. Machine Learning types en toepassingen
Supervised
Gelabelde data
Unsupervised
Ongegelabelde data
Reinforcement
Reward-based learning
Deep Learning
Neural networks
Semi-supervised
Mixed data
Transfer Learning
Pre-trained models
Supervised Learning
Werking: Model leert van gelabelde voorbeelden (input → output paren).
- Classificatie: Discrete labels voorspellen (spam/not spam)
- Regressie: Continue waarden voorspellen (huisprijs)
# Voorbeelden van supervised learning algoritmes:
# 1. Linear Regression - voor regressie problemen
# 2. Logistic Regression - voor binaire classificatie
# 3. Decision Trees - voor classificatie en regressie
# 4. Random Forest - ensemble van decision trees
# 5. SVM (Support Vector Machines) - voor classificatie
# 6. Neural Networks - voor complexe patronen
Unsupervised Learning
Werking: Model zoekt patronen in ongegelabelde data.
- Clustering: Groepeert gelijksoortige data points
- Dimensionality Reduction: Vermindert aantal features
- Anomaly Detection: Vindt afwijkende observaties
# Voorbeelden van unsupervised learning algoritmes:
# 1. K-Means Clustering - voor data segmentatie
# 2. Hierarchical Clustering - voor nested clusters
# 3. PCA (Principal Component Analysis) - voor dimensionality reduction
# 4. t-SNE - voor data visualisatie
# 5. Autoencoders - neural networks voor compression
Deep Learning
Werking: Gebruikt neural networks met meerdere lagen voor complexe patronen.
- CNN: Convolutional Neural Networks - voor images
- RNN: Recurrent Neural Networks - voor sequences
- GAN: Generative Adversarial Networks - voor content generation
- Transformer: Voor NLP (BERT, GPT)
# Deep learning frameworks:
# 1. TensorFlow - Google's DL framework
# 2. PyTorch - Facebook's DL framework
# 3. Keras - High-level API voor TensorFlow
# 4. FastAI - Voor snelle prototyping
| ML Type | Wanneer gebruiken? | Algoritmes | Voorbeelden |
|---|---|---|---|
| Supervised | Je hebt gelabelde data en wilt voorspellingen doen | Linear/Logistic Regression, Random Forest, SVM | Spam detection, prijs voorspelling |
| Unsupervised | Je hebt ongegelabelde data en wilt patronen ontdekken | K-Means, PCA, DBSCAN | Customer segmentation, anomaly detection |
| Semi-supervised | Je hebt weinig gelabelde en veel ongegelabelde data | Label Propagation, Self-training | Medical imaging met weinig labels |
| Reinforcement | Je wilt een agent leren door trial-and-error | Q-Learning, Deep Q Networks | Game playing, robotics |
| Deep Learning | Je hebt veel data en complexe patronen | CNN, RNN, Transformer | Image recognition, NLP, speech recognition |
3. Python ML stack setup
Complete Machine Learning Stack
Scikit-learn
Basis ML algoritmes en tools
TensorFlow/PyTorch
Deep learning frameworks
XGBoost/LightGBM
Gradient boosting voor tabulaire data
Pandas/NumPy
Data manipulatie en numerieke berekeningen
Matplotlib/Seaborn
Data visualisatie
SciPy/Statsmodels
Statistische analyse
# Complete ML environment setup
# Optie 1: Miniconda/Anaconda (aanbevolen)
# Download van: https://docs.conda.io/en/latest/miniconda.html
# Creëer nieuwe environment:
conda create -n ml_env python=3.9
conda activate ml_env
# Optie 2: Virtual environment met pip
python -m venv ml_env
# Windows:
ml_env\Scripts\activate
# Mac/Linux:
source ml_env/bin/activate
# Basis ML packages installeren
pip install numpy pandas matplotlib seaborn scipy scikit-learn jupyter
# Geavanceerde ML packages
pip install xgboost lightgbm catboost
# Deep learning packages
pip install tensorflow keras torch torchvision torchaudio
# Extra utilities
pip install joblib imbalanced-learn yellowbrick eli5 shap
# Voor deployment
pip install flask fastapi streamlit mlflow
# Of gebruik een requirements.txt bestand:
# requirements.txt inhoud:
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
tensorflow==2.13.0
xgboost==2.0.0
matplotlib==3.7.1
seaborn==0.12.2
jupyter==1.0.0
flask==2.3.2
# Installeer alle packages:
pip install -r requirements.txt
# Import statements voor ML projecten
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# ML algoritmes
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
# Deep learning imports
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
# Styling instellen
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
print("✅ ML environment ready!")
print(f"TensorFlow versie: {tf.__version__}")
print(f"Scikit-learn versie: {sklearn.__version__}")
4. Data voorbereiding voor ML
Belangrijkste data issues voor ML
- Missing values: Kan modellen breken of slechte resultaten geven
- Categorical data: ML algoritmes werken alleen met numerieke data
- Feature scaling: Sommige algoritmes vereisen geschaalde features
- Imbalanced data: Kan leiden tot biased modellen
- Outliers: Kunnen model training verstoren
- Multicollinearity: Hoog gecorreleerde features kunnen problemen veroorzaken
Complete data preprocessing pipeline
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# 1. Sample dataset voor ML
np.random.seed(42)
n_samples = 1000
# Creëer dataset met verschillende data types en issues
data = pd.DataFrame({
'age': np.random.randint(18, 70, n_samples),
'income': np.random.normal(50000, 15000, n_samples),
'credit_score': np.random.randint(300, 850, n_samples),
'gender': np.random.choice(['Male', 'Female', None], n_samples, p=[0.45, 0.45, 0.10]),
'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD', None], n_samples),
'employment_status': np.random.choice(['Employed', 'Unemployed', 'Self-Employed'], n_samples),
'loan_amount': np.random.exponential(50000, n_samples),
'loan_term': np.random.choice([12, 24, 36, 48, 60], n_samples),
'default': np.random.choice([0, 1], n_samples, p=[0.85, 0.15]) # Target variable
})
# Introduceer missing values en outliers
data.loc[np.random.choice(n_samples, 50, replace=False), 'income'] = np.nan
data.loc[np.random.choice(n_samples, 30, replace=False), 'credit_score'] = np.nan
data.loc[np.random.choice(n_samples, 5, replace=False), 'income'] = 1000000 # Outliers
print("=== DATASET OVERVIEW ===")
print(f"Shape: {data.shape}")
print("\nData types:")
print(data.dtypes)
print("\nMissing values:")
print(data.isnull().sum())
print("\nFirst 5 rows:")
print(data.head())
# 2. Data exploration en analysis
print("\n=== DATA EXPLORATION ===")
# Basic statistics
print("Numeric columns statistics:")
print(data.select_dtypes(include=[np.number]).describe())
# Categorical columns analysis
categorical_cols = data.select_dtypes(include=['object']).columns
print("\nCategorical columns distribution:")
for col in categorical_cols:
print(f"\n{col}:")
print(data[col].value_counts(dropna=False))
# Target variable distribution
print("\nTarget variable distribution:")
print(data['default'].value_counts())
print(f"Imbalance ratio: {data['default'].value_counts()[0] / data['default'].value_counts()[1]:.2f}:1")
# 3. Data cleaning pipeline
print("\n=== DATA CLEANING ===")
# Split features and target
X = data.drop('default', axis=1)
y = data['default']
# Split data before preprocessing (to avoid data leakage)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
print(f"\nNumeric features: {numeric_features}")
print(f"Categorical features: {categorical_features}")
# 4. Create preprocessing pipelines
print("\n=== PREPROCESSING PIPELINES ===")
# Numeric pipeline
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')), # Handle missing values
('scaler', StandardScaler()) # Standardize features
])
# Categorical pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')), # Fill missing with mode
('onehot', OneHotEncoder(handle_unknown='ignore')) # Convert to numeric
])
# Combine pipelines
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# 5. Apply preprocessing
print("Applying preprocessing...")
# Fit on training data only, then transform both train and test
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
# Get feature names after one-hot encoding
try:
# Extract feature names from the preprocessor
cat_encoder = preprocessor.named_transformers_['cat'].named_steps['onehot']
cat_features = cat_encoder.get_feature_names_out(categorical_features)
# Combine all feature names
all_features = np.concatenate([numeric_features, cat_features])
print(f"\nTotal features after preprocessing: {len(all_features)}")
print("First 10 features:", all_features[:10])
except Exception as e:
print(f"Could not extract feature names: {e}")
print(f"\nX_train shape after preprocessing: {X_train_processed.shape}")
print(f"X_test shape after preprocessing: {X_test_processed.shape}")
# 6. Handle class imbalance (for classification problems)
print("\n=== HANDLE CLASS IMBALANCE ===")
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
# Option 1: SMOTE (Synthetic Minority Over-sampling Technique)
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)
print(f"Before SMOTE - Class distribution: {np.bincount(y_train)}")
print(f"After SMOTE - Class distribution: {np.bincount(y_train_resampled)}")
# Option 2: Combined sampling
over = SMOTE(sampling_strategy=0.5, random_state=42)
under = RandomUnderSampler(sampling_strategy=0.8, random_state=42)
# Create pipeline with resampling
resampling_pipeline = ImbPipeline(steps=[
('preprocessor', preprocessor),
('over', over),
('under', under)
])
# 7. Feature engineering (adding new features)
print("\n=== FEATURE ENGINEERING ===")
def engineer_features(df):
"""Add engineered features to the dataset"""
df_engineered = df.copy()
# Example engineered features for loan dataset
if 'income' in df.columns and 'loan_amount' in df.columns:
# Debt-to-income ratio
df_engineered['debt_to_income'] = df_engineered['loan_amount'] / df_engineered['income']
# Monthly payment (simplified)
if 'loan_term' in df.columns:
df_engineered['monthly_payment'] = df_engineered['loan_amount'] / df_engineered['loan_term']
# Age groups
if 'age' in df.columns:
df_engineered['age_group'] = pd.cut(
df_engineered['age'],
bins=[0, 25, 35, 45, 55, 100],
labels=['18-25', '26-35', '36-45', '46-55', '56+']
)
# Credit score categories
if 'credit_score' in df.columns:
df_engineered['credit_category'] = pd.cut(
df_engineered['credit_score'],
bins=[300, 580, 670, 740, 800, 850],
labels=['Poor', 'Fair', 'Good', 'Very Good', 'Excellent']
)
return df_engineered
# Apply feature engineering
X_train_engineered = engineer_features(X_train)
X_test_engineered = engineer_features(X_test)
print(f"Original features: {X_train.shape[1]}")
print(f"Engineered features: {X_train_engineered.shape[1]}")
print("\nNew engineered features:")
new_features = set(X_train_engineered.columns) - set(X_train.columns)
print(list(new_features))
# 8. Feature selection (optional)
print("\n=== FEATURE SELECTION ===")
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
# After preprocessing and engineering, we might have many features
# We can select the most important ones
# First need to preprocess the engineered data
preprocessor_engineered = ColumnTransformer(
transformers=[
('num', numeric_transformer, X_train_engineered.select_dtypes(include=[np.number]).columns.tolist()),
('cat', categorical_transformer, X_train_engineered.select_dtypes(include=['object', 'category']).columns.tolist())
])
X_train_eng_processed = preprocessor_engineered.fit_transform(X_train_engineered)
X_test_eng_processed = preprocessor_engineered.transform(X_test_engineered)
# Select top 20 features
selector = SelectKBest(score_func=f_classif, k=20)
X_train_selected = selector.fit_transform(X_train_eng_processed, y_train)
X_test_selected = selector.transform(X_test_eng_processed)
print(f"Selected {X_train_selected.shape[1]} best features out of {X_train_eng_processed.shape[1]}")
print("\n✅ Data preparation complete!")
print("Preprocessed data ready for model training.")
Klaar voor ML projecten?
Vind de juiste experts of plaats je Machine Learning vacature
5. Supervised Learning: Classificatie & Regressie
Supervised Learning Workflow
- Data Collection: Verzamel gelabelde data
- Preprocessing: Clean en transformeer data
- Train/Test Split: Split data voor evaluatie
- Model Selection: Kies geschikt algoritme
- Training: Train model op training data
- Evaluation: Evalueer op test data
- Hyperparameter Tuning: Optimaliseer model parameters
- Deployment: Implementeer model in productie
Classificatie: Verschillende algoritmes vergelijken
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report,
roc_curve, auc, roc_auc_score)
# Import classification algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
# 1. Load or create dataset
from sklearn.datasets import make_classification
# Create synthetic classification dataset
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=15,
n_redundant=5,
n_classes=2,
weights=[0.8, 0.2], # Imbalanced classes
random_state=42
)
print("Dataset shape:", X.shape)
print("Class distribution:", np.bincount(y))
# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features (important for some algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 3. Define classification models to compare
models = {
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'SVM': SVC(random_state=42, probability=True),
'K-Nearest Neighbors': KNeighborsClassifier(),
'Naive Bayes': GaussianNB()
}
# 4. Train and evaluate all models
results = []
for name, model in models.items():
print(f"\n=== Training {name} ===")
# Train model
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] if hasattr(model, 'predict_proba') else None
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)
roc_auc = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else None
# Cross-validation score
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
cv_mean = cv_scores.mean()
cv_std = cv_scores.std()
# Store results
results.append({
'Model': name,
'Accuracy': accuracy,
'Precision': precision,
'Recall': recall,
'F1-Score': f1,
'ROC-AUC': roc_auc,
'CV Mean': cv_mean,
'CV Std': cv_std
})
# Print classification report
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
if roc_auc:
print(f"ROC-AUC: {roc_auc:.4f}")
print(f"CV Accuracy: {cv_mean:.4f} (+/- {cv_std:.4f})")
# 5. Compare all models
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('F1-Score', ascending=False)
print("\n" + "="*60)
print("MODEL COMPARISON (sorted by F1-Score)")
print("="*60)
print(results_df.to_string(index=False))
# 6. Hyperparameter tuning for best model
print("\n" + "="*60)
print("HYPERPARAMETER TUNING FOR RANDOM FOREST")
print("="*60)
# Define parameter grid for Random Forest
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2']
}
# Create and train GridSearchCV
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
estimator=rf,
param_grid=param_grid,
cv=5,
scoring='f1',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train_scaled, y_train)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best F1-Score: {grid_search.best_score_:.4f}")
# 7. Evaluate tuned model
best_rf = grid_search.best_estimator_
y_pred_best = best_rf.predict(X_test_scaled)
y_pred_proba_best = best_rf.predict_proba(X_test_scaled)[:, 1]
print("\n" + "="*60)
print("TUNED RANDOM FOREST PERFORMANCE")
print("="*60)
print(f"Accuracy: {accuracy_score(y_test, y_pred_best):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_best):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_best):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_best):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba_best):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best))
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred_best)
print(cm)
# 8. Feature importance analysis
print("\n" + "="*60)
print("FEATURE IMPORTANCE ANALYSIS")
print("="*60)
import matplotlib.pyplot as plt
# Get feature importance from tuned Random Forest
feature_importance = best_rf.feature_importances_
sorted_idx = np.argsort(feature_importance)[::-1]
# Print top 10 features
print("Top 10 most important features:")
for i in range(10):
print(f"Feature {sorted_idx[i]}: {feature_importance[sorted_idx[i]]:.4f}")
# Plot feature importance
plt.figure(figsize=(10, 6))
plt.bar(range(20), feature_importance[sorted_idx], align='center')
plt.xlabel('Feature Index')
plt.ylabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300)
plt.show()
print("\n✅ Classification model training complete!")
Regressie: Predictieve modellen voor continue waarden
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline
# Import regression algorithms
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
# 1. Load or create regression dataset
from sklearn.datasets import make_regression
# Create synthetic regression dataset
X, y = make_regression(
n_samples=1000,
n_features=15,
n_informative=10,
noise=10,
random_state=42
)
print("Dataset shape:", X.shape)
# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 3. Define regression models to compare
models = {
'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(random_state=42),
'Lasso Regression': Lasso(random_state=42),
'ElasticNet': ElasticNet(random_state=42),
'Decision Tree': DecisionTreeRegressor(random_state=42),
'Random Forest': RandomForestRegressor(random_state=42),
'Gradient Boosting': GradientBoostingRegressor(random_state=42),
'SVR': SVR(),
'K-Nearest Neighbors': KNeighborsRegressor()
}
# 4. Train and evaluate all models
results = []
for name, model in models.items():
print(f"\n=== Training {name} ===")
# Train model
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Cross-validation score
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='r2')
cv_mean = cv_scores.mean()
cv_std = cv_scores.std()
# Store results
results.append({
'Model': name,
'MSE': mse,
'RMSE': rmse,
'MAE': mae,
'R²': r2,
'CV R² Mean': cv_mean,
'CV R² Std': cv_std
})
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R²: {r2:.4f}")
print(f"CV R²: {cv_mean:.4f} (+/- {cv_std:.4f})")
# 5. Compare all models
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('R²', ascending=False)
print("\n" + "="*60)
print("REGRESSION MODEL COMPARISON (sorted by R²)")
print("="*60)
print(results_df.to_string(index=False))
# 6. Polynomial regression and feature engineering
print("\n" + "="*60)
print("POLYNOMIAL REGRESSION WITH FEATURE ENGINEERING")
print("="*60)
# Create polynomial features pipeline
poly_pipeline = Pipeline([
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('scaler', StandardScaler()),
('regressor', LinearRegression())
])
# Fit and evaluate polynomial regression
poly_pipeline.fit(X_train, y_train)
y_pred_poly = poly_pipeline.predict(X_test)
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)
print(f"Polynomial Regression (degree=2) R²: {r2_poly:.4f}")
print(f"Polynomial Regression MSE: {mse_poly:.4f}")
# 7. Hyperparameter tuning for best regression model
print("\n" + "="*60)
print("HYPERPARAMETER TUNING FOR GRADIENT BOOSTING REGRESSOR")
print("="*60)
# Define parameter grid for Gradient Boosting
param_grid = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': [3, 5, 7],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2]
}
# Create and train GridSearchCV
gbr = GradientBoostingRegressor(random_state=42)
grid_search = GridSearchCV(
estimator=gbr,
param_grid=param_grid,
cv=5,
scoring='r2',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train_scaled, y_train)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best R² Score: {grid_search.best_score_:.4f}")
# 8. Evaluate tuned model
best_gbr = grid_search.best_estimator_
y_pred_best = best_gbr.predict(X_test_scaled)
print("\n" + "="*60)
print("TUNED GRADIENT BOOSTING REGRESSOR PERFORMANCE")
print("="*60)
print(f"MSE: {mean_squared_error(y_test, y_pred_best):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_best)):.4f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred_best):.4f}")
print(f"R²: {r2_score(y_test, y_pred_best):.4f}")
# 9. Visualize predictions vs actual values
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_best, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values (Gradient Boosting Regressor)')
plt.tight_layout()
plt.savefig('regression_predictions.png', dpi=300)
plt.show()
# 10. Residual analysis
residuals = y_test - y_pred_best
plt.figure(figsize=(12, 4))
# Residuals vs Predicted
plt.subplot(1, 3, 1)
plt.scatter(y_pred_best, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted')
# Histogram of residuals
plt.subplot(1, 3, 2)
plt.hist(residuals, bins=30, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
# Q-Q plot for normality check
plt.subplot(1, 3, 3)
from scipy import stats
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.tight_layout()
plt.savefig('residual_analysis.png', dpi=300)
plt.show()
print("\n✅ Regression model training complete!")
6. Model evaluatie en validatie
Belangrijkste evaluatie metrics
Classificatie Metrics
- Accuracy: (TP+TN) / Total
- Precision: TP / (TP+FP)
- Recall: TP / (TP+FN)
- F1-Score: Harmonic mean van Precision en Recall
- ROC-AUC: Area onder ROC curve
Regressie Metrics
- MSE: Mean Squared Error
- RMSE: Root Mean Squared Error
- MAE: Mean Absolute Error
- R²: Coefficient of Determination
- MAPE: Mean Absolute Percentage Error
Validatie Technieken
- Train/Test Split: Basis validatie
- K-Fold CV: Meer robuuste validatie
- Stratified K-Fold: Voor imbalanced data
- Time Series Split: Voor tijdreeksen
- Bootstrapping: Voor kleine datasets
7. Deep Learning met TensorFlow/Keras
Neural Network voor classificatie
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, callbacks
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)
# 1. Load or create dataset
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=10000,
n_features=30,
n_informative=20,
n_redundant=5,
n_classes=2,
weights=[0.8, 0.2],
random_state=42
)
print(f"Dataset shape: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")
# 2. Split and scale data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
print(f"Training set: {X_train_scaled.shape}")
print(f"Validation set: {X_val_scaled.shape}")
print(f"Test set: {X_test_scaled.shape}")
# 3. Build neural network model
def build_model(input_shape, dropout_rate=0.3):
"""Build a neural network for binary classification"""
model = keras.Sequential([
# Input layer
layers.Input(shape=(input_shape,)),
# Hidden layers with batch normalization and dropout
layers.Dense(128, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(dropout_rate),
layers.Dense(64, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(dropout_rate),
layers.Dense(32, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(dropout_rate),
# Output layer
layers.Dense(1, activation='sigmoid')
])
return model
# Create model
input_shape = X_train_scaled.shape[1]
model = build_model(input_shape)
# 4. Compile model
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=[
'accuracy',
keras.metrics.Precision(name='precision'),
keras.metrics.Recall(name='recall'),
keras.metrics.AUC(name='auc')
]
)
# Display model architecture
model.summary()
# 5. Define callbacks for better training
callbacks_list = [
# Early stopping to prevent overfitting
callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True,
verbose=1
),
# Reduce learning rate when plateau
callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=5,
min_lr=1e-6,
verbose=1
),
# Model checkpoint
callbacks.ModelCheckpoint(
filepath='best_model.keras',
monitor='val_loss',
save_best_only=True,
verbose=1
),
# TensorBoard for visualization
# callbacks.TensorBoard(log_dir='./logs')
]
# 6. Train the model
history = model.fit(
X_train_scaled, y_train,
validation_data=(X_val_scaled, y_val),
epochs=100,
batch_size=32,
callbacks=callbacks_list,
verbose=1
)
# 7. Evaluate the model
print("\n" + "="*60)
print("NEURAL NETWORK EVALUATION")
print("="*60)
# Load best model
best_model = keras.models.load_model('best_model.keras')
# Evaluate on test set
test_loss, test_accuracy, test_precision, test_recall, test_auc = best_model.evaluate(
X_test_scaled, y_test, verbose=0
)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test Precision: {test_precision:.4f}")
print(f"Test Recall: {test_recall:.4f}")
print(f"Test AUC: {test_auc:.4f}")
# Calculate F1-Score
test_f1 = 2 * (test_precision * test_recall) / (test_precision + test_recall)
print(f"Test F1-Score: {test_f1:.4f}")
# Make predictions
y_pred_proba = best_model.predict(X_test_scaled)
y_pred = (y_pred_proba > 0.5).astype(int)
# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)
# 8. Plot training history
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Plot loss
axes[0, 0].plot(history.history['loss'], label='Training Loss')
axes[0, 0].plot(history.history['val_loss'], label='Validation Loss')
axes[0, 0].set_title('Model Loss')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Plot accuracy
axes[0, 1].plot(history.history['accuracy'], label='Training Accuracy')
axes[0, 1].plot(history.history['val_accuracy'], label='Validation Accuracy')
axes[0, 1].set_title('Model Accuracy')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Accuracy')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Plot precision
axes[1, 0].plot(history.history['precision'], label='Training Precision')
axes[1, 0].plot(history.history['val_precision'], label='Validation Precision')
axes[1, 0].set_title('Model Precision')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Precision')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# Plot recall
axes[1, 1].plot(history.history['recall'], label='Training Recall')
axes[1, 1].plot(history.history['val_recall'], label='Validation Recall')
axes[1, 1].set_title('Model Recall')
axes[1, 1].set_xlabel('Epoch')
axes[1, 1].set_ylabel('Recall')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('training_history.png', dpi=300)
plt.show()
# 9. ROC Curve
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.savefig('roc_curve.png', dpi=300)
plt.show()
print("\n✅ Neural Network training complete!")
print("Best model saved as 'best_model.keras'")
10. Praktijkvoorbeeld: Customer Churn Prediction
End-to-end ML project: Customer Churn Prediction
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix,
classification_report, roc_curve)
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
# Import ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
# Import for feature importance visualization
import shap
# 1. Load and explore the dataset
# For this example, we'll create a realistic customer churn dataset
np.random.seed(42)
n_customers = 10000
# Create realistic customer data
data = pd.DataFrame({
'customer_id': range(1000, 1000 + n_customers),
'age': np.random.randint(18, 70, n_customers),
'gender': np.random.choice(['Male', 'Female'], n_customers),
'tenure_months': np.random.exponential(24, n_customers).astype(int),
'monthly_charges': np.random.normal(65, 20, n_customers),
'total_charges': np.random.normal(1500, 500, n_customers),
'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'],
n_customers, p=[0.5, 0.3, 0.2]),
'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'],
n_customers, p=[0.4, 0.4, 0.2]),
'online_security': np.random.choice(['Yes', 'No'], n_customers, p=[0.3, 0.7]),
'online_backup': np.random.choice(['Yes', 'No'], n_customers, p=[0.35, 0.65]),
'device_protection': np.random.choice(['Yes', 'No'], n_customers, p=[0.25, 0.75]),
'tech_support': np.random.choice(['Yes', 'No'], n_customers, p=[0.3, 0.7]),
'streaming_tv': np.random.choice(['Yes', 'No'], n_customers, p=[0.4, 0.6]),
'streaming_movies': np.random.choice(['Yes', 'No'], n_customers, p=[0.4, 0.6]),
'paperless_billing': np.random.choice(['Yes', 'No'], n_customers, p=[0.6, 0.4]),
'payment_method': np.random.choice([
'Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'
], n_customers, p=[0.3, 0.2, 0.25, 0.25]),
'monthly_usage_gb': np.random.gamma(2, 50, n_customers),
'customer_service_calls': np.random.poisson(1, n_customers),
'satisfaction_score': np.random.randint(1, 6, n_customers)
})
# Create realistic churn based on features
def calculate_churn_probability(row):
"""Calculate churn probability based on customer features"""
prob = 0.1 # Base probability
# Factors that increase churn probability
if row['contract_type'] == 'Month-to-month':
prob += 0.2
if row['customer_service_calls'] > 2:
prob += 0.15
if row['satisfaction_score'] < 3:
prob += 0.1
if row['tenure_months'] < 12:
prob += 0.1
if row['online_security'] == 'No':
prob += 0.05
if row['tech_support'] == 'No':
prob += 0.05
# Factors that decrease churn probability
if row['contract_type'] == 'Two year':
prob -= 0.15
if row['tenure_months'] > 24:
prob -= 0.1
if row['satisfaction_score'] > 4:
prob -= 0.08
return min(max(prob, 0.05), 0.95)
# Apply churn probability and generate churn labels
data['churn_probability'] = data.apply(calculate_churn_probability, axis=1)
data['churn'] = np.random.binomial(1, data['churn_probability'])
# Drop probability column (not available in real scenario)
data = data.drop('churn_probability', axis=1)
print("=== CUSTOMER CHURN DATASET ===")
print(f"Dataset shape: {data.shape}")
print(f"\nChurn distribution:")
print(data['churn'].value_counts())
print(f"Churn rate: {data['churn'].mean() * 100:.2f}%")
# 2. Exploratory Data Analysis (EDA)
print("\n=== EXPLORATORY DATA ANALYSIS ===")
# Basic statistics
print("\nNumeric columns statistics:")
print(data.select_dtypes(include=[np.number]).describe())
# Churn by categorical features
categorical_features = ['contract_type', 'internet_service', 'online_security',
'tech_support', 'paperless_billing', 'payment_method']
print("\nChurn rates by category:")
for feature in categorical_features:
churn_rates = data.groupby(feature)['churn'].mean().sort_values(ascending=False)
print(f"\n{feature}:")
for category, rate in churn_rates.items():
print(f" {category}: {rate * 100:.1f}%")
# 3. Feature engineering
print("\n=== FEATURE ENGINEERING ===")
# Create new features
data['tenure_years'] = data['tenure_months'] / 12
data['avg_monthly_spend'] = data['total_charges'] / data['tenure_months'].replace(0, 1)
data['has_multiple_services'] = (
(data['online_security'] == 'Yes').astype(int) +
(data['online_backup'] == 'Yes').astype(int) +
(data['device_protection'] == 'Yes').astype(int) +
(data['tech_support'] == 'Yes').astype(int) +
(data['streaming_tv'] == 'Yes').astype(int) +
(data['streaming_movies'] == 'Yes').astype(int)
)
data['high_usage'] = (data['monthly_usage_gb'] > data['monthly_usage_gb'].quantile(0.75)).astype(int)
data['frequent_service_calls'] = (data['customer_service_calls'] > 2).astype(int)
print(f"Added {len(['tenure_years', 'avg_monthly_spend', 'has_multiple_services', 'high_usage', 'frequent_service_calls'])} new features")
# 4. Prepare data for modeling
print("\n=== DATA PREPARATION ===")
# Drop customer_id (not a feature)
X = data.drop(['customer_id', 'churn'], axis=1)
y = data['churn']
# Identify feature types
numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
print(f"Numeric features: {len(numeric_features)}")
print(f"Categorical features: {len(categorical_features)}")
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Training churn rate: {y_train.mean() * 100:.2f}%")
print(f"Test churn rate: {y_test.mean() * 100:.2f}%")
# 5. Create preprocessing pipeline
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# 6. Build and compare multiple models
print("\n=== MODEL TRAINING AND COMPARISON ===")
# Define models with initial parameters
models = {
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced'),
'Random Forest': RandomForestClassifier(random_state=42, class_weight='balanced_subsample'),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'XGBoost': XGBClassifier(random_state=42, eval_metric='logloss', use_label_encoder=False),
'LightGBM': LGBMClassifier(random_state=42, class_weight='balanced')
}
# Create pipelines with SMOTE for each model
pipelines = {}
for name, model in models.items():
pipelines[name] = ImbPipeline(steps=[
('preprocessor', preprocessor),
('smote', SMOTE(random_state=42)),
('classifier', model)
])
# Train and evaluate all models
results = []
for name, pipeline in pipelines.items():
print(f"\nTraining {name}...")
# Train model
pipeline.fit(X_train, y_train)
# Make predictions
y_pred = pipeline.predict(X_test)
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
# Cross-validation score
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
cv_mean = cv_scores.mean()
cv_std = cv_scores.std()
# Store results
results.append({
'Model': name,
'Accuracy': accuracy,
'Precision': precision,
'Recall': recall,
'F1-Score': f1,
'ROC-AUC': roc_auc,
'CV ROC-AUC Mean': cv_mean,
'CV ROC-AUC Std': cv_std
})
print(f" Accuracy: {accuracy:.4f}")
print(f" Precision: {precision:.4f}")
print(f" Recall: {recall:.4f}")
print(f" F1-Score: {f1:.4f}")
print(f" ROC-AUC: {roc_auc:.4f}")
# Compare models
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('F1-Score', ascending=False)
print("\n" + "="*70)
print("MODEL COMPARISON FOR CUSTOMER CHURN PREDICTION")
print("="*70)
print(results_df.to_string(index=False))
# 7. Hyperparameter tuning for best model
print("\n" + "="*70)
print("HYPERPARAMETER TUNING FOR XGBOOST (BEST MODEL)")
print("="*70)
# Define parameter grid for XGBoost
param_grid = {
'classifier__n_estimators': [100, 200, 300],
'classifier__max_depth': [3, 5, 7],
'classifier__learning_rate': [0.01, 0.05, 0.1],
'classifier__subsample': [0.8, 0.9, 1.0],
'classifier__colsample_bytree': [0.8, 0.9, 1.0]
}
# Create GridSearchCV for XGBoost pipeline
xgb_pipeline = pipelines['XGBoost']
grid_search = GridSearchCV(
estimator=xgb_pipeline,
param_grid=param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best ROC-AUC Score: {grid_search.best_score_:.4f}")
# 8. Evaluate tuned model
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
y_pred_proba_best = best_model.predict_proba(X_test)[:, 1]
print("\n" + "="*70)
print("TUNED XGBOOST PERFORMANCE ON TEST SET")
print("="*70)
print(f"Accuracy: {accuracy_score(y_test, y_pred_best):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_best):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_best):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_best):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba_best):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best))
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred_best)
print(cm)
# 9. Feature importance analysis
print("\n" + "="*70)
print("FEATURE IMPORTANCE ANALYSIS")
print("="*70)
# Get feature names after preprocessing
preprocessor = best_model.named_steps['preprocessor']
xgb_model = best_model.named_steps['classifier']
# Get one-hot encoded feature names
onehot_encoder = preprocessor.named_transformers_['cat'].named_steps['onehot']
cat_feature_names = onehot_encoder.get_feature_names_out(categorical_features)
# Combine all feature names
all_feature_names = np.concatenate([numeric_features, cat_feature_names])
# Get feature importance from XGBoost
feature_importance = xgb_model.feature_importances_
sorted_idx = np.argsort(feature_importance)[::-1]
# Print top 20 features
print("\nTop 20 most important features for churn prediction:")
for i in range(min(20, len(all_feature_names))):
feature_name = all_feature_names[sorted_idx[i]]
importance = feature_importance[sorted_idx[i]]
print(f"{i+1:2d}. {feature_name:30s}: {importance:.4f}")
# 10. SHAP analysis for model interpretability
print("\n" + "="*70)
print("SHAP ANALYSIS FOR MODEL INTERPRETABILITY")
print("="*70)
try:
# Prepare data for SHAP
X_test_processed = preprocessor.transform(X_test)
# Create SHAP explainer
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test_processed)
# Summary plot
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values, X_test_processed, feature_names=all_feature_names,
max_display=20, show=False)
plt.tight_layout()
plt.savefig('shap_summary.png', dpi=300, bbox_inches='tight')
plt.show()
print("SHAP summary plot saved as 'shap_summary.png'")
except Exception as e:
print(f"SHAP analysis failed: {e}")
# 11. Business insights and recommendations
print("\n" + "="*70)
print("BUSINESS INSIGHTS AND RECOMMENDATIONS")
print("="*70)
# Calculate precision at different thresholds
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
print("\nPerformance at different probability thresholds:")
for threshold in thresholds:
y_pred_thresh = (y_pred_proba_best >= threshold).astype(int)
precision_thresh = precision_score(y_test, y_pred_thresh)
recall_thresh = recall_score(y_test, y_pred_thresh)
print(f"Threshold {threshold}: Precision={precision_thresh:.3f}, Recall={recall_thresh:.3f}")
# Identify high-risk customers
high_risk_threshold = 0.7
high_risk_indices = np.where(y_pred_proba_best >= high_risk_threshold)[0]
high_risk_customers = X_test.iloc[high_risk_indices].copy()
high_risk_customers['churn_probability'] = y_pred_proba_best[high_risk_indices]
print(f"\nIdentified {len(high_risk_customers)} high-risk customers (probability >= {high_risk_threshold})")
# Analyze characteristics of high-risk customers
print("\nCharacteristics of high-risk customers:")
print(f"- Average tenure: {high_risk_customers['tenure_months'].mean():.1f} months (vs {X_test['tenure_months'].mean():.1f} overall)")
print(f"- Monthly charges: €{high_risk_customers['monthly_charges'].mean():.2f} (vs €{X_test['monthly_charges'].mean():.2f} overall)")
print(f"- Month-to-month contracts: {high_risk_customers[high_risk_customers['contract_type'] == 'Month-to-month'].shape[0] / len(high_risk_customers) * 100:.1f}%")
# 12. Save the model
import joblib
# Save the entire pipeline
joblib.dump(best_model, 'customer_churn_model.pkl')
# Save feature names
joblib.dump(all_feature_names, 'feature_names.pkl')
print("\n" + "="*70)
print("MODEL DEPLOYMENT READY")
print("="*70)
print("Model saved as 'customer_churn_model.pkl'")
print("Feature names saved as 'feature_names.pkl'")
print("\n✅ Customer Churn Prediction project complete!")
print(f"Final model achieves ROC-AUC of {roc_auc_score(y_test, y_pred_proba_best):.4f} on test data")
Klaar om te beginnen met Machine Learning?
Vind ML professionals of plaats je vacature voor AI/ML projecten
Conclusie en volgende stappen
Machine Learning in Python is een krachtige vaardigheid die je in staat stelt intelligente systemen te bouwen. Je hebt nu geleerd:
- ML basis concepten: Supervised vs unsupervised learning
- Data preprocessing: Voorbereiding van data voor ML
- Model training: Verschillende algoritmes trainen en vergelijken
- Model evaluatie: Metrics en validatie technieken
- Deep Learning: Neural networks met TensorFlow/Keras
- Praktijk project: End-to-end customer churn prediction
Volgende stappen:
- Experimenteer met je eigen datasets
- Leer advanced deep learning (CNNs, RNNs, Transformers)
- Implementeer ML models in productie met Flask/FastAPI
- Leer MLOps voor model monitoring en management
- Volg onze advanced ML en AI tutorials