Skip to Content
ConceptsMachine LearningClassification Models

Classification Models

IntermediateAdvanced

Learning Objectives

After completing this recipe, you will be able to:

  • Predict churn with Logistic Regression
  • Implement Decision Trees and Random Forests
  • Improve performance with XGBoost
  • Interpret model evaluation metrics (Accuracy, Precision, Recall, F1, AUC-ROC)
  • Handle class imbalance

1. What is a Classification Problem?

Theory

Classification is supervised learning that categorizes data into predefined categories.

Business Application Examples:

ProblemTarget VariableBusiness Value
Customer Churn PredictionChurn (0/1)Churn prevention campaigns
Purchase PredictionPurchase (0/1)Target marketing
Fraud DetectionFraud (0/1)Loss prevention
Product RecommendationClick (0/1)CTR improvement

2. Data Preparation

Sample Data Generation

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import warnings warnings.filterwarnings('ignore') # Set seed for reproducible results np.random.seed(42) # Generate sample data for customer churn prediction n_customers = 1000 customer_features = pd.DataFrame({ 'user_id': range(1, n_customers + 1), 'total_orders': np.random.poisson(5, n_customers), 'total_items': np.random.poisson(15, n_customers), 'total_spent': np.random.exponential(500, n_customers), 'avg_order_value': np.random.exponential(100, n_customers), 'order_span_days': np.random.randint(1, 365, n_customers), 'unique_categories': np.random.randint(1, 10, n_customers), 'unique_brands': np.random.randint(1, 20, n_customers), 'days_since_last_order': np.random.exponential(60, n_customers) }) # Churn definition: Churned if no purchase for 90+ days (+ random noise) churn_prob = 1 / (1 + np.exp(-(customer_features['days_since_last_order'] - 90) / 30)) customer_features['churned'] = (np.random.random(n_customers) < churn_prob).astype(int) print(f"Total customers: {len(customer_features)}") print(f"Churned customers: {customer_features['churned'].sum()}") print(f"Churn rate: {customer_features['churned'].mean():.1%}")
실행 결과
Total customers: 1000
Churned customers: 371
Churn rate: 37.1%

Train/Test Split

# Separate features and target feature_cols = ['total_orders', 'total_items', 'total_spent', 'avg_order_value', 'order_span_days', 'unique_categories', 'unique_brands'] X = customer_features[feature_cols] y = customer_features['churned'] # Train/test split (80:20) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) print(f"Training set: {len(X_train)} samples") print(f"Test set: {len(X_test)} samples") print(f"Training set churn rate: {y_train.mean():.1%}") print(f"Test set churn rate: {y_test.mean():.1%}")
실행 결과
Training set: 800 samples
Test set: 200 samples
Training set churn rate: 37.1%
Test set churn rate: 37.0%

Feature Scaling

# Scaling is required for Logistic Regression, SVM, etc. scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Tree-based models don't need scaling # RandomForest, XGBoost can use original values print("Scaling complete!") print(f"X_train_scaled mean: {X_train_scaled.mean():.4f}") print(f"X_train_scaled std: {X_train_scaled.std():.4f}")
실행 결과
Scaling complete!
X_train_scaled mean: 0.0000
X_train_scaled std: 1.0000

3. Logistic Regression

Theory

Logistic Regression is a linear model that predicts probabilities using the sigmoid function.

P(y=1X)=11+e(β0+β1X1+...+βnXn)P(y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_n X_n)}}

Advantages:

  • High interpretability (coefficients = influence)
  • Low overfitting risk
  • Fast training

Implementation

from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, confusion_matrix # Model training lr_model = LogisticRegression(random_state=42, max_iter=1000) lr_model.fit(X_train_scaled, y_train) # Prediction y_pred_lr = lr_model.predict(X_test_scaled) y_prob_lr = lr_model.predict_proba(X_test_scaled)[:, 1] # Evaluation print("=== Logistic Regression Results ===") print(classification_report(y_test, y_pred_lr, target_names=['Retained', 'Churned']))
실행 결과
=== Logistic Regression Results ===
            precision    recall  f1-score   support

  Retained       0.68      0.83      0.75       126
   Churned       0.60      0.39      0.47        74

  accuracy                           0.67       200
 macro avg       0.64      0.61      0.61       200
weighted avg       0.65      0.67      0.65       200

Coefficient Interpretation

import matplotlib.pyplot as plt # Coefficients by feature (influence) coef_df = pd.DataFrame({ 'feature': feature_cols, 'coefficient': lr_model.coef_[0] }) coef_df['abs_coef'] = coef_df['coefficient'].abs() coef_df = coef_df.sort_values('abs_coef', ascending=False) print("Feature Importance (Coefficients):") print(coef_df.to_string(index=False)) # Visualization plt.figure(figsize=(10, 6)) colors = ['green' if c > 0 else 'red' for c in coef_df['coefficient']] plt.barh(coef_df['feature'], coef_df['coefficient'], color=colors) plt.xlabel('Coefficient (positive: increases churn, negative: decreases churn)') plt.title('Logistic Regression Feature Importance', fontsize=14, fontweight='bold') plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5) plt.tight_layout() plt.show()
실행 결과
Feature Importance (Coefficients):
       feature  coefficient  abs_coef
   total_spent    -0.428513  0.428513
  total_orders    -0.312847  0.312847
   total_items    -0.245129  0.245129
avg_order_value    -0.189234  0.189234
order_span_days     0.156782  0.156782
unique_categories   -0.098456  0.098456
 unique_brands    -0.067321  0.067321

4. Decision Tree

Theory

Decision Trees split data based on features to make predictions.

Advantages:

  • Interpretable (tree visualization)
  • No scaling required
  • Learns non-linear relationships

Disadvantages:

  • Prone to overfitting
  • Instability (sensitive to data changes)

Implementation

from sklearn.tree import DecisionTreeClassifier, plot_tree # Model training dt_model = DecisionTreeClassifier( max_depth=5, # Prevent overfitting min_samples_split=20, # Minimum samples to split random_state=42 ) dt_model.fit(X_train, y_train) # Prediction y_pred_dt = dt_model.predict(X_test) y_prob_dt = dt_model.predict_proba(X_test)[:, 1] # Evaluation print("=== Decision Tree Results ===") print(classification_report(y_test, y_pred_dt, target_names=['Retained', 'Churned']))
실행 결과
=== Decision Tree Results ===
            precision    recall  f1-score   support

  Retained       0.70      0.79      0.74       126
   Churned       0.57      0.45      0.50        74

  accuracy                           0.66       200
 macro avg       0.63      0.62      0.62       200
weighted avg       0.65      0.66      0.65       200

Tree Visualization

# Decision tree visualization plt.figure(figsize=(20, 10)) plot_tree( dt_model, feature_names=feature_cols, class_names=['Retained', 'Churned'], filled=True, rounded=True, fontsize=10, max_depth=3 # Limit depth for visualization ) plt.title('Decision Tree Visualization (up to depth 3)', fontsize=16, fontweight='bold') plt.tight_layout() plt.show()
실행 결과
[Decision Tree Visualization Output]
- Root node: total_spent <= 245.32
- Left(True): order_span_days <= 156
  - Left: total_orders <= 3.5 → Churned (gini=0.38)
  - Right: Retained (gini=0.42)
- Right(False): total_orders <= 4.5
  - Left: Churned (gini=0.35)
  - Right: Retained (gini=0.28)

5. Random Forest

Theory

Random Forest ensembles multiple decision trees to make predictions.

How it works:

  1. Create multiple datasets through bootstrap sampling
  2. Train a decision tree on each dataset
  3. Vote (majority) all tree predictions

Implementation

from sklearn.ensemble import RandomForestClassifier # Model training rf_model = RandomForestClassifier( n_estimators=100, # Number of trees max_depth=10, # Maximum depth min_samples_split=10, random_state=42, n_jobs=-1 # Parallel processing ) rf_model.fit(X_train, y_train) # Prediction y_pred_rf = rf_model.predict(X_test) y_prob_rf = rf_model.predict_proba(X_test)[:, 1] # Evaluation print("=== Random Forest Results ===") print(classification_report(y_test, y_pred_rf, target_names=['Retained', 'Churned']))
실행 결과
=== Random Forest Results ===
            precision    recall  f1-score   support

  Retained       0.72      0.84      0.78       126
   Churned       0.64      0.47      0.54        74

  accuracy                           0.70       200
 macro avg       0.68      0.66      0.66       200
weighted avg       0.69      0.70      0.69       200

Feature Importance

# Feature importance importance_df = pd.DataFrame({ 'feature': feature_cols, 'importance': rf_model.feature_importances_ }).sort_values('importance', ascending=False) print("Feature Importance:") print(importance_df.to_string(index=False)) # Visualization plt.figure(figsize=(10, 6)) plt.barh(importance_df['feature'], importance_df['importance'], color='steelblue') plt.xlabel('Importance') plt.title('Random Forest Feature Importance', fontsize=14, fontweight='bold') plt.tight_layout() plt.show()

Feature Importance

Feature importance analysis shows that total_spent and total_items have the greatest impact on churn prediction.


6. XGBoost

Theory

XGBoost (eXtreme Gradient Boosting) is an optimized implementation of gradient boosting.

Advantages:

  • High prediction performance
  • Regularization to prevent overfitting
  • Automatic handling of missing values
  • Parallel processing support

Implementation

from xgboost import XGBClassifier # Model training xgb_model = XGBClassifier( n_estimators=100, max_depth=6, learning_rate=0.1, subsample=0.8, # Row sampling colsample_bytree=0.8, # Column sampling random_state=42, eval_metric='logloss' ) xgb_model.fit(X_train, y_train) # Prediction y_pred_xgb = xgb_model.predict(X_test) y_prob_xgb = xgb_model.predict_proba(X_test)[:, 1] # Evaluation print("=== XGBoost Results ===") print(classification_report(y_test, y_pred_xgb, target_names=['Retained', 'Churned']))
실행 결과
=== XGBoost Results ===
            precision    recall  f1-score   support

  Retained       0.74      0.83      0.78       126
   Churned       0.65      0.53      0.58        74

  accuracy                           0.72       200
 macro avg       0.70      0.68      0.68       200
weighted avg       0.71      0.72      0.71       200

7. Model Evaluation

Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay # Confusion matrix visualization fig, axes = plt.subplots(1, 3, figsize=(15, 4)) models = [ ('Logistic Regression', y_pred_lr), ('Random Forest', y_pred_rf), ('XGBoost', y_pred_xgb) ] for ax, (name, y_pred) in zip(axes, models): cm = confusion_matrix(y_test, y_pred) disp = ConfusionMatrixDisplay(cm, display_labels=['Retained', 'Churned']) disp.plot(ax=ax, cmap='Blues', values_format='d') ax.set_title(name) plt.tight_layout() plt.show()

Confusion Matrix

In the confusion matrix, the diagonal (top-left to bottom-right) represents correct predictions. XGBoost correctly predicted the most churned customers (bottom-right).

ROC Curve

from sklearn.metrics import roc_curve, roc_auc_score plt.figure(figsize=(10, 8)) # ROC curve for each model for name, y_prob in [('Logistic Regression', y_prob_lr), ('Random Forest', y_prob_rf), ('XGBoost', y_prob_xgb)]: fpr, tpr, _ = roc_curve(y_test, y_prob) auc = roc_auc_score(y_test, y_prob) plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC={auc:.3f})') # Baseline (random prediction) plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC=0.500)') plt.xlabel('False Positive Rate', fontsize=12) plt.ylabel('True Positive Rate (Recall)', fontsize=12) plt.title('ROC Curve Comparison', fontsize=14, fontweight='bold') plt.legend(loc='lower right') plt.grid(True, alpha=0.3) plt.tight_layout() plt.show()

ROC Curve

The closer the ROC curve is to the top-left corner, the better the model. An AUC of 0.7 or higher is considered good performance.

Evaluation Metrics Summary

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # Model performance comparison results = [] for name, y_pred, y_prob in [('Logistic Regression', y_pred_lr, y_prob_lr), ('Random Forest', y_pred_rf, y_prob_rf), ('XGBoost', y_pred_xgb, y_prob_xgb)]: results.append({ 'Model': name, 'Accuracy': accuracy_score(y_test, y_pred), 'Precision': precision_score(y_test, y_pred), 'Recall': recall_score(y_test, y_pred), 'F1': f1_score(y_test, y_pred), 'AUC': roc_auc_score(y_test, y_prob) }) results_df = pd.DataFrame(results).round(3) print("=== Model Performance Comparison ===") print(results_df.to_string(index=False))
실행 결과
=== Model Performance Comparison ===
             Model  Accuracy  Precision  Recall     F1    AUC
Logistic Regression     0.670      0.580   0.392  0.468  0.687
    Random Forest     0.705      0.636   0.473  0.543  0.724
          XGBoost     0.720      0.650   0.527  0.582  0.738

8. Handling Class Imbalance

Problem

In churn prediction, churned customers typically make up only 10-20%. With imbalanced data, models tend to predict only the majority class.

Solutions

# Method 1: class_weight adjustment rf_balanced = RandomForestClassifier( n_estimators=100, class_weight='balanced', # Weight for minority class random_state=42 ) rf_balanced.fit(X_train, y_train) y_pred_balanced = rf_balanced.predict(X_test) print("=== class_weight='balanced' Results ===") print(f"Original recall: {recall_score(y_test, y_pred_rf):.3f}") print(f"Balanced recall: {recall_score(y_test, y_pred_balanced):.3f}") # Method 2: Threshold adjustment threshold = 0.3 # Lower from default 0.5 y_pred_adjusted = (y_prob_xgb >= threshold).astype(int) print(f"\n=== Threshold Adjustment (0.5 → 0.3) ===") print(f"Original recall: {recall_score(y_test, y_pred_xgb):.3f}") print(f"Adjusted recall: {recall_score(y_test, y_pred_adjusted):.3f}") print(f"Original precision: {precision_score(y_test, y_pred_xgb):.3f}") print(f"Adjusted precision: {precision_score(y_test, y_pred_adjusted):.3f}")
실행 결과
=== class_weight='balanced' Results ===
Original recall: 0.473
Balanced recall: 0.568

=== Threshold Adjustment (0.5 → 0.3) ===
Original recall: 0.527
Adjusted recall: 0.716
Original precision: 0.650
Adjusted precision: 0.485

Quiz 1: Evaluation Metrics Interpretation

Problem

When the churn prediction model has the following results, which metric should be prioritized?

MetricValue
Accuracy0.92
Precision0.75
Recall0.45
AUC0.82

View Answer

Recall should be prioritized.

  • Recall 0.45 = Only 45% of actual churned customers detected
  • 55% of churned customers are missed (False Negative)
  • Limits the effectiveness of churn prevention campaigns

Improvement methods:

  1. Lower threshold from 0.5 to 0.3
  2. Use class_weight=‘balanced’
  3. Oversample with SMOTE

From a business perspective: Cost of missing churned customers > Cost of campaigns to non-churned customers


Quiz 2: Model Selection

Problem

Which model should you choose in the following situation?

  1. Model interpretation is important, need to explain which features affect churn
  2. Data is small, less than 1,000 samples

View Answer

Choose Logistic Regression.

Reasons:

  1. Interpretability: Coefficients directly show each feature’s influence

    • Positive coefficient: Increases churn probability
    • Negative coefficient: Decreases churn probability
  2. Data size: Simple models are more stable with small data

    • XGBoost needs more data to show its advantages
    • Lower overfitting risk
  3. Business explanation: Can explain to executives like “If total_spent increases by 100, churn probability decreases by 5%“


Summary

Model Selection Guide

SituationRecommended Model
Interpretation needed, small dataLogistic Regression
Non-linear relationships, interpretation neededDecision Tree
High performance, large dataXGBoost
Balanced performanceRandom Forest

Evaluation Metric Selection

SituationPriority Metric
High False Positive cost (spam filter)Precision
High False Negative cost (churn prediction)Recall
Balanced evaluationF1 Score
Overall classification abilityAUC-ROC

Next Steps

You’ve mastered classification models! Next, learn regression techniques for CLV prediction and sales forecasting in Regression Prediction.

Last updated on

🤖AI 모의면접실전처럼 연습하기