Machine Learning
Learn frequently used machine learning techniques in practice. Focus on solving business problems such as customer segmentation, churn prediction, and sales forecasting.
Machine Learning vs Statistics
| Perspective | Statistics | Machine Learning |
|---|---|---|
| Purpose | Inference, hypothesis testing | Prediction, pattern discovery |
| Interpretation | Model interpretation focused | Prediction performance focused |
| Data | Valid even with small samples | Requires large-scale data |
| Approach | Assumption-based | Data-driven |
Curriculum
1. Clustering
IntermediatePerform customer segmentation using unsupervised learning.
- K-Means Clustering
- Determining optimal cluster count (Elbow, Silhouette)
- RFM-based customer segmentation
- Cluster profiling
- DBSCAN (Density-based clustering)
Get Started with Clustering β
2. Classification Models
IntermediateAdvancedSolve classification problems such as customer churn prediction and purchase prediction.
- Logistic Regression
- Decision Tree
- Random Forest
- XGBoost
- Model Evaluation: Accuracy, Precision, Recall, F1, AUC-ROC
Get Started with Classification Models β
3. Regression Prediction
IntermediateAdvancedPredict continuous values such as Customer Lifetime Value (CLV) and sales.
- Linear Regression
- Ridge, Lasso Regression
- Random Forest Regression
- XGBoost Regression
- Model Evaluation: MAE, RMSE, RΒ²
Get Started with Regression Prediction β
4. Time Series Forecasting
AdvancedForecast time series data such as sales and demand.
- Basic Prophet usage
- Trend and seasonality modeling
- Holiday effect incorporation
- Anomaly detection
- Multi-time series forecasting
Get Started with Time Series Forecasting β
5. Recommendation Systems
AdvancedImplement product recommendation algorithms.
- Collaborative Filtering
- Content-Based Filtering
- Hybrid Recommendation
- Evaluation Metrics: Precision@K, Recall@K, NDCG
ML Workflow
1. Problem Definition
ββ Transform business goals into ML problems
β
2. Data Collection and Exploration (EDA)
ββ Understand data, check quality
β
3. Feature Engineering
ββ Create new features, transformations
β
4. Model Training
ββ Train/validation data split
ββ Compare multiple models
β
5. Model Evaluation
ββ Measure performance on test data
β
6. Deployment and Monitoring
ββ Apply to production environment
ββ Performance monitoring, retrainingKey Libraries
# Data preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
# Models
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
# Evaluation
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report, roc_auc_score
)
# Time series
from prophet import ProphetError: No module named 'prophet'
Model Selection Guide
When data is small (< 1,000 samples)
- Logistic Regression, Decision Tree
- Watch out for overfitting, cross-validation is essential
When data is large (> 10,000 samples)
- Random Forest, XGBoost
- Hyperparameter tuning is important
When interpretation is important
- Logistic Regression, Decision Tree
- Feature importance analysis
When prediction performance is important
- XGBoost, LightGBM
- Ensemble techniques
Practical Tips
- Data Leakage: Be careful not to include future information in training
- Class Imbalance: Few churners in churn prediction β SMOTE, weight adjustment
- Overfitting: Good training performance but poor actual performance β Cross-validation is essential
- Feature Scaling: Models other than tree-based need normalization/standardization