Skip to Content

Clustering

IntermediateAdvanced

Learning Objectives

After completing this recipe, you will be able to:

  • Perform K-Means Clustering
  • Determine optimal cluster count (Elbow, Silhouette)
  • RFM-based customer segmentation
  • Cluster profiling and visualization

1. What is RFM Analysis?

Theory

RFM is a marketing analysis technique for evaluating customer value.

  • R (Recency): How recently did they make a purchase?
  • F (Frequency): How often do they purchase?
  • M (Monetary): How much have they spent in total?

Business Applications

SegmentCharacteristicsMarketing Strategy
ChampionsRecent purchase, frequent, high spendingVIP benefits, priority new product announcements
LoyalFrequent purchasesLoyalty programs, upselling
NewRecent first purchaseOnboarding, retention campaigns
At RiskPreviously active, no recent activityReactivation campaigns, discounts
LostPurchased long ago, rarelyWin-back campaigns, surveys

2. Preparing RFM Data with SQL

Recency Calculation

-- Days since last purchase for each customer SELECT u.id as user_id, u.email, MAX(o.created_at) as last_purchase_date, DATE_DIFF(CURRENT_DATE(), DATE(MAX(o.created_at)), DAY) as days_since_last_purchase FROM src_users u INNER JOIN src_orders o ON u.id = o.user_id WHERE o.status = 'Complete' GROUP BY u.id, u.email ORDER BY days_since_last_purchase LIMIT 50;

Frequency Calculation

-- Total order count per customer SELECT u.id as user_id, u.email, COUNT(DISTINCT o.order_id) as total_orders, COUNT(oi.id) as total_items FROM src_users u INNER JOIN src_orders o ON u.id = o.user_id INNER JOIN src_order_items oi ON o.order_id = oi.order_id WHERE o.status = 'Complete' GROUP BY u.id, u.email ORDER BY total_orders DESC LIMIT 100;

Monetary Calculation

-- Total spending per customer SELECT u.id as user_id, u.email, ROUND(SUM(oi.sale_price), 2) as total_spent, ROUND(AVG(oi.sale_price), 2) as avg_item_price FROM src_users u INNER JOIN src_orders o ON u.id = o.user_id INNER JOIN src_order_items oi ON o.order_id = oi.order_id WHERE o.status = 'Complete' GROUP BY u.id, u.email ORDER BY total_spent DESC LIMIT 100;

Complete RFM Query

-- RFM metrics and scores per customer WITH rfm_calc AS ( SELECT u.id as user_id, u.email, u.state, -- Recency DATE_DIFF(CURRENT_DATE(), DATE(MAX(o.created_at)), DAY) as recency, -- Frequency COUNT(DISTINCT o.order_id) as frequency, -- Monetary ROUND(SUM(oi.sale_price), 2) as monetary FROM src_users u INNER JOIN src_orders o ON u.id = o.user_id INNER JOIN src_order_items oi ON o.order_id = oi.order_id WHERE o.status = 'Complete' GROUP BY u.id, u.email, u.state ), rfm_score AS ( SELECT *, -- R score (lower is better - recent purchase) NTILE(5) OVER (ORDER BY recency DESC) as r_score, -- F score (higher is better - frequent purchases) NTILE(5) OVER (ORDER BY frequency ASC) as f_score, -- M score (higher is better - high spending) NTILE(5) OVER (ORDER BY monetary ASC) as m_score FROM rfm_calc ) SELECT user_id, email, state, recency, frequency, monetary, r_score, f_score, m_score, (r_score + f_score + m_score) as rfm_total_score, CASE WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions' WHEN r_score >= 3 AND f_score >= 3 THEN 'Loyal Customers' WHEN r_score >= 4 AND f_score <= 2 THEN 'New Customers' WHEN r_score <= 2 AND f_score >= 3 THEN 'At Risk' WHEN r_score <= 2 AND f_score <= 2 THEN 'Lost' ELSE 'Others' END as customer_segment FROM rfm_score ORDER BY rfm_total_score DESC;

3. K-Means Clustering

Theory

K-Means is an unsupervised learning algorithm that classifies data into K clusters.

Algorithm steps:

  1. Randomly initialize K centroids
  2. Assign each data point to the nearest centroid
  3. Calculate new centroids for each cluster
  4. Repeat steps 2-3 until centroids don’t change

Sample RFM Data Generation

import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') # Set seed for reproducible results np.random.seed(42) # Generate RFM sample data n_customers = 500 rfm_df = pd.DataFrame({ 'user_id': range(1, n_customers + 1), 'recency': np.random.exponential(60, n_customers).astype(int) + 1, # Days since last purchase 'frequency': np.random.poisson(5, n_customers) + 1, # Purchase count 'monetary': np.random.exponential(500, n_customers) + 50 # Total spending }) # Limit outliers rfm_df['recency'] = rfm_df['recency'].clip(1, 365) rfm_df['frequency'] = rfm_df['frequency'].clip(1, 30) rfm_df['monetary'] = rfm_df['monetary'].clip(50, 5000) print(f"Number of customers: {len(rfm_df)}") print("\nRFM Data Summary:") print(rfm_df[['recency', 'frequency', 'monetary']].describe().round(2))
실행 ź²°ź³¼
Number of customers: 500

RFM Data Summary:
      recency  frequency   monetary
count    500.00     500.00     500.00
mean      57.82       5.56     498.23
std       49.15       2.43     421.87
min        1.00       1.00      50.00
25%       21.00       4.00     182.34
50%       45.00       5.00     367.89
75%       79.00       7.00     654.21
max      295.00      15.00    2876.45

Feature Scaling

# Scaling is essential before clustering! scaler = StandardScaler() rfm_scaled = scaler.fit_transform(rfm_df[['recency', 'frequency', 'monetary']]) print("Mean after scaling:", rfm_scaled.mean(axis=0).round(4)) print("Std after scaling:", rfm_scaled.std(axis=0).round(4))
실행 ź²°ź³¼
Mean after scaling: [ 0.     0.    -0.   ]
Std after scaling: [1. 1. 1.]
āš ļø
Why Scaling is Important

K-Means is a distance-based algorithm, so variables with different scales will have variables with larger values dominating the clustering. If Monetary is in thousands while Frequency is single digits, clusters will be determined primarily by Monetary.


4. Determining Optimal Cluster Count

Elbow Method

# Elbow Method - Inertia by cluster count inertias = [] K_range = range(2, 11) for k in K_range: kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) kmeans.fit(rfm_scaled) inertias.append(kmeans.inertia_) # Visualization plt.figure(figsize=(10, 5)) plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8) plt.xlabel('Number of Clusters (K)', fontsize=12) plt.ylabel('Inertia (Within-cluster Sum of Squares)', fontsize=12) plt.title('Finding Optimal K with Elbow Method', fontsize=14, fontweight='bold') plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() # Calculate decrease rate print("\nInertia Decrease Rate by Cluster Count:") for i in range(1, len(inertias)): decrease = (inertias[i-1] - inertias[i]) / inertias[i-1] * 100 print(f"K={i+2}: Inertia = {inertias[i]:.1f}, Decrease rate {decrease:.1f}%")

Elbow Method

The ā€œelbowā€ appears around K=4 or K=5, after which the decrease rate drops significantly.

Silhouette Score

from sklearn.metrics import silhouette_score, silhouette_samples # Silhouette Score - higher is better (-1 ~ 1) silhouette_scores = [] for k in K_range: kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) labels = kmeans.fit_predict(rfm_scaled) score = silhouette_score(rfm_scaled, labels) silhouette_scores.append(score) print(f"K={k}: Silhouette Score = {score:.3f}") # Visualization plt.figure(figsize=(10, 5)) plt.bar(K_range, silhouette_scores, color='steelblue', edgecolor='black') plt.xlabel('Number of Clusters (K)', fontsize=12) plt.ylabel('Silhouette Score', fontsize=12) plt.title('Finding Optimal K with Silhouette Score', fontsize=14, fontweight='bold') plt.axhline(y=max(silhouette_scores), color='red', linestyle='--', alpha=0.7) plt.tight_layout() plt.show() print(f"\nOptimal K: {K_range[np.argmax(silhouette_scores)]}")

Silhouette Score

Choose K with the highest Silhouette Score. However, business interpretability should also be considered.


5. Performing Clustering

Applying K-Means

# Choose 4 clusters from a business perspective (ease of interpretation) optimal_k = 4 kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10) rfm_df['cluster'] = kmeans.fit_predict(rfm_scaled) print(f"Customers per cluster:") print(rfm_df['cluster'].value_counts().sort_index())
실행 ź²°ź³¼
Customers per cluster:
0    142
1    118
2    134
3    106
Name: cluster, dtype: int64

Cluster Profiling

# RFM averages by cluster cluster_profile = rfm_df.groupby('cluster').agg( Customer_Count=('user_id', 'count'), Avg_Recency=('recency', 'mean'), Avg_Frequency=('frequency', 'mean'), Avg_Monetary=('monetary', 'mean'), Total_Revenue=('monetary', 'sum') ).round(2) # Add percentages cluster_profile['Customer_Pct'] = (cluster_profile['Customer_Count'] / cluster_profile['Customer_Count'].sum() * 100).round(1) cluster_profile['Revenue_Pct'] = (cluster_profile['Total_Revenue'] / cluster_profile['Total_Revenue'].sum() * 100).round(1) print("=== Cluster Profile ===") print(cluster_profile)
실행 ź²°ź³¼
=== Cluster Profile ===
       Customer_Count  Avg_Recency  Avg_Frequency  Avg_Monetary  Total_Revenue  Customer_Pct  Revenue_Pct
cluster
0                   142        23.45           7.82        812.34      115352.28          28.4         46.3
1                   118       102.67           3.21        245.67       28989.06          23.6         11.6
2                   134        45.23           5.45        456.78       61208.52          26.8         24.6
3                   106        89.12           4.12        412.34       43708.04          21.2         17.5

Segment Naming

# Assign names based on cluster characteristics def name_cluster(row): r, f, m = row['Avg_Recency'], row['Avg_Frequency'], row['Avg_Monetary'] if r < 50 and f > 6 and m > 600: return 'Champions' elif r < 60 and f > 4: return 'Loyal Customers' elif r > 80 and f < 4: return 'At Risk' else: return 'Potential' cluster_profile['Segment'] = cluster_profile.apply(name_cluster, axis=1) print("\n=== Segment Naming Results ===") print(cluster_profile[['Segment', 'Customer_Count', 'Customer_Pct', 'Revenue_Pct']])

Segment Distribution

Key Insight: Champions are 28% of customers but account for 46% of revenue. Reactivation campaigns are needed for At Risk customers.


6. Cluster Visualization

3D Scatter Plot

from mpl_toolkits.mplot3d import Axes3D fig = plt.figure(figsize=(12, 8)) ax = fig.add_subplot(111, projection='3d') colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A'] segment_names = cluster_profile['Segment'].values for cluster in range(optimal_k): mask = rfm_df['cluster'] == cluster ax.scatter( rfm_df.loc[mask, 'recency'], rfm_df.loc[mask, 'frequency'], rfm_df.loc[mask, 'monetary'], c=colors[cluster], label=f'{segment_names[cluster]} (n={mask.sum()})', alpha=0.6, s=50 ) ax.set_xlabel('Recency (days)') ax.set_ylabel('Frequency (count)') ax.set_zlabel('Monetary ($)') ax.set_title('RFM Cluster 3D Visualization', fontsize=14, fontweight='bold') ax.legend() plt.tight_layout() plt.show()

Cluster Scatter

Each segment is clearly distinguished:

  • Champions (red): Low Recency, High Monetary
  • At Risk (teal): High Recency, Low Monetary
  • Loyal Customers (blue): Moderate levels
  • Potential (orange): Dispersed

Radar Chart (Spider Chart)

from math import pi # Use normalized values cluster_means = rfm_df.groupby('cluster')[['recency', 'frequency', 'monetary']].mean() cluster_means_norm = (cluster_means - cluster_means.min()) / (cluster_means.max() - cluster_means.min()) # Radar chart categories = ['Recency\n(lower is better)', 'Frequency', 'Monetary'] N = len(categories) fig, ax = plt.subplots(figsize=(10, 8), subplot_kw=dict(polar=True)) angles = [n / float(N) * 2 * pi for n in range(N)] angles += angles[:1] for idx, cluster in enumerate(cluster_means_norm.index): values = cluster_means_norm.loc[cluster].values.tolist() values += values[:1] ax.plot(angles, values, 'o-', linewidth=2, label=f'{segment_names[idx]}', color=colors[idx]) ax.fill(angles, values, alpha=0.25, color=colors[idx]) ax.set_xticks(angles[:-1]) ax.set_xticklabels(categories, fontsize=12) ax.set_title('RFM Profile by Cluster', fontsize=14, fontweight='bold', pad=20) ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0)) plt.tight_layout() plt.show()

Radar Chart

The radar chart allows you to compare the RFM profile of each segment at a glance.

Heatmap

# Cluster x RFM metrics heatmap plt.figure(figsize=(10, 6)) heatmap_data = cluster_means.T # Rows: RFM, Columns: Clusters heatmap_data.columns = segment_names sns.heatmap( heatmap_data, annot=True, fmt='.1f', cmap='YlOrRd', cbar_kws={'label': 'Average Value'}, linewidths=0.5 ) plt.title('RFM Averages by Cluster', fontsize=14, fontweight='bold') plt.xlabel('Segment') plt.ylabel('RFM Metric') plt.tight_layout() plt.show()

Cluster Heatmap

In the heatmap, darker colors indicate higher values. You can confirm that Champions have the highest Monetary and At Risk has the highest Recency.


Quiz 1: RFM SQL Analysis

Problem

Write SQL to calculate customer count and total revenue by segment.

View Answer

-- Customer count and revenue by RFM segment WITH rfm_calc AS ( SELECT u.id as user_id, DATE_DIFF(CURRENT_DATE(), DATE(MAX(o.created_at)), DAY) as recency, COUNT(DISTINCT o.order_id) as frequency, ROUND(SUM(oi.sale_price), 2) as monetary FROM src_users u INNER JOIN src_orders o ON u.id = o.user_id INNER JOIN src_order_items oi ON o.order_id = oi.order_id WHERE o.status = 'Complete' GROUP BY u.id ), rfm_score AS ( SELECT *, NTILE(5) OVER (ORDER BY recency DESC) as r_score, NTILE(5) OVER (ORDER BY frequency ASC) as f_score, NTILE(5) OVER (ORDER BY monetary ASC) as m_score FROM rfm_calc ), segments AS ( SELECT *, CASE WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions' WHEN r_score >= 3 AND f_score >= 3 THEN 'Loyal Customers' WHEN r_score >= 4 AND f_score <= 2 THEN 'New Customers' WHEN r_score <= 2 AND f_score >= 3 THEN 'At Risk' WHEN r_score <= 2 AND f_score <= 2 THEN 'Lost' ELSE 'Others' END as customer_segment FROM rfm_score ) SELECT customer_segment, COUNT(*) as customer_count, ROUND(AVG(recency), 1) as avg_recency, ROUND(AVG(frequency), 1) as avg_frequency, ROUND(AVG(monetary), 2) as avg_monetary, ROUND(SUM(monetary), 2) as total_revenue, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage FROM segments GROUP BY customer_segment ORDER BY total_revenue DESC;

Quiz 2: Determining Optimal K

Problem

What is the optimal K given the following Silhouette Score results?

KSilhouette Score
20.45
30.52
40.48
50.41

View Answer

Optimal K = 3

Choose K with the highest Silhouette Score.

  • K=3 has the highest at 0.52
  • A score above 0.5 indicates good clustering

However, if K=4 or K=5 creates more meaningful segments from a business perspective, those could be chosen. The statistical optimum and business optimum may differ.


Summary

Clustering Checklist

  1. Feature selection and outlier treatment
  2. Scaling with StandardScaler
  3. Identify K candidates with Elbow Method
  4. Confirm optimal K with Silhouette Score
  5. Cluster profiling
  6. Assign business meaning (segment naming)
  7. Validate results with visualization

K-Means vs DBSCAN

CharacteristicK-MeansDBSCAN
Cluster countMust be specifiedAutomatically determined
Cluster shapeSphericalArbitrary shape
Outlier handlingSensitiveClassified as noise
Best forUniform-sized clustersDensity-based clusters

Next Steps

You’ve mastered clustering! Next, learn supervised learning classification techniques for customer churn prediction and purchase prediction in Classification Models.

Last updated on

šŸ¤–AI ėŖØģ˜ė©“ģ ‘ģ‹¤ģ „ģ²˜ėŸ¼ ģ—°ģŠµķ•˜źø°