05. Customer Segmentation
1. Overview and Scenario
Situation: Providing the same CS service to all customers is increasing costs while satisfaction stays flat. The CMO suggests:
“Shouldn’t we respond faster to VIP customers, and give coupons to customers who seem likely to churn?”
But we don’t know who’s a VIP and who’s at risk of churning. In this chapter, we’ll use RFM Analysis to segment customers and develop strategies for each group.
2. Data Preparation
We need order data (src_orders) and order details (src_order_items).
We’ll connect these with CS ticket data (cs_tickets_dummy) for analysis.
BigQuery (SQL)
# BigQuery connection setup
from google.cloud import bigquery
client = bigquery.Client()3. Basic Segmentation: By Ticket Frequency
The easiest method is to look at “How often do they contact us?”
❓ Problem 1: Group by Ticket Frequency
Q. Calculate total tickets per customer and group them by the following criteria:
- 1 time: 1 inquiry
- 2-3 times: 2-3 inquiries
- 4+ times: Frequent inquiries
BigQuery (SQL)
Hint: Use CASE WHEN to create groups.
View Solution
WITH user_tickets AS (
SELECT user_id, COUNT(*) as ticket_count
FROM `your-project-id.retail_analytics_us.cs_tickets_dummy`
GROUP BY user_id
)
SELECT
CASE
WHEN ticket_count = 1 THEN '1 inquiry'
WHEN ticket_count BETWEEN 2 AND 3 THEN '2-3 inquiries'
ELSE '4+ (frequent)'
END as frequency_segment,
COUNT(*) as user_count
FROM user_tickets
GROUP BY 1
ORDER BY user_count DESC;4. RFM Analysis (Recency, Frequency, Monetary)
This is the most commonly used marketing segmentation technique.
- Recency: How recently did they purchase?
- Frequency: How often did they purchase?
- Monetary: How much did they spend?
❓ Problem 2: Calculate RFM Metrics per Customer
Q. Calculate each user’s Recency (days since last purchase), Frequency (total orders), and Monetary (total spend).
BigQuery (SQL)
Hint: Use DATE_DIFF and CURRENT_DATE() to calculate days elapsed.
View Solution
SELECT
o.user_id,
DATE_DIFF(CURRENT_DATE(), MAX(DATE(o.created_at)), DAY) as recency,
COUNT(DISTINCT o.order_id) as frequency,
ROUND(SUM(oi.sale_price), 2) as monetary
FROM `your-project-id.retail_analytics_us.src_orders` o
JOIN `your-project-id.retail_analytics_us.src_order_items` oi
ON o.order_id = oi.order_id
WHERE o.status NOT IN ('Cancelled', 'Returned')
GROUP BY o.user_id
LIMIT 10;❓ Problem 3: Assign RFM Scores and Classify Segments
Q. Use NTILE(5) function to divide each metric into 1-5 scores and define segments with the following rules:
- Champions: R, F, M all 4 points or higher
- Loyal: R, F, M all 3 points or higher
- At Risk: R 2 points or lower, F 3 points or higher (used to buy frequently but hasn’t come recently)
- New: R 4 points or higher, F 2 points or lower (came recently but low frequency)
- Others: The rest
BigQuery (SQL)
Hint: Use a CTE to calculate scores first, then classify with CASE WHEN.
View Solution
WITH rfm_base AS (
-- (Problem 2 query above)
SELECT
o.user_id,
DATE_DIFF(CURRENT_DATE(), MAX(DATE(o.created_at)), DAY) as recency,
COUNT(DISTINCT o.order_id) as frequency,
SUM(oi.sale_price) as monetary
FROM `your-project-id.retail_analytics_us.src_orders` o
JOIN `your-project-id.retail_analytics_us.src_order_items` oi ON o.order_id = oi.order_id
WHERE o.status NOT IN ('Cancelled', 'Returned')
GROUP BY o.user_id
),
rfm_scores AS (
SELECT *,
NTILE(5) OVER (ORDER BY recency ASC) as r_score, -- The more recent (smaller), the better score should be
-- Usually 5 points is best, so ORDER BY recency DESC would be correct.
-- However, SQL NTILE(5) OVER (ORDER BY col) assigns 1 to smallest values.
-- Recency: smaller (more recent) should be better -> Is rank 1 the best?
-- For convenience, sort descending so larger values get 5 points
-- or depends on problem definition. Here we use '5-point scale'.
-- Modified: Configure logic so 5 is the best score
NTILE(5) OVER (ORDER BY recency DESC) as r_score_raw, -- Old gets 1, recent gets 5
NTILE(5) OVER (ORDER BY frequency ASC) as f_score, -- More is 5
NTILE(5) OVER (ORDER BY monetary ASC) as m_score -- More is 5
FROM rfm_base
)
SELECT
user_id,
CASE
WHEN r_score_raw >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions'
WHEN r_score_raw >= 3 AND f_score >= 3 AND m_score >= 3 THEN 'Loyal'
WHEN r_score_raw <= 2 AND f_score >= 3 THEN 'At Risk'
WHEN r_score_raw >= 4 AND f_score <= 2 THEN 'New'
ELSE 'Others'
END as segment
FROM rfm_scores;💡 Summary and Insights
Customers are all different and want to be treated differently.
- Top 20% create 80% of revenue (Pareto Principle): Give special treatment to Champions.
- Detect churn signals: Send win-back coupons to ‘At Risk’ group (used to buy frequently but became inactive).
- Resource optimization: Instead of urgently responding to all tickets, you can build a system that prioritizes VIP tickets.
Wrapping Up Part 1 Great work! You’ve now built a foundation in data analysis (exploration, metrics, time series, segmentation). Starting from the next chapter, we’ll dive deeper into Machine Learning (Sentiment Analysis) and AI.