05. Customer Segmentation

Advanced3 hours

1. Overview and Scenario

Situation: Providing the same CS service to all customers is increasing costs while satisfaction stays flat. The CMO suggests:

“Shouldn’t we respond faster to VIP customers, and give coupons to customers who seem likely to churn?”

But we don’t know who’s a VIP and who’s at risk of churning. In this chapter, we’ll use RFM Analysis to segment customers and develop strategies for each group.

2. Data Preparation

We need order data (src_orders) and order details (src_order_items). We’ll connect these with CS ticket data (cs_tickets_dummy) for analysis.

BigQuery (SQL)


# BigQuery connection setup
from google.cloud import bigquery
client = bigquery.Client()

3. Basic Segmentation: By Ticket Frequency

The easiest method is to look at “How often do they contact us?”

❓ Problem 1: Group by Ticket Frequency

Q. Calculate total tickets per customer and group them by the following criteria:

1 time: 1 inquiry
2-3 times: 2-3 inquiries
4+ times: Frequent inquiries

BigQuery (SQL)

💡

Hint: Use CASE WHEN to create groups.

View Solution


WITH user_tickets AS (
    SELECT user_id, COUNT(*) as ticket_count
    FROM `your-project-id.retail_analytics_us.cs_tickets_dummy`
    GROUP BY user_id
)
SELECT
    CASE
        WHEN ticket_count = 1 THEN '1 inquiry'
        WHEN ticket_count BETWEEN 2 AND 3 THEN '2-3 inquiries'
        ELSE '4+ (frequent)'
    END as frequency_segment,
    COUNT(*) as user_count
FROM user_tickets
GROUP BY 1
ORDER BY user_count DESC;

4. RFM Analysis (Recency, Frequency, Monetary)

This is the most commonly used marketing segmentation technique.

Recency: How recently did they purchase?
Frequency: How often did they purchase?
Monetary: How much did they spend?

❓ Problem 2: Calculate RFM Metrics per Customer

Q. Calculate each user’s Recency (days since last purchase), Frequency (total orders), and Monetary (total spend).

BigQuery (SQL)

💡

Hint: Use DATE_DIFF and CURRENT_DATE() to calculate days elapsed.

View Solution


SELECT
    o.user_id,
    DATE_DIFF(CURRENT_DATE(), MAX(DATE(o.created_at)), DAY) as recency,
    COUNT(DISTINCT o.order_id) as frequency,
    ROUND(SUM(oi.sale_price), 2) as monetary
FROM `your-project-id.retail_analytics_us.src_orders` o
JOIN `your-project-id.retail_analytics_us.src_order_items` oi
  ON o.order_id = oi.order_id
WHERE o.status NOT IN ('Cancelled', 'Returned')
GROUP BY o.user_id
LIMIT 10;

❓ Problem 3: Assign RFM Scores and Classify Segments

Q. Use NTILE(5) function to divide each metric into 1-5 scores and define segments with the following rules:

Champions: R, F, M all 4 points or higher
Loyal: R, F, M all 3 points or higher
At Risk: R 2 points or lower, F 3 points or higher (used to buy frequently but hasn’t come recently)
New: R 4 points or higher, F 2 points or lower (came recently but low frequency)
Others: The rest

BigQuery (SQL)

💡

Hint: Use a CTE to calculate scores first, then classify with CASE WHEN.

View Solution


WITH rfm_base AS (
    -- (Problem 2 query above)
    SELECT
        o.user_id,
        DATE_DIFF(CURRENT_DATE(), MAX(DATE(o.created_at)), DAY) as recency,
        COUNT(DISTINCT o.order_id) as frequency,
        SUM(oi.sale_price) as monetary
    FROM `your-project-id.retail_analytics_us.src_orders` o
    JOIN `your-project-id.retail_analytics_us.src_order_items` oi ON o.order_id = oi.order_id
    WHERE o.status NOT IN ('Cancelled', 'Returned')
    GROUP BY o.user_id
),
rfm_scores AS (
    SELECT *,
        NTILE(5) OVER (ORDER BY recency ASC) as r_score,   -- The more recent (smaller), the better score should be
                                                           -- Usually 5 points is best, so ORDER BY recency DESC would be correct.
                                                           -- However, SQL NTILE(5) OVER (ORDER BY col) assigns 1 to smallest values.
                                                           -- Recency: smaller (more recent) should be better -> Is rank 1 the best?
                                                           -- For convenience, sort descending so larger values get 5 points
                                                           -- or depends on problem definition. Here we use '5-point scale'.
        -- Modified: Configure logic so 5 is the best score
        NTILE(5) OVER (ORDER BY recency DESC) as r_score_raw, -- Old gets 1, recent gets 5
        NTILE(5) OVER (ORDER BY frequency ASC) as f_score,     -- More is 5
        NTILE(5) OVER (ORDER BY monetary ASC) as m_score       -- More is 5
    FROM rfm_base
)
SELECT
    user_id,
    CASE
        WHEN r_score_raw >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions'
        WHEN r_score_raw >= 3 AND f_score >= 3 AND m_score >= 3 THEN 'Loyal'
        WHEN r_score_raw <= 2 AND f_score >= 3 THEN 'At Risk'
        WHEN r_score_raw >= 4 AND f_score <= 2 THEN 'New'
        ELSE 'Others'
    END as segment
FROM rfm_scores;

💡 Summary and Insights

Customers are all different and want to be treated differently.

Top 20% create 80% of revenue (Pareto Principle): Give special treatment to Champions.
Detect churn signals: Send win-back coupons to ‘At Risk’ group (used to buy frequently but became inactive).
Resource optimization: Instead of urgently responding to all tickets, you can build a system that prioritizes VIP tickets.

Wrapping Up Part 1 Great work! You’ve now built a foundation in data analysis (exploration, metrics, time series, segmentation). Starting from the next chapter, we’ll dive deeper into Machine Learning (Sentiment Analysis) and AI.

05. Customer Segmentation

1. Overview and Scenario

2. Data Preparation

BigQuery (SQL)

Pandas (CSV)

3. Basic Segmentation: By Ticket Frequency

❓ Problem 1: Group by Ticket Frequency

BigQuery (SQL)

View Solution

Pandas

View Solution

4. RFM Analysis (Recency, Frequency, Monetary)

❓ Problem 2: Calculate RFM Metrics per Customer

BigQuery (SQL)

View Solution

Pandas

View Solution

❓ Problem 3: Assign RFM Scores and Classify Segments

BigQuery (SQL)

View Solution

Pandas

View Solution

💡 Summary and Insights