Cohort Analysis with CTE
BigQuery Execution Environment Setup
from google.cloud import bigquery
import pandas as pd
# Authentication setup (when using service account key)
# import os
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-key.json"
# Create BigQuery client
client = bigquery.Client(project='your-project-id')
# Query execution function
def run_query(query):
return client.query(query).to_dataframe()Learning Objectives
After completing this recipe, you will be able to:
- Understand the concept and benefits of CTE (Common Table Expression)
- Structure queries step by step with the WITH clause
- Build complex analysis pipelines with multiple CTEs
- Measure customer retention with cohort analysis
- Segment customers with RFM analysis
- Analyze repurchase intervals
1. Basic CTE - WITH Clause
Theory
CTE is a method to define a temporary result set within a query.
Subquery vs CTE
Subquery (hard to read):
SELECT * FROM (
SELECT * FROM (
SELECT ...
)
)CTE (easy to read):
WITH step1 AS (...),
step2 AS (...)
SELECT * FROM step2Benefits
- Improved readability
- Reusable
- Easy debugging
- Clear step-by-step logic
Practice: Customer Purchase Summary
WITH customer_summary AS (
SELECT
o.user_id,
COUNT(DISTINCT o.order_id) AS total_orders,
SUM(oi.sale_price) AS total_spent,
MIN(o.created_at) AS first_order_date,
MAX(o.created_at) AS last_order_date
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi
ON o.order_id = oi.order_id
GROUP BY o.user_id
)
SELECT
user_id,
total_orders,
total_spent,
ROUND(total_spent / total_orders, 2) AS avg_order_value,
first_order_date,
last_order_date,
DATE_DIFF(CAST(last_order_date AS DATE), CAST(first_order_date AS DATE), DAY) AS customer_lifetime_days
FROM customer_summary
ORDER BY total_spent DESC
LIMIT 20Execution Result:
| user_id | total_orders | total_spent | avg_order_value | customer_lifetime_days |
|---|---|---|---|---|
| 59071 | 4 | 1633.89 | 408.47 | 961 |
| 84005 | 2 | 1500.73 | 750.36 | 275 |
| 90572 | 4 | 1477.39 | 369.35 | 71 |
- Readability: Separate complex queries into logical steps
- Reusability: Reference the same CTE multiple times
- Debugging: Test each step individually
Quiz 1: Monthly Revenue Summary
Problem
Using CTE:
- First CTE: Aggregate monthly revenue
- Second CTE: Calculate month-over-month growth rate
- Final: Query only months with positive growth rate
Hint: DATE_TRUNC, LAG
View Answer
WITH monthly_revenue AS (
SELECT
DATE_TRUNC(CAST(o.created_at AS DATE), MONTH) AS month,
SUM(oi.sale_price) AS revenue,
COUNT(DISTINCT o.order_id) AS order_count
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
GROUP BY month
),
monthly_growth AS (
SELECT
month,
revenue,
order_count,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) /
NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100, 2) AS growth_rate
FROM monthly_revenue
)
SELECT
month,
revenue,
order_count,
prev_revenue,
growth_rate
FROM monthly_growth
WHERE growth_rate > 0
ORDER BY monthExecution Result:
| month | revenue | order_count | prev_revenue | growth_rate |
|---|---|---|---|---|
| 2024-01-01 | 156234.50 | 12890 | 142567.25 | 9.58 |
| 2024-03-01 | 172456.30 | 14560 | 148923.75 | 15.80 |
Key Points:
- Build complex logic by stacking CTEs step by step
- Use LAG function to reference previous row values
- Prevent division by zero with NULLIF
2. Multiple CTEs - Cohort Analysis
Theory
Cohort analysis tracks behavior patterns of user groups that share common characteristics (signup time, first purchase date, etc.).
Core Concepts
- Cohort: A group of users who experienced a specific event at the same time
- Retention: The percentage of users who continue to be active over time
- Cohort period: Time elapsed since the first event
Practice: Monthly Cohort Retention
WITH user_first_order AS (
-- Step 1: Define each user's first order month (cohort definition)
SELECT
user_id,
DATE_TRUNC(MIN(CAST(created_at AS DATE)), MONTH) AS cohort_month
FROM `your-project-id.thelook_ecommerce.orders`
GROUP BY user_id
),
user_orders AS (
-- Step 2: Add cohort info to all orders
SELECT
o.user_id,
DATE_TRUNC(CAST(o.created_at AS DATE), MONTH) AS order_month,
ufo.cohort_month
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN user_first_order ufo ON o.user_id = ufo.user_id
),
cohort_data AS (
-- Step 3: Calculate cohort period
SELECT
cohort_month,
order_month,
DATE_DIFF(order_month, cohort_month, MONTH) AS months_since_first,
COUNT(DISTINCT user_id) AS users
FROM user_orders
GROUP BY cohort_month, order_month
),
cohort_size AS (
-- Step 4: Initial user count per cohort
SELECT
cohort_month,
users AS cohort_users
FROM cohort_data
WHERE months_since_first = 0
)
-- Step 5: Calculate retention rate
SELECT
cd.cohort_month,
cd.months_since_first,
cd.users AS active_users,
cs.cohort_users,
ROUND(cd.users * 100.0 / cs.cohort_users, 2) AS retention_rate
FROM cohort_data cd
JOIN cohort_size cs ON cd.cohort_month = cs.cohort_month
WHERE cd.cohort_month >= '2024-01-01'
ORDER BY cd.cohort_month, cd.months_since_firstExecution Result:
| cohort_month | months_since_first | active_users | cohort_users | retention_rate |
|---|---|---|---|---|
| 2024-01-01 | 0 | 1892 | 1892 | 100.00 |
| 2024-01-01 | 1 | 156 | 1892 | 8.24 |
| 2024-01-01 | 2 | 98 | 1892 | 5.18 |
| 2024-02-01 | 0 | 1845 | 1845 | 100.00 |
| 2024-02-01 | 1 | 142 | 1845 | 7.69 |
- Month 0: Always 100% (by definition)
- Month 1 drop: Need strategies to encourage return visits after first purchase
- Stabilization point: The point where retention becomes steady is the loyal customer benchmark
Quiz 2: Cumulative Revenue by Cohort
Problem
With multiple CTEs:
- Define user cohort (first purchase month)
- Calculate cumulative revenue by cohort and period
- Estimate 12-month LTV by cohort
Hint: SUM() OVER(PARTITION BY cohort ORDER BY month)
View Answer
WITH user_cohort AS (
SELECT
user_id,
DATE_TRUNC(MIN(CAST(created_at AS DATE)), MONTH) AS cohort_month
FROM `your-project-id.thelook_ecommerce.orders`
GROUP BY user_id
),
user_monthly_revenue AS (
SELECT
o.user_id,
uc.cohort_month,
DATE_TRUNC(CAST(o.created_at AS DATE), MONTH) AS order_month,
DATE_DIFF(DATE_TRUNC(CAST(o.created_at AS DATE), MONTH), uc.cohort_month, MONTH) AS months_since_first,
SUM(oi.sale_price) AS revenue
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
JOIN user_cohort uc ON o.user_id = uc.user_id
GROUP BY o.user_id, uc.cohort_month, order_month
),
cohort_revenue AS (
SELECT
cohort_month,
months_since_first,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT user_id) AS active_users
FROM user_monthly_revenue
GROUP BY cohort_month, months_since_first
)
SELECT
cohort_month,
months_since_first,
total_revenue,
active_users,
SUM(total_revenue) OVER (
PARTITION BY cohort_month
ORDER BY months_since_first
) AS cumulative_revenue
FROM cohort_revenue
WHERE cohort_month >= '2024-01-01'
AND months_since_first <= 12
ORDER BY cohort_month, months_since_firstExecution Result:
| cohort_month | months_since_first | total_revenue | active_users | cumulative_revenue |
|---|---|---|---|---|
| 2024-01-01 | 0 | 285600.50 | 1892 | 285600.50 |
| 2024-01-01 | 1 | 23450.75 | 156 | 309051.25 |
| 2024-01-01 | 2 | 15890.30 | 98 | 324941.55 |
| 2024-02-01 | 0 | 268920.40 | 1845 | 268920.40 |
| 2024-02-01 | 1 | 21560.80 | 142 | 290481.20 |
Business Insights:
- Initial revenue: Most revenue occurs in Month 0
- LTV pattern: Predict customer value from cumulative revenue curve
- Cohort comparison: Measure marketing campaign effectiveness
3. RFM Analysis
Theory
RFM is a representative customer segmentation methodology:
- R (Recency): How recently - days since last purchase
- F (Frequency): How often - total purchase count
- M (Monetary): How much - total purchase amount
Practice: RFM Score Calculation
WITH customer_rfm AS (
SELECT
o.user_id,
DATE_DIFF(CURRENT_DATE(), MAX(CAST(o.created_at AS DATE)), DAY) AS recency,
COUNT(DISTINCT o.order_id) AS frequency,
SUM(oi.sale_price) AS monetary
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
GROUP BY o.user_id
),
rfm_scores AS (
SELECT
user_id,
recency,
frequency,
monetary,
NTILE(5) OVER (ORDER BY recency DESC) AS r_score, -- Low recency = high score
NTILE(5) OVER (ORDER BY frequency ASC) AS f_score,
NTILE(5) OVER (ORDER BY monetary ASC) AS m_score
FROM customer_rfm
)
SELECT
user_id,
recency,
frequency,
monetary,
r_score,
f_score,
m_score,
CONCAT(CAST(r_score AS STRING), CAST(f_score AS STRING), CAST(m_score AS STRING)) AS rfm_segment,
CASE
WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions'
WHEN r_score >= 4 AND f_score >= 3 THEN 'Loyal Customers'
WHEN r_score >= 4 AND f_score <= 2 THEN 'Recent Customers'
WHEN r_score <= 2 AND f_score >= 4 THEN 'At Risk'
WHEN r_score <= 2 AND f_score <= 2 THEN 'Lost'
ELSE 'Others'
END AS customer_segment
FROM rfm_scores
ORDER BY monetary DESC
LIMIT 100Execution Result:
| user_id | recency | frequency | monetary | r_score | f_score | m_score | rfm_segment | customer_segment |
|---|---|---|---|---|---|---|---|---|
| 59071 | 123 | 4 | 1633.89 | 4 | 4 | 5 | 445 | Champions |
| 84005 | 89 | 2 | 1500.73 | 4 | 3 | 5 | 435 | Loyal Customers |
| 90572 | 45 | 4 | 1477.39 | 5 | 4 | 5 | 545 | Champions |
| 67234 | 256 | 3 | 1389.45 | 2 | 3 | 4 | 234 | Others |
| 12456 | 312 | 1 | 987.50 | 2 | 2 | 3 | 223 | Lost |
RFM Segment Definitions:
| Segment | R Score | F Score | M Score | Strategy |
|---|---|---|---|---|
| Champions | 4-5 | 4-5 | 4-5 | Rewards program, VIP benefits |
| Loyal Customers | 4-5 | 3-5 | - | Upselling, referral incentives |
| Recent Customers | 4-5 | 1-2 | - | Encourage repurchase, onboarding |
| At Risk | 1-2 | 4-5 | - | Reactivation campaigns |
| Lost | 1-2 | 1-2 | - | Win-back promotions |
Quiz 3: Statistics by RFM Segment
Problem
From RFM analysis results:
- Aggregate customer count by segment
- Calculate average revenue by segment
- Calculate each segment’s share of total revenue
Hint: GROUP BY customer_segment
View Answer
WITH customer_rfm AS (
SELECT
o.user_id,
DATE_DIFF(CURRENT_DATE(), MAX(CAST(o.created_at AS DATE)), DAY) AS recency,
COUNT(DISTINCT o.order_id) AS frequency,
SUM(oi.sale_price) AS monetary
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
GROUP BY o.user_id
),
rfm_scores AS (
SELECT
user_id,
recency,
frequency,
monetary,
NTILE(5) OVER (ORDER BY recency DESC) AS r_score,
NTILE(5) OVER (ORDER BY frequency ASC) AS f_score,
NTILE(5) OVER (ORDER BY monetary ASC) AS m_score
FROM customer_rfm
),
rfm_segments AS (
SELECT
user_id,
monetary,
CASE
WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions'
WHEN r_score >= 4 AND f_score >= 3 THEN 'Loyal Customers'
WHEN r_score >= 4 AND f_score <= 2 THEN 'Recent Customers'
WHEN r_score <= 2 AND f_score >= 4 THEN 'At Risk'
WHEN r_score <= 2 AND f_score <= 2 THEN 'Lost'
ELSE 'Others'
END AS customer_segment
FROM rfm_scores
)
SELECT
customer_segment,
COUNT(*) AS customer_count,
ROUND(AVG(monetary), 2) AS avg_monetary,
SUM(monetary) AS total_revenue,
ROUND(SUM(monetary) * 100.0 / (SELECT SUM(monetary) FROM rfm_segments), 2) AS revenue_share
FROM rfm_segments
GROUP BY customer_segment
ORDER BY total_revenue DESCExecution Result:
| customer_segment | customer_count | avg_monetary | total_revenue | revenue_share |
|---|---|---|---|---|
| Loyal Customers | 2340 | 678.90 | 1588626.00 | 25.97 |
| Champions | 1250 | 1245.67 | 1557087.50 | 25.45 |
| Others | 2180 | 438.90 | 956802.00 | 15.64 |
| At Risk | 1560 | 456.78 | 712576.80 | 11.65 |
| Recent Customers | 1890 | 345.60 | 653184.00 | 10.67 |
| Lost | 1780 | 234.56 | 417516.80 | 6.82 |
Business Insights:
- Champions: 20% of customers generate 80% of revenue (Pareto principle)
- At Risk: High F score but low R score → Need immediate reactivation
- Lost: Decide targeting after cost-benefit analysis
4. Repurchase Interval Analysis
Theory
Understanding customer purchase patterns allows you to send marketing messages at the right time.
Practice: Average Repurchase Interval by Customer
WITH purchase_gaps AS (
SELECT
user_id,
created_at,
LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_purchase,
DATE_DIFF(
CAST(created_at AS DATE),
CAST(LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS DATE),
DAY
) AS days_gap
FROM `your-project-id.thelook_ecommerce.orders`
),
user_purchase_patterns AS (
SELECT
user_id,
COUNT(*) AS purchase_count,
AVG(days_gap) AS avg_gap,
MIN(days_gap) AS min_gap,
MAX(days_gap) AS max_gap,
STDDEV(days_gap) AS stddev_gap
FROM purchase_gaps
WHERE days_gap IS NOT NULL
GROUP BY user_id
)
SELECT
CASE
WHEN avg_gap <= 30 THEN 'Frequent (<=30 days)'
WHEN avg_gap <= 90 THEN 'Regular (31-90 days)'
WHEN avg_gap <= 180 THEN 'Occasional (91-180 days)'
ELSE 'Rare (>180 days)'
END AS purchase_frequency_group,
COUNT(*) AS customer_count,
ROUND(AVG(avg_gap), 1) AS avg_repurchase_days,
ROUND(AVG(purchase_count), 1) AS avg_purchases
FROM user_purchase_patterns
GROUP BY purchase_frequency_group
ORDER BY avg_repurchase_daysExecution Result:
| purchase_frequency_group | customer_count | avg_repurchase_days | avg_purchases |
|---|---|---|---|
| Frequent (≤30 days) | 1234 | 18.5 | 5.2 |
| Regular (31-90 days) | 4567 | 58.3 | 3.1 |
| Occasional (91-180 days) | 3456 | 128.7 | 2.4 |
| Rare (>180 days) | 2345 | 245.6 | 2.1 |
Quiz 4: Repurchase Pattern by Category
Problem
By product category:
- Calculate average repurchase interval
- Calculate repurchase rate (percentage of customers with 2+ purchases)
- Sort results by repurchase interval
View Answer
WITH category_purchases AS (
SELECT
o.user_id,
p.category,
o.created_at,
LAG(o.created_at) OVER (
PARTITION BY o.user_id, p.category
ORDER BY o.created_at
) AS prev_purchase
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
JOIN `your-project-id.thelook_ecommerce.products` p ON oi.product_id = p.id
),
category_gaps AS (
SELECT
category,
user_id,
DATE_DIFF(CAST(created_at AS DATE), CAST(prev_purchase AS DATE), DAY) AS days_gap
FROM category_purchases
WHERE prev_purchase IS NOT NULL
),
category_stats AS (
SELECT
category,
COUNT(DISTINCT user_id) AS repeat_customers,
AVG(days_gap) AS avg_repurchase_days
FROM category_gaps
GROUP BY category
),
category_total AS (
SELECT
p.category,
COUNT(DISTINCT o.user_id) AS total_customers
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
JOIN `your-project-id.thelook_ecommerce.products` p ON oi.product_id = p.id
GROUP BY p.category
)
SELECT
ct.category,
ct.total_customers,
COALESCE(cs.repeat_customers, 0) AS repeat_customers,
ROUND(COALESCE(cs.repeat_customers, 0) * 100.0 / ct.total_customers, 2) AS repeat_rate,
ROUND(cs.avg_repurchase_days, 1) AS avg_repurchase_days
FROM category_total ct
LEFT JOIN category_stats cs ON ct.category = cs.category
ORDER BY avg_repurchase_daysExecution Result:
| category | total_customers | repeat_customers | repeat_rate | avg_repurchase_days |
|---|---|---|---|---|
| Jeans | 12450 | 2340 | 18.80 | 38.4 |
| Tops & Tees | 11230 | 1890 | 16.83 | 42.1 |
| Accessories | 8560 | 1245 | 14.55 | 45.6 |
| Sweaters | 9870 | 1560 | 15.80 | 48.9 |
| Active | 7890 | 1120 | 14.20 | 52.3 |
| Outerwear & Coats | 6540 | 890 | 13.61 | 65.2 |
Business Insights:
- Short repurchase interval categories: Consider subscription model
- Long repurchase interval categories: Reminder marketing
- High repurchase rate categories: Cross-selling opportunities
Summary
CTE Patterns Covered
| Pattern | Use | Example |
|---|---|---|
| Single CTE | Define intermediate results | Customer summary table |
| Multiple CTEs | Step-by-step pipeline | Cohort analysis |
| Recursive CTE | Hierarchical structure processing | Org chart, category tree |
| CTE + Window Functions | Complex analysis | RFM, retention |
CTE Best Practices
- Clear naming: CTE names should explain their purpose
- Step-by-step separation: Break complex logic into small units
- Reusability: Reference the same CTE multiple times to remove duplication
- Debugging: Verify by running each CTE individually
Cohort Analysis Applications
- Product team: Track user retention after feature launches
- Marketing team: Analyze customer acquisition and retention by campaign
- Finance team: Forecast LTV by cohort and budget allocation
Next Steps
You’ve mastered CTE and cohort analysis! Next, learn text data processing techniques in String Functions.