Skip to Content
ConceptsSQLCTE and Cohort Analysis

Cohort Analysis with CTE

IntermediateAdvanced

BigQuery Execution Environment Setup

from google.cloud import bigquery import pandas as pd # Authentication setup (when using service account key) # import os # os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-key.json" # Create BigQuery client client = bigquery.Client(project='your-project-id') # Query execution function def run_query(query): return client.query(query).to_dataframe()

Learning Objectives

After completing this recipe, you will be able to:

  • Understand the concept and benefits of CTE (Common Table Expression)
  • Structure queries step by step with the WITH clause
  • Build complex analysis pipelines with multiple CTEs
  • Measure customer retention with cohort analysis
  • Segment customers with RFM analysis
  • Analyze repurchase intervals

1. Basic CTE - WITH Clause

Theory

CTE is a method to define a temporary result set within a query.

Subquery vs CTE

Subquery (hard to read):

SELECT * FROM ( SELECT * FROM ( SELECT ... ) )

CTE (easy to read):

WITH step1 AS (...), step2 AS (...) SELECT * FROM step2

Benefits

  • Improved readability
  • Reusable
  • Easy debugging
  • Clear step-by-step logic

Practice: Customer Purchase Summary

WITH customer_summary AS ( SELECT o.user_id, COUNT(DISTINCT o.order_id) AS total_orders, SUM(oi.sale_price) AS total_spent, MIN(o.created_at) AS first_order_date, MAX(o.created_at) AS last_order_date FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id GROUP BY o.user_id ) SELECT user_id, total_orders, total_spent, ROUND(total_spent / total_orders, 2) AS avg_order_value, first_order_date, last_order_date, DATE_DIFF(CAST(last_order_date AS DATE), CAST(first_order_date AS DATE), DAY) AS customer_lifetime_days FROM customer_summary ORDER BY total_spent DESC LIMIT 20

Execution Result:

user_idtotal_orderstotal_spentavg_order_valuecustomer_lifetime_days
5907141633.89408.47961
8400521500.73750.36275
9057241477.39369.3571
ℹ️
Benefits of CTE
  • Readability: Separate complex queries into logical steps
  • Reusability: Reference the same CTE multiple times
  • Debugging: Test each step individually

Quiz 1: Monthly Revenue Summary

Problem

Using CTE:

  1. First CTE: Aggregate monthly revenue
  2. Second CTE: Calculate month-over-month growth rate
  3. Final: Query only months with positive growth rate

Hint: DATE_TRUNC, LAG

View Answer

WITH monthly_revenue AS ( SELECT DATE_TRUNC(CAST(o.created_at AS DATE), MONTH) AS month, SUM(oi.sale_price) AS revenue, COUNT(DISTINCT o.order_id) AS order_count FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id GROUP BY month ), monthly_growth AS ( SELECT month, revenue, order_count, LAG(revenue) OVER (ORDER BY month) AS prev_revenue, ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100, 2) AS growth_rate FROM monthly_revenue ) SELECT month, revenue, order_count, prev_revenue, growth_rate FROM monthly_growth WHERE growth_rate > 0 ORDER BY month

Execution Result:

monthrevenueorder_countprev_revenuegrowth_rate
2024-01-01156234.5012890142567.259.58
2024-03-01172456.3014560148923.7515.80

Key Points:

  • Build complex logic by stacking CTEs step by step
  • Use LAG function to reference previous row values
  • Prevent division by zero with NULLIF

2. Multiple CTEs - Cohort Analysis

Theory

Cohort analysis tracks behavior patterns of user groups that share common characteristics (signup time, first purchase date, etc.).

Core Concepts

  • Cohort: A group of users who experienced a specific event at the same time
  • Retention: The percentage of users who continue to be active over time
  • Cohort period: Time elapsed since the first event

Practice: Monthly Cohort Retention

WITH user_first_order AS ( -- Step 1: Define each user's first order month (cohort definition) SELECT user_id, DATE_TRUNC(MIN(CAST(created_at AS DATE)), MONTH) AS cohort_month FROM `your-project-id.thelook_ecommerce.orders` GROUP BY user_id ), user_orders AS ( -- Step 2: Add cohort info to all orders SELECT o.user_id, DATE_TRUNC(CAST(o.created_at AS DATE), MONTH) AS order_month, ufo.cohort_month FROM `your-project-id.thelook_ecommerce.orders` o JOIN user_first_order ufo ON o.user_id = ufo.user_id ), cohort_data AS ( -- Step 3: Calculate cohort period SELECT cohort_month, order_month, DATE_DIFF(order_month, cohort_month, MONTH) AS months_since_first, COUNT(DISTINCT user_id) AS users FROM user_orders GROUP BY cohort_month, order_month ), cohort_size AS ( -- Step 4: Initial user count per cohort SELECT cohort_month, users AS cohort_users FROM cohort_data WHERE months_since_first = 0 ) -- Step 5: Calculate retention rate SELECT cd.cohort_month, cd.months_since_first, cd.users AS active_users, cs.cohort_users, ROUND(cd.users * 100.0 / cs.cohort_users, 2) AS retention_rate FROM cohort_data cd JOIN cohort_size cs ON cd.cohort_month = cs.cohort_month WHERE cd.cohort_month >= '2024-01-01' ORDER BY cd.cohort_month, cd.months_since_first

Execution Result:

cohort_monthmonths_since_firstactive_userscohort_usersretention_rate
2024-01-01018921892100.00
2024-01-01115618928.24
2024-01-0129818925.18
2024-02-01018451845100.00
2024-02-01114218457.69
💡
Cohort Analysis Insights
  • Month 0: Always 100% (by definition)
  • Month 1 drop: Need strategies to encourage return visits after first purchase
  • Stabilization point: The point where retention becomes steady is the loyal customer benchmark

Quiz 2: Cumulative Revenue by Cohort

Problem

With multiple CTEs:

  1. Define user cohort (first purchase month)
  2. Calculate cumulative revenue by cohort and period
  3. Estimate 12-month LTV by cohort

Hint: SUM() OVER(PARTITION BY cohort ORDER BY month)

View Answer

WITH user_cohort AS ( SELECT user_id, DATE_TRUNC(MIN(CAST(created_at AS DATE)), MONTH) AS cohort_month FROM `your-project-id.thelook_ecommerce.orders` GROUP BY user_id ), user_monthly_revenue AS ( SELECT o.user_id, uc.cohort_month, DATE_TRUNC(CAST(o.created_at AS DATE), MONTH) AS order_month, DATE_DIFF(DATE_TRUNC(CAST(o.created_at AS DATE), MONTH), uc.cohort_month, MONTH) AS months_since_first, SUM(oi.sale_price) AS revenue FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id JOIN user_cohort uc ON o.user_id = uc.user_id GROUP BY o.user_id, uc.cohort_month, order_month ), cohort_revenue AS ( SELECT cohort_month, months_since_first, SUM(revenue) AS total_revenue, COUNT(DISTINCT user_id) AS active_users FROM user_monthly_revenue GROUP BY cohort_month, months_since_first ) SELECT cohort_month, months_since_first, total_revenue, active_users, SUM(total_revenue) OVER ( PARTITION BY cohort_month ORDER BY months_since_first ) AS cumulative_revenue FROM cohort_revenue WHERE cohort_month >= '2024-01-01' AND months_since_first <= 12 ORDER BY cohort_month, months_since_first

Execution Result:

cohort_monthmonths_since_firsttotal_revenueactive_userscumulative_revenue
2024-01-010285600.501892285600.50
2024-01-01123450.75156309051.25
2024-01-01215890.3098324941.55
2024-02-010268920.401845268920.40
2024-02-01121560.80142290481.20

Business Insights:

  • Initial revenue: Most revenue occurs in Month 0
  • LTV pattern: Predict customer value from cumulative revenue curve
  • Cohort comparison: Measure marketing campaign effectiveness

3. RFM Analysis

Theory

RFM is a representative customer segmentation methodology:

  • R (Recency): How recently - days since last purchase
  • F (Frequency): How often - total purchase count
  • M (Monetary): How much - total purchase amount

Practice: RFM Score Calculation

WITH customer_rfm AS ( SELECT o.user_id, DATE_DIFF(CURRENT_DATE(), MAX(CAST(o.created_at AS DATE)), DAY) AS recency, COUNT(DISTINCT o.order_id) AS frequency, SUM(oi.sale_price) AS monetary FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id GROUP BY o.user_id ), rfm_scores AS ( SELECT user_id, recency, frequency, monetary, NTILE(5) OVER (ORDER BY recency DESC) AS r_score, -- Low recency = high score NTILE(5) OVER (ORDER BY frequency ASC) AS f_score, NTILE(5) OVER (ORDER BY monetary ASC) AS m_score FROM customer_rfm ) SELECT user_id, recency, frequency, monetary, r_score, f_score, m_score, CONCAT(CAST(r_score AS STRING), CAST(f_score AS STRING), CAST(m_score AS STRING)) AS rfm_segment, CASE WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions' WHEN r_score >= 4 AND f_score >= 3 THEN 'Loyal Customers' WHEN r_score >= 4 AND f_score <= 2 THEN 'Recent Customers' WHEN r_score <= 2 AND f_score >= 4 THEN 'At Risk' WHEN r_score <= 2 AND f_score <= 2 THEN 'Lost' ELSE 'Others' END AS customer_segment FROM rfm_scores ORDER BY monetary DESC LIMIT 100

Execution Result:

user_idrecencyfrequencymonetaryr_scoref_scorem_scorerfm_segmentcustomer_segment
5907112341633.89445445Champions
840058921500.73435435Loyal Customers
905724541477.39545545Champions
6723425631389.45234234Others
124563121987.50223223Lost

RFM Segment Definitions:

SegmentR ScoreF ScoreM ScoreStrategy
Champions4-54-54-5Rewards program, VIP benefits
Loyal Customers4-53-5-Upselling, referral incentives
Recent Customers4-51-2-Encourage repurchase, onboarding
At Risk1-24-5-Reactivation campaigns
Lost1-21-2-Win-back promotions

Quiz 3: Statistics by RFM Segment

Problem

From RFM analysis results:

  1. Aggregate customer count by segment
  2. Calculate average revenue by segment
  3. Calculate each segment’s share of total revenue

Hint: GROUP BY customer_segment

View Answer

WITH customer_rfm AS ( SELECT o.user_id, DATE_DIFF(CURRENT_DATE(), MAX(CAST(o.created_at AS DATE)), DAY) AS recency, COUNT(DISTINCT o.order_id) AS frequency, SUM(oi.sale_price) AS monetary FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id GROUP BY o.user_id ), rfm_scores AS ( SELECT user_id, recency, frequency, monetary, NTILE(5) OVER (ORDER BY recency DESC) AS r_score, NTILE(5) OVER (ORDER BY frequency ASC) AS f_score, NTILE(5) OVER (ORDER BY monetary ASC) AS m_score FROM customer_rfm ), rfm_segments AS ( SELECT user_id, monetary, CASE WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions' WHEN r_score >= 4 AND f_score >= 3 THEN 'Loyal Customers' WHEN r_score >= 4 AND f_score <= 2 THEN 'Recent Customers' WHEN r_score <= 2 AND f_score >= 4 THEN 'At Risk' WHEN r_score <= 2 AND f_score <= 2 THEN 'Lost' ELSE 'Others' END AS customer_segment FROM rfm_scores ) SELECT customer_segment, COUNT(*) AS customer_count, ROUND(AVG(monetary), 2) AS avg_monetary, SUM(monetary) AS total_revenue, ROUND(SUM(monetary) * 100.0 / (SELECT SUM(monetary) FROM rfm_segments), 2) AS revenue_share FROM rfm_segments GROUP BY customer_segment ORDER BY total_revenue DESC

Execution Result:

customer_segmentcustomer_countavg_monetarytotal_revenuerevenue_share
Loyal Customers2340678.901588626.0025.97
Champions12501245.671557087.5025.45
Others2180438.90956802.0015.64
At Risk1560456.78712576.8011.65
Recent Customers1890345.60653184.0010.67
Lost1780234.56417516.806.82

Business Insights:

  • Champions: 20% of customers generate 80% of revenue (Pareto principle)
  • At Risk: High F score but low R score → Need immediate reactivation
  • Lost: Decide targeting after cost-benefit analysis

4. Repurchase Interval Analysis

Theory

Understanding customer purchase patterns allows you to send marketing messages at the right time.

Practice: Average Repurchase Interval by Customer

WITH purchase_gaps AS ( SELECT user_id, created_at, LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_purchase, DATE_DIFF( CAST(created_at AS DATE), CAST(LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS DATE), DAY ) AS days_gap FROM `your-project-id.thelook_ecommerce.orders` ), user_purchase_patterns AS ( SELECT user_id, COUNT(*) AS purchase_count, AVG(days_gap) AS avg_gap, MIN(days_gap) AS min_gap, MAX(days_gap) AS max_gap, STDDEV(days_gap) AS stddev_gap FROM purchase_gaps WHERE days_gap IS NOT NULL GROUP BY user_id ) SELECT CASE WHEN avg_gap <= 30 THEN 'Frequent (<=30 days)' WHEN avg_gap <= 90 THEN 'Regular (31-90 days)' WHEN avg_gap <= 180 THEN 'Occasional (91-180 days)' ELSE 'Rare (>180 days)' END AS purchase_frequency_group, COUNT(*) AS customer_count, ROUND(AVG(avg_gap), 1) AS avg_repurchase_days, ROUND(AVG(purchase_count), 1) AS avg_purchases FROM user_purchase_patterns GROUP BY purchase_frequency_group ORDER BY avg_repurchase_days

Execution Result:

purchase_frequency_groupcustomer_countavg_repurchase_daysavg_purchases
Frequent (≤30 days)123418.55.2
Regular (31-90 days)456758.33.1
Occasional (91-180 days)3456128.72.4
Rare (>180 days)2345245.62.1

Quiz 4: Repurchase Pattern by Category

Problem

By product category:

  1. Calculate average repurchase interval
  2. Calculate repurchase rate (percentage of customers with 2+ purchases)
  3. Sort results by repurchase interval

View Answer

WITH category_purchases AS ( SELECT o.user_id, p.category, o.created_at, LAG(o.created_at) OVER ( PARTITION BY o.user_id, p.category ORDER BY o.created_at ) AS prev_purchase FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id JOIN `your-project-id.thelook_ecommerce.products` p ON oi.product_id = p.id ), category_gaps AS ( SELECT category, user_id, DATE_DIFF(CAST(created_at AS DATE), CAST(prev_purchase AS DATE), DAY) AS days_gap FROM category_purchases WHERE prev_purchase IS NOT NULL ), category_stats AS ( SELECT category, COUNT(DISTINCT user_id) AS repeat_customers, AVG(days_gap) AS avg_repurchase_days FROM category_gaps GROUP BY category ), category_total AS ( SELECT p.category, COUNT(DISTINCT o.user_id) AS total_customers FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id JOIN `your-project-id.thelook_ecommerce.products` p ON oi.product_id = p.id GROUP BY p.category ) SELECT ct.category, ct.total_customers, COALESCE(cs.repeat_customers, 0) AS repeat_customers, ROUND(COALESCE(cs.repeat_customers, 0) * 100.0 / ct.total_customers, 2) AS repeat_rate, ROUND(cs.avg_repurchase_days, 1) AS avg_repurchase_days FROM category_total ct LEFT JOIN category_stats cs ON ct.category = cs.category ORDER BY avg_repurchase_days

Execution Result:

categorytotal_customersrepeat_customersrepeat_rateavg_repurchase_days
Jeans12450234018.8038.4
Tops & Tees11230189016.8342.1
Accessories8560124514.5545.6
Sweaters9870156015.8048.9
Active7890112014.2052.3
Outerwear & Coats654089013.6165.2

Business Insights:

  • Short repurchase interval categories: Consider subscription model
  • Long repurchase interval categories: Reminder marketing
  • High repurchase rate categories: Cross-selling opportunities

Summary

CTE Patterns Covered

PatternUseExample
Single CTEDefine intermediate resultsCustomer summary table
Multiple CTEsStep-by-step pipelineCohort analysis
Recursive CTEHierarchical structure processingOrg chart, category tree
CTE + Window FunctionsComplex analysisRFM, retention

CTE Best Practices

  1. Clear naming: CTE names should explain their purpose
  2. Step-by-step separation: Break complex logic into small units
  3. Reusability: Reference the same CTE multiple times to remove duplication
  4. Debugging: Verify by running each CTE individually

Cohort Analysis Applications

  • Product team: Track user retention after feature launches
  • Marketing team: Analyze customer acquisition and retention by campaign
  • Finance team: Forecast LTV by cohort and budget allocation

Next Steps

You’ve mastered CTE and cohort analysis! Next, learn text data processing techniques in String Functions.

Last updated on

🤖AI 모의면접실전처럼 연습하기