Skip to Content
이론 및 개념 (Concepts)SQLCTE와 코호트 분석

CTE를 활용한 코호트 분석

중급고급

BigQuery 실행 환경 설정

from google.cloud import bigquery import pandas as pd # 인증 설정 (서비스 계정 키 사용 시) # import os # os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-key.json" # BigQuery 클라이언트 생성 client = bigquery.Client(project='your-project-id') # 쿼리 실행 함수 def run_query(query): return client.query(query).to_dataframe()

학습 목표

이 레시피를 완료하면 다음을 할 수 있습니다:

  • CTE(Common Table Expression)의 개념과 장점 이해
  • WITH 절로 단계적 쿼리 구조화
  • 다중 CTE로 복잡한 분석 파이프라인 구축
  • 코호트 분석으로 고객 리텐션 측정
  • RFM 분석으로 고객 세분화
  • 재구매 간격 분석

1. 기본 CTE - WITH 절

이론

CTE는 쿼리 내에서 임시 결과 세트를 정의하는 방법입니다.

서브쿼리 vs CTE

서브쿼리 (읽기 어려움):

SELECT * FROM ( SELECT * FROM ( SELECT ... ) )

CTE (읽기 쉬움):

WITH step1 AS (...), step2 AS (...) SELECT * FROM step2

장점

  • 가독성 향상
  • 재사용 가능
  • 디버깅 용이
  • 단계별 로직 명확화

실습: 고객 구매 요약

WITH customer_summary AS ( SELECT o.user_id, COUNT(DISTINCT o.order_id) AS total_orders, SUM(oi.sale_price) AS total_spent, MIN(o.created_at) AS first_order_date, MAX(o.created_at) AS last_order_date FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id GROUP BY o.user_id ) SELECT user_id, total_orders, total_spent, ROUND(total_spent / total_orders, 2) AS avg_order_value, first_order_date, last_order_date, DATE_DIFF(CAST(last_order_date AS DATE), CAST(first_order_date AS DATE), DAY) AS customer_lifetime_days FROM customer_summary ORDER BY total_spent DESC LIMIT 20

실행 결과:

user_idtotal_orderstotal_spentavg_order_valuecustomer_lifetime_days
5907141633.89408.47961
8400521500.73750.36275
9057241477.39369.3571
ℹ️
CTE의 장점
  • 가독성: 복잡한 쿼리를 논리적 단계로 분리
  • 재사용: 같은 CTE를 여러 번 참조 가능
  • 디버깅: 각 단계를 개별적으로 테스트 가능

퀴즈 1: 월별 매출 요약

문제

CTE를 사용하여:

  1. 첫 번째 CTE: 월별 매출 집계
  2. 두 번째 CTE: 전월 대비 성장률 계산
  3. 최종: 성장률이 양수인 월만 조회

힌트: DATE_TRUNC, LAG

정답 보기

WITH monthly_revenue AS ( SELECT DATE_TRUNC(CAST(o.created_at AS DATE), MONTH) AS month, SUM(oi.sale_price) AS revenue, COUNT(DISTINCT o.order_id) AS order_count FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id GROUP BY month ), monthly_growth AS ( SELECT month, revenue, order_count, LAG(revenue) OVER (ORDER BY month) AS prev_revenue, ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100, 2) AS growth_rate FROM monthly_revenue ) SELECT month, revenue, order_count, prev_revenue, growth_rate FROM monthly_growth WHERE growth_rate > 0 ORDER BY month

실행 결과:

monthrevenueorder_countprev_revenuegrowth_rate
2024-01-01156234.5012890142567.259.58
2024-03-01172456.3014560148923.7515.80

핵심 포인트:

  • CTE를 단계별로 쌓아가며 복잡한 로직 구현
  • LAG 함수로 이전 행 값 참조
  • NULLIF로 0 나누기 방지

2. 다중 CTE - 코호트 분석

이론

코호트 분석은 동일한 특성(가입 시점, 첫 구매일 등)을 가진 사용자 그룹의 행동 패턴을 추적합니다.

핵심 개념

  • 코호트: 같은 시점에 특정 이벤트를 경험한 사용자 그룹
  • 리텐션: 시간이 지나도 계속 활동하는 사용자 비율
  • 코호트 기간: 첫 이벤트 이후 경과 기간

실습: 월별 코호트 리텐션

WITH user_first_order AS ( -- 1단계: 각 사용자의 첫 주문 월 (코호트 정의) SELECT user_id, DATE_TRUNC(MIN(CAST(created_at AS DATE)), MONTH) AS cohort_month FROM `your-project-id.thelook_ecommerce.orders` GROUP BY user_id ), user_orders AS ( -- 2단계: 모든 주문에 코호트 정보 추가 SELECT o.user_id, DATE_TRUNC(CAST(o.created_at AS DATE), MONTH) AS order_month, ufo.cohort_month FROM `your-project-id.thelook_ecommerce.orders` o JOIN user_first_order ufo ON o.user_id = ufo.user_id ), cohort_data AS ( -- 3단계: 코호트 기간 계산 SELECT cohort_month, order_month, DATE_DIFF(order_month, cohort_month, MONTH) AS months_since_first, COUNT(DISTINCT user_id) AS users FROM user_orders GROUP BY cohort_month, order_month ), cohort_size AS ( -- 4단계: 코호트별 초기 사용자 수 SELECT cohort_month, users AS cohort_users FROM cohort_data WHERE months_since_first = 0 ) -- 5단계: 리텐션율 계산 SELECT cd.cohort_month, cd.months_since_first, cd.users AS active_users, cs.cohort_users, ROUND(cd.users * 100.0 / cs.cohort_users, 2) AS retention_rate FROM cohort_data cd JOIN cohort_size cs ON cd.cohort_month = cs.cohort_month WHERE cd.cohort_month >= '2024-01-01' ORDER BY cd.cohort_month, cd.months_since_first

실행 결과:

cohort_monthmonths_since_firstactive_userscohort_usersretention_rate
2024-01-01018921892100.00
2024-01-01115618928.24
2024-01-0129818925.18
2024-02-01018451845100.00
2024-02-01114218457.69
💡
코호트 분석 인사이트
  • Month 0: 항상 100% (정의상)
  • Month 1 급감: 첫 구매 후 재방문 유도 전략 필요
  • 안정화 시점: 리텐션이 일정해지는 시점이 충성 고객 기준

퀴즈 2: 코호트별 누적 매출

문제

다중 CTE로:

  1. 사용자별 코호트(첫 구매 월) 정의
  2. 코호트별, 기간별 누적 매출 계산
  3. 코호트별 12개월 LTV 추정

힌트: SUM() OVER(PARTITION BY cohort ORDER BY month)

정답 보기

WITH user_cohort AS ( SELECT user_id, DATE_TRUNC(MIN(CAST(created_at AS DATE)), MONTH) AS cohort_month FROM `your-project-id.thelook_ecommerce.orders` GROUP BY user_id ), user_monthly_revenue AS ( SELECT o.user_id, uc.cohort_month, DATE_TRUNC(CAST(o.created_at AS DATE), MONTH) AS order_month, DATE_DIFF(DATE_TRUNC(CAST(o.created_at AS DATE), MONTH), uc.cohort_month, MONTH) AS months_since_first, SUM(oi.sale_price) AS revenue FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id JOIN user_cohort uc ON o.user_id = uc.user_id GROUP BY o.user_id, uc.cohort_month, order_month ), cohort_revenue AS ( SELECT cohort_month, months_since_first, SUM(revenue) AS total_revenue, COUNT(DISTINCT user_id) AS active_users FROM user_monthly_revenue GROUP BY cohort_month, months_since_first ) SELECT cohort_month, months_since_first, total_revenue, active_users, SUM(total_revenue) OVER ( PARTITION BY cohort_month ORDER BY months_since_first ) AS cumulative_revenue FROM cohort_revenue WHERE cohort_month >= '2024-01-01' AND months_since_first <= 12 ORDER BY cohort_month, months_since_first

실행 결과:

cohort_monthmonths_since_firsttotal_revenueactive_userscumulative_revenue
2024-01-010285600.501892285600.50
2024-01-01123450.75156309051.25
2024-01-01215890.3098324941.55
2024-02-010268920.401845268920.40
2024-02-01121560.80142290481.20

비즈니스 인사이트:

  • 초기 매출: Month 0에서 대부분의 매출 발생
  • LTV 패턴: 누적 매출 곡선으로 고객 가치 예측
  • 코호트 비교: 마케팅 캠페인 효과 측정

3. RFM 분석

이론

RFM은 고객 세분화의 대표적 방법론입니다:

  • R (Recency): 최근성 - 마지막 구매로부터 경과일
  • F (Frequency): 빈도 - 총 구매 횟수
  • M (Monetary): 금액 - 총 구매 금액

실습: RFM 점수 계산

WITH customer_rfm AS ( SELECT o.user_id, DATE_DIFF(CURRENT_DATE(), MAX(CAST(o.created_at AS DATE)), DAY) AS recency, COUNT(DISTINCT o.order_id) AS frequency, SUM(oi.sale_price) AS monetary FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id GROUP BY o.user_id ), rfm_scores AS ( SELECT user_id, recency, frequency, monetary, NTILE(5) OVER (ORDER BY recency DESC) AS r_score, -- 낮은 recency = 높은 점수 NTILE(5) OVER (ORDER BY frequency ASC) AS f_score, NTILE(5) OVER (ORDER BY monetary ASC) AS m_score FROM customer_rfm ) SELECT user_id, recency, frequency, monetary, r_score, f_score, m_score, CONCAT(CAST(r_score AS STRING), CAST(f_score AS STRING), CAST(m_score AS STRING)) AS rfm_segment, CASE WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions' WHEN r_score >= 4 AND f_score >= 3 THEN 'Loyal Customers' WHEN r_score >= 4 AND f_score <= 2 THEN 'Recent Customers' WHEN r_score <= 2 AND f_score >= 4 THEN 'At Risk' WHEN r_score <= 2 AND f_score <= 2 THEN 'Lost' ELSE 'Others' END AS customer_segment FROM rfm_scores ORDER BY monetary DESC LIMIT 100

실행 결과:

user_idrecencyfrequencymonetaryr_scoref_scorem_scorerfm_segmentcustomer_segment
5907112341633.89445445Champions
840058921500.73435435Loyal Customers
905724541477.39545545Champions
6723425631389.45234234Others
124563121987.50223223Lost

RFM 세그먼트 정의:

세그먼트R점수F점수M점수전략
Champions4-54-54-5보상 프로그램, VIP 혜택
Loyal Customers4-53-5-업셀링, 추천 유도
Recent Customers4-51-2-재구매 유도, 온보딩
At Risk1-24-5-재활성화 캠페인
Lost1-21-2-윈백 프로모션

퀴즈 3: RFM 세그먼트별 통계

문제

RFM 분석 결과로:

  1. 세그먼트별 고객 수 집계
  2. 세그먼트별 평균 매출 계산
  3. 전체 매출에서 각 세그먼트 비중 산출

힌트: GROUP BY customer_segment

정답 보기

WITH customer_rfm AS ( SELECT o.user_id, DATE_DIFF(CURRENT_DATE(), MAX(CAST(o.created_at AS DATE)), DAY) AS recency, COUNT(DISTINCT o.order_id) AS frequency, SUM(oi.sale_price) AS monetary FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id GROUP BY o.user_id ), rfm_scores AS ( SELECT user_id, recency, frequency, monetary, NTILE(5) OVER (ORDER BY recency DESC) AS r_score, NTILE(5) OVER (ORDER BY frequency ASC) AS f_score, NTILE(5) OVER (ORDER BY monetary ASC) AS m_score FROM customer_rfm ), rfm_segments AS ( SELECT user_id, monetary, CASE WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions' WHEN r_score >= 4 AND f_score >= 3 THEN 'Loyal Customers' WHEN r_score >= 4 AND f_score <= 2 THEN 'Recent Customers' WHEN r_score <= 2 AND f_score >= 4 THEN 'At Risk' WHEN r_score <= 2 AND f_score <= 2 THEN 'Lost' ELSE 'Others' END AS customer_segment FROM rfm_scores ) SELECT customer_segment, COUNT(*) AS customer_count, ROUND(AVG(monetary), 2) AS avg_monetary, SUM(monetary) AS total_revenue, ROUND(SUM(monetary) * 100.0 / (SELECT SUM(monetary) FROM rfm_segments), 2) AS revenue_share FROM rfm_segments GROUP BY customer_segment ORDER BY total_revenue DESC

실행 결과:

customer_segmentcustomer_countavg_monetarytotal_revenuerevenue_share
Loyal Customers2340678.901588626.0025.97
Champions12501245.671557087.5025.45
Others2180438.90956802.0015.64
At Risk1560456.78712576.8011.65
Recent Customers1890345.60653184.0010.67
Lost1780234.56417516.806.82

비즈니스 인사이트:

  • Champions: 전체 고객의 20%가 80%의 매출 생성 (파레토 법칙)
  • At Risk: 높은 F점수지만 낮은 R점수 → 즉각적인 재활성화 필요
  • Lost: 비용 대비 효과 분석 후 타겟팅 결정

4. 재구매 간격 분석

이론

고객의 구매 패턴을 이해하면 적절한 시점에 마케팅 메시지를 발송할 수 있습니다.

실습: 고객별 평균 재구매 간격

WITH purchase_gaps AS ( SELECT user_id, created_at, LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_purchase, DATE_DIFF( CAST(created_at AS DATE), CAST(LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS DATE), DAY ) AS days_gap FROM `your-project-id.thelook_ecommerce.orders` ), user_purchase_patterns AS ( SELECT user_id, COUNT(*) AS purchase_count, AVG(days_gap) AS avg_gap, MIN(days_gap) AS min_gap, MAX(days_gap) AS max_gap, STDDEV(days_gap) AS stddev_gap FROM purchase_gaps WHERE days_gap IS NOT NULL GROUP BY user_id ) SELECT CASE WHEN avg_gap <= 30 THEN 'Frequent (≤30일)' WHEN avg_gap <= 90 THEN 'Regular (31-90일)' WHEN avg_gap <= 180 THEN 'Occasional (91-180일)' ELSE 'Rare (>180일)' END AS purchase_frequency_group, COUNT(*) AS customer_count, ROUND(AVG(avg_gap), 1) AS avg_repurchase_days, ROUND(AVG(purchase_count), 1) AS avg_purchases FROM user_purchase_patterns GROUP BY purchase_frequency_group ORDER BY avg_repurchase_days

실행 결과:

purchase_frequency_groupcustomer_countavg_repurchase_daysavg_purchases
Frequent (≤30일)123418.55.2
Regular (31-90일)456758.33.1
Occasional (91-180일)3456128.72.4
Rare (>180일)2345245.62.1

퀴즈 4: 카테고리별 재구매 패턴

문제

제품 카테고리별로:

  1. 평균 재구매 간격 계산
  2. 재구매율 (2회 이상 구매 고객 비율) 산출
  3. 결과를 재구매 간격 순으로 정렬

정답 보기

WITH category_purchases AS ( SELECT o.user_id, p.category, o.created_at, LAG(o.created_at) OVER ( PARTITION BY o.user_id, p.category ORDER BY o.created_at ) AS prev_purchase FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id JOIN `your-project-id.thelook_ecommerce.products` p ON oi.product_id = p.id ), category_gaps AS ( SELECT category, user_id, DATE_DIFF(CAST(created_at AS DATE), CAST(prev_purchase AS DATE), DAY) AS days_gap FROM category_purchases WHERE prev_purchase IS NOT NULL ), category_stats AS ( SELECT category, COUNT(DISTINCT user_id) AS repeat_customers, AVG(days_gap) AS avg_repurchase_days FROM category_gaps GROUP BY category ), category_total AS ( SELECT p.category, COUNT(DISTINCT o.user_id) AS total_customers FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id JOIN `your-project-id.thelook_ecommerce.products` p ON oi.product_id = p.id GROUP BY p.category ) SELECT ct.category, ct.total_customers, COALESCE(cs.repeat_customers, 0) AS repeat_customers, ROUND(COALESCE(cs.repeat_customers, 0) * 100.0 / ct.total_customers, 2) AS repeat_rate, ROUND(cs.avg_repurchase_days, 1) AS avg_repurchase_days FROM category_total ct LEFT JOIN category_stats cs ON ct.category = cs.category ORDER BY avg_repurchase_days

실행 결과:

categorytotal_customersrepeat_customersrepeat_rateavg_repurchase_days
Jeans12450234018.8038.4
Tops & Tees11230189016.8342.1
Accessories8560124514.5545.6
Sweaters9870156015.8048.9
Active7890112014.2052.3
Outerwear & Coats654089013.6165.2

비즈니스 인사이트:

  • 짧은 재구매 간격 카테고리: 구독 모델 적용 검토
  • 긴 재구매 간격 카테고리: 리마인더 마케팅
  • 높은 재구매율 카테고리: 크로스셀링 기회

정리

학습한 CTE 패턴

패턴용도예시
단일 CTE중간 결과 정의고객 요약 테이블
다중 CTE단계별 파이프라인코호트 분석
재귀 CTE계층 구조 처리조직도, 카테고리 트리
CTE + 윈도우 함수복잡한 분석RFM, 리텐션

CTE 베스트 프랙티스

  1. 명확한 네이밍: CTE 이름은 그 목적을 설명해야 함
  2. 단계별 분리: 복잡한 로직을 작은 단위로 분할
  3. 재사용: 같은 CTE를 여러 번 참조하여 중복 제거
  4. 디버깅: 각 CTE를 개별적으로 실행하여 검증

코호트 분석 활용

  • 제품 팀: 기능 출시 후 사용자 리텐션 추적
  • 마케팅 팀: 캠페인별 고객 획득 및 유지 분석
  • 재무 팀: 코호트별 LTV 예측 및 예산 책정

다음 단계

CTE와 코호트 분석을 마스터했습니다! 다음으로 문자열 함수에서 텍스트 데이터 처리 기법을 배워보세요.

Last updated on

🤖AI 모의면접실전처럼 연습하기