CTE를 활용한 코호트 분석
중급고급
BigQuery 실행 환경 설정
from google.cloud import bigquery
import pandas as pd
# 인증 설정 (서비스 계정 키 사용 시)
# import os
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-key.json"
# BigQuery 클라이언트 생성
client = bigquery.Client(project='your-project-id')
# 쿼리 실행 함수
def run_query(query):
return client.query(query).to_dataframe()학습 목표
이 레시피를 완료하면 다음을 할 수 있습니다:
- CTE(Common Table Expression)의 개념과 장점 이해
- WITH 절로 단계적 쿼리 구조화
- 다중 CTE로 복잡한 분석 파이프라인 구축
- 코호트 분석으로 고객 리텐션 측정
- RFM 분석으로 고객 세분화
- 재구매 간격 분석
1. 기본 CTE - WITH 절
이론
CTE는 쿼리 내에서 임시 결과 세트를 정의하는 방법입니다.
서브쿼리 vs CTE
서브쿼리 (읽기 어려움):
SELECT * FROM (
SELECT * FROM (
SELECT ...
)
)CTE (읽기 쉬움):
WITH step1 AS (...),
step2 AS (...)
SELECT * FROM step2장점
- 가독성 향상
- 재사용 가능
- 디버깅 용이
- 단계별 로직 명확화
실습: 고객 구매 요약
WITH customer_summary AS (
SELECT
o.user_id,
COUNT(DISTINCT o.order_id) AS total_orders,
SUM(oi.sale_price) AS total_spent,
MIN(o.created_at) AS first_order_date,
MAX(o.created_at) AS last_order_date
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi
ON o.order_id = oi.order_id
GROUP BY o.user_id
)
SELECT
user_id,
total_orders,
total_spent,
ROUND(total_spent / total_orders, 2) AS avg_order_value,
first_order_date,
last_order_date,
DATE_DIFF(CAST(last_order_date AS DATE), CAST(first_order_date AS DATE), DAY) AS customer_lifetime_days
FROM customer_summary
ORDER BY total_spent DESC
LIMIT 20실행 결과:
| user_id | total_orders | total_spent | avg_order_value | customer_lifetime_days |
|---|---|---|---|---|
| 59071 | 4 | 1633.89 | 408.47 | 961 |
| 84005 | 2 | 1500.73 | 750.36 | 275 |
| 90572 | 4 | 1477.39 | 369.35 | 71 |
ℹ️
CTE의 장점
- 가독성: 복잡한 쿼리를 논리적 단계로 분리
- 재사용: 같은 CTE를 여러 번 참조 가능
- 디버깅: 각 단계를 개별적으로 테스트 가능
퀴즈 1: 월별 매출 요약
문제
CTE를 사용하여:
- 첫 번째 CTE: 월별 매출 집계
- 두 번째 CTE: 전월 대비 성장률 계산
- 최종: 성장률이 양수인 월만 조회
힌트: DATE_TRUNC, LAG
정답 보기
WITH monthly_revenue AS (
SELECT
DATE_TRUNC(CAST(o.created_at AS DATE), MONTH) AS month,
SUM(oi.sale_price) AS revenue,
COUNT(DISTINCT o.order_id) AS order_count
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
GROUP BY month
),
monthly_growth AS (
SELECT
month,
revenue,
order_count,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) /
NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100, 2) AS growth_rate
FROM monthly_revenue
)
SELECT
month,
revenue,
order_count,
prev_revenue,
growth_rate
FROM monthly_growth
WHERE growth_rate > 0
ORDER BY month실행 결과:
| month | revenue | order_count | prev_revenue | growth_rate |
|---|---|---|---|---|
| 2024-01-01 | 156234.50 | 12890 | 142567.25 | 9.58 |
| 2024-03-01 | 172456.30 | 14560 | 148923.75 | 15.80 |
핵심 포인트:
- CTE를 단계별로 쌓아가며 복잡한 로직 구현
- LAG 함수로 이전 행 값 참조
- NULLIF로 0 나누기 방지
2. 다중 CTE - 코호트 분석
이론
코호트 분석은 동일한 특성(가입 시점, 첫 구매일 등)을 가진 사용자 그룹의 행동 패턴을 추적합니다.
핵심 개념
- 코호트: 같은 시점에 특정 이벤트를 경험한 사용자 그룹
- 리텐션: 시간이 지나도 계속 활동하는 사용자 비율
- 코호트 기간: 첫 이벤트 이후 경과 기간
실습: 월별 코호트 리텐션
WITH user_first_order AS (
-- 1단계: 각 사용자의 첫 주문 월 (코호트 정의)
SELECT
user_id,
DATE_TRUNC(MIN(CAST(created_at AS DATE)), MONTH) AS cohort_month
FROM `your-project-id.thelook_ecommerce.orders`
GROUP BY user_id
),
user_orders AS (
-- 2단계: 모든 주문에 코호트 정보 추가
SELECT
o.user_id,
DATE_TRUNC(CAST(o.created_at AS DATE), MONTH) AS order_month,
ufo.cohort_month
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN user_first_order ufo ON o.user_id = ufo.user_id
),
cohort_data AS (
-- 3단계: 코호트 기간 계산
SELECT
cohort_month,
order_month,
DATE_DIFF(order_month, cohort_month, MONTH) AS months_since_first,
COUNT(DISTINCT user_id) AS users
FROM user_orders
GROUP BY cohort_month, order_month
),
cohort_size AS (
-- 4단계: 코호트별 초기 사용자 수
SELECT
cohort_month,
users AS cohort_users
FROM cohort_data
WHERE months_since_first = 0
)
-- 5단계: 리텐션율 계산
SELECT
cd.cohort_month,
cd.months_since_first,
cd.users AS active_users,
cs.cohort_users,
ROUND(cd.users * 100.0 / cs.cohort_users, 2) AS retention_rate
FROM cohort_data cd
JOIN cohort_size cs ON cd.cohort_month = cs.cohort_month
WHERE cd.cohort_month >= '2024-01-01'
ORDER BY cd.cohort_month, cd.months_since_first실행 결과:
| cohort_month | months_since_first | active_users | cohort_users | retention_rate |
|---|---|---|---|---|
| 2024-01-01 | 0 | 1892 | 1892 | 100.00 |
| 2024-01-01 | 1 | 156 | 1892 | 8.24 |
| 2024-01-01 | 2 | 98 | 1892 | 5.18 |
| 2024-02-01 | 0 | 1845 | 1845 | 100.00 |
| 2024-02-01 | 1 | 142 | 1845 | 7.69 |
💡
코호트 분석 인사이트
- Month 0: 항상 100% (정의상)
- Month 1 급감: 첫 구매 후 재방문 유도 전략 필요
- 안정화 시점: 리텐션이 일정해지는 시점이 충성 고객 기준
퀴즈 2: 코호트별 누적 매출
문제
다중 CTE로:
- 사용자별 코호트(첫 구매 월) 정의
- 코호트별, 기간별 누적 매출 계산
- 코호트별 12개월 LTV 추정
힌트: SUM() OVER(PARTITION BY cohort ORDER BY month)
정답 보기
WITH user_cohort AS (
SELECT
user_id,
DATE_TRUNC(MIN(CAST(created_at AS DATE)), MONTH) AS cohort_month
FROM `your-project-id.thelook_ecommerce.orders`
GROUP BY user_id
),
user_monthly_revenue AS (
SELECT
o.user_id,
uc.cohort_month,
DATE_TRUNC(CAST(o.created_at AS DATE), MONTH) AS order_month,
DATE_DIFF(DATE_TRUNC(CAST(o.created_at AS DATE), MONTH), uc.cohort_month, MONTH) AS months_since_first,
SUM(oi.sale_price) AS revenue
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
JOIN user_cohort uc ON o.user_id = uc.user_id
GROUP BY o.user_id, uc.cohort_month, order_month
),
cohort_revenue AS (
SELECT
cohort_month,
months_since_first,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT user_id) AS active_users
FROM user_monthly_revenue
GROUP BY cohort_month, months_since_first
)
SELECT
cohort_month,
months_since_first,
total_revenue,
active_users,
SUM(total_revenue) OVER (
PARTITION BY cohort_month
ORDER BY months_since_first
) AS cumulative_revenue
FROM cohort_revenue
WHERE cohort_month >= '2024-01-01'
AND months_since_first <= 12
ORDER BY cohort_month, months_since_first실행 결과:
| cohort_month | months_since_first | total_revenue | active_users | cumulative_revenue |
|---|---|---|---|---|
| 2024-01-01 | 0 | 285600.50 | 1892 | 285600.50 |
| 2024-01-01 | 1 | 23450.75 | 156 | 309051.25 |
| 2024-01-01 | 2 | 15890.30 | 98 | 324941.55 |
| 2024-02-01 | 0 | 268920.40 | 1845 | 268920.40 |
| 2024-02-01 | 1 | 21560.80 | 142 | 290481.20 |
비즈니스 인사이트:
- 초기 매출: Month 0에서 대부분의 매출 발생
- LTV 패턴: 누적 매출 곡선으로 고객 가치 예측
- 코호트 비교: 마케팅 캠페인 효과 측정
3. RFM 분석
이론
RFM은 고객 세분화의 대표적 방법론입니다:
- R (Recency): 최근성 - 마지막 구매로부터 경과일
- F (Frequency): 빈도 - 총 구매 횟수
- M (Monetary): 금액 - 총 구매 금액
실습: RFM 점수 계산
WITH customer_rfm AS (
SELECT
o.user_id,
DATE_DIFF(CURRENT_DATE(), MAX(CAST(o.created_at AS DATE)), DAY) AS recency,
COUNT(DISTINCT o.order_id) AS frequency,
SUM(oi.sale_price) AS monetary
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
GROUP BY o.user_id
),
rfm_scores AS (
SELECT
user_id,
recency,
frequency,
monetary,
NTILE(5) OVER (ORDER BY recency DESC) AS r_score, -- 낮은 recency = 높은 점수
NTILE(5) OVER (ORDER BY frequency ASC) AS f_score,
NTILE(5) OVER (ORDER BY monetary ASC) AS m_score
FROM customer_rfm
)
SELECT
user_id,
recency,
frequency,
monetary,
r_score,
f_score,
m_score,
CONCAT(CAST(r_score AS STRING), CAST(f_score AS STRING), CAST(m_score AS STRING)) AS rfm_segment,
CASE
WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions'
WHEN r_score >= 4 AND f_score >= 3 THEN 'Loyal Customers'
WHEN r_score >= 4 AND f_score <= 2 THEN 'Recent Customers'
WHEN r_score <= 2 AND f_score >= 4 THEN 'At Risk'
WHEN r_score <= 2 AND f_score <= 2 THEN 'Lost'
ELSE 'Others'
END AS customer_segment
FROM rfm_scores
ORDER BY monetary DESC
LIMIT 100실행 결과:
| user_id | recency | frequency | monetary | r_score | f_score | m_score | rfm_segment | customer_segment |
|---|---|---|---|---|---|---|---|---|
| 59071 | 123 | 4 | 1633.89 | 4 | 4 | 5 | 445 | Champions |
| 84005 | 89 | 2 | 1500.73 | 4 | 3 | 5 | 435 | Loyal Customers |
| 90572 | 45 | 4 | 1477.39 | 5 | 4 | 5 | 545 | Champions |
| 67234 | 256 | 3 | 1389.45 | 2 | 3 | 4 | 234 | Others |
| 12456 | 312 | 1 | 987.50 | 2 | 2 | 3 | 223 | Lost |
RFM 세그먼트 정의:
| 세그먼트 | R점수 | F점수 | M점수 | 전략 |
|---|---|---|---|---|
| Champions | 4-5 | 4-5 | 4-5 | 보상 프로그램, VIP 혜택 |
| Loyal Customers | 4-5 | 3-5 | - | 업셀링, 추천 유도 |
| Recent Customers | 4-5 | 1-2 | - | 재구매 유도, 온보딩 |
| At Risk | 1-2 | 4-5 | - | 재활성화 캠페인 |
| Lost | 1-2 | 1-2 | - | 윈백 프로모션 |
퀴즈 3: RFM 세그먼트별 통계
문제
RFM 분석 결과로:
- 세그먼트별 고객 수 집계
- 세그먼트별 평균 매출 계산
- 전체 매출에서 각 세그먼트 비중 산출
힌트: GROUP BY customer_segment
정답 보기
WITH customer_rfm AS (
SELECT
o.user_id,
DATE_DIFF(CURRENT_DATE(), MAX(CAST(o.created_at AS DATE)), DAY) AS recency,
COUNT(DISTINCT o.order_id) AS frequency,
SUM(oi.sale_price) AS monetary
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
GROUP BY o.user_id
),
rfm_scores AS (
SELECT
user_id,
recency,
frequency,
monetary,
NTILE(5) OVER (ORDER BY recency DESC) AS r_score,
NTILE(5) OVER (ORDER BY frequency ASC) AS f_score,
NTILE(5) OVER (ORDER BY monetary ASC) AS m_score
FROM customer_rfm
),
rfm_segments AS (
SELECT
user_id,
monetary,
CASE
WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions'
WHEN r_score >= 4 AND f_score >= 3 THEN 'Loyal Customers'
WHEN r_score >= 4 AND f_score <= 2 THEN 'Recent Customers'
WHEN r_score <= 2 AND f_score >= 4 THEN 'At Risk'
WHEN r_score <= 2 AND f_score <= 2 THEN 'Lost'
ELSE 'Others'
END AS customer_segment
FROM rfm_scores
)
SELECT
customer_segment,
COUNT(*) AS customer_count,
ROUND(AVG(monetary), 2) AS avg_monetary,
SUM(monetary) AS total_revenue,
ROUND(SUM(monetary) * 100.0 / (SELECT SUM(monetary) FROM rfm_segments), 2) AS revenue_share
FROM rfm_segments
GROUP BY customer_segment
ORDER BY total_revenue DESC실행 결과:
| customer_segment | customer_count | avg_monetary | total_revenue | revenue_share |
|---|---|---|---|---|
| Loyal Customers | 2340 | 678.90 | 1588626.00 | 25.97 |
| Champions | 1250 | 1245.67 | 1557087.50 | 25.45 |
| Others | 2180 | 438.90 | 956802.00 | 15.64 |
| At Risk | 1560 | 456.78 | 712576.80 | 11.65 |
| Recent Customers | 1890 | 345.60 | 653184.00 | 10.67 |
| Lost | 1780 | 234.56 | 417516.80 | 6.82 |
비즈니스 인사이트:
- Champions: 전체 고객의 20%가 80%의 매출 생성 (파레토 법칙)
- At Risk: 높은 F점수지만 낮은 R점수 → 즉각적인 재활성화 필요
- Lost: 비용 대비 효과 분석 후 타겟팅 결정
4. 재구매 간격 분석
이론
고객의 구매 패턴을 이해하면 적절한 시점에 마케팅 메시지를 발송할 수 있습니다.
실습: 고객별 평균 재구매 간격
WITH purchase_gaps AS (
SELECT
user_id,
created_at,
LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_purchase,
DATE_DIFF(
CAST(created_at AS DATE),
CAST(LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS DATE),
DAY
) AS days_gap
FROM `your-project-id.thelook_ecommerce.orders`
),
user_purchase_patterns AS (
SELECT
user_id,
COUNT(*) AS purchase_count,
AVG(days_gap) AS avg_gap,
MIN(days_gap) AS min_gap,
MAX(days_gap) AS max_gap,
STDDEV(days_gap) AS stddev_gap
FROM purchase_gaps
WHERE days_gap IS NOT NULL
GROUP BY user_id
)
SELECT
CASE
WHEN avg_gap <= 30 THEN 'Frequent (≤30일)'
WHEN avg_gap <= 90 THEN 'Regular (31-90일)'
WHEN avg_gap <= 180 THEN 'Occasional (91-180일)'
ELSE 'Rare (>180일)'
END AS purchase_frequency_group,
COUNT(*) AS customer_count,
ROUND(AVG(avg_gap), 1) AS avg_repurchase_days,
ROUND(AVG(purchase_count), 1) AS avg_purchases
FROM user_purchase_patterns
GROUP BY purchase_frequency_group
ORDER BY avg_repurchase_days실행 결과:
| purchase_frequency_group | customer_count | avg_repurchase_days | avg_purchases |
|---|---|---|---|
| Frequent (≤30일) | 1234 | 18.5 | 5.2 |
| Regular (31-90일) | 4567 | 58.3 | 3.1 |
| Occasional (91-180일) | 3456 | 128.7 | 2.4 |
| Rare (>180일) | 2345 | 245.6 | 2.1 |
퀴즈 4: 카테고리별 재구매 패턴
문제
제품 카테고리별로:
- 평균 재구매 간격 계산
- 재구매율 (2회 이상 구매 고객 비율) 산출
- 결과를 재구매 간격 순으로 정렬
정답 보기
WITH category_purchases AS (
SELECT
o.user_id,
p.category,
o.created_at,
LAG(o.created_at) OVER (
PARTITION BY o.user_id, p.category
ORDER BY o.created_at
) AS prev_purchase
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
JOIN `your-project-id.thelook_ecommerce.products` p ON oi.product_id = p.id
),
category_gaps AS (
SELECT
category,
user_id,
DATE_DIFF(CAST(created_at AS DATE), CAST(prev_purchase AS DATE), DAY) AS days_gap
FROM category_purchases
WHERE prev_purchase IS NOT NULL
),
category_stats AS (
SELECT
category,
COUNT(DISTINCT user_id) AS repeat_customers,
AVG(days_gap) AS avg_repurchase_days
FROM category_gaps
GROUP BY category
),
category_total AS (
SELECT
p.category,
COUNT(DISTINCT o.user_id) AS total_customers
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
JOIN `your-project-id.thelook_ecommerce.products` p ON oi.product_id = p.id
GROUP BY p.category
)
SELECT
ct.category,
ct.total_customers,
COALESCE(cs.repeat_customers, 0) AS repeat_customers,
ROUND(COALESCE(cs.repeat_customers, 0) * 100.0 / ct.total_customers, 2) AS repeat_rate,
ROUND(cs.avg_repurchase_days, 1) AS avg_repurchase_days
FROM category_total ct
LEFT JOIN category_stats cs ON ct.category = cs.category
ORDER BY avg_repurchase_days실행 결과:
| category | total_customers | repeat_customers | repeat_rate | avg_repurchase_days |
|---|---|---|---|---|
| Jeans | 12450 | 2340 | 18.80 | 38.4 |
| Tops & Tees | 11230 | 1890 | 16.83 | 42.1 |
| Accessories | 8560 | 1245 | 14.55 | 45.6 |
| Sweaters | 9870 | 1560 | 15.80 | 48.9 |
| Active | 7890 | 1120 | 14.20 | 52.3 |
| Outerwear & Coats | 6540 | 890 | 13.61 | 65.2 |
비즈니스 인사이트:
- 짧은 재구매 간격 카테고리: 구독 모델 적용 검토
- 긴 재구매 간격 카테고리: 리마인더 마케팅
- 높은 재구매율 카테고리: 크로스셀링 기회
정리
학습한 CTE 패턴
| 패턴 | 용도 | 예시 |
|---|---|---|
| 단일 CTE | 중간 결과 정의 | 고객 요약 테이블 |
| 다중 CTE | 단계별 파이프라인 | 코호트 분석 |
| 재귀 CTE | 계층 구조 처리 | 조직도, 카테고리 트리 |
| CTE + 윈도우 함수 | 복잡한 분석 | RFM, 리텐션 |
CTE 베스트 프랙티스
- 명확한 네이밍: CTE 이름은 그 목적을 설명해야 함
- 단계별 분리: 복잡한 로직을 작은 단위로 분할
- 재사용: 같은 CTE를 여러 번 참조하여 중복 제거
- 디버깅: 각 CTE를 개별적으로 실행하여 검증
코호트 분석 활용
- 제품 팀: 기능 출시 후 사용자 리텐션 추적
- 마케팅 팀: 캠페인별 고객 획득 및 유지 분석
- 재무 팀: 코호트별 LTV 예측 및 예산 책정
다음 단계
CTE와 코호트 분석을 마스터했습니다! 다음으로 문자열 함수에서 텍스트 데이터 처리 기법을 배워보세요.
Last updated on