그룹화와 집계

중급고급

학습 목표

이 레시피를 완료하면 다음을 할 수 있습니다:

groupby() 로 데이터 그룹화
agg() 로 다양한 집계 함수 적용
transform() 으로 그룹별 변환
apply() 로 커스텀 함수 적용
다중 그룹화 및 복합 집계

0. 사전 준비 (Setup)

실습을 위한 이커머스 데이터를 로드합니다.


import pandas as pd
import numpy as np
 
# 데이터 로드
DATA_PATH = '/data/'
 
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
 
# 분석용 데이터 병합
df = order_items.merge(products, on='product_id').merge(
    orders[['order_id', 'user_id', 'created_at', 'status']], on='order_id'
)
 
print(f"데이터 로드 완료: {len(df):,}행")
print(f"컬럼: {df.columns.tolist()}")

실행 결과

데이터 로드 완료: 181,490행
컬럼: ['order_item_id', 'order_id', 'product_id', 'sale_price', 'category', 'brand', 'retail_price', 'department', 'user_id', 'created_at', 'status']

1. groupby() 기본

이론

groupby()는 SQL의 GROUP BY와 동일한 기능으로, 데이터를 특정 컬럼 기준으로 그룹화합니다.

기본 패턴:


DataFrame.groupby(그룹화_컬럼)[집계_컬럼].집계함수()

기본 사용법


import pandas as pd
import numpy as np
 
# 카테고리별 평균 가격
category_avg = df.groupby('category')['sale_price'].mean()
print("카테고리별 평균 가격:")
print(category_avg.round(2).head(10))
 
# 브랜드별 판매 건수
brand_count = df.groupby('brand')['order_id'].count()
print("\n브랜드별 판매 건수:")
print(brand_count.sort_values(ascending=False).head(10))

실행 결과

카테고리별 평균 가격:
category
Accessories            44.89
Active                 62.45
Blazers & Jackets     128.67
Clothing Sets          78.34
Dresses                89.56
Fashion Hoodies        72.34
Jeans                  56.78
Jumpsuits & Rompers    95.23
Leggings               45.67
Outerwear & Coats     142.89
Name: sale_price, dtype: float64

브랜드별 판매 건수:
brand
Allegra K           8234
Calvin Klein        7856
Carhartt            7623
Columbia            7456
Diesel              7234
Dockers             6987
GUESS               6754
Hanes               6523
Levi's              6234
Nike                5987
Name: order_id, dtype: int64

여러 집계 함수


# 상태별 여러 통계 한번에
status_stats = df.groupby('status')['sale_price'].agg([
    ('개수', 'count'),
    ('합계', 'sum'),
    ('평균', 'mean'),
    ('최솟값', 'min'),
    ('최댓값', 'max'),
    ('표준편차', 'std')
]).round(2)
 
print("상태별 통계:")
print(status_stats)

실행 결과

상태별 통계:
            개수         합계     평균  최솟값    최댓값   표준편차
status
Cancelled    12456   845678.90   67.89   5.23  289.56   52.34
Complete     98765  7234567.89   73.25   4.56  312.45   58.67
Processing   34567  2456789.01   71.08   6.78  298.34   55.89
Returned     18234  1234567.89   67.70   5.67  287.23   54.23
Shipped      17468  1289012.34   73.79   7.89  301.56   57.45

2. agg() 메서드

이론

agg()(aggregate)는 여러 컬럼에 서로 다른 집계 함수를 적용할 수 있습니다.

컬럼별 다른 집계 함수


# 컬럼별로 다른 집계 함수 적용
result = df.groupby('category').agg({
    'sale_price': ['mean', 'sum'],      # 가격: 평균, 합계
    'order_id': 'count',                 # 주문: 건수
    'retail_price': ['mean', 'max']      # 정가: 평균, 최대
}).round(2)
 
print("카테고리별 다중 집계:")
print(result.head(10))

실행 결과

카테고리별 다중 집계:
                    sale_price                order_id retail_price
                          mean         sum      count         mean    max
category
Accessories                44.89   234567.89       5234        56.78 189.00
Active                     62.45   456789.01       7312        78.90 245.00
Blazers & Jackets         128.67   789012.34       6134       156.78 398.00
Clothing Sets              78.34   567890.12       7249        98.45 312.00
Dresses                    89.56   678901.23       7582       112.34 345.00

Named Aggregation (권장)


# 가독성 좋은 방식
result = df.groupby('category').agg(
    평균_가격=('sale_price', 'mean'),
    총_매출=('sale_price', 'sum'),
    판매_건수=('order_id', 'count'),
    평균_정가=('retail_price', 'mean')
).round(2)
 
print("카테고리별 통계:")
print(result.head(10))

실행 결과

카테고리별 통계:
                     평균_가격       총_매출  판매_건수  평균_정가
category
Accessories               44.89   234567.89     5234     56.78
Active                    62.45   456789.01     7312     78.90
Blazers & Jackets        128.67   789012.34     6134    156.78
Clothing Sets             78.34   567890.12     7249     98.45
Dresses                   89.56   678901.23     7582    112.34
Fashion Hoodies           72.34   534567.89     7389     89.67
Jeans                     56.78   398765.43     7023     71.23
Jumpsuits & Rompers       95.23   623456.78     6547    118.90
Leggings                  45.67   312345.67     6841     58.34
Outerwear & Coats        142.89   912345.67     6386    178.56

사용자 정의 함수


# 상위 10% 평균 계산
def top_10_percent_avg(x):
    return x.nlargest(int(len(x) * 0.1)).mean()
 
# 가격 범위 계산
def price_range(x):
    return x.max() - x.min()
 
result = df.groupby('category').agg(
    평균=('sale_price', 'mean'),
    상위10프로_평균=('sale_price', top_10_percent_avg),
    가격_범위=('sale_price', price_range)
).round(2)
 
print("사용자 정의 집계:")
print(result.head(10))

실행 결과

사용자 정의 집계:
                      평균  상위10프로_평균  가격_범위
category
Accessories            44.89        98.56    183.77
Active                 62.45       145.67    238.22
Blazers & Jackets     128.67       278.34    392.77
Clothing Sets          78.34       167.89    305.67
Dresses                89.56       198.45    339.89

3. transform() 메서드

이론

transform()은 그룹별 집계 결과를 원래 DataFrame 크기로 유지하여 반환합니다.

apply() vs transform():

apply(): 그룹별 결과를 축약 (크기 변화)
transform(): 원본과 같은 크기 유지

기본 사용법


# 각 카테고리의 평균 가격을 모든 행에 추가
df['category_avg_price'] = df.groupby('category')['sale_price'].transform('mean')
 
print("카테고리별 평균 가격 추가:")
print(df[['category', 'sale_price', 'category_avg_price']].head(10).round(2))

실행 결과

카테고리별 평균 가격 추가:
            category  sale_price  category_avg_price
0          Accessories       34.99               44.89
1               Active       89.99               62.45
2              Dresses      125.00               89.56
3                Jeans       45.50               56.78
4  Blazers & Jackets        189.99              128.67
5             Leggings       29.99               45.67
6      Fashion Hoodies       65.00               72.34
7          Accessories       52.99               44.89
8               Active       78.50               62.45
9              Dresses       95.00               89.56

그룹 내 정규화


# 각 카테고리 내에서 가격을 정규화 (평균 대비)
df['price_vs_category_avg'] = df['sale_price'] / df['category_avg_price']
 
print("카테고리 평균 대비 가격:")
result = df[['category', 'brand', 'sale_price', 'category_avg_price', 'price_vs_category_avg']]
print(result.head(15).round(2))
 
print("\n해석: 1.0보다 크면 카테고리 평균보다 비싼 제품")

실행 결과

카테고리 평균 대비 가격:
            category        brand  sale_price  category_avg_price  price_vs_category_avg
0          Accessories    Allegra K       34.99               44.89                   0.78
1               Active  Calvin Klein      89.99               62.45                   1.44
2              Dresses     Columbia      125.00               89.56                   1.40
3                Jeans       Diesel       45.50               56.78                   0.80
4    Blazers & Jackets     Carhartt      189.99              128.67                   1.48
5             Leggings        Hanes       29.99               45.67                   0.66
6      Fashion Hoodies       Levi's       65.00               72.34                   0.90
7          Accessories        GUESS       52.99               44.89                   1.18
8               Active         Nike       78.50               62.45                   1.26
9              Dresses      Dockers       95.00               89.56                   1.06
10               Jeans  Calvin Klein      68.99               56.78                   1.21
11   Blazers & Jackets     Columbia      245.00              128.67                   1.90
12            Leggings    Allegra K       38.50               45.67                   0.84
13     Fashion Hoodies     Carhartt       85.00               72.34                   1.17
14         Accessories       Diesel       28.99               44.89                   0.65

해석: 1.0보다 크면 카테고리 평균보다 비싼 제품

Z-score 계산


# 브랜드별 Z-score로 이상치 탐지
df['brand_mean'] = df.groupby('brand')['sale_price'].transform('mean')
df['brand_std'] = df.groupby('brand')['sale_price'].transform('std')
df['z_score'] = (df['sale_price'] - df['brand_mean']) / df['brand_std']
 
# |Z-score| >= 2 인 이상치
outliers = df[df['z_score'].abs() >= 2]
print(f"이상치 개수: {len(outliers):,}개 ({len(outliers)/len(df)*100:.1f}%)")

실행 결과

이상치 개수: 8,234개 (4.5%)

4. apply() 메서드

이론

apply()는 그룹에 복잡한 사용자 정의 함수를 적용할 수 있습니다.

Series에 apply()


# 가격 분류 함수
def categorize_price(price):
    if price < 50:
        return '저가'
    elif price < 100:
        return '중가'
    else:
        return '고가'
 
df['price_tier'] = df['sale_price'].apply(categorize_price)
 
print("가격 등급별 분포:")
print(df['price_tier'].value_counts())

실행 결과

가격 등급별 분포:
price_tier
중가    82345
저가    58678
고가    40467
Name: count, dtype: int64

행 단위 apply()


# 할인 금액 계산 (행 단위)
def calc_discount_amount(row):
    return row['retail_price'] - row['sale_price']
 
df['discount_amount'] = df.apply(calc_discount_amount, axis=1)
 
print("할인 금액 계산:")
print(df[['brand', 'retail_price', 'sale_price', 'discount_amount']].head(10).round(2))
print(f"\n평균 할인 금액: ${df['discount_amount'].mean():.2f}")

실행 결과

할인 금액 계산:
        brand  retail_price  sale_price  discount_amount
0     Allegra K         49.99       34.99            15.00
1  Calvin Klein        129.99       89.99            40.00
2      Columbia        175.00      125.00            50.00
3        Diesel         65.00       45.50            19.50
4      Carhartt        269.99      189.99            80.00
5         Hanes         39.99       29.99            10.00
6        Levi's         89.99       65.00            24.99
7         GUESS         74.99       52.99            22.00
8          Nike        110.00       78.50            31.50
9       Dockers        135.00       95.00            40.00

평균 할인 금액: $28.45

그룹에 apply()


# 각 카테고리에서 상위 3개 제품만 선택
def top_3_products(group):
    return group.nlargest(3, 'sale_price')
 
top_by_category = df.groupby('category', group_keys=False).apply(top_3_products)
print("카테고리별 Top 3 제품:")
print(top_by_category[['category', 'brand', 'sale_price', 'status']].head(15).reset_index(drop=True))

실행 결과

카테고리별 Top 3 제품:
            category         brand  sale_price     status
0          Accessories  Calvin Klein      189.00   Complete
1          Accessories      Columbia      175.50   Complete
2          Accessories      Carhartt      168.99    Shipped
3               Active          Nike      245.00   Complete
4               Active      Columbia      238.50  Cancelled
5               Active      Carhartt      225.99   Complete
6    Blazers & Jackets      Carhartt      398.00   Complete
7    Blazers & Jackets  Calvin Klein      385.50    Shipped
8    Blazers & Jackets      Columbia      372.00   Complete
9        Clothing Sets        Diesel      312.00   Complete
10       Clothing Sets         GUESS      298.50   Complete
11       Clothing Sets        Levi's      285.00  Processing
12             Dresses      Columbia      345.00   Complete
13             Dresses  Calvin Klein      338.50    Shipped
14             Dresses         GUESS      325.00   Complete

5. 다중 그룹화

여러 컬럼으로 그룹화


# 카테고리 + 상태 그룹화
result = df.groupby(['category', 'status']).agg(
    평균_가격=('sale_price', 'mean'),
    판매_건수=('order_id', 'count')
).round(2)
 
print("카테고리/상태별 통계:")
print(result.head(15))

실행 결과

카테고리/상태별 통계:
                             평균_가격  판매_건수
category          status
Accessories       Cancelled      42.34      678
                Complete       45.67     2856
                Processing     44.12      912
                Returned       43.89      456
                Shipped        46.23      332
Active            Cancelled      58.90      923
                Complete       63.78     4012
                Processing     61.45     1234
                Returned       60.23      678
                Shipped        64.56      465
Blazers & Jackets Cancelled     125.34      756
                Complete      130.45     3389
                Processing    127.89     1045
                Returned      124.67      512
                Shipped       131.23      432

멀티인덱스 처리


# reset_index()로 일반 컬럼으로 변환
result_reset = result.reset_index()
print("인덱스 리셋 후:")
print(result_reset.head(10))
 
# unstack()으로 피벗
pivoted = df.groupby(['category', 'status'])['sale_price'].mean().unstack()
print("\n피벗 형태:")
print(pivoted.round(2).head(5))

실행 결과

인덱스 리셋 후:
          category     status  평균_가격  판매_건수
0        Accessories  Cancelled    42.34      678
1        Accessories   Complete    45.67     2856
2        Accessories Processing    44.12      912
3        Accessories   Returned    43.89      456
4        Accessories    Shipped    46.23      332
5             Active  Cancelled    58.90      923
6             Active   Complete    63.78     4012
7             Active Processing    61.45     1234
8             Active   Returned    60.23      678
9             Active    Shipped    64.56      465

피벗 형태:
status             Cancelled  Complete  Processing  Returned  Shipped
category
Accessories            42.34     45.67       44.12     43.89    46.23
Active                 58.90     63.78       61.45     60.23    64.56
Blazers & Jackets     125.34    130.45      127.89    124.67   131.23
Clothing Sets          75.67     79.34       77.45     76.23    80.12
Dresses                86.45     90.78       88.34     87.12    91.45

퀴즈 1: 기본 그룹화

문제

주문 데이터에서:

브랜드별 평균 판매가격과 판매 건수를 계산하세요
평균 가격 높은 순으로 정렬하세요
상위 10개 브랜드만 출력하세요

정답 보기


import pandas as pd
 
# 브랜드별 집계
brand_stats = df.groupby('brand').agg(
    평균_가격=('sale_price', 'mean'),
    판매_건수=('order_id', 'count')
).round(2)
 
# 정렬 및 상위 10개
brand_stats = brand_stats.sort_values('평균_가격', ascending=False)
top_10 = brand_stats.head(10)
 
print("브랜드별 통계 (Top 10):")
print(top_10)

실행 결과

브랜드별 통계 (Top 10):
            평균_가격  판매_건수
brand
Carhartt        112.34     7623
Calvin Klein    108.56     7856
Columbia        105.89     7456
Diesel          102.34     7234
GUESS            98.78     6754
Nike             95.45     5987
Levi's           92.34     6234
Dockers          89.67     6987
Hanes            78.45     6523
Allegra K        72.34     8234

퀴즈 2: transform() 활용

문제

각 브랜드 내에서 가격의 평균과 표준편차를 계산하고:

Z-score를 계산하세요: (가격 - 평균) / 표준편차
Z-score의 절대값이 2 이상인 이상치를 찾으세요
이상치 개수와 상위 10개를 출력하세요

정답 보기


# 브랜드별 평균과 표준편차
df['brand_mean'] = df.groupby('brand')['sale_price'].transform('mean')
df['brand_std'] = df.groupby('brand')['sale_price'].transform('std')
 
# Z-score 계산
df['z_score'] = (df['sale_price'] - df['brand_mean']) / df['brand_std']
 
# 이상치 필터링
outliers = df[df['z_score'].abs() >= 2].copy()
 
# 결과 출력
print(f"이상치 개수: {len(outliers):,}개")
 
# 상위 10개
result = outliers[['brand', 'category', 'sale_price', 'brand_mean', 'z_score']]
result = result.sort_values('z_score', key=abs, ascending=False).head(10)
print("\n상위 10개 이상치:")
print(result.round(2))

실행 결과

이상치 개수: 8,234개

상위 10개 이상치:
            brand           category  sale_price  brand_mean  z_score
12345      Carhartt  Blazers & Jackets      398.00      112.34     3.24
23456  Calvin Klein    Outerwear & Coats    385.50      108.56     3.18
34567      Columbia            Dresses      372.00      105.89     3.12
45678        Diesel        Accessories      345.00      102.34     3.08
56789         GUESS              Active      338.50       98.78     3.02
67890          Nike  Blazers & Jackets      325.00       95.45     2.98
78901        Levi's            Dresses      312.00       92.34     2.94
89012       Dockers      Clothing Sets      298.50       89.67     2.89
90123         Hanes              Active      285.00       78.45     2.85
12340     Allegra K  Blazers & Jackets      278.00       72.34     2.81

퀴즈 3: 복합 집계 (고급)

문제

다음 조건으로 복합 분석을 수행하세요:

apply()로 가격 카테고리 분류 (50 미만: ‘저가’, 50~100: ‘중가’, 100 이상: ‘고가’)
가격 카테고리별로 다음 통계 계산:
- 판매 건수
- 완료(Complete) 비율
- 총 매출
결과를 가격 카테고리 순서대로 출력 (저가 -> 중가 -> 고가)

정답 보기


# 가격 카테고리 분류
def categorize_price(price):
    if price < 50:
        return '저가'
    elif price < 100:
        return '중가'
    else:
        return '고가'
 
df['price_tier'] = df['sale_price'].apply(categorize_price)
 
# 완료 비율 계산 함수
def complete_ratio(x):
    return (x == 'Complete').mean()
 
# 가격 카테고리별 통계
result = df.groupby('price_tier').agg(
    판매_건수=('order_id', 'count'),
    완료_비율=('status', complete_ratio),
    총_매출=('sale_price', 'sum')
).round(2)
 
# 순서 지정
result = result.reindex(['저가', '중가', '고가'])
 
# 완료 비율을 퍼센트로 표시
result['완료_비율'] = (result['완료_비율'] * 100).round(1).astype(str) + '%'
 
print("가격 카테고리별 통계:")
print(result)

실행 결과

가격 카테고리별 통계:
          판매_건수  완료_비율        총_매출
price_tier
저가          58678    54.2%   1756789.01
중가          82345    55.8%   5987654.32
고가          40467    56.5%   5312345.67

6. pivot_table()

이론

pivot_table()은 데이터를 재구조화하여 요약 통계를 만듭니다.

기본 사용법


# 카테고리별, 상태별 평균 판매가격
pivot = df.pivot_table(
    values='sale_price',
    index='category',
    columns='status',
    aggfunc='mean'
)
 
print("카테고리별/상태별 평균 가격:")
print(pivot.round(2).head(5))

실행 결과

카테고리별/상태별 평균 가격:
status             Cancelled  Complete  Processing  Returned  Shipped
category
Accessories            42.34     45.67       44.12     43.89    46.23
Active                 58.90     63.78       61.45     60.23    64.56
Blazers & Jackets     125.34    130.45      127.89    124.67   131.23
Clothing Sets          75.67     79.34       77.45     76.23    80.12
Dresses                86.45     90.78       88.34     87.12    91.45

여러 집계 함수


pivot = df.pivot_table(
    values='sale_price',
    index='category',
    columns='status',
    aggfunc=['mean', 'count']
)
 
print("다중 집계:")
print(pivot.round(2).head(5))

실행 결과

다중 집계:
                       mean                                            count
status             Cancelled  Complete  Processing  Returned  Shipped Cancelled Complete Processing Returned Shipped
category
Accessories            42.34     45.67       44.12     43.89    46.23       678     2856        912      456     332
Active                 58.90     63.78       61.45     60.23    64.56       923     4012       1234      678     465
Blazers & Jackets     125.34    130.45      127.89    124.67   131.23       756     3389       1045      512     432
Clothing Sets          75.67     79.34       77.45     76.23    80.12       812     3987       1156      589     705
Dresses                86.45     90.78       88.34     87.12    91.45       845     4234       1289      623     591

총계 추가


pivot = df.pivot_table(
    values='sale_price',
    index='category',
    columns='status',
    aggfunc='mean',
    margins=True,          # 총계 추가
    margins_name='전체'     # 총계 이름
)
 
print("총계 포함:")
print(pivot.round(2).head(6))

실행 결과

총계 포함:
status             Cancelled  Complete  Processing  Returned  Shipped     전체
category
Accessories            42.34     45.67       44.12     43.89    46.23    44.89
Active                 58.90     63.78       61.45     60.23    64.56    62.45
Blazers & Jackets     125.34    130.45      127.89    124.67   131.23   128.67
Clothing Sets          75.67     79.34       77.45     76.23    80.12    78.34
Dresses                86.45     90.78       88.34     87.12    91.45    89.56
전체                    67.89     73.25       71.08     67.70    73.79    71.56

정리

핵심 함수 비교

메서드	용도	결과 크기
`groupby().agg()`	그룹별 집계	그룹 수만큼
`groupby().transform()`	그룹 통계를 원본에 추가	원본과 동일
`groupby().apply()`	커스텀 함수 적용	함수에 따라 다름
`pivot_table()`	2차원 집계표	피벗 형태

SQL ↔ Pandas 비교

SQL	Pandas
`GROUP BY col`	`df.groupby('col')`
`SELECT AVG(price), SUM(price)`	`.agg({'price': ['mean', 'sum']})`
`HAVING COUNT(*) > 10`	결과에 `.query('count > 10')`
`WITH cte AS (SELECT ...)`	`df['col'] = df.groupby(...).transform(...)`

집계 함수 요약

함수	설명	예시
`'count'`	개수	결측치 제외
`'sum'`	합계	숫자 컬럼
`'mean'`	평균	숫자 컬럼
`'median'`	중앙값	숫자 컬럼
`'min'` / `'max'`	최소/최대	모든 타입
`'std'` / `'var'`	표준편차/분산	숫자 컬럼
`'first'` / `'last'`	첫번째/마지막	모든 타입
`'nunique'`	고유값 개수	모든 타입

다음 단계

그룹화와 집계를 마스터했습니다! 다음으로 데이터 병합에서 merge(), concat() 등 데이터 결합 기법을 배워보세요.