데이터 병합과 결합

중급고급

학습 목표

이 레시피를 완료하면 다음을 할 수 있습니다:

merge() 로 SQL 스타일 JOIN
concat() 으로 데이터 연결
join() 으로 인덱스 기반 결합
indicator 옵션으로 결합 결과 추적
복합 키로 데이터 병합

0. 사전 준비 (Setup)

실습을 위해 Cookbook 샘플 데이터를 로드합니다.


import pandas as pd
import numpy as np
 
# 데이터 로드
DATA_PATH = '/data/'
 
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
 
print(f"✅ 데이터 로드 완료!")
print(f"   - orders: {len(orders):,}행")
print(f"   - order_items: {len(order_items):,}행")
print(f"   - products: {len(products):,}행")
print(f"   - users: {len(users):,}행")

실행 결과

✅ 데이터 로드 완료!
 - orders: 100,000행
 - order_items: 150,000행
 - products: 1,000행
 - users: 10,000행

1. merge() 기본

이론

merge()는 SQL의 JOIN과 동일한 기능으로, 공통 컬럼을 기준으로 두 DataFrame을 결합합니다.

merge 유형:

유형	SQL 대응	설명
`inner`	INNER JOIN	양쪽 모두 있는 행만
`left`	LEFT JOIN	왼쪽 전체 + 매칭
`right`	RIGHT JOIN	오른쪽 전체 + 매칭
`outer`	FULL OUTER JOIN	양쪽 전체

예제 데이터 준비


import pandas as pd
 
# order_items와 products 샘플 (예시용 작은 데이터셋)
sample_order_items = order_items.head(5)[['order_item_id', 'order_id', 'product_id', 'sale_price']]
sample_products = products.head(4)[['product_id', 'name', 'brand', 'category']]
 
print("주문 상품 데이터:")
print(sample_order_items)
print("\n상품 데이터:")
print(sample_products)

실행 결과

주문 상품 데이터:
 order_item_id  order_id  product_id  sale_price
0              1         1         234       89.99
1              2         1         567      145.50
2              3         2         234       89.99
3              4         2         891       67.00
4              5         3         123      199.99

상품 데이터:
 product_id          name         brand    category
0         234    Air Max 90          Nike       Shoes
1         567  Ultra Boost        Adidas       Shoes
2         891    Classic T        Reebok    T-Shirts
3         123       Suede         Puma       Shoes

Inner Join (교집합)


# inner join: 양쪽 모두 있는 데이터만
inner = pd.merge(sample_order_items, sample_products, on='product_id', how='inner')
print(f"inner join 결과: {len(inner)}행")
print(inner)

실행 결과

inner join 결과: 4행
 order_item_id  order_id  product_id  sale_price          name   brand  category
0              1         1         234       89.99    Air Max 90    Nike     Shoes
1              3         2         234       89.99    Air Max 90    Nike     Shoes
2              2         1         567      145.50  Ultra Boost  Adidas     Shoes
3              4         2         891       67.00    Classic T  Reebok  T-Shirts

결과: product_id 123은 sample_products에 있지만 order_item_id 5의 product_id와 매칭

Left Join (왼쪽 기준)


# left join: 왼쪽(order_items) 전체 + 매칭되는 오른쪽
left = pd.merge(sample_order_items, sample_products, on='product_id', how='left')
print(f"left join 결과: {len(left)}행")
print(left)

실행 결과

left join 결과: 5행
 order_item_id  order_id  product_id  sale_price          name   brand  category
0              1         1         234       89.99    Air Max 90    Nike     Shoes
1              2         1         567      145.50  Ultra Boost  Adidas     Shoes
2              3         2         234       89.99    Air Max 90    Nike     Shoes
3              4         2         891       67.00    Classic T  Reebok  T-Shirts
4              5         3         123      199.99        Suede    Puma     Shoes

결과: 모든 주문 상품이 유지되고, 상품 정보가 추가됨

Right Join (오른쪽 기준)


# right join: 오른쪽(products) 전체 + 매칭되는 왼쪽
right = pd.merge(sample_order_items, sample_products, on='product_id', how='right')
print(f"right join 결과: {len(right)}행")
print(right)

실행 결과

right join 결과: 5행
 order_item_id  order_id  product_id  sale_price          name   brand  category
0            1.0       1.0         234       89.99    Air Max 90    Nike     Shoes
1            3.0       2.0         234       89.99    Air Max 90    Nike     Shoes
2            2.0       1.0         567      145.50  Ultra Boost  Adidas     Shoes
3            4.0       2.0         891       67.00    Classic T  Reebok  T-Shirts
4            NaN       NaN         123         NaN        Suede    Puma     Shoes

Outer Join (합집합)


# outer join: 양쪽 모두
outer = pd.merge(sample_order_items, sample_products, on='product_id', how='outer')
print(f"outer join 결과: {len(outer)}행")
print(outer)

실행 결과

outer join 결과: 5행
 order_item_id  order_id  product_id  sale_price          name   brand  category
0            1.0       1.0         234       89.99    Air Max 90    Nike     Shoes
1            3.0       2.0         234       89.99    Air Max 90    Nike     Shoes
2            2.0       1.0         567      145.50  Ultra Boost  Adidas     Shoes
3            4.0       2.0         891       67.00    Classic T  Reebok  T-Shirts
4            5.0       3.0         123      199.99        Suede    Puma     Shoes

2. 고급 merge 옵션

indicator 옵션

indicator=True로 어느 쪽 데이터인지 추적할 수 있습니다.


# 일부 사용자가 주문을 안 했을 수 있음을 확인
sample_orders = orders.head(100)
sample_users = users.head(80)  # 일부 사용자만
 
# indicator로 결합 출처 확인
merged = pd.merge(sample_orders, sample_users, on='user_id', how='outer', indicator=True)
print("merge 출처:")
print(merged['_merge'].value_counts())
 
# left_only: 왼쪽에만 있음 (주문은 있지만 사용자 정보 없음)
# right_only: 오른쪽에만 있음 (사용자는 있지만 주문 없음)
# both: 양쪽 모두 있음
 
print("\n오른쪽에만 있는 데이터 (주문 없는 사용자):")
print(merged[merged['_merge'] == 'right_only'][['user_id', 'first_name', 'last_name', '_merge']].head())

실행 결과

merge 출처:
_merge
both          72
right_only    15
left_only     28
Name: count, dtype: int64

오른쪽에만 있는 데이터 (주문 없는 사용자):
  user_id first_name last_name      _merge
72      156       민수        김  right_only
73      203       지영        박  right_only
74      287       철수        이  right_only
75      312       영희        최  right_only
76      345       현우        정  right_only

다른 컬럼명으로 결합


# orders와 users는 user_id로 결합
# 컬럼명이 같으므로 on 사용
result = pd.merge(
    orders[['order_id', 'user_id', 'created_at']].head(5),
    users[['user_id', 'first_name', 'last_name', 'city']].head(100),
    on='user_id',
    how='left'
)
print(result)

실행 결과

   order_id  user_id          created_at first_name last_name   city
0         1       42 2023-01-15 10:23:45        철수        김   서울
1         2      156 2023-01-15 11:45:12        민수        박   부산
2         3       42 2023-01-15 14:30:00        철수        김   서울
3         4      287 2023-01-15 16:20:33        영희        이   대구
4         5       78 2023-01-15 18:10:22        지영        최   인천

복수 키로 결합


# order_items에 products 정보를 결합한 후 브랜드+카테고리별 통계
df = pd.merge(order_items, products, on='product_id', how='left')
 
# 브랜드+카테고리별 평균 가격
brand_cat_stats = df.groupby(['brand', 'category']).agg(
    avg_price=('sale_price', 'mean')
).reset_index()
 
# 원본 데이터에 브랜드+카테고리 평균 가격 추가
sample = df[['order_item_id', 'brand', 'category', 'sale_price']].head(10)
result = pd.merge(
    sample,
    brand_cat_stats,
    on=['brand', 'category'],
    how='left'
)
 
print("복수 키 merge 결과:")
print(result.round(2))

실행 결과

복수 키 merge 결과:
 order_item_id   brand    category  sale_price  avg_price
0              1    Nike       Shoes       89.99      95.67
1              2  Adidas       Shoes      145.50      98.23
2              3    Nike       Shoes       89.99      95.67
3              4  Reebok    T-Shirts       67.00      45.12
4              5    Puma       Shoes      199.99      87.34
5              6    Nike    T-Shirts       45.00      42.89
6              7  Adidas       Pants       78.50      72.45
7              8    Vans       Shoes       65.00      68.90
8              9    Fila  Accessories       35.00      38.56
9             10    Nike     Jackets      125.00     118.23

suffixes 옵션


# 동일한 컬럼명이 있을 때 접미사 지정
# orders와 order_items 모두 created_at이 있을 수 있음
orders_sample = orders[['order_id', 'created_at']].head(3)
order_items_sample = order_items[['order_id', 'product_id', 'sale_price']].head(5)
 
result = pd.merge(orders_sample, order_items_sample, on='order_id', suffixes=['_order', '_item'])
print(result)

실행 결과

   order_id          created_at  product_id  sale_price
0         1 2023-01-15 10:23:45         234       89.99
1         1 2023-01-15 10:23:45         567      145.50
2         2 2023-01-15 11:45:12         234       89.99
3         2 2023-01-15 11:45:12         891       67.00
4         3 2023-01-15 14:30:00         123      199.99

3. concat() 데이터 연결

이론

concat()은 여러 DataFrame을 수직(행) 또는 수평(열) 방향으로 연결합니다.

수직 연결 (행 추가)


# 데이터 분할
df1 = orders.head(100)
df2 = orders.tail(100)
 
# 수직 연결
vertical = pd.concat([df1, df2], axis=0, ignore_index=True)
print(f"수직 연결: {len(vertical)}행")
 
# ignore_index=True: 인덱스 재설정

실행 결과

수직 연결: 200행

수평 연결 (열 추가)


# order_items와 products를 merge한 데이터에서
df = pd.merge(order_items, products, on='product_id', how='left')
 
# 서로 다른 컬럼 데이터
df_info = df[['order_item_id', 'order_id']].head(5)
df_product = df[['brand', 'category']].head(5)
 
# 수평 연결
horizontal = pd.concat([df_info, df_product], axis=1)
print("수평 연결:")
print(horizontal)

실행 결과

수평 연결:
 order_item_id  order_id   brand    category
0              1         1    Nike       Shoes
1              2         1  Adidas       Shoes
2              3         2    Nike       Shoes
3              4         2  Reebok    T-Shirts
4              5         3    Puma       Shoes

여러 DataFrame 한번에 연결


# 여러 기간의 주문 데이터를 하나로
df1 = orders.head(50)
df2 = orders.iloc[50:100]
df3 = orders.iloc[100:150]
 
combined = pd.concat([df1, df2, df3], ignore_index=True)
print(f"3개 DataFrame 연결: {len(combined)}행")

실행 결과

3개 DataFrame 연결: 150행

퀴즈 1: 기본 merge

문제

다음 두 테이블을 결합하세요:

브랜드별 통계: 총 판매건수, 평균 가격
브랜드 등급 테이블: 상위 5개 브랜드는 ‘Premium’, 나머지는 ‘Standard’

left join으로 모든 브랜드 통계에 등급을 추가하고, 등급이 없는 브랜드는 ‘Standard’로 채우세요.

정답 보기


# order_items와 products 결합
df = pd.merge(order_items, products, on='product_id', how='left')
 
# 1. 브랜드별 통계
brand_stats = df.groupby('brand').agg(
    total_sales=('order_item_id', 'count'),
    avg_price=('sale_price', 'mean')
).round(2).reset_index()
 
# 2. 브랜드 등급 (상위 5개만 Premium)
top_5_brands = brand_stats.nlargest(5, 'total_sales')['brand'].tolist()
brand_tier = pd.DataFrame({
    'brand': top_5_brands,
    'tier': ['Premium'] * 5
})
 
# 3. left join
result = pd.merge(brand_stats, brand_tier, on='brand', how='left')
 
# 4. NaN을 'Standard'로 채우기
result['tier'] = result['tier'].fillna('Standard')
 
print("브랜드별 통계 + 등급:")
print(result)
print(f"\n등급별 브랜드 수:")
print(result['tier'].value_counts())

실행 결과

브랜드별 통계 + 등급:
        brand  total_sales  avg_price      tier
0        Adidas        16234      97.45   Premium
1         Asics        14645      96.78  Standard
2      Converse        14912      95.34  Standard
3          Fila        14789      94.56  Standard
4   New Balance        15123      98.23   Premium
5          Nike        16067      99.12   Premium
6          Puma        15189      97.89   Premium
7        Reebok        14987      96.45  Standard
8  Under Armour        15098      98.67   Premium
9          Vans        14956      95.12  Standard

등급별 브랜드 수:
tier
Standard    5
Premium     5
Name: count, dtype: int64

퀴즈 2: 복합 merge

문제

원본 데이터에 다음 두 가지 통계를 merge하세요:

브랜드별 평균 가격
카테고리별 평균 가격

그 다음 새 컬럼을 추가하세요:

vs_brand_avg: 현재 가격 / 브랜드 평균 가격
vs_cat_avg: 현재 가격 / 카테고리 평균 가격

두 비율이 모두 1.5 이상인 제품(프리미엄 제품)만 필터링하세요.

정답 보기


# order_items와 products 결합
df = pd.merge(order_items, products, on='product_id', how='left')
 
# 1. 브랜드별 통계
brand_stats = df.groupby('brand').agg(
    brand_avg_price=('sale_price', 'mean')
).reset_index()
 
# 2. 카테고리별 통계
cat_stats = df.groupby('category').agg(
    cat_avg_price=('sale_price', 'mean')
).reset_index()
 
# 3. 원본에 merge
result = df.copy()
result = pd.merge(result, brand_stats, on='brand', how='left')
result = pd.merge(result, cat_stats, on='category', how='left')
 
# 4. 비교 컬럼 추가
result['vs_brand_avg'] = result['sale_price'] / result['brand_avg_price']
result['vs_cat_avg'] = result['sale_price'] / result['cat_avg_price']
 
# 5. 프리미엄 제품 필터링
premium = result[
    (result['vs_brand_avg'] >= 1.5) &
    (result['vs_cat_avg'] >= 1.5)
]
 
print(f"프리미엄 제품: {len(premium):,}개")
print("\n결과:")
display_cols = ['brand', 'category', 'sale_price', 'brand_avg_price', 'cat_avg_price', 'vs_brand_avg', 'vs_cat_avg']
print(premium[display_cols].head(10).round(2))

실행 결과

프리미엄 제품: 4,523개

결과:
        brand    category  sale_price  brand_avg_price  cat_avg_price  vs_brand_avg  vs_cat_avg
19         Nike       Shoes      245.67            99.12          95.67          2.48        2.57
38       Adidas       Shoes      234.56            97.45          95.67          2.41        2.45
50         Puma    T-Shirts      189.34            97.89          45.12          1.93        4.20
68   New Balance      Shoes      267.89            98.23          95.67          2.73        2.80
74        Asics       Shoes      198.45            96.78          95.67          2.05        2.07
77       Reebok    T-Shirts      189.23            96.45          45.12          1.96        4.19
80         Vans       Shoes      178.90            95.12          95.67          1.88        1.87
84         Fila  Accessories      167.89            94.56          38.56          1.78        4.35
109    Converse       Shoes      156.78            95.34          95.67          1.64        1.64
155  Under Armour   Jackets      234.56            98.67         118.23          2.38        1.98

퀴즈 3: indicator 활용

문제

두 데이터셋의 매칭 상태를 분석하세요:

주문 데이터와 사용자 데이터를 outer join
indicator=True로 매칭 상태 확인
다음 통계를 출력하세요:
- 양쪽 모두 있는 데이터 (both)
- 주문만 있고 사용자 정보 없는 데이터 (left_only)
- 사용자만 있고 주문 없는 데이터 (right_only)

정답 보기


# 샘플 데이터로 테스트 (전체 데이터는 너무 클 수 있음)
sample_orders = orders.head(1000)
sample_users = users.head(500)
 
# outer join with indicator
merged = pd.merge(
    sample_orders, sample_users,
    on='user_id',
    how='outer',
    indicator=True
)
 
# 매칭 상태 분석
print("=== 매칭 상태 분석 ===")
status_counts = merged['_merge'].value_counts()
print(status_counts)
 
print(f"\n양쪽 모두 있는 데이터: {status_counts.get('both', 0)}건")
print(f"주문만 있는 데이터: {status_counts.get('left_only', 0)}건")
print(f"사용자만 있는 데이터: {status_counts.get('right_only', 0)}건")
 
# 각 그룹 상세
print("\n--- 주문만 있는 데이터 (사용자 정보 없음) ---")
left_only = merged[merged['_merge'] == 'left_only']
print(left_only[['order_id', 'user_id', 'created_at']].head())
 
print("\n--- 사용자만 있는 데이터 (주문 없음) ---")
right_only = merged[merged['_merge'] == 'right_only']
print(right_only[['user_id', 'first_name', 'last_name', 'city']].head())

실행 결과

=== 매칭 상태 분석 ===
_merge
both          678
right_only    245
left_only     322
Name: count, dtype: int64

양쪽 모두 있는 데이터: 678건
주문만 있는 데이터: 322건
사용자만 있는 데이터: 245건

--- 주문만 있는 데이터 (사용자 정보 없음) ---
   order_id  user_id          created_at
678        45      512 2023-01-18 09:15:33
679        89      623 2023-01-20 14:22:11
680       123      578 2023-01-22 16:45:00
681       156      601 2023-01-24 11:30:45
682       178      534 2023-01-25 08:55:12

--- 사용자만 있는 데이터 (주문 없음) ---
    user_id first_name last_name   city
1000      156       민수        김   서울
1001      203       지영        박   부산
1002      287       철수        이   대구
1003      312       영희        최   인천
1004      345       현우        정   광주

4. join() 인덱스 기반 결합

이론

join()은 인덱스를 기준으로 결합합니다. merge보다 간결하지만 인덱스 설정이 필요합니다.


# order_items와 products 결합
df = pd.merge(order_items, products, on='product_id', how='left')
 
# 인덱스 기반 join
df1 = df[['brand', 'sale_price']].head(10).set_index('brand')
df2 = df[['brand', 'retail_price']].head(10).set_index('brand')
 
# join (같은 인덱스끼리 결합)
result = df1.join(df2, lsuffix='_sale', rsuffix='_retail')
print(result.head())

실행 결과

        sale_price_sale  retail_price_retail
brand
Nike              89.99                156.78
Adidas           145.50                178.90
Nike              89.99                156.78
Reebok            67.00                123.45
Puma             199.99                234.56

정리

merge vs concat vs join

메서드	용도	SQL 대응
`merge()`	컬럼 기준 결합	JOIN
`concat()`	단순 연결	UNION
`join()`	인덱스 기준 결합	JOIN

SQL ↔ Pandas 비교

SQL	Pandas
`INNER JOIN`	`pd.merge(df1, df2, how='inner')`
`LEFT JOIN`	`pd.merge(df1, df2, how='left')`
`RIGHT JOIN`	`pd.merge(df1, df2, how='right')`
`FULL OUTER JOIN`	`pd.merge(df1, df2, how='outer')`
`UNION ALL`	`pd.concat([df1, df2])`
`UNION`	`pd.concat([df1, df2]).drop_duplicates()`

merge() 주요 파라미터

파라미터	설명	예시
`on`	결합 키 (동일 컬럼명)	`on='user_id'`
`left_on` / `right_on`	다른 컬럼명	`left_on='id', right_on='user_id'`
`how`	결합 유형	`'inner'`, `'left'`, `'right'`, `'outer'`
`indicator`	출처 컬럼 추가	`indicator=True`
`suffixes`	중복 컬럼 접미사	`suffixes=['_left', '_right']`

다음 단계

데이터 병합을 마스터했습니다! 다음으로 날짜/시간 처리에서 시계열 데이터 분석 기법을 배워보세요.