Skip to Content
ConceptsPandasData Merging

Data Merging and Joining

IntermediateAdvanced

Learning Objectives

After completing this recipe, you will be able to:

  • Perform SQL-style JOIN with merge()
  • Concatenate data with concat()
  • Join by index with join()
  • Track join results with indicator option
  • Merge data using composite keys

0. Setup

Load Cookbook sample data for hands-on practice.

import pandas as pd import numpy as np # Load data DATA_PATH = '/data/' orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at']) order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv') products = pd.read_csv(DATA_PATH + 'src_products.csv') users = pd.read_csv(DATA_PATH + 'src_users.csv') print(f"Data loaded!") print(f" - orders: {len(orders):,} rows") print(f" - order_items: {len(order_items):,} rows") print(f" - products: {len(products):,} rows") print(f" - users: {len(users):,} rows")
실행 결과
Data loaded!
 - orders: 100,000 rows
 - order_items: 150,000 rows
 - products: 1,000 rows
 - users: 10,000 rows

1. merge() Basics

Theory

merge() is equivalent to SQL JOIN, combining two DataFrames based on common columns.

merge Types:

TypeSQL EquivalentDescription
innerINNER JOINOnly rows in both
leftLEFT JOINAll left + matching
rightRIGHT JOINAll right + matching
outerFULL OUTER JOINAll from both

Preparing Example Data

import pandas as pd # Sample order_items and products (small dataset for examples) sample_order_items = order_items.head(5)[['order_item_id', 'order_id', 'product_id', 'sale_price']] sample_products = products.head(4)[['product_id', 'name', 'brand', 'category']] print("Order items data:") print(sample_order_items) print("\nProducts data:") print(sample_products)
실행 결과
Order items data:
 order_item_id  order_id  product_id  sale_price
0              1         1         234       89.99
1              2         1         567      145.50
2              3         2         234       89.99
3              4         2         891       67.00
4              5         3         123      199.99

Products data:
 product_id          name         brand    category
0         234    Air Max 90          Nike       Shoes
1         567  Ultra Boost        Adidas       Shoes
2         891    Classic T        Reebok    T-Shirts
3         123       Suede         Puma       Shoes

Inner Join (Intersection)

# inner join: only data present in both inner = pd.merge(sample_order_items, sample_products, on='product_id', how='inner') print(f"inner join result: {len(inner)} rows") print(inner)
실행 결과
inner join result: 4 rows
 order_item_id  order_id  product_id  sale_price          name   brand  category
0              1         1         234       89.99    Air Max 90    Nike     Shoes
1              3         2         234       89.99    Air Max 90    Nike     Shoes
2              2         1         567      145.50  Ultra Boost  Adidas     Shoes
3              4         2         891       67.00    Classic T  Reebok  T-Shirts

Result: product_id 123 is in sample_products and matches order_item_id 5

Left Join (Left-based)

# left join: all left (order_items) + matching right left = pd.merge(sample_order_items, sample_products, on='product_id', how='left') print(f"left join result: {len(left)} rows") print(left)
실행 결과
left join result: 5 rows
 order_item_id  order_id  product_id  sale_price          name   brand  category
0              1         1         234       89.99    Air Max 90    Nike     Shoes
1              2         1         567      145.50  Ultra Boost  Adidas     Shoes
2              3         2         234       89.99    Air Max 90    Nike     Shoes
3              4         2         891       67.00    Classic T  Reebok  T-Shirts
4              5         3         123      199.99        Suede    Puma     Shoes

Result: All order items are kept, and product info is added

Right Join (Right-based)

# right join: all right (products) + matching left right = pd.merge(sample_order_items, sample_products, on='product_id', how='right') print(f"right join result: {len(right)} rows") print(right)
실행 결과
right join result: 5 rows
 order_item_id  order_id  product_id  sale_price          name   brand  category
0            1.0       1.0         234       89.99    Air Max 90    Nike     Shoes
1            3.0       2.0         234       89.99    Air Max 90    Nike     Shoes
2            2.0       1.0         567      145.50  Ultra Boost  Adidas     Shoes
3            4.0       2.0         891       67.00    Classic T  Reebok  T-Shirts
4            NaN       NaN         123         NaN        Suede    Puma     Shoes

Outer Join (Union)

# outer join: all from both outer = pd.merge(sample_order_items, sample_products, on='product_id', how='outer') print(f"outer join result: {len(outer)} rows") print(outer)
실행 결과
outer join result: 5 rows
 order_item_id  order_id  product_id  sale_price          name   brand  category
0            1.0       1.0         234       89.99    Air Max 90    Nike     Shoes
1            3.0       2.0         234       89.99    Air Max 90    Nike     Shoes
2            2.0       1.0         567      145.50  Ultra Boost  Adidas     Shoes
3            4.0       2.0         891       67.00    Classic T  Reebok  T-Shirts
4            5.0       3.0         123      199.99        Suede    Puma     Shoes

2. Advanced merge Options

indicator Option

With indicator=True, you can track which side the data came from.

# Some users may not have placed orders sample_orders = orders.head(100) sample_users = users.head(80) # Only some users # Check join source with indicator merged = pd.merge(sample_orders, sample_users, on='user_id', how='outer', indicator=True) print("merge source:") print(merged['_merge'].value_counts()) # left_only: only in left (orders exist but no user info) # right_only: only in right (users exist but no orders) # both: in both print("\nData only in right (users with no orders):") print(merged[merged['_merge'] == 'right_only'][['user_id', 'first_name', 'last_name', '_merge']].head())
실행 결과
merge source:
_merge
both          72
right_only    15
left_only     28
Name: count, dtype: int64

Data only in right (users with no orders):
  user_id first_name last_name      _merge
72      156      Mike       Kim  right_only
73      203     Sarah      Park  right_only
74      287      John       Lee  right_only
75      312     Emily      Choi  right_only
76      345    Daniel     Jung  right_only

Joining with Different Column Names

# orders and users join on user_id # Since column names are the same, use on result = pd.merge( orders[['order_id', 'user_id', 'created_at']].head(5), users[['user_id', 'first_name', 'last_name', 'city']].head(100), on='user_id', how='left' ) print(result)
실행 결과
   order_id  user_id          created_at first_name last_name   city
0         1       42 2023-01-15 10:23:45       John       Kim   Seoul
1         2      156 2023-01-15 11:45:12       Mike      Park   Busan
2         3       42 2023-01-15 14:30:00       John       Kim   Seoul
3         4      287 2023-01-15 16:20:33      Emily       Lee   Daegu
4         5       78 2023-01-15 18:10:22      Sarah      Choi   Incheon

Joining with Multiple Keys

# After merging order_items with products, get brand+category statistics df = pd.merge(order_items, products, on='product_id', how='left') # Average price by brand+category brand_cat_stats = df.groupby(['brand', 'category']).agg( avg_price=('sale_price', 'mean') ).reset_index() # Add brand+category average price to original data sample = df[['order_item_id', 'brand', 'category', 'sale_price']].head(10) result = pd.merge( sample, brand_cat_stats, on=['brand', 'category'], how='left' ) print("Multiple key merge result:") print(result.round(2))
실행 결과
Multiple key merge result:
 order_item_id   brand    category  sale_price  avg_price
0              1    Nike       Shoes       89.99      95.67
1              2  Adidas       Shoes      145.50      98.23
2              3    Nike       Shoes       89.99      95.67
3              4  Reebok    T-Shirts       67.00      45.12
4              5    Puma       Shoes      199.99      87.34
5              6    Nike    T-Shirts       45.00      42.89
6              7  Adidas       Pants       78.50      72.45
7              8    Vans       Shoes       65.00      68.90
8              9    Fila  Accessories       35.00      38.56
9             10    Nike     Jackets      125.00     118.23

suffixes Option

# Specify suffixes when there are duplicate column names # orders and order_items may both have created_at orders_sample = orders[['order_id', 'created_at']].head(3) order_items_sample = order_items[['order_id', 'product_id', 'sale_price']].head(5) result = pd.merge(orders_sample, order_items_sample, on='order_id', suffixes=['_order', '_item']) print(result)
실행 결과
   order_id          created_at  product_id  sale_price
0         1 2023-01-15 10:23:45         234       89.99
1         1 2023-01-15 10:23:45         567      145.50
2         2 2023-01-15 11:45:12         234       89.99
3         2 2023-01-15 11:45:12         891       67.00
4         3 2023-01-15 14:30:00         123      199.99

3. concat() Data Concatenation

Theory

concat() concatenates multiple DataFrames vertically (rows) or horizontally (columns).

Vertical Concatenation (Adding Rows)

# Split data df1 = orders.head(100) df2 = orders.tail(100) # Vertical concatenation vertical = pd.concat([df1, df2], axis=0, ignore_index=True) print(f"Vertical concatenation: {len(vertical)} rows") # ignore_index=True: Reset index
실행 결과
Vertical concatenation: 200 rows

Horizontal Concatenation (Adding Columns)

# From data merged with order_items and products df = pd.merge(order_items, products, on='product_id', how='left') # Different column data df_info = df[['order_item_id', 'order_id']].head(5) df_product = df[['brand', 'category']].head(5) # Horizontal concatenation horizontal = pd.concat([df_info, df_product], axis=1) print("Horizontal concatenation:") print(horizontal)
실행 결과
Horizontal concatenation:
 order_item_id  order_id   brand    category
0              1         1    Nike       Shoes
1              2         1  Adidas       Shoes
2              3         2    Nike       Shoes
3              4         2  Reebok    T-Shirts
4              5         3    Puma       Shoes

Concatenating Multiple DataFrames at Once

# Combine order data from multiple periods into one df1 = orders.head(50) df2 = orders.iloc[50:100] df3 = orders.iloc[100:150] combined = pd.concat([df1, df2, df3], ignore_index=True) print(f"3 DataFrames concatenated: {len(combined)} rows")
실행 결과
3 DataFrames concatenated: 150 rows

Quiz 1: Basic merge

Problem

Combine the following two tables:

  1. Brand statistics: total sales count, average price
  2. Brand tier table: top 5 brands are ‘Premium’, rest are ‘Standard’

Use left join to add tier to all brand statistics, and fill brands without tier as ‘Standard’.

View Answer

# Merge order_items and products df = pd.merge(order_items, products, on='product_id', how='left') # 1. Brand statistics brand_stats = df.groupby('brand').agg( total_sales=('order_item_id', 'count'), avg_price=('sale_price', 'mean') ).round(2).reset_index() # 2. Brand tier (only top 5 are Premium) top_5_brands = brand_stats.nlargest(5, 'total_sales')['brand'].tolist() brand_tier = pd.DataFrame({ 'brand': top_5_brands, 'tier': ['Premium'] * 5 }) # 3. left join result = pd.merge(brand_stats, brand_tier, on='brand', how='left') # 4. Fill NaN with 'Standard' result['tier'] = result['tier'].fillna('Standard') print("Brand statistics + tier:") print(result) print(f"\nBrand count by tier:") print(result['tier'].value_counts())
실행 결과
Brand statistics + tier:
        brand  total_sales  avg_price      tier
0        Adidas        16234      97.45   Premium
1         Asics        14645      96.78  Standard
2      Converse        14912      95.34  Standard
3          Fila        14789      94.56  Standard
4   New Balance        15123      98.23   Premium
5          Nike        16067      99.12   Premium
6          Puma        15189      97.89   Premium
7        Reebok        14987      96.45  Standard
8  Under Armour        15098      98.67   Premium
9          Vans        14956      95.12  Standard

Brand count by tier:
tier
Standard    5
Premium     5
Name: count, dtype: int64

Quiz 2: Complex merge

Problem

Merge the following two statistics to the original data:

  1. Average price by brand
  2. Average price by category

Then add new columns:

  • vs_brand_avg: current price / brand average price
  • vs_cat_avg: current price / category average price

Filter only products where both ratios are 1.5 or above (premium products).

View Answer

# Merge order_items and products df = pd.merge(order_items, products, on='product_id', how='left') # 1. Brand statistics brand_stats = df.groupby('brand').agg( brand_avg_price=('sale_price', 'mean') ).reset_index() # 2. Category statistics cat_stats = df.groupby('category').agg( cat_avg_price=('sale_price', 'mean') ).reset_index() # 3. Merge to original result = df.copy() result = pd.merge(result, brand_stats, on='brand', how='left') result = pd.merge(result, cat_stats, on='category', how='left') # 4. Add comparison columns result['vs_brand_avg'] = result['sale_price'] / result['brand_avg_price'] result['vs_cat_avg'] = result['sale_price'] / result['cat_avg_price'] # 5. Filter premium products premium = result[ (result['vs_brand_avg'] >= 1.5) & (result['vs_cat_avg'] >= 1.5) ] print(f"Premium products: {len(premium):,}") print("\nResult:") display_cols = ['brand', 'category', 'sale_price', 'brand_avg_price', 'cat_avg_price', 'vs_brand_avg', 'vs_cat_avg'] print(premium[display_cols].head(10).round(2))
실행 결과
Premium products: 4,523

Result:
        brand    category  sale_price  brand_avg_price  cat_avg_price  vs_brand_avg  vs_cat_avg
19         Nike       Shoes      245.67            99.12          95.67          2.48        2.57
38       Adidas       Shoes      234.56            97.45          95.67          2.41        2.45
50         Puma    T-Shirts      189.34            97.89          45.12          1.93        4.20
68   New Balance      Shoes      267.89            98.23          95.67          2.73        2.80
74        Asics       Shoes      198.45            96.78          95.67          2.05        2.07
77       Reebok    T-Shirts      189.23            96.45          45.12          1.96        4.19
80         Vans       Shoes      178.90            95.12          95.67          1.88        1.87
84         Fila  Accessories      167.89            94.56          38.56          1.78        4.35
109    Converse       Shoes      156.78            95.34          95.67          1.64        1.64
155  Under Armour   Jackets      234.56            98.67         118.23          2.38        1.98

Quiz 3: Using indicator

Problem

Analyze the matching status of two datasets:

  1. Outer join orders data and users data
  2. Check matching status with indicator=True
  3. Output the following statistics:
    • Data in both (both)
    • Data with orders but no user info (left_only)
    • Data with users but no orders (right_only)

View Answer

# Test with sample data (full data may be too large) sample_orders = orders.head(1000) sample_users = users.head(500) # outer join with indicator merged = pd.merge( sample_orders, sample_users, on='user_id', how='outer', indicator=True ) # Analyze matching status print("=== Matching Status Analysis ===") status_counts = merged['_merge'].value_counts() print(status_counts) print(f"\nData in both: {status_counts.get('both', 0)}") print(f"Orders only (no user info): {status_counts.get('left_only', 0)}") print(f"Users only (no orders): {status_counts.get('right_only', 0)}") # Details for each group print("\n--- Data with orders only (no user info) ---") left_only = merged[merged['_merge'] == 'left_only'] print(left_only[['order_id', 'user_id', 'created_at']].head()) print("\n--- Data with users only (no orders) ---") right_only = merged[merged['_merge'] == 'right_only'] print(right_only[['user_id', 'first_name', 'last_name', 'city']].head())
실행 결과
=== Matching Status Analysis ===
_merge
both          678
right_only    245
left_only     322
Name: count, dtype: int64

Data in both: 678
Orders only (no user info): 322
Users only (no orders): 245

--- Data with orders only (no user info) ---
   order_id  user_id          created_at
678        45      512 2023-01-18 09:15:33
679        89      623 2023-01-20 14:22:11
680       123      578 2023-01-22 16:45:00
681       156      601 2023-01-24 11:30:45
682       178      534 2023-01-25 08:55:12

--- Data with users only (no orders) ---
    user_id first_name last_name   city
1000      156       Mike       Kim   Seoul
1001      203      Sarah      Park   Busan
1002      287       John       Lee   Daegu
1003      312      Emily      Choi   Incheon
1004      345     Daniel     Jung   Gwangju

4. join() Index-based Join

Theory

join() joins based on index. More concise than merge but requires index setup.

# From data merged with order_items and products df = pd.merge(order_items, products, on='product_id', how='left') # Index-based join df1 = df[['brand', 'sale_price']].head(10).set_index('brand') df2 = df[['brand', 'retail_price']].head(10).set_index('brand') # join (join by same index) result = df1.join(df2, lsuffix='_sale', rsuffix='_retail') print(result.head())
실행 결과
        sale_price_sale  retail_price_retail
brand
Nike              89.99                156.78
Adidas           145.50                178.90
Nike              89.99                156.78
Reebok            67.00                123.45
Puma             199.99                234.56

Summary

merge vs concat vs join

MethodUse CaseSQL Equivalent
merge()Join by columnsJOIN
concat()Simple concatenationUNION
join()Join by indexJOIN

SQL to Pandas Comparison

SQLPandas
INNER JOINpd.merge(df1, df2, how='inner')
LEFT JOINpd.merge(df1, df2, how='left')
RIGHT JOINpd.merge(df1, df2, how='right')
FULL OUTER JOINpd.merge(df1, df2, how='outer')
UNION ALLpd.concat([df1, df2])
UNIONpd.concat([df1, df2]).drop_duplicates()

merge() Key Parameters

ParameterDescriptionExample
onJoin key (same column name)on='user_id'
left_on / right_onDifferent column namesleft_on='id', right_on='user_id'
howJoin type'inner', 'left', 'right', 'outer'
indicatorAdd source columnindicator=True
suffixesSuffix for duplicate columnssuffixes=['_left', '_right']

Next Steps

You’ve mastered data merging! Next, learn time series data analysis techniques in Date/Time Handling.

Last updated on

🤖AI 모의면접실전처럼 연습하기