Data Merging and Joining

IntermediateAdvanced

Learning Objectives

After completing this recipe, you will be able to:

Perform SQL-style JOIN with merge()
Concatenate data with concat()
Join by index with join()
Track join results with indicator option
Merge data using composite keys

0. Setup

Load Cookbook sample data for hands-on practice.


import pandas as pd
import numpy as np
 
# Load data
DATA_PATH = '/data/'
 
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
 
print(f"Data loaded!")
print(f"   - orders: {len(orders):,} rows")
print(f"   - order_items: {len(order_items):,} rows")
print(f"   - products: {len(products):,} rows")
print(f"   - users: {len(users):,} rows")

실행 결과

Data loaded!
 - orders: 100,000 rows
 - order_items: 150,000 rows
 - products: 1,000 rows
 - users: 10,000 rows

1. merge() Basics

Theory

merge() is equivalent to SQL JOIN, combining two DataFrames based on common columns.

merge Types:

Type	SQL Equivalent	Description
`inner`	INNER JOIN	Only rows in both
`left`	LEFT JOIN	All left + matching
`right`	RIGHT JOIN	All right + matching
`outer`	FULL OUTER JOIN	All from both

Preparing Example Data


import pandas as pd
 
# Sample order_items and products (small dataset for examples)
sample_order_items = order_items.head(5)[['order_item_id', 'order_id', 'product_id', 'sale_price']]
sample_products = products.head(4)[['product_id', 'name', 'brand', 'category']]
 
print("Order items data:")
print(sample_order_items)
print("\nProducts data:")
print(sample_products)

실행 결과

Order items data:
 order_item_id  order_id  product_id  sale_price
0              1         1         234       89.99
1              2         1         567      145.50
2              3         2         234       89.99
3              4         2         891       67.00
4              5         3         123      199.99

Products data:
 product_id          name         brand    category
0         234    Air Max 90          Nike       Shoes
1         567  Ultra Boost        Adidas       Shoes
2         891    Classic T        Reebok    T-Shirts
3         123       Suede         Puma       Shoes

Inner Join (Intersection)


# inner join: only data present in both
inner = pd.merge(sample_order_items, sample_products, on='product_id', how='inner')
print(f"inner join result: {len(inner)} rows")
print(inner)

실행 결과

inner join result: 4 rows
 order_item_id  order_id  product_id  sale_price          name   brand  category
0              1         1         234       89.99    Air Max 90    Nike     Shoes
1              3         2         234       89.99    Air Max 90    Nike     Shoes
2              2         1         567      145.50  Ultra Boost  Adidas     Shoes
3              4         2         891       67.00    Classic T  Reebok  T-Shirts

Result: product_id 123 is in sample_products and matches order_item_id 5

Left Join (Left-based)


# left join: all left (order_items) + matching right
left = pd.merge(sample_order_items, sample_products, on='product_id', how='left')
print(f"left join result: {len(left)} rows")
print(left)

실행 결과

left join result: 5 rows
 order_item_id  order_id  product_id  sale_price          name   brand  category
0              1         1         234       89.99    Air Max 90    Nike     Shoes
1              2         1         567      145.50  Ultra Boost  Adidas     Shoes
2              3         2         234       89.99    Air Max 90    Nike     Shoes
3              4         2         891       67.00    Classic T  Reebok  T-Shirts
4              5         3         123      199.99        Suede    Puma     Shoes

Result: All order items are kept, and product info is added

Right Join (Right-based)


# right join: all right (products) + matching left
right = pd.merge(sample_order_items, sample_products, on='product_id', how='right')
print(f"right join result: {len(right)} rows")
print(right)

실행 결과

right join result: 5 rows
 order_item_id  order_id  product_id  sale_price          name   brand  category
0            1.0       1.0         234       89.99    Air Max 90    Nike     Shoes
1            3.0       2.0         234       89.99    Air Max 90    Nike     Shoes
2            2.0       1.0         567      145.50  Ultra Boost  Adidas     Shoes
3            4.0       2.0         891       67.00    Classic T  Reebok  T-Shirts
4            NaN       NaN         123         NaN        Suede    Puma     Shoes

Outer Join (Union)


# outer join: all from both
outer = pd.merge(sample_order_items, sample_products, on='product_id', how='outer')
print(f"outer join result: {len(outer)} rows")
print(outer)

실행 결과

outer join result: 5 rows
 order_item_id  order_id  product_id  sale_price          name   brand  category
0            1.0       1.0         234       89.99    Air Max 90    Nike     Shoes
1            3.0       2.0         234       89.99    Air Max 90    Nike     Shoes
2            2.0       1.0         567      145.50  Ultra Boost  Adidas     Shoes
3            4.0       2.0         891       67.00    Classic T  Reebok  T-Shirts
4            5.0       3.0         123      199.99        Suede    Puma     Shoes

2. Advanced merge Options

indicator Option

With indicator=True, you can track which side the data came from.


# Some users may not have placed orders
sample_orders = orders.head(100)
sample_users = users.head(80)  # Only some users
 
# Check join source with indicator
merged = pd.merge(sample_orders, sample_users, on='user_id', how='outer', indicator=True)
print("merge source:")
print(merged['_merge'].value_counts())
 
# left_only: only in left (orders exist but no user info)
# right_only: only in right (users exist but no orders)
# both: in both
 
print("\nData only in right (users with no orders):")
print(merged[merged['_merge'] == 'right_only'][['user_id', 'first_name', 'last_name', '_merge']].head())

실행 결과

merge source:
_merge
both          72
right_only    15
left_only     28
Name: count, dtype: int64

Data only in right (users with no orders):
  user_id first_name last_name      _merge
72      156      Mike       Kim  right_only
73      203     Sarah      Park  right_only
74      287      John       Lee  right_only
75      312     Emily      Choi  right_only
76      345    Daniel     Jung  right_only

Joining with Different Column Names


# orders and users join on user_id
# Since column names are the same, use on
result = pd.merge(
    orders[['order_id', 'user_id', 'created_at']].head(5),
    users[['user_id', 'first_name', 'last_name', 'city']].head(100),
    on='user_id',
    how='left'
)
print(result)

실행 결과

   order_id  user_id          created_at first_name last_name   city
0         1       42 2023-01-15 10:23:45       John       Kim   Seoul
1         2      156 2023-01-15 11:45:12       Mike      Park   Busan
2         3       42 2023-01-15 14:30:00       John       Kim   Seoul
3         4      287 2023-01-15 16:20:33      Emily       Lee   Daegu
4         5       78 2023-01-15 18:10:22      Sarah      Choi   Incheon

Joining with Multiple Keys


# After merging order_items with products, get brand+category statistics
df = pd.merge(order_items, products, on='product_id', how='left')
 
# Average price by brand+category
brand_cat_stats = df.groupby(['brand', 'category']).agg(
    avg_price=('sale_price', 'mean')
).reset_index()
 
# Add brand+category average price to original data
sample = df[['order_item_id', 'brand', 'category', 'sale_price']].head(10)
result = pd.merge(
    sample,
    brand_cat_stats,
    on=['brand', 'category'],
    how='left'
)
 
print("Multiple key merge result:")
print(result.round(2))

실행 결과

Multiple key merge result:
 order_item_id   brand    category  sale_price  avg_price
0              1    Nike       Shoes       89.99      95.67
1              2  Adidas       Shoes      145.50      98.23
2              3    Nike       Shoes       89.99      95.67
3              4  Reebok    T-Shirts       67.00      45.12
4              5    Puma       Shoes      199.99      87.34
5              6    Nike    T-Shirts       45.00      42.89
6              7  Adidas       Pants       78.50      72.45
7              8    Vans       Shoes       65.00      68.90
8              9    Fila  Accessories       35.00      38.56
9             10    Nike     Jackets      125.00     118.23

suffixes Option


# Specify suffixes when there are duplicate column names
# orders and order_items may both have created_at
orders_sample = orders[['order_id', 'created_at']].head(3)
order_items_sample = order_items[['order_id', 'product_id', 'sale_price']].head(5)
 
result = pd.merge(orders_sample, order_items_sample, on='order_id', suffixes=['_order', '_item'])
print(result)

실행 결과

   order_id          created_at  product_id  sale_price
0         1 2023-01-15 10:23:45         234       89.99
1         1 2023-01-15 10:23:45         567      145.50
2         2 2023-01-15 11:45:12         234       89.99
3         2 2023-01-15 11:45:12         891       67.00
4         3 2023-01-15 14:30:00         123      199.99

3. concat() Data Concatenation

Theory

concat() concatenates multiple DataFrames vertically (rows) or horizontally (columns).

Vertical Concatenation (Adding Rows)


# Split data
df1 = orders.head(100)
df2 = orders.tail(100)
 
# Vertical concatenation
vertical = pd.concat([df1, df2], axis=0, ignore_index=True)
print(f"Vertical concatenation: {len(vertical)} rows")
 
# ignore_index=True: Reset index

실행 결과

Vertical concatenation: 200 rows

Horizontal Concatenation (Adding Columns)


# From data merged with order_items and products
df = pd.merge(order_items, products, on='product_id', how='left')
 
# Different column data
df_info = df[['order_item_id', 'order_id']].head(5)
df_product = df[['brand', 'category']].head(5)
 
# Horizontal concatenation
horizontal = pd.concat([df_info, df_product], axis=1)
print("Horizontal concatenation:")
print(horizontal)

실행 결과

Horizontal concatenation:
 order_item_id  order_id   brand    category
0              1         1    Nike       Shoes
1              2         1  Adidas       Shoes
2              3         2    Nike       Shoes
3              4         2  Reebok    T-Shirts
4              5         3    Puma       Shoes

Concatenating Multiple DataFrames at Once


# Combine order data from multiple periods into one
df1 = orders.head(50)
df2 = orders.iloc[50:100]
df3 = orders.iloc[100:150]
 
combined = pd.concat([df1, df2, df3], ignore_index=True)
print(f"3 DataFrames concatenated: {len(combined)} rows")

실행 결과

3 DataFrames concatenated: 150 rows

Quiz 1: Basic merge

Problem

Combine the following two tables:

Brand statistics: total sales count, average price
Brand tier table: top 5 brands are ‘Premium’, rest are ‘Standard’

Use left join to add tier to all brand statistics, and fill brands without tier as ‘Standard’.

View Answer


# Merge order_items and products
df = pd.merge(order_items, products, on='product_id', how='left')
 
# 1. Brand statistics
brand_stats = df.groupby('brand').agg(
    total_sales=('order_item_id', 'count'),
    avg_price=('sale_price', 'mean')
).round(2).reset_index()
 
# 2. Brand tier (only top 5 are Premium)
top_5_brands = brand_stats.nlargest(5, 'total_sales')['brand'].tolist()
brand_tier = pd.DataFrame({
    'brand': top_5_brands,
    'tier': ['Premium'] * 5
})
 
# 3. left join
result = pd.merge(brand_stats, brand_tier, on='brand', how='left')
 
# 4. Fill NaN with 'Standard'
result['tier'] = result['tier'].fillna('Standard')
 
print("Brand statistics + tier:")
print(result)
print(f"\nBrand count by tier:")
print(result['tier'].value_counts())

실행 결과

Brand statistics + tier:
        brand  total_sales  avg_price      tier
0        Adidas        16234      97.45   Premium
1         Asics        14645      96.78  Standard
2      Converse        14912      95.34  Standard
3          Fila        14789      94.56  Standard
4   New Balance        15123      98.23   Premium
5          Nike        16067      99.12   Premium
6          Puma        15189      97.89   Premium
7        Reebok        14987      96.45  Standard
8  Under Armour        15098      98.67   Premium
9          Vans        14956      95.12  Standard

Brand count by tier:
tier
Standard    5
Premium     5
Name: count, dtype: int64

Quiz 2: Complex merge

Problem

Merge the following two statistics to the original data:

Average price by brand
Average price by category

Then add new columns:

vs_brand_avg: current price / brand average price
vs_cat_avg: current price / category average price

Filter only products where both ratios are 1.5 or above (premium products).

View Answer


# Merge order_items and products
df = pd.merge(order_items, products, on='product_id', how='left')
 
# 1. Brand statistics
brand_stats = df.groupby('brand').agg(
    brand_avg_price=('sale_price', 'mean')
).reset_index()
 
# 2. Category statistics
cat_stats = df.groupby('category').agg(
    cat_avg_price=('sale_price', 'mean')
).reset_index()
 
# 3. Merge to original
result = df.copy()
result = pd.merge(result, brand_stats, on='brand', how='left')
result = pd.merge(result, cat_stats, on='category', how='left')
 
# 4. Add comparison columns
result['vs_brand_avg'] = result['sale_price'] / result['brand_avg_price']
result['vs_cat_avg'] = result['sale_price'] / result['cat_avg_price']
 
# 5. Filter premium products
premium = result[
    (result['vs_brand_avg'] >= 1.5) &
    (result['vs_cat_avg'] >= 1.5)
]
 
print(f"Premium products: {len(premium):,}")
print("\nResult:")
display_cols = ['brand', 'category', 'sale_price', 'brand_avg_price', 'cat_avg_price', 'vs_brand_avg', 'vs_cat_avg']
print(premium[display_cols].head(10).round(2))

실행 결과

Premium products: 4,523

Result:
        brand    category  sale_price  brand_avg_price  cat_avg_price  vs_brand_avg  vs_cat_avg
19         Nike       Shoes      245.67            99.12          95.67          2.48        2.57
38       Adidas       Shoes      234.56            97.45          95.67          2.41        2.45
50         Puma    T-Shirts      189.34            97.89          45.12          1.93        4.20
68   New Balance      Shoes      267.89            98.23          95.67          2.73        2.80
74        Asics       Shoes      198.45            96.78          95.67          2.05        2.07
77       Reebok    T-Shirts      189.23            96.45          45.12          1.96        4.19
80         Vans       Shoes      178.90            95.12          95.67          1.88        1.87
84         Fila  Accessories      167.89            94.56          38.56          1.78        4.35
109    Converse       Shoes      156.78            95.34          95.67          1.64        1.64
155  Under Armour   Jackets      234.56            98.67         118.23          2.38        1.98

Quiz 3: Using indicator

Problem

Analyze the matching status of two datasets:

Outer join orders data and users data
Check matching status with indicator=True
Output the following statistics:
- Data in both (both)
- Data with orders but no user info (left_only)
- Data with users but no orders (right_only)

View Answer


# Test with sample data (full data may be too large)
sample_orders = orders.head(1000)
sample_users = users.head(500)
 
# outer join with indicator
merged = pd.merge(
    sample_orders, sample_users,
    on='user_id',
    how='outer',
    indicator=True
)
 
# Analyze matching status
print("=== Matching Status Analysis ===")
status_counts = merged['_merge'].value_counts()
print(status_counts)
 
print(f"\nData in both: {status_counts.get('both', 0)}")
print(f"Orders only (no user info): {status_counts.get('left_only', 0)}")
print(f"Users only (no orders): {status_counts.get('right_only', 0)}")
 
# Details for each group
print("\n--- Data with orders only (no user info) ---")
left_only = merged[merged['_merge'] == 'left_only']
print(left_only[['order_id', 'user_id', 'created_at']].head())
 
print("\n--- Data with users only (no orders) ---")
right_only = merged[merged['_merge'] == 'right_only']
print(right_only[['user_id', 'first_name', 'last_name', 'city']].head())

실행 결과

=== Matching Status Analysis ===
_merge
both          678
right_only    245
left_only     322
Name: count, dtype: int64

Data in both: 678
Orders only (no user info): 322
Users only (no orders): 245

--- Data with orders only (no user info) ---
   order_id  user_id          created_at
678        45      512 2023-01-18 09:15:33
679        89      623 2023-01-20 14:22:11
680       123      578 2023-01-22 16:45:00
681       156      601 2023-01-24 11:30:45
682       178      534 2023-01-25 08:55:12

--- Data with users only (no orders) ---
    user_id first_name last_name   city
1000      156       Mike       Kim   Seoul
1001      203      Sarah      Park   Busan
1002      287       John       Lee   Daegu
1003      312      Emily      Choi   Incheon
1004      345     Daniel     Jung   Gwangju

4. join() Index-based Join

Theory

join() joins based on index. More concise than merge but requires index setup.


# From data merged with order_items and products
df = pd.merge(order_items, products, on='product_id', how='left')
 
# Index-based join
df1 = df[['brand', 'sale_price']].head(10).set_index('brand')
df2 = df[['brand', 'retail_price']].head(10).set_index('brand')
 
# join (join by same index)
result = df1.join(df2, lsuffix='_sale', rsuffix='_retail')
print(result.head())

실행 결과

        sale_price_sale  retail_price_retail
brand
Nike              89.99                156.78
Adidas           145.50                178.90
Nike              89.99                156.78
Reebok            67.00                123.45
Puma             199.99                234.56

Summary

merge vs concat vs join

Method	Use Case	SQL Equivalent
`merge()`	Join by columns	JOIN
`concat()`	Simple concatenation	UNION
`join()`	Join by index	JOIN

SQL to Pandas Comparison

SQL	Pandas
`INNER JOIN`	`pd.merge(df1, df2, how='inner')`
`LEFT JOIN`	`pd.merge(df1, df2, how='left')`
`RIGHT JOIN`	`pd.merge(df1, df2, how='right')`
`FULL OUTER JOIN`	`pd.merge(df1, df2, how='outer')`
`UNION ALL`	`pd.concat([df1, df2])`
`UNION`	`pd.concat([df1, df2]).drop_duplicates()`

merge() Key Parameters

Parameter	Description	Example
`on`	Join key (same column name)	`on='user_id'`
`left_on` / `right_on`	Different column names	`left_on='id', right_on='user_id'`
`how`	Join type	`'inner'`, `'left'`, `'right'`, `'outer'`
`indicator`	Add source column	`indicator=True`
`suffixes`	Suffix for duplicate columns	`suffixes=['_left', '_right']`

Next Steps

You’ve mastered data merging! Next, learn time series data analysis techniques in Date/Time Handling.