Data Merging and Joining
Learning Objectives
After completing this recipe, you will be able to:
- Perform SQL-style JOIN with
merge() - Concatenate data with
concat() - Join by index with
join() - Track join results with
indicatoroption - Merge data using composite keys
0. Setup
Load Cookbook sample data for hands-on practice.
import pandas as pd
import numpy as np
# Load data
DATA_PATH = '/data/'
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
print(f"Data loaded!")
print(f" - orders: {len(orders):,} rows")
print(f" - order_items: {len(order_items):,} rows")
print(f" - products: {len(products):,} rows")
print(f" - users: {len(users):,} rows")Data loaded! - orders: 100,000 rows - order_items: 150,000 rows - products: 1,000 rows - users: 10,000 rows
1. merge() Basics
Theory
merge() is equivalent to SQL JOIN, combining two DataFrames based on common columns.
merge Types:
| Type | SQL Equivalent | Description |
|---|---|---|
inner | INNER JOIN | Only rows in both |
left | LEFT JOIN | All left + matching |
right | RIGHT JOIN | All right + matching |
outer | FULL OUTER JOIN | All from both |
Preparing Example Data
import pandas as pd
# Sample order_items and products (small dataset for examples)
sample_order_items = order_items.head(5)[['order_item_id', 'order_id', 'product_id', 'sale_price']]
sample_products = products.head(4)[['product_id', 'name', 'brand', 'category']]
print("Order items data:")
print(sample_order_items)
print("\nProducts data:")
print(sample_products)Order items data: order_item_id order_id product_id sale_price 0 1 1 234 89.99 1 2 1 567 145.50 2 3 2 234 89.99 3 4 2 891 67.00 4 5 3 123 199.99 Products data: product_id name brand category 0 234 Air Max 90 Nike Shoes 1 567 Ultra Boost Adidas Shoes 2 891 Classic T Reebok T-Shirts 3 123 Suede Puma Shoes
Inner Join (Intersection)
# inner join: only data present in both
inner = pd.merge(sample_order_items, sample_products, on='product_id', how='inner')
print(f"inner join result: {len(inner)} rows")
print(inner)inner join result: 4 rows order_item_id order_id product_id sale_price name brand category 0 1 1 234 89.99 Air Max 90 Nike Shoes 1 3 2 234 89.99 Air Max 90 Nike Shoes 2 2 1 567 145.50 Ultra Boost Adidas Shoes 3 4 2 891 67.00 Classic T Reebok T-Shirts
Result: product_id 123 is in sample_products and matches order_item_id 5
Left Join (Left-based)
# left join: all left (order_items) + matching right
left = pd.merge(sample_order_items, sample_products, on='product_id', how='left')
print(f"left join result: {len(left)} rows")
print(left)left join result: 5 rows order_item_id order_id product_id sale_price name brand category 0 1 1 234 89.99 Air Max 90 Nike Shoes 1 2 1 567 145.50 Ultra Boost Adidas Shoes 2 3 2 234 89.99 Air Max 90 Nike Shoes 3 4 2 891 67.00 Classic T Reebok T-Shirts 4 5 3 123 199.99 Suede Puma Shoes
Result: All order items are kept, and product info is added
Right Join (Right-based)
# right join: all right (products) + matching left
right = pd.merge(sample_order_items, sample_products, on='product_id', how='right')
print(f"right join result: {len(right)} rows")
print(right)right join result: 5 rows order_item_id order_id product_id sale_price name brand category 0 1.0 1.0 234 89.99 Air Max 90 Nike Shoes 1 3.0 2.0 234 89.99 Air Max 90 Nike Shoes 2 2.0 1.0 567 145.50 Ultra Boost Adidas Shoes 3 4.0 2.0 891 67.00 Classic T Reebok T-Shirts 4 NaN NaN 123 NaN Suede Puma Shoes
Outer Join (Union)
# outer join: all from both
outer = pd.merge(sample_order_items, sample_products, on='product_id', how='outer')
print(f"outer join result: {len(outer)} rows")
print(outer)outer join result: 5 rows order_item_id order_id product_id sale_price name brand category 0 1.0 1.0 234 89.99 Air Max 90 Nike Shoes 1 3.0 2.0 234 89.99 Air Max 90 Nike Shoes 2 2.0 1.0 567 145.50 Ultra Boost Adidas Shoes 3 4.0 2.0 891 67.00 Classic T Reebok T-Shirts 4 5.0 3.0 123 199.99 Suede Puma Shoes
2. Advanced merge Options
indicator Option
With indicator=True, you can track which side the data came from.
# Some users may not have placed orders
sample_orders = orders.head(100)
sample_users = users.head(80) # Only some users
# Check join source with indicator
merged = pd.merge(sample_orders, sample_users, on='user_id', how='outer', indicator=True)
print("merge source:")
print(merged['_merge'].value_counts())
# left_only: only in left (orders exist but no user info)
# right_only: only in right (users exist but no orders)
# both: in both
print("\nData only in right (users with no orders):")
print(merged[merged['_merge'] == 'right_only'][['user_id', 'first_name', 'last_name', '_merge']].head())merge source: _merge both 72 right_only 15 left_only 28 Name: count, dtype: int64 Data only in right (users with no orders): user_id first_name last_name _merge 72 156 Mike Kim right_only 73 203 Sarah Park right_only 74 287 John Lee right_only 75 312 Emily Choi right_only 76 345 Daniel Jung right_only
Joining with Different Column Names
# orders and users join on user_id
# Since column names are the same, use on
result = pd.merge(
orders[['order_id', 'user_id', 'created_at']].head(5),
users[['user_id', 'first_name', 'last_name', 'city']].head(100),
on='user_id',
how='left'
)
print(result)order_id user_id created_at first_name last_name city 0 1 42 2023-01-15 10:23:45 John Kim Seoul 1 2 156 2023-01-15 11:45:12 Mike Park Busan 2 3 42 2023-01-15 14:30:00 John Kim Seoul 3 4 287 2023-01-15 16:20:33 Emily Lee Daegu 4 5 78 2023-01-15 18:10:22 Sarah Choi Incheon
Joining with Multiple Keys
# After merging order_items with products, get brand+category statistics
df = pd.merge(order_items, products, on='product_id', how='left')
# Average price by brand+category
brand_cat_stats = df.groupby(['brand', 'category']).agg(
avg_price=('sale_price', 'mean')
).reset_index()
# Add brand+category average price to original data
sample = df[['order_item_id', 'brand', 'category', 'sale_price']].head(10)
result = pd.merge(
sample,
brand_cat_stats,
on=['brand', 'category'],
how='left'
)
print("Multiple key merge result:")
print(result.round(2))Multiple key merge result: order_item_id brand category sale_price avg_price 0 1 Nike Shoes 89.99 95.67 1 2 Adidas Shoes 145.50 98.23 2 3 Nike Shoes 89.99 95.67 3 4 Reebok T-Shirts 67.00 45.12 4 5 Puma Shoes 199.99 87.34 5 6 Nike T-Shirts 45.00 42.89 6 7 Adidas Pants 78.50 72.45 7 8 Vans Shoes 65.00 68.90 8 9 Fila Accessories 35.00 38.56 9 10 Nike Jackets 125.00 118.23
suffixes Option
# Specify suffixes when there are duplicate column names
# orders and order_items may both have created_at
orders_sample = orders[['order_id', 'created_at']].head(3)
order_items_sample = order_items[['order_id', 'product_id', 'sale_price']].head(5)
result = pd.merge(orders_sample, order_items_sample, on='order_id', suffixes=['_order', '_item'])
print(result)order_id created_at product_id sale_price 0 1 2023-01-15 10:23:45 234 89.99 1 1 2023-01-15 10:23:45 567 145.50 2 2 2023-01-15 11:45:12 234 89.99 3 2 2023-01-15 11:45:12 891 67.00 4 3 2023-01-15 14:30:00 123 199.99
3. concat() Data Concatenation
Theory
concat() concatenates multiple DataFrames vertically (rows) or horizontally (columns).
Vertical Concatenation (Adding Rows)
# Split data
df1 = orders.head(100)
df2 = orders.tail(100)
# Vertical concatenation
vertical = pd.concat([df1, df2], axis=0, ignore_index=True)
print(f"Vertical concatenation: {len(vertical)} rows")
# ignore_index=True: Reset indexVertical concatenation: 200 rows
Horizontal Concatenation (Adding Columns)
# From data merged with order_items and products
df = pd.merge(order_items, products, on='product_id', how='left')
# Different column data
df_info = df[['order_item_id', 'order_id']].head(5)
df_product = df[['brand', 'category']].head(5)
# Horizontal concatenation
horizontal = pd.concat([df_info, df_product], axis=1)
print("Horizontal concatenation:")
print(horizontal)Horizontal concatenation: order_item_id order_id brand category 0 1 1 Nike Shoes 1 2 1 Adidas Shoes 2 3 2 Nike Shoes 3 4 2 Reebok T-Shirts 4 5 3 Puma Shoes
Concatenating Multiple DataFrames at Once
# Combine order data from multiple periods into one
df1 = orders.head(50)
df2 = orders.iloc[50:100]
df3 = orders.iloc[100:150]
combined = pd.concat([df1, df2, df3], ignore_index=True)
print(f"3 DataFrames concatenated: {len(combined)} rows")3 DataFrames concatenated: 150 rows
Quiz 1: Basic merge
Problem
Combine the following two tables:
- Brand statistics: total sales count, average price
- Brand tier table: top 5 brands are ‘Premium’, rest are ‘Standard’
Use left join to add tier to all brand statistics, and fill brands without tier as ‘Standard’.
View Answer
# Merge order_items and products
df = pd.merge(order_items, products, on='product_id', how='left')
# 1. Brand statistics
brand_stats = df.groupby('brand').agg(
total_sales=('order_item_id', 'count'),
avg_price=('sale_price', 'mean')
).round(2).reset_index()
# 2. Brand tier (only top 5 are Premium)
top_5_brands = brand_stats.nlargest(5, 'total_sales')['brand'].tolist()
brand_tier = pd.DataFrame({
'brand': top_5_brands,
'tier': ['Premium'] * 5
})
# 3. left join
result = pd.merge(brand_stats, brand_tier, on='brand', how='left')
# 4. Fill NaN with 'Standard'
result['tier'] = result['tier'].fillna('Standard')
print("Brand statistics + tier:")
print(result)
print(f"\nBrand count by tier:")
print(result['tier'].value_counts())Brand statistics + tier:
brand total_sales avg_price tier
0 Adidas 16234 97.45 Premium
1 Asics 14645 96.78 Standard
2 Converse 14912 95.34 Standard
3 Fila 14789 94.56 Standard
4 New Balance 15123 98.23 Premium
5 Nike 16067 99.12 Premium
6 Puma 15189 97.89 Premium
7 Reebok 14987 96.45 Standard
8 Under Armour 15098 98.67 Premium
9 Vans 14956 95.12 Standard
Brand count by tier:
tier
Standard 5
Premium 5
Name: count, dtype: int64Quiz 2: Complex merge
Problem
Merge the following two statistics to the original data:
- Average price by brand
- Average price by category
Then add new columns:
vs_brand_avg: current price / brand average pricevs_cat_avg: current price / category average price
Filter only products where both ratios are 1.5 or above (premium products).
View Answer
# Merge order_items and products
df = pd.merge(order_items, products, on='product_id', how='left')
# 1. Brand statistics
brand_stats = df.groupby('brand').agg(
brand_avg_price=('sale_price', 'mean')
).reset_index()
# 2. Category statistics
cat_stats = df.groupby('category').agg(
cat_avg_price=('sale_price', 'mean')
).reset_index()
# 3. Merge to original
result = df.copy()
result = pd.merge(result, brand_stats, on='brand', how='left')
result = pd.merge(result, cat_stats, on='category', how='left')
# 4. Add comparison columns
result['vs_brand_avg'] = result['sale_price'] / result['brand_avg_price']
result['vs_cat_avg'] = result['sale_price'] / result['cat_avg_price']
# 5. Filter premium products
premium = result[
(result['vs_brand_avg'] >= 1.5) &
(result['vs_cat_avg'] >= 1.5)
]
print(f"Premium products: {len(premium):,}")
print("\nResult:")
display_cols = ['brand', 'category', 'sale_price', 'brand_avg_price', 'cat_avg_price', 'vs_brand_avg', 'vs_cat_avg']
print(premium[display_cols].head(10).round(2))Premium products: 4,523
Result:
brand category sale_price brand_avg_price cat_avg_price vs_brand_avg vs_cat_avg
19 Nike Shoes 245.67 99.12 95.67 2.48 2.57
38 Adidas Shoes 234.56 97.45 95.67 2.41 2.45
50 Puma T-Shirts 189.34 97.89 45.12 1.93 4.20
68 New Balance Shoes 267.89 98.23 95.67 2.73 2.80
74 Asics Shoes 198.45 96.78 95.67 2.05 2.07
77 Reebok T-Shirts 189.23 96.45 45.12 1.96 4.19
80 Vans Shoes 178.90 95.12 95.67 1.88 1.87
84 Fila Accessories 167.89 94.56 38.56 1.78 4.35
109 Converse Shoes 156.78 95.34 95.67 1.64 1.64
155 Under Armour Jackets 234.56 98.67 118.23 2.38 1.98Quiz 3: Using indicator
Problem
Analyze the matching status of two datasets:
- Outer join orders data and users data
- Check matching status with
indicator=True - Output the following statistics:
- Data in both (both)
- Data with orders but no user info (left_only)
- Data with users but no orders (right_only)
View Answer
# Test with sample data (full data may be too large)
sample_orders = orders.head(1000)
sample_users = users.head(500)
# outer join with indicator
merged = pd.merge(
sample_orders, sample_users,
on='user_id',
how='outer',
indicator=True
)
# Analyze matching status
print("=== Matching Status Analysis ===")
status_counts = merged['_merge'].value_counts()
print(status_counts)
print(f"\nData in both: {status_counts.get('both', 0)}")
print(f"Orders only (no user info): {status_counts.get('left_only', 0)}")
print(f"Users only (no orders): {status_counts.get('right_only', 0)}")
# Details for each group
print("\n--- Data with orders only (no user info) ---")
left_only = merged[merged['_merge'] == 'left_only']
print(left_only[['order_id', 'user_id', 'created_at']].head())
print("\n--- Data with users only (no orders) ---")
right_only = merged[merged['_merge'] == 'right_only']
print(right_only[['user_id', 'first_name', 'last_name', 'city']].head())=== Matching Status Analysis ===
_merge
both 678
right_only 245
left_only 322
Name: count, dtype: int64
Data in both: 678
Orders only (no user info): 322
Users only (no orders): 245
--- Data with orders only (no user info) ---
order_id user_id created_at
678 45 512 2023-01-18 09:15:33
679 89 623 2023-01-20 14:22:11
680 123 578 2023-01-22 16:45:00
681 156 601 2023-01-24 11:30:45
682 178 534 2023-01-25 08:55:12
--- Data with users only (no orders) ---
user_id first_name last_name city
1000 156 Mike Kim Seoul
1001 203 Sarah Park Busan
1002 287 John Lee Daegu
1003 312 Emily Choi Incheon
1004 345 Daniel Jung Gwangju4. join() Index-based Join
Theory
join() joins based on index. More concise than merge but requires index setup.
# From data merged with order_items and products
df = pd.merge(order_items, products, on='product_id', how='left')
# Index-based join
df1 = df[['brand', 'sale_price']].head(10).set_index('brand')
df2 = df[['brand', 'retail_price']].head(10).set_index('brand')
# join (join by same index)
result = df1.join(df2, lsuffix='_sale', rsuffix='_retail')
print(result.head())sale_price_sale retail_price_retail brand Nike 89.99 156.78 Adidas 145.50 178.90 Nike 89.99 156.78 Reebok 67.00 123.45 Puma 199.99 234.56
Summary
merge vs concat vs join
| Method | Use Case | SQL Equivalent |
|---|---|---|
merge() | Join by columns | JOIN |
concat() | Simple concatenation | UNION |
join() | Join by index | JOIN |
SQL to Pandas Comparison
| SQL | Pandas |
|---|---|
INNER JOIN | pd.merge(df1, df2, how='inner') |
LEFT JOIN | pd.merge(df1, df2, how='left') |
RIGHT JOIN | pd.merge(df1, df2, how='right') |
FULL OUTER JOIN | pd.merge(df1, df2, how='outer') |
UNION ALL | pd.concat([df1, df2]) |
UNION | pd.concat([df1, df2]).drop_duplicates() |
merge() Key Parameters
| Parameter | Description | Example |
|---|---|---|
on | Join key (same column name) | on='user_id' |
left_on / right_on | Different column names | left_on='id', right_on='user_id' |
how | Join type | 'inner', 'left', 'right', 'outer' |
indicator | Add source column | indicator=True |
suffixes | Suffix for duplicate columns | suffixes=['_left', '_right'] |
Next Steps
You’ve mastered data merging! Next, learn time series data analysis techniques in Date/Time Handling.