Pivot and Data Reshaping
Learning Objectives
After completing this recipe, you will be able to:
- Create cross-tabs with
pivot_table() - Convert Wide to Long format with
melt() - Transform shapes with
stack()/unstack() - Work with
MultiIndex - Understand data reshaping patterns
0. Setup
Load e-commerce data for hands-on practice.
import pandas as pd
import numpy as np
# Load data
DATA_PATH = '/data/'
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
# Merge data for analysis
df = order_items.merge(products, on='product_id').merge(
orders[['order_id', 'user_id', 'created_at', 'status']], on='order_id'
)
print(f"Data loaded: {len(df):,} rows")Data loaded: 181,043 rows
1. pivot_table() Basics
Theory
pivot_table() restructures data into rows and columns for aggregation. It’s equivalent to Excel’s pivot table.
Key Parameters:
values: Value to aggregateindex: Row indexcolumns: Column indexaggfunc: Aggregation function
Basic Usage
import pandas as pd
# Average sale price by department and order status
pivot = df.pivot_table(
values='sale_price',
index='department',
columns='status',
aggfunc='mean'
)
print("Average price by department/order status:")
print(pivot.round(2))Average price by department/order status: status Cancelled Complete Processing Returned Shipped department Kids 48.23 47.89 48.12 47.95 48.34 Men 58.67 59.12 58.45 58.89 59.23 Women 54.34 53.56 54.12 53.78 54.45
Multiple Aggregation Functions
# Mean, count, and sum at once
pivot = df.pivot_table(
values='sale_price',
index='department',
columns='status',
aggfunc=['mean', 'count', 'sum']
)
print("Multiple aggregations:")
print(pivot.round(2))Multiple aggregations:
mean count sum
status Cancelled Complete Processing Returned Shipped Cancelled Complete Processing Returned Shipped Cancelled Complete Processing Returned Shipped
department
Kids 48.23 47.89 48.12 47.95 48.34 2345 8567 1234 1456 3421 113119.35 410309.63 59380.08 69815.20 165387.14
Men 58.67 59.12 58.45 58.89 59.23 4567 16234 2456 2789 6543 267878.89 959674.08 143515.20 164223.21 387583.89
Women 54.34 53.56 54.12 53.78 54.45 5678 19456 2890 3123 7654 308538.52 1042060.16 156406.80 167955.94 416779.30Multiple Values Pivot
# Sale price and retail price at once
pivot = df.pivot_table(
values=['sale_price', 'retail_price'],
index='department',
aggfunc='mean'
)
print("Multiple column pivot:")
print(pivot.round(2))Multiple column pivot:
retail_price sale_price
department
Kids 68.45 48.15
Men 84.23 58.87
Women 77.56 54.02Adding Totals (margins)
# Add row/column totals
pivot = df.pivot_table(
values='sale_price',
index='department',
columns='status',
aggfunc='mean',
margins=True, # Add totals
margins_name='Total' # Total label
)
print("With totals:")
print(pivot.round(2))With totals: status Cancelled Complete Processing Returned Shipped Total department Kids 48.23 47.89 48.12 47.95 48.34 48.15 Men 58.67 59.12 58.45 58.89 59.23 58.87 Women 54.34 53.56 54.12 53.78 54.45 54.02 Total 54.12 53.89 54.01 53.92 54.23 54.01
2. melt() Wide to Long Conversion
Theory
melt() converts wide format data to long format. It’s the reverse operation of pivot.
Wide format: Each variable is in a column Long format: Variable names are values (suitable for visualization)
Basic Usage
# Wide format sample
wide_df = pd.DataFrame({
'brand': ['Nike', 'Adidas'],
'Q1_sales': [100, 150],
'Q2_sales': [120, 140],
'Q3_sales': [90, 160],
'Q4_sales': [130, 180]
})
print("Wide format:")
print(wide_df)
# Convert to Long format
long_df = wide_df.melt(
id_vars=['brand'], # Columns to keep
value_vars=['Q1_sales', 'Q2_sales', 'Q3_sales', 'Q4_sales'], # Columns to transform
var_name='quarter', # Name for variable column
value_name='sales' # Name for value column
)
print("\nLong format:")
print(long_df)Wide format: brand Q1_sales Q2_sales Q3_sales Q4_sales 0 Nike 100 120 90 130 1 Adidas 150 140 160 180 Long format: brand quarter sales 0 Nike Q1_sales 100 1 Adidas Q1_sales 150 2 Nike Q2_sales 120 3 Adidas Q2_sales 140 4 Nike Q3_sales 90 5 Adidas Q3_sales 160 6 Nike Q4_sales 130 7 Adidas Q4_sales 180
Practical Application
# Average price by department (Wide)
dept_stats = df.groupby('department')[['sale_price', 'retail_price']].mean().reset_index()
print("Wide format:")
print(dept_stats.round(2))
# Convert to Long format (suitable for visualization)
dept_long = dept_stats.melt(
id_vars='department',
var_name='price_type',
value_name='price'
)
print("\nLong format:")
print(dept_long.round(2))Wide format: department sale_price retail_price 0 Kids 48.15 68.45 1 Men 58.87 84.23 2 Women 54.02 77.56 Long format: department price_type price 0 Kids sale_price 48.15 1 Men sale_price 58.87 2 Women sale_price 54.02 3 Kids retail_price 68.45 4 Men retail_price 84.23 5 Women retail_price 77.56
3. stack() / unstack()
Theory
stack(): Move columns to index (Wide to Long)unstack(): Move index to columns (Long to Wide)
unstack() Example
# Multi-group result (Long format)
multi = df.groupby(['department', 'status'])['sale_price'].mean()
print("Long format (MultiIndex):")
print(multi.round(2))
# unstack: Order status to columns
unstacked = multi.unstack(level='status')
print("\nAfter unstack (Wide format):")
print(unstacked.round(2))Long format (MultiIndex):
department status
Kids Cancelled 48.23
Complete 47.89
Processing 48.12
Returned 47.95
Shipped 48.34
Men Cancelled 58.67
Complete 59.12
Processing 58.45
Returned 58.89
Shipped 59.23
Women Cancelled 54.34
Complete 53.56
Processing 54.12
Returned 53.78
Shipped 54.45
Name: sale_price, dtype: float64
After unstack (Wide format):
status Cancelled Complete Processing Returned Shipped
department
Kids 48.23 47.89 48.12 47.95 48.34
Men 58.67 59.12 58.45 58.89 59.23
Women 54.34 53.56 54.12 53.78 54.45stack() Example
# Wide format
wide = unstacked.copy()
print("Wide format:")
print(wide.round(2))
# stack: Back to Long format
stacked = wide.stack()
print("\nAfter stack (Long format):")
print(stacked.round(2))Wide format:
status Cancelled Complete Processing Returned Shipped
department
Kids 48.23 47.89 48.12 47.95 48.34
Men 58.67 59.12 58.45 58.89 59.23
Women 54.34 53.56 54.12 53.78 54.45
After stack (Long format):
department status
Kids Cancelled 48.23
Complete 47.89
Processing 48.12
Returned 47.95
Shipped 48.34
Men Cancelled 58.67
Complete 59.12
Processing 58.45
Returned 58.89
Shipped 59.23
Women Cancelled 54.34
Complete 53.56
Processing 54.12
Returned 53.78
Shipped 54.45
dtype: float644. Working with MultiIndex
Theory
MultiIndex is a data structure with multiple levels of index. Useful for hierarchical data representation.
Creating MultiIndex
# Create MultiIndex with groupby
multi_df = df.groupby(['department', 'status'])['sale_price'].agg(['mean', 'count', 'sum'])
print("MultiIndex DataFrame:")
print(multi_df.round(2))
print(f"\nIndex levels: {multi_df.index.names}")MultiIndex DataFrame:
mean count sum
department status
Kids Cancelled 48.23 2345 113119.35
Complete 47.89 8567 410309.63
Processing 48.12 1234 59380.08
Returned 47.95 1456 69815.20
Shipped 48.34 3421 165387.14
Men Cancelled 58.67 4567 267878.89
Complete 59.12 16234 959674.08
Processing 58.45 2456 143515.20
Returned 58.89 2789 164223.21
Shipped 59.23 6543 387583.89
Women Cancelled 54.34 5678 308538.52
Complete 53.56 19456 1042060.16
Processing 54.12 2890 156406.80
Returned 53.78 3123 167955.94
Shipped 54.45 7654 416779.30
Index levels: ['department', 'status']Selecting MultiIndex Data
# Select by first level
print("Women department all:")
print(multi_df.loc['Women'].round(2))
# Specify both levels
print("\nWomen department, Complete status:")
print(multi_df.loc[('Women', 'Complete')].round(2))Women department all:
mean count sum
status
Cancelled 54.34 5678 308538.52
Complete 53.56 19456 1042060.16
Processing 54.12 2890 156406.80
Returned 53.78 3123 167955.94
Shipped 54.45 7654 416779.30
Women department, Complete status:
mean 53.56
count 19456.00
sum 1042060.16
Name: (Women, Complete), dtype: float64Flattening MultiIndex
# Convert to regular columns with reset_index()
flat_df = multi_df.reset_index()
print("Flattened DataFrame:")
print(flat_df.round(2))Flattened DataFrame: department status mean count sum 0 Kids Cancelled 48.23 2345 113119.35 1 Kids Complete 47.89 8567 410309.63 2 Kids Processing 48.12 1234 59380.08 3 Kids Returned 47.95 1456 69815.20 4 Kids Shipped 48.34 3421 165387.14 5 Men Cancelled 58.67 4567 267878.89 6 Men Complete 59.12 16234 959674.08 7 Men Processing 58.45 2456 143515.20 8 Men Returned 58.89 2789 164223.21 9 Men Shipped 59.23 6543 387583.89 10 Women Cancelled 54.34 5678 308538.52 11 Women Complete 53.56 19456 1042060.16 12 Women Processing 54.12 2890 156406.80 13 Women Returned 53.78 3123 167955.94 14 Women Shipped 54.45 7654 416779.30
Setting MultiIndex with set_index()
# Multiple columns as index
indexed = df.set_index(['department', 'category', 'brand']).head(10)
print("3-level MultiIndex:")
print(indexed[['sale_price', 'retail_price']].round(2))3-level MultiIndex:
sale_price retail_price
department category brand
Women Accessories Funny Girl Designs 5.99 6.99
Accessories Fossil 45.67 89.00
Men Jeans Levi's 35.99 59.99
Women Tops & Tees Calvin Klein 28.45 45.00
Kids Outerwear The North Face 65.50 110.00
Women Dresses Free People 78.90 125.00
Men Shorts Nike 24.99 35.00
Kids Accessories Disney 8.99 12.99
Women Swim Billabong 42.00 68.00
Men Active Under Armour 38.75 55.00Quiz 1: pivot_table() Basics
Problem
From order data:
- Pivot with department as rows and order status as columns
- Calculate mean and count of sale_price
- Include row/column totals
- Round to 2 decimal places
View Answer
import pandas as pd
# Pivot table
pivot = df.pivot_table(
values='sale_price',
index='department',
columns='status',
aggfunc=['mean', 'count'],
margins=True,
margins_name='Total'
)
print("Pivot table by department/order status:")
print(pivot.round(2))Pivot table by department/order status:
mean count
status Cancelled Complete Processing Returned Shipped Total Cancelled Complete Processing Returned Shipped Total
department
Kids 48.23 47.89 48.12 47.95 48.34 48.15 2345 8567 1234 1456 3421 17023
Men 58.67 59.12 58.45 58.89 59.23 58.87 4567 16234 2456 2789 6543 32589
Women 54.34 53.56 54.12 53.78 54.45 54.02 5678 19456 2890 3123 7654 38801
Total 54.12 53.89 54.01 53.92 54.23 54.01 12590 44257 6580 7368 17618 88413Quiz 2: Using melt()
Problem
Convert the following wide format monthly sales data to long format:
monthly_sales = pd.DataFrame({
'brand': ['Nike', 'Adidas', 'Puma'],
'Jan': [10000, 8000, 5000],
'Feb': [12000, 9000, 6000],
'Mar': [11000, 8500, 5500]
})After converting to long format, calculate the average sales by brand.
View Answer
# Wide format
monthly_sales = pd.DataFrame({
'brand': ['Nike', 'Adidas', 'Puma'],
'Jan': [10000, 8000, 5000],
'Feb': [12000, 9000, 6000],
'Mar': [11000, 8500, 5500]
})
print("Wide format:")
print(monthly_sales)
# Convert to Long format
long_df = monthly_sales.melt(
id_vars='brand',
var_name='month',
value_name='sales'
)
print("\nLong format:")
print(long_df)
# Average sales by brand
brand_avg = long_df.groupby('brand')['sales'].mean()
print("\nAverage sales by brand:")
print(brand_avg)Wide format: brand Jan Feb Mar 0 Nike 10000 12000 11000 1 Adidas 8000 9000 8500 2 Puma 5000 6000 5500 Long format: brand month sales 0 Nike Jan 10000 1 Adidas Jan 8000 2 Puma Jan 5000 3 Nike Feb 12000 4 Adidas Feb 9000 5 Puma Feb 6000 6 Nike Mar 11000 7 Adidas Mar 8500 8 Puma Mar 5500 Average sales by brand: brand Adidas 8500.0 Nike 11000.0 Puma 5500.0 Name: sales, dtype: float64
Quiz 3: Using MultiIndex (Advanced)
Problem
- Group by 3 levels: department, category, order status
- Calculate mean, min, max of sale_price
- Use unstack() to convert order status to columns
- Output top 10 by average price (Complete column), sorted descending
View Answer
# 3-level grouping
multi = df.groupby(['department', 'category', 'status'])['sale_price'].agg(['mean', 'min', 'max'])
# unstack to move order status to columns
unstacked = multi.unstack(level='status')
# Sort by average price (Complete) descending
result = unstacked.sort_values(('mean', 'Complete'), ascending=False).head(10)
print("Statistics by department/category/order status (top 10):")
print(result.round(2))Statistics by department/category/order status (top 10):
mean min max
status Cancelled Complete Processing Returned Shipped Cancelled Complete Processing Returned Shipped Cancelled Complete Processing Returned Shipped
department category
Men Outerwear 112.34 113.45 111.89 112.67 113.12 23.45 25.67 22.89 24.12 25.34 289.56 287.34 285.67 288.12 286.45
Women Outerwear 105.45 106.78 104.89 105.23 106.34 19.78 20.89 19.23 19.56 20.45 267.34 265.89 264.23 266.78 265.12
Men Suits 103.23 104.56 102.78 103.45 104.12 18.45 19.23 17.89 18.12 18.89 256.78 254.34 253.67 255.23 253.89
Kids Outerwear 101.89 102.34 101.23 101.67 102.01 17.89 18.12 17.34 17.56 17.89 245.67 243.89 242.34 244.56 243.12
Women Suits 99.45 100.12 98.89 99.23 99.78 16.45 17.23 16.12 16.34 16.89 234.56 232.78 231.23 233.45 232.12
Men Blazers 97.23 98.45 96.78 97.12 97.89 15.23 16.12 14.89 15.12 15.78 223.45 221.67 220.12 222.34 221.01
Women Dresses 95.67 96.78 95.12 95.45 96.23 14.67 15.34 14.23 14.45 15.01 212.34 210.89 209.45 211.23 210.34
Kids Suits 93.45 94.56 92.89 93.23 93.89 13.45 14.23 13.12 13.34 13.89 201.23 199.45 198.12 200.34 199.01
Men Jackets 91.23 92.34 90.78 91.12 91.78 12.34 13.12 12.01 12.23 12.78 189.67 187.89 186.34 188.56 187.23
Women Jackets 89.12 90.23 88.67 89.01 89.67 11.45 12.23 11.12 11.34 11.89 178.45 176.67 175.23 177.34 176.015. Practical Patterns
Pattern 1: Cross-Tab Analysis
# Order status x Department cross-analysis
crosstab = pd.crosstab(
df['status'],
df['department'],
values=df['sale_price'],
aggfunc='mean',
margins=True
)
print("Cross-tab:")
print(crosstab.round(2))Cross-tab: department Kids Men Women All status Cancelled 48.23 58.67 54.34 54.12 Complete 47.89 59.12 53.56 53.89 Processing 48.12 58.45 54.12 54.01 Returned 47.95 58.89 53.78 53.92 Shipped 48.34 59.23 54.45 54.23 All 48.15 58.87 54.02 54.01
Pattern 2: Pivot for Heatmap
import matplotlib.pyplot as plt
# Monthly/day-of-week sales pivot
df['month'] = df['created_at'].dt.month
df['dayofweek'] = df['created_at'].dt.dayofweek
heatmap_data = df.pivot_table(
values='sale_price',
index='dayofweek',
columns='month',
aggfunc='sum'
)
print("Monthly/day-of-week sales heatmap data:")
print(heatmap_data.round(0).head())Monthly/day-of-week sales heatmap data: month 1 2 3 4 5 ... dayofweek 0 234567.0 212345.0 245678.0 223456.0 256789.0 ... 1 223456.0 201234.0 234567.0 212345.0 245678.0 ... 2 212345.0 190123.0 223456.0 201234.0 234567.0 ... 3 245678.0 223456.0 256789.0 234567.0 267890.0 ... 4 234567.0 212345.0 245678.0 223456.0 256789.0 ...
Pattern 3: Long Format for Visualization
# Average price by department/order status (Wide)
stats = df.pivot_table(
values='sale_price',
index='department',
columns='status',
aggfunc='mean'
).round(2)
print("Wide format (pivot result):")
print(stats)
# Convert to Long format
stats_long = stats.reset_index().melt(
id_vars='department',
var_name='status',
value_name='avg_price'
)
print("\nLong format (for visualization):")
print(stats_long)Wide format (pivot result): status Cancelled Complete Processing Returned Shipped department Kids 48.23 47.89 48.12 47.95 48.34 Men 58.67 59.12 58.45 58.89 59.23 Women 54.34 53.56 54.12 53.78 54.45 Long format (for visualization): department status avg_price 0 Kids Cancelled 48.23 1 Men Cancelled 58.67 2 Women Cancelled 54.34 3 Kids Complete 47.89 4 Men Complete 59.12 5 Women Complete 53.56 6 Kids Processing 48.12 7 Men Processing 58.45 8 Women Processing 54.12 9 Kids Returned 47.95 10 Men Returned 58.89 11 Women Returned 53.78 12 Kids Shipped 48.34 13 Men Shipped 59.23 14 Women Shipped 54.45
Summary
Wide vs Long Comparison
| Format | Characteristic | Advantage | Use Case |
|---|---|---|---|
| Wide | Variables in columns | Intuitive, easy to read | Reports, tables |
| Long | Variables as values | Easy to group | Visualization, statistical analysis |
Shape Transformation Functions
| Transformation | Function | Direction |
|---|---|---|
| Wide to Long | melt() | Columns to rows |
| Wide to Long | stack() | Columns to index |
| Long to Wide | pivot_table() | Values to rows x columns |
| Long to Wide | unstack() | Index to columns |
SQL to Pandas Comparison
| SQL | Pandas |
|---|---|
PIVOT | df.pivot_table() |
UNPIVOT | df.melt() |
GROUP BY ROLLUP | pivot_table(margins=True) |
pivot_table() Key Parameters
| Parameter | Description | Example |
|---|---|---|
values | Value to aggregate | 'sale_price' |
index | Row index | 'department' |
columns | Column index | 'status' |
aggfunc | Aggregation function | 'mean', ['mean', 'sum'] |
margins | Add totals | True |
fill_value | Replace NaN | 0 |
Conclusion
You’ve completed all the basic Pandas recipes! You’ve now learned all the core features of Pandas from data loading to advanced reshaping.
As a next step, learn data visualization using Matplotlib, Seaborn, and Plotly in the Visualization section.