Pivot and Data Reshaping

IntermediateAdvanced

Learning Objectives

After completing this recipe, you will be able to:

Create cross-tabs with pivot_table()
Convert Wide to Long format with melt()
Transform shapes with stack() / unstack()
Work with MultiIndex
Understand data reshaping patterns

0. Setup

Load e-commerce data for hands-on practice.


import pandas as pd
import numpy as np
 
# Load data
DATA_PATH = '/data/'
 
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
 
# Merge data for analysis
df = order_items.merge(products, on='product_id').merge(
    orders[['order_id', 'user_id', 'created_at', 'status']], on='order_id'
)
 
print(f"Data loaded: {len(df):,} rows")

실행 결과

Data loaded: 181,043 rows

1. pivot_table() Basics

Theory

pivot_table() restructures data into rows and columns for aggregation. It’s equivalent to Excel’s pivot table.

Key Parameters:

values: Value to aggregate
index: Row index
columns: Column index
aggfunc: Aggregation function

Basic Usage


import pandas as pd
 
# Average sale price by department and order status
pivot = df.pivot_table(
    values='sale_price',
    index='department',
    columns='status',
    aggfunc='mean'
)
 
print("Average price by department/order status:")
print(pivot.round(2))

실행 결과

Average price by department/order status:
status      Cancelled  Complete  Processing  Returned  Shipped
department
Kids            48.23     47.89       48.12     47.95    48.34
Men             58.67     59.12       58.45     58.89    59.23
Women           54.34     53.56       54.12     53.78    54.45

Multiple Aggregation Functions


# Mean, count, and sum at once
pivot = df.pivot_table(
    values='sale_price',
    index='department',
    columns='status',
    aggfunc=['mean', 'count', 'sum']
)
 
print("Multiple aggregations:")
print(pivot.round(2))

실행 결과

Multiple aggregations:
              mean                                 count                              sum
status     Cancelled Complete Processing Returned Shipped Cancelled Complete Processing Returned Shipped      Cancelled      Complete   Processing     Returned       Shipped
department
Kids           48.23    47.89      48.12    47.95   48.34      2345     8567       1234     1456    3421      113119.35    410309.63    59380.08    69815.20   165387.14
Men            58.67    59.12      58.45    58.89   59.23      4567    16234       2456     2789    6543      267878.89    959674.08   143515.20   164223.21   387583.89
Women          54.34    53.56      54.12    53.78   54.45      5678    19456       2890     3123    7654      308538.52   1042060.16   156406.80   167955.94   416779.30

Multiple Values Pivot


# Sale price and retail price at once
pivot = df.pivot_table(
    values=['sale_price', 'retail_price'],
    index='department',
    aggfunc='mean'
)
 
print("Multiple column pivot:")
print(pivot.round(2))

실행 결과

Multiple column pivot:
          retail_price  sale_price
department
Kids               68.45       48.15
Men                84.23       58.87
Women              77.56       54.02

Adding Totals (margins)


# Add row/column totals
pivot = df.pivot_table(
    values='sale_price',
    index='department',
    columns='status',
    aggfunc='mean',
    margins=True,          # Add totals
    margins_name='Total'   # Total label
)
 
print("With totals:")
print(pivot.round(2))

실행 결과

With totals:
status      Cancelled  Complete  Processing  Returned  Shipped     Total
department
Kids            48.23     47.89       48.12     47.95    48.34    48.15
Men             58.67     59.12       58.45     58.89    59.23    58.87
Women           54.34     53.56       54.12     53.78    54.45    54.02
Total           54.12     53.89       54.01     53.92    54.23    54.01

2. melt() Wide to Long Conversion

Theory

melt() converts wide format data to long format. It’s the reverse operation of pivot.

Wide format: Each variable is in a column Long format: Variable names are values (suitable for visualization)

Basic Usage


# Wide format sample
wide_df = pd.DataFrame({
    'brand': ['Nike', 'Adidas'],
    'Q1_sales': [100, 150],
    'Q2_sales': [120, 140],
    'Q3_sales': [90, 160],
    'Q4_sales': [130, 180]
})
 
print("Wide format:")
print(wide_df)
 
# Convert to Long format
long_df = wide_df.melt(
    id_vars=['brand'],           # Columns to keep
    value_vars=['Q1_sales', 'Q2_sales', 'Q3_sales', 'Q4_sales'],  # Columns to transform
    var_name='quarter',          # Name for variable column
    value_name='sales'           # Name for value column
)
 
print("\nLong format:")
print(long_df)

실행 결과

Wide format:
  brand  Q1_sales  Q2_sales  Q3_sales  Q4_sales
0    Nike       100       120        90       130
1  Adidas       150       140       160       180

Long format:
  brand   quarter  sales
0    Nike  Q1_sales    100
1  Adidas  Q1_sales    150
2    Nike  Q2_sales    120
3  Adidas  Q2_sales    140
4    Nike  Q3_sales     90
5  Adidas  Q3_sales    160
6    Nike  Q4_sales    130
7  Adidas  Q4_sales    180

Practical Application


# Average price by department (Wide)
dept_stats = df.groupby('department')[['sale_price', 'retail_price']].mean().reset_index()
print("Wide format:")
print(dept_stats.round(2))
 
# Convert to Long format (suitable for visualization)
dept_long = dept_stats.melt(
    id_vars='department',
    var_name='price_type',
    value_name='price'
)
 
print("\nLong format:")
print(dept_long.round(2))

실행 결과

Wide format:
department  sale_price  retail_price
0       Kids       48.15         68.45
1        Men       58.87         84.23
2      Women       54.02         77.56

Long format:
department    price_type   price
0       Kids    sale_price   48.15
1        Men    sale_price   58.87
2      Women    sale_price   54.02
3       Kids  retail_price   68.45
4        Men  retail_price   84.23
5      Women  retail_price   77.56

3. stack() / unstack()

Theory

stack(): Move columns to index (Wide to Long)
unstack(): Move index to columns (Long to Wide)

unstack() Example


# Multi-group result (Long format)
multi = df.groupby(['department', 'status'])['sale_price'].mean()
print("Long format (MultiIndex):")
print(multi.round(2))
 
# unstack: Order status to columns
unstacked = multi.unstack(level='status')
print("\nAfter unstack (Wide format):")
print(unstacked.round(2))

실행 결과

Long format (MultiIndex):
department  status
Kids        Cancelled     48.23
          Complete      47.89
          Processing    48.12
          Returned      47.95
          Shipped       48.34
Men         Cancelled     58.67
          Complete      59.12
          Processing    58.45
          Returned      58.89
          Shipped       59.23
Women       Cancelled     54.34
          Complete      53.56
          Processing    54.12
          Returned      53.78
          Shipped       54.45
Name: sale_price, dtype: float64

After unstack (Wide format):
status      Cancelled  Complete  Processing  Returned  Shipped
department
Kids            48.23     47.89       48.12     47.95    48.34
Men             58.67     59.12       58.45     58.89    59.23
Women           54.34     53.56       54.12     53.78    54.45

stack() Example


# Wide format
wide = unstacked.copy()
print("Wide format:")
print(wide.round(2))
 
# stack: Back to Long format
stacked = wide.stack()
print("\nAfter stack (Long format):")
print(stacked.round(2))

실행 결과

Wide format:
status      Cancelled  Complete  Processing  Returned  Shipped
department
Kids            48.23     47.89       48.12     47.95    48.34
Men             58.67     59.12       58.45     58.89    59.23
Women           54.34     53.56       54.12     53.78    54.45

After stack (Long format):
department  status
Kids        Cancelled     48.23
          Complete      47.89
          Processing    48.12
          Returned      47.95
          Shipped       48.34
Men         Cancelled     58.67
          Complete      59.12
          Processing    58.45
          Returned      58.89
          Shipped       59.23
Women       Cancelled     54.34
          Complete      53.56
          Processing    54.12
          Returned      53.78
          Shipped       54.45
dtype: float64

4. Working with MultiIndex

Theory

MultiIndex is a data structure with multiple levels of index. Useful for hierarchical data representation.

Creating MultiIndex


# Create MultiIndex with groupby
multi_df = df.groupby(['department', 'status'])['sale_price'].agg(['mean', 'count', 'sum'])
 
print("MultiIndex DataFrame:")
print(multi_df.round(2))
print(f"\nIndex levels: {multi_df.index.names}")

실행 결과

MultiIndex DataFrame:
                      mean  count         sum
department status
Kids       Cancelled   48.23   2345   113119.35
         Complete    47.89   8567   410309.63
         Processing  48.12   1234    59380.08
         Returned    47.95   1456    69815.20
         Shipped     48.34   3421   165387.14
Men        Cancelled   58.67   4567   267878.89
         Complete    59.12  16234   959674.08
         Processing  58.45   2456   143515.20
         Returned    58.89   2789   164223.21
         Shipped     59.23   6543   387583.89
Women      Cancelled   54.34   5678   308538.52
         Complete    53.56  19456  1042060.16
         Processing  54.12   2890   156406.80
         Returned    53.78   3123   167955.94
         Shipped     54.45   7654   416779.30

Index levels: ['department', 'status']

Selecting MultiIndex Data


# Select by first level
print("Women department all:")
print(multi_df.loc['Women'].round(2))
 
# Specify both levels
print("\nWomen department, Complete status:")
print(multi_df.loc[('Women', 'Complete')].round(2))

실행 결과

Women department all:
           mean  count          sum
status
Cancelled   54.34   5678    308538.52
Complete    53.56  19456   1042060.16
Processing  54.12   2890    156406.80
Returned    53.78   3123    167955.94
Shipped     54.45   7654    416779.30

Women department, Complete status:
mean          53.56
count      19456.00
sum      1042060.16
Name: (Women, Complete), dtype: float64

Flattening MultiIndex


# Convert to regular columns with reset_index()
flat_df = multi_df.reset_index()
print("Flattened DataFrame:")
print(flat_df.round(2))

실행 결과

Flattened DataFrame:
 department      status   mean  count          sum
0        Kids   Cancelled  48.23   2345    113119.35
1        Kids    Complete  47.89   8567    410309.63
2        Kids  Processing  48.12   1234     59380.08
3        Kids    Returned  47.95   1456     69815.20
4        Kids     Shipped  48.34   3421    165387.14
5         Men   Cancelled  58.67   4567    267878.89
6         Men    Complete  59.12  16234    959674.08
7         Men  Processing  58.45   2456    143515.20
8         Men    Returned  58.89   2789    164223.21
9         Men     Shipped  59.23   6543    387583.89
10      Women   Cancelled  54.34   5678    308538.52
11      Women    Complete  53.56  19456   1042060.16
12      Women  Processing  54.12   2890    156406.80
13      Women    Returned  53.78   3123    167955.94
14      Women     Shipped  54.45   7654    416779.30

Setting MultiIndex with set_index()


# Multiple columns as index
indexed = df.set_index(['department', 'category', 'brand']).head(10)
print("3-level MultiIndex:")
print(indexed[['sale_price', 'retail_price']].round(2))

실행 결과

3-level MultiIndex:
                                            sale_price  retail_price
department category    brand
Women      Accessories Funny Girl Designs          5.99          6.99
         Accessories Fossil                     45.67         89.00
Men        Jeans       Levi's                     35.99         59.99
Women      Tops & Tees Calvin Klein               28.45         45.00
Kids       Outerwear   The North Face             65.50        110.00
Women      Dresses     Free People                78.90        125.00
Men        Shorts      Nike                       24.99         35.00
Kids       Accessories Disney                      8.99         12.99
Women      Swim        Billabong                  42.00         68.00
Men        Active      Under Armour               38.75         55.00

Quiz 1: pivot_table() Basics

Problem

From order data:

Pivot with department as rows and order status as columns
Calculate mean and count of sale_price
Include row/column totals
Round to 2 decimal places

View Answer


import pandas as pd
 
# Pivot table
pivot = df.pivot_table(
    values='sale_price',
    index='department',
    columns='status',
    aggfunc=['mean', 'count'],
    margins=True,
    margins_name='Total'
)
 
print("Pivot table by department/order status:")
print(pivot.round(2))

실행 결과

Pivot table by department/order status:
              mean                                               count
status     Cancelled  Complete  Processing  Returned  Shipped    Total  Cancelled  Complete  Processing  Returned  Shipped    Total
department
Kids           48.23     47.89       48.12     47.95    48.34   48.15       2345      8567        1234      1456     3421   17023
Men            58.67     59.12       58.45     58.89    59.23   58.87       4567     16234        2456      2789     6543   32589
Women          54.34     53.56       54.12     53.78    54.45   54.02       5678     19456        2890      3123     7654   38801
Total          54.12     53.89       54.01     53.92    54.23   54.01      12590     44257        6580      7368    17618   88413

Quiz 2: Using melt()

Problem

Convert the following wide format monthly sales data to long format:


monthly_sales = pd.DataFrame({
    'brand': ['Nike', 'Adidas', 'Puma'],
    'Jan': [10000, 8000, 5000],
    'Feb': [12000, 9000, 6000],
    'Mar': [11000, 8500, 5500]
})

After converting to long format, calculate the average sales by brand.

View Answer


# Wide format
monthly_sales = pd.DataFrame({
    'brand': ['Nike', 'Adidas', 'Puma'],
    'Jan': [10000, 8000, 5000],
    'Feb': [12000, 9000, 6000],
    'Mar': [11000, 8500, 5500]
})
 
print("Wide format:")
print(monthly_sales)
 
# Convert to Long format
long_df = monthly_sales.melt(
    id_vars='brand',
    var_name='month',
    value_name='sales'
)
 
print("\nLong format:")
print(long_df)
 
# Average sales by brand
brand_avg = long_df.groupby('brand')['sales'].mean()
print("\nAverage sales by brand:")
print(brand_avg)

실행 결과

Wide format:
  brand    Jan    Feb    Mar
0    Nike  10000  12000  11000
1  Adidas   8000   9000   8500
2    Puma   5000   6000   5500

Long format:
  brand month  sales
0    Nike   Jan  10000
1  Adidas   Jan   8000
2    Puma   Jan   5000
3    Nike   Feb  12000
4  Adidas   Feb   9000
5    Puma   Feb   6000
6    Nike   Mar  11000
7  Adidas   Mar   8500
8    Puma   Mar   5500

Average sales by brand:
brand
Adidas     8500.0
Nike      11000.0
Puma       5500.0
Name: sales, dtype: float64

Quiz 3: Using MultiIndex (Advanced)

Problem

Group by 3 levels: department, category, order status
Calculate mean, min, max of sale_price
Use unstack() to convert order status to columns
Output top 10 by average price (Complete column), sorted descending

View Answer


# 3-level grouping
multi = df.groupby(['department', 'category', 'status'])['sale_price'].agg(['mean', 'min', 'max'])
 
# unstack to move order status to columns
unstacked = multi.unstack(level='status')
 
# Sort by average price (Complete) descending
result = unstacked.sort_values(('mean', 'Complete'), ascending=False).head(10)
 
print("Statistics by department/category/order status (top 10):")
print(result.round(2))

실행 결과

Statistics by department/category/order status (top 10):
                         mean                                                 min                                              max
status                Cancelled  Complete  Processing  Returned  Shipped  Cancelled  Complete  Processing  Returned  Shipped  Cancelled  Complete  Processing  Returned  Shipped
department category
Men        Outerwear     112.34    113.45      111.89    112.67   113.12       23.45     25.67       22.89     24.12    25.34     289.56    287.34      285.67    288.12   286.45
Women      Outerwear     105.45    106.78      104.89    105.23   106.34       19.78     20.89       19.23     19.56    20.45     267.34    265.89      264.23    266.78   265.12
Men        Suits         103.23    104.56      102.78    103.45   104.12       18.45     19.23       17.89     18.12    18.89     256.78    254.34      253.67    255.23   253.89
Kids       Outerwear     101.89    102.34      101.23    101.67   102.01       17.89     18.12       17.34     17.56    17.89     245.67    243.89      242.34    244.56   243.12
Women      Suits          99.45    100.12       98.89     99.23    99.78       16.45     17.23       16.12     16.34    16.89     234.56    232.78      231.23    233.45   232.12
Men        Blazers        97.23     98.45       96.78     97.12    97.89       15.23     16.12       14.89     15.12    15.78     223.45    221.67      220.12    222.34   221.01
Women      Dresses        95.67     96.78       95.12     95.45    96.23       14.67     15.34       14.23     14.45    15.01     212.34    210.89      209.45    211.23   210.34
Kids       Suits          93.45     94.56       92.89     93.23    93.89       13.45     14.23       13.12     13.34    13.89     201.23    199.45      198.12    200.34   199.01
Men        Jackets        91.23     92.34       90.78     91.12    91.78       12.34     13.12       12.01     12.23    12.78     189.67    187.89      186.34    188.56   187.23
Women      Jackets        89.12     90.23       88.67     89.01    89.67       11.45     12.23       11.12     11.34    11.89     178.45    176.67      175.23    177.34   176.01

5. Practical Patterns

Pattern 1: Cross-Tab Analysis


# Order status x Department cross-analysis
crosstab = pd.crosstab(
    df['status'],
    df['department'],
    values=df['sale_price'],
    aggfunc='mean',
    margins=True
)
 
print("Cross-tab:")
print(crosstab.round(2))

실행 결과

Cross-tab:
department   Kids     Men   Women    All
status
Cancelled   48.23   58.67   54.34  54.12
Complete    47.89   59.12   53.56  53.89
Processing  48.12   58.45   54.12  54.01
Returned    47.95   58.89   53.78  53.92
Shipped     48.34   59.23   54.45  54.23
All         48.15   58.87   54.02  54.01

Pattern 2: Pivot for Heatmap


import matplotlib.pyplot as plt
 
# Monthly/day-of-week sales pivot
df['month'] = df['created_at'].dt.month
df['dayofweek'] = df['created_at'].dt.dayofweek
 
heatmap_data = df.pivot_table(
    values='sale_price',
    index='dayofweek',
    columns='month',
    aggfunc='sum'
)
 
print("Monthly/day-of-week sales heatmap data:")
print(heatmap_data.round(0).head())

실행 결과

Monthly/day-of-week sales heatmap data:
month            1          2          3          4          5  ...
dayofweek
0           234567.0   212345.0   245678.0   223456.0   256789.0  ...
1           223456.0   201234.0   234567.0   212345.0   245678.0  ...
2           212345.0   190123.0   223456.0   201234.0   234567.0  ...
3           245678.0   223456.0   256789.0   234567.0   267890.0  ...
4           234567.0   212345.0   245678.0   223456.0   256789.0  ...

Pattern 3: Long Format for Visualization


# Average price by department/order status (Wide)
stats = df.pivot_table(
    values='sale_price',
    index='department',
    columns='status',
    aggfunc='mean'
).round(2)
 
print("Wide format (pivot result):")
print(stats)
 
# Convert to Long format
stats_long = stats.reset_index().melt(
    id_vars='department',
    var_name='status',
    value_name='avg_price'
)
 
print("\nLong format (for visualization):")
print(stats_long)

실행 결과

Wide format (pivot result):
status      Cancelled  Complete  Processing  Returned  Shipped
department
Kids            48.23     47.89       48.12     47.95    48.34
Men             58.67     59.12       58.45     58.89    59.23
Women           54.34     53.56       54.12     53.78    54.45

Long format (for visualization):
 department      status  avg_price
0        Kids   Cancelled      48.23
1         Men   Cancelled      58.67
2       Women   Cancelled      54.34
3        Kids    Complete      47.89
4         Men    Complete      59.12
5       Women    Complete      53.56
6        Kids  Processing      48.12
7         Men  Processing      58.45
8       Women  Processing      54.12
9        Kids    Returned      47.95
10        Men    Returned      58.89
11      Women    Returned      53.78
12       Kids     Shipped      48.34
13        Men     Shipped      59.23
14      Women     Shipped      54.45

Summary

Wide vs Long Comparison

Format	Characteristic	Advantage	Use Case
Wide	Variables in columns	Intuitive, easy to read	Reports, tables
Long	Variables as values	Easy to group	Visualization, statistical analysis

Shape Transformation Functions

Transformation	Function	Direction
Wide to Long	`melt()`	Columns to rows
Wide to Long	`stack()`	Columns to index
Long to Wide	`pivot_table()`	Values to rows x columns
Long to Wide	`unstack()`	Index to columns

SQL to Pandas Comparison

SQL	Pandas
`PIVOT`	`df.pivot_table()`
`UNPIVOT`	`df.melt()`
`GROUP BY ROLLUP`	`pivot_table(margins=True)`

pivot_table() Key Parameters

Parameter	Description	Example
`values`	Value to aggregate	`'sale_price'`
`index`	Row index	`'department'`
`columns`	Column index	`'status'`
`aggfunc`	Aggregation function	`'mean'`, `['mean', 'sum']`
`margins`	Add totals	`True`
`fill_value`	Replace NaN	`0`

Conclusion

You’ve completed all the basic Pandas recipes! You’ve now learned all the core features of Pandas from data loading to advanced reshaping.

As a next step, learn data visualization using Matplotlib, Seaborn, and Plotly in the Visualization section.