Grouping and Aggregation
Learning Objectives
After completing this recipe, you will be able to:
- Group data with
groupby() - Apply various aggregation functions with
agg() - Transform by group with
transform() - Apply custom functions with
apply() - Perform multi-level grouping and complex aggregations
0. Setup
Load e-commerce data for hands-on practice.
import pandas as pd
import numpy as np
# Load data
DATA_PATH = '/data/'
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
# Merge data for analysis
df = order_items.merge(products, on='product_id').merge(
orders[['order_id', 'user_id', 'created_at', 'status']], on='order_id'
)
print(f"Data loaded: {len(df):,} rows")
print(f"Columns: {df.columns.tolist()}")Data loaded: 181,490 rows Columns: ['order_item_id', 'order_id', 'product_id', 'sale_price', 'category', 'brand', 'retail_price', 'department', 'user_id', 'created_at', 'status']
1. groupby() Basics
Theory
groupby() is equivalent to SQL’s GROUP BY, grouping data by specific columns.
Basic Pattern:
DataFrame.groupby(grouping_column)[aggregation_column].aggregation_function()Basic Usage
import pandas as pd
import numpy as np
# Average price by category
category_avg = df.groupby('category')['sale_price'].mean()
print("Average price by category:")
print(category_avg.round(2).head(10))
# Sales count by brand
brand_count = df.groupby('brand')['order_id'].count()
print("\nSales count by brand:")
print(brand_count.sort_values(ascending=False).head(10))Average price by category: category Accessories 44.89 Active 62.45 Blazers & Jackets 128.67 Clothing Sets 78.34 Dresses 89.56 Fashion Hoodies 72.34 Jeans 56.78 Jumpsuits & Rompers 95.23 Leggings 45.67 Outerwear & Coats 142.89 Name: sale_price, dtype: float64 Sales count by brand: brand Allegra K 8234 Calvin Klein 7856 Carhartt 7623 Columbia 7456 Diesel 7234 Dockers 6987 GUESS 6754 Hanes 6523 Levi's 6234 Nike 5987 Name: order_id, dtype: int64
Multiple Aggregation Functions
# Multiple statistics by status at once
status_stats = df.groupby('status')['sale_price'].agg([
('count', 'count'),
('sum', 'sum'),
('mean', 'mean'),
('min', 'min'),
('max', 'max'),
('std', 'std')
]).round(2)
print("Statistics by status:")
print(status_stats)Statistics by status:
count sum mean min max std
status
Cancelled 12456 845678.90 67.89 5.23 289.56 52.34
Complete 98765 7234567.89 73.25 4.56 312.45 58.67
Processing 34567 2456789.01 71.08 6.78 298.34 55.89
Returned 18234 1234567.89 67.70 5.67 287.23 54.23
Shipped 17468 1289012.34 73.79 7.89 301.56 57.452. agg() Method
Theory
agg() (aggregate) allows applying different aggregation functions to multiple columns.
Different Aggregation Functions per Column
# Apply different aggregation functions per column
result = df.groupby('category').agg({
'sale_price': ['mean', 'sum'], # price: mean, sum
'order_id': 'count', # orders: count
'retail_price': ['mean', 'max'] # retail: mean, max
}).round(2)
print("Multiple aggregations by category:")
print(result.head(10))Multiple aggregations by category:
sale_price order_id retail_price
mean sum count mean max
category
Accessories 44.89 234567.89 5234 56.78 189.00
Active 62.45 456789.01 7312 78.90 245.00
Blazers & Jackets 128.67 789012.34 6134 156.78 398.00
Clothing Sets 78.34 567890.12 7249 98.45 312.00
Dresses 89.56 678901.23 7582 112.34 345.00Named Aggregation (Recommended)
# More readable approach
result = df.groupby('category').agg(
avg_price=('sale_price', 'mean'),
total_sales=('sale_price', 'sum'),
sales_count=('order_id', 'count'),
avg_retail=('retail_price', 'mean')
).round(2)
print("Statistics by category:")
print(result.head(10))Statistics by category:
avg_price total_sales sales_count avg_retail
category
Accessories 44.89 234567.89 5234 56.78
Active 62.45 456789.01 7312 78.90
Blazers & Jackets 128.67 789012.34 6134 156.78
Clothing Sets 78.34 567890.12 7249 98.45
Dresses 89.56 678901.23 7582 112.34
Fashion Hoodies 72.34 534567.89 7389 89.67
Jeans 56.78 398765.43 7023 71.23
Jumpsuits & Rompers 95.23 623456.78 6547 118.90
Leggings 45.67 312345.67 6841 58.34
Outerwear & Coats 142.89 912345.67 6386 178.56Custom Functions
# Calculate top 10% average
def top_10_percent_avg(x):
return x.nlargest(int(len(x) * 0.1)).mean()
# Calculate price range
def price_range(x):
return x.max() - x.min()
result = df.groupby('category').agg(
mean=('sale_price', 'mean'),
top_10_pct_avg=('sale_price', top_10_percent_avg),
price_range=('sale_price', price_range)
).round(2)
print("Custom aggregations:")
print(result.head(10))Custom aggregations:
mean top_10_pct_avg price_range
category
Accessories 44.89 98.56 183.77
Active 62.45 145.67 238.22
Blazers & Jackets 128.67 278.34 392.77
Clothing Sets 78.34 167.89 305.67
Dresses 89.56 198.45 339.893. transform() Method
Theory
transform() returns group-level aggregation results while maintaining the original DataFrame size.
apply() vs transform():
apply(): Summarizes results by group (size changes)transform(): Maintains same size as original
Basic Usage
# Add each category's average price to all rows
df['category_avg_price'] = df.groupby('category')['sale_price'].transform('mean')
print("Category average price added:")
print(df[['category', 'sale_price', 'category_avg_price']].head(10).round(2))Category average price added:
category sale_price category_avg_price
0 Accessories 34.99 44.89
1 Active 89.99 62.45
2 Dresses 125.00 89.56
3 Jeans 45.50 56.78
4 Blazers & Jackets 189.99 128.67
5 Leggings 29.99 45.67
6 Fashion Hoodies 65.00 72.34
7 Accessories 52.99 44.89
8 Active 78.50 62.45
9 Dresses 95.00 89.56Within-Group Normalization
# Normalize price within each category (vs average)
df['price_vs_category_avg'] = df['sale_price'] / df['category_avg_price']
print("Price vs category average:")
result = df[['category', 'brand', 'sale_price', 'category_avg_price', 'price_vs_category_avg']]
print(result.head(15).round(2))
print("\nInterpretation: Values > 1.0 mean products more expensive than category average")Price vs category average:
category brand sale_price category_avg_price price_vs_category_avg
0 Accessories Allegra K 34.99 44.89 0.78
1 Active Calvin Klein 89.99 62.45 1.44
2 Dresses Columbia 125.00 89.56 1.40
3 Jeans Diesel 45.50 56.78 0.80
4 Blazers & Jackets Carhartt 189.99 128.67 1.48
5 Leggings Hanes 29.99 45.67 0.66
6 Fashion Hoodies Levi's 65.00 72.34 0.90
7 Accessories GUESS 52.99 44.89 1.18
8 Active Nike 78.50 62.45 1.26
9 Dresses Dockers 95.00 89.56 1.06
10 Jeans Calvin Klein 68.99 56.78 1.21
11 Blazers & Jackets Columbia 245.00 128.67 1.90
12 Leggings Allegra K 38.50 45.67 0.84
13 Fashion Hoodies Carhartt 85.00 72.34 1.17
14 Accessories Diesel 28.99 44.89 0.65
Interpretation: Values > 1.0 mean products more expensive than category averageZ-score Calculation
# Detect outliers with Z-score by brand
df['brand_mean'] = df.groupby('brand')['sale_price'].transform('mean')
df['brand_std'] = df.groupby('brand')['sale_price'].transform('std')
df['z_score'] = (df['sale_price'] - df['brand_mean']) / df['brand_std']
# Outliers with |Z-score| >= 2
outliers = df[df['z_score'].abs() >= 2]
print(f"Number of outliers: {len(outliers):,} ({len(outliers)/len(df)*100:.1f}%)")Number of outliers: 8,234 (4.5%)
4. apply() Method
Theory
apply() can apply complex custom functions to groups.
apply() on Series
# Price categorization function
def categorize_price(price):
if price < 50:
return 'Low'
elif price < 100:
return 'Medium'
else:
return 'High'
df['price_tier'] = df['sale_price'].apply(categorize_price)
print("Distribution by price tier:")
print(df['price_tier'].value_counts())Distribution by price tier: price_tier Medium 82345 Low 58678 High 40467 Name: count, dtype: int64
Row-wise apply()
# Calculate discount amount (row-wise)
def calc_discount_amount(row):
return row['retail_price'] - row['sale_price']
df['discount_amount'] = df.apply(calc_discount_amount, axis=1)
print("Discount amount calculation:")
print(df[['brand', 'retail_price', 'sale_price', 'discount_amount']].head(10).round(2))
print(f"\nAverage discount amount: ${df['discount_amount'].mean():.2f}")Discount amount calculation:
brand retail_price sale_price discount_amount
0 Allegra K 49.99 34.99 15.00
1 Calvin Klein 129.99 89.99 40.00
2 Columbia 175.00 125.00 50.00
3 Diesel 65.00 45.50 19.50
4 Carhartt 269.99 189.99 80.00
5 Hanes 39.99 29.99 10.00
6 Levi's 89.99 65.00 24.99
7 GUESS 74.99 52.99 22.00
8 Nike 110.00 78.50 31.50
9 Dockers 135.00 95.00 40.00
Average discount amount: $28.45apply() on Groups
# Select only top 3 products from each category
def top_3_products(group):
return group.nlargest(3, 'sale_price')
top_by_category = df.groupby('category', group_keys=False).apply(top_3_products)
print("Top 3 products by category:")
print(top_by_category[['category', 'brand', 'sale_price', 'status']].head(15).reset_index(drop=True))Top 3 products by category:
category brand sale_price status
0 Accessories Calvin Klein 189.00 Complete
1 Accessories Columbia 175.50 Complete
2 Accessories Carhartt 168.99 Shipped
3 Active Nike 245.00 Complete
4 Active Columbia 238.50 Cancelled
5 Active Carhartt 225.99 Complete
6 Blazers & Jackets Carhartt 398.00 Complete
7 Blazers & Jackets Calvin Klein 385.50 Shipped
8 Blazers & Jackets Columbia 372.00 Complete
9 Clothing Sets Diesel 312.00 Complete
10 Clothing Sets GUESS 298.50 Complete
11 Clothing Sets Levi's 285.00 Processing
12 Dresses Columbia 345.00 Complete
13 Dresses Calvin Klein 338.50 Shipped
14 Dresses GUESS 325.00 Complete5. Multi-Level Grouping
Grouping by Multiple Columns
# Group by category + status
result = df.groupby(['category', 'status']).agg(
avg_price=('sale_price', 'mean'),
sales_count=('order_id', 'count')
).round(2)
print("Statistics by category/status:")
print(result.head(15))Statistics by category/status:
avg_price sales_count
category status
Accessories Cancelled 42.34 678
Complete 45.67 2856
Processing 44.12 912
Returned 43.89 456
Shipped 46.23 332
Active Cancelled 58.90 923
Complete 63.78 4012
Processing 61.45 1234
Returned 60.23 678
Shipped 64.56 465
Blazers & Jackets Cancelled 125.34 756
Complete 130.45 3389
Processing 127.89 1045
Returned 124.67 512
Shipped 131.23 432Handling MultiIndex
# Convert to regular columns with reset_index()
result_reset = result.reset_index()
print("After index reset:")
print(result_reset.head(10))
# Pivot with unstack()
pivoted = df.groupby(['category', 'status'])['sale_price'].mean().unstack()
print("\nPivot format:")
print(pivoted.round(2).head(5))After index reset:
category status avg_price sales_count
0 Accessories Cancelled 42.34 678
1 Accessories Complete 45.67 2856
2 Accessories Processing 44.12 912
3 Accessories Returned 43.89 456
4 Accessories Shipped 46.23 332
5 Active Cancelled 58.90 923
6 Active Complete 63.78 4012
7 Active Processing 61.45 1234
8 Active Returned 60.23 678
9 Active Shipped 64.56 465
Pivot format:
status Cancelled Complete Processing Returned Shipped
category
Accessories 42.34 45.67 44.12 43.89 46.23
Active 58.90 63.78 61.45 60.23 64.56
Blazers & Jackets 125.34 130.45 127.89 124.67 131.23
Clothing Sets 75.67 79.34 77.45 76.23 80.12
Dresses 86.45 90.78 88.34 87.12 91.45Quiz 1: Basic Grouping
Problem
From order data:
- Calculate average sale price and sales count by brand
- Sort by average price in descending order
- Output top 10 brands only
View Answer
import pandas as pd
# Aggregate by brand
brand_stats = df.groupby('brand').agg(
avg_price=('sale_price', 'mean'),
sales_count=('order_id', 'count')
).round(2)
# Sort and get top 10
brand_stats = brand_stats.sort_values('avg_price', ascending=False)
top_10 = brand_stats.head(10)
print("Brand statistics (Top 10):")
print(top_10)Brand statistics (Top 10):
avg_price sales_count
brand
Carhartt 112.34 7623
Calvin Klein 108.56 7856
Columbia 105.89 7456
Diesel 102.34 7234
GUESS 98.78 6754
Nike 95.45 5987
Levi's 92.34 6234
Dockers 89.67 6987
Hanes 78.45 6523
Allegra K 72.34 8234Quiz 2: Using transform()
Problem
Calculate mean and standard deviation of price within each brand:
- Calculate Z-score: (price - mean) / standard deviation
- Find outliers where absolute value of Z-score is 2 or more
- Output the count and top 10 outliers
View Answer
# Mean and std by brand
df['brand_mean'] = df.groupby('brand')['sale_price'].transform('mean')
df['brand_std'] = df.groupby('brand')['sale_price'].transform('std')
# Calculate Z-score
df['z_score'] = (df['sale_price'] - df['brand_mean']) / df['brand_std']
# Filter outliers
outliers = df[df['z_score'].abs() >= 2].copy()
# Output results
print(f"Number of outliers: {len(outliers):,}")
# Top 10
result = outliers[['brand', 'category', 'sale_price', 'brand_mean', 'z_score']]
result = result.sort_values('z_score', key=abs, ascending=False).head(10)
print("\nTop 10 outliers:")
print(result.round(2))Number of outliers: 8,234
Top 10 outliers:
brand category sale_price brand_mean z_score
12345 Carhartt Blazers & Jackets 398.00 112.34 3.24
23456 Calvin Klein Outerwear & Coats 385.50 108.56 3.18
34567 Columbia Dresses 372.00 105.89 3.12
45678 Diesel Accessories 345.00 102.34 3.08
56789 GUESS Active 338.50 98.78 3.02
67890 Nike Blazers & Jackets 325.00 95.45 2.98
78901 Levi's Dresses 312.00 92.34 2.94
89012 Dockers Clothing Sets 298.50 89.67 2.89
90123 Hanes Active 285.00 78.45 2.85
12340 Allegra K Blazers & Jackets 278.00 72.34 2.81Quiz 3: Complex Aggregation (Advanced)
Problem
Perform complex analysis with the following conditions:
- Categorize prices with
apply()(below 50: ‘Low’, 50-100: ‘Medium’, 100 and above: ‘High’) - Calculate the following statistics by price category:
- Sales count
- Complete ratio
- Total sales
- Output results in order of price category (Low -> Medium -> High)
View Answer
# Price category classification
def categorize_price(price):
if price < 50:
return 'Low'
elif price < 100:
return 'Medium'
else:
return 'High'
df['price_tier'] = df['sale_price'].apply(categorize_price)
# Complete ratio calculation function
def complete_ratio(x):
return (x == 'Complete').mean()
# Statistics by price category
result = df.groupby('price_tier').agg(
sales_count=('order_id', 'count'),
complete_ratio=('status', complete_ratio),
total_sales=('sale_price', 'sum')
).round(2)
# Specify order
result = result.reindex(['Low', 'Medium', 'High'])
# Display complete ratio as percentage
result['complete_ratio'] = (result['complete_ratio'] * 100).round(1).astype(str) + '%'
print("Statistics by price category:")
print(result)Statistics by price category:
sales_count complete_ratio total_sales
price_tier
Low 58678 54.2% 1756789.01
Medium 82345 55.8% 5987654.32
High 40467 56.5% 5312345.676. pivot_table()
Theory
pivot_table() restructures data to create summary statistics.
Basic Usage
# Average sale price by category and status
pivot = df.pivot_table(
values='sale_price',
index='category',
columns='status',
aggfunc='mean'
)
print("Average price by category/status:")
print(pivot.round(2).head(5))Average price by category/status: status Cancelled Complete Processing Returned Shipped category Accessories 42.34 45.67 44.12 43.89 46.23 Active 58.90 63.78 61.45 60.23 64.56 Blazers & Jackets 125.34 130.45 127.89 124.67 131.23 Clothing Sets 75.67 79.34 77.45 76.23 80.12 Dresses 86.45 90.78 88.34 87.12 91.45
Multiple Aggregation Functions
pivot = df.pivot_table(
values='sale_price',
index='category',
columns='status',
aggfunc=['mean', 'count']
)
print("Multiple aggregations:")
print(pivot.round(2).head(5))Multiple aggregations:
mean count
status Cancelled Complete Processing Returned Shipped Cancelled Complete Processing Returned Shipped
category
Accessories 42.34 45.67 44.12 43.89 46.23 678 2856 912 456 332
Active 58.90 63.78 61.45 60.23 64.56 923 4012 1234 678 465
Blazers & Jackets 125.34 130.45 127.89 124.67 131.23 756 3389 1045 512 432
Clothing Sets 75.67 79.34 77.45 76.23 80.12 812 3987 1156 589 705
Dresses 86.45 90.78 88.34 87.12 91.45 845 4234 1289 623 591Adding Totals
pivot = df.pivot_table(
values='sale_price',
index='category',
columns='status',
aggfunc='mean',
margins=True, # Add totals
margins_name='Total' # Total name
)
print("With totals:")
print(pivot.round(2).head(6))With totals: status Cancelled Complete Processing Returned Shipped Total category Accessories 42.34 45.67 44.12 43.89 46.23 44.89 Active 58.90 63.78 61.45 60.23 64.56 62.45 Blazers & Jackets 125.34 130.45 127.89 124.67 131.23 128.67 Clothing Sets 75.67 79.34 77.45 76.23 80.12 78.34 Dresses 86.45 90.78 88.34 87.12 91.45 89.56 Total 67.89 73.25 71.08 67.70 73.79 71.56
Summary
Key Function Comparison
| Method | Use Case | Result Size |
|---|---|---|
groupby().agg() | Aggregate by group | Number of groups |
groupby().transform() | Add group stats to original | Same as original |
groupby().apply() | Apply custom functions | Depends on function |
pivot_table() | 2D summary table | Pivot format |
SQL to Pandas Comparison
| SQL | Pandas |
|---|---|
GROUP BY col | df.groupby('col') |
SELECT AVG(price), SUM(price) | .agg({'price': ['mean', 'sum']}) |
HAVING COUNT(*) > 10 | Apply .query('count > 10') to result |
WITH cte AS (SELECT ...) | df['col'] = df.groupby(...).transform(...) |
Aggregation Function Summary
| Function | Description | Example |
|---|---|---|
'count' | Count | Excludes NaN |
'sum' | Sum | Numeric columns |
'mean' | Mean | Numeric columns |
'median' | Median | Numeric columns |
'min' / 'max' | Min/Max | All types |
'std' / 'var' | Std/Variance | Numeric columns |
'first' / 'last' | First/Last | All types |
'nunique' | Unique count | All types |
Next Steps
You’ve mastered grouping and aggregation! Next, learn data joining techniques including merge(), concat() in Data Merging.