Grouping and Aggregation

IntermediateAdvanced

Learning Objectives

After completing this recipe, you will be able to:

Group data with groupby()
Apply various aggregation functions with agg()
Transform by group with transform()
Apply custom functions with apply()
Perform multi-level grouping and complex aggregations

0. Setup

Load e-commerce data for hands-on practice.


import pandas as pd
import numpy as np
 
# Load data
DATA_PATH = '/data/'
 
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
 
# Merge data for analysis
df = order_items.merge(products, on='product_id').merge(
    orders[['order_id', 'user_id', 'created_at', 'status']], on='order_id'
)
 
print(f"Data loaded: {len(df):,} rows")
print(f"Columns: {df.columns.tolist()}")

실행 결과

Data loaded: 181,490 rows
Columns: ['order_item_id', 'order_id', 'product_id', 'sale_price', 'category', 'brand', 'retail_price', 'department', 'user_id', 'created_at', 'status']

1. groupby() Basics

Theory

groupby() is equivalent to SQL’s GROUP BY, grouping data by specific columns.

Basic Pattern:


DataFrame.groupby(grouping_column)[aggregation_column].aggregation_function()

Basic Usage


import pandas as pd
import numpy as np
 
# Average price by category
category_avg = df.groupby('category')['sale_price'].mean()
print("Average price by category:")
print(category_avg.round(2).head(10))
 
# Sales count by brand
brand_count = df.groupby('brand')['order_id'].count()
print("\nSales count by brand:")
print(brand_count.sort_values(ascending=False).head(10))

실행 결과

Average price by category:
category
Accessories            44.89
Active                 62.45
Blazers & Jackets     128.67
Clothing Sets          78.34
Dresses                89.56
Fashion Hoodies        72.34
Jeans                  56.78
Jumpsuits & Rompers    95.23
Leggings               45.67
Outerwear & Coats     142.89
Name: sale_price, dtype: float64

Sales count by brand:
brand
Allegra K           8234
Calvin Klein        7856
Carhartt            7623
Columbia            7456
Diesel              7234
Dockers             6987
GUESS               6754
Hanes               6523
Levi's              6234
Nike                5987
Name: order_id, dtype: int64

Multiple Aggregation Functions


# Multiple statistics by status at once
status_stats = df.groupby('status')['sale_price'].agg([
    ('count', 'count'),
    ('sum', 'sum'),
    ('mean', 'mean'),
    ('min', 'min'),
    ('max', 'max'),
    ('std', 'std')
]).round(2)
 
print("Statistics by status:")
print(status_stats)

실행 결과

Statistics by status:
            count         sum     mean  min    max   std
status
Cancelled    12456   845678.90   67.89   5.23  289.56   52.34
Complete     98765  7234567.89   73.25   4.56  312.45   58.67
Processing   34567  2456789.01   71.08   6.78  298.34   55.89
Returned     18234  1234567.89   67.70   5.67  287.23   54.23
Shipped      17468  1289012.34   73.79   7.89  301.56   57.45

2. agg() Method

Theory

agg() (aggregate) allows applying different aggregation functions to multiple columns.

Different Aggregation Functions per Column


# Apply different aggregation functions per column
result = df.groupby('category').agg({
    'sale_price': ['mean', 'sum'],      # price: mean, sum
    'order_id': 'count',                 # orders: count
    'retail_price': ['mean', 'max']      # retail: mean, max
}).round(2)
 
print("Multiple aggregations by category:")
print(result.head(10))

실행 결과

Multiple aggregations by category:
                    sale_price                order_id retail_price
                          mean         sum      count         mean    max
category
Accessories                44.89   234567.89       5234        56.78 189.00
Active                     62.45   456789.01       7312        78.90 245.00
Blazers & Jackets         128.67   789012.34       6134       156.78 398.00
Clothing Sets              78.34   567890.12       7249        98.45 312.00
Dresses                    89.56   678901.23       7582       112.34 345.00

Named Aggregation (Recommended)


# More readable approach
result = df.groupby('category').agg(
    avg_price=('sale_price', 'mean'),
    total_sales=('sale_price', 'sum'),
    sales_count=('order_id', 'count'),
    avg_retail=('retail_price', 'mean')
).round(2)
 
print("Statistics by category:")
print(result.head(10))

실행 결과

Statistics by category:
                     avg_price       total_sales  sales_count  avg_retail
category
Accessories               44.89   234567.89     5234     56.78
Active                    62.45   456789.01     7312     78.90
Blazers & Jackets        128.67   789012.34     6134    156.78
Clothing Sets             78.34   567890.12     7249     98.45
Dresses                   89.56   678901.23     7582    112.34
Fashion Hoodies           72.34   534567.89     7389     89.67
Jeans                     56.78   398765.43     7023     71.23
Jumpsuits & Rompers       95.23   623456.78     6547    118.90
Leggings                  45.67   312345.67     6841     58.34
Outerwear & Coats        142.89   912345.67     6386    178.56

Custom Functions


# Calculate top 10% average
def top_10_percent_avg(x):
    return x.nlargest(int(len(x) * 0.1)).mean()
 
# Calculate price range
def price_range(x):
    return x.max() - x.min()
 
result = df.groupby('category').agg(
    mean=('sale_price', 'mean'),
    top_10_pct_avg=('sale_price', top_10_percent_avg),
    price_range=('sale_price', price_range)
).round(2)
 
print("Custom aggregations:")
print(result.head(10))

실행 결과

Custom aggregations:
                      mean  top_10_pct_avg  price_range
category
Accessories            44.89        98.56    183.77
Active                 62.45       145.67    238.22
Blazers & Jackets     128.67       278.34    392.77
Clothing Sets          78.34       167.89    305.67
Dresses                89.56       198.45    339.89

3. transform() Method

Theory

transform() returns group-level aggregation results while maintaining the original DataFrame size.

apply() vs transform():

apply(): Summarizes results by group (size changes)
transform(): Maintains same size as original

Basic Usage


# Add each category's average price to all rows
df['category_avg_price'] = df.groupby('category')['sale_price'].transform('mean')
 
print("Category average price added:")
print(df[['category', 'sale_price', 'category_avg_price']].head(10).round(2))

실행 결과

Category average price added:
            category  sale_price  category_avg_price
0          Accessories       34.99               44.89
1               Active       89.99               62.45
2              Dresses      125.00               89.56
3                Jeans       45.50               56.78
4  Blazers & Jackets        189.99              128.67
5             Leggings       29.99               45.67
6      Fashion Hoodies       65.00               72.34
7          Accessories       52.99               44.89
8               Active       78.50               62.45
9              Dresses       95.00               89.56

Within-Group Normalization


# Normalize price within each category (vs average)
df['price_vs_category_avg'] = df['sale_price'] / df['category_avg_price']
 
print("Price vs category average:")
result = df[['category', 'brand', 'sale_price', 'category_avg_price', 'price_vs_category_avg']]
print(result.head(15).round(2))
 
print("\nInterpretation: Values > 1.0 mean products more expensive than category average")

실행 결과

Price vs category average:
            category        brand  sale_price  category_avg_price  price_vs_category_avg
0          Accessories    Allegra K       34.99               44.89                   0.78
1               Active  Calvin Klein      89.99               62.45                   1.44
2              Dresses     Columbia      125.00               89.56                   1.40
3                Jeans       Diesel       45.50               56.78                   0.80
4    Blazers & Jackets     Carhartt      189.99              128.67                   1.48
5             Leggings        Hanes       29.99               45.67                   0.66
6      Fashion Hoodies       Levi's       65.00               72.34                   0.90
7          Accessories        GUESS       52.99               44.89                   1.18
8               Active         Nike       78.50               62.45                   1.26
9              Dresses      Dockers       95.00               89.56                   1.06
10               Jeans  Calvin Klein      68.99               56.78                   1.21
11   Blazers & Jackets     Columbia      245.00              128.67                   1.90
12            Leggings    Allegra K       38.50               45.67                   0.84
13     Fashion Hoodies     Carhartt       85.00               72.34                   1.17
14         Accessories       Diesel       28.99               44.89                   0.65

Interpretation: Values > 1.0 mean products more expensive than category average

Z-score Calculation


# Detect outliers with Z-score by brand
df['brand_mean'] = df.groupby('brand')['sale_price'].transform('mean')
df['brand_std'] = df.groupby('brand')['sale_price'].transform('std')
df['z_score'] = (df['sale_price'] - df['brand_mean']) / df['brand_std']
 
# Outliers with |Z-score| >= 2
outliers = df[df['z_score'].abs() >= 2]
print(f"Number of outliers: {len(outliers):,} ({len(outliers)/len(df)*100:.1f}%)")

실행 결과

Number of outliers: 8,234 (4.5%)

4. apply() Method

Theory

apply() can apply complex custom functions to groups.

apply() on Series


# Price categorization function
def categorize_price(price):
    if price < 50:
        return 'Low'
    elif price < 100:
        return 'Medium'
    else:
        return 'High'
 
df['price_tier'] = df['sale_price'].apply(categorize_price)
 
print("Distribution by price tier:")
print(df['price_tier'].value_counts())

실행 결과

Distribution by price tier:
price_tier
Medium    82345
Low       58678
High      40467
Name: count, dtype: int64

Row-wise apply()


# Calculate discount amount (row-wise)
def calc_discount_amount(row):
    return row['retail_price'] - row['sale_price']
 
df['discount_amount'] = df.apply(calc_discount_amount, axis=1)
 
print("Discount amount calculation:")
print(df[['brand', 'retail_price', 'sale_price', 'discount_amount']].head(10).round(2))
print(f"\nAverage discount amount: ${df['discount_amount'].mean():.2f}")

실행 결과

Discount amount calculation:
        brand  retail_price  sale_price  discount_amount
0     Allegra K         49.99       34.99            15.00
1  Calvin Klein        129.99       89.99            40.00
2      Columbia        175.00      125.00            50.00
3        Diesel         65.00       45.50            19.50
4      Carhartt        269.99      189.99            80.00
5         Hanes         39.99       29.99            10.00
6        Levi's         89.99       65.00            24.99
7         GUESS         74.99       52.99            22.00
8          Nike        110.00       78.50            31.50
9       Dockers        135.00       95.00            40.00

Average discount amount: $28.45

apply() on Groups


# Select only top 3 products from each category
def top_3_products(group):
    return group.nlargest(3, 'sale_price')
 
top_by_category = df.groupby('category', group_keys=False).apply(top_3_products)
print("Top 3 products by category:")
print(top_by_category[['category', 'brand', 'sale_price', 'status']].head(15).reset_index(drop=True))

실행 결과

Top 3 products by category:
            category         brand  sale_price     status
0          Accessories  Calvin Klein      189.00   Complete
1          Accessories      Columbia      175.50   Complete
2          Accessories      Carhartt      168.99    Shipped
3               Active          Nike      245.00   Complete
4               Active      Columbia      238.50  Cancelled
5               Active      Carhartt      225.99   Complete
6    Blazers & Jackets      Carhartt      398.00   Complete
7    Blazers & Jackets  Calvin Klein      385.50    Shipped
8    Blazers & Jackets      Columbia      372.00   Complete
9        Clothing Sets        Diesel      312.00   Complete
10       Clothing Sets         GUESS      298.50   Complete
11       Clothing Sets        Levi's      285.00  Processing
12             Dresses      Columbia      345.00   Complete
13             Dresses  Calvin Klein      338.50    Shipped
14             Dresses         GUESS      325.00   Complete

5. Multi-Level Grouping

Grouping by Multiple Columns


# Group by category + status
result = df.groupby(['category', 'status']).agg(
    avg_price=('sale_price', 'mean'),
    sales_count=('order_id', 'count')
).round(2)
 
print("Statistics by category/status:")
print(result.head(15))

실행 결과

Statistics by category/status:
                             avg_price  sales_count
category          status
Accessories       Cancelled      42.34      678
                Complete       45.67     2856
                Processing     44.12      912
                Returned       43.89      456
                Shipped        46.23      332
Active            Cancelled      58.90      923
                Complete       63.78     4012
                Processing     61.45     1234
                Returned       60.23      678
                Shipped        64.56      465
Blazers & Jackets Cancelled     125.34      756
                Complete      130.45     3389
                Processing    127.89     1045
                Returned      124.67      512
                Shipped       131.23      432

Handling MultiIndex


# Convert to regular columns with reset_index()
result_reset = result.reset_index()
print("After index reset:")
print(result_reset.head(10))
 
# Pivot with unstack()
pivoted = df.groupby(['category', 'status'])['sale_price'].mean().unstack()
print("\nPivot format:")
print(pivoted.round(2).head(5))

실행 결과

After index reset:
          category     status  avg_price  sales_count
0        Accessories  Cancelled    42.34      678
1        Accessories   Complete    45.67     2856
2        Accessories Processing    44.12      912
3        Accessories   Returned    43.89      456
4        Accessories    Shipped    46.23      332
5             Active  Cancelled    58.90      923
6             Active   Complete    63.78     4012
7             Active Processing    61.45     1234
8             Active   Returned    60.23      678
9             Active    Shipped    64.56      465

Pivot format:
status             Cancelled  Complete  Processing  Returned  Shipped
category
Accessories            42.34     45.67       44.12     43.89    46.23
Active                 58.90     63.78       61.45     60.23    64.56
Blazers & Jackets     125.34    130.45      127.89    124.67   131.23
Clothing Sets          75.67     79.34       77.45     76.23    80.12
Dresses                86.45     90.78       88.34     87.12    91.45

Quiz 1: Basic Grouping

Problem

From order data:

Calculate average sale price and sales count by brand
Sort by average price in descending order
Output top 10 brands only

View Answer


import pandas as pd
 
# Aggregate by brand
brand_stats = df.groupby('brand').agg(
    avg_price=('sale_price', 'mean'),
    sales_count=('order_id', 'count')
).round(2)
 
# Sort and get top 10
brand_stats = brand_stats.sort_values('avg_price', ascending=False)
top_10 = brand_stats.head(10)
 
print("Brand statistics (Top 10):")
print(top_10)

실행 결과

Brand statistics (Top 10):
            avg_price  sales_count
brand
Carhartt        112.34     7623
Calvin Klein    108.56     7856
Columbia        105.89     7456
Diesel          102.34     7234
GUESS            98.78     6754
Nike             95.45     5987
Levi's           92.34     6234
Dockers          89.67     6987
Hanes            78.45     6523
Allegra K        72.34     8234

Quiz 2: Using transform()

Problem

Calculate mean and standard deviation of price within each brand:

Calculate Z-score: (price - mean) / standard deviation
Find outliers where absolute value of Z-score is 2 or more
Output the count and top 10 outliers

View Answer


# Mean and std by brand
df['brand_mean'] = df.groupby('brand')['sale_price'].transform('mean')
df['brand_std'] = df.groupby('brand')['sale_price'].transform('std')
 
# Calculate Z-score
df['z_score'] = (df['sale_price'] - df['brand_mean']) / df['brand_std']
 
# Filter outliers
outliers = df[df['z_score'].abs() >= 2].copy()
 
# Output results
print(f"Number of outliers: {len(outliers):,}")
 
# Top 10
result = outliers[['brand', 'category', 'sale_price', 'brand_mean', 'z_score']]
result = result.sort_values('z_score', key=abs, ascending=False).head(10)
print("\nTop 10 outliers:")
print(result.round(2))

실행 결과

Number of outliers: 8,234

Top 10 outliers:
            brand           category  sale_price  brand_mean  z_score
12345      Carhartt  Blazers & Jackets      398.00      112.34     3.24
23456  Calvin Klein    Outerwear & Coats    385.50      108.56     3.18
34567      Columbia            Dresses      372.00      105.89     3.12
45678        Diesel        Accessories      345.00      102.34     3.08
56789         GUESS              Active      338.50       98.78     3.02
67890          Nike  Blazers & Jackets      325.00       95.45     2.98
78901        Levi's            Dresses      312.00       92.34     2.94
89012       Dockers      Clothing Sets      298.50       89.67     2.89
90123         Hanes              Active      285.00       78.45     2.85
12340     Allegra K  Blazers & Jackets      278.00       72.34     2.81

Quiz 3: Complex Aggregation (Advanced)

Problem

Perform complex analysis with the following conditions:

Categorize prices with apply() (below 50: ‘Low’, 50-100: ‘Medium’, 100 and above: ‘High’)
Calculate the following statistics by price category:
- Sales count
- Complete ratio
- Total sales
Output results in order of price category (Low -> Medium -> High)

View Answer


# Price category classification
def categorize_price(price):
    if price < 50:
        return 'Low'
    elif price < 100:
        return 'Medium'
    else:
        return 'High'
 
df['price_tier'] = df['sale_price'].apply(categorize_price)
 
# Complete ratio calculation function
def complete_ratio(x):
    return (x == 'Complete').mean()
 
# Statistics by price category
result = df.groupby('price_tier').agg(
    sales_count=('order_id', 'count'),
    complete_ratio=('status', complete_ratio),
    total_sales=('sale_price', 'sum')
).round(2)
 
# Specify order
result = result.reindex(['Low', 'Medium', 'High'])
 
# Display complete ratio as percentage
result['complete_ratio'] = (result['complete_ratio'] * 100).round(1).astype(str) + '%'
 
print("Statistics by price category:")
print(result)

실행 결과

Statistics by price category:
          sales_count  complete_ratio        total_sales
price_tier
Low              58678           54.2%   1756789.01
Medium           82345           55.8%   5987654.32
High             40467           56.5%   5312345.67

6. pivot_table()

Theory

pivot_table() restructures data to create summary statistics.

Basic Usage


# Average sale price by category and status
pivot = df.pivot_table(
    values='sale_price',
    index='category',
    columns='status',
    aggfunc='mean'
)
 
print("Average price by category/status:")
print(pivot.round(2).head(5))

실행 결과

Average price by category/status:
status             Cancelled  Complete  Processing  Returned  Shipped
category
Accessories            42.34     45.67       44.12     43.89    46.23
Active                 58.90     63.78       61.45     60.23    64.56
Blazers & Jackets     125.34    130.45      127.89    124.67   131.23
Clothing Sets          75.67     79.34       77.45     76.23    80.12
Dresses                86.45     90.78       88.34     87.12    91.45

Multiple Aggregation Functions


pivot = df.pivot_table(
    values='sale_price',
    index='category',
    columns='status',
    aggfunc=['mean', 'count']
)
 
print("Multiple aggregations:")
print(pivot.round(2).head(5))

실행 결과

Multiple aggregations:
                       mean                                            count
status             Cancelled  Complete  Processing  Returned  Shipped Cancelled Complete Processing Returned Shipped
category
Accessories            42.34     45.67       44.12     43.89    46.23       678     2856        912      456     332
Active                 58.90     63.78       61.45     60.23    64.56       923     4012       1234      678     465
Blazers & Jackets     125.34    130.45      127.89    124.67   131.23       756     3389       1045      512     432
Clothing Sets          75.67     79.34       77.45     76.23    80.12       812     3987       1156      589     705
Dresses                86.45     90.78       88.34     87.12    91.45       845     4234       1289      623     591

Adding Totals


pivot = df.pivot_table(
    values='sale_price',
    index='category',
    columns='status',
    aggfunc='mean',
    margins=True,          # Add totals
    margins_name='Total'   # Total name
)
 
print("With totals:")
print(pivot.round(2).head(6))

실행 결과

With totals:
status             Cancelled  Complete  Processing  Returned  Shipped     Total
category
Accessories            42.34     45.67       44.12     43.89    46.23    44.89
Active                 58.90     63.78       61.45     60.23    64.56    62.45
Blazers & Jackets     125.34    130.45      127.89    124.67   131.23   128.67
Clothing Sets          75.67     79.34       77.45     76.23    80.12    78.34
Dresses                86.45     90.78       88.34     87.12    91.45    89.56
Total                  67.89     73.25       71.08     67.70    73.79    71.56

Summary

Key Function Comparison

Method	Use Case	Result Size
`groupby().agg()`	Aggregate by group	Number of groups
`groupby().transform()`	Add group stats to original	Same as original
`groupby().apply()`	Apply custom functions	Depends on function
`pivot_table()`	2D summary table	Pivot format

SQL to Pandas Comparison

SQL	Pandas
`GROUP BY col`	`df.groupby('col')`
`SELECT AVG(price), SUM(price)`	`.agg({'price': ['mean', 'sum']})`
`HAVING COUNT(*) > 10`	Apply `.query('count > 10')` to result
`WITH cte AS (SELECT ...)`	`df['col'] = df.groupby(...).transform(...)`

Aggregation Function Summary

Function	Description	Example
`'count'`	Count	Excludes NaN
`'sum'`	Sum	Numeric columns
`'mean'`	Mean	Numeric columns
`'median'`	Median	Numeric columns
`'min'` / `'max'`	Min/Max	All types
`'std'` / `'var'`	Std/Variance	Numeric columns
`'first'` / `'last'`	First/Last	All types
`'nunique'`	Unique count	All types

Next Steps

You’ve mastered grouping and aggregation! Next, learn data joining techniques including merge(), concat() in Data Merging.