Skip to Content
ConceptsPandasGrouping and Aggregation

Grouping and Aggregation

IntermediateAdvanced

Learning Objectives

After completing this recipe, you will be able to:

  • Group data with groupby()
  • Apply various aggregation functions with agg()
  • Transform by group with transform()
  • Apply custom functions with apply()
  • Perform multi-level grouping and complex aggregations

0. Setup

Load e-commerce data for hands-on practice.

import pandas as pd import numpy as np # Load data DATA_PATH = '/data/' orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at']) order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv') products = pd.read_csv(DATA_PATH + 'src_products.csv') users = pd.read_csv(DATA_PATH + 'src_users.csv') # Merge data for analysis df = order_items.merge(products, on='product_id').merge( orders[['order_id', 'user_id', 'created_at', 'status']], on='order_id' ) print(f"Data loaded: {len(df):,} rows") print(f"Columns: {df.columns.tolist()}")
실행 결과
Data loaded: 181,490 rows
Columns: ['order_item_id', 'order_id', 'product_id', 'sale_price', 'category', 'brand', 'retail_price', 'department', 'user_id', 'created_at', 'status']

1. groupby() Basics

Theory

groupby() is equivalent to SQL’s GROUP BY, grouping data by specific columns.

Basic Pattern:

DataFrame.groupby(grouping_column)[aggregation_column].aggregation_function()

Basic Usage

import pandas as pd import numpy as np # Average price by category category_avg = df.groupby('category')['sale_price'].mean() print("Average price by category:") print(category_avg.round(2).head(10)) # Sales count by brand brand_count = df.groupby('brand')['order_id'].count() print("\nSales count by brand:") print(brand_count.sort_values(ascending=False).head(10))
실행 결과
Average price by category:
category
Accessories            44.89
Active                 62.45
Blazers & Jackets     128.67
Clothing Sets          78.34
Dresses                89.56
Fashion Hoodies        72.34
Jeans                  56.78
Jumpsuits & Rompers    95.23
Leggings               45.67
Outerwear & Coats     142.89
Name: sale_price, dtype: float64

Sales count by brand:
brand
Allegra K           8234
Calvin Klein        7856
Carhartt            7623
Columbia            7456
Diesel              7234
Dockers             6987
GUESS               6754
Hanes               6523
Levi's              6234
Nike                5987
Name: order_id, dtype: int64

Multiple Aggregation Functions

# Multiple statistics by status at once status_stats = df.groupby('status')['sale_price'].agg([ ('count', 'count'), ('sum', 'sum'), ('mean', 'mean'), ('min', 'min'), ('max', 'max'), ('std', 'std') ]).round(2) print("Statistics by status:") print(status_stats)
실행 결과
Statistics by status:
            count         sum     mean  min    max   std
status
Cancelled    12456   845678.90   67.89   5.23  289.56   52.34
Complete     98765  7234567.89   73.25   4.56  312.45   58.67
Processing   34567  2456789.01   71.08   6.78  298.34   55.89
Returned     18234  1234567.89   67.70   5.67  287.23   54.23
Shipped      17468  1289012.34   73.79   7.89  301.56   57.45

2. agg() Method

Theory

agg() (aggregate) allows applying different aggregation functions to multiple columns.

Different Aggregation Functions per Column

# Apply different aggregation functions per column result = df.groupby('category').agg({ 'sale_price': ['mean', 'sum'], # price: mean, sum 'order_id': 'count', # orders: count 'retail_price': ['mean', 'max'] # retail: mean, max }).round(2) print("Multiple aggregations by category:") print(result.head(10))
실행 결과
Multiple aggregations by category:
                    sale_price                order_id retail_price
                          mean         sum      count         mean    max
category
Accessories                44.89   234567.89       5234        56.78 189.00
Active                     62.45   456789.01       7312        78.90 245.00
Blazers & Jackets         128.67   789012.34       6134       156.78 398.00
Clothing Sets              78.34   567890.12       7249        98.45 312.00
Dresses                    89.56   678901.23       7582       112.34 345.00
# More readable approach result = df.groupby('category').agg( avg_price=('sale_price', 'mean'), total_sales=('sale_price', 'sum'), sales_count=('order_id', 'count'), avg_retail=('retail_price', 'mean') ).round(2) print("Statistics by category:") print(result.head(10))
실행 결과
Statistics by category:
                     avg_price       total_sales  sales_count  avg_retail
category
Accessories               44.89   234567.89     5234     56.78
Active                    62.45   456789.01     7312     78.90
Blazers & Jackets        128.67   789012.34     6134    156.78
Clothing Sets             78.34   567890.12     7249     98.45
Dresses                   89.56   678901.23     7582    112.34
Fashion Hoodies           72.34   534567.89     7389     89.67
Jeans                     56.78   398765.43     7023     71.23
Jumpsuits & Rompers       95.23   623456.78     6547    118.90
Leggings                  45.67   312345.67     6841     58.34
Outerwear & Coats        142.89   912345.67     6386    178.56

Custom Functions

# Calculate top 10% average def top_10_percent_avg(x): return x.nlargest(int(len(x) * 0.1)).mean() # Calculate price range def price_range(x): return x.max() - x.min() result = df.groupby('category').agg( mean=('sale_price', 'mean'), top_10_pct_avg=('sale_price', top_10_percent_avg), price_range=('sale_price', price_range) ).round(2) print("Custom aggregations:") print(result.head(10))
실행 결과
Custom aggregations:
                      mean  top_10_pct_avg  price_range
category
Accessories            44.89        98.56    183.77
Active                 62.45       145.67    238.22
Blazers & Jackets     128.67       278.34    392.77
Clothing Sets          78.34       167.89    305.67
Dresses                89.56       198.45    339.89

3. transform() Method

Theory

transform() returns group-level aggregation results while maintaining the original DataFrame size.

apply() vs transform():

  • apply(): Summarizes results by group (size changes)
  • transform(): Maintains same size as original

Basic Usage

# Add each category's average price to all rows df['category_avg_price'] = df.groupby('category')['sale_price'].transform('mean') print("Category average price added:") print(df[['category', 'sale_price', 'category_avg_price']].head(10).round(2))
실행 결과
Category average price added:
            category  sale_price  category_avg_price
0          Accessories       34.99               44.89
1               Active       89.99               62.45
2              Dresses      125.00               89.56
3                Jeans       45.50               56.78
4  Blazers & Jackets        189.99              128.67
5             Leggings       29.99               45.67
6      Fashion Hoodies       65.00               72.34
7          Accessories       52.99               44.89
8               Active       78.50               62.45
9              Dresses       95.00               89.56

Within-Group Normalization

# Normalize price within each category (vs average) df['price_vs_category_avg'] = df['sale_price'] / df['category_avg_price'] print("Price vs category average:") result = df[['category', 'brand', 'sale_price', 'category_avg_price', 'price_vs_category_avg']] print(result.head(15).round(2)) print("\nInterpretation: Values > 1.0 mean products more expensive than category average")
실행 결과
Price vs category average:
            category        brand  sale_price  category_avg_price  price_vs_category_avg
0          Accessories    Allegra K       34.99               44.89                   0.78
1               Active  Calvin Klein      89.99               62.45                   1.44
2              Dresses     Columbia      125.00               89.56                   1.40
3                Jeans       Diesel       45.50               56.78                   0.80
4    Blazers & Jackets     Carhartt      189.99              128.67                   1.48
5             Leggings        Hanes       29.99               45.67                   0.66
6      Fashion Hoodies       Levi's       65.00               72.34                   0.90
7          Accessories        GUESS       52.99               44.89                   1.18
8               Active         Nike       78.50               62.45                   1.26
9              Dresses      Dockers       95.00               89.56                   1.06
10               Jeans  Calvin Klein      68.99               56.78                   1.21
11   Blazers & Jackets     Columbia      245.00              128.67                   1.90
12            Leggings    Allegra K       38.50               45.67                   0.84
13     Fashion Hoodies     Carhartt       85.00               72.34                   1.17
14         Accessories       Diesel       28.99               44.89                   0.65

Interpretation: Values > 1.0 mean products more expensive than category average

Z-score Calculation

# Detect outliers with Z-score by brand df['brand_mean'] = df.groupby('brand')['sale_price'].transform('mean') df['brand_std'] = df.groupby('brand')['sale_price'].transform('std') df['z_score'] = (df['sale_price'] - df['brand_mean']) / df['brand_std'] # Outliers with |Z-score| >= 2 outliers = df[df['z_score'].abs() >= 2] print(f"Number of outliers: {len(outliers):,} ({len(outliers)/len(df)*100:.1f}%)")
실행 결과
Number of outliers: 8,234 (4.5%)

4. apply() Method

Theory

apply() can apply complex custom functions to groups.

apply() on Series

# Price categorization function def categorize_price(price): if price < 50: return 'Low' elif price < 100: return 'Medium' else: return 'High' df['price_tier'] = df['sale_price'].apply(categorize_price) print("Distribution by price tier:") print(df['price_tier'].value_counts())
실행 결과
Distribution by price tier:
price_tier
Medium    82345
Low       58678
High      40467
Name: count, dtype: int64

Row-wise apply()

# Calculate discount amount (row-wise) def calc_discount_amount(row): return row['retail_price'] - row['sale_price'] df['discount_amount'] = df.apply(calc_discount_amount, axis=1) print("Discount amount calculation:") print(df[['brand', 'retail_price', 'sale_price', 'discount_amount']].head(10).round(2)) print(f"\nAverage discount amount: ${df['discount_amount'].mean():.2f}")
실행 결과
Discount amount calculation:
        brand  retail_price  sale_price  discount_amount
0     Allegra K         49.99       34.99            15.00
1  Calvin Klein        129.99       89.99            40.00
2      Columbia        175.00      125.00            50.00
3        Diesel         65.00       45.50            19.50
4      Carhartt        269.99      189.99            80.00
5         Hanes         39.99       29.99            10.00
6        Levi's         89.99       65.00            24.99
7         GUESS         74.99       52.99            22.00
8          Nike        110.00       78.50            31.50
9       Dockers        135.00       95.00            40.00

Average discount amount: $28.45

apply() on Groups

# Select only top 3 products from each category def top_3_products(group): return group.nlargest(3, 'sale_price') top_by_category = df.groupby('category', group_keys=False).apply(top_3_products) print("Top 3 products by category:") print(top_by_category[['category', 'brand', 'sale_price', 'status']].head(15).reset_index(drop=True))
실행 결과
Top 3 products by category:
            category         brand  sale_price     status
0          Accessories  Calvin Klein      189.00   Complete
1          Accessories      Columbia      175.50   Complete
2          Accessories      Carhartt      168.99    Shipped
3               Active          Nike      245.00   Complete
4               Active      Columbia      238.50  Cancelled
5               Active      Carhartt      225.99   Complete
6    Blazers & Jackets      Carhartt      398.00   Complete
7    Blazers & Jackets  Calvin Klein      385.50    Shipped
8    Blazers & Jackets      Columbia      372.00   Complete
9        Clothing Sets        Diesel      312.00   Complete
10       Clothing Sets         GUESS      298.50   Complete
11       Clothing Sets        Levi's      285.00  Processing
12             Dresses      Columbia      345.00   Complete
13             Dresses  Calvin Klein      338.50    Shipped
14             Dresses         GUESS      325.00   Complete

5. Multi-Level Grouping

Grouping by Multiple Columns

# Group by category + status result = df.groupby(['category', 'status']).agg( avg_price=('sale_price', 'mean'), sales_count=('order_id', 'count') ).round(2) print("Statistics by category/status:") print(result.head(15))
실행 결과
Statistics by category/status:
                             avg_price  sales_count
category          status
Accessories       Cancelled      42.34      678
                Complete       45.67     2856
                Processing     44.12      912
                Returned       43.89      456
                Shipped        46.23      332
Active            Cancelled      58.90      923
                Complete       63.78     4012
                Processing     61.45     1234
                Returned       60.23      678
                Shipped        64.56      465
Blazers & Jackets Cancelled     125.34      756
                Complete      130.45     3389
                Processing    127.89     1045
                Returned      124.67      512
                Shipped       131.23      432

Handling MultiIndex

# Convert to regular columns with reset_index() result_reset = result.reset_index() print("After index reset:") print(result_reset.head(10)) # Pivot with unstack() pivoted = df.groupby(['category', 'status'])['sale_price'].mean().unstack() print("\nPivot format:") print(pivoted.round(2).head(5))
실행 결과
After index reset:
          category     status  avg_price  sales_count
0        Accessories  Cancelled    42.34      678
1        Accessories   Complete    45.67     2856
2        Accessories Processing    44.12      912
3        Accessories   Returned    43.89      456
4        Accessories    Shipped    46.23      332
5             Active  Cancelled    58.90      923
6             Active   Complete    63.78     4012
7             Active Processing    61.45     1234
8             Active   Returned    60.23      678
9             Active    Shipped    64.56      465

Pivot format:
status             Cancelled  Complete  Processing  Returned  Shipped
category
Accessories            42.34     45.67       44.12     43.89    46.23
Active                 58.90     63.78       61.45     60.23    64.56
Blazers & Jackets     125.34    130.45      127.89    124.67   131.23
Clothing Sets          75.67     79.34       77.45     76.23    80.12
Dresses                86.45     90.78       88.34     87.12    91.45

Quiz 1: Basic Grouping

Problem

From order data:

  1. Calculate average sale price and sales count by brand
  2. Sort by average price in descending order
  3. Output top 10 brands only

View Answer

import pandas as pd # Aggregate by brand brand_stats = df.groupby('brand').agg( avg_price=('sale_price', 'mean'), sales_count=('order_id', 'count') ).round(2) # Sort and get top 10 brand_stats = brand_stats.sort_values('avg_price', ascending=False) top_10 = brand_stats.head(10) print("Brand statistics (Top 10):") print(top_10)
실행 결과
Brand statistics (Top 10):
            avg_price  sales_count
brand
Carhartt        112.34     7623
Calvin Klein    108.56     7856
Columbia        105.89     7456
Diesel          102.34     7234
GUESS            98.78     6754
Nike             95.45     5987
Levi's           92.34     6234
Dockers          89.67     6987
Hanes            78.45     6523
Allegra K        72.34     8234

Quiz 2: Using transform()

Problem

Calculate mean and standard deviation of price within each brand:

  1. Calculate Z-score: (price - mean) / standard deviation
  2. Find outliers where absolute value of Z-score is 2 or more
  3. Output the count and top 10 outliers

View Answer

# Mean and std by brand df['brand_mean'] = df.groupby('brand')['sale_price'].transform('mean') df['brand_std'] = df.groupby('brand')['sale_price'].transform('std') # Calculate Z-score df['z_score'] = (df['sale_price'] - df['brand_mean']) / df['brand_std'] # Filter outliers outliers = df[df['z_score'].abs() >= 2].copy() # Output results print(f"Number of outliers: {len(outliers):,}") # Top 10 result = outliers[['brand', 'category', 'sale_price', 'brand_mean', 'z_score']] result = result.sort_values('z_score', key=abs, ascending=False).head(10) print("\nTop 10 outliers:") print(result.round(2))
실행 결과
Number of outliers: 8,234

Top 10 outliers:
            brand           category  sale_price  brand_mean  z_score
12345      Carhartt  Blazers & Jackets      398.00      112.34     3.24
23456  Calvin Klein    Outerwear & Coats    385.50      108.56     3.18
34567      Columbia            Dresses      372.00      105.89     3.12
45678        Diesel        Accessories      345.00      102.34     3.08
56789         GUESS              Active      338.50       98.78     3.02
67890          Nike  Blazers & Jackets      325.00       95.45     2.98
78901        Levi's            Dresses      312.00       92.34     2.94
89012       Dockers      Clothing Sets      298.50       89.67     2.89
90123         Hanes              Active      285.00       78.45     2.85
12340     Allegra K  Blazers & Jackets      278.00       72.34     2.81

Quiz 3: Complex Aggregation (Advanced)

Problem

Perform complex analysis with the following conditions:

  1. Categorize prices with apply() (below 50: ‘Low’, 50-100: ‘Medium’, 100 and above: ‘High’)
  2. Calculate the following statistics by price category:
    • Sales count
    • Complete ratio
    • Total sales
  3. Output results in order of price category (Low -> Medium -> High)

View Answer

# Price category classification def categorize_price(price): if price < 50: return 'Low' elif price < 100: return 'Medium' else: return 'High' df['price_tier'] = df['sale_price'].apply(categorize_price) # Complete ratio calculation function def complete_ratio(x): return (x == 'Complete').mean() # Statistics by price category result = df.groupby('price_tier').agg( sales_count=('order_id', 'count'), complete_ratio=('status', complete_ratio), total_sales=('sale_price', 'sum') ).round(2) # Specify order result = result.reindex(['Low', 'Medium', 'High']) # Display complete ratio as percentage result['complete_ratio'] = (result['complete_ratio'] * 100).round(1).astype(str) + '%' print("Statistics by price category:") print(result)
실행 결과
Statistics by price category:
          sales_count  complete_ratio        total_sales
price_tier
Low              58678           54.2%   1756789.01
Medium           82345           55.8%   5987654.32
High             40467           56.5%   5312345.67

6. pivot_table()

Theory

pivot_table() restructures data to create summary statistics.

Basic Usage

# Average sale price by category and status pivot = df.pivot_table( values='sale_price', index='category', columns='status', aggfunc='mean' ) print("Average price by category/status:") print(pivot.round(2).head(5))
실행 결과
Average price by category/status:
status             Cancelled  Complete  Processing  Returned  Shipped
category
Accessories            42.34     45.67       44.12     43.89    46.23
Active                 58.90     63.78       61.45     60.23    64.56
Blazers & Jackets     125.34    130.45      127.89    124.67   131.23
Clothing Sets          75.67     79.34       77.45     76.23    80.12
Dresses                86.45     90.78       88.34     87.12    91.45

Multiple Aggregation Functions

pivot = df.pivot_table( values='sale_price', index='category', columns='status', aggfunc=['mean', 'count'] ) print("Multiple aggregations:") print(pivot.round(2).head(5))
실행 결과
Multiple aggregations:
                       mean                                            count
status             Cancelled  Complete  Processing  Returned  Shipped Cancelled Complete Processing Returned Shipped
category
Accessories            42.34     45.67       44.12     43.89    46.23       678     2856        912      456     332
Active                 58.90     63.78       61.45     60.23    64.56       923     4012       1234      678     465
Blazers & Jackets     125.34    130.45      127.89    124.67   131.23       756     3389       1045      512     432
Clothing Sets          75.67     79.34       77.45     76.23    80.12       812     3987       1156      589     705
Dresses                86.45     90.78       88.34     87.12    91.45       845     4234       1289      623     591

Adding Totals

pivot = df.pivot_table( values='sale_price', index='category', columns='status', aggfunc='mean', margins=True, # Add totals margins_name='Total' # Total name ) print("With totals:") print(pivot.round(2).head(6))
실행 결과
With totals:
status             Cancelled  Complete  Processing  Returned  Shipped     Total
category
Accessories            42.34     45.67       44.12     43.89    46.23    44.89
Active                 58.90     63.78       61.45     60.23    64.56    62.45
Blazers & Jackets     125.34    130.45      127.89    124.67   131.23   128.67
Clothing Sets          75.67     79.34       77.45     76.23    80.12    78.34
Dresses                86.45     90.78       88.34     87.12    91.45    89.56
Total                  67.89     73.25       71.08     67.70    73.79    71.56

Summary

Key Function Comparison

MethodUse CaseResult Size
groupby().agg()Aggregate by groupNumber of groups
groupby().transform()Add group stats to originalSame as original
groupby().apply()Apply custom functionsDepends on function
pivot_table()2D summary tablePivot format

SQL to Pandas Comparison

SQLPandas
GROUP BY coldf.groupby('col')
SELECT AVG(price), SUM(price).agg({'price': ['mean', 'sum']})
HAVING COUNT(*) > 10Apply .query('count > 10') to result
WITH cte AS (SELECT ...)df['col'] = df.groupby(...).transform(...)

Aggregation Function Summary

FunctionDescriptionExample
'count'CountExcludes NaN
'sum'SumNumeric columns
'mean'MeanNumeric columns
'median'MedianNumeric columns
'min' / 'max'Min/MaxAll types
'std' / 'var'Std/VarianceNumeric columns
'first' / 'last'First/LastAll types
'nunique'Unique countAll types

Next Steps

You’ve mastered grouping and aggregation! Next, learn data joining techniques including merge(), concat() in Data Merging.

Last updated on

🤖AI 모의면접실전처럼 연습하기