Skip to Content
ConceptsSQLGrouping and Aggregation

Category Performance Analysis with GROUP BY

Beginner Intermediate

Learning Objectives

  • Core concepts of GROUP BY and using aggregate functions
  • Single and multi-column grouping techniques
  • Filtering aggregated results with HAVING clause
  • Unique value analysis with COUNT DISTINCT
  • Deriving integrated metrics by combining aggregate functions

BigQuery Execution Environment Setup

from google.cloud import bigquery import pandas as pd # Authentication setup (when using service account key) # import os # os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-key.json" # Create BigQuery client client = bigquery.Client(project='your-project-id') # Query execution function def run_query(query): return client.query(query).to_dataframe()
ℹ️
Project ID Setup

Replace your-project-id with your GCP project ID in the code above.


Example 1: Basic GROUP BY - Category Aggregation

Theory

GROUP BY groups rows with the same values together for aggregation.

Basic Syntax

SELECT column, AGG_FUNCTION(column) FROM table GROUP BY column

Key Aggregate Functions

FunctionDescriptionExample
COUNT(*)Row countOrder count
COUNT(DISTINCT col)Unique value countUnique customer count
SUM(col)SumTotal revenue
AVG(col)AverageAverage order amount
MIN(col) / MAX(col)Min/Max valueLowest/highest price

Example Code

Execution Result
categoryorder_countitem_counttotal_revenueavg_price
Outerwear & Coats8,8649,0771,311,691144.51
Jeans11,98612,3651,212,71298.08
Sweaters10,74911,073837,37675.62
Swim10,90511,224646,69457.62
Suits & Sport Coats4,9725,095646,693126.93
ℹ️
COUNT(*) vs COUNT(column)
  • COUNT(*): Counts all rows including NULL
  • COUNT(column): Counts rows excluding NULL
  • COUNT(DISTINCT column): Counts only unique values

Quiz 1: Calculate Revenue and Profit Margin by Brand

Problem

JOIN order_items and products to:

  1. Total revenue by brand (sum of sale_price)
  2. Total cost by brand (sum of cost)
  3. Profit = Revenue - Cost
  4. Profit margin(%) = (Profit / Revenue) * 100
  5. Query only top 10 brands by revenue

Hint: SUM(), GROUP BY brand, ORDER BY DESC, LIMIT

View Answer

SELECT p.brand, SUM(oi.sale_price) AS total_revenue, SUM(p.cost) AS total_cost, SUM(oi.sale_price) - SUM(p.cost) AS profit, ROUND((SUM(oi.sale_price) - SUM(p.cost)) * 100.0 / SUM(oi.sale_price), 2) AS profit_margin FROM `your-project-id.thelook_ecommerce.order_items` oi JOIN `your-project-id.thelook_ecommerce.products` p ON oi.product_id = p.id GROUP BY p.brand ORDER BY total_revenue DESC LIMIT 10
Execution Result
brandtotal_revenuetotal_costprofitprofit_margin
Calvin Klein207,13197,083110,04953.13
Diesel202,497101,074101,42350.09
Carhartt177,72183,01094,71153.29
7 For All Mankind176,22591,90884,31747.85
True Religion175,55291,98683,56747.60

Explanation:

  • SUM(oi.sale_price) - SUM(p.cost): Calculate total profit
  • Profit margin = (Profit / Revenue) * 100
  • ROUND(..., 2): Round to 2 decimal places

Business Insights:

  • High margin brands: Maintain premium positioning
  • Low margin brands: Need volume strategy or cost reduction
  • View revenue and margin together for optimal portfolio composition

Example 2: Multi-Column GROUP BY

Theory

Grouping by multiple columns enables more granular analysis.

Syntax

GROUP BY column1, column2, column3

Use Cases

  • Category + brand analysis
  • Year + month trends
  • Region + gender segmentation

Example Code

Execution Result
categorybranditem_counttotal_revenueavg_price
AccessoriesRay-Ban67179,911119.09
AccessoriesOakley38852,646135.69
AccessoriesTom Ford4611,043240.07
AccessoriesKate Spade698,978130.11

Quiz 2: Revenue Aggregation by Department and Category

Problem

JOIN order_items and products to:

  1. Group by department and category
  2. Calculate items sold, total revenue, average price
  3. Query only top 20 combinations by revenue

Hint: GROUP BY department, category

View Answer

SELECT p.department, p.category, COUNT(*) AS item_count, SUM(oi.sale_price) AS total_revenue, ROUND(AVG(oi.sale_price), 2) AS avg_price FROM `your-project-id.thelook_ecommerce.order_items` oi JOIN `your-project-id.thelook_ecommerce.products` p ON oi.product_id = p.id GROUP BY p.department, p.category ORDER BY total_revenue DESC LIMIT 20

Explanation:

  • GROUP BY department, category: Simultaneous grouping by two dimensions
  • Understand revenue status by department-category combination

Business Applications:

  • Identify strong categories by department
  • Discover cross-selling opportunities
  • Develop inventory allocation strategies

Example 3: HAVING Clause - Filtering Aggregated Results

Theory

WHERE filters before grouping, HAVING filters after grouping.

Execution Order

SELECT → FROM → WHERE → GROUP BY → HAVING → ORDER BY ↑ ↑ Pre-grouping filter Post-grouping filter

WHERE vs HAVING

AspectWHEREHAVING
When appliedBefore GROUP BYAfter GROUP BY
TargetIndividual rowsGroups
Aggregate functionsCannot useCan use

Example Code

Execution Result
categoryorder_counttotal_revenueavg_price
Outerwear & Coats8,8641,311,691144.51
Jeans11,9861,212,71298.08
Sweaters10,749837,37675.62
… (22 categories)

Quiz 3: Output Only Top 10 Brands by Revenue

Problem

JOIN order_items and products to:

  1. Calculate total revenue, order count, items sold by brand
  2. Filter with HAVING for brands with 100+ orders
  3. Sort by revenue descending
  4. Query only top 10 brands

Hint: HAVING COUNT(DISTINCT ...) >= 100

View Answer

SELECT p.brand, COUNT(DISTINCT oi.order_id) AS order_count, COUNT(*) AS item_count, SUM(oi.sale_price) AS total_revenue FROM `your-project-id.thelook_ecommerce.order_items` oi JOIN `your-project-id.thelook_ecommerce.products` p ON oi.product_id = p.id GROUP BY p.brand HAVING COUNT(DISTINCT oi.order_id) >= 100 ORDER BY total_revenue DESC LIMIT 10
Execution Result
brandorder_countitem_counttotal_revenue
Calvin Klein3,2153,244207,131
Diesel1,4681,481202,497
Carhartt2,5462,562177,721
7 For All Mankind1,1181,120176,225
True Religion881882175,552

Explanation:

  • HAVING COUNT(DISTINCT oi.order_id) >= 100: Filter by order count after grouping
  • WHERE applies before aggregation, HAVING applies after
  • LIMIT restricts final result count

Practical Applications:

  • Analyze only brands above a certain scale (noise removal)
  • Extract segments above KPI thresholds
  • Focus analysis on top N items

Example 4: COUNT DISTINCT - Unique Value Aggregation

Theory

Calculate the count of unique values with duplicates removed.

Syntax

COUNT(DISTINCT column)

Use Cases

  • Unique customers
  • Unique products
  • Unique transaction dates

Example Code

Execution Result
categoryunique_productsunique_brandstotal_items_soldtotal_revenue
Outerwear & Coats1,4162869,0771,311,691
Jeans1,99822212,3651,212,712
Sweaters1,73334711,073837,376

Quiz 4: Unique Customer Count by Category

Problem

JOIN three tables to:

  1. JOIN orders, order_items, products
  2. Calculate unique customer count by category
  3. Calculate average purchase amount per customer
  4. Also query total order count
  5. Query only top 10 categories by unique customer count

Hint: COUNT(DISTINCT o.user_id), SUM(sale_price) / COUNT(DISTINCT user_id)

View Answer

SELECT p.category, COUNT(DISTINCT o.user_id) AS unique_customers, COUNT(DISTINCT o.order_id) AS order_count, SUM(oi.sale_price) AS total_revenue, ROUND(SUM(oi.sale_price) / COUNT(DISTINCT o.user_id), 2) AS avg_revenue_per_customer FROM `your-project-id.thelook_ecommerce.orders` o JOIN `your-project-id.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id JOIN `your-project-id.thelook_ecommerce.products` p ON oi.product_id = p.id GROUP BY p.category ORDER BY unique_customers DESC LIMIT 10
Execution Result
categoryunique_customersorder_counttotal_revenueavg_revenue_per_customer
Jeans11,36011,9861,212,712106.75
Intimates11,26412,435449,36339.89
Tops & Tees10,91111,486488,32444.76
Fashion Hoodies & Sweatshirts10,76711,300632,60458.75

Explanation:

  • COUNT(DISTINCT o.user_id): Unique customer count with duplicates removed
  • SUM(sale_price) / COUNT(DISTINCT user_id): Average revenue per customer (ARPU)
  • Derive integrated metrics by JOINing three tables

Business Insights:

  • High customer count + low avg revenue: Mass market category, volume strategy
  • Low customer count + high avg revenue: Premium category, needs customer expansion
  • Ideal: Categories with both high customer count and high avg revenue

Example 5: Combining Aggregate Functions - Comprehensive Analysis

Theory

Combine multiple aggregate functions to derive multi-dimensional insights.

Key Aggregate Functions

FunctionUse
SUM()Sum
AVG()Average
MIN() / MAX()Min/Max
COUNT()Count
COUNT(DISTINCT)Unique count
STDDEV()Standard deviation
APPROX_QUANTILES()Percentiles (BigQuery)

Example Code

Execution Result
categorytotal_itemstotal_revenueavg_pricemin_pricemax_priceprice_rangeprice_stddev
Outerwear & Coats9,0771,311,691144.5110.51999.00988.49136.41
Jeans12,3651,212,71298.089.99389.00379.0164.21
Sweaters11,073837,37675.625.99588.00582.0159.62

Quiz 5: Price Range and Average Analysis by Category

Problem

From the products table:

  1. Calculate statistics based on retail_price by category
    • Average, min, max, median (50th percentile), standard deviation
  2. Filter only categories with 10+ products
  3. Sort by average price descending
  4. Query top 10

Hint: APPROX_QUANTILES(retail_price, 100)[OFFSET(50)] AS median

View Answer

SELECT category, COUNT(*) AS product_count, ROUND(AVG(retail_price), 2) AS avg_price, MIN(retail_price) AS min_price, MAX(retail_price) AS max_price, APPROX_QUANTILES(retail_price, 100)[OFFSET(50)] AS median_price, ROUND(STDDEV(retail_price), 2) AS price_stddev FROM `your-project-id.thelook_ecommerce.products` GROUP BY category HAVING COUNT(*) >= 10 ORDER BY avg_price DESC LIMIT 10
Execution Result
categoryproduct_countavg_pricemin_pricemax_pricemedian_priceprice_stddev
Outerwear & Coats1,420146.0210.51999.00109.95138.25
Suits & Sport Coats739126.569.99698.0099.5098.77
Suits188116.1613.99239.72122.6046.29
Jeans1,99997.859.99389.0078.0063.85

Explanation:

  • APPROX_QUANTILES(retail_price, 100)[OFFSET(50)]: Calculate median
  • STDDEV(): Measure price volatility
  • HAVING COUNT(*) >= 10: Ensure statistical significance

Business Insights:

  • High standard deviation: High price diversity (mix of low and high-priced products)
  • Low standard deviation: Consistent price range (clear target customer segment)
  • Average > Median: High-priced products pulling up the average
  • Average < Median: Many low-priced products

Summary

GROUP BY Techniques Covered

TechniqueUseExample
GROUP BY columnSingle column aggregationRevenue by category
GROUP BY col1, col2Multi-column aggregationRevenue by category+brand
HAVINGFilter aggregated resultsOnly $100K+ revenue
COUNT(DISTINCT)Unique value countUnique customer count
MIN/MAX/AVG/SUMBasic aggregate functionsPrice statistics

WHERE vs HAVING

-- WHERE: Pre-grouping filter (individual rows) WHERE created_at >= '2024-01-01' -- HAVING: Post-grouping filter (aggregated results) HAVING SUM(revenue) > 100000

Practical Use Scenarios

  1. Category Performance Analysis: GROUP BY category
  2. Brand Portfolio Optimization: GROUP BY brand + HAVING
  3. Customer Segmentation: COUNT(DISTINCT user_id)
  4. Pricing Strategy: MIN/MAX/AVG/STDDEV
  5. Cross Analysis: Multi-column GROUP BY
💡
Next Steps
  • Ranking and cumulative aggregation with Window Functions
  • Build complex analysis pipelines by combining with CTE
  • Multi-dimensional aggregation with ROLLUP/CUBE
Last updated on

🤖AI 모의면접실전처럼 연습하기