Category Performance Analysis with GROUP BY
Beginner IntermediateLearning Objectives
- Core concepts of GROUP BY and using aggregate functions
- Single and multi-column grouping techniques
- Filtering aggregated results with HAVING clause
- Unique value analysis with COUNT DISTINCT
- Deriving integrated metrics by combining aggregate functions
BigQuery Execution Environment Setup
from google.cloud import bigquery
import pandas as pd
# Authentication setup (when using service account key)
# import os
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-key.json"
# Create BigQuery client
client = bigquery.Client(project='your-project-id')
# Query execution function
def run_query(query):
return client.query(query).to_dataframe()Replace your-project-id with your GCP project ID in the code above.
Example 1: Basic GROUP BY - Category Aggregation
Theory
GROUP BY groups rows with the same values together for aggregation.
Basic Syntax
SELECT column, AGG_FUNCTION(column)
FROM table
GROUP BY columnKey Aggregate Functions
| Function | Description | Example |
|---|---|---|
COUNT(*) | Row count | Order count |
COUNT(DISTINCT col) | Unique value count | Unique customer count |
SUM(col) | Sum | Total revenue |
AVG(col) | Average | Average order amount |
MIN(col) / MAX(col) | Min/Max value | Lowest/highest price |
Example Code
| category | order_count | item_count | total_revenue | avg_price |
|---|---|---|---|---|
| Outerwear & Coats | 8,864 | 9,077 | 1,311,691 | 144.51 |
| Jeans | 11,986 | 12,365 | 1,212,712 | 98.08 |
| Sweaters | 10,749 | 11,073 | 837,376 | 75.62 |
| Swim | 10,905 | 11,224 | 646,694 | 57.62 |
| Suits & Sport Coats | 4,972 | 5,095 | 646,693 | 126.93 |
| … | … | … | … | … |
COUNT(*): Counts all rows including NULLCOUNT(column): Counts rows excluding NULLCOUNT(DISTINCT column): Counts only unique values
Quiz 1: Calculate Revenue and Profit Margin by Brand
Problem
JOIN order_items and products to:
- Total revenue by brand (sum of sale_price)
- Total cost by brand (sum of cost)
- Profit = Revenue - Cost
- Profit margin(%) = (Profit / Revenue) * 100
- Query only top 10 brands by revenue
Hint: SUM(), GROUP BY brand, ORDER BY DESC, LIMIT
View Answer
SELECT
p.brand,
SUM(oi.sale_price) AS total_revenue,
SUM(p.cost) AS total_cost,
SUM(oi.sale_price) - SUM(p.cost) AS profit,
ROUND((SUM(oi.sale_price) - SUM(p.cost)) * 100.0 / SUM(oi.sale_price), 2) AS profit_margin
FROM `your-project-id.thelook_ecommerce.order_items` oi
JOIN `your-project-id.thelook_ecommerce.products` p
ON oi.product_id = p.id
GROUP BY p.brand
ORDER BY total_revenue DESC
LIMIT 10| brand | total_revenue | total_cost | profit | profit_margin |
|---|---|---|---|---|
| Calvin Klein | 207,131 | 97,083 | 110,049 | 53.13 |
| Diesel | 202,497 | 101,074 | 101,423 | 50.09 |
| Carhartt | 177,721 | 83,010 | 94,711 | 53.29 |
| 7 For All Mankind | 176,225 | 91,908 | 84,317 | 47.85 |
| True Religion | 175,552 | 91,986 | 83,567 | 47.60 |
| … | … | … | … | … |
Explanation:
SUM(oi.sale_price) - SUM(p.cost): Calculate total profit- Profit margin =
(Profit / Revenue) * 100 ROUND(..., 2): Round to 2 decimal places
Business Insights:
- High margin brands: Maintain premium positioning
- Low margin brands: Need volume strategy or cost reduction
- View revenue and margin together for optimal portfolio composition
Example 2: Multi-Column GROUP BY
Theory
Grouping by multiple columns enables more granular analysis.
Syntax
GROUP BY column1, column2, column3Use Cases
- Category + brand analysis
- Year + month trends
- Region + gender segmentation
Example Code
| category | brand | item_count | total_revenue | avg_price |
|---|---|---|---|---|
| Accessories | Ray-Ban | 671 | 79,911 | 119.09 |
| Accessories | Oakley | 388 | 52,646 | 135.69 |
| Accessories | Tom Ford | 46 | 11,043 | 240.07 |
| Accessories | Kate Spade | 69 | 8,978 | 130.11 |
| … | … | … | … | … |
Quiz 2: Revenue Aggregation by Department and Category
Problem
JOIN order_items and products to:
- Group by department and category
- Calculate items sold, total revenue, average price
- Query only top 20 combinations by revenue
Hint: GROUP BY department, category
View Answer
SELECT
p.department,
p.category,
COUNT(*) AS item_count,
SUM(oi.sale_price) AS total_revenue,
ROUND(AVG(oi.sale_price), 2) AS avg_price
FROM `your-project-id.thelook_ecommerce.order_items` oi
JOIN `your-project-id.thelook_ecommerce.products` p
ON oi.product_id = p.id
GROUP BY p.department, p.category
ORDER BY total_revenue DESC
LIMIT 20Explanation:
GROUP BY department, category: Simultaneous grouping by two dimensions- Understand revenue status by department-category combination
Business Applications:
- Identify strong categories by department
- Discover cross-selling opportunities
- Develop inventory allocation strategies
Example 3: HAVING Clause - Filtering Aggregated Results
Theory
WHERE filters before grouping, HAVING filters after grouping.
Execution Order
SELECT → FROM → WHERE → GROUP BY → HAVING → ORDER BY
↑ ↑
Pre-grouping filter Post-grouping filterWHERE vs HAVING
| Aspect | WHERE | HAVING |
|---|---|---|
| When applied | Before GROUP BY | After GROUP BY |
| Target | Individual rows | Groups |
| Aggregate functions | Cannot use | Can use |
Example Code
| category | order_count | total_revenue | avg_price |
|---|---|---|---|
| Outerwear & Coats | 8,864 | 1,311,691 | 144.51 |
| Jeans | 11,986 | 1,212,712 | 98.08 |
| Sweaters | 10,749 | 837,376 | 75.62 |
| … (22 categories) | … | … | … |
Quiz 3: Output Only Top 10 Brands by Revenue
Problem
JOIN order_items and products to:
- Calculate total revenue, order count, items sold by brand
- Filter with HAVING for brands with 100+ orders
- Sort by revenue descending
- Query only top 10 brands
Hint: HAVING COUNT(DISTINCT ...) >= 100
View Answer
SELECT
p.brand,
COUNT(DISTINCT oi.order_id) AS order_count,
COUNT(*) AS item_count,
SUM(oi.sale_price) AS total_revenue
FROM `your-project-id.thelook_ecommerce.order_items` oi
JOIN `your-project-id.thelook_ecommerce.products` p
ON oi.product_id = p.id
GROUP BY p.brand
HAVING COUNT(DISTINCT oi.order_id) >= 100
ORDER BY total_revenue DESC
LIMIT 10| brand | order_count | item_count | total_revenue |
|---|---|---|---|
| Calvin Klein | 3,215 | 3,244 | 207,131 |
| Diesel | 1,468 | 1,481 | 202,497 |
| Carhartt | 2,546 | 2,562 | 177,721 |
| 7 For All Mankind | 1,118 | 1,120 | 176,225 |
| True Religion | 881 | 882 | 175,552 |
| … | … | … | … |
Explanation:
HAVING COUNT(DISTINCT oi.order_id) >= 100: Filter by order count after grouping- WHERE applies before aggregation, HAVING applies after
- LIMIT restricts final result count
Practical Applications:
- Analyze only brands above a certain scale (noise removal)
- Extract segments above KPI thresholds
- Focus analysis on top N items
Example 4: COUNT DISTINCT - Unique Value Aggregation
Theory
Calculate the count of unique values with duplicates removed.
Syntax
COUNT(DISTINCT column)Use Cases
- Unique customers
- Unique products
- Unique transaction dates
Example Code
| category | unique_products | unique_brands | total_items_sold | total_revenue |
|---|---|---|---|---|
| Outerwear & Coats | 1,416 | 286 | 9,077 | 1,311,691 |
| Jeans | 1,998 | 222 | 12,365 | 1,212,712 |
| Sweaters | 1,733 | 347 | 11,073 | 837,376 |
| … | … | … | … | … |
Quiz 4: Unique Customer Count by Category
Problem
JOIN three tables to:
- JOIN
orders,order_items,products - Calculate unique customer count by category
- Calculate average purchase amount per customer
- Also query total order count
- Query only top 10 categories by unique customer count
Hint: COUNT(DISTINCT o.user_id), SUM(sale_price) / COUNT(DISTINCT user_id)
View Answer
SELECT
p.category,
COUNT(DISTINCT o.user_id) AS unique_customers,
COUNT(DISTINCT o.order_id) AS order_count,
SUM(oi.sale_price) AS total_revenue,
ROUND(SUM(oi.sale_price) / COUNT(DISTINCT o.user_id), 2) AS avg_revenue_per_customer
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi
ON o.order_id = oi.order_id
JOIN `your-project-id.thelook_ecommerce.products` p
ON oi.product_id = p.id
GROUP BY p.category
ORDER BY unique_customers DESC
LIMIT 10| category | unique_customers | order_count | total_revenue | avg_revenue_per_customer |
|---|---|---|---|---|
| Jeans | 11,360 | 11,986 | 1,212,712 | 106.75 |
| Intimates | 11,264 | 12,435 | 449,363 | 39.89 |
| Tops & Tees | 10,911 | 11,486 | 488,324 | 44.76 |
| Fashion Hoodies & Sweatshirts | 10,767 | 11,300 | 632,604 | 58.75 |
| … | … | … | … | … |
Explanation:
COUNT(DISTINCT o.user_id): Unique customer count with duplicates removedSUM(sale_price) / COUNT(DISTINCT user_id): Average revenue per customer (ARPU)- Derive integrated metrics by JOINing three tables
Business Insights:
- High customer count + low avg revenue: Mass market category, volume strategy
- Low customer count + high avg revenue: Premium category, needs customer expansion
- Ideal: Categories with both high customer count and high avg revenue
Example 5: Combining Aggregate Functions - Comprehensive Analysis
Theory
Combine multiple aggregate functions to derive multi-dimensional insights.
Key Aggregate Functions
| Function | Use |
|---|---|
SUM() | Sum |
AVG() | Average |
MIN() / MAX() | Min/Max |
COUNT() | Count |
COUNT(DISTINCT) | Unique count |
STDDEV() | Standard deviation |
APPROX_QUANTILES() | Percentiles (BigQuery) |
Example Code
| category | total_items | total_revenue | avg_price | min_price | max_price | price_range | price_stddev |
|---|---|---|---|---|---|---|---|
| Outerwear & Coats | 9,077 | 1,311,691 | 144.51 | 10.51 | 999.00 | 988.49 | 136.41 |
| Jeans | 12,365 | 1,212,712 | 98.08 | 9.99 | 389.00 | 379.01 | 64.21 |
| Sweaters | 11,073 | 837,376 | 75.62 | 5.99 | 588.00 | 582.01 | 59.62 |
| … | … | … | … | … | … | … | … |
Quiz 5: Price Range and Average Analysis by Category
Problem
From the products table:
- Calculate statistics based on retail_price by category
- Average, min, max, median (50th percentile), standard deviation
- Filter only categories with 10+ products
- Sort by average price descending
- Query top 10
Hint: APPROX_QUANTILES(retail_price, 100)[OFFSET(50)] AS median
View Answer
SELECT
category,
COUNT(*) AS product_count,
ROUND(AVG(retail_price), 2) AS avg_price,
MIN(retail_price) AS min_price,
MAX(retail_price) AS max_price,
APPROX_QUANTILES(retail_price, 100)[OFFSET(50)] AS median_price,
ROUND(STDDEV(retail_price), 2) AS price_stddev
FROM `your-project-id.thelook_ecommerce.products`
GROUP BY category
HAVING COUNT(*) >= 10
ORDER BY avg_price DESC
LIMIT 10| category | product_count | avg_price | min_price | max_price | median_price | price_stddev |
|---|---|---|---|---|---|---|
| Outerwear & Coats | 1,420 | 146.02 | 10.51 | 999.00 | 109.95 | 138.25 |
| Suits & Sport Coats | 739 | 126.56 | 9.99 | 698.00 | 99.50 | 98.77 |
| Suits | 188 | 116.16 | 13.99 | 239.72 | 122.60 | 46.29 |
| Jeans | 1,999 | 97.85 | 9.99 | 389.00 | 78.00 | 63.85 |
| … | … | … | … | … | … | … |
Explanation:
APPROX_QUANTILES(retail_price, 100)[OFFSET(50)]: Calculate medianSTDDEV(): Measure price volatilityHAVING COUNT(*) >= 10: Ensure statistical significance
Business Insights:
- High standard deviation: High price diversity (mix of low and high-priced products)
- Low standard deviation: Consistent price range (clear target customer segment)
- Average > Median: High-priced products pulling up the average
- Average < Median: Many low-priced products
Summary
GROUP BY Techniques Covered
| Technique | Use | Example |
|---|---|---|
GROUP BY column | Single column aggregation | Revenue by category |
GROUP BY col1, col2 | Multi-column aggregation | Revenue by category+brand |
HAVING | Filter aggregated results | Only $100K+ revenue |
COUNT(DISTINCT) | Unique value count | Unique customer count |
MIN/MAX/AVG/SUM | Basic aggregate functions | Price statistics |
WHERE vs HAVING
-- WHERE: Pre-grouping filter (individual rows)
WHERE created_at >= '2024-01-01'
-- HAVING: Post-grouping filter (aggregated results)
HAVING SUM(revenue) > 100000Practical Use Scenarios
- Category Performance Analysis:
GROUP BY category - Brand Portfolio Optimization:
GROUP BY brand+HAVING - Customer Segmentation:
COUNT(DISTINCT user_id) - Pricing Strategy:
MIN/MAX/AVG/STDDEV - Cross Analysis: Multi-column
GROUP BY
- Ranking and cumulative aggregation with Window Functions
- Build complex analysis pipelines by combining with CTE
- Multi-dimensional aggregation with ROLLUP/CUBE