Combining Tables with JOIN
Learning Objectives
After completing this recipe, you will be able to:
- Understand the concept of JOIN and set theory
- Extract intersection data with INNER JOIN
- Maintain the base table with LEFT/RIGHT JOIN
- Create unions with FULL OUTER JOIN
- Generate all combinations with CROSS JOIN
- Multi-table JOIN and performance optimization
BigQuery Execution Environment Setup
from google.cloud import bigquery
import pandas as pd
# Authentication setup (when using service account key)
# import os
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-key.json"
# Create BigQuery client
client = bigquery.Client(project='your-project-id')
# Query execution function
def run_query(query):
return client.query(query).to_dataframe()Replace your-project-id with your GCP project ID in the code above.
1. What is JOIN?
Theory
JOIN is a SQL operation that combines two or more tables based on specific conditions. Since data in relational databases is stored across multiple tables, you need to combine tables for meaningful analysis.
Understanding with Set Theory
| JOIN Type | Set Operation | Description |
|---|---|---|
| INNER JOIN | A ∩ B | Only rows that exist in both tables |
| LEFT JOIN | A + (A ∩ B) | All of the left table + matching rows from the right |
| RIGHT JOIN | (A ∩ B) + B | All of the right table + matching rows from the left |
| FULL OUTER JOIN | A ∪ B | Both tables entirely |
| CROSS JOIN | A × B | All possible combinations (Cartesian product) |
Basic Syntax
SELECT columns
FROM table_a
[INNER | LEFT | RIGHT | FULL OUTER | CROSS] JOIN table_b
ON table_a.key = table_b.key2. INNER JOIN - Intersection
Theory
INNER JOIN returns only rows where the condition matches in both tables. It is the most commonly used JOIN type.
Table A Table B INNER JOIN Result
+----+----+ +----+----+ +----+----+----+
| id | a | | id | b | | id | a | b |
+----+----+ +----+----+ +----+----+----+
| 1 | X | | 1 | P | → | 1 | X | P |
| 2 | Y | | 3 | Q | | 3 | Z | Q |
| 3 | Z | | 4 | R | +----+----+----+
+----+----+ +----+----+Practice: Combining Orders and Order Details
SELECT
o.order_id,
o.user_id,
o.status,
o.created_at,
oi.product_id,
oi.sale_price
FROM `your-project-id.thelook_ecommerce.orders` o
INNER JOIN `your-project-id.thelook_ecommerce.order_items` oi
ON o.order_id = oi.order_id
WHERE o.status = 'Complete'
LIMIT 10Execution Result:
| order_id | user_id | status | created_at | product_id | sale_price |
|---|---|---|---|---|---|
| 1 | 54321 | Complete | 2024-01-15 10:30:00 | 12345 | 59.99 |
| 1 | 54321 | Complete | 2024-01-15 10:30:00 | 67890 | 29.99 |
| 2 | 12345 | Complete | 2024-01-15 11:45:00 | 11111 | 89.99 |
Assigning an alias to a table like orders o makes the query more concise. This is essential when JOINing multiple tables.
3. LEFT JOIN - Left-Based
Theory
LEFT JOIN keeps all rows from the left table and combines matching rows from the right table. If there’s no match, it fills with NULL.
Table A Table B LEFT JOIN Result
+----+----+ +----+----+ +----+----+------+
| id | a | | id | b | | id | a | b |
+----+----+ +----+----+ +----+----+------+
| 1 | X | | 1 | P | → | 1 | X | P |
| 2 | Y | | 3 | Q | | 2 | Y | NULL |
| 3 | Z | | 4 | R | | 3 | Z | Q |
+----+----+ +----+----+ +----+----+------+Practice: All Customers and Order History
SELECT
u.id AS user_id,
u.first_name,
u.last_name,
u.email,
COUNT(o.order_id) AS order_count,
COALESCE(SUM(oi.sale_price), 0) AS total_spent
FROM `your-project-id.thelook_ecommerce.users` u
LEFT JOIN `your-project-id.thelook_ecommerce.orders` o
ON u.id = o.user_id
LEFT JOIN `your-project-id.thelook_ecommerce.order_items` oi
ON o.order_id = oi.order_id
GROUP BY u.id, u.first_name, u.last_name, u.email
ORDER BY total_spent DESC
LIMIT 15Execution Result:
| user_id | first_name | last_name | order_count | total_spent | |
|---|---|---|---|---|---|
| 59071 | John | Smith | john.smith@example.com | 8 | 1633.89 |
| 84005 | Jane | Doe | jane.doe@example.com | 4 | 1500.73 |
| 90572 | Bob | Johnson | bob.j@example.com | 6 | 1477.39 |
| 12345 | Alice | Brown | alice.b@example.com | 0 | 0 |
- Finding customers who haven’t ordered (
WHERE o.order_id IS NULL) - Finding products without reviews
- Finding orders with failed payments
Practice: Finding Customers Who Haven’t Ordered
SELECT
u.id AS user_id,
u.first_name,
u.last_name,
u.email,
u.created_at AS signup_date
FROM `your-project-id.thelook_ecommerce.users` u
LEFT JOIN `your-project-id.thelook_ecommerce.orders` o
ON u.id = o.user_id
WHERE o.order_id IS NULL
ORDER BY u.created_at DESC
LIMIT 10Execution Result:
| user_id | first_name | last_name | signup_date | |
|---|---|---|---|---|
| 99876 | Mike | Wilson | mike.w@example.com | 2024-11-15 |
| 99543 | Sarah | Lee | sarah.l@example.com | 2024-11-14 |
4. RIGHT JOIN and FULL OUTER JOIN
RIGHT JOIN
RIGHT JOIN is the opposite of LEFT JOIN. It keeps all rows from the right table.
-- RIGHT JOIN (LEFT JOIN is more commonly used in practice)
SELECT
o.order_id,
o.status,
u.first_name,
u.last_name
FROM `your-project-id.thelook_ecommerce.users` u
RIGHT JOIN `your-project-id.thelook_ecommerce.orders` o
ON u.id = o.user_id
LIMIT 10In most cases, instead of RIGHT JOIN, swap the table order and use LEFT JOIN. It’s more readable.
FULL OUTER JOIN
FULL OUTER JOIN includes all rows from both tables.
-- All users and all orders (regardless of matching)
SELECT
u.id AS user_id,
u.first_name,
o.order_id,
o.status
FROM `your-project-id.thelook_ecommerce.users` u
FULL OUTER JOIN `your-project-id.thelook_ecommerce.orders` o
ON u.id = o.user_id
WHERE u.id IS NULL OR o.order_id IS NULL -- Only unmatched ones
LIMIT 105. CROSS JOIN - Cartesian Product
Theory
CROSS JOIN generates all possible combinations of two tables. Since it combines without conditions, the resulting row count = rows in A × rows in B.
Practice: Generate All Product-Promotion Combinations
-- All combinations of promotions and categories
WITH promotions AS (
SELECT promotion_name FROM UNNEST(['Black Friday', 'Summer Sale', 'New Year']) AS promotion_name
),
categories AS (
SELECT DISTINCT category
FROM `your-project-id.thelook_ecommerce.products`
LIMIT 5
)
SELECT
p.promotion_name,
c.category,
CONCAT(p.promotion_name, ' - ', c.category) AS campaign_name
FROM promotions p
CROSS JOIN categories c
ORDER BY p.promotion_name, c.categoryExecution Result:
| promotion_name | category | campaign_name |
|---|---|---|
| Black Friday | Accessories | Black Friday - Accessories |
| Black Friday | Active | Black Friday - Active |
| Black Friday | Jeans | Black Friday - Jeans |
| Summer Sale | Accessories | Summer Sale - Accessories |
| … | … | … |
CROSS JOIN results can grow rapidly. 10,000 rows × 10,000 rows = 100 million rows! Use it carefully.
6. Multi-Table JOIN
Practice: Order Detail Analysis (4 Tables)
SELECT
o.order_id,
o.created_at AS order_date,
o.status,
u.first_name || ' ' || u.last_name AS customer_name,
u.country,
p.name AS product_name,
p.category,
p.brand,
oi.sale_price,
p.retail_price,
ROUND(oi.sale_price - p.cost, 2) AS profit
FROM `your-project-id.thelook_ecommerce.orders` o
INNER JOIN `your-project-id.thelook_ecommerce.users` u
ON o.user_id = u.id
INNER JOIN `your-project-id.thelook_ecommerce.order_items` oi
ON o.order_id = oi.order_id
INNER JOIN `your-project-id.thelook_ecommerce.products` p
ON oi.product_id = p.id
WHERE o.created_at >= '2024-01-01'
ORDER BY o.created_at DESC
LIMIT 15Execution Result:
| order_id | order_date | status | customer_name | country | product_name | category | brand | sale_price | profit |
|---|---|---|---|---|---|---|---|---|---|
| 125000 | 2024-11-15 | Complete | John Smith | United States | Classic Jeans | Jeans | Levi’s | 89.99 | 35.50 |
| 124999 | 2024-11-15 | Shipped | Jane Doe | China | Cotton T-Shirt | Tops & Tees | H&M | 24.99 | 12.30 |
7. JOIN Performance Optimization
Core Principles
- Use indexes on JOIN keys: BigQuery auto-optimizes, but leverage partitioning/clustering keys
- Put smaller tables on the right: BigQuery broadcasts the right table
- SELECT only necessary columns: Avoid
SELECT * - Filter early: Reduce data with WHERE before JOIN
Good vs Bad Examples
-- ❌ Bad example: Filtering after JOIN
SELECT *
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.created_at >= '2024-01-01'
-- ✅ Good example: Filter first with subquery
SELECT o.order_id, o.status, oi.sale_price
FROM (
SELECT order_id, status
FROM orders
WHERE created_at >= '2024-01-01'
) o
JOIN order_items oi ON o.order_id = oi.order_idCheck Execution Plan with EXPLAIN
-- Add EXPLAIN before the query
EXPLAIN
SELECT o.order_id, COUNT(oi.id)
FROM `your-project-id.thelook_ecommerce.orders` o
JOIN `your-project-id.thelook_ecommerce.order_items` oi
ON o.order_id = oi.order_id
GROUP BY o.order_idQuiz
Problem 1: Top 3 Brands by Revenue per Category
Query the top 3 brands by revenue in each category.
Hint: Use ROW_NUMBER() after JOINing products and order_items
View Answer
WITH brand_revenue AS (
SELECT
p.category,
p.brand,
SUM(oi.sale_price) AS total_revenue,
COUNT(*) AS items_sold
FROM `your-project-id.thelook_ecommerce.order_items` oi
INNER JOIN `your-project-id.thelook_ecommerce.products` p
ON oi.product_id = p.id
GROUP BY p.category, p.brand
),
ranked AS (
SELECT
category,
brand,
total_revenue,
items_sold,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY total_revenue DESC) AS rank
FROM brand_revenue
)
SELECT
category,
brand,
total_revenue,
items_sold,
rank
FROM ranked
WHERE rank <= 3
ORDER BY category, rankExecution Result:
| category | brand | total_revenue | items_sold | rank |
|---|---|---|---|---|
| Accessories | Ray-Ban | 125,340.50 | 1,523 | 1 |
| Accessories | Oakley | 98,230.75 | 1,245 | 2 |
| Accessories | Persol | 87,650.00 | 987 | 3 |
| Jeans | Levi’s | 234,560.25 | 2,890 | 1 |
| … | … | … | … | … |
Problem 2: Repeat Customer Analysis
Query the first and last order information for customers who have ordered 2 or more times.
Hint: Self Join - JOIN the orders table with itself
View Answer
WITH customer_orders AS (
SELECT
user_id,
MIN(created_at) AS first_order_date,
MAX(created_at) AS last_order_date,
COUNT(DISTINCT order_id) AS total_orders
FROM `your-project-id.thelook_ecommerce.orders`
WHERE status = 'Complete'
GROUP BY user_id
HAVING COUNT(DISTINCT order_id) >= 2
)
SELECT
co.user_id,
u.first_name,
u.last_name,
co.first_order_date,
co.last_order_date,
co.total_orders,
DATE_DIFF(CAST(co.last_order_date AS DATE), CAST(co.first_order_date AS DATE), DAY) AS days_as_customer
FROM customer_orders co
INNER JOIN `your-project-id.thelook_ecommerce.users` u
ON co.user_id = u.id
ORDER BY co.total_orders DESC, days_as_customer DESC
LIMIT 20Execution Result:
| user_id | first_name | last_name | first_order_date | last_order_date | total_orders | days_as_customer |
|---|---|---|---|---|---|---|
| 4523 | Michael | Chen | 2023-01-02 | 2024-11-15 | 12 | 682 |
| 7891 | Emily | Wang | 2023-02-15 | 2024-10-28 | 9 | 621 |
Problem 3: Finding Products Never Sold
Find products that have never been sold.
Hint: LEFT JOIN + IS NULL
View Answer
SELECT
p.id AS product_id,
p.name,
p.category,
p.brand,
p.retail_price,
p.created_at AS product_created_at
FROM `your-project-id.thelook_ecommerce.products` p
LEFT JOIN `your-project-id.thelook_ecommerce.order_items` oi
ON p.id = oi.product_id
WHERE oi.id IS NULL
ORDER BY p.retail_price DESC
LIMIT 20Execution Result:
| product_id | name | category | brand | retail_price | product_created_at |
|---|---|---|---|---|---|
| 28456 | Premium Jacket X | Outerwear & Coats | Canada Goose | 899.99 | 2024-10-01 |
| 29012 | Designer Dress Y | Dresses | Gucci | 750.00 | 2024-09-15 |
Business Insights:
- If there are many expensive products: Review pricing strategy
- If they’re new products: Strengthen marketing
- Identify candidates for inventory clearance
Summary
| JOIN Type | When to Use | Result |
|---|---|---|
INNER JOIN | Only need data in both | Intersection |
LEFT JOIN | All left + matching right | Left-based |
RIGHT JOIN | All right + matching left | Right-based |
FULL OUTER JOIN | Include both sides | Union |
CROSS JOIN | Generate all combinations | Cartesian product |
JOIN Selection Guide
What data do you need?
├─ Must exist in both → INNER JOIN
├─ All of one side, attach other if exists → LEFT/RIGHT JOIN
├─ Need all of both sides → FULL OUTER JOIN
└─ Need all combinations → CROSS JOINNext Steps
You’ve mastered JOIN! Next, learn about GROUP BY and aggregate functions in Aggregation and Grouping.