Understanding Data Structure
📦 Download Dataset (BigQuery Setup Guide)
Understand the structure and relationships of the datasets used in this Cookbook.
Dataset Overview
This Cookbook is based on data from a fictional e-commerce company. It extends Google BigQuery’s thelook_ecommerce public dataset with CS tickets, marketing campaigns, external market data, and more.
Data Model (ERD)
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ src_users │ │ src_orders │ │ src_order_items │
├─────────────────┤ ├─────────────────┤ ├─────────────────┤
│ user_id (PK) │──────▶│ order_id (PK) │◀──────│ order_id (FK) │
│ first_name │ │ user_id (FK) │ │ product_id (FK) │
│ last_name │ │ created_at │ │ sale_price │
│ email │ │ status │ │ returned_at │
│ gender │ │ num_of_item │ │ shipped_at │
│ age │ └─────────────────┘ │ delivered_at │
│ country │ └────────┬────────┘
│ state │ │
│ created_at │ │
└─────────────────┘ ▼
┌─────────────────┐
┌─────────────────┐ │ src_products │
│ src_events │ ├─────────────────┤
├─────────────────┤ │ product_id (PK) │
│ id (PK) │ │ category │
│ user_id (FK) │ │ brand │
│ session_id │ │ department │
│ sequence_number │ │ name │
│ created_at │ │ retail_price │
│ ip_address │ │ cost │
└─────────────────┘ └─────────────────┘Detailed Table Descriptions
1. Core Tables (E-commerce)
src_users - Customer Information
| Column | Type | Description |
|---|---|---|
| user_id | INT64 | Customer unique identifier (PK) |
| first_name | STRING | First name |
| last_name | STRING | Last name |
| STRING | Email address | |
| gender | STRING | Gender (M/F) |
| age | INT64 | Age |
| country | STRING | Country |
| state | STRING | State/Region |
| created_at | TIMESTAMP | Registration date |
ℹ️
Data Scale
Contains approximately 100,000 customer records.
src_orders - Order Information
| Column | Type | Description |
|---|---|---|
| order_id | INT64 | Order unique identifier (PK) |
| user_id | INT64 | Customer ID (FK) |
| created_at | TIMESTAMP | Order date/time |
| status | STRING | Order status (Shipped, Complete, Cancelled, etc.) |
| num_of_item | INT64 | Number of items in order |
Order Status Values:
Shipped: In transitComplete: DeliveredCancelled: CancelledReturned: ReturnedProcessing: Processing
src_order_items - Order Item Details
| Column | Type | Description |
|---|---|---|
| order_id | INT64 | Order ID (FK) |
| product_id | INT64 | Product ID (FK) |
| sale_price | FLOAT64 | Sale price |
| returned_at | TIMESTAMP | Return date/time (nullable) |
| shipped_at | TIMESTAMP | Shipping start date/time |
| delivered_at | TIMESTAMP | Delivery completion date/time |
💡
Composite Key
The combination of order_id and product_id is the composite primary key for this table.
src_products - Product Information
| Column | Type | Description |
|---|---|---|
| product_id | INT64 | Product unique identifier (PK) |
| category | STRING | Category |
| brand | STRING | Brand |
| department | STRING | Department (Men/Women) |
| name | STRING | Product name |
| retail_price | FLOAT64 | Retail price |
| cost | FLOAT64 | Cost |
| sku | STRING | SKU code |
| distribution_center_id | INT64 | Distribution center ID |
Main Categories:
- Accessories, Active, Blazers & Jackets, Clothing Sets
- Dresses, Fashion Hoodies & Sweatshirts, Jeans
- Leggings, Maternity, Outerwear & Coats, Pants
- Pants & Capris, Plus, Shorts, Skirts, Sleep & Lounge
- Socks, Socks & Hosiery, Suits, Suits & Sport Coats
- Sweaters, Swim, Tops & Tees, Underwear
src_events - Web Event Logs
| Column | Type | Description |
|---|---|---|
| id | INT64 | Event ID (PK) |
| user_id | INT64 | Customer ID (FK, nullable) |
| session_id | STRING | Session ID |
| sequence_number | INT64 | Event sequence within session |
| created_at | TIMESTAMP | Event timestamp |
| ip_address | STRING | IP address |
2. Augmented Data (for Session Analysis)
events_augmented - Session-level Aggregation
| Column | Type | Description |
|---|---|---|
| session_id | STRING | Session ID (PK) |
| user_id | INT64 | Customer ID |
| session_start_at | TIMESTAMP | Session start time |
| session_date | DATE | Session date |
| events_in_session | INT64 | Number of events in session |
| channel_key | STRING | Acquisition channel |
| device_key | STRING | Device type |
| landing_page | STRING | Landing page URL |
| pageviews_est | INT64 | Estimated pageviews |
| add_to_cart_est | INT64 | Estimated add-to-cart count |
Channel Types:
organic: Organic searchpaid_search: Paid search adspaid_social: Paid social adsdisplay: Display adsemail: Email marketingreferral: Referral/affiliatedirect: Direct visit
3. Customer Service Data
cs_tickets_dummy - CS Tickets
| Column | Type | Description |
|---|---|---|
| ticket_id | STRING | Ticket ID (PK) |
| user_id | INT64 | Customer ID (FK) |
| opened_at | TIMESTAMP | Ticket creation time |
| first_response_at | TIMESTAMP | First response time |
| resolved_at | TIMESTAMP | Resolution time |
| status | STRING | Status (open, pending, solved, escalated) |
| priority | STRING | Priority (low, normal, high, urgent) |
| channel | STRING | Contact channel (email, chat, phone) |
| issue_type | STRING | Issue type |
| agent_id | STRING | Assigned agent ID |
| country_name | STRING | Customer country |
| category | STRING | Related product category |
| satisfaction_score | INT64 | Satisfaction score (1-5) |
| comment | STRING | Customer comment |
Issue Types:
size: Size inquiryquality: Quality issueshipping: Shipping relatedpayment: Payment issuerefund: Refund request
survey_cs_dummy - Customer Satisfaction Survey
| Column | Type | Description |
|---|---|---|
| survey_id | STRING | Survey ID (PK) |
| user_id | INT64 | Customer ID |
| sent_at | TIMESTAMP | Survey sent time |
| completed_at | TIMESTAMP | Survey completion time |
| nps_score | INT64 | NPS score (0-10) |
| csat_score | INT64 | CSAT score (1-5) |
| free_text | STRING | Free-form response |
| related_ticket_id | STRING | Related ticket ID |
4. Marketing Data
mkt_campaigns_dummy - Campaign Metadata
| Column | Type | Description |
|---|---|---|
| campaign_id | STRING | Campaign ID (PK) |
| campaign_name | STRING | Campaign name |
| channel_key | STRING | Channel |
| start_date | DATE | Start date |
| end_date | DATE | End date |
| target_category | STRING | Target category |
| target_country | STRING | Target country |
| objective | STRING | Objective (awareness, traffic, conversion, etc.) |
| kpi | STRING | KPI (CTR, CVR, ROAS, LTV) |
| budget | FLOAT64 | Budget |
mkt_channel_spend_dummy - Channel Ad Spend
| Column | Type | Description |
|---|---|---|
| date | DATE | Date |
| channel_key | STRING | Channel |
| campaign_id | STRING | Campaign ID |
| impressions | INT64 | Impressions |
| clicks | INT64 | Clicks |
| spend | FLOAT64 | Spend amount |
| landing_page | STRING | Landing page |
| country_name | STRING | Country |
| device_key | STRING | Device |
5. External Data
ext_demand_trends - Market Demand Trends
| Column | Type | Description |
|---|---|---|
| week | DATE | Week start date |
| keyword | STRING | Keyword |
| region_code | STRING | Region code |
| score | INT64 | Trend score (0-100) |
ext_competitive_signal - Competitive Signals
| Column | Type | Description |
|---|---|---|
| period | DATE | Period (month) |
| topic | STRING | Topic |
| region_code | STRING | Region code |
| signal | FLOAT64 | Signal strength |
| source | STRING | Source (patent, news, social) |
Using Data Relationships
Basic Join Patterns
-- Join orders and customer information
SELECT
o.order_id,
o.created_at,
u.country,
u.age
FROM src_orders o
JOIN src_users u ON o.user_id = u.user_id
-- Join order details and product information
SELECT
oi.order_id,
oi.sale_price,
p.category,
p.brand
FROM src_order_items oi
JOIN src_products p ON oi.product_id = p.product_id
-- Full join (orders → products → customers)
SELECT
o.order_id,
o.created_at,
u.country,
p.category,
oi.sale_price
FROM src_orders o
JOIN src_users u ON o.user_id = u.user_id
JOIN src_order_items oi ON o.order_id = oi.order_id
JOIN src_products p ON oi.product_id = p.product_idPandas Equivalent Code
# Load data
orders = pd.read_csv('src_orders.csv')
users = pd.read_csv('src_users.csv')
order_items = pd.read_csv('src_order_items.csv')
products = pd.read_csv('src_products.csv')
# Join orders and customer information
orders_with_users = orders.merge(users, on='user_id')
# Join order details and product information
items_with_products = order_items.merge(products, on='product_id')
# Full join
full_data = (orders
.merge(users, on='user_id')
.merge(order_items, on='order_id')
.merge(products, on='product_id'))Next Steps
Now that you understand the data structure, you’re ready to start the actual analysis:
- Start SQL Track - Data analysis with BigQuery SQL
- Start Pandas Track - Data analysis with Python Pandas
Last updated on