BigQuery Setup
This guide walks you through setting up a SQL-based data analysis environment using Google BigQuery.
1. Create a Google Cloud Project
Google Cloud offers new users $300 in credits and a 90-day free trial period. BigQuery provides 1TB of free query processing and 10GB of free storage per month.
1.1 Access Google Cloud Console
- Go to Google Cloud Console .
- Sign in with your Google account.
- Accept the terms of service.
1.2 Create a Project
- Click the project selection dropdown at the top.
- Click “New Project”.
- Enter a project name (e.g.,
my-analytics-project) - Click “Create”.
2. Enable BigQuery API
- Go to the BigQuery API page .
- Click the “Enable” button to activate the API.
3. Create Service Account and Key
You need a service account to access BigQuery from Jupyter Notebook or Python scripts.
3.1 Create Service Account
- Go to IAM & Admin > Service Accounts .
- Click “Create Service Account”.
- Enter the following information:
- Service account name:
bigquery-access - Description:
Service account for BigQuery data analysis
- Service account name:
- Click “Create and Continue”.
3.2 Grant Permissions
Add the following roles:
- BigQuery Data Viewer (
roles/bigquery.dataViewer) - Read data - BigQuery Job User (
roles/bigquery.jobUser) - Run queries - BigQuery Data Editor (
roles/bigquery.dataEditor) - Write data
3.3 Create JSON Key
- Click on the created service account.
- Go to the “Keys” tab.
- Click “Add Key” > “Create new key”.
- Select JSON format and click “Create”.
- The JSON file will be downloaded automatically.
Never commit the JSON key file to Git or share it publicly. Add it to .gitignore and store it in a secure location.
4. Create Dataset and Tables
All examples in this Cookbook use data generated by the SQL scripts below. Run the entire script in the BigQuery console.
Replace your-project-id with your actual project ID in the SQL below.
4.1 Create Dataset
-- Create dataset (US multi-region)
CREATE SCHEMA IF NOT EXISTS `your-project-id.retail_analytics_us`
OPTIONS(location="US");4.2 Copy Source Data (thelook_ecommerce)
Copy the BigQuery public dataset thelook_ecommerce. Apply partitioning and clustering to optimize query performance.
-- 1) Products table (no partition - no date column)
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.src_products` AS
SELECT
id AS product_id,
category,
brand,
department,
name,
retail_price,
cost,
sku,
distribution_center_id
FROM `bigquery-public-data.thelook_ecommerce.products`;
-- 2) Orders table (partitioned by created_at, clustered by user_id)
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.src_orders`
PARTITION BY DATE(created_at)
CLUSTER BY user_id AS
SELECT
order_id,
user_id,
created_at,
status,
num_of_item
FROM `bigquery-public-data.thelook_ecommerce.orders`;
-- 3) Order items table (clustered by order_id, product_id)
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.src_order_items`
CLUSTER BY order_id, product_id AS
SELECT
order_id,
product_id,
sale_price,
returned_at,
shipped_at,
delivered_at
FROM `bigquery-public-data.thelook_ecommerce.order_items`;
-- 4) Events table (partitioned by created_at, clustered by session_id/user_id)
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.src_events`
PARTITION BY DATE(created_at)
CLUSTER BY session_id, user_id AS
SELECT
id,
user_id,
sequence_number,
session_id,
created_at,
ip_address
FROM `bigquery-public-data.thelook_ecommerce.events`;
-- 5) Users table (partitioned by created_at, clustered by country)
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.src_users`
PARTITION BY DATE(created_at)
CLUSTER BY country AS
SELECT
id AS user_id,
first_name,
last_name,
email,
gender,
age,
state,
country,
created_at
FROM `bigquery-public-data.thelook_ecommerce.users`;4.3 Augmented Events Table (Session-level Channel/Device)
Create session-level augmented data for marketing analysis.
-- Augmented events (session-level channel/device/landing generated)
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.events_augmented`
PARTITION BY session_date
CLUSTER BY channel_key, device_key AS
WITH base AS (
SELECT
e.session_id,
ANY_VALUE(e.user_id) AS user_id,
MIN(e.created_at) AS session_start_at,
DATE(MIN(e.created_at)) AS session_date,
COUNT(*) AS events_in_session,
ABS(MOD(FARM_FINGERPRINT(CAST(e.session_id AS STRING)), 1000000)) AS h
FROM `your-project-id.retail_analytics_us.src_events` e
GROUP BY e.session_id
),
cooked AS (
SELECT
b.session_id,
b.user_id,
b.session_start_at,
b.session_date,
b.events_in_session,
b.h,
CASE MOD(b.h, 7)
WHEN 0 THEN 'organic' WHEN 1 THEN 'paid_search' WHEN 2 THEN 'paid_social'
WHEN 3 THEN 'display' WHEN 4 THEN 'email' WHEN 5 THEN 'referral' ELSE 'direct'
END AS channel_key,
CASE MOD(DIV(b.h,7), 3)
WHEN 0 THEN 'desktop' WHEN 1 THEN 'mobile' ELSE 'tablet'
END AS device_key,
CASE MOD(DIV(b.h,21), 5)
WHEN 0 THEN 'https://shop.example.com/home'
WHEN 1 THEN 'https://shop.example.com/women/heels'
WHEN 2 THEN 'https://shop.example.com/men/tee'
WHEN 3 THEN 'https://shop.example.com/accessories/belts'
ELSE 'https://shop.example.com/watches/quartz'
END AS landing_page
FROM base b
)
SELECT
c.session_id,
c.user_id,
c.session_start_at,
c.session_date,
c.events_in_session,
c.channel_key,
c.device_key,
c.landing_page,
LEAST(c.events_in_session, 10) AS pageviews_est,
CAST(LEAST(c.events_in_session, 10) * (0.15 + MOD(DIV(c.h,105), 5) * 0.05) AS INT64) AS add_to_cart_est
FROM cooked c;4.4 CS Ticket Dummy Data (for Project 1)
Generate ticket data for customer service analysis projects.
-- CS ticket dummy
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.cs_tickets_dummy`
PARTITION BY DATE(opened_at)
CLUSTER BY country_name, category, issue_type AS
WITH user_pool AS (
SELECT user_id, country, ROW_NUMBER() OVER (ORDER BY user_id) as rn
FROM `your-project-id.retail_analytics_us.src_users`
),
user_count AS (
SELECT COUNT(*) as total_users FROM user_pool
),
ticket_numbers AS (
SELECT off
FROM UNNEST(GENERATE_ARRAY(1, 5000)) AS off
),
gen AS (
SELECT
t.off,
CONCAT('TKT_', CAST(100000+t.off AS STRING)) AS ticket_id,
u.user_id,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(t.off * 30 + MOD(ABS(FARM_FINGERPRINT(CONCAT('ts', CAST(t.off AS STRING)))), 90*24*60) AS INT64) MINUTE) AS opened_at,
t.off AS row_num
FROM ticket_numbers t
CROSS JOIN user_count uc
LEFT JOIN user_pool u ON u.rn = 1 + MOD(ABS(FARM_FINGERPRINT(CAST(t.off AS STRING))), uc.total_users)
)
SELECT
ticket_id,
user_id,
opened_at,
CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-resp'))), 10) < 7
THEN TIMESTAMP_SUB(opened_at, INTERVAL -1 * CAST(MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-resptime'))), 48) AS INT64) HOUR)
END AS first_response_at,
CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-resv'))), 10) < 6
THEN TIMESTAMP_SUB(opened_at, INTERVAL -1 * CAST(MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-resvtime'))), 120) AS INT64) HOUR)
END AS resolved_at,
CASE
WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-status'))), 100) < 15 THEN 'escalated'
WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-status'))), 100) < 25 THEN 'pending'
WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-status'))), 100) < 35 THEN 'open'
ELSE 'solved'
END AS status,
CASE
WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-priority'))), 100) < 10 THEN 'urgent'
WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-priority'))), 100) < 40 THEN 'high'
WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-priority'))), 100) < 70 THEN 'normal'
ELSE 'low'
END AS priority,
CASE
WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-channel'))), 10) < 4 THEN 'email'
WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-channel'))), 10) < 7 THEN 'chat'
ELSE 'phone'
END AS channel,
CASE
WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-issue'))), 100) < 25 THEN 'size'
WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-issue'))), 100) < 50 THEN 'quality'
WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-issue'))), 100) < 70 THEN 'shipping'
WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-issue'))), 100) < 85 THEN 'payment'
ELSE 'refund'
END AS issue_type,
CONCAT('AGT_', CAST(100 + MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-agent'))), 20) AS STRING)) AS agent_id,
CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-country'))), 2) = 0 THEN 'United States' ELSE 'Korea, Republic of' END AS country_name,
CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-category'))), 2) = 0 THEN 'Womens Shoes' ELSE 'Mens Apparel' END AS category,
CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-sat'))), 10) < 6
THEN CAST(1 + MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-satval'))), 5) AS INT64)
END AS satisfaction_score,
CASE WHEN MOD(row_num, 2) = 0 THEN 'Thanks for quick support' ELSE 'I want to return due to size issue' END AS comment
FROM gen;4.5 Satisfaction/NPS Survey Dummy Data
-- Satisfaction/NPS survey dummy
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.survey_cs_dummy`
PARTITION BY DATE(sent_at)
CLUSTER BY country_name, category AS
WITH tickets_with_index AS (
SELECT
ticket_id,
user_id,
ROW_NUMBER() OVER (ORDER BY ticket_id) as rn
FROM `your-project-id.retail_analytics_us.cs_tickets_dummy`
WHERE status IN ('solved', 'pending')
),
ticket_count AS (
SELECT COUNT(*) as total_tickets FROM tickets_with_index
),
survey_numbers AS (
SELECT off FROM UNNEST(GENERATE_ARRAY(1, 1500)) AS off
)
SELECT
CONCAT('SVY_', CAST(100000+s.off AS STRING)) AS survey_id,
t.user_id,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(s.off * 40 + MOD(ABS(FARM_FINGERPRINT(CONCAT('svy', CAST(s.off AS STRING)))), 60*24*60) AS INT64) MINUTE) AS sent_at,
CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-complete'))), 10) < 6
THEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(s.off * 40 + MOD(ABS(FARM_FINGERPRINT(CONCAT('svy-c', CAST(s.off AS STRING)))), 60*24*55) AS INT64) MINUTE)
END AS completed_at,
CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-nps'))), 10) < 6
THEN CAST(MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-npsval'))), 11) AS INT64)
END AS nps_score,
CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-csat'))), 10) < 7
THEN CAST(1 + MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-csatval'))), 5) AS INT64)
END AS csat_score,
CASE WHEN MOD(s.off, 2) = 0 THEN 'Good quality' ELSE 'Delivery took longer than expected' END AS free_text,
CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-hasticket'))), 10) < 4
THEN t.ticket_id
END AS related_ticket_id,
CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-country'))), 2) = 0 THEN 'United States' ELSE 'Korea, Republic of' END AS country_name,
CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-cat'))), 2) = 0 THEN 'Womens Shoes' ELSE 'Mens Apparel' END AS category
FROM survey_numbers s
CROSS JOIN ticket_count tc
LEFT JOIN tickets_with_index t ON t.rn = 1 + MOD(ABS(FARM_FINGERPRINT(CAST(s.off AS STRING))), tc.total_tickets);4.6 Return Reason Dummy Data
-- Return reason label dummy
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.returns_reason_dummy`
PARTITION BY return_date
CLUSTER BY reason_code, country_name AS
WITH order_items_sample AS (
SELECT
oi.order_id,
oi.product_id,
o.created_at,
u.country as country_name,
ROW_NUMBER() OVER (ORDER BY oi.order_id, oi.product_id) as row_num
FROM `your-project-id.retail_analytics_us.src_order_items` oi
INNER JOIN `your-project-id.retail_analytics_us.src_orders` o ON oi.order_id = o.order_id
INNER JOIN `your-project-id.retail_analytics_us.src_users` u ON o.user_id = u.user_id
WHERE RAND() < 0.1
AND oi.returned_at IS NULL
),
reasons AS (
SELECT * FROM UNNEST(['size_issue','defect','damaged','changed_mind','shipping_delay']) AS reason_code WITH OFFSET AS reason_offset
),
responsibilities AS (
SELECT * FROM UNNEST(['merchant','logistics','customer']) AS responsibility WITH OFFSET AS resp_offset
)
SELECT
order_id,
product_id,
DATE_ADD(DATE(created_at), INTERVAL CAST(7 + MOD(row_num, 30) AS INT64) DAY) AS return_date,
(SELECT reason_code FROM reasons WHERE reason_offset = MOD(ABS(FARM_FINGERPRINT(CONCAT(CAST(order_id AS STRING), '-', CAST(product_id AS STRING), '-reason'))), 5) LIMIT 1) AS reason_code,
'auto-generated reason' AS reason_text,
(SELECT responsibility FROM responsibilities WHERE resp_offset = MOD(ABS(FARM_FINGERPRINT(CONCAT(CAST(order_id AS STRING), '-', CAST(product_id AS STRING), '-resp'))), 3) LIMIT 1) AS responsibility,
country_name
FROM order_items_sample;4.7 External/Strategic Signal Dummy (for Project 2)
-- Demand trends (similar to Google Trends)
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.ext_demand_trends`
PARTITION BY week
CLUSTER BY keyword, region_code AS
WITH weeks AS (
SELECT DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL off WEEK), WEEK(MONDAY)) AS week
FROM UNNEST(GENERATE_ARRAY(0, 26)) AS off
),
kw AS (SELECT * FROM UNNEST(['heels','women shoes','men clothing','handbags','watches']) AS keyword),
rg AS (SELECT * FROM UNNEST(['US','KR']) AS region_code)
SELECT week, keyword, region_code, 50 + CAST(FLOOR(30*RAND()) AS INT64) AS score
FROM weeks CROSS JOIN kw CROSS JOIN rg;
-- Competitive/technology signals
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.ext_competitive_signal`
PARTITION BY period
CLUSTER BY topic, region_code AS
WITH months AS (
SELECT DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL off MONTH), MONTH) AS period
FROM UNNEST(GENERATE_ARRAY(0, 12)) AS off
),
topics AS (SELECT * FROM UNNEST(['fabric','eco','sportswear','accessories']) AS topic),
regions AS (SELECT * FROM UNNEST(['US','KR']) AS region_code)
SELECT
m.period AS period,
t.topic AS topic,
r.region_code AS region_code,
ROUND(10 + RAND()*90, 2) AS signal,
(SELECT AS VALUE s FROM UNNEST(['patent','news','social']) s ORDER BY RAND() LIMIT 1) AS source
FROM months m
CROSS JOIN topics t
CROSS JOIN regions r;
-- Macro/weather adjustment
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.ext_macro_weather`
PARTITION BY period
CLUSTER BY country_iso AS
WITH months AS (
SELECT DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL off MONTH), MONTH) AS period
FROM UNNEST(GENERATE_ARRAY(0, 12)) AS off
),
countries AS (SELECT * FROM UNNEST(['US','KR']) AS country_iso)
SELECT
m.period AS period,
c.country_iso AS country_iso,
ROUND(3 + RAND()*7, 2) AS unemployment_rate,
ROUND(80 + RAND()*40, 2) AS income_index,
ROUND(40 + RAND()*60, 2) AS season_temp_idx
FROM months m
CROSS JOIN countries c;4.8 Marketing Campaign Dummy (for Project 3)
-- Campaign metadata
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.mkt_campaigns_dummy`
PARTITION BY start_date
CLUSTER BY channel_key, target_category AS
WITH ids AS (
SELECT CONCAT('CMP_', CAST(1000+off AS STRING)) AS campaign_id
FROM UNNEST(GENERATE_ARRAY(1, 20)) AS off
)
SELECT
campaign_id,
CONCAT('Campaign ', CAST(ROW_NUMBER() OVER() AS STRING)) AS campaign_name,
(SELECT AS VALUE c FROM UNNEST(['paid_search','paid_social','display','email']) c ORDER BY RAND() LIMIT 1) AS channel_key,
DATE_SUB(CURRENT_DATE(), INTERVAL CAST(RAND()*90 AS INT64) DAY) AS start_date,
DATE_ADD(CURRENT_DATE(), INTERVAL CAST(RAND()*30 AS INT64) DAY) AS end_date,
(SELECT AS VALUE cat FROM UNNEST(['Womens Shoes','Mens Apparel','Accessories','Watches','Bags']) cat ORDER BY RAND() LIMIT 1) AS target_category,
(SELECT AS VALUE cn FROM UNNEST(['United States','Korea, Republic of']) cn ORDER BY RAND() LIMIT 1) AS target_country,
(SELECT AS VALUE o FROM UNNEST(['awareness','traffic','conversion','retention']) o ORDER BY RAND() LIMIT 1) AS objective,
(SELECT AS VALUE k FROM UNNEST(['CTR','CVR','ROAS','LTV']) k ORDER BY RAND() LIMIT 1) AS kpi,
ROUND(1000 + RAND()*9000, 2) AS budget
FROM ids;
-- Channel spend/performance
CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.mkt_channel_spend_dummy`
PARTITION BY date
CLUSTER BY channel_key, campaign_id, country_name AS
WITH cal AS (
SELECT DATE_SUB(CURRENT_DATE(), INTERVAL off DAY) AS date
FROM UNNEST(GENERATE_ARRAY(0, 90)) AS off
),
ch AS (SELECT * FROM UNNEST(['paid_search','paid_social','display','email','referral','direct']) AS channel_key),
co AS (SELECT * FROM UNNEST(['United States','Korea, Republic of']) AS country_name),
dv AS (SELECT * FROM UNNEST(['desktop','mobile']) AS device_key)
SELECT
date, channel_key,
(SELECT campaign_id FROM `your-project-id.retail_analytics_us.mkt_campaigns_dummy` ORDER BY RAND() LIMIT 1) AS campaign_id,
CAST(1000 + RAND()*9000 AS INT64) AS impressions,
CAST(100 + RAND()*2000 AS INT64) AS clicks,
ROUND(100 + RAND()*800, 2) AS spend,
CONCAT('https://shop.example.com/lp/', channel_key) AS landing_page,
country_name, device_key
FROM cal, ch, co, dv;5. Verify Creation Results
Verify that all tables were created successfully.
-- Check table list
SELECT table_name, table_type, row_count
FROM `your-project-id.retail_analytics_us`.INFORMATION_SCHEMA.TABLES
ORDER BY table_name;Expected table list:
| Table Name | Description |
|---|---|
src_products | Product master |
src_orders | Order header |
src_order_items | Order details |
src_events | Web event logs |
src_users | Customer information |
events_augmented | Session-level augmented data |
cs_tickets_dummy | CS tickets |
survey_cs_dummy | Satisfaction surveys |
returns_reason_dummy | Return reasons |
ext_demand_trends | Demand trends |
ext_competitive_signal | Competitive signals |
ext_macro_weather | Macroeconomic/weather |
mkt_campaigns_dummy | Marketing campaigns |
mkt_channel_spend_dummy | Channel spend |
6. Connect to BigQuery from Python
6.1 Install Required Packages
pip install google-cloud-bigquery pandas db-dtypes pyarrow6.2 Set Up Authentication
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/your-service-account-key.json"6.3 Test Connection
from google.cloud import bigquery
import pandas as pd
# Initialize client
client = bigquery.Client(project='your-project-id')
# Run test query
query = """
SELECT
COUNT(*) as total_orders,
COUNT(DISTINCT user_id) as unique_customers,
SUM(num_of_item) as total_items
FROM `your-project-id.retail_analytics_us.src_orders`
"""
result = client.query(query).to_dataframe()
print("✅ BigQuery connection successful!")
print(result)Expected output:
✅ BigQuery connection successful!
total_orders unique_customers total_items
0 80643 49891 1588467. Set Environment Variables (Recommended)
macOS / Linux
# Add to ~/.bashrc or ~/.zshrc
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your-service-account-key.json"
export GCP_PROJECT_ID="your-project-id"Windows
setx GOOGLE_APPLICATION_CREDENTIALS "C:\path\to\your-service-account-key.json"
setx GCP_PROJECT_ID "your-project-id"Next Steps
Your environment setup is complete!
- Understanding Data Structure - Understand table schemas and relationships
- SQL Track - Start SQL-based analysis
- Pandas Track - Start Python-based analysis