Skip to Content
Getting StartedBigQuery Setup

BigQuery Setup

This guide walks you through setting up a SQL-based data analysis environment using Google BigQuery.

1. Create a Google Cloud Project

ℹ️
Free Trial

Google Cloud offers new users $300 in credits and a 90-day free trial period. BigQuery provides 1TB of free query processing and 10GB of free storage per month.

1.1 Access Google Cloud Console

  1. Go to Google Cloud Console .
  2. Sign in with your Google account.
  3. Accept the terms of service.

1.2 Create a Project

  1. Click the project selection dropdown at the top.
  2. Click “New Project”.
  3. Enter a project name (e.g., my-analytics-project)
  4. Click “Create”.

2. Enable BigQuery API

  1. Go to the BigQuery API page .
  2. Click the “Enable” button to activate the API.

3. Create Service Account and Key

You need a service account to access BigQuery from Jupyter Notebook or Python scripts.

3.1 Create Service Account

  1. Go to IAM & Admin > Service Accounts .
  2. Click “Create Service Account”.
  3. Enter the following information:
    • Service account name: bigquery-access
    • Description: Service account for BigQuery data analysis
  4. Click “Create and Continue”.

3.2 Grant Permissions

Add the following roles:

  • BigQuery Data Viewer (roles/bigquery.dataViewer) - Read data
  • BigQuery Job User (roles/bigquery.jobUser) - Run queries
  • BigQuery Data Editor (roles/bigquery.dataEditor) - Write data

3.3 Create JSON Key

  1. Click on the created service account.
  2. Go to the “Keys” tab.
  3. Click “Add Key” > “Create new key”.
  4. Select JSON format and click “Create”.
  5. The JSON file will be downloaded automatically.
⚠️
Security Warning

Never commit the JSON key file to Git or share it publicly. Add it to .gitignore and store it in a secure location.


4. Create Dataset and Tables

All examples in this Cookbook use data generated by the SQL scripts below. Run the entire script in the BigQuery console.

💡
Change Project ID

Replace your-project-id with your actual project ID in the SQL below.

4.1 Create Dataset

-- Create dataset (US multi-region) CREATE SCHEMA IF NOT EXISTS `your-project-id.retail_analytics_us` OPTIONS(location="US");

4.2 Copy Source Data (thelook_ecommerce)

Copy the BigQuery public dataset thelook_ecommerce. Apply partitioning and clustering to optimize query performance.

-- 1) Products table (no partition - no date column) CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.src_products` AS SELECT id AS product_id, category, brand, department, name, retail_price, cost, sku, distribution_center_id FROM `bigquery-public-data.thelook_ecommerce.products`; -- 2) Orders table (partitioned by created_at, clustered by user_id) CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.src_orders` PARTITION BY DATE(created_at) CLUSTER BY user_id AS SELECT order_id, user_id, created_at, status, num_of_item FROM `bigquery-public-data.thelook_ecommerce.orders`; -- 3) Order items table (clustered by order_id, product_id) CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.src_order_items` CLUSTER BY order_id, product_id AS SELECT order_id, product_id, sale_price, returned_at, shipped_at, delivered_at FROM `bigquery-public-data.thelook_ecommerce.order_items`; -- 4) Events table (partitioned by created_at, clustered by session_id/user_id) CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.src_events` PARTITION BY DATE(created_at) CLUSTER BY session_id, user_id AS SELECT id, user_id, sequence_number, session_id, created_at, ip_address FROM `bigquery-public-data.thelook_ecommerce.events`; -- 5) Users table (partitioned by created_at, clustered by country) CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.src_users` PARTITION BY DATE(created_at) CLUSTER BY country AS SELECT id AS user_id, first_name, last_name, email, gender, age, state, country, created_at FROM `bigquery-public-data.thelook_ecommerce.users`;

4.3 Augmented Events Table (Session-level Channel/Device)

Create session-level augmented data for marketing analysis.

-- Augmented events (session-level channel/device/landing generated) CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.events_augmented` PARTITION BY session_date CLUSTER BY channel_key, device_key AS WITH base AS ( SELECT e.session_id, ANY_VALUE(e.user_id) AS user_id, MIN(e.created_at) AS session_start_at, DATE(MIN(e.created_at)) AS session_date, COUNT(*) AS events_in_session, ABS(MOD(FARM_FINGERPRINT(CAST(e.session_id AS STRING)), 1000000)) AS h FROM `your-project-id.retail_analytics_us.src_events` e GROUP BY e.session_id ), cooked AS ( SELECT b.session_id, b.user_id, b.session_start_at, b.session_date, b.events_in_session, b.h, CASE MOD(b.h, 7) WHEN 0 THEN 'organic' WHEN 1 THEN 'paid_search' WHEN 2 THEN 'paid_social' WHEN 3 THEN 'display' WHEN 4 THEN 'email' WHEN 5 THEN 'referral' ELSE 'direct' END AS channel_key, CASE MOD(DIV(b.h,7), 3) WHEN 0 THEN 'desktop' WHEN 1 THEN 'mobile' ELSE 'tablet' END AS device_key, CASE MOD(DIV(b.h,21), 5) WHEN 0 THEN 'https://shop.example.com/home' WHEN 1 THEN 'https://shop.example.com/women/heels' WHEN 2 THEN 'https://shop.example.com/men/tee' WHEN 3 THEN 'https://shop.example.com/accessories/belts' ELSE 'https://shop.example.com/watches/quartz' END AS landing_page FROM base b ) SELECT c.session_id, c.user_id, c.session_start_at, c.session_date, c.events_in_session, c.channel_key, c.device_key, c.landing_page, LEAST(c.events_in_session, 10) AS pageviews_est, CAST(LEAST(c.events_in_session, 10) * (0.15 + MOD(DIV(c.h,105), 5) * 0.05) AS INT64) AS add_to_cart_est FROM cooked c;

4.4 CS Ticket Dummy Data (for Project 1)

Generate ticket data for customer service analysis projects.

-- CS ticket dummy CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.cs_tickets_dummy` PARTITION BY DATE(opened_at) CLUSTER BY country_name, category, issue_type AS WITH user_pool AS ( SELECT user_id, country, ROW_NUMBER() OVER (ORDER BY user_id) as rn FROM `your-project-id.retail_analytics_us.src_users` ), user_count AS ( SELECT COUNT(*) as total_users FROM user_pool ), ticket_numbers AS ( SELECT off FROM UNNEST(GENERATE_ARRAY(1, 5000)) AS off ), gen AS ( SELECT t.off, CONCAT('TKT_', CAST(100000+t.off AS STRING)) AS ticket_id, u.user_id, TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(t.off * 30 + MOD(ABS(FARM_FINGERPRINT(CONCAT('ts', CAST(t.off AS STRING)))), 90*24*60) AS INT64) MINUTE) AS opened_at, t.off AS row_num FROM ticket_numbers t CROSS JOIN user_count uc LEFT JOIN user_pool u ON u.rn = 1 + MOD(ABS(FARM_FINGERPRINT(CAST(t.off AS STRING))), uc.total_users) ) SELECT ticket_id, user_id, opened_at, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-resp'))), 10) < 7 THEN TIMESTAMP_SUB(opened_at, INTERVAL -1 * CAST(MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-resptime'))), 48) AS INT64) HOUR) END AS first_response_at, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-resv'))), 10) < 6 THEN TIMESTAMP_SUB(opened_at, INTERVAL -1 * CAST(MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-resvtime'))), 120) AS INT64) HOUR) END AS resolved_at, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-status'))), 100) < 15 THEN 'escalated' WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-status'))), 100) < 25 THEN 'pending' WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-status'))), 100) < 35 THEN 'open' ELSE 'solved' END AS status, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-priority'))), 100) < 10 THEN 'urgent' WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-priority'))), 100) < 40 THEN 'high' WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-priority'))), 100) < 70 THEN 'normal' ELSE 'low' END AS priority, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-channel'))), 10) < 4 THEN 'email' WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-channel'))), 10) < 7 THEN 'chat' ELSE 'phone' END AS channel, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-issue'))), 100) < 25 THEN 'size' WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-issue'))), 100) < 50 THEN 'quality' WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-issue'))), 100) < 70 THEN 'shipping' WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-issue'))), 100) < 85 THEN 'payment' ELSE 'refund' END AS issue_type, CONCAT('AGT_', CAST(100 + MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-agent'))), 20) AS STRING)) AS agent_id, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-country'))), 2) = 0 THEN 'United States' ELSE 'Korea, Republic of' END AS country_name, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-category'))), 2) = 0 THEN 'Womens Shoes' ELSE 'Mens Apparel' END AS category, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-sat'))), 10) < 6 THEN CAST(1 + MOD(ABS(FARM_FINGERPRINT(CONCAT(ticket_id, '-satval'))), 5) AS INT64) END AS satisfaction_score, CASE WHEN MOD(row_num, 2) = 0 THEN 'Thanks for quick support' ELSE 'I want to return due to size issue' END AS comment FROM gen;

4.5 Satisfaction/NPS Survey Dummy Data

-- Satisfaction/NPS survey dummy CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.survey_cs_dummy` PARTITION BY DATE(sent_at) CLUSTER BY country_name, category AS WITH tickets_with_index AS ( SELECT ticket_id, user_id, ROW_NUMBER() OVER (ORDER BY ticket_id) as rn FROM `your-project-id.retail_analytics_us.cs_tickets_dummy` WHERE status IN ('solved', 'pending') ), ticket_count AS ( SELECT COUNT(*) as total_tickets FROM tickets_with_index ), survey_numbers AS ( SELECT off FROM UNNEST(GENERATE_ARRAY(1, 1500)) AS off ) SELECT CONCAT('SVY_', CAST(100000+s.off AS STRING)) AS survey_id, t.user_id, TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(s.off * 40 + MOD(ABS(FARM_FINGERPRINT(CONCAT('svy', CAST(s.off AS STRING)))), 60*24*60) AS INT64) MINUTE) AS sent_at, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-complete'))), 10) < 6 THEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(s.off * 40 + MOD(ABS(FARM_FINGERPRINT(CONCAT('svy-c', CAST(s.off AS STRING)))), 60*24*55) AS INT64) MINUTE) END AS completed_at, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-nps'))), 10) < 6 THEN CAST(MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-npsval'))), 11) AS INT64) END AS nps_score, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-csat'))), 10) < 7 THEN CAST(1 + MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-csatval'))), 5) AS INT64) END AS csat_score, CASE WHEN MOD(s.off, 2) = 0 THEN 'Good quality' ELSE 'Delivery took longer than expected' END AS free_text, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-hasticket'))), 10) < 4 THEN t.ticket_id END AS related_ticket_id, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-country'))), 2) = 0 THEN 'United States' ELSE 'Korea, Republic of' END AS country_name, CASE WHEN MOD(ABS(FARM_FINGERPRINT(CONCAT('SVY_', CAST(s.off AS STRING), '-cat'))), 2) = 0 THEN 'Womens Shoes' ELSE 'Mens Apparel' END AS category FROM survey_numbers s CROSS JOIN ticket_count tc LEFT JOIN tickets_with_index t ON t.rn = 1 + MOD(ABS(FARM_FINGERPRINT(CAST(s.off AS STRING))), tc.total_tickets);

4.6 Return Reason Dummy Data

-- Return reason label dummy CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.returns_reason_dummy` PARTITION BY return_date CLUSTER BY reason_code, country_name AS WITH order_items_sample AS ( SELECT oi.order_id, oi.product_id, o.created_at, u.country as country_name, ROW_NUMBER() OVER (ORDER BY oi.order_id, oi.product_id) as row_num FROM `your-project-id.retail_analytics_us.src_order_items` oi INNER JOIN `your-project-id.retail_analytics_us.src_orders` o ON oi.order_id = o.order_id INNER JOIN `your-project-id.retail_analytics_us.src_users` u ON o.user_id = u.user_id WHERE RAND() < 0.1 AND oi.returned_at IS NULL ), reasons AS ( SELECT * FROM UNNEST(['size_issue','defect','damaged','changed_mind','shipping_delay']) AS reason_code WITH OFFSET AS reason_offset ), responsibilities AS ( SELECT * FROM UNNEST(['merchant','logistics','customer']) AS responsibility WITH OFFSET AS resp_offset ) SELECT order_id, product_id, DATE_ADD(DATE(created_at), INTERVAL CAST(7 + MOD(row_num, 30) AS INT64) DAY) AS return_date, (SELECT reason_code FROM reasons WHERE reason_offset = MOD(ABS(FARM_FINGERPRINT(CONCAT(CAST(order_id AS STRING), '-', CAST(product_id AS STRING), '-reason'))), 5) LIMIT 1) AS reason_code, 'auto-generated reason' AS reason_text, (SELECT responsibility FROM responsibilities WHERE resp_offset = MOD(ABS(FARM_FINGERPRINT(CONCAT(CAST(order_id AS STRING), '-', CAST(product_id AS STRING), '-resp'))), 3) LIMIT 1) AS responsibility, country_name FROM order_items_sample;

4.7 External/Strategic Signal Dummy (for Project 2)

-- Demand trends (similar to Google Trends) CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.ext_demand_trends` PARTITION BY week CLUSTER BY keyword, region_code AS WITH weeks AS ( SELECT DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL off WEEK), WEEK(MONDAY)) AS week FROM UNNEST(GENERATE_ARRAY(0, 26)) AS off ), kw AS (SELECT * FROM UNNEST(['heels','women shoes','men clothing','handbags','watches']) AS keyword), rg AS (SELECT * FROM UNNEST(['US','KR']) AS region_code) SELECT week, keyword, region_code, 50 + CAST(FLOOR(30*RAND()) AS INT64) AS score FROM weeks CROSS JOIN kw CROSS JOIN rg; -- Competitive/technology signals CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.ext_competitive_signal` PARTITION BY period CLUSTER BY topic, region_code AS WITH months AS ( SELECT DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL off MONTH), MONTH) AS period FROM UNNEST(GENERATE_ARRAY(0, 12)) AS off ), topics AS (SELECT * FROM UNNEST(['fabric','eco','sportswear','accessories']) AS topic), regions AS (SELECT * FROM UNNEST(['US','KR']) AS region_code) SELECT m.period AS period, t.topic AS topic, r.region_code AS region_code, ROUND(10 + RAND()*90, 2) AS signal, (SELECT AS VALUE s FROM UNNEST(['patent','news','social']) s ORDER BY RAND() LIMIT 1) AS source FROM months m CROSS JOIN topics t CROSS JOIN regions r; -- Macro/weather adjustment CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.ext_macro_weather` PARTITION BY period CLUSTER BY country_iso AS WITH months AS ( SELECT DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL off MONTH), MONTH) AS period FROM UNNEST(GENERATE_ARRAY(0, 12)) AS off ), countries AS (SELECT * FROM UNNEST(['US','KR']) AS country_iso) SELECT m.period AS period, c.country_iso AS country_iso, ROUND(3 + RAND()*7, 2) AS unemployment_rate, ROUND(80 + RAND()*40, 2) AS income_index, ROUND(40 + RAND()*60, 2) AS season_temp_idx FROM months m CROSS JOIN countries c;

4.8 Marketing Campaign Dummy (for Project 3)

-- Campaign metadata CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.mkt_campaigns_dummy` PARTITION BY start_date CLUSTER BY channel_key, target_category AS WITH ids AS ( SELECT CONCAT('CMP_', CAST(1000+off AS STRING)) AS campaign_id FROM UNNEST(GENERATE_ARRAY(1, 20)) AS off ) SELECT campaign_id, CONCAT('Campaign ', CAST(ROW_NUMBER() OVER() AS STRING)) AS campaign_name, (SELECT AS VALUE c FROM UNNEST(['paid_search','paid_social','display','email']) c ORDER BY RAND() LIMIT 1) AS channel_key, DATE_SUB(CURRENT_DATE(), INTERVAL CAST(RAND()*90 AS INT64) DAY) AS start_date, DATE_ADD(CURRENT_DATE(), INTERVAL CAST(RAND()*30 AS INT64) DAY) AS end_date, (SELECT AS VALUE cat FROM UNNEST(['Womens Shoes','Mens Apparel','Accessories','Watches','Bags']) cat ORDER BY RAND() LIMIT 1) AS target_category, (SELECT AS VALUE cn FROM UNNEST(['United States','Korea, Republic of']) cn ORDER BY RAND() LIMIT 1) AS target_country, (SELECT AS VALUE o FROM UNNEST(['awareness','traffic','conversion','retention']) o ORDER BY RAND() LIMIT 1) AS objective, (SELECT AS VALUE k FROM UNNEST(['CTR','CVR','ROAS','LTV']) k ORDER BY RAND() LIMIT 1) AS kpi, ROUND(1000 + RAND()*9000, 2) AS budget FROM ids; -- Channel spend/performance CREATE OR REPLACE TABLE `your-project-id.retail_analytics_us.mkt_channel_spend_dummy` PARTITION BY date CLUSTER BY channel_key, campaign_id, country_name AS WITH cal AS ( SELECT DATE_SUB(CURRENT_DATE(), INTERVAL off DAY) AS date FROM UNNEST(GENERATE_ARRAY(0, 90)) AS off ), ch AS (SELECT * FROM UNNEST(['paid_search','paid_social','display','email','referral','direct']) AS channel_key), co AS (SELECT * FROM UNNEST(['United States','Korea, Republic of']) AS country_name), dv AS (SELECT * FROM UNNEST(['desktop','mobile']) AS device_key) SELECT date, channel_key, (SELECT campaign_id FROM `your-project-id.retail_analytics_us.mkt_campaigns_dummy` ORDER BY RAND() LIMIT 1) AS campaign_id, CAST(1000 + RAND()*9000 AS INT64) AS impressions, CAST(100 + RAND()*2000 AS INT64) AS clicks, ROUND(100 + RAND()*800, 2) AS spend, CONCAT('https://shop.example.com/lp/', channel_key) AS landing_page, country_name, device_key FROM cal, ch, co, dv;

5. Verify Creation Results

Verify that all tables were created successfully.

-- Check table list SELECT table_name, table_type, row_count FROM `your-project-id.retail_analytics_us`.INFORMATION_SCHEMA.TABLES ORDER BY table_name;

Expected table list:

Table NameDescription
src_productsProduct master
src_ordersOrder header
src_order_itemsOrder details
src_eventsWeb event logs
src_usersCustomer information
events_augmentedSession-level augmented data
cs_tickets_dummyCS tickets
survey_cs_dummySatisfaction surveys
returns_reason_dummyReturn reasons
ext_demand_trendsDemand trends
ext_competitive_signalCompetitive signals
ext_macro_weatherMacroeconomic/weather
mkt_campaigns_dummyMarketing campaigns
mkt_channel_spend_dummyChannel spend

6. Connect to BigQuery from Python

6.1 Install Required Packages

pip install google-cloud-bigquery pandas db-dtypes pyarrow

6.2 Set Up Authentication

import os os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/your-service-account-key.json"

6.3 Test Connection

from google.cloud import bigquery import pandas as pd # Initialize client client = bigquery.Client(project='your-project-id') # Run test query query = """ SELECT COUNT(*) as total_orders, COUNT(DISTINCT user_id) as unique_customers, SUM(num_of_item) as total_items FROM `your-project-id.retail_analytics_us.src_orders` """ result = client.query(query).to_dataframe() print("✅ BigQuery connection successful!") print(result)

Expected output:

✅ BigQuery connection successful! total_orders unique_customers total_items 0 80643 49891 158846

macOS / Linux

# Add to ~/.bashrc or ~/.zshrc export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your-service-account-key.json" export GCP_PROJECT_ID="your-project-id"

Windows

setx GOOGLE_APPLICATION_CREDENTIALS "C:\path\to\your-service-account-key.json" setx GCP_PROJECT_ID "your-project-id"

Next Steps

Your environment setup is complete!

Last updated on

🤖AI 모의면접실전처럼 연습하기