Product Analysis with String Functions
BigQuery Execution Environment Setup
from google.cloud import bigquery
import pandas as pd
# Authentication setup (when using service account key)
# import os
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-key.json"
# Create BigQuery client
client = bigquery.Client(project='your-project-id')
# Query execution function
def run_query(query):
return client.query(query).to_dataframe()Learning Objectives
After completing this recipe, you will be able to:
- Combine strings and format data with CONCAT
- Pattern matching and keyword search with LIKE/CONTAINS
- Extract substrings with SUBSTR/LEFT/RIGHT
- Normalize strings with UPPER/LOWER/TRIM
- Split and structure strings with SPLIT
- Clean and analyze real-world text data
1. CONCAT - String Concatenation
Theory
CONCAT combines multiple strings into one.
Syntax
CONCAT(string1, string2, string3, ...)Similar Functions
CONCAT(a, b, c): Combines all values (NULL treated as empty string)||operator:a || b || c(BigQuery)CONCAT_WS(sep, a, b, c): Combine with separator
Use Cases
- Creating full names
- Address formatting
- ID generation
- Report text creation
Practice: Creating Customer Full Name and Location Info
SELECT
user_id,
first_name,
last_name,
CONCAT(first_name, ' ', last_name) AS full_name,
state,
country,
CONCAT(IFNULL(state, 'Unknown'), ', ', IFNULL(country, 'Unknown')) AS location,
email,
CONCAT('Customer ID: ', CAST(user_id AS STRING), ' - ', first_name, ' ', last_name) AS customer_label
FROM `your-project-id.thelook_ecommerce.users`
WHERE first_name IS NOT NULL AND last_name IS NOT NULL
LIMIT 20Execution Result:
| user_id | first_name | last_name | full_name | location | customer_label |
|---|---|---|---|---|---|
| 46787 | Shelly | Rice | Shelly Rice | Guangdong, China | Customer ID: 46787 - Shelly Rice |
| 94573 | Emily | Brown | Emily Brown | Zhejiang, China | Customer ID: 94573 - Emily Brown |
Quiz 1: Create Full Product Name
Problem
From the products table:
- Create full_product_name in “[Brand] Product Name (Category)” format with CONCAT
- Create product_info including price: “Product Name - $Price”
- Query only the top 15 products by revenue (JOIN with order_items)
Hint: CONCAT(’[’, brand, ’] ’, name, ’ (’, category, ’)‘)
View Answer
WITH product_revenue AS (
SELECT
p.id AS product_id,
p.name,
p.brand,
p.category,
p.retail_price,
SUM(oi.sale_price) AS total_revenue
FROM `your-project-id.thelook_ecommerce.products` p
JOIN `your-project-id.thelook_ecommerce.order_items` oi
ON p.id = oi.product_id
GROUP BY p.id, p.name, p.brand, p.category, p.retail_price
)
SELECT
product_id,
name,
brand,
category,
retail_price,
total_revenue,
CONCAT('[', brand, '] ', name, ' (', category, ')') AS full_product_name,
CONCAT(name, ' - $', CAST(retail_price AS STRING)) AS product_info
FROM product_revenue
ORDER BY total_revenue DESC
LIMIT 15Execution Result:
| product_id | name | brand | total_revenue | full_product_name | product_info |
|---|---|---|---|---|---|
| 15234 | Jeans Classic 15234 | Diesel | 12580.50 | [Diesel] Jeans Classic 15234 (Jeans) | Jeans Classic 15234 - $89.99 |
| 8456 | Jacket Premium 8456 | Carhartt | 11250.00 | [Carhartt] Jacket Premium 8456 (Outerwear & Coats) | Jacket Premium 8456 - $159.99 |
| 22891 | Sweater Vintage 22891 | Calvin Klein | 10890.75 | [Calvin Klein] Sweater Vintage 22891 (Sweaters) | Sweater Vintage 22891 - $75.50 |
Practical Applications:
- Report Generation: User-friendly product names
- ID Generation: Unique keys combining multiple columns
- Labeling: Dashboard and chart labels
2. LIKE/CONTAINS - Pattern Matching
Theory
Used for text search and filtering.
LIKE Patterns
%: 0 or more arbitrary characters_: Exactly 1 arbitrary character'%dress%': Any string containing ‘dress’'dress%': Starts with ‘dress’'%dress': Ends with ‘dress’
CONTAINS (BigQuery)
WHERE CONTAINS(column, 'keyword')Case Sensitivity
LIKE: Case-sensitiveLOWER(column) LIKE LOWER(pattern): Case-insensitive
Practice: Categorize Products by Keyword
WITH keyword_products AS (
SELECT
p.id AS product_id,
p.name,
p.category,
p.brand,
p.retail_price,
CASE
WHEN LOWER(p.name) LIKE '%shirt%' THEN 'Shirt'
WHEN LOWER(p.name) LIKE '%pant%' THEN 'Pants'
WHEN LOWER(p.name) LIKE '%jean%' THEN 'Jeans'
WHEN LOWER(p.name) LIKE '%dress%' THEN 'Dress'
WHEN LOWER(p.name) LIKE '%jacket%' THEN 'Jacket'
ELSE 'Other'
END AS product_type
FROM `your-project-id.thelook_ecommerce.products` p
)
SELECT
product_type,
COUNT(*) AS product_count,
AVG(retail_price) AS avg_price
FROM keyword_products
GROUP BY product_type
ORDER BY product_count DESCExecution Result:
| product_type | product_count | avg_price |
|---|---|---|
| Other | 19424 | 52.12 |
| Pants | 2867 | 50.06 |
| Jeans | 2073 | 93.34 |
| Shirt | 1873 | 40.05 |
| Jacket | 1563 | 129.62 |
| Dress | 1320 | 73.91 |
Quiz 2: Revenue Analysis by Keyword
Problem
JOIN products and order_items to:
- Search for ‘sweater’, ‘hoodie’, ‘coat’ keywords with LIKE pattern
- Aggregate total revenue and sales quantity by keyword
- Calculate average sale price
Hint: LOWER(name) LIKE ‘%sweater%‘
View Answer
WITH keyword_sales AS (
SELECT
p.id AS product_id,
p.name,
oi.sale_price,
CASE
WHEN LOWER(p.name) LIKE '%sweater%' THEN 'Sweater'
WHEN LOWER(p.name) LIKE '%hoodie%' THEN 'Hoodie'
WHEN LOWER(p.name) LIKE '%coat%' THEN 'Coat'
ELSE 'Other'
END AS keyword
FROM `your-project-id.thelook_ecommerce.products` p
JOIN `your-project-id.thelook_ecommerce.order_items` oi
ON p.id = oi.product_id
)
SELECT
keyword,
COUNT(*) AS sales_count,
SUM(sale_price) AS total_revenue,
AVG(sale_price) AS avg_sale_price
FROM keyword_sales
WHERE keyword != 'Other'
GROUP BY keyword
ORDER BY total_revenue DESCExecution Result:
| keyword | sales_count | total_revenue | avg_sale_price |
|---|---|---|---|
| Sweater | 8656 | 610092.34 | 70.48 |
| Coat | 4589 | 526580.48 | 114.75 |
| Hoodie | 6966 | 373351.96 | 53.60 |
Performance Considerations:
%keyword%: Full table scan (slow)keyword%: Can use index (fast)- Large datasets: Consider full-text search engines
3. SUBSTR/LEFT/RIGHT - Substring Extraction
Theory
Extract only a part of a string.
Functions
SUBSTR(string, start, length): From specified position, specified lengthLEFT(string, length): N characters from the leftRIGHT(string, length): N characters from the right
Examples
SUBSTR('Hello World', 1, 5) → 'Hello'
LEFT('Hello World', 5) → 'Hello'
RIGHT('Hello World', 5) → 'World'Index
- BigQuery: Starts from 1 (1-indexed)
Practice: SKU Code Analysis
SELECT
id AS product_id,
name,
sku,
LEFT(sku, 3) AS sku_prefix,
SUBSTR(sku, 4, 3) AS sku_middle,
RIGHT(sku, 3) AS sku_suffix,
LENGTH(sku) AS sku_length
FROM `your-project-id.thelook_ecommerce.products`
WHERE sku IS NOT NULL
LIMIT 30Execution Result:
| product_id | name | sku | sku_prefix | sku_middle | sku_suffix | sku_length |
|---|---|---|---|---|---|---|
| 13844 | (ONE) 1 Satin Headband | 2A3E95… | 2A3 | E95 | 4C8 | 32 |
| 14086 | CHEER Rhinestone… | 8EFA90… | 8EF | A90 | 4AD | 32 |
Quiz 3: Product Code Pattern Analysis
Problem
From the products table:
- Extract first 2 characters of SKU as category code
- Extract last 2 characters as product type code
- Aggregate product count and average price by category code
Hint: LEFT(sku, 2), RIGHT(sku, 2)
View Answer
SELECT
LEFT(sku, 2) AS category_code,
RIGHT(sku, 2) AS type_code,
COUNT(*) AS product_count,
AVG(retail_price) AS avg_price,
MIN(retail_price) AS min_price,
MAX(retail_price) AS max_price
FROM `your-project-id.thelook_ecommerce.products`
WHERE sku IS NOT NULL
GROUP BY category_code, type_code
HAVING COUNT(*) >= 5
ORDER BY product_count DESC
LIMIT 20Execution Result:
| category_code | type_code | product_count | avg_price | min_price | max_price |
|---|---|---|---|---|---|
| 2A | C7 | 125 | 85.50 | 12.99 | 450.00 |
| 8E | D3 | 118 | 92.30 | 15.00 | 380.00 |
| 1C | D9 | 112 | 78.25 | 10.50 | 320.00 |
Practical Applications:
- SKU System Analysis: Understanding product code rules
- Inventory Management: Stock status by code
- Auto-categorization: Category assignment based on code
4. UPPER/LOWER/TRIM - String Normalization
Theory
Used for data cleaning and standardization.
Functions
UPPER(string): Convert to uppercaseLOWER(string): Convert to lowercaseTRIM(string): Remove leading and trailing whitespaceLTRIM(string): Remove left whitespaceRTRIM(string): Remove right whitespace
Use Cases
- Unify case
- Remove whitespace
- Remove duplicates
- Data standardization
Practice: Normalize Brand Names
WITH normalized_brands AS (
SELECT
brand AS original_brand,
UPPER(TRIM(brand)) AS normalized_brand,
COUNT(*) AS product_count
FROM `your-project-id.thelook_ecommerce.products`
WHERE brand IS NOT NULL
GROUP BY brand
)
SELECT
normalized_brand,
SUM(product_count) AS total_products,
COUNT(DISTINCT original_brand) AS brand_variations
FROM normalized_brands
GROUP BY normalized_brand
HAVING COUNT(DISTINCT original_brand) > 1 OR SUM(product_count) >= 10
ORDER BY brand_variations DESC, total_products DESC
LIMIT 20Execution Result:
| normalized_brand | total_products | brand_variations |
|---|---|---|
| VIP BOUTIQUE | 29 | 3 |
| TURKISHTOWELS | 7 | 3 |
| HUGO BOSS | 135 | 2 |
| ICEBREAKER | 70 | 2 |
- Remove Duplicates: Treat ‘Gmail.com’ and ‘gmail.com’ as the same
- Whitespace Issues: Unify ’ user@example.com ’ and ‘user@example.com’
- Consistency: Improve analysis and aggregation accuracy
Quiz 4: Email Domain Normalization and Analysis
Problem
From the users table:
- Convert email addresses to lowercase and remove whitespace
- Extract domain after ’@’ (using SPLIT)
- Aggregate customer count by domain
Hint: LOWER(TRIM(email)), SPLIT(email, ’@’)[OFFSET(1)]
View Answer
WITH normalized_emails AS (
SELECT
id AS user_id,
email,
LOWER(TRIM(email)) AS clean_email,
SPLIT(LOWER(TRIM(email)), '@')[SAFE_OFFSET(1)] AS email_domain
FROM `your-project-id.thelook_ecommerce.users`
WHERE email IS NOT NULL
)
SELECT
email_domain,
COUNT(*) AS customer_count,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM normalized_emails), 2) AS percentage
FROM normalized_emails
WHERE email_domain IS NOT NULL
GROUP BY email_domain
ORDER BY customer_count DESC
LIMIT 10Execution Result:
| email_domain | customer_count | percentage |
|---|---|---|
| example.net | 33601 | 33.60 |
| example.org | 33212 | 33.21 |
| example.com | 33187 | 33.19 |
Practical Applications:
- Email marketing segmentation
- Distinguishing company email vs personal email
- B2B vs B2C analysis
5. SPLIT - String Splitting
Theory
Splits a string into an array based on a delimiter.
Syntax
SPLIT(string, delimiter)
-- Result: ARRAYArray Indexing
SPLIT('a,b,c', ',')[OFFSET(0)] → 'a'
SPLIT('a,b,c', ',')[OFFSET(1)] → 'b'
SPLIT('a,b,c', ',')[SAFE_OFFSET(10)] → NULL (safe)Use Cases
- CSV parsing
- Tag separation
- Path decomposition
- Email, URL analysis
Practice: Email Analysis
WITH email_parts AS (
SELECT
id AS user_id,
email,
SPLIT(email, '@')[SAFE_OFFSET(0)] AS username,
SPLIT(email, '@')[SAFE_OFFSET(1)] AS domain,
ARRAY_LENGTH(SPLIT(email, '@')) AS parts_count
FROM `your-project-id.thelook_ecommerce.users`
WHERE email IS NOT NULL
AND email LIKE '%@%'
)
SELECT
user_id,
email,
username,
domain,
LENGTH(username) AS username_length,
CASE
WHEN domain LIKE '%.com' THEN 'Commercial'
WHEN domain LIKE '%.org' THEN 'Organization'
WHEN domain LIKE '%.edu' THEN 'Education'
WHEN domain LIKE '%.gov' THEN 'Government'
ELSE 'Other'
END AS domain_type
FROM email_parts
WHERE parts_count = 2
LIMIT 30Execution Result:
| user_id | username | domain | username_length | domain_type | |
|---|---|---|---|---|---|
| 97381 | dianamartinez@example.org | dianamartinez | example.org | 13 | Organization |
| 30683 | breannafrancis@example.org | breannafrancis | example.org | 14 | Organization |
Quiz 5: Extract Keywords from Product Names
Problem
From the products table:
- SPLIT product names by space
- Extract the first word as the main keyword
- Aggregate product count and average price by keyword
Hint: SPLIT(name, ’ ’)[SAFE_OFFSET(0)]
View Answer
WITH product_keywords AS (
SELECT
id AS product_id,
name,
retail_price,
UPPER(SPLIT(TRIM(name), ' ')[SAFE_OFFSET(0)]) AS first_keyword
FROM `your-project-id.thelook_ecommerce.products`
WHERE name IS NOT NULL
)
SELECT
first_keyword,
COUNT(*) AS product_count,
AVG(retail_price) AS avg_price,
MIN(retail_price) AS min_price,
MAX(retail_price) AS max_price
FROM product_keywords
WHERE first_keyword IS NOT NULL
AND LENGTH(first_keyword) > 2
GROUP BY first_keyword
HAVING COUNT(*) >= 5
ORDER BY product_count DESC
LIMIT 15Execution Result:
| first_keyword | product_count | avg_price | min_price | max_price |
|---|---|---|---|---|
| ALLEGRA | 1034 | 14.13 | 3.28 | 41.63 |
| CALVIN | 615 | 64.26 | 10.00 | 259.99 |
| CARHARTT | 387 | 68.86 | 7.00 | 448.99 |
| TOMMY | 324 | 70.88 | 9.00 | 289.99 |
| VOLCOM | 293 | 58.54 | 11.96 | 320.00 |
SPLIT Use Scenarios:
- Tag Analysis: Separate comma-delimited tags
- Path Parsing: Decompose URLs, file paths
- Keyword Extraction: Core words from product names, descriptions
Summary
String Functions Covered
| Function | Use Case | Example |
|---|---|---|
CONCAT() | String concatenation | CONCAT(first_name, ’ ’, last_name) |
LIKE | Pattern matching | name LIKE ‘%shirt%‘ |
SUBSTR() | Substring | SUBSTR(sku, 1, 3) |
LEFT/RIGHT() | First/last N characters | LEFT(sku, 2) |
UPPER/LOWER() | Case conversion | UPPER(brand) |
TRIM() | Remove whitespace | TRIM(email) |
SPLIT() | Split string | SPLIT(email, ’@‘) |
Additional Useful Functions
| Function | Description |
|---|---|
LENGTH() | String length |
REPLACE() | String replacement |
REGEXP_CONTAINS() | Regular expression matching |
REGEXP_EXTRACT() | Regular expression extraction |
FORMAT() | Formatting |
Data Cleaning Patterns
-- 1. Normalization
UPPER(TRIM(column))
-- 2. Extract email domain
SPLIT(LOWER(TRIM(email)), '@')[SAFE_OFFSET(1)]
-- 3. Code parsing
LEFT(sku, 3) AS category_code
-- 4. Keyword search
WHERE LOWER(name) LIKE '%keyword%'
-- 5. Formatting
CONCAT('[', brand, '] ', name)Next Steps
You’ve mastered string functions! Now learn Python-based data analysis in the Pandas track, or study data visualization techniques in Visualization.