Product Analysis with String Functions

BeginnerIntermediate

BigQuery Execution Environment Setup


from google.cloud import bigquery
import pandas as pd
 
# Authentication setup (when using service account key)
# import os
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-key.json"
 
# Create BigQuery client
client = bigquery.Client(project='your-project-id')
 
# Query execution function
def run_query(query):
    return client.query(query).to_dataframe()

Learning Objectives

After completing this recipe, you will be able to:

Combine strings and format data with CONCAT
Pattern matching and keyword search with LIKE/CONTAINS
Extract substrings with SUBSTR/LEFT/RIGHT
Normalize strings with UPPER/LOWER/TRIM
Split and structure strings with SPLIT
Clean and analyze real-world text data

1. CONCAT - String Concatenation

Theory

CONCAT combines multiple strings into one.

Syntax


CONCAT(string1, string2, string3, ...)

Similar Functions

CONCAT(a, b, c): Combines all values (NULL treated as empty string)
|| operator: a || b || c (BigQuery)
CONCAT_WS(sep, a, b, c): Combine with separator

Use Cases

Creating full names
Address formatting
ID generation
Report text creation

Practice: Creating Customer Full Name and Location Info


SELECT
    user_id,
    first_name,
    last_name,
    CONCAT(first_name, ' ', last_name) AS full_name,
    state,
    country,
    CONCAT(IFNULL(state, 'Unknown'), ', ', IFNULL(country, 'Unknown')) AS location,
    email,
    CONCAT('Customer ID: ', CAST(user_id AS STRING), ' - ', first_name, ' ', last_name) AS customer_label
FROM `your-project-id.thelook_ecommerce.users`
WHERE first_name IS NOT NULL AND last_name IS NOT NULL
LIMIT 20

Execution Result:

user_id	first_name	last_name	full_name	location	customer_label
46787	Shelly	Rice	Shelly Rice	Guangdong, China	Customer ID: 46787 - Shelly Rice
94573	Emily	Brown	Emily Brown	Zhejiang, China	Customer ID: 94573 - Emily Brown

Quiz 1: Create Full Product Name

Problem

From the products table:

Create full_product_name in “[Brand] Product Name (Category)” format with CONCAT
Create product_info including price: “Product Name - $Price”
Query only the top 15 products by revenue (JOIN with order_items)

Hint: CONCAT(’[’, brand, ’] ’, name, ’ (’, category, ’)‘)

View Answer


WITH product_revenue AS (
    SELECT
        p.id AS product_id,
        p.name,
        p.brand,
        p.category,
        p.retail_price,
        SUM(oi.sale_price) AS total_revenue
    FROM `your-project-id.thelook_ecommerce.products` p
    JOIN `your-project-id.thelook_ecommerce.order_items` oi
        ON p.id = oi.product_id
    GROUP BY p.id, p.name, p.brand, p.category, p.retail_price
)
SELECT
    product_id,
    name,
    brand,
    category,
    retail_price,
    total_revenue,
    CONCAT('[', brand, '] ', name, ' (', category, ')') AS full_product_name,
    CONCAT(name, ' - $', CAST(retail_price AS STRING)) AS product_info
FROM product_revenue
ORDER BY total_revenue DESC
LIMIT 15

Execution Result:

product_id	name	brand	total_revenue	full_product_name	product_info
15234	Jeans Classic 15234	Diesel	12580.50	[Diesel] Jeans Classic 15234 (Jeans)	Jeans Classic 15234 - $89.99
8456	Jacket Premium 8456	Carhartt	11250.00	[Carhartt] Jacket Premium 8456 (Outerwear & Coats)	Jacket Premium 8456 - $159.99
22891	Sweater Vintage 22891	Calvin Klein	10890.75	[Calvin Klein] Sweater Vintage 22891 (Sweaters)	Sweater Vintage 22891 - $75.50

Practical Applications:

Report Generation: User-friendly product names
ID Generation: Unique keys combining multiple columns
Labeling: Dashboard and chart labels

2. LIKE/CONTAINS - Pattern Matching

Theory

Used for text search and filtering.

LIKE Patterns

%: 0 or more arbitrary characters
_: Exactly 1 arbitrary character
'%dress%': Any string containing ‘dress’
'dress%': Starts with ‘dress’
'%dress': Ends with ‘dress’

CONTAINS (BigQuery)


WHERE CONTAINS(column, 'keyword')

Case Sensitivity

LIKE: Case-sensitive
LOWER(column) LIKE LOWER(pattern): Case-insensitive

Practice: Categorize Products by Keyword


WITH keyword_products AS (
    SELECT
        p.id AS product_id,
        p.name,
        p.category,
        p.brand,
        p.retail_price,
        CASE
            WHEN LOWER(p.name) LIKE '%shirt%' THEN 'Shirt'
            WHEN LOWER(p.name) LIKE '%pant%' THEN 'Pants'
            WHEN LOWER(p.name) LIKE '%jean%' THEN 'Jeans'
            WHEN LOWER(p.name) LIKE '%dress%' THEN 'Dress'
            WHEN LOWER(p.name) LIKE '%jacket%' THEN 'Jacket'
            ELSE 'Other'
        END AS product_type
    FROM `your-project-id.thelook_ecommerce.products` p
)
SELECT
    product_type,
    COUNT(*) AS product_count,
    AVG(retail_price) AS avg_price
FROM keyword_products
GROUP BY product_type
ORDER BY product_count DESC

Execution Result:

product_type	product_count	avg_price
Other	19424	52.12
Pants	2867	50.06
Jeans	2073	93.34
Shirt	1873	40.05
Jacket	1563	129.62
Dress	1320	73.91

Quiz 2: Revenue Analysis by Keyword

Problem

JOIN products and order_items to:

Search for ‘sweater’, ‘hoodie’, ‘coat’ keywords with LIKE pattern
Aggregate total revenue and sales quantity by keyword
Calculate average sale price

Hint: LOWER(name) LIKE ‘%sweater%‘

View Answer


WITH keyword_sales AS (
    SELECT
        p.id AS product_id,
        p.name,
        oi.sale_price,
        CASE
            WHEN LOWER(p.name) LIKE '%sweater%' THEN 'Sweater'
            WHEN LOWER(p.name) LIKE '%hoodie%' THEN 'Hoodie'
            WHEN LOWER(p.name) LIKE '%coat%' THEN 'Coat'
            ELSE 'Other'
        END AS keyword
    FROM `your-project-id.thelook_ecommerce.products` p
    JOIN `your-project-id.thelook_ecommerce.order_items` oi
        ON p.id = oi.product_id
)
SELECT
    keyword,
    COUNT(*) AS sales_count,
    SUM(sale_price) AS total_revenue,
    AVG(sale_price) AS avg_sale_price
FROM keyword_sales
WHERE keyword != 'Other'
GROUP BY keyword
ORDER BY total_revenue DESC

Execution Result:

keyword	sales_count	total_revenue	avg_sale_price
Sweater	8656	610092.34	70.48
Coat	4589	526580.48	114.75
Hoodie	6966	373351.96	53.60

Performance Considerations:

%keyword%: Full table scan (slow)
keyword%: Can use index (fast)
Large datasets: Consider full-text search engines

3. SUBSTR/LEFT/RIGHT - Substring Extraction

Theory

Extract only a part of a string.

Functions

SUBSTR(string, start, length): From specified position, specified length
LEFT(string, length): N characters from the left
RIGHT(string, length): N characters from the right

Examples


SUBSTR('Hello World', 1, 5)  → 'Hello'
LEFT('Hello World', 5)       → 'Hello'
RIGHT('Hello World', 5)      → 'World'

Index

BigQuery: Starts from 1 (1-indexed)

Practice: SKU Code Analysis


SELECT
    id AS product_id,
    name,
    sku,
    LEFT(sku, 3) AS sku_prefix,
    SUBSTR(sku, 4, 3) AS sku_middle,
    RIGHT(sku, 3) AS sku_suffix,
    LENGTH(sku) AS sku_length
FROM `your-project-id.thelook_ecommerce.products`
WHERE sku IS NOT NULL
LIMIT 30

Execution Result:

product_id	name	sku	sku_prefix	sku_middle	sku_suffix	sku_length
13844	(ONE) 1 Satin Headband	2A3E95…	2A3	E95	4C8	32
14086	CHEER Rhinestone…	8EFA90…	8EF	A90	4AD	32

Quiz 3: Product Code Pattern Analysis

Problem

From the products table:

Extract first 2 characters of SKU as category code
Extract last 2 characters as product type code
Aggregate product count and average price by category code

Hint: LEFT(sku, 2), RIGHT(sku, 2)

View Answer


SELECT
    LEFT(sku, 2) AS category_code,
    RIGHT(sku, 2) AS type_code,
    COUNT(*) AS product_count,
    AVG(retail_price) AS avg_price,
    MIN(retail_price) AS min_price,
    MAX(retail_price) AS max_price
FROM `your-project-id.thelook_ecommerce.products`
WHERE sku IS NOT NULL
GROUP BY category_code, type_code
HAVING COUNT(*) >= 5
ORDER BY product_count DESC
LIMIT 20

Execution Result:

category_code	type_code	product_count	avg_price	min_price	max_price
2A	C7	125	85.50	12.99	450.00
8E	D3	118	92.30	15.00	380.00
1C	D9	112	78.25	10.50	320.00

Practical Applications:

SKU System Analysis: Understanding product code rules
Inventory Management: Stock status by code
Auto-categorization: Category assignment based on code

4. UPPER/LOWER/TRIM - String Normalization

Theory

Used for data cleaning and standardization.

Functions

UPPER(string): Convert to uppercase
LOWER(string): Convert to lowercase
TRIM(string): Remove leading and trailing whitespace
LTRIM(string): Remove left whitespace
RTRIM(string): Remove right whitespace

Use Cases

Unify case
Remove whitespace
Remove duplicates
Data standardization

Practice: Normalize Brand Names


WITH normalized_brands AS (
    SELECT
        brand AS original_brand,
        UPPER(TRIM(brand)) AS normalized_brand,
        COUNT(*) AS product_count
    FROM `your-project-id.thelook_ecommerce.products`
    WHERE brand IS NOT NULL
    GROUP BY brand
)
SELECT
    normalized_brand,
    SUM(product_count) AS total_products,
    COUNT(DISTINCT original_brand) AS brand_variations
FROM normalized_brands
GROUP BY normalized_brand
HAVING COUNT(DISTINCT original_brand) > 1 OR SUM(product_count) >= 10
ORDER BY brand_variations DESC, total_products DESC
LIMIT 20

Execution Result:

normalized_brand	total_products	brand_variations
VIP BOUTIQUE	29	3
TURKISHTOWELS	7	3
HUGO BOSS	135	2
ICEBREAKER	70	2

⚠️

Importance of Normalization

Remove Duplicates: Treat ‘Gmail.com’ and ‘gmail.com’ as the same
Whitespace Issues: Unify ’ user@example.com ’ and ‘user@example.com’
Consistency: Improve analysis and aggregation accuracy

Quiz 4: Email Domain Normalization and Analysis

Problem

From the users table:

Convert email addresses to lowercase and remove whitespace
Extract domain after ’@’ (using SPLIT)
Aggregate customer count by domain

Hint: LOWER(TRIM(email)), SPLIT(email, ’@’)[OFFSET(1)]

View Answer


WITH normalized_emails AS (
    SELECT
        id AS user_id,
        email,
        LOWER(TRIM(email)) AS clean_email,
        SPLIT(LOWER(TRIM(email)), '@')[SAFE_OFFSET(1)] AS email_domain
    FROM `your-project-id.thelook_ecommerce.users`
    WHERE email IS NOT NULL
)
SELECT
    email_domain,
    COUNT(*) AS customer_count,
    ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM normalized_emails), 2) AS percentage
FROM normalized_emails
WHERE email_domain IS NOT NULL
GROUP BY email_domain
ORDER BY customer_count DESC
LIMIT 10

Execution Result:

email_domain	customer_count	percentage
example.net	33601	33.60
example.org	33212	33.21
example.com	33187	33.19

Practical Applications:

Email marketing segmentation
Distinguishing company email vs personal email
B2B vs B2C analysis

5. SPLIT - String Splitting

Theory

Splits a string into an array based on a delimiter.

Syntax


SPLIT(string, delimiter)
-- Result: ARRAY

Array Indexing


SPLIT('a,b,c', ',')[OFFSET(0)]  → 'a'
SPLIT('a,b,c', ',')[OFFSET(1)]  → 'b'
SPLIT('a,b,c', ',')[SAFE_OFFSET(10)]  → NULL (safe)

Use Cases

CSV parsing
Tag separation
Path decomposition
Email, URL analysis

Practice: Email Analysis


WITH email_parts AS (
    SELECT
        id AS user_id,
        email,
        SPLIT(email, '@')[SAFE_OFFSET(0)] AS username,
        SPLIT(email, '@')[SAFE_OFFSET(1)] AS domain,
        ARRAY_LENGTH(SPLIT(email, '@')) AS parts_count
    FROM `your-project-id.thelook_ecommerce.users`
    WHERE email IS NOT NULL
        AND email LIKE '%@%'
)
SELECT
    user_id,
    email,
    username,
    domain,
    LENGTH(username) AS username_length,
    CASE
        WHEN domain LIKE '%.com' THEN 'Commercial'
        WHEN domain LIKE '%.org' THEN 'Organization'
        WHEN domain LIKE '%.edu' THEN 'Education'
        WHEN domain LIKE '%.gov' THEN 'Government'
        ELSE 'Other'
    END AS domain_type
FROM email_parts
WHERE parts_count = 2
LIMIT 30

Execution Result:

user_id	email	username	domain	username_length	domain_type
97381	dianamartinez@example.org	dianamartinez	example.org	13	Organization
30683	breannafrancis@example.org	breannafrancis	example.org	14	Organization

Quiz 5: Extract Keywords from Product Names

Problem

From the products table:

SPLIT product names by space
Extract the first word as the main keyword
Aggregate product count and average price by keyword

Hint: SPLIT(name, ’ ’)[SAFE_OFFSET(0)]

View Answer


WITH product_keywords AS (
    SELECT
        id AS product_id,
        name,
        retail_price,
        UPPER(SPLIT(TRIM(name), ' ')[SAFE_OFFSET(0)]) AS first_keyword
    FROM `your-project-id.thelook_ecommerce.products`
    WHERE name IS NOT NULL
)
SELECT
    first_keyword,
    COUNT(*) AS product_count,
    AVG(retail_price) AS avg_price,
    MIN(retail_price) AS min_price,
    MAX(retail_price) AS max_price
FROM product_keywords
WHERE first_keyword IS NOT NULL
    AND LENGTH(first_keyword) > 2
GROUP BY first_keyword
HAVING COUNT(*) >= 5
ORDER BY product_count DESC
LIMIT 15

Execution Result:

first_keyword	product_count	avg_price	min_price	max_price
ALLEGRA	1034	14.13	3.28	41.63
CALVIN	615	64.26	10.00	259.99
CARHARTT	387	68.86	7.00	448.99
TOMMY	324	70.88	9.00	289.99
VOLCOM	293	58.54	11.96	320.00

SPLIT Use Scenarios:

Tag Analysis: Separate comma-delimited tags
Path Parsing: Decompose URLs, file paths
Keyword Extraction: Core words from product names, descriptions

Summary

String Functions Covered

Function	Use Case	Example
`CONCAT()`	String concatenation	CONCAT(first_name, ’ ’, last_name)
`LIKE`	Pattern matching	name LIKE ‘%shirt%‘
`SUBSTR()`	Substring	SUBSTR(sku, 1, 3)
`LEFT/RIGHT()`	First/last N characters	LEFT(sku, 2)
`UPPER/LOWER()`	Case conversion	UPPER(brand)
`TRIM()`	Remove whitespace	TRIM(email)
`SPLIT()`	Split string	SPLIT(email, ’@‘)

Additional Useful Functions

Function	Description
`LENGTH()`	String length
`REPLACE()`	String replacement
`REGEXP_CONTAINS()`	Regular expression matching
`REGEXP_EXTRACT()`	Regular expression extraction
`FORMAT()`	Formatting

Data Cleaning Patterns


-- 1. Normalization
UPPER(TRIM(column))
 
-- 2. Extract email domain
SPLIT(LOWER(TRIM(email)), '@')[SAFE_OFFSET(1)]
 
-- 3. Code parsing
LEFT(sku, 3) AS category_code
 
-- 4. Keyword search
WHERE LOWER(name) LIKE '%keyword%'
 
-- 5. Formatting
CONCAT('[', brand, '] ', name)

Next Steps

You’ve mastered string functions! Now learn Python-based data analysis in the Pandas track, or study data visualization techniques in Visualization.