Skip to Content
ConceptsSQLString Functions

Product Analysis with String Functions

BeginnerIntermediate

BigQuery Execution Environment Setup

from google.cloud import bigquery import pandas as pd # Authentication setup (when using service account key) # import os # os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-key.json" # Create BigQuery client client = bigquery.Client(project='your-project-id') # Query execution function def run_query(query): return client.query(query).to_dataframe()

Learning Objectives

After completing this recipe, you will be able to:

  • Combine strings and format data with CONCAT
  • Pattern matching and keyword search with LIKE/CONTAINS
  • Extract substrings with SUBSTR/LEFT/RIGHT
  • Normalize strings with UPPER/LOWER/TRIM
  • Split and structure strings with SPLIT
  • Clean and analyze real-world text data

1. CONCAT - String Concatenation

Theory

CONCAT combines multiple strings into one.

Syntax

CONCAT(string1, string2, string3, ...)

Similar Functions

  • CONCAT(a, b, c): Combines all values (NULL treated as empty string)
  • || operator: a || b || c (BigQuery)
  • CONCAT_WS(sep, a, b, c): Combine with separator

Use Cases

  • Creating full names
  • Address formatting
  • ID generation
  • Report text creation

Practice: Creating Customer Full Name and Location Info

SELECT user_id, first_name, last_name, CONCAT(first_name, ' ', last_name) AS full_name, state, country, CONCAT(IFNULL(state, 'Unknown'), ', ', IFNULL(country, 'Unknown')) AS location, email, CONCAT('Customer ID: ', CAST(user_id AS STRING), ' - ', first_name, ' ', last_name) AS customer_label FROM `your-project-id.thelook_ecommerce.users` WHERE first_name IS NOT NULL AND last_name IS NOT NULL LIMIT 20

Execution Result:

user_idfirst_namelast_namefull_namelocationcustomer_label
46787ShellyRiceShelly RiceGuangdong, ChinaCustomer ID: 46787 - Shelly Rice
94573EmilyBrownEmily BrownZhejiang, ChinaCustomer ID: 94573 - Emily Brown

Quiz 1: Create Full Product Name

Problem

From the products table:

  1. Create full_product_name in “[Brand] Product Name (Category)” format with CONCAT
  2. Create product_info including price: “Product Name - $Price”
  3. Query only the top 15 products by revenue (JOIN with order_items)

Hint: CONCAT(’[’, brand, ’] ’, name, ’ (’, category, ’)‘)

View Answer

WITH product_revenue AS ( SELECT p.id AS product_id, p.name, p.brand, p.category, p.retail_price, SUM(oi.sale_price) AS total_revenue FROM `your-project-id.thelook_ecommerce.products` p JOIN `your-project-id.thelook_ecommerce.order_items` oi ON p.id = oi.product_id GROUP BY p.id, p.name, p.brand, p.category, p.retail_price ) SELECT product_id, name, brand, category, retail_price, total_revenue, CONCAT('[', brand, '] ', name, ' (', category, ')') AS full_product_name, CONCAT(name, ' - $', CAST(retail_price AS STRING)) AS product_info FROM product_revenue ORDER BY total_revenue DESC LIMIT 15

Execution Result:

product_idnamebrandtotal_revenuefull_product_nameproduct_info
15234Jeans Classic 15234Diesel12580.50[Diesel] Jeans Classic 15234 (Jeans)Jeans Classic 15234 - $89.99
8456Jacket Premium 8456Carhartt11250.00[Carhartt] Jacket Premium 8456 (Outerwear & Coats)Jacket Premium 8456 - $159.99
22891Sweater Vintage 22891Calvin Klein10890.75[Calvin Klein] Sweater Vintage 22891 (Sweaters)Sweater Vintage 22891 - $75.50

Practical Applications:

  • Report Generation: User-friendly product names
  • ID Generation: Unique keys combining multiple columns
  • Labeling: Dashboard and chart labels

2. LIKE/CONTAINS - Pattern Matching

Theory

Used for text search and filtering.

LIKE Patterns

  • %: 0 or more arbitrary characters
  • _: Exactly 1 arbitrary character
  • '%dress%': Any string containing ‘dress’
  • 'dress%': Starts with ‘dress’
  • '%dress': Ends with ‘dress’

CONTAINS (BigQuery)

WHERE CONTAINS(column, 'keyword')

Case Sensitivity

  • LIKE: Case-sensitive
  • LOWER(column) LIKE LOWER(pattern): Case-insensitive

Practice: Categorize Products by Keyword

WITH keyword_products AS ( SELECT p.id AS product_id, p.name, p.category, p.brand, p.retail_price, CASE WHEN LOWER(p.name) LIKE '%shirt%' THEN 'Shirt' WHEN LOWER(p.name) LIKE '%pant%' THEN 'Pants' WHEN LOWER(p.name) LIKE '%jean%' THEN 'Jeans' WHEN LOWER(p.name) LIKE '%dress%' THEN 'Dress' WHEN LOWER(p.name) LIKE '%jacket%' THEN 'Jacket' ELSE 'Other' END AS product_type FROM `your-project-id.thelook_ecommerce.products` p ) SELECT product_type, COUNT(*) AS product_count, AVG(retail_price) AS avg_price FROM keyword_products GROUP BY product_type ORDER BY product_count DESC

Execution Result:

product_typeproduct_countavg_price
Other1942452.12
Pants286750.06
Jeans207393.34
Shirt187340.05
Jacket1563129.62
Dress132073.91

Quiz 2: Revenue Analysis by Keyword

Problem

JOIN products and order_items to:

  1. Search for ‘sweater’, ‘hoodie’, ‘coat’ keywords with LIKE pattern
  2. Aggregate total revenue and sales quantity by keyword
  3. Calculate average sale price

Hint: LOWER(name) LIKE ‘%sweater%‘

View Answer

WITH keyword_sales AS ( SELECT p.id AS product_id, p.name, oi.sale_price, CASE WHEN LOWER(p.name) LIKE '%sweater%' THEN 'Sweater' WHEN LOWER(p.name) LIKE '%hoodie%' THEN 'Hoodie' WHEN LOWER(p.name) LIKE '%coat%' THEN 'Coat' ELSE 'Other' END AS keyword FROM `your-project-id.thelook_ecommerce.products` p JOIN `your-project-id.thelook_ecommerce.order_items` oi ON p.id = oi.product_id ) SELECT keyword, COUNT(*) AS sales_count, SUM(sale_price) AS total_revenue, AVG(sale_price) AS avg_sale_price FROM keyword_sales WHERE keyword != 'Other' GROUP BY keyword ORDER BY total_revenue DESC

Execution Result:

keywordsales_counttotal_revenueavg_sale_price
Sweater8656610092.3470.48
Coat4589526580.48114.75
Hoodie6966373351.9653.60

Performance Considerations:

  • %keyword%: Full table scan (slow)
  • keyword%: Can use index (fast)
  • Large datasets: Consider full-text search engines

3. SUBSTR/LEFT/RIGHT - Substring Extraction

Theory

Extract only a part of a string.

Functions

  • SUBSTR(string, start, length): From specified position, specified length
  • LEFT(string, length): N characters from the left
  • RIGHT(string, length): N characters from the right

Examples

SUBSTR('Hello World', 1, 5) → 'Hello' LEFT('Hello World', 5) → 'Hello' RIGHT('Hello World', 5) → 'World'

Index

  • BigQuery: Starts from 1 (1-indexed)

Practice: SKU Code Analysis

SELECT id AS product_id, name, sku, LEFT(sku, 3) AS sku_prefix, SUBSTR(sku, 4, 3) AS sku_middle, RIGHT(sku, 3) AS sku_suffix, LENGTH(sku) AS sku_length FROM `your-project-id.thelook_ecommerce.products` WHERE sku IS NOT NULL LIMIT 30

Execution Result:

product_idnameskusku_prefixsku_middlesku_suffixsku_length
13844(ONE) 1 Satin Headband2A3E95…2A3E954C832
14086CHEER Rhinestone…8EFA90…8EFA904AD32

Quiz 3: Product Code Pattern Analysis

Problem

From the products table:

  1. Extract first 2 characters of SKU as category code
  2. Extract last 2 characters as product type code
  3. Aggregate product count and average price by category code

Hint: LEFT(sku, 2), RIGHT(sku, 2)

View Answer

SELECT LEFT(sku, 2) AS category_code, RIGHT(sku, 2) AS type_code, COUNT(*) AS product_count, AVG(retail_price) AS avg_price, MIN(retail_price) AS min_price, MAX(retail_price) AS max_price FROM `your-project-id.thelook_ecommerce.products` WHERE sku IS NOT NULL GROUP BY category_code, type_code HAVING COUNT(*) >= 5 ORDER BY product_count DESC LIMIT 20

Execution Result:

category_codetype_codeproduct_countavg_pricemin_pricemax_price
2AC712585.5012.99450.00
8ED311892.3015.00380.00
1CD911278.2510.50320.00

Practical Applications:

  • SKU System Analysis: Understanding product code rules
  • Inventory Management: Stock status by code
  • Auto-categorization: Category assignment based on code

4. UPPER/LOWER/TRIM - String Normalization

Theory

Used for data cleaning and standardization.

Functions

  • UPPER(string): Convert to uppercase
  • LOWER(string): Convert to lowercase
  • TRIM(string): Remove leading and trailing whitespace
  • LTRIM(string): Remove left whitespace
  • RTRIM(string): Remove right whitespace

Use Cases

  • Unify case
  • Remove whitespace
  • Remove duplicates
  • Data standardization

Practice: Normalize Brand Names

WITH normalized_brands AS ( SELECT brand AS original_brand, UPPER(TRIM(brand)) AS normalized_brand, COUNT(*) AS product_count FROM `your-project-id.thelook_ecommerce.products` WHERE brand IS NOT NULL GROUP BY brand ) SELECT normalized_brand, SUM(product_count) AS total_products, COUNT(DISTINCT original_brand) AS brand_variations FROM normalized_brands GROUP BY normalized_brand HAVING COUNT(DISTINCT original_brand) > 1 OR SUM(product_count) >= 10 ORDER BY brand_variations DESC, total_products DESC LIMIT 20

Execution Result:

normalized_brandtotal_productsbrand_variations
VIP BOUTIQUE293
TURKISHTOWELS73
HUGO BOSS1352
ICEBREAKER702
⚠️
Importance of Normalization
  • Remove Duplicates: Treat ‘Gmail.com’ and ‘gmail.com’ as the same
  • Whitespace Issues: Unify ’ user@example.com ’ and ‘user@example.com
  • Consistency: Improve analysis and aggregation accuracy

Quiz 4: Email Domain Normalization and Analysis

Problem

From the users table:

  1. Convert email addresses to lowercase and remove whitespace
  2. Extract domain after ’@’ (using SPLIT)
  3. Aggregate customer count by domain

Hint: LOWER(TRIM(email)), SPLIT(email, ’@’)[OFFSET(1)]

View Answer

WITH normalized_emails AS ( SELECT id AS user_id, email, LOWER(TRIM(email)) AS clean_email, SPLIT(LOWER(TRIM(email)), '@')[SAFE_OFFSET(1)] AS email_domain FROM `your-project-id.thelook_ecommerce.users` WHERE email IS NOT NULL ) SELECT email_domain, COUNT(*) AS customer_count, ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM normalized_emails), 2) AS percentage FROM normalized_emails WHERE email_domain IS NOT NULL GROUP BY email_domain ORDER BY customer_count DESC LIMIT 10

Execution Result:

email_domaincustomer_countpercentage
example.net3360133.60
example.org3321233.21
example.com3318733.19

Practical Applications:

  • Email marketing segmentation
  • Distinguishing company email vs personal email
  • B2B vs B2C analysis

5. SPLIT - String Splitting

Theory

Splits a string into an array based on a delimiter.

Syntax

SPLIT(string, delimiter) -- Result: ARRAY

Array Indexing

SPLIT('a,b,c', ',')[OFFSET(0)] → 'a' SPLIT('a,b,c', ',')[OFFSET(1)] → 'b' SPLIT('a,b,c', ',')[SAFE_OFFSET(10)] → NULL (safe)

Use Cases

  • CSV parsing
  • Tag separation
  • Path decomposition
  • Email, URL analysis

Practice: Email Analysis

WITH email_parts AS ( SELECT id AS user_id, email, SPLIT(email, '@')[SAFE_OFFSET(0)] AS username, SPLIT(email, '@')[SAFE_OFFSET(1)] AS domain, ARRAY_LENGTH(SPLIT(email, '@')) AS parts_count FROM `your-project-id.thelook_ecommerce.users` WHERE email IS NOT NULL AND email LIKE '%@%' ) SELECT user_id, email, username, domain, LENGTH(username) AS username_length, CASE WHEN domain LIKE '%.com' THEN 'Commercial' WHEN domain LIKE '%.org' THEN 'Organization' WHEN domain LIKE '%.edu' THEN 'Education' WHEN domain LIKE '%.gov' THEN 'Government' ELSE 'Other' END AS domain_type FROM email_parts WHERE parts_count = 2 LIMIT 30

Execution Result:

user_idemailusernamedomainusername_lengthdomain_type
97381dianamartinez@example.orgdianamartinezexample.org13Organization
30683breannafrancis@example.orgbreannafrancisexample.org14Organization

Quiz 5: Extract Keywords from Product Names

Problem

From the products table:

  1. SPLIT product names by space
  2. Extract the first word as the main keyword
  3. Aggregate product count and average price by keyword

Hint: SPLIT(name, ’ ’)[SAFE_OFFSET(0)]

View Answer

WITH product_keywords AS ( SELECT id AS product_id, name, retail_price, UPPER(SPLIT(TRIM(name), ' ')[SAFE_OFFSET(0)]) AS first_keyword FROM `your-project-id.thelook_ecommerce.products` WHERE name IS NOT NULL ) SELECT first_keyword, COUNT(*) AS product_count, AVG(retail_price) AS avg_price, MIN(retail_price) AS min_price, MAX(retail_price) AS max_price FROM product_keywords WHERE first_keyword IS NOT NULL AND LENGTH(first_keyword) > 2 GROUP BY first_keyword HAVING COUNT(*) >= 5 ORDER BY product_count DESC LIMIT 15

Execution Result:

first_keywordproduct_countavg_pricemin_pricemax_price
ALLEGRA103414.133.2841.63
CALVIN61564.2610.00259.99
CARHARTT38768.867.00448.99
TOMMY32470.889.00289.99
VOLCOM29358.5411.96320.00

SPLIT Use Scenarios:

  • Tag Analysis: Separate comma-delimited tags
  • Path Parsing: Decompose URLs, file paths
  • Keyword Extraction: Core words from product names, descriptions

Summary

String Functions Covered

FunctionUse CaseExample
CONCAT()String concatenationCONCAT(first_name, ’ ’, last_name)
LIKEPattern matchingname LIKE ‘%shirt%‘
SUBSTR()SubstringSUBSTR(sku, 1, 3)
LEFT/RIGHT()First/last N charactersLEFT(sku, 2)
UPPER/LOWER()Case conversionUPPER(brand)
TRIM()Remove whitespaceTRIM(email)
SPLIT()Split stringSPLIT(email, ’@‘)

Additional Useful Functions

FunctionDescription
LENGTH()String length
REPLACE()String replacement
REGEXP_CONTAINS()Regular expression matching
REGEXP_EXTRACT()Regular expression extraction
FORMAT()Formatting

Data Cleaning Patterns

-- 1. Normalization UPPER(TRIM(column)) -- 2. Extract email domain SPLIT(LOWER(TRIM(email)), '@')[SAFE_OFFSET(1)] -- 3. Code parsing LEFT(sku, 3) AS category_code -- 4. Keyword search WHERE LOWER(name) LIKE '%keyword%' -- 5. Formatting CONCAT('[', brand, '] ', name)

Next Steps

You’ve mastered string functions! Now learn Python-based data analysis in the Pandas track, or study data visualization techniques in Visualization.

Last updated on

🤖AI 모의면접실전처럼 연습하기