Customer Data Cleaning with NULL Handling
BigQuery Execution Environment Setup
from google.cloud import bigquery
import pandas as pd
# Authentication setup (when using service account key)
# import os
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-key.json"
# Create BigQuery client
client = bigquery.Client(project='your-project-id')
# Query execution function
def run_query(query):
return client.query(query).to_dataframe()Learning Objectives
After completing this recipe, you will be able to:
- Understand the concept of NULL values and their impact on data quality
- Detect missing values with IS NULL / IS NOT NULL
- Replace NULL and handle priority with COALESCE
- Set simple default values with IFNULL
- Implement complex conditional logic with CASE WHEN
- Generate NULL under specific conditions with NULLIF
1. NULL Checking - IS NULL, IS NOT NULL
Theory
NULL represents “no value” or “unknown”.
Important Notes
column = NULL❌ (doesn’t work)column IS NULL✅ (correct method)- All operations with NULL result in NULL
Syntax
WHERE column IS NULL
WHERE column IS NOT NULLPractice: NULL Analysis of Customer Data
SELECT
'age' AS column_name,
COUNT(*) AS total_rows,
COUNT(age) AS non_null_count,
COUNT(*) - COUNT(age) AS null_count,
ROUND((COUNT(*) - COUNT(age)) * 100.0 / COUNT(*), 2) AS null_percentage
FROM `your-project-id.thelook_ecommerce.users`
UNION ALL
SELECT
'gender' AS column_name,
COUNT(*) AS total_rows,
COUNT(gender) AS non_null_count,
COUNT(*) - COUNT(gender) AS null_count,
ROUND((COUNT(*) - COUNT(gender)) * 100.0 / COUNT(*), 2) AS null_percentage
FROM `your-project-id.thelook_ecommerce.users`
UNION ALL
SELECT
'state' AS column_name,
COUNT(*) AS total_rows,
COUNT(state) AS non_null_count,
COUNT(*) - COUNT(state) AS null_count,
ROUND((COUNT(*) - COUNT(state)) * 100.0 / COUNT(*), 2) AS null_percentage
FROM `your-project-id.thelook_ecommerce.users`
ORDER BY null_percentage DESCExecution Result:
| column_name | total_rows | non_null_count | null_count | null_percentage |
|---|---|---|---|---|
| state | 100000 | 100000 | 0 | 0.0 |
| age | 100000 | 100000 | 0 | 0.0 |
| country | 100000 | 100000 | 0 | 0.0 |
| gender | 100000 | 100000 | 0 | 0.0 |
COUNT(*): Total row count (includes NULL)COUNT(column): Counts only non-NULL rows- NULL percentage =
(Total - Non-NULL) / Total * 100
Quiz 1: NULL Analysis of Order Items
Problem
From the order_items table:
- Calculate the NULL percentage for returned_at, shipped_at, delivered_at columns
- Output total rows, NULL count, NULL percentage(%)
- Sort by highest NULL percentage
Hint: COUNT(*) - COUNT(column), UNION ALL
View Answer
SELECT
'returned_at' AS column_name,
COUNT(*) AS total_rows,
COUNT(returned_at) AS non_null_count,
COUNT(*) - COUNT(returned_at) AS null_count,
ROUND((COUNT(*) - COUNT(returned_at)) * 100.0 / COUNT(*), 2) AS null_percentage
FROM `your-project-id.thelook_ecommerce.order_items`
UNION ALL
SELECT
'shipped_at' AS column_name,
COUNT(*) AS total_rows,
COUNT(shipped_at) AS non_null_count,
COUNT(*) - COUNT(shipped_at) AS null_count,
ROUND((COUNT(*) - COUNT(shipped_at)) * 100.0 / COUNT(*), 2) AS null_percentage
FROM `your-project-id.thelook_ecommerce.order_items`
UNION ALL
SELECT
'delivered_at' AS column_name,
COUNT(*) AS total_rows,
COUNT(delivered_at) AS non_null_count,
COUNT(*) - COUNT(delivered_at) AS null_count,
ROUND((COUNT(*) - COUNT(delivered_at)) * 100.0 / COUNT(*), 2) AS null_percentage
FROM `your-project-id.thelook_ecommerce.order_items`
ORDER BY null_percentage DESCExecution Result:
| column_name | total_rows | non_null_count | null_count | null_percentage |
|---|---|---|---|---|
| returned_at | 180120 | 18058 | 162062 | 89.97 |
| delivered_at | 180120 | 63226 | 116894 | 64.90 |
| shipped_at | 180120 | 117014 | 63106 | 35.04 |
Business Insights:
- High returned_at NULL: Most products aren’t returned (normal)
- shipped_at NULL: Orders not yet shipped
- delivered_at NULL: In transit or undelivered status
2. COALESCE - NULL Replacement (Priority)
Theory
COALESCE returns the first NOT NULL value among multiple values.
Syntax
COALESCE(value1, value2, value3, default_value)Use Cases
- Select valid contact from multiple options
- Set default values
- Priority-based selection
Practice: Setting Default Value for State
SELECT
user_id,
first_name,
last_name,
state,
country,
COALESCE(state, 'Unknown State') AS state_cleaned,
COALESCE(country, 'Unknown Country') AS country_cleaned,
COALESCE(state, country, 'Unknown Location') AS location
FROM `your-project-id.thelook_ecommerce.users`
WHERE state IS NULL OR country IS NULL
LIMIT 20Quiz 2: Replace NULL with Average Value
Problem
From the users table:
- Calculate the overall average age (using subquery or CTE)
- Replace age with average age when NULL using COALESCE
- Compare original age and cleaned age_cleaned
Hint: COALESCE(age, (SELECT AVG(age) FROM …))
View Answer
WITH avg_age AS (
SELECT ROUND(AVG(age), 0) AS average_age
FROM `your-project-id.thelook_ecommerce.users`
WHERE age IS NOT NULL
)
SELECT
user_id,
first_name,
age AS age_original,
COALESCE(age, (SELECT average_age FROM avg_age)) AS age_cleaned,
CASE
WHEN age IS NULL THEN 'Imputed'
ELSE 'Original'
END AS data_source
FROM `your-project-id.thelook_ecommerce.users`
LIMIT 1000Missing Value Replacement Strategies:
- Mean: Numeric data, when distribution is close to normal
- Median: More stable when there are many outliers
- Mode: Categorical data
- Group mean: Apply different averages by gender, region
3. IFNULL - Simple NULL Replacement
Theory
IFNULL is a simpler version of COALESCE that only accepts 2 arguments.
Syntax
IFNULL(column, default_value)COALESCE vs IFNULL
COALESCE(a, b, c, d): Compare multiple valuesIFNULL(a, b): Simple replacement (more concise)
Practice: Simple Default Value Setting
SELECT
user_id,
first_name,
last_name,
gender,
state,
IFNULL(gender, 'Unknown') AS gender_cleaned,
IFNULL(state, 'N/A') AS state_cleaned,
CONCAT(IFNULL(first_name, ''), ' ', IFNULL(last_name, '')) AS full_name
FROM `your-project-id.thelook_ecommerce.users`
WHERE gender IS NULL OR state IS NULL
LIMIT 30Quiz 3: Replace State NULL with ‘Unknown’ and Analyze
Problem
From the users table:
- Replace NULL state with ‘Unknown’ using IFNULL
- Aggregate customer count by cleaned state
- Query only top 15 states
Hint: IFNULL(state, ‘Unknown’), GROUP BY, ORDER BY
View Answer
SELECT
IFNULL(state, 'Unknown') AS state_cleaned,
COUNT(*) AS customer_count,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM `your-project-id.thelook_ecommerce.users`), 2) AS percentage
FROM `your-project-id.thelook_ecommerce.users`
GROUP BY state_cleaned
ORDER BY customer_count DESC
LIMIT 15Execution Result:
| state_cleaned | customer_count | percentage |
|---|---|---|
| Guangdong | 5376 | 5.38 |
| England | 4126 | 4.13 |
| California | 3748 | 3.75 |
| Shanghai | 2443 | 2.44 |
| Texas | 2407 | 2.41 |
Business Applications:
- High Unknown percentage: Need to improve data collection process
- Regional customer distribution: Marketing targeting, logistics center location decisions
4. CASE WHEN - Conditional Logic
Theory
CASE WHEN is SQL’s if-else statement.
Syntax
CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
ELSE default_result
ENDUse Cases
- Customer segmentation (age groups, tiers, etc.)
- Conditional aggregation
- Complex business logic implementation
Practice: Customer Segmentation by Age Group
SELECT
user_id,
first_name,
age,
CASE
WHEN age IS NULL THEN 'Unknown'
WHEN age < 20 THEN 'Teens'
WHEN age < 30 THEN '20s'
WHEN age < 40 THEN '30s'
WHEN age < 50 THEN '40s'
WHEN age < 60 THEN '50s'
ELSE '60s+'
END AS age_group,
CASE
WHEN age IS NULL THEN 'N/A'
WHEN age >= 40 THEN 'VIP'
ELSE 'Regular'
END AS customer_tier
FROM `your-project-id.thelook_ecommerce.users`
LIMIT 50Execution Result:
| user_id | first_name | age | age_group | customer_tier |
|---|---|---|---|---|
| 92691 | Tara | 42 | 40s | VIP |
| 64986 | Jennifer | 42 | 40s | VIP |
| 46547 | Kelly | 46 | 40s | VIP |
| 26697 | Christina | 27 | 20s | Regular |
Quiz 4: VIP/Regular Customer Order Analysis by Age Group
Problem
JOIN users and orders to:
- Classify age groups with CASE WHEN (Teens, 20s, 30s, 40s+, Unknown)
- Classify customer tier with CASE WHEN:
- age >= 40: ‘VIP’
- age < 40: ‘Regular’
- age IS NULL: ‘Unknown’
- Calculate order count and average items per order by age group and tier
Hint: JOIN, GROUP BY age_group, customer_tier
View Answer
SELECT
CASE
WHEN u.age IS NULL THEN 'Unknown'
WHEN u.age < 20 THEN 'Teens'
WHEN u.age < 30 THEN '20s'
WHEN u.age < 40 THEN '30s'
ELSE '40s+'
END AS age_group,
CASE
WHEN u.age IS NULL THEN 'Unknown'
WHEN u.age >= 40 THEN 'VIP'
ELSE 'Regular'
END AS customer_tier,
COUNT(DISTINCT o.order_id) AS order_count,
AVG(o.num_of_item) AS avg_items_per_order
FROM `your-project-id.thelook_ecommerce.users` u
JOIN `your-project-id.thelook_ecommerce.orders` o
ON u.user_id = o.user_id
GROUP BY age_group, customer_tier
ORDER BY
CASE age_group
WHEN 'Teens' THEN 1
WHEN '20s' THEN 2
WHEN '30s' THEN 3
WHEN '40s+' THEN 4
WHEN 'Unknown' THEN 5
END,
customer_tierExecution Result:
| age_group | customer_tier | order_count | avg_items_per_order |
|---|---|---|---|
| Teens | Regular | 16861 | 1.44 |
| 20s | Regular | 21385 | 1.45 |
| 30s | Regular | 21076 | 1.45 |
| 40s+ | VIP | 65471 | 1.44 |
Business Insights:
- VIP customers (40s+): Analyze order count and purchase patterns
- Regular customers (20-30s): Segments with high growth potential
5. NULLIF - Conditional NULL Generation
Theory
NULLIF returns NULL if two values are equal.
Syntax
NULLIF(expression1, expression2)
-- Returns NULL if expression1 == expression2, otherwise expression1Use Cases
- Prevent division by zero:
value / NULLIF(count, 0) - Remove default values:
NULLIF(column, 'default') - Exclude meaningless values
Practice: Prevent Division by Zero with NULLIF
SELECT
p.category,
COUNT(*) AS total_products,
SUM(CASE WHEN oi.order_id IS NOT NULL THEN 1 ELSE 0 END) AS sold_products,
-- Prevent division by zero with NULLIF
ROUND(
SUM(CASE WHEN oi.order_id IS NOT NULL THEN 1 ELSE 0 END) * 100.0 /
NULLIF(COUNT(*), 0),
2
) AS sell_through_rate
FROM `your-project-id.thelook_ecommerce.products` p
LEFT JOIN `your-project-id.thelook_ecommerce.order_items` oi
ON p.product_id = oi.product_id
GROUP BY p.category
ORDER BY sell_through_rate DESC
LIMIT 15Quiz 5: Safe Average Calculation
Problem
From orders:
- Group by status
- Calculate order count, total items, average items per order
- Prevent division by zero with NULLIF for average calculation
Hint: SUM(num_of_item) / NULLIF(COUNT(*), 0)
View Answer
SELECT
status,
COUNT(*) AS order_count,
SUM(num_of_item) AS total_items,
ROUND(SUM(num_of_item) / NULLIF(COUNT(*), 0), 2) AS avg_items_per_order
FROM `your-project-id.thelook_ecommerce.orders`
GROUP BY status
ORDER BY order_count DESCExecution Result:
| status | order_count | total_items | avg_items_per_order |
|---|---|---|---|
| Shipped | 37302 | 53788 | 1.44 |
| Complete | 31161 | 45168 | 1.45 |
| Processing | 25171 | 36205 | 1.44 |
| Cancelled | 18648 | 26901 | 1.44 |
| Returned | 12511 | 18058 | 1.44 |
NULLIF Use Scenarios:
- Prevent division by zero:
value / NULLIF(denominator, 0) - Remove meaningless values:
NULLIF(column, '') - Filter default values:
NULLIF(column, 'N/A')
Summary
NULL Handling Functions Covered
| Function | Use | Example |
|---|---|---|
IS NULL | Check for NULL | WHERE age IS NULL |
COALESCE(a,b,c) | First NOT NULL value | COALESCE(state, country, ‘Unknown’) |
IFNULL(a,b) | Simple replacement | IFNULL(age, 0) |
CASE WHEN | Conditional logic | Age group classification, customer tier |
NULLIF(a,b) | Conditional NULL generation | NULLIF(count, 0) |
NULL Handling Best Practices
-
Data Quality Monitoring
- Regularly measure NULL percentages
- Alert when threshold exceeded
-
Appropriate Replacement Strategies
- Numbers: Mean, median, 0
- Categories: Mode, ‘Unknown’
- Dates: Specific reference date, previous/next value
-
Business Logic Implementation
- Complex classification with CASE WHEN
- Customer segmentation, product grading
-
Safe Operations
- Prevent division by zero with NULLIF
- Guarantee default values with COALESCE
Next Steps
You’ve mastered NULL handling! Next, learn complex query structuring and advanced analysis techniques in CTE and Cohort Analysis.