Skip to Content
ConceptsSQLNULL Handling

Customer Data Cleaning with NULL Handling

BeginnerIntermediate

BigQuery Execution Environment Setup

from google.cloud import bigquery import pandas as pd # Authentication setup (when using service account key) # import os # os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-key.json" # Create BigQuery client client = bigquery.Client(project='your-project-id') # Query execution function def run_query(query): return client.query(query).to_dataframe()

Learning Objectives

After completing this recipe, you will be able to:

  • Understand the concept of NULL values and their impact on data quality
  • Detect missing values with IS NULL / IS NOT NULL
  • Replace NULL and handle priority with COALESCE
  • Set simple default values with IFNULL
  • Implement complex conditional logic with CASE WHEN
  • Generate NULL under specific conditions with NULLIF

1. NULL Checking - IS NULL, IS NOT NULL

Theory

NULL represents “no value” or “unknown”.

Important Notes

  • column = NULL ❌ (doesn’t work)
  • column IS NULL ✅ (correct method)
  • All operations with NULL result in NULL

Syntax

WHERE column IS NULL WHERE column IS NOT NULL

Practice: NULL Analysis of Customer Data

SELECT 'age' AS column_name, COUNT(*) AS total_rows, COUNT(age) AS non_null_count, COUNT(*) - COUNT(age) AS null_count, ROUND((COUNT(*) - COUNT(age)) * 100.0 / COUNT(*), 2) AS null_percentage FROM `your-project-id.thelook_ecommerce.users` UNION ALL SELECT 'gender' AS column_name, COUNT(*) AS total_rows, COUNT(gender) AS non_null_count, COUNT(*) - COUNT(gender) AS null_count, ROUND((COUNT(*) - COUNT(gender)) * 100.0 / COUNT(*), 2) AS null_percentage FROM `your-project-id.thelook_ecommerce.users` UNION ALL SELECT 'state' AS column_name, COUNT(*) AS total_rows, COUNT(state) AS non_null_count, COUNT(*) - COUNT(state) AS null_count, ROUND((COUNT(*) - COUNT(state)) * 100.0 / COUNT(*), 2) AS null_percentage FROM `your-project-id.thelook_ecommerce.users` ORDER BY null_percentage DESC

Execution Result:

column_nametotal_rowsnon_null_countnull_countnull_percentage
state10000010000000.0
age10000010000000.0
country10000010000000.0
gender10000010000000.0
ℹ️
Core Concepts
  • COUNT(*): Total row count (includes NULL)
  • COUNT(column): Counts only non-NULL rows
  • NULL percentage = (Total - Non-NULL) / Total * 100

Quiz 1: NULL Analysis of Order Items

Problem

From the order_items table:

  1. Calculate the NULL percentage for returned_at, shipped_at, delivered_at columns
  2. Output total rows, NULL count, NULL percentage(%)
  3. Sort by highest NULL percentage

Hint: COUNT(*) - COUNT(column), UNION ALL

View Answer

SELECT 'returned_at' AS column_name, COUNT(*) AS total_rows, COUNT(returned_at) AS non_null_count, COUNT(*) - COUNT(returned_at) AS null_count, ROUND((COUNT(*) - COUNT(returned_at)) * 100.0 / COUNT(*), 2) AS null_percentage FROM `your-project-id.thelook_ecommerce.order_items` UNION ALL SELECT 'shipped_at' AS column_name, COUNT(*) AS total_rows, COUNT(shipped_at) AS non_null_count, COUNT(*) - COUNT(shipped_at) AS null_count, ROUND((COUNT(*) - COUNT(shipped_at)) * 100.0 / COUNT(*), 2) AS null_percentage FROM `your-project-id.thelook_ecommerce.order_items` UNION ALL SELECT 'delivered_at' AS column_name, COUNT(*) AS total_rows, COUNT(delivered_at) AS non_null_count, COUNT(*) - COUNT(delivered_at) AS null_count, ROUND((COUNT(*) - COUNT(delivered_at)) * 100.0 / COUNT(*), 2) AS null_percentage FROM `your-project-id.thelook_ecommerce.order_items` ORDER BY null_percentage DESC

Execution Result:

column_nametotal_rowsnon_null_countnull_countnull_percentage
returned_at1801201805816206289.97
delivered_at1801206322611689464.90
shipped_at1801201170146310635.04

Business Insights:

  • High returned_at NULL: Most products aren’t returned (normal)
  • shipped_at NULL: Orders not yet shipped
  • delivered_at NULL: In transit or undelivered status

2. COALESCE - NULL Replacement (Priority)

Theory

COALESCE returns the first NOT NULL value among multiple values.

Syntax

COALESCE(value1, value2, value3, default_value)

Use Cases

  • Select valid contact from multiple options
  • Set default values
  • Priority-based selection

Practice: Setting Default Value for State

SELECT user_id, first_name, last_name, state, country, COALESCE(state, 'Unknown State') AS state_cleaned, COALESCE(country, 'Unknown Country') AS country_cleaned, COALESCE(state, country, 'Unknown Location') AS location FROM `your-project-id.thelook_ecommerce.users` WHERE state IS NULL OR country IS NULL LIMIT 20

Quiz 2: Replace NULL with Average Value

Problem

From the users table:

  1. Calculate the overall average age (using subquery or CTE)
  2. Replace age with average age when NULL using COALESCE
  3. Compare original age and cleaned age_cleaned

Hint: COALESCE(age, (SELECT AVG(age) FROM …))

View Answer

WITH avg_age AS ( SELECT ROUND(AVG(age), 0) AS average_age FROM `your-project-id.thelook_ecommerce.users` WHERE age IS NOT NULL ) SELECT user_id, first_name, age AS age_original, COALESCE(age, (SELECT average_age FROM avg_age)) AS age_cleaned, CASE WHEN age IS NULL THEN 'Imputed' ELSE 'Original' END AS data_source FROM `your-project-id.thelook_ecommerce.users` LIMIT 1000

Missing Value Replacement Strategies:

  • Mean: Numeric data, when distribution is close to normal
  • Median: More stable when there are many outliers
  • Mode: Categorical data
  • Group mean: Apply different averages by gender, region

3. IFNULL - Simple NULL Replacement

Theory

IFNULL is a simpler version of COALESCE that only accepts 2 arguments.

Syntax

IFNULL(column, default_value)

COALESCE vs IFNULL

  • COALESCE(a, b, c, d): Compare multiple values
  • IFNULL(a, b): Simple replacement (more concise)

Practice: Simple Default Value Setting

SELECT user_id, first_name, last_name, gender, state, IFNULL(gender, 'Unknown') AS gender_cleaned, IFNULL(state, 'N/A') AS state_cleaned, CONCAT(IFNULL(first_name, ''), ' ', IFNULL(last_name, '')) AS full_name FROM `your-project-id.thelook_ecommerce.users` WHERE gender IS NULL OR state IS NULL LIMIT 30

Quiz 3: Replace State NULL with ‘Unknown’ and Analyze

Problem

From the users table:

  1. Replace NULL state with ‘Unknown’ using IFNULL
  2. Aggregate customer count by cleaned state
  3. Query only top 15 states

Hint: IFNULL(state, ‘Unknown’), GROUP BY, ORDER BY

View Answer

SELECT IFNULL(state, 'Unknown') AS state_cleaned, COUNT(*) AS customer_count, ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM `your-project-id.thelook_ecommerce.users`), 2) AS percentage FROM `your-project-id.thelook_ecommerce.users` GROUP BY state_cleaned ORDER BY customer_count DESC LIMIT 15

Execution Result:

state_cleanedcustomer_countpercentage
Guangdong53765.38
England41264.13
California37483.75
Shanghai24432.44
Texas24072.41

Business Applications:

  • High Unknown percentage: Need to improve data collection process
  • Regional customer distribution: Marketing targeting, logistics center location decisions

4. CASE WHEN - Conditional Logic

Theory

CASE WHEN is SQL’s if-else statement.

Syntax

CASE WHEN condition1 THEN result1 WHEN condition2 THEN result2 ELSE default_result END

Use Cases

  • Customer segmentation (age groups, tiers, etc.)
  • Conditional aggregation
  • Complex business logic implementation

Practice: Customer Segmentation by Age Group

SELECT user_id, first_name, age, CASE WHEN age IS NULL THEN 'Unknown' WHEN age < 20 THEN 'Teens' WHEN age < 30 THEN '20s' WHEN age < 40 THEN '30s' WHEN age < 50 THEN '40s' WHEN age < 60 THEN '50s' ELSE '60s+' END AS age_group, CASE WHEN age IS NULL THEN 'N/A' WHEN age >= 40 THEN 'VIP' ELSE 'Regular' END AS customer_tier FROM `your-project-id.thelook_ecommerce.users` LIMIT 50

Execution Result:

user_idfirst_nameageage_groupcustomer_tier
92691Tara4240sVIP
64986Jennifer4240sVIP
46547Kelly4640sVIP
26697Christina2720sRegular

Quiz 4: VIP/Regular Customer Order Analysis by Age Group

Problem

JOIN users and orders to:

  1. Classify age groups with CASE WHEN (Teens, 20s, 30s, 40s+, Unknown)
  2. Classify customer tier with CASE WHEN:
    • age >= 40: ‘VIP’
    • age < 40: ‘Regular’
    • age IS NULL: ‘Unknown’
  3. Calculate order count and average items per order by age group and tier

Hint: JOIN, GROUP BY age_group, customer_tier

View Answer

SELECT CASE WHEN u.age IS NULL THEN 'Unknown' WHEN u.age < 20 THEN 'Teens' WHEN u.age < 30 THEN '20s' WHEN u.age < 40 THEN '30s' ELSE '40s+' END AS age_group, CASE WHEN u.age IS NULL THEN 'Unknown' WHEN u.age >= 40 THEN 'VIP' ELSE 'Regular' END AS customer_tier, COUNT(DISTINCT o.order_id) AS order_count, AVG(o.num_of_item) AS avg_items_per_order FROM `your-project-id.thelook_ecommerce.users` u JOIN `your-project-id.thelook_ecommerce.orders` o ON u.user_id = o.user_id GROUP BY age_group, customer_tier ORDER BY CASE age_group WHEN 'Teens' THEN 1 WHEN '20s' THEN 2 WHEN '30s' THEN 3 WHEN '40s+' THEN 4 WHEN 'Unknown' THEN 5 END, customer_tier

Execution Result:

age_groupcustomer_tierorder_countavg_items_per_order
TeensRegular168611.44
20sRegular213851.45
30sRegular210761.45
40s+VIP654711.44

Business Insights:

  • VIP customers (40s+): Analyze order count and purchase patterns
  • Regular customers (20-30s): Segments with high growth potential

5. NULLIF - Conditional NULL Generation

Theory

NULLIF returns NULL if two values are equal.

Syntax

NULLIF(expression1, expression2) -- Returns NULL if expression1 == expression2, otherwise expression1

Use Cases

  • Prevent division by zero: value / NULLIF(count, 0)
  • Remove default values: NULLIF(column, 'default')
  • Exclude meaningless values

Practice: Prevent Division by Zero with NULLIF

SELECT p.category, COUNT(*) AS total_products, SUM(CASE WHEN oi.order_id IS NOT NULL THEN 1 ELSE 0 END) AS sold_products, -- Prevent division by zero with NULLIF ROUND( SUM(CASE WHEN oi.order_id IS NOT NULL THEN 1 ELSE 0 END) * 100.0 / NULLIF(COUNT(*), 0), 2 ) AS sell_through_rate FROM `your-project-id.thelook_ecommerce.products` p LEFT JOIN `your-project-id.thelook_ecommerce.order_items` oi ON p.product_id = oi.product_id GROUP BY p.category ORDER BY sell_through_rate DESC LIMIT 15

Quiz 5: Safe Average Calculation

Problem

From orders:

  1. Group by status
  2. Calculate order count, total items, average items per order
  3. Prevent division by zero with NULLIF for average calculation

Hint: SUM(num_of_item) / NULLIF(COUNT(*), 0)

View Answer

SELECT status, COUNT(*) AS order_count, SUM(num_of_item) AS total_items, ROUND(SUM(num_of_item) / NULLIF(COUNT(*), 0), 2) AS avg_items_per_order FROM `your-project-id.thelook_ecommerce.orders` GROUP BY status ORDER BY order_count DESC

Execution Result:

statusorder_counttotal_itemsavg_items_per_order
Shipped37302537881.44
Complete31161451681.45
Processing25171362051.44
Cancelled18648269011.44
Returned12511180581.44

NULLIF Use Scenarios:

  • Prevent division by zero: value / NULLIF(denominator, 0)
  • Remove meaningless values: NULLIF(column, '')
  • Filter default values: NULLIF(column, 'N/A')

Summary

NULL Handling Functions Covered

FunctionUseExample
IS NULLCheck for NULLWHERE age IS NULL
COALESCE(a,b,c)First NOT NULL valueCOALESCE(state, country, ‘Unknown’)
IFNULL(a,b)Simple replacementIFNULL(age, 0)
CASE WHENConditional logicAge group classification, customer tier
NULLIF(a,b)Conditional NULL generationNULLIF(count, 0)

NULL Handling Best Practices

  1. Data Quality Monitoring

    • Regularly measure NULL percentages
    • Alert when threshold exceeded
  2. Appropriate Replacement Strategies

    • Numbers: Mean, median, 0
    • Categories: Mode, ‘Unknown’
    • Dates: Specific reference date, previous/next value
  3. Business Logic Implementation

    • Complex classification with CASE WHEN
    • Customer segmentation, product grading
  4. Safe Operations

    • Prevent division by zero with NULLIF
    • Guarantee default values with COALESCE

Next Steps

You’ve mastered NULL handling! Next, learn complex query structuring and advanced analysis techniques in CTE and Cohort Analysis.

Last updated on

🤖AI 모의면접실전처럼 연습하기