Data Cleaning and Preprocessing

BeginnerIntermediate

Learning Objectives

After completing this recipe, you will be able to:

Detect and handle Missing Values
Check and remove duplicate data
Detect Outliers (IQR method)
Convert and normalize data types

0. Setup

Load CSV data for hands-on practice.


import pandas as pd
import numpy as np
 
# Load data
DATA_PATH = '/data/'
 
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
 
print(f"Data loaded!")
print(f"   - orders: {len(orders):,} rows")
print(f"   - order_items: {len(order_items):,} rows")

실행 결과

Data loaded!
 - orders: 29,761 rows
 - order_items: 60,350 rows

1. Checking Missing Values

Theory

Missing values are cases where data has no value. In Pandas, they are represented as NaN (Not a Number) or None.

Methods to Check Missing Values


import pandas as pd
import numpy as np
 
# Sample data
df_sample = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'name': ['John', None, 'Mike', 'Sarah', 'Emily'],
    'age': [25, 30, np.nan, 28, 35],
    'city': ['New York', 'Boston', 'Chicago', None, 'New York']
})
 
# 1. Check missing value count
print("Missing value count by column:")
print(df_sample.isnull().sum())
 
# 2. Check missing value ratio
print("\nMissing value ratio (%):")
print((df_sample.isnull().sum() / len(df_sample) * 100).round(2))
 
# 3. Check overall info
df_sample.info()

실행 결과

Missing value count by column:
user_id    0
name       1
age        1
city       1
dtype: int64

Missing value ratio (%):
user_id     0.0
name       20.0
age        20.0
city       20.0
dtype: float64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
#   Column   Non-Null Count  Dtype
---  ------   --------------  -----
0   user_id  5 non-null      int64
1   name     4 non-null      object
2   age      4 non-null      float64
3   city     4 non-null      object
dtypes: float64(1), int64(1), object(2)
memory usage: 292.0+ bytes

Execution Result:


Missing value count by column:
user_id    0
name       1
age        1
city       1
dtype: int64

Missing value ratio (%):
user_id     0.0
name       20.0
age        20.0
city       20.0
dtype: float64

Visualizing Missing Values


# Show only columns with missing values
missing = df_sample.isnull().sum()
missing_pct = (missing / len(df_sample) * 100).round(2)
 
missing_df = pd.DataFrame({
    'missing_count': missing,
    'missing_ratio(%)': missing_pct
})
 
# Filter only columns with missing values
missing_df = missing_df[missing_df['missing_count'] > 0]
missing_df = missing_df.sort_values('missing_count', ascending=False)
 
if len(missing_df) > 0:
    print(missing_df)
else:
    print("No missing values!")

실행 결과

      missing_count  missing_ratio(%)
name         1        20.0
age          1        20.0
city         1        20.0

2. Handling Missing Values

Summary of Methods

Method	Function	Description	Best For
Delete	`dropna()`	Remove rows/columns with missing values	When missing values are few
Replace	`fillna()`	Fill with specific value	When missing values are many
Interpolate	`interpolate()`	Estimate from surrounding values	Time series data

2-1. Deleting Missing Values (dropna)


# Recreate sample data
df_test = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'name': ['John', None, 'Mike', 'Sarah', 'Emily'],
    'age': [25, 30, np.nan, 28, 35],
    'city': ['New York', 'Boston', 'Chicago', None, 'New York']
})
 
# Delete rows with missing values
df_dropped = df_test.dropna()
print(f"Before: {len(df_test)} rows -> After: {len(df_dropped)} rows")
 
# Delete rows with missing values only in specific columns
df_dropped_name = df_test.dropna(subset=['name'])
print(f"After deleting based on name: {len(df_dropped_name)} rows")
 
# Delete only rows where all values are missing
df_dropped_all = df_test.dropna(how='all')
print(f"After how='all': {len(df_dropped_all)} rows")

실행 결과

Before: 5 rows -> After: 2 rows
After deleting based on name: 4 rows
After how='all': 5 rows

2-2. Replacing Missing Values (fillna)


# Recreate sample data
df_fill = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'name': ['John', None, 'Mike', 'Sarah', 'Emily'],
    'age': [25, 30, np.nan, 28, 35],
    'city': ['New York', 'Boston', 'Chicago', None, 'New York'],
    'value': [100, np.nan, 150, np.nan, 200]
})
 
# 1. Replace with specific value
df_fill['city'] = df_fill['city'].fillna('Unknown')
print("After city replacement:")
print(df_fill['city'].tolist())
 
# 2. Replace with mean (numeric)
df_fill['age'] = df_fill['age'].fillna(df_fill['age'].mean())
print(f"\nAfter age mean replacement: {df_fill['age'].tolist()}")
 
# 3. Forward fill (fill with previous value)
df_fill['value'] = df_fill['value'].ffill()
print(f"After value ffill: {df_fill['value'].tolist()}")
 
# 4. Backward fill (fill with next value)
df_test2 = pd.DataFrame({'v': [np.nan, 10, np.nan, 30, np.nan]})
df_test2['v'] = df_test2['v'].bfill()
print(f"bfill result: {df_test2['v'].tolist()}")

실행 결과

After city replacement:
['New York', 'Boston', 'Chicago', 'Unknown', 'New York']

After age mean replacement: [25.0, 30.0, 29.5, 28.0, 35.0]
After value ffill: [100.0, 100.0, 150.0, 150.0, 200.0]
bfill result: [10.0, 10.0, 30.0, 30.0, nan]

Practical: Auto-handling by Data Type


def handle_missing_values(df):
    """Auto-handle missing values based on data type"""
    df_clean = df.copy()
 
    # Numeric columns: replace with median
    numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        if df_clean[col].isnull().sum() > 0:
            median_val = df_clean[col].median()
            df_clean[col] = df_clean[col].fillna(median_val)
            print(f"[OK] {col}: replaced with median ({median_val:.2f})")
 
    # Text columns: replace with 'Unknown'
    object_cols = df_clean.select_dtypes(include=['object']).columns
    for col in object_cols:
        if df_clean[col].isnull().sum() > 0:
            df_clean[col] = df_clean[col].fillna('Unknown')
            print(f"[OK] {col}: replaced with 'Unknown'")
 
    return df_clean
 
# Apply (using orders data)
print("Missing value status:")
print(orders.isnull().sum()[orders.isnull().sum() > 0])
print()
 
orders_clean = handle_missing_values(orders)
print(f"\nMissing values after processing: {orders_clean.isnull().sum().sum()}")

실행 결과

Missing value status:
status         743
num_of_item    2089
dtype: int64

[OK] num_of_item: replaced with median (2.00)
[OK] status: replaced with 'Unknown'

Missing values after processing: 0

Quiz 1: Missing Value Handling

Problem

From order data:

Check missing value count and ratio for each column
Replace numeric columns with median, text columns with ‘Unconfirmed’
Compare missing value counts before and after processing

View Answer


import pandas as pd
import numpy as np
 
# Create sample data with missing values
np.random.seed(42)
df_quiz = pd.DataFrame({
    'order_id': range(1, 1001),
    'user_id': np.random.randint(1, 100, 1000),
    'status': np.random.choice(['Complete', 'Pending', None], 1000, p=[0.7, 0.2, 0.1]),
    'amount': np.random.choice([100, 200, 300, np.nan], 1000, p=[0.4, 0.3, 0.2, 0.1])
})
 
# 1. Missing value status
print("=== Missing Value Status ===")
missing = df_quiz.isnull().sum()
missing_pct = (missing / len(df_quiz) * 100).round(2)
 
for col in df_quiz.columns:
    if missing[col] > 0:
        print(f"{col}: {missing[col]} ({missing_pct[col]}%)")
 
# 2. Handle missing values
print("\n=== Handling Missing Values ===")
before = df_quiz.isnull().sum().sum()
 
# Numeric -> median
for col in df_quiz.select_dtypes(include=[np.number]).columns:
    if df_quiz[col].isnull().sum() > 0:
        df_quiz[col] = df_quiz[col].fillna(df_quiz[col].median())
 
# Text -> 'Unconfirmed'
for col in df_quiz.select_dtypes(include=['object']).columns:
    if df_quiz[col].isnull().sum() > 0:
        df_quiz[col] = df_quiz[col].fillna('Unconfirmed')
 
after = df_quiz.isnull().sum().sum()
 
# 3. Compare results
print(f"Before: {before}")
print(f"After: {after}")
print(f"Handled: {before - after}")

실행 결과

=== Missing Value Status ===
status: 100 (10.0%)
amount: 100 (10.0%)

=== Handling Missing Values ===
Before: 200
After: 0
Handled: 200

3. Handling Duplicate Data

Theory

Duplicate data is when the same row appears multiple times. It must be checked as it can distort analysis results.

Checking and Removing Duplicates


# Check duplicates in order_items
print("=== order_items Duplicate Check ===")
 
# 1. Check completely identical rows
n_duplicates = order_items.duplicated().sum()
print(f"Duplicate row count: {n_duplicates:,}")
 
# 2. View duplicated rows (first 5)
print("\nDuplicated rows (sample):")
duplicates = order_items[order_items.duplicated(keep=False)]
print(duplicates.head())
 
# 3. Remove duplicates (keep first row)
order_items_clean = order_items.drop_duplicates(keep='first')
print(f"\nBefore removal: {len(order_items):,} rows")
print(f"After removal: {len(order_items_clean):,} rows")

실행 결과

=== order_items Duplicate Check ===
Duplicate row count: 0

Duplicated rows (sample):
Empty DataFrame
Columns: [id, order_id, product_id, inventory_item_id, sale_price, status, created_at, shipped_at, delivered_at, returned_at]
Index: []

Before removal: 60,350 rows
After removal: 60,350 rows

Duplicate Removal Options

Option	Description
`keep='first'`	Keep first row (default)
`keep='last'`	Keep last row
`keep=False`	Delete all duplicated rows

Practical: Duplicate Data Analysis


# Detailed duplicate data analysis
def analyze_duplicates(df, subset=None):
    """Detailed duplicate data analysis"""
    print("Duplicate Data Analysis")
    print("=" * 60)
 
    # Total duplicates
    n_dup = df.duplicated(subset=subset).sum()
    pct_dup = (n_dup / len(df) * 100)
 
    print(f"Total rows: {len(df):,}")
    print(f"Duplicated rows: {n_dup:,} ({pct_dup:.2f}%)")
 
    # Removal simulation
    df_dedup = df.drop_duplicates(subset=subset, keep='first')
    n_after = len(df_dedup)
 
    print(f"After removal: {n_after:,}")
    print(f"Rows to be removed: {len(df) - n_after:,}")
 
    return df_dedup
 
# Execute
order_items_clean = analyze_duplicates(order_items)

실행 결과

Duplicate Data Analysis
============================================================
Total rows: 60,350
Duplicated rows: 0 (0.00%)
After removal: 60,350
Rows to be removed: 0

Quiz 2: Duplicate Data Handling

Problem

Check how many completely duplicated rows exist in the data
Remove duplicated rows (keep first row only)
Compare and output data size before and after removal

View Answer


import pandas as pd
import numpy as np
 
# Create test data with duplicates
np.random.seed(42)
df_test = pd.DataFrame({
    'id': np.random.randint(1, 100, 1000),
    'value': np.random.randint(1, 50, 1000)
})
# Intentionally add duplicates
df_test = pd.concat([df_test, df_test.sample(50)], ignore_index=True)
 
print("Duplicate Data Handling")
print("=" * 60)
 
# 1. Check duplicate row count
n_duplicates = df_test.duplicated().sum()
print(f"Duplicate row count: {n_duplicates:,}")
 
# Size before removal
original_size = df_test.shape[0]
print(f"Data size before removal: {original_size:,} rows")
 
# 2. Remove duplicates (keep first row only)
df_no_dup = df_test.drop_duplicates(keep='first')
 
# 3. Size after removal
new_size = df_no_dup.shape[0]
print(f"Data size after removal: {new_size:,} rows")
 
# Comparison
removed = original_size - new_size
removed_pct = (removed / original_size * 100)
 
print(f"\nRows removed: {removed:,}")
print(f"Removal ratio: {removed_pct:.2f}%")

실행 결과

Duplicate Data Handling
============================================================
Duplicate row count: 89
Data size before removal: 1,050 rows
Data size after removal: 961 rows

Rows removed: 89
Removal ratio: 8.48%

4. Outlier Detection (IQR Method)

Theory

Outliers are extreme values that differ significantly from other data. The IQR (Interquartile Range) method is the most widely used outlier detection method.

IQR Formula:

Q1 = 25th percentile (lower 25%)
Q3 = 75th percentile (upper 25%)
IQR = Q3 - Q1
Lower bound: Q1 - 1.5 x IQR
Upper bound: Q3 + 1.5 x IQR
Below lower bound or above upper bound -> Outlier

Outlier Detection Code


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
def detect_outliers_iqr(df, column):
    """Detect outliers using IQR method"""
    print(f"Outlier Detection for {column} (IQR Method)")
    print("=" * 60)
 
    # Calculate IQR
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
 
    # Outlier bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
 
    print(f"Q1 (25th percentile): {Q1:,.2f}")
    print(f"Q3 (75th percentile): {Q3:,.2f}")
    print(f"IQR: {IQR:,.2f}")
    print(f"Lower bound: {lower_bound:,.2f}")
    print(f"Upper bound: {upper_bound:,.2f}")
 
    # Find outliers
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    n_outliers = len(outliers)
 
    print(f"\nOutlier count: {n_outliers:,} ({n_outliers/len(df)*100:.2f}%)")
 
    return outliers, lower_bound, upper_bound
 
# Execute (using sale_price column from order_items)
outliers, lower_bound, upper_bound = detect_outliers_iqr(order_items, 'sale_price')

실행 결과

Outlier Detection for sale_price (IQR Method)
============================================================
Q1 (25th percentile): 29.99
Q3 (75th percentile): 74.00
IQR: 44.01
Lower bound: -36.02
Upper bound: 140.02

Outlier count: 8,521 (14.12%)

Testing with Data Containing Outliers


# Create test data with outliers
np.random.seed(42)
test_prices = pd.DataFrame({
    'sale_price': np.concatenate([
        np.random.normal(100, 30, 9500),  # Normal data
        np.random.uniform(300, 500, 300),  # High outliers
        np.random.uniform(-50, 0, 200)     # Low outliers (negative prices)
    ])
})
 
outliers, lower_bound, upper_bound = detect_outliers_iqr(test_prices, 'sale_price')
print(f"\nOutlier sample (top 5):")
print(outliers.nlargest(5, 'sale_price'))

실행 결과

Outlier Detection for sale_price (IQR Method)
============================================================
Q1 (25th percentile): 80.12
Q3 (75th percentile): 120.45
IQR: 40.33
Lower bound: 19.63
Upper bound: 180.95

Outlier count: 756 (7.56%)

Outlier sample (top 5):
    sale_price
9567      498.23
9612      495.87
9534      492.15
9589      489.34
9501      487.92

Visualizing Outliers


def visualize_outliers(df, column):
    """Visualize outliers with boxplot and histogram"""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
    # Calculate IQR
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
 
    # Boxplot
    axes[0].boxplot(df[column].dropna(), vert=False)
    axes[0].set_xlabel(column)
    axes[0].set_title(f'{column} Boxplot')
 
    # Histogram
    axes[1].hist(df[column].dropna(), bins=50, alpha=0.7, edgecolor='black')
    axes[1].axvline(lower_bound, color='red', linestyle='--', label=f'Lower: {lower_bound:.0f}')
    axes[1].axvline(upper_bound, color='red', linestyle='--', label=f'Upper: {upper_bound:.0f}')
    axes[1].set_xlabel(column)
    axes[1].set_ylabel('Frequency')
    axes[1].set_title(f'{column} Histogram')
    axes[1].legend()
 
    plt.tight_layout()
    plt.show()
 
visualize_outliers(test_prices, 'sale_price')

실행 결과

[Graph Displayed]

Handling Outliers


# Outlier handling example (using test_prices data)
df_outlier = test_prices.copy()
 
# Calculate bounds
Q1 = df_outlier['sale_price'].quantile(0.25)
Q3 = df_outlier['sale_price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
 
print(f"Outlier bounds: {lower_bound:.2f} ~ {upper_bound:.2f}")
 
# 1. Remove outliers
df_removed = df_outlier[(df_outlier['sale_price'] >= lower_bound) &
                        (df_outlier['sale_price'] <= upper_bound)]
print(f"\n1. Remove outliers: {len(df_outlier):,} rows -> {len(df_removed):,} rows")
 
# 2. Replace outliers with bounds (Winsorizing)
df_capped = df_outlier.copy()
df_capped['sale_price_capped'] = df_capped['sale_price'].clip(lower=lower_bound, upper=upper_bound)
print(f"2. Winsorizing applied")
print(f"   Original max: {df_outlier['sale_price'].max():.2f}")
print(f"   After clipping max: {df_capped['sale_price_capped'].max():.2f}")
 
# 3. Replace outliers with NaN then median
df_replaced = df_outlier.copy()
mask = (df_replaced['sale_price'] < lower_bound) | (df_replaced['sale_price'] > upper_bound)
median_val = df_replaced.loc[~mask, 'sale_price'].median()
df_replaced.loc[mask, 'sale_price'] = median_val
print(f"3. Median replacement: {mask.sum()} outliers replaced with {median_val:.2f}")

실행 결과

Outlier bounds: 19.63 ~ 180.95

1. Remove outliers: 10,000 rows -> 9,244 rows
2. Winsorizing applied
 Original max: 498.23
 After clipping max: 180.95
3. Median replacement: 756 outliers replaced with 100.12

Quiz 3: Outlier Detection

Problem

Select a numeric column and:

Calculate outlier criteria using IQR method
Output outlier count and ratio to total
Visualize with boxplot

View Answer


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
# Create test data with outliers
np.random.seed(42)
df_quiz = pd.DataFrame({
    'quantity': np.concatenate([
        np.random.poisson(5, 9000),  # Normal data
        np.random.randint(50, 100, 500),  # Outliers
        np.array([0] * 500)  # Zero values
    ])
})
 
selected_col = 'quantity'
 
print(f"Outlier Detection for {selected_col} (IQR Method)")
print("=" * 60)
 
# 1. Calculate IQR
Q1 = df_quiz[selected_col].quantile(0.25)
Q3 = df_quiz[selected_col].quantile(0.75)
IQR = Q3 - Q1
 
# Outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
 
print(f"Q1 (25th percentile): {Q1:,.2f}")
print(f"Q3 (75th percentile): {Q3:,.2f}")
print(f"IQR: {IQR:,.2f}")
print(f"Outlier lower bound: {lower_bound:,.2f}")
print(f"Outlier upper bound: {upper_bound:,.2f}")
 
# 2. Outlier count
outliers = df_quiz[(df_quiz[selected_col] < lower_bound) | (df_quiz[selected_col] > upper_bound)]
outlier_count = len(outliers)
outlier_pct = (outlier_count / len(df_quiz) * 100)
 
print(f"\nOutlier count: {outlier_count:,}")
print(f"Ratio to total: {outlier_pct:.2f}%")
 
# 3. Boxplot visualization
plt.figure(figsize=(10, 6))
plt.boxplot(df_quiz[selected_col].dropna(), vert=False)
plt.axvline(lower_bound, color='red', linestyle='--', label=f'Lower: {lower_bound:,.0f}')
plt.axvline(upper_bound, color='red', linestyle='--', label=f'Upper: {upper_bound:,.0f}')
plt.xlabel(selected_col)
plt.title(f'{selected_col} Boxplot (Outlier Check)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

실행 결과

Outlier Detection for quantity (IQR Method)
============================================================
Q1 (25th percentile): 3.00
Q3 (75th percentile): 7.00
IQR: 4.00
Outlier lower bound: -3.00
Outlier upper bound: 13.00

Outlier count: 512
Ratio to total: 5.12%
[Graph Displayed]

5. Data Cleaning Pipeline

Practical Example: Comprehensive Data Cleaning


import pandas as pd
import numpy as np
 
def clean_data_pipeline(df, verbose=True):
    """Data cleaning pipeline"""
 
    df_clean = df.copy()
 
    if verbose:
        print("Data Cleaning Pipeline Started")
        print("=" * 60)
        print(f"Original data: {df_clean.shape[0]:,} rows x {df_clean.shape[1]} columns")
 
    # 1. Remove duplicates
    before = len(df_clean)
    df_clean = df_clean.drop_duplicates()
    after = len(df_clean)
    if verbose:
        print(f"\n[1] Duplicate removal: {before - after:,} rows removed")
 
    # 2. Handle missing values
    missing_before = df_clean.isnull().sum().sum()
 
    # Numeric -> median
    numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        df_clean[col] = df_clean[col].fillna(df_clean[col].median())
 
    # Text -> 'Unknown'
    object_cols = df_clean.select_dtypes(include=['object']).columns
    for col in object_cols:
        df_clean[col] = df_clean[col].fillna('Unknown')
 
    missing_after = df_clean.isnull().sum().sum()
    if verbose:
        print(f"[2] Missing value handling: {missing_before - missing_after:,} handled")
 
    # 3. Data type optimization
    for col in object_cols:
        if df_clean[col].nunique() / len(df_clean) < 0.5:
            df_clean[col] = df_clean[col].astype('category')
 
    if verbose:
        print("[3] Data type optimization completed")
        print(f"\nCleaning complete: {df_clean.shape[0]:,} rows x {df_clean.shape[1]} columns")
 
    return df_clean
 
# Execute (using orders data)
orders_clean = clean_data_pipeline(orders)

실행 결과

Data Cleaning Pipeline Started
============================================================
Original data: 29,761 rows x 6 columns

[1] Duplicate removal: 0 rows removed
[2] Missing value handling: 2,832 handled
[3] Data type optimization completed

Cleaning complete: 29,761 rows x 6 columns

Summary

Key Functions Summary

Task	Function	Example
Check missing values	`df.isnull().sum()`	Missing count by column
Missing value ratio	`df.isnull().mean() * 100`	Missing ratio by column
Delete missing values	`df.dropna()`	Delete rows with missing values
Replace missing values	`df.fillna(value)`	Replace with specific value
Check duplicates	`df.duplicated().sum()`	Duplicate row count
Remove duplicates	`df.drop_duplicates()`	Remove duplicate rows
Outlier detection	`df.quantile([0.25, 0.75])`	Calculate IQR
Value clipping	`df['col'].clip(lower, upper)`	Limit to bounds

SQL to Pandas Comparison

SQL	Pandas
`WHERE col IS NOT NULL`	`df[df['col'].notna()]`
`COALESCE(col, replacement)`	`df['col'].fillna(replacement)`
`DISTINCT *`	`df.drop_duplicates()`
`COUNT() - COUNT(DISTINCT )`	`df.duplicated().sum()`

Next Steps

You’ve mastered data cleaning! Next, learn advanced filtering techniques including query(), isin(), between() in Data Filtering.