Data Cleaning and Preprocessing
BeginnerIntermediate
Learning Objectives
After completing this recipe, you will be able to:
- Detect and handle Missing Values
- Check and remove duplicate data
- Detect Outliers (IQR method)
- Convert and normalize data types
0. Setup
Load CSV data for hands-on practice.
import pandas as pd
import numpy as np
# Load data
DATA_PATH = '/data/'
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
print(f"Data loaded!")
print(f" - orders: {len(orders):,} rows")
print(f" - order_items: {len(order_items):,} rows")실행 결과
Data loaded! - orders: 29,761 rows - order_items: 60,350 rows
1. Checking Missing Values
Theory
Missing values are cases where data has no value. In Pandas, they are represented as NaN (Not a Number) or None.
Methods to Check Missing Values
import pandas as pd
import numpy as np
# Sample data
df_sample = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5],
'name': ['John', None, 'Mike', 'Sarah', 'Emily'],
'age': [25, 30, np.nan, 28, 35],
'city': ['New York', 'Boston', 'Chicago', None, 'New York']
})
# 1. Check missing value count
print("Missing value count by column:")
print(df_sample.isnull().sum())
# 2. Check missing value ratio
print("\nMissing value ratio (%):")
print((df_sample.isnull().sum() / len(df_sample) * 100).round(2))
# 3. Check overall info
df_sample.info()실행 결과
Missing value count by column: user_id 0 name 1 age 1 city 1 dtype: int64 Missing value ratio (%): user_id 0.0 name 20.0 age 20.0 city 20.0 dtype: float64 <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 5 non-null int64 1 name 4 non-null object 2 age 4 non-null float64 3 city 4 non-null object dtypes: float64(1), int64(1), object(2) memory usage: 292.0+ bytes
Execution Result:
Missing value count by column:
user_id 0
name 1
age 1
city 1
dtype: int64
Missing value ratio (%):
user_id 0.0
name 20.0
age 20.0
city 20.0
dtype: float64Visualizing Missing Values
# Show only columns with missing values
missing = df_sample.isnull().sum()
missing_pct = (missing / len(df_sample) * 100).round(2)
missing_df = pd.DataFrame({
'missing_count': missing,
'missing_ratio(%)': missing_pct
})
# Filter only columns with missing values
missing_df = missing_df[missing_df['missing_count'] > 0]
missing_df = missing_df.sort_values('missing_count', ascending=False)
if len(missing_df) > 0:
print(missing_df)
else:
print("No missing values!")실행 결과
missing_count missing_ratio(%) name 1 20.0 age 1 20.0 city 1 20.0
2. Handling Missing Values
Summary of Methods
| Method | Function | Description | Best For |
|---|---|---|---|
| Delete | dropna() | Remove rows/columns with missing values | When missing values are few |
| Replace | fillna() | Fill with specific value | When missing values are many |
| Interpolate | interpolate() | Estimate from surrounding values | Time series data |
2-1. Deleting Missing Values (dropna)
# Recreate sample data
df_test = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5],
'name': ['John', None, 'Mike', 'Sarah', 'Emily'],
'age': [25, 30, np.nan, 28, 35],
'city': ['New York', 'Boston', 'Chicago', None, 'New York']
})
# Delete rows with missing values
df_dropped = df_test.dropna()
print(f"Before: {len(df_test)} rows -> After: {len(df_dropped)} rows")
# Delete rows with missing values only in specific columns
df_dropped_name = df_test.dropna(subset=['name'])
print(f"After deleting based on name: {len(df_dropped_name)} rows")
# Delete only rows where all values are missing
df_dropped_all = df_test.dropna(how='all')
print(f"After how='all': {len(df_dropped_all)} rows")실행 결과
Before: 5 rows -> After: 2 rows After deleting based on name: 4 rows After how='all': 5 rows
2-2. Replacing Missing Values (fillna)
# Recreate sample data
df_fill = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5],
'name': ['John', None, 'Mike', 'Sarah', 'Emily'],
'age': [25, 30, np.nan, 28, 35],
'city': ['New York', 'Boston', 'Chicago', None, 'New York'],
'value': [100, np.nan, 150, np.nan, 200]
})
# 1. Replace with specific value
df_fill['city'] = df_fill['city'].fillna('Unknown')
print("After city replacement:")
print(df_fill['city'].tolist())
# 2. Replace with mean (numeric)
df_fill['age'] = df_fill['age'].fillna(df_fill['age'].mean())
print(f"\nAfter age mean replacement: {df_fill['age'].tolist()}")
# 3. Forward fill (fill with previous value)
df_fill['value'] = df_fill['value'].ffill()
print(f"After value ffill: {df_fill['value'].tolist()}")
# 4. Backward fill (fill with next value)
df_test2 = pd.DataFrame({'v': [np.nan, 10, np.nan, 30, np.nan]})
df_test2['v'] = df_test2['v'].bfill()
print(f"bfill result: {df_test2['v'].tolist()}")실행 결과
After city replacement: ['New York', 'Boston', 'Chicago', 'Unknown', 'New York'] After age mean replacement: [25.0, 30.0, 29.5, 28.0, 35.0] After value ffill: [100.0, 100.0, 150.0, 150.0, 200.0] bfill result: [10.0, 10.0, 30.0, 30.0, nan]
Practical: Auto-handling by Data Type
def handle_missing_values(df):
"""Auto-handle missing values based on data type"""
df_clean = df.copy()
# Numeric columns: replace with median
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
if df_clean[col].isnull().sum() > 0:
median_val = df_clean[col].median()
df_clean[col] = df_clean[col].fillna(median_val)
print(f"[OK] {col}: replaced with median ({median_val:.2f})")
# Text columns: replace with 'Unknown'
object_cols = df_clean.select_dtypes(include=['object']).columns
for col in object_cols:
if df_clean[col].isnull().sum() > 0:
df_clean[col] = df_clean[col].fillna('Unknown')
print(f"[OK] {col}: replaced with 'Unknown'")
return df_clean
# Apply (using orders data)
print("Missing value status:")
print(orders.isnull().sum()[orders.isnull().sum() > 0])
print()
orders_clean = handle_missing_values(orders)
print(f"\nMissing values after processing: {orders_clean.isnull().sum().sum()}")실행 결과
Missing value status: status 743 num_of_item 2089 dtype: int64 [OK] num_of_item: replaced with median (2.00) [OK] status: replaced with 'Unknown' Missing values after processing: 0
Quiz 1: Missing Value Handling
Problem
From order data:
- Check missing value count and ratio for each column
- Replace numeric columns with median, text columns with ‘Unconfirmed’
- Compare missing value counts before and after processing
View Answer
import pandas as pd
import numpy as np
# Create sample data with missing values
np.random.seed(42)
df_quiz = pd.DataFrame({
'order_id': range(1, 1001),
'user_id': np.random.randint(1, 100, 1000),
'status': np.random.choice(['Complete', 'Pending', None], 1000, p=[0.7, 0.2, 0.1]),
'amount': np.random.choice([100, 200, 300, np.nan], 1000, p=[0.4, 0.3, 0.2, 0.1])
})
# 1. Missing value status
print("=== Missing Value Status ===")
missing = df_quiz.isnull().sum()
missing_pct = (missing / len(df_quiz) * 100).round(2)
for col in df_quiz.columns:
if missing[col] > 0:
print(f"{col}: {missing[col]} ({missing_pct[col]}%)")
# 2. Handle missing values
print("\n=== Handling Missing Values ===")
before = df_quiz.isnull().sum().sum()
# Numeric -> median
for col in df_quiz.select_dtypes(include=[np.number]).columns:
if df_quiz[col].isnull().sum() > 0:
df_quiz[col] = df_quiz[col].fillna(df_quiz[col].median())
# Text -> 'Unconfirmed'
for col in df_quiz.select_dtypes(include=['object']).columns:
if df_quiz[col].isnull().sum() > 0:
df_quiz[col] = df_quiz[col].fillna('Unconfirmed')
after = df_quiz.isnull().sum().sum()
# 3. Compare results
print(f"Before: {before}")
print(f"After: {after}")
print(f"Handled: {before - after}")실행 결과
=== Missing Value Status === status: 100 (10.0%) amount: 100 (10.0%) === Handling Missing Values === Before: 200 After: 0 Handled: 200
3. Handling Duplicate Data
Theory
Duplicate data is when the same row appears multiple times. It must be checked as it can distort analysis results.
Checking and Removing Duplicates
# Check duplicates in order_items
print("=== order_items Duplicate Check ===")
# 1. Check completely identical rows
n_duplicates = order_items.duplicated().sum()
print(f"Duplicate row count: {n_duplicates:,}")
# 2. View duplicated rows (first 5)
print("\nDuplicated rows (sample):")
duplicates = order_items[order_items.duplicated(keep=False)]
print(duplicates.head())
# 3. Remove duplicates (keep first row)
order_items_clean = order_items.drop_duplicates(keep='first')
print(f"\nBefore removal: {len(order_items):,} rows")
print(f"After removal: {len(order_items_clean):,} rows")실행 결과
=== order_items Duplicate Check === Duplicate row count: 0 Duplicated rows (sample): Empty DataFrame Columns: [id, order_id, product_id, inventory_item_id, sale_price, status, created_at, shipped_at, delivered_at, returned_at] Index: [] Before removal: 60,350 rows After removal: 60,350 rows
Duplicate Removal Options
| Option | Description |
|---|---|
keep='first' | Keep first row (default) |
keep='last' | Keep last row |
keep=False | Delete all duplicated rows |
Practical: Duplicate Data Analysis
# Detailed duplicate data analysis
def analyze_duplicates(df, subset=None):
"""Detailed duplicate data analysis"""
print("Duplicate Data Analysis")
print("=" * 60)
# Total duplicates
n_dup = df.duplicated(subset=subset).sum()
pct_dup = (n_dup / len(df) * 100)
print(f"Total rows: {len(df):,}")
print(f"Duplicated rows: {n_dup:,} ({pct_dup:.2f}%)")
# Removal simulation
df_dedup = df.drop_duplicates(subset=subset, keep='first')
n_after = len(df_dedup)
print(f"After removal: {n_after:,}")
print(f"Rows to be removed: {len(df) - n_after:,}")
return df_dedup
# Execute
order_items_clean = analyze_duplicates(order_items)실행 결과
Duplicate Data Analysis ============================================================ Total rows: 60,350 Duplicated rows: 0 (0.00%) After removal: 60,350 Rows to be removed: 0
Quiz 2: Duplicate Data Handling
Problem
- Check how many completely duplicated rows exist in the data
- Remove duplicated rows (keep first row only)
- Compare and output data size before and after removal
View Answer
import pandas as pd
import numpy as np
# Create test data with duplicates
np.random.seed(42)
df_test = pd.DataFrame({
'id': np.random.randint(1, 100, 1000),
'value': np.random.randint(1, 50, 1000)
})
# Intentionally add duplicates
df_test = pd.concat([df_test, df_test.sample(50)], ignore_index=True)
print("Duplicate Data Handling")
print("=" * 60)
# 1. Check duplicate row count
n_duplicates = df_test.duplicated().sum()
print(f"Duplicate row count: {n_duplicates:,}")
# Size before removal
original_size = df_test.shape[0]
print(f"Data size before removal: {original_size:,} rows")
# 2. Remove duplicates (keep first row only)
df_no_dup = df_test.drop_duplicates(keep='first')
# 3. Size after removal
new_size = df_no_dup.shape[0]
print(f"Data size after removal: {new_size:,} rows")
# Comparison
removed = original_size - new_size
removed_pct = (removed / original_size * 100)
print(f"\nRows removed: {removed:,}")
print(f"Removal ratio: {removed_pct:.2f}%")실행 결과
Duplicate Data Handling ============================================================ Duplicate row count: 89 Data size before removal: 1,050 rows Data size after removal: 961 rows Rows removed: 89 Removal ratio: 8.48%
4. Outlier Detection (IQR Method)
Theory
Outliers are extreme values that differ significantly from other data. The IQR (Interquartile Range) method is the most widely used outlier detection method.
IQR Formula:
- Q1 = 25th percentile (lower 25%)
- Q3 = 75th percentile (upper 25%)
- IQR = Q3 - Q1
- Lower bound: Q1 - 1.5 x IQR
- Upper bound: Q3 + 1.5 x IQR
- Below lower bound or above upper bound -> Outlier
Outlier Detection Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def detect_outliers_iqr(df, column):
"""Detect outliers using IQR method"""
print(f"Outlier Detection for {column} (IQR Method)")
print("=" * 60)
# Calculate IQR
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
# Outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Q1 (25th percentile): {Q1:,.2f}")
print(f"Q3 (75th percentile): {Q3:,.2f}")
print(f"IQR: {IQR:,.2f}")
print(f"Lower bound: {lower_bound:,.2f}")
print(f"Upper bound: {upper_bound:,.2f}")
# Find outliers
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
n_outliers = len(outliers)
print(f"\nOutlier count: {n_outliers:,} ({n_outliers/len(df)*100:.2f}%)")
return outliers, lower_bound, upper_bound
# Execute (using sale_price column from order_items)
outliers, lower_bound, upper_bound = detect_outliers_iqr(order_items, 'sale_price')실행 결과
Outlier Detection for sale_price (IQR Method) ============================================================ Q1 (25th percentile): 29.99 Q3 (75th percentile): 74.00 IQR: 44.01 Lower bound: -36.02 Upper bound: 140.02 Outlier count: 8,521 (14.12%)
Testing with Data Containing Outliers
# Create test data with outliers
np.random.seed(42)
test_prices = pd.DataFrame({
'sale_price': np.concatenate([
np.random.normal(100, 30, 9500), # Normal data
np.random.uniform(300, 500, 300), # High outliers
np.random.uniform(-50, 0, 200) # Low outliers (negative prices)
])
})
outliers, lower_bound, upper_bound = detect_outliers_iqr(test_prices, 'sale_price')
print(f"\nOutlier sample (top 5):")
print(outliers.nlargest(5, 'sale_price'))실행 결과
Outlier Detection for sale_price (IQR Method)
============================================================
Q1 (25th percentile): 80.12
Q3 (75th percentile): 120.45
IQR: 40.33
Lower bound: 19.63
Upper bound: 180.95
Outlier count: 756 (7.56%)
Outlier sample (top 5):
sale_price
9567 498.23
9612 495.87
9534 492.15
9589 489.34
9501 487.92Visualizing Outliers
def visualize_outliers(df, column):
"""Visualize outliers with boxplot and histogram"""
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Calculate IQR
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Boxplot
axes[0].boxplot(df[column].dropna(), vert=False)
axes[0].set_xlabel(column)
axes[0].set_title(f'{column} Boxplot')
# Histogram
axes[1].hist(df[column].dropna(), bins=50, alpha=0.7, edgecolor='black')
axes[1].axvline(lower_bound, color='red', linestyle='--', label=f'Lower: {lower_bound:.0f}')
axes[1].axvline(upper_bound, color='red', linestyle='--', label=f'Upper: {upper_bound:.0f}')
axes[1].set_xlabel(column)
axes[1].set_ylabel('Frequency')
axes[1].set_title(f'{column} Histogram')
axes[1].legend()
plt.tight_layout()
plt.show()
visualize_outliers(test_prices, 'sale_price')실행 결과
[Graph Displayed]
Handling Outliers
# Outlier handling example (using test_prices data)
df_outlier = test_prices.copy()
# Calculate bounds
Q1 = df_outlier['sale_price'].quantile(0.25)
Q3 = df_outlier['sale_price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Outlier bounds: {lower_bound:.2f} ~ {upper_bound:.2f}")
# 1. Remove outliers
df_removed = df_outlier[(df_outlier['sale_price'] >= lower_bound) &
(df_outlier['sale_price'] <= upper_bound)]
print(f"\n1. Remove outliers: {len(df_outlier):,} rows -> {len(df_removed):,} rows")
# 2. Replace outliers with bounds (Winsorizing)
df_capped = df_outlier.copy()
df_capped['sale_price_capped'] = df_capped['sale_price'].clip(lower=lower_bound, upper=upper_bound)
print(f"2. Winsorizing applied")
print(f" Original max: {df_outlier['sale_price'].max():.2f}")
print(f" After clipping max: {df_capped['sale_price_capped'].max():.2f}")
# 3. Replace outliers with NaN then median
df_replaced = df_outlier.copy()
mask = (df_replaced['sale_price'] < lower_bound) | (df_replaced['sale_price'] > upper_bound)
median_val = df_replaced.loc[~mask, 'sale_price'].median()
df_replaced.loc[mask, 'sale_price'] = median_val
print(f"3. Median replacement: {mask.sum()} outliers replaced with {median_val:.2f}")실행 결과
Outlier bounds: 19.63 ~ 180.95 1. Remove outliers: 10,000 rows -> 9,244 rows 2. Winsorizing applied Original max: 498.23 After clipping max: 180.95 3. Median replacement: 756 outliers replaced with 100.12
Quiz 3: Outlier Detection
Problem
Select a numeric column and:
- Calculate outlier criteria using IQR method
- Output outlier count and ratio to total
- Visualize with boxplot
View Answer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create test data with outliers
np.random.seed(42)
df_quiz = pd.DataFrame({
'quantity': np.concatenate([
np.random.poisson(5, 9000), # Normal data
np.random.randint(50, 100, 500), # Outliers
np.array([0] * 500) # Zero values
])
})
selected_col = 'quantity'
print(f"Outlier Detection for {selected_col} (IQR Method)")
print("=" * 60)
# 1. Calculate IQR
Q1 = df_quiz[selected_col].quantile(0.25)
Q3 = df_quiz[selected_col].quantile(0.75)
IQR = Q3 - Q1
# Outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Q1 (25th percentile): {Q1:,.2f}")
print(f"Q3 (75th percentile): {Q3:,.2f}")
print(f"IQR: {IQR:,.2f}")
print(f"Outlier lower bound: {lower_bound:,.2f}")
print(f"Outlier upper bound: {upper_bound:,.2f}")
# 2. Outlier count
outliers = df_quiz[(df_quiz[selected_col] < lower_bound) | (df_quiz[selected_col] > upper_bound)]
outlier_count = len(outliers)
outlier_pct = (outlier_count / len(df_quiz) * 100)
print(f"\nOutlier count: {outlier_count:,}")
print(f"Ratio to total: {outlier_pct:.2f}%")
# 3. Boxplot visualization
plt.figure(figsize=(10, 6))
plt.boxplot(df_quiz[selected_col].dropna(), vert=False)
plt.axvline(lower_bound, color='red', linestyle='--', label=f'Lower: {lower_bound:,.0f}')
plt.axvline(upper_bound, color='red', linestyle='--', label=f'Upper: {upper_bound:,.0f}')
plt.xlabel(selected_col)
plt.title(f'{selected_col} Boxplot (Outlier Check)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()실행 결과
Outlier Detection for quantity (IQR Method) ============================================================ Q1 (25th percentile): 3.00 Q3 (75th percentile): 7.00 IQR: 4.00 Outlier lower bound: -3.00 Outlier upper bound: 13.00 Outlier count: 512 Ratio to total: 5.12% [Graph Displayed]
5. Data Cleaning Pipeline
Practical Example: Comprehensive Data Cleaning
import pandas as pd
import numpy as np
def clean_data_pipeline(df, verbose=True):
"""Data cleaning pipeline"""
df_clean = df.copy()
if verbose:
print("Data Cleaning Pipeline Started")
print("=" * 60)
print(f"Original data: {df_clean.shape[0]:,} rows x {df_clean.shape[1]} columns")
# 1. Remove duplicates
before = len(df_clean)
df_clean = df_clean.drop_duplicates()
after = len(df_clean)
if verbose:
print(f"\n[1] Duplicate removal: {before - after:,} rows removed")
# 2. Handle missing values
missing_before = df_clean.isnull().sum().sum()
# Numeric -> median
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
df_clean[col] = df_clean[col].fillna(df_clean[col].median())
# Text -> 'Unknown'
object_cols = df_clean.select_dtypes(include=['object']).columns
for col in object_cols:
df_clean[col] = df_clean[col].fillna('Unknown')
missing_after = df_clean.isnull().sum().sum()
if verbose:
print(f"[2] Missing value handling: {missing_before - missing_after:,} handled")
# 3. Data type optimization
for col in object_cols:
if df_clean[col].nunique() / len(df_clean) < 0.5:
df_clean[col] = df_clean[col].astype('category')
if verbose:
print("[3] Data type optimization completed")
print(f"\nCleaning complete: {df_clean.shape[0]:,} rows x {df_clean.shape[1]} columns")
return df_clean
# Execute (using orders data)
orders_clean = clean_data_pipeline(orders)실행 결과
Data Cleaning Pipeline Started ============================================================ Original data: 29,761 rows x 6 columns [1] Duplicate removal: 0 rows removed [2] Missing value handling: 2,832 handled [3] Data type optimization completed Cleaning complete: 29,761 rows x 6 columns
Summary
Key Functions Summary
| Task | Function | Example |
|---|---|---|
| Check missing values | df.isnull().sum() | Missing count by column |
| Missing value ratio | df.isnull().mean() * 100 | Missing ratio by column |
| Delete missing values | df.dropna() | Delete rows with missing values |
| Replace missing values | df.fillna(value) | Replace with specific value |
| Check duplicates | df.duplicated().sum() | Duplicate row count |
| Remove duplicates | df.drop_duplicates() | Remove duplicate rows |
| Outlier detection | df.quantile([0.25, 0.75]) | Calculate IQR |
| Value clipping | df['col'].clip(lower, upper) | Limit to bounds |
SQL to Pandas Comparison
| SQL | Pandas |
|---|---|
WHERE col IS NOT NULL | df[df['col'].notna()] |
COALESCE(col, replacement) | df['col'].fillna(replacement) |
DISTINCT * | df.drop_duplicates() |
COUNT(*) - COUNT(DISTINCT *) | df.duplicated().sum() |
Next Steps
You’ve mastered data cleaning! Next, learn advanced filtering techniques including query(), isin(), between() in Data Filtering.
Last updated on