Skip to Content
ConceptsPandasData Cleaning

Data Cleaning and Preprocessing

BeginnerIntermediate

Learning Objectives

After completing this recipe, you will be able to:

  • Detect and handle Missing Values
  • Check and remove duplicate data
  • Detect Outliers (IQR method)
  • Convert and normalize data types

0. Setup

Load CSV data for hands-on practice.

import pandas as pd import numpy as np # Load data DATA_PATH = '/data/' orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at']) order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv') products = pd.read_csv(DATA_PATH + 'src_products.csv') users = pd.read_csv(DATA_PATH + 'src_users.csv') print(f"Data loaded!") print(f" - orders: {len(orders):,} rows") print(f" - order_items: {len(order_items):,} rows")
실행 결과
Data loaded!
 - orders: 29,761 rows
 - order_items: 60,350 rows

1. Checking Missing Values

Theory

Missing values are cases where data has no value. In Pandas, they are represented as NaN (Not a Number) or None.

Methods to Check Missing Values

import pandas as pd import numpy as np # Sample data df_sample = pd.DataFrame({ 'user_id': [1, 2, 3, 4, 5], 'name': ['John', None, 'Mike', 'Sarah', 'Emily'], 'age': [25, 30, np.nan, 28, 35], 'city': ['New York', 'Boston', 'Chicago', None, 'New York'] }) # 1. Check missing value count print("Missing value count by column:") print(df_sample.isnull().sum()) # 2. Check missing value ratio print("\nMissing value ratio (%):") print((df_sample.isnull().sum() / len(df_sample) * 100).round(2)) # 3. Check overall info df_sample.info()
실행 결과
Missing value count by column:
user_id    0
name       1
age        1
city       1
dtype: int64

Missing value ratio (%):
user_id     0.0
name       20.0
age        20.0
city       20.0
dtype: float64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
#   Column   Non-Null Count  Dtype
---  ------   --------------  -----
0   user_id  5 non-null      int64
1   name     4 non-null      object
2   age      4 non-null      float64
3   city     4 non-null      object
dtypes: float64(1), int64(1), object(2)
memory usage: 292.0+ bytes

Execution Result:

Missing value count by column: user_id 0 name 1 age 1 city 1 dtype: int64 Missing value ratio (%): user_id 0.0 name 20.0 age 20.0 city 20.0 dtype: float64

Visualizing Missing Values

# Show only columns with missing values missing = df_sample.isnull().sum() missing_pct = (missing / len(df_sample) * 100).round(2) missing_df = pd.DataFrame({ 'missing_count': missing, 'missing_ratio(%)': missing_pct }) # Filter only columns with missing values missing_df = missing_df[missing_df['missing_count'] > 0] missing_df = missing_df.sort_values('missing_count', ascending=False) if len(missing_df) > 0: print(missing_df) else: print("No missing values!")
실행 결과
      missing_count  missing_ratio(%)
name         1        20.0
age          1        20.0
city         1        20.0

2. Handling Missing Values

Summary of Methods

MethodFunctionDescriptionBest For
Deletedropna()Remove rows/columns with missing valuesWhen missing values are few
Replacefillna()Fill with specific valueWhen missing values are many
Interpolateinterpolate()Estimate from surrounding valuesTime series data

2-1. Deleting Missing Values (dropna)

# Recreate sample data df_test = pd.DataFrame({ 'user_id': [1, 2, 3, 4, 5], 'name': ['John', None, 'Mike', 'Sarah', 'Emily'], 'age': [25, 30, np.nan, 28, 35], 'city': ['New York', 'Boston', 'Chicago', None, 'New York'] }) # Delete rows with missing values df_dropped = df_test.dropna() print(f"Before: {len(df_test)} rows -> After: {len(df_dropped)} rows") # Delete rows with missing values only in specific columns df_dropped_name = df_test.dropna(subset=['name']) print(f"After deleting based on name: {len(df_dropped_name)} rows") # Delete only rows where all values are missing df_dropped_all = df_test.dropna(how='all') print(f"After how='all': {len(df_dropped_all)} rows")
실행 결과
Before: 5 rows -> After: 2 rows
After deleting based on name: 4 rows
After how='all': 5 rows

2-2. Replacing Missing Values (fillna)

# Recreate sample data df_fill = pd.DataFrame({ 'user_id': [1, 2, 3, 4, 5], 'name': ['John', None, 'Mike', 'Sarah', 'Emily'], 'age': [25, 30, np.nan, 28, 35], 'city': ['New York', 'Boston', 'Chicago', None, 'New York'], 'value': [100, np.nan, 150, np.nan, 200] }) # 1. Replace with specific value df_fill['city'] = df_fill['city'].fillna('Unknown') print("After city replacement:") print(df_fill['city'].tolist()) # 2. Replace with mean (numeric) df_fill['age'] = df_fill['age'].fillna(df_fill['age'].mean()) print(f"\nAfter age mean replacement: {df_fill['age'].tolist()}") # 3. Forward fill (fill with previous value) df_fill['value'] = df_fill['value'].ffill() print(f"After value ffill: {df_fill['value'].tolist()}") # 4. Backward fill (fill with next value) df_test2 = pd.DataFrame({'v': [np.nan, 10, np.nan, 30, np.nan]}) df_test2['v'] = df_test2['v'].bfill() print(f"bfill result: {df_test2['v'].tolist()}")
실행 결과
After city replacement:
['New York', 'Boston', 'Chicago', 'Unknown', 'New York']

After age mean replacement: [25.0, 30.0, 29.5, 28.0, 35.0]
After value ffill: [100.0, 100.0, 150.0, 150.0, 200.0]
bfill result: [10.0, 10.0, 30.0, 30.0, nan]

Practical: Auto-handling by Data Type

def handle_missing_values(df): """Auto-handle missing values based on data type""" df_clean = df.copy() # Numeric columns: replace with median numeric_cols = df_clean.select_dtypes(include=[np.number]).columns for col in numeric_cols: if df_clean[col].isnull().sum() > 0: median_val = df_clean[col].median() df_clean[col] = df_clean[col].fillna(median_val) print(f"[OK] {col}: replaced with median ({median_val:.2f})") # Text columns: replace with 'Unknown' object_cols = df_clean.select_dtypes(include=['object']).columns for col in object_cols: if df_clean[col].isnull().sum() > 0: df_clean[col] = df_clean[col].fillna('Unknown') print(f"[OK] {col}: replaced with 'Unknown'") return df_clean # Apply (using orders data) print("Missing value status:") print(orders.isnull().sum()[orders.isnull().sum() > 0]) print() orders_clean = handle_missing_values(orders) print(f"\nMissing values after processing: {orders_clean.isnull().sum().sum()}")
실행 결과
Missing value status:
status         743
num_of_item    2089
dtype: int64

[OK] num_of_item: replaced with median (2.00)
[OK] status: replaced with 'Unknown'

Missing values after processing: 0

Quiz 1: Missing Value Handling

Problem

From order data:

  1. Check missing value count and ratio for each column
  2. Replace numeric columns with median, text columns with ‘Unconfirmed’
  3. Compare missing value counts before and after processing

View Answer

import pandas as pd import numpy as np # Create sample data with missing values np.random.seed(42) df_quiz = pd.DataFrame({ 'order_id': range(1, 1001), 'user_id': np.random.randint(1, 100, 1000), 'status': np.random.choice(['Complete', 'Pending', None], 1000, p=[0.7, 0.2, 0.1]), 'amount': np.random.choice([100, 200, 300, np.nan], 1000, p=[0.4, 0.3, 0.2, 0.1]) }) # 1. Missing value status print("=== Missing Value Status ===") missing = df_quiz.isnull().sum() missing_pct = (missing / len(df_quiz) * 100).round(2) for col in df_quiz.columns: if missing[col] > 0: print(f"{col}: {missing[col]} ({missing_pct[col]}%)") # 2. Handle missing values print("\n=== Handling Missing Values ===") before = df_quiz.isnull().sum().sum() # Numeric -> median for col in df_quiz.select_dtypes(include=[np.number]).columns: if df_quiz[col].isnull().sum() > 0: df_quiz[col] = df_quiz[col].fillna(df_quiz[col].median()) # Text -> 'Unconfirmed' for col in df_quiz.select_dtypes(include=['object']).columns: if df_quiz[col].isnull().sum() > 0: df_quiz[col] = df_quiz[col].fillna('Unconfirmed') after = df_quiz.isnull().sum().sum() # 3. Compare results print(f"Before: {before}") print(f"After: {after}") print(f"Handled: {before - after}")
실행 결과
=== Missing Value Status ===
status: 100 (10.0%)
amount: 100 (10.0%)

=== Handling Missing Values ===
Before: 200
After: 0
Handled: 200

3. Handling Duplicate Data

Theory

Duplicate data is when the same row appears multiple times. It must be checked as it can distort analysis results.

Checking and Removing Duplicates

# Check duplicates in order_items print("=== order_items Duplicate Check ===") # 1. Check completely identical rows n_duplicates = order_items.duplicated().sum() print(f"Duplicate row count: {n_duplicates:,}") # 2. View duplicated rows (first 5) print("\nDuplicated rows (sample):") duplicates = order_items[order_items.duplicated(keep=False)] print(duplicates.head()) # 3. Remove duplicates (keep first row) order_items_clean = order_items.drop_duplicates(keep='first') print(f"\nBefore removal: {len(order_items):,} rows") print(f"After removal: {len(order_items_clean):,} rows")
실행 결과
=== order_items Duplicate Check ===
Duplicate row count: 0

Duplicated rows (sample):
Empty DataFrame
Columns: [id, order_id, product_id, inventory_item_id, sale_price, status, created_at, shipped_at, delivered_at, returned_at]
Index: []

Before removal: 60,350 rows
After removal: 60,350 rows

Duplicate Removal Options

OptionDescription
keep='first'Keep first row (default)
keep='last'Keep last row
keep=FalseDelete all duplicated rows

Practical: Duplicate Data Analysis

# Detailed duplicate data analysis def analyze_duplicates(df, subset=None): """Detailed duplicate data analysis""" print("Duplicate Data Analysis") print("=" * 60) # Total duplicates n_dup = df.duplicated(subset=subset).sum() pct_dup = (n_dup / len(df) * 100) print(f"Total rows: {len(df):,}") print(f"Duplicated rows: {n_dup:,} ({pct_dup:.2f}%)") # Removal simulation df_dedup = df.drop_duplicates(subset=subset, keep='first') n_after = len(df_dedup) print(f"After removal: {n_after:,}") print(f"Rows to be removed: {len(df) - n_after:,}") return df_dedup # Execute order_items_clean = analyze_duplicates(order_items)
실행 결과
Duplicate Data Analysis
============================================================
Total rows: 60,350
Duplicated rows: 0 (0.00%)
After removal: 60,350
Rows to be removed: 0

Quiz 2: Duplicate Data Handling

Problem

  1. Check how many completely duplicated rows exist in the data
  2. Remove duplicated rows (keep first row only)
  3. Compare and output data size before and after removal

View Answer

import pandas as pd import numpy as np # Create test data with duplicates np.random.seed(42) df_test = pd.DataFrame({ 'id': np.random.randint(1, 100, 1000), 'value': np.random.randint(1, 50, 1000) }) # Intentionally add duplicates df_test = pd.concat([df_test, df_test.sample(50)], ignore_index=True) print("Duplicate Data Handling") print("=" * 60) # 1. Check duplicate row count n_duplicates = df_test.duplicated().sum() print(f"Duplicate row count: {n_duplicates:,}") # Size before removal original_size = df_test.shape[0] print(f"Data size before removal: {original_size:,} rows") # 2. Remove duplicates (keep first row only) df_no_dup = df_test.drop_duplicates(keep='first') # 3. Size after removal new_size = df_no_dup.shape[0] print(f"Data size after removal: {new_size:,} rows") # Comparison removed = original_size - new_size removed_pct = (removed / original_size * 100) print(f"\nRows removed: {removed:,}") print(f"Removal ratio: {removed_pct:.2f}%")
실행 결과
Duplicate Data Handling
============================================================
Duplicate row count: 89
Data size before removal: 1,050 rows
Data size after removal: 961 rows

Rows removed: 89
Removal ratio: 8.48%

4. Outlier Detection (IQR Method)

Theory

Outliers are extreme values that differ significantly from other data. The IQR (Interquartile Range) method is the most widely used outlier detection method.

IQR Formula:

  • Q1 = 25th percentile (lower 25%)
  • Q3 = 75th percentile (upper 25%)
  • IQR = Q3 - Q1
  • Lower bound: Q1 - 1.5 x IQR
  • Upper bound: Q3 + 1.5 x IQR
  • Below lower bound or above upper bound -> Outlier

Outlier Detection Code

import pandas as pd import numpy as np import matplotlib.pyplot as plt def detect_outliers_iqr(df, column): """Detect outliers using IQR method""" print(f"Outlier Detection for {column} (IQR Method)") print("=" * 60) # Calculate IQR Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 # Outlier bounds lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR print(f"Q1 (25th percentile): {Q1:,.2f}") print(f"Q3 (75th percentile): {Q3:,.2f}") print(f"IQR: {IQR:,.2f}") print(f"Lower bound: {lower_bound:,.2f}") print(f"Upper bound: {upper_bound:,.2f}") # Find outliers outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] n_outliers = len(outliers) print(f"\nOutlier count: {n_outliers:,} ({n_outliers/len(df)*100:.2f}%)") return outliers, lower_bound, upper_bound # Execute (using sale_price column from order_items) outliers, lower_bound, upper_bound = detect_outliers_iqr(order_items, 'sale_price')
실행 결과
Outlier Detection for sale_price (IQR Method)
============================================================
Q1 (25th percentile): 29.99
Q3 (75th percentile): 74.00
IQR: 44.01
Lower bound: -36.02
Upper bound: 140.02

Outlier count: 8,521 (14.12%)

Testing with Data Containing Outliers

# Create test data with outliers np.random.seed(42) test_prices = pd.DataFrame({ 'sale_price': np.concatenate([ np.random.normal(100, 30, 9500), # Normal data np.random.uniform(300, 500, 300), # High outliers np.random.uniform(-50, 0, 200) # Low outliers (negative prices) ]) }) outliers, lower_bound, upper_bound = detect_outliers_iqr(test_prices, 'sale_price') print(f"\nOutlier sample (top 5):") print(outliers.nlargest(5, 'sale_price'))
실행 결과
Outlier Detection for sale_price (IQR Method)
============================================================
Q1 (25th percentile): 80.12
Q3 (75th percentile): 120.45
IQR: 40.33
Lower bound: 19.63
Upper bound: 180.95

Outlier count: 756 (7.56%)

Outlier sample (top 5):
    sale_price
9567      498.23
9612      495.87
9534      492.15
9589      489.34
9501      487.92

Visualizing Outliers

def visualize_outliers(df, column): """Visualize outliers with boxplot and histogram""" fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Calculate IQR Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Boxplot axes[0].boxplot(df[column].dropna(), vert=False) axes[0].set_xlabel(column) axes[0].set_title(f'{column} Boxplot') # Histogram axes[1].hist(df[column].dropna(), bins=50, alpha=0.7, edgecolor='black') axes[1].axvline(lower_bound, color='red', linestyle='--', label=f'Lower: {lower_bound:.0f}') axes[1].axvline(upper_bound, color='red', linestyle='--', label=f'Upper: {upper_bound:.0f}') axes[1].set_xlabel(column) axes[1].set_ylabel('Frequency') axes[1].set_title(f'{column} Histogram') axes[1].legend() plt.tight_layout() plt.show() visualize_outliers(test_prices, 'sale_price')
실행 결과
[Graph Displayed]

Handling Outliers

# Outlier handling example (using test_prices data) df_outlier = test_prices.copy() # Calculate bounds Q1 = df_outlier['sale_price'].quantile(0.25) Q3 = df_outlier['sale_price'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR print(f"Outlier bounds: {lower_bound:.2f} ~ {upper_bound:.2f}") # 1. Remove outliers df_removed = df_outlier[(df_outlier['sale_price'] >= lower_bound) & (df_outlier['sale_price'] <= upper_bound)] print(f"\n1. Remove outliers: {len(df_outlier):,} rows -> {len(df_removed):,} rows") # 2. Replace outliers with bounds (Winsorizing) df_capped = df_outlier.copy() df_capped['sale_price_capped'] = df_capped['sale_price'].clip(lower=lower_bound, upper=upper_bound) print(f"2. Winsorizing applied") print(f" Original max: {df_outlier['sale_price'].max():.2f}") print(f" After clipping max: {df_capped['sale_price_capped'].max():.2f}") # 3. Replace outliers with NaN then median df_replaced = df_outlier.copy() mask = (df_replaced['sale_price'] < lower_bound) | (df_replaced['sale_price'] > upper_bound) median_val = df_replaced.loc[~mask, 'sale_price'].median() df_replaced.loc[mask, 'sale_price'] = median_val print(f"3. Median replacement: {mask.sum()} outliers replaced with {median_val:.2f}")
실행 결과
Outlier bounds: 19.63 ~ 180.95

1. Remove outliers: 10,000 rows -> 9,244 rows
2. Winsorizing applied
 Original max: 498.23
 After clipping max: 180.95
3. Median replacement: 756 outliers replaced with 100.12

Quiz 3: Outlier Detection

Problem

Select a numeric column and:

  1. Calculate outlier criteria using IQR method
  2. Output outlier count and ratio to total
  3. Visualize with boxplot

View Answer

import pandas as pd import numpy as np import matplotlib.pyplot as plt # Create test data with outliers np.random.seed(42) df_quiz = pd.DataFrame({ 'quantity': np.concatenate([ np.random.poisson(5, 9000), # Normal data np.random.randint(50, 100, 500), # Outliers np.array([0] * 500) # Zero values ]) }) selected_col = 'quantity' print(f"Outlier Detection for {selected_col} (IQR Method)") print("=" * 60) # 1. Calculate IQR Q1 = df_quiz[selected_col].quantile(0.25) Q3 = df_quiz[selected_col].quantile(0.75) IQR = Q3 - Q1 # Outlier bounds lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR print(f"Q1 (25th percentile): {Q1:,.2f}") print(f"Q3 (75th percentile): {Q3:,.2f}") print(f"IQR: {IQR:,.2f}") print(f"Outlier lower bound: {lower_bound:,.2f}") print(f"Outlier upper bound: {upper_bound:,.2f}") # 2. Outlier count outliers = df_quiz[(df_quiz[selected_col] < lower_bound) | (df_quiz[selected_col] > upper_bound)] outlier_count = len(outliers) outlier_pct = (outlier_count / len(df_quiz) * 100) print(f"\nOutlier count: {outlier_count:,}") print(f"Ratio to total: {outlier_pct:.2f}%") # 3. Boxplot visualization plt.figure(figsize=(10, 6)) plt.boxplot(df_quiz[selected_col].dropna(), vert=False) plt.axvline(lower_bound, color='red', linestyle='--', label=f'Lower: {lower_bound:,.0f}') plt.axvline(upper_bound, color='red', linestyle='--', label=f'Upper: {upper_bound:,.0f}') plt.xlabel(selected_col) plt.title(f'{selected_col} Boxplot (Outlier Check)') plt.legend() plt.grid(True, alpha=0.3) plt.show()
실행 결과
Outlier Detection for quantity (IQR Method)
============================================================
Q1 (25th percentile): 3.00
Q3 (75th percentile): 7.00
IQR: 4.00
Outlier lower bound: -3.00
Outlier upper bound: 13.00

Outlier count: 512
Ratio to total: 5.12%
[Graph Displayed]

5. Data Cleaning Pipeline

Practical Example: Comprehensive Data Cleaning

import pandas as pd import numpy as np def clean_data_pipeline(df, verbose=True): """Data cleaning pipeline""" df_clean = df.copy() if verbose: print("Data Cleaning Pipeline Started") print("=" * 60) print(f"Original data: {df_clean.shape[0]:,} rows x {df_clean.shape[1]} columns") # 1. Remove duplicates before = len(df_clean) df_clean = df_clean.drop_duplicates() after = len(df_clean) if verbose: print(f"\n[1] Duplicate removal: {before - after:,} rows removed") # 2. Handle missing values missing_before = df_clean.isnull().sum().sum() # Numeric -> median numeric_cols = df_clean.select_dtypes(include=[np.number]).columns for col in numeric_cols: df_clean[col] = df_clean[col].fillna(df_clean[col].median()) # Text -> 'Unknown' object_cols = df_clean.select_dtypes(include=['object']).columns for col in object_cols: df_clean[col] = df_clean[col].fillna('Unknown') missing_after = df_clean.isnull().sum().sum() if verbose: print(f"[2] Missing value handling: {missing_before - missing_after:,} handled") # 3. Data type optimization for col in object_cols: if df_clean[col].nunique() / len(df_clean) < 0.5: df_clean[col] = df_clean[col].astype('category') if verbose: print("[3] Data type optimization completed") print(f"\nCleaning complete: {df_clean.shape[0]:,} rows x {df_clean.shape[1]} columns") return df_clean # Execute (using orders data) orders_clean = clean_data_pipeline(orders)
실행 결과
Data Cleaning Pipeline Started
============================================================
Original data: 29,761 rows x 6 columns

[1] Duplicate removal: 0 rows removed
[2] Missing value handling: 2,832 handled
[3] Data type optimization completed

Cleaning complete: 29,761 rows x 6 columns

Summary

Key Functions Summary

TaskFunctionExample
Check missing valuesdf.isnull().sum()Missing count by column
Missing value ratiodf.isnull().mean() * 100Missing ratio by column
Delete missing valuesdf.dropna()Delete rows with missing values
Replace missing valuesdf.fillna(value)Replace with specific value
Check duplicatesdf.duplicated().sum()Duplicate row count
Remove duplicatesdf.drop_duplicates()Remove duplicate rows
Outlier detectiondf.quantile([0.25, 0.75])Calculate IQR
Value clippingdf['col'].clip(lower, upper)Limit to bounds

SQL to Pandas Comparison

SQLPandas
WHERE col IS NOT NULLdf[df['col'].notna()]
COALESCE(col, replacement)df['col'].fillna(replacement)
DISTINCT *df.drop_duplicates()
COUNT(*) - COUNT(DISTINCT *)df.duplicated().sum()

Next Steps

You’ve mastered data cleaning! Next, learn advanced filtering techniques including query(), isin(), between() in Data Filtering.

Last updated on

🤖AI 모의면접실전처럼 연습하기