데이터 정제와 전처리

초급중급

학습 목표

이 레시피를 완료하면 다음을 할 수 있습니다:

결측치(Missing Values) 탐지 및 처리
중복 데이터 확인 및 제거
이상치(Outlier) 탐지 (IQR 방법)
데이터 타입 변환 및 정규화

0. 사전 준비 (Setup)

데이터 실습을 위해 CSV 파일에서 데이터를 로드합니다.


import pandas as pd
import numpy as np
 
# 데이터 로드
DATA_PATH = '/data/'
 
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
 
print(f"✅ 데이터 로드 완료!")
print(f"   - orders: {len(orders):,}행")
print(f"   - order_items: {len(order_items):,}행")

실행 결과

✅ 데이터 로드 완료!
 - orders: 29,761행
 - order_items: 60,350행

1. 결측치 확인

이론

결측치(Missing Value)는 데이터에서 값이 없는 경우를 말합니다. Pandas에서는 NaN(Not a Number) 또는 None으로 표시됩니다.

결측치 확인 방법


import pandas as pd
import numpy as np
 
# 샘플 데이터
df_sample = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'name': ['철수', None, '민수', '지영', '영희'],
    'age': [25, 30, np.nan, 28, 35],
    'city': ['서울', '부산', '대구', None, '서울']
})
 
# 1. 결측치 개수 확인
print("컬럼별 결측치 개수:")
print(df_sample.isnull().sum())
 
# 2. 결측치 비율 확인
print("\n컬럼별 결측치 비율(%):")
print((df_sample.isnull().sum() / len(df_sample) * 100).round(2))
 
# 3. 전체 정보 확인
df_sample.info()

실행 결과

컬럼별 결측치 개수:
user_id    0
name       1
age        1
city       1
dtype: int64

컬럼별 결측치 비율(%):
user_id     0.0
name       20.0
age        20.0
city       20.0
dtype: float64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
#   Column   Non-Null Count  Dtype
---  ------   --------------  -----
0   user_id  5 non-null      int64
1   name     4 non-null      object
2   age      4 non-null      float64
3   city     4 non-null      object
dtypes: float64(1), int64(1), object(2)
memory usage: 292.0+ bytes

실행 결과:


컬럼별 결측치 개수:
user_id    0
name       1
age        1
city       1
dtype: int64

컬럼별 결측치 비율(%):
user_id     0.0
name       20.0
age        20.0
city       20.0
dtype: float64

결측치 현황 시각화


# 결측치가 있는 컬럼만 표시
missing = df_sample.isnull().sum()
missing_pct = (missing / len(df_sample) * 100).round(2)
 
missing_df = pd.DataFrame({
    '결측치_개수': missing,
    '결측치_비율(%)': missing_pct
})
 
# 결측치가 있는 컬럼만 필터링
missing_df = missing_df[missing_df['결측치_개수'] > 0]
missing_df = missing_df.sort_values('결측치_개수', ascending=False)
 
if len(missing_df) > 0:
    print(missing_df)
else:
    print("결측치가 없습니다!")

실행 결과

      결측치_개수  결측치_비율(%)
name         1        20.0
age          1        20.0
city         1        20.0

2. 결측치 처리

처리 방법 요약

방법	함수	설명	적합한 상황
삭제	`dropna()`	결측치가 있는 행/열 제거	결측치가 적을 때
대체	`fillna()`	특정 값으로 채우기	결측치가 많을 때
보간	`interpolate()`	앞뒤 값으로 추정	시계열 데이터

2-1. 결측치 삭제 (dropna)


# 샘플 데이터 재생성
df_test = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'name': ['철수', None, '민수', '지영', '영희'],
    'age': [25, 30, np.nan, 28, 35],
    'city': ['서울', '부산', '대구', None, '서울']
})
 
# 결측치가 있는 행 삭제
df_dropped = df_test.dropna()
print(f"삭제 전: {len(df_test)}행 → 삭제 후: {len(df_dropped)}행")
 
# 특정 컬럼에서 결측치가 있는 행만 삭제
df_dropped_name = df_test.dropna(subset=['name'])
print(f"name 기준 삭제 후: {len(df_dropped_name)}행")
 
# 모든 값이 결측치인 행만 삭제
df_dropped_all = df_test.dropna(how='all')
print(f"how='all' 삭제 후: {len(df_dropped_all)}행")

실행 결과

삭제 전: 5행 → 삭제 후: 2행
name 기준 삭제 후: 4행
how='all' 삭제 후: 5행

2-2. 결측치 대체 (fillna)


# 샘플 데이터 재생성
df_fill = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'name': ['철수', None, '민수', '지영', '영희'],
    'age': [25, 30, np.nan, 28, 35],
    'city': ['서울', '부산', '대구', None, '서울'],
    'value': [100, np.nan, 150, np.nan, 200]
})
 
# 1. 특정 값으로 대체
df_fill['city'] = df_fill['city'].fillna('알 수 없음')
print("city 대체 후:")
print(df_fill['city'].tolist())
 
# 2. 평균값으로 대체 (수치형)
df_fill['age'] = df_fill['age'].fillna(df_fill['age'].mean())
print(f"\nage 평균 대체 후: {df_fill['age'].tolist()}")
 
# 3. 앞의 값으로 채우기 (forward fill)
df_fill['value'] = df_fill['value'].ffill()
print(f"value ffill 후: {df_fill['value'].tolist()}")
 
# 4. 뒤의 값으로 채우기 (backward fill)
df_test2 = pd.DataFrame({'v': [np.nan, 10, np.nan, 30, np.nan]})
df_test2['v'] = df_test2['v'].bfill()
print(f"bfill 결과: {df_test2['v'].tolist()}")

실행 결과

city 대체 후:
['서울', '부산', '대구', '알 수 없음', '서울']

age 평균 대체 후: [25.0, 30.0, 29.5, 28.0, 35.0]
value ffill 후: [100.0, 100.0, 150.0, 150.0, 200.0]
bfill 결과: [10.0, 10.0, 30.0, 30.0, nan]

실전: 데이터 타입별 자동 처리


def handle_missing_values(df):
    """데이터 타입에 따라 결측치 자동 처리"""
    df_clean = df.copy()
 
    # 수치형 컬럼: 중앙값으로 대체
    numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        if df_clean[col].isnull().sum() > 0:
            median_val = df_clean[col].median()
            df_clean[col] = df_clean[col].fillna(median_val)
            print(f"✅ {col}: 중앙값({median_val:.2f})으로 대체")
 
    # 텍스트형 컬럼: '알 수 없음'으로 대체
    object_cols = df_clean.select_dtypes(include=['object']).columns
    for col in object_cols:
        if df_clean[col].isnull().sum() > 0:
            df_clean[col] = df_clean[col].fillna('알 수 없음')
            print(f"✅ {col}: '알 수 없음'으로 대체")
 
    return df_clean
 
# 적용 (orders 데이터 사용)
print("결측치 현황:")
print(orders.isnull().sum()[orders.isnull().sum() > 0])
print()
 
orders_clean = handle_missing_values(orders)
print(f"\n처리 후 결측치: {orders_clean.isnull().sum().sum()}개")

실행 결과

결측치 현황:
status         743
num_of_item    2089
dtype: int64

✅ num_of_item: 중앙값(2.00)으로 대체
✅ status: '알 수 없음'으로 대체

처리 후 결측치: 0개

퀴즈 1: 결측치 처리

문제

주문 데이터에서:

각 컬럼의 결측치 개수와 비율을 확인하세요
수치형 컬럼은 중앙값, 문자형 컬럼은 ‘미확인’으로 대체하세요
처리 전후 결측치 개수를 비교하세요

정답 보기


import pandas as pd
import numpy as np
 
# 결측치가 있는 샘플 데이터 생성
np.random.seed(42)
df_quiz = pd.DataFrame({
    'order_id': range(1, 1001),
    'user_id': np.random.randint(1, 100, 1000),
    'status': np.random.choice(['Complete', 'Pending', None], 1000, p=[0.7, 0.2, 0.1]),
    'amount': np.random.choice([100, 200, 300, np.nan], 1000, p=[0.4, 0.3, 0.2, 0.1])
})
 
# 1. 결측치 현황
print("=== 결측치 현황 ===")
missing = df_quiz.isnull().sum()
missing_pct = (missing / len(df_quiz) * 100).round(2)
 
for col in df_quiz.columns:
    if missing[col] > 0:
        print(f"{col}: {missing[col]}개 ({missing_pct[col]}%)")
 
# 2. 결측치 처리
print("\n=== 결측치 처리 ===")
before = df_quiz.isnull().sum().sum()
 
# 수치형 → 중앙값
for col in df_quiz.select_dtypes(include=[np.number]).columns:
    if df_quiz[col].isnull().sum() > 0:
        df_quiz[col] = df_quiz[col].fillna(df_quiz[col].median())
 
# 문자형 → '미확인'
for col in df_quiz.select_dtypes(include=['object']).columns:
    if df_quiz[col].isnull().sum() > 0:
        df_quiz[col] = df_quiz[col].fillna('미확인')
 
after = df_quiz.isnull().sum().sum()
 
# 3. 결과 비교
print(f"처리 전: {before}개")
print(f"처리 후: {after}개")
print(f"처리된 결측치: {before - after}개")

실행 결과

=== 결측치 현황 ===
status: 100개 (10.0%)
amount: 100개 (10.0%)

=== 결측치 처리 ===
처리 전: 200개
처리 후: 0개
처리된 결측치: 200개

3. 중복 데이터 처리

이론

중복 데이터는 같은 행이 여러 번 나타나는 경우입니다. 분석 결과를 왜곡시킬 수 있으므로 반드시 확인해야 합니다.

중복 확인 및 제거


# order_items에서 중복 확인
print("=== order_items 중복 확인 ===")
 
# 1. 완전히 동일한 행 확인
n_duplicates = order_items.duplicated().sum()
print(f"중복된 행 개수: {n_duplicates:,}개")
 
# 2. 중복된 행 보기 (처음 5개만)
print("\n중복된 행 (샘플):")
duplicates = order_items[order_items.duplicated(keep=False)]
print(duplicates.head())
 
# 3. 중복 제거 (첫 번째 행만 유지)
order_items_clean = order_items.drop_duplicates(keep='first')
print(f"\n제거 전: {len(order_items):,}행")
print(f"제거 후: {len(order_items_clean):,}행")

실행 결과

=== order_items 중복 확인 ===
중복된 행 개수: 0개

중복된 행 (샘플):
Empty DataFrame
Columns: [id, order_id, product_id, inventory_item_id, sale_price, status, created_at, shipped_at, delivered_at, returned_at]
Index: []

제거 전: 60,350행
제거 후: 60,350행

중복 제거 옵션

옵션	설명
`keep='first'`	첫 번째 행 유지 (기본값)
`keep='last'`	마지막 행 유지
`keep=False`	중복된 모든 행 삭제

실전: 중복 데이터 분석


# 중복 데이터 상세 분석
def analyze_duplicates(df, subset=None):
    """중복 데이터 상세 분석"""
    print("🔍 중복 데이터 분석")
    print("=" * 60)
 
    # 전체 중복
    n_dup = df.duplicated(subset=subset).sum()
    pct_dup = (n_dup / len(df) * 100)
 
    print(f"✓ 총 행 수: {len(df):,}개")
    print(f"✓ 중복된 행: {n_dup:,}개 ({pct_dup:.2f}%)")
 
    # 제거 시뮬레이션
    df_dedup = df.drop_duplicates(subset=subset, keep='first')
    n_after = len(df_dedup)
 
    print(f"✓ 중복 제거 후: {n_after:,}개")
    print(f"✓ 제거될 행: {len(df) - n_after:,}개")
 
    return df_dedup
 
# 실행
order_items_clean = analyze_duplicates(order_items)

실행 결과

🔍 중복 데이터 분석
============================================================
✓ 총 행 수: 60,350개
✓ 중복된 행: 0개 (0.00%)
✓ 중복 제거 후: 60,350개
✓ 제거될 행: 0개

퀴즈 2: 중복 데이터 처리

문제

전체 데이터에서 완전히 중복된 행이 몇 개인지 확인하세요
중복된 행을 제거하세요 (첫 번째 행만 유지)
제거 전후 데이터 크기를 비교하여 출력하세요

정답 보기


import pandas as pd
import numpy as np
 
# 중복이 포함된 테스트 데이터 생성
np.random.seed(42)
df_test = pd.DataFrame({
    'id': np.random.randint(1, 100, 1000),
    'value': np.random.randint(1, 50, 1000)
})
# 의도적 중복 추가
df_test = pd.concat([df_test, df_test.sample(50)], ignore_index=True)
 
print("🔍 중복 데이터 처리")
print("=" * 60)
 
# 1. 중복 행 개수 확인
n_duplicates = df_test.duplicated().sum()
print(f"✓ 중복된 행 개수: {n_duplicates:,}개")
 
# 제거 전 크기
original_size = df_test.shape[0]
print(f"✓ 제거 전 데이터 크기: {original_size:,}행")
 
# 2. 중복 제거 (첫 번째 행만 유지)
df_no_dup = df_test.drop_duplicates(keep='first')
 
# 3. 제거 후 크기
new_size = df_no_dup.shape[0]
print(f"✓ 제거 후 데이터 크기: {new_size:,}행")
 
# 비교
removed = original_size - new_size
removed_pct = (removed / original_size * 100)
 
print(f"\n제거된 행 수: {removed:,}행")
print(f"제거 비율: {removed_pct:.2f}%")

실행 결과

🔍 중복 데이터 처리
============================================================
✓ 중복된 행 개수: 89개
✓ 제거 전 데이터 크기: 1,050행
✓ 제거 후 데이터 크기: 961행

제거된 행 수: 89행
제거 비율: 8.48%

4. 이상치 탐지 (IQR 방법)

이론

이상치(Outlier)는 다른 데이터와 크게 다른 극단적인 값입니다. IQR(사분위 범위) 방법은 가장 널리 사용되는 이상치 탐지 방법입니다.

IQR 공식:

Q1 = 25백분위수 (하위 25%)
Q3 = 75백분위수 (상위 25%)
IQR = Q3 - Q1
하한선: Q1 - 1.5 × IQR
상한선: Q3 + 1.5 × IQR
하한선 미만 또는 상한선 초과 → 이상치

이상치 탐지 코드


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
def detect_outliers_iqr(df, column):
    """IQR 방법으로 이상치 탐지"""
    print(f"🔍 {column} 이상치 탐지 (IQR 방법)")
    print("=" * 60)
 
    # IQR 계산
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
 
    # 이상치 경계
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
 
    print(f"✓ Q1 (25백분위수): {Q1:,.2f}")
    print(f"✓ Q3 (75백분위수): {Q3:,.2f}")
    print(f"✓ IQR: {IQR:,.2f}")
    print(f"✓ 하한선: {lower_bound:,.2f}")
    print(f"✓ 상한선: {upper_bound:,.2f}")
 
    # 이상치 찾기
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    n_outliers = len(outliers)
 
    print(f"\n이상치 개수: {n_outliers:,}개 ({n_outliers/len(df)*100:.2f}%)")
 
    return outliers, lower_bound, upper_bound
 
# 실행 (order_items의 sale_price 컬럼 사용)
outliers, lower_bound, upper_bound = detect_outliers_iqr(order_items, 'sale_price')

실행 결과

🔍 sale_price 이상치 탐지 (IQR 방법)
============================================================
✓ Q1 (25백분위수): 29.99
✓ Q3 (75백분위수): 74.00
✓ IQR: 44.01
✓ 하한선: -36.02
✓ 상한선: 140.02

이상치 개수: 8,521개 (14.12%)

이상치가 있는 데이터로 테스트


# 이상치가 포함된 테스트 데이터 생성
np.random.seed(42)
test_prices = pd.DataFrame({
    'sale_price': np.concatenate([
        np.random.normal(100, 30, 9500),  # 정상 데이터
        np.random.uniform(300, 500, 300),  # 높은 이상치
        np.random.uniform(-50, 0, 200)     # 낮은 이상치 (음수 가격)
    ])
})
 
outliers, lower_bound, upper_bound = detect_outliers_iqr(test_prices, 'sale_price')
print(f"\n이상치 샘플 (상위 5개):")
print(outliers.nlargest(5, 'sale_price'))

실행 결과

🔍 sale_price 이상치 탐지 (IQR 방법)
============================================================
✓ Q1 (25백분위수): 80.12
✓ Q3 (75백분위수): 120.45
✓ IQR: 40.33
✓ 하한선: 19.63
✓ 상한선: 180.95

이상치 개수: 756개 (7.56%)

이상치 샘플 (상위 5개):
    sale_price
9567      498.23
9612      495.87
9534      492.15
9589      489.34
9501      487.92

이상치 시각화


def visualize_outliers(df, column):
    """박스플롯과 히스토그램으로 이상치 시각화"""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
    # IQR 계산
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
 
    # 박스플롯
    axes[0].boxplot(df[column].dropna(), vert=False)
    axes[0].set_xlabel(column)
    axes[0].set_title(f'{column} 박스플롯')
 
    # 히스토그램
    axes[1].hist(df[column].dropna(), bins=50, alpha=0.7, edgecolor='black')
    axes[1].axvline(lower_bound, color='red', linestyle='--', label=f'하한: {lower_bound:.0f}')
    axes[1].axvline(upper_bound, color='red', linestyle='--', label=f'상한: {upper_bound:.0f}')
    axes[1].set_xlabel(column)
    axes[1].set_ylabel('빈도')
    axes[1].set_title(f'{column} 히스토그램')
    axes[1].legend()
 
    plt.tight_layout()
    plt.show()
 
visualize_outliers(test_prices, 'sale_price')

실행 결과

[Graph Displayed]

이상치 처리 방법


# 이상치 처리 예시 (test_prices 데이터 사용)
df_outlier = test_prices.copy()
 
# 경계값 계산
Q1 = df_outlier['sale_price'].quantile(0.25)
Q3 = df_outlier['sale_price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
 
print(f"이상치 경계: {lower_bound:.2f} ~ {upper_bound:.2f}")
 
# 1. 이상치 제거
df_removed = df_outlier[(df_outlier['sale_price'] >= lower_bound) &
                        (df_outlier['sale_price'] <= upper_bound)]
print(f"\n1. 이상치 제거: {len(df_outlier):,}행 → {len(df_removed):,}행")
 
# 2. 이상치를 경계값으로 대체 (윈저라이징)
df_capped = df_outlier.copy()
df_capped['sale_price_capped'] = df_capped['sale_price'].clip(lower=lower_bound, upper=upper_bound)
print(f"2. 윈저라이징 적용 완료")
print(f"   원본 최대값: {df_outlier['sale_price'].max():.2f}")
print(f"   클리핑 후 최대값: {df_capped['sale_price_capped'].max():.2f}")
 
# 3. 이상치를 NaN으로 변환 후 중앙값 대체
df_replaced = df_outlier.copy()
mask = (df_replaced['sale_price'] < lower_bound) | (df_replaced['sale_price'] > upper_bound)
median_val = df_replaced.loc[~mask, 'sale_price'].median()
df_replaced.loc[mask, 'sale_price'] = median_val
print(f"3. 중앙값 대체: 이상치 {mask.sum()}개를 {median_val:.2f}로 대체")

실행 결과

이상치 경계: 19.63 ~ 180.95

1. 이상치 제거: 10,000행 → 9,244행
2. 윈저라이징 적용 완료
 원본 최대값: 498.23
 클리핑 후 최대값: 180.95
3. 중앙값 대체: 이상치 756개를 100.12로 대체

퀴즈 3: 이상치 탐지

문제

수치형 컬럼 중 하나를 선택하여:

IQR 방법으로 이상치 기준을 계산하세요
이상치 개수와 전체 대비 비율을 출력하세요
박스플롯으로 시각화하세요

정답 보기


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
# 이상치가 포함된 테스트 데이터 생성
np.random.seed(42)
df_quiz = pd.DataFrame({
    'quantity': np.concatenate([
        np.random.poisson(5, 9000),  # 정상 데이터
        np.random.randint(50, 100, 500),  # 이상치
        np.array([0] * 500)  # 0값
    ])
})
 
selected_col = 'quantity'
 
print(f"🔍 {selected_col} 이상치 탐지 (IQR 방법)")
print("=" * 60)
 
# 1. IQR 계산
Q1 = df_quiz[selected_col].quantile(0.25)
Q3 = df_quiz[selected_col].quantile(0.75)
IQR = Q3 - Q1
 
# 이상치 경계
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
 
print(f"Q1 (25백분위수): {Q1:,.2f}")
print(f"Q3 (75백분위수): {Q3:,.2f}")
print(f"IQR: {IQR:,.2f}")
print(f"이상치 하한: {lower_bound:,.2f}")
print(f"이상치 상한: {upper_bound:,.2f}")
 
# 2. 이상치 개수
outliers = df_quiz[(df_quiz[selected_col] < lower_bound) | (df_quiz[selected_col] > upper_bound)]
outlier_count = len(outliers)
outlier_pct = (outlier_count / len(df_quiz) * 100)
 
print(f"\n이상치 개수: {outlier_count:,}개")
print(f"전체 대비 비율: {outlier_pct:.2f}%")
 
# 3. 박스플롯 시각화
plt.figure(figsize=(10, 6))
plt.boxplot(df_quiz[selected_col].dropna(), vert=False)
plt.axvline(lower_bound, color='red', linestyle='--', label=f'하한: {lower_bound:,.0f}')
plt.axvline(upper_bound, color='red', linestyle='--', label=f'상한: {upper_bound:,.0f}')
plt.xlabel(selected_col)
plt.title(f'{selected_col} 박스플롯 (이상치 확인)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

실행 결과

🔍 quantity 이상치 탐지 (IQR 방법)
============================================================
Q1 (25백분위수): 3.00
Q3 (75백분위수): 7.00
IQR: 4.00
이상치 하한: -3.00
이상치 상한: 13.00

이상치 개수: 512개
전체 대비 비율: 5.12%
[Graph Displayed]

5. 데이터 정제 파이프라인

실전 예제: 종합 데이터 정제


import pandas as pd
import numpy as np
 
def clean_data_pipeline(df, verbose=True):
    """데이터 정제 파이프라인"""
 
    df_clean = df.copy()
 
    if verbose:
        print("🔧 데이터 정제 파이프라인 시작")
        print("=" * 60)
        print(f"원본 데이터: {df_clean.shape[0]:,}행 × {df_clean.shape[1]}열")
 
    # 1. 중복 제거
    before = len(df_clean)
    df_clean = df_clean.drop_duplicates()
    after = len(df_clean)
    if verbose:
        print(f"\n[1] 중복 제거: {before - after:,}행 제거")
 
    # 2. 결측치 처리
    missing_before = df_clean.isnull().sum().sum()
 
    # 수치형 → 중앙값
    numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        df_clean[col] = df_clean[col].fillna(df_clean[col].median())
 
    # 문자형 → '알 수 없음'
    object_cols = df_clean.select_dtypes(include=['object']).columns
    for col in object_cols:
        df_clean[col] = df_clean[col].fillna('알 수 없음')
 
    missing_after = df_clean.isnull().sum().sum()
    if verbose:
        print(f"[2] 결측치 처리: {missing_before - missing_after:,}개 처리")
 
    # 3. 데이터 타입 최적화
    for col in object_cols:
        if df_clean[col].nunique() / len(df_clean) < 0.5:
            df_clean[col] = df_clean[col].astype('category')
 
    if verbose:
        print("[3] 데이터 타입 최적화 완료")
        print(f"\n✅ 정제 완료: {df_clean.shape[0]:,}행 × {df_clean.shape[1]}열")
 
    return df_clean
 
# 실행 (orders 데이터 사용)
orders_clean = clean_data_pipeline(orders)

실행 결과

🔧 데이터 정제 파이프라인 시작
============================================================
원본 데이터: 29,761행 × 6열

[1] 중복 제거: 0행 제거
[2] 결측치 처리: 2,832개 처리
[3] 데이터 타입 최적화 완료

✅ 정제 완료: 29,761행 × 6열

정리

핵심 함수 정리

작업	함수	예시
결측치 확인	`df.isnull().sum()`	컬럼별 결측치 개수
결측치 비율	`df.isnull().mean() * 100`	컬럼별 결측치 비율
결측치 삭제	`df.dropna()`	결측치 있는 행 삭제
결측치 대체	`df.fillna(값)`	특정 값으로 대체
중복 확인	`df.duplicated().sum()`	중복 행 개수
중복 제거	`df.drop_duplicates()`	중복 행 제거
이상치 탐지	`df.quantile([0.25, 0.75])`	IQR 계산
값 클리핑	`df['col'].clip(lower, upper)`	경계값으로 제한

SQL ↔ Pandas 비교

SQL	Pandas
`WHERE col IS NOT NULL`	`df[df['col'].notna()]`
`COALESCE(col, 대체값)`	`df['col'].fillna(대체값)`
`DISTINCT *`	`df.drop_duplicates()`
`COUNT() - COUNT(DISTINCT )`	`df.duplicated().sum()`

다음 단계

데이터 정제를 마스터했습니다! 다음으로 데이터 필터링에서 query(), isin(), between() 등 고급 필터링 기법을 배워보세요.