회귀분석
중급고급
학습 목표
- 단순/다중 선형 회귀 이해
- 회귀 계수 해석
- 모델 평가 (R², RMSE)
0. 사전 준비 (Setup)
데이터 실습을 위해 CSV 파일을 로드합니다.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Load Data
orders = pd.read_csv('src_orders.csv', parse_dates=['created_at'])
items = pd.read_csv('src_order_items.csv')
products = pd.read_csv('src_products.csv')
# Merge for Analysis
df = orders.merge(items, on='order_id').merge(products, on='product_id')1. 단순 선형 회귀
이론
단순 선형 회귀는 하나의 독립변수(X)로 종속변수(Y)를 예측합니다.
모델: Y = β₀ + β₁X + ε
Python 구현
import statsmodels.api as sm
# 독립변수와 종속변수
X = df['retail_price']
y = df['sale_price']
# 상수항 추가
X = sm.add_constant(X)
# 회귀 모델 적합
model = sm.OLS(y, X).fit()
# 결과 출력
print(model.summary())
print(f"\n해석:")
print(f"- 절편 (β₀): {model.params['const']:.2f}")
print(f"- 기울기 (β₁): {model.params['retail_price']:.4f}")
print(f"- R²: {model.rsquared:.4f}")
print(f"→ 정가가 $1 증가하면 판매가는 ${model.params['retail_price']:.4f} 증가")실행 결과
OLS Regression Results
==============================================================================
Dep. Variable: sale_price R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.045e+36
Date: Sat, 20 Dec 2025 Prob (F-statistic): 0.00
Time: 00:24:23 Log-Likelihood: 5.4535e+06
No. Observations: 181026 AIC: -1.091e+07
Df Residuals: 181024 BIC: -1.091e+07
Df Model: 1
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const -1.644e-15 6.28e-17 -26.160 0.000 -1.77e-15 -1.52e-15
retail_price 1.0000 6.99e-19 1.43e+18 0.000 1.000 1.000
==============================================================================
Omnibus: 210423.122 Durbin-Watson: 1.142
Prob(Omnibus): 0.000 Jarque-Bera (JB): 32784753.264
Skew: 6.036 Prob(JB): 0.00
Kurtosis: 67.813 Cond. No. 120.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
해석:
- 절편 (β₀): -0.00
- 기울기 (β₁): 1.0000
- R²: 1.0000
→ 정가가 $1 증가하면 판매가는 $1.0000 증가2. 다중 선형 회귀
여러 독립변수 사용
from statsmodels.formula.api import ols
# 수식으로 모델 정의
model = ols('sale_price ~ retail_price + cost + num_of_item', data=df).fit()
print(model.summary())
# 계수 해석
print("\n변수별 영향:")
for var, coef in model.params.items():
if var != 'Intercept':
print(f"- {var}: {coef:.4f}")실행 결과
OLS Regression Results
==============================================================================
Dep. Variable: sale_price R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 9.789e+34
Date: Sat, 20 Dec 2025 Prob (F-statistic): 0.00
Time: 00:24:23 Log-Likelihood: 5.2778e+06
No. Observations: 181026 AIC: -1.056e+07
Df Residuals: 181022 BIC: -1.056e+07
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -2.618e-14 2.79e-16 -93.752 0.000 -2.67e-14 -2.56e-14
retail_price 1.0000 9.97e-18 1e+17 0.000 1.000 1.000
cost 1.442e-16 2.16e-17 6.681 0.000 1.02e-16 1.87e-16
num_of_item 2.086e-15 1.17e-16 17.822 0.000 1.86e-15 2.32e-15
==============================================================================
Omnibus: 174196.359 Durbin-Watson: 1.677
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11170271.467
Skew: -4.624 Prob(JB): 0.00
Kurtosis: 40.355 Cond. No. 236.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
변수별 영향:
- retail_price: 1.0000
- cost: 0.0000
- num_of_item: 0.0000범주형 변수 포함
# 범주형 변수는 자동으로 더미 인코딩
model = ols('sale_price ~ retail_price + C(department)', data=df).fit()
print(model.summary())실행 결과
OLS Regression Results
==============================================================================
Dep. Variable: sale_price R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.018e+36
Date: Sat, 20 Dec 2025 Prob (F-statistic): 0.00
Time: 00:24:23 Log-Likelihood: 5.4530e+06
No. Observations: 181026 AIC: -1.091e+07
Df Residuals: 181023 BIC: -1.091e+07
Df Model: 2
Covariance Type: nonrobust
==========================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------
Intercept -1.7e-14 8.01e-17 -212.274 0.000 -1.72e-14 -1.68e-14
C(department)[T.Women] -1.179e-14 9.43e-17 -125.084 0.000 -1.2e-14 -1.16e-14
retail_price 1.0000 7.02e-19 1.42e+18 0.000 1.000 1.000
==============================================================================
Omnibus: 176025.839 Durbin-Watson: 1.474
Prob(Omnibus): 0.000 Jarque-Bera (JB): 15454549.554
Skew: -4.573 Prob(JB): 0.00
Kurtosis: 47.331 Cond. No. 213.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.3. 모델 평가
R² (결정계수)
- 모델이 설명하는 분산의 비율
- 0~1 사이, 높을수록 좋음
RMSE (평균제곱근오차)
from sklearn.metrics import mean_squared_error
import numpy as np
# 예측
y_pred = model.predict(df)
# RMSE 계산
rmse = np.sqrt(mean_squared_error(df['sale_price'], y_pred))
print(f"RMSE: ${rmse:.2f}")실행 결과
RMSE: $0.00
다중공선성 확인 (VIF)
from statsmodels.stats.outliers_influence import variance_inflation_factor
# VIF 계산
X = df[['retail_price', 'cost', 'num_of_item']]
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data['변수'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("VIF (10 이상이면 다중공선성 의심):")
print(vif_data)실행 결과
VIF (10 이상이면 다중공선성 의심):
변수 VIF
0 const 5.081528
1 retail_price 29.215176
2 cost 29.215126
3 num_of_item 1.000010퀴즈
문제
정가(retail_price)와 원가(cost)를 사용하여 판매가(sale_price)를 예측하는 회귀 모델을 만들고, 각 변수의 영향력을 해석하세요.
정답 보기
from statsmodels.formula.api import ols
# 다중 회귀 모델
model = ols('sale_price ~ retail_price + cost', data=df).fit()
print(model.summary())
print("\n=== 해석 ===")
print(f"R²: {model.rsquared:.4f} (모델이 {model.rsquared*100:.1f}%의 분산 설명)")
print(f"\n계수:")
print(f"- 정가 $1 증가 → 판매가 ${model.params['retail_price']:.4f} 증가")
print(f"- 원가 $1 증가 → 판매가 ${model.params['cost']:.4f} 증가")
# p-value 확인
for var in ['retail_price', 'cost']:
p = model.pvalues[var]
sig = "유의함" if p < 0.05 else "유의하지 않음"
print(f"- {var}: p={p:.4f} ({sig})")실행 결과
OLS Regression Results
==============================================================================
Dep. Variable: sale_price R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 3.002e+34
Date: Sat, 20 Dec 2025 Prob (F-statistic): 0.00
Time: 00:24:24 Log-Likelihood: 5.1341e+06
No. Observations: 181026 AIC: -1.027e+07
Df Residuals: 181023 BIC: -1.027e+07
Df Model: 2
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -3.617e-15 3.75e-16 -9.655 0.000 -4.35e-15 -2.88e-15
retail_price 1.0000 2.21e-17 4.53e+16 0.000 1.000 1.000
cost 9.236e-16 4.77e-17 19.345 0.000 8.3e-16 1.02e-15
==============================================================================
Omnibus: 179421.631 Durbin-Watson: 1.178
Prob(Omnibus): 0.000 Jarque-Bera (JB): 13824323.079
Skew: -4.786 Prob(JB): 0.00
Kurtosis: 44.727 Cond. No. 136.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
=== 해석 ===
R²: 1.0000 (모델이 100.0%의 분산 설명)
계수:
- 정가 $1 증가 → 판매가 $1.0000 증가
- 원가 $1 증가 → 판매가 $0.0000 증가
- retail_price: p=0.0000 (유의함)
- cost: p=0.0000 (유의함)다음 단계
A/B 테스트에서 실험을 통한 인과관계 검증을 배워보세요.
Last updated on