Skip to Content

회귀분석

중급고급

학습 목표

  • 단순/다중 선형 회귀 이해
  • 회귀 계수 해석
  • 모델 평가 (R², RMSE)

0. 사전 준비 (Setup)

데이터 실습을 위해 CSV 파일을 로드합니다.

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import statsmodels.api as sm from statsmodels.formula.api import ols # Load Data orders = pd.read_csv('src_orders.csv', parse_dates=['created_at']) items = pd.read_csv('src_order_items.csv') products = pd.read_csv('src_products.csv') # Merge for Analysis df = orders.merge(items, on='order_id').merge(products, on='product_id')

1. 단순 선형 회귀

이론

단순 선형 회귀는 하나의 독립변수(X)로 종속변수(Y)를 예측합니다.

모델: Y = β₀ + β₁X + ε

Python 구현

import statsmodels.api as sm # 독립변수와 종속변수 X = df['retail_price'] y = df['sale_price'] # 상수항 추가 X = sm.add_constant(X) # 회귀 모델 적합 model = sm.OLS(y, X).fit() # 결과 출력 print(model.summary()) print(f"\n해석:") print(f"- 절편 (β₀): {model.params['const']:.2f}") print(f"- 기울기 (β₁): {model.params['retail_price']:.4f}") print(f"- R²: {model.rsquared:.4f}") print(f"→ 정가가 $1 증가하면 판매가는 ${model.params['retail_price']:.4f} 증가")
실행 결과
OLS Regression Results                            
==============================================================================
Dep. Variable:             sale_price   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.045e+36
Date:                Sat, 20 Dec 2025   Prob (F-statistic):               0.00
Time:                        00:24:23   Log-Likelihood:             5.4535e+06
No. Observations:              181026   AIC:                        -1.091e+07
Df Residuals:                  181024   BIC:                        -1.091e+07
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -1.644e-15   6.28e-17    -26.160      0.000   -1.77e-15   -1.52e-15
retail_price     1.0000   6.99e-19   1.43e+18      0.000       1.000       1.000
==============================================================================
Omnibus:                   210423.122   Durbin-Watson:                   1.142
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         32784753.264
Skew:                           6.036   Prob(JB):                         0.00
Kurtosis:                      67.813   Cond. No.                         120.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

해석:
- 절편 (β₀): -0.00
- 기울기 (β₁): 1.0000
- R²: 1.0000
→ 정가가 $1 증가하면 판매가는 $1.0000 증가

2. 다중 선형 회귀

여러 독립변수 사용

from statsmodels.formula.api import ols # 수식으로 모델 정의 model = ols('sale_price ~ retail_price + cost + num_of_item', data=df).fit() print(model.summary()) # 계수 해석 print("\n변수별 영향:") for var, coef in model.params.items(): if var != 'Intercept': print(f"- {var}: {coef:.4f}")
실행 결과
OLS Regression Results                            
==============================================================================
Dep. Variable:             sale_price   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 9.789e+34
Date:                Sat, 20 Dec 2025   Prob (F-statistic):               0.00
Time:                        00:24:23   Log-Likelihood:             5.2778e+06
No. Observations:              181026   AIC:                        -1.056e+07
Df Residuals:                  181022   BIC:                        -1.056e+07
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept    -2.618e-14   2.79e-16    -93.752      0.000   -2.67e-14   -2.56e-14
retail_price     1.0000   9.97e-18      1e+17      0.000       1.000       1.000
cost          1.442e-16   2.16e-17      6.681      0.000    1.02e-16    1.87e-16
num_of_item   2.086e-15   1.17e-16     17.822      0.000    1.86e-15    2.32e-15
==============================================================================
Omnibus:                   174196.359   Durbin-Watson:                   1.677
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         11170271.467
Skew:                          -4.624   Prob(JB):                         0.00
Kurtosis:                      40.355   Cond. No.                         236.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

변수별 영향:
- retail_price: 1.0000
- cost: 0.0000
- num_of_item: 0.0000

범주형 변수 포함

# 범주형 변수는 자동으로 더미 인코딩 model = ols('sale_price ~ retail_price + C(department)', data=df).fit() print(model.summary())
실행 결과
OLS Regression Results                            
==============================================================================
Dep. Variable:             sale_price   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.018e+36
Date:                Sat, 20 Dec 2025   Prob (F-statistic):               0.00
Time:                        00:24:23   Log-Likelihood:             5.4530e+06
No. Observations:              181026   AIC:                        -1.091e+07
Df Residuals:                  181023   BIC:                        -1.091e+07
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                -1.7e-14   8.01e-17   -212.274      0.000   -1.72e-14   -1.68e-14
C(department)[T.Women] -1.179e-14   9.43e-17   -125.084      0.000    -1.2e-14   -1.16e-14
retail_price               1.0000   7.02e-19   1.42e+18      0.000       1.000       1.000
==============================================================================
Omnibus:                   176025.839   Durbin-Watson:                   1.474
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         15454549.554
Skew:                          -4.573   Prob(JB):                         0.00
Kurtosis:                      47.331   Cond. No.                         213.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

3. 모델 평가

R² (결정계수)

  • 모델이 설명하는 분산의 비율
  • 0~1 사이, 높을수록 좋음

RMSE (평균제곱근오차)

from sklearn.metrics import mean_squared_error import numpy as np # 예측 y_pred = model.predict(df) # RMSE 계산 rmse = np.sqrt(mean_squared_error(df['sale_price'], y_pred)) print(f"RMSE: ${rmse:.2f}")
실행 결과
RMSE: $0.00

다중공선성 확인 (VIF)

from statsmodels.stats.outliers_influence import variance_inflation_factor # VIF 계산 X = df[['retail_price', 'cost', 'num_of_item']] X = sm.add_constant(X) vif_data = pd.DataFrame() vif_data['변수'] = X.columns vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] print("VIF (10 이상이면 다중공선성 의심):") print(vif_data)
실행 결과
VIF (10 이상이면 다중공선성 의심):
           변수        VIF
0         const   5.081528
1  retail_price  29.215176
2          cost  29.215126
3   num_of_item   1.000010

퀴즈

문제

정가(retail_price)와 원가(cost)를 사용하여 판매가(sale_price)를 예측하는 회귀 모델을 만들고, 각 변수의 영향력을 해석하세요.

정답 보기

from statsmodels.formula.api import ols # 다중 회귀 모델 model = ols('sale_price ~ retail_price + cost', data=df).fit() print(model.summary()) print("\n=== 해석 ===") print(f"R²: {model.rsquared:.4f} (모델이 {model.rsquared*100:.1f}%의 분산 설명)") print(f"\n계수:") print(f"- 정가 $1 증가 → 판매가 ${model.params['retail_price']:.4f} 증가") print(f"- 원가 $1 증가 → 판매가 ${model.params['cost']:.4f} 증가") # p-value 확인 for var in ['retail_price', 'cost']: p = model.pvalues[var] sig = "유의함" if p < 0.05 else "유의하지 않음" print(f"- {var}: p={p:.4f} ({sig})")
실행 결과
OLS Regression Results                            
==============================================================================
Dep. Variable:             sale_price   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.002e+34
Date:                Sat, 20 Dec 2025   Prob (F-statistic):               0.00
Time:                        00:24:24   Log-Likelihood:             5.1341e+06
No. Observations:              181026   AIC:                        -1.027e+07
Df Residuals:                  181023   BIC:                        -1.027e+07
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept    -3.617e-15   3.75e-16     -9.655      0.000   -4.35e-15   -2.88e-15
retail_price     1.0000   2.21e-17   4.53e+16      0.000       1.000       1.000
cost          9.236e-16   4.77e-17     19.345      0.000     8.3e-16    1.02e-15
==============================================================================
Omnibus:                   179421.631   Durbin-Watson:                   1.178
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         13824323.079
Skew:                          -4.786   Prob(JB):                         0.00
Kurtosis:                      44.727   Cond. No.                         136.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

=== 해석 ===
R²: 1.0000 (모델이 100.0%의 분산 설명)

계수:
- 정가 $1 증가 → 판매가 $1.0000 증가
- 원가 $1 증가 → 판매가 $0.0000 증가
- retail_price: p=0.0000 (유의함)
- cost: p=0.0000 (유의함)

다음 단계

A/B 테스트에서 실험을 통한 인과관계 검증을 배워보세요.

Last updated on

🤖AI 모의면접실전처럼 연습하기