Skip to Content
ConceptsStatisticsRegression Analysis

Regression Analysis

IntermediateAdvanced

Learning Objectives

  • Understand simple/multiple linear regression
  • Interpret regression coefficients
  • Model evaluation (R², RMSE)

0. Setup

Load CSV files for data practice.

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import statsmodels.api as sm from statsmodels.formula.api import ols # Load Data orders = pd.read_csv('src_orders.csv', parse_dates=['created_at']) items = pd.read_csv('src_order_items.csv') products = pd.read_csv('src_products.csv') # Merge for Analysis df = orders.merge(items, on='order_id').merge(products, on='product_id')

1. Simple Linear Regression

Theory

Simple linear regression predicts the dependent variable (Y) using one independent variable (X).

Model: Y = β₀ + β₁X + ε

Python Implementation

import statsmodels.api as sm # Independent and dependent variables X = df['retail_price'] y = df['sale_price'] # Add constant term X = sm.add_constant(X) # Fit regression model model = sm.OLS(y, X).fit() # Print results print(model.summary()) print(f"\nInterpretation:") print(f"- Intercept (β₀): {model.params['const']:.2f}") print(f"- Slope (β₁): {model.params['retail_price']:.4f}") print(f"- R²: {model.rsquared:.4f}") print(f"→ For each $1 increase in retail price, sale price increases by ${model.params['retail_price']:.4f}")
실행 결과
OLS Regression Results
==============================================================================
Dep. Variable:             sale_price   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.045e+36
Date:                Sat, 20 Dec 2025   Prob (F-statistic):               0.00
Time:                        00:24:23   Log-Likelihood:             5.4535e+06
No. Observations:              181026   AIC:                        -1.091e+07
Df Residuals:                  181024   BIC:                        -1.091e+07
Df Model:                           1
Covariance Type:            nonrobust
================================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -1.644e-15   6.28e-17    -26.160      0.000   -1.77e-15   -1.52e-15
retail_price     1.0000   6.99e-19   1.43e+18      0.000       1.000       1.000
==============================================================================
Omnibus:                   210423.122   Durbin-Watson:                   1.142
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         32784753.264
Skew:                           6.036   Prob(JB):                         0.00
Kurtosis:                      67.813   Cond. No.                         120.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpretation:
- Intercept (β₀): -0.00
- Slope (β₁): 1.0000
- R²: 1.0000
→ For each $1 increase in retail price, sale price increases by $1.0000

2. Multiple Linear Regression

Using Multiple Independent Variables

from statsmodels.formula.api import ols # Define model with formula model = ols('sale_price ~ retail_price + cost + num_of_item', data=df).fit() print(model.summary()) # Coefficient interpretation print("\nVariable Effects:") for var, coef in model.params.items(): if var != 'Intercept': print(f"- {var}: {coef:.4f}")
실행 결과
OLS Regression Results
==============================================================================
Dep. Variable:             sale_price   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 9.789e+34
Date:                Sat, 20 Dec 2025   Prob (F-statistic):               0.00
Time:                        00:24:23   Log-Likelihood:             5.2778e+06
No. Observations:              181026   AIC:                        -1.056e+07
Df Residuals:                  181022   BIC:                        -1.056e+07
Df Model:                           3
Covariance Type:            nonrobust
================================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept    -2.618e-14   2.79e-16    -93.752      0.000   -2.67e-14   -2.56e-14
retail_price     1.0000   9.97e-18      1e+17      0.000       1.000       1.000
cost          1.442e-16   2.16e-17      6.681      0.000    1.02e-16    1.87e-16
num_of_item   2.086e-15   1.17e-16     17.822      0.000    1.86e-15    2.32e-15
==============================================================================
Omnibus:                   174196.359   Durbin-Watson:                   1.677
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         11170271.467
Skew:                          -4.624   Prob(JB):                         0.00
Kurtosis:                      40.355   Cond. No.                         236.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Variable Effects:
- retail_price: 1.0000
- cost: 0.0000
- num_of_item: 0.0000

Including Categorical Variables

# Categorical variables are automatically dummy encoded model = ols('sale_price ~ retail_price + C(department)', data=df).fit() print(model.summary())
실행 결과
OLS Regression Results
==============================================================================
Dep. Variable:             sale_price   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.018e+36
Date:                Sat, 20 Dec 2025   Prob (F-statistic):               0.00
Time:                        00:24:23   Log-Likelihood:             5.4530e+06
No. Observations:              181026   AIC:                        -1.091e+07
Df Residuals:                  181023   BIC:                        -1.091e+07
Df Model:                           2
Covariance Type:            nonrobust
==========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                -1.7e-14   8.01e-17   -212.274      0.000   -1.72e-14   -1.68e-14
C(department)[T.Women] -1.179e-14   9.43e-17   -125.084      0.000    -1.2e-14   -1.16e-14
retail_price               1.0000   7.02e-19   1.42e+18      0.000       1.000       1.000
==============================================================================
Omnibus:                   176025.839   Durbin-Watson:                   1.474
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         15454549.554
Skew:                          -4.573   Prob(JB):                         0.00
Kurtosis:                      47.331   Cond. No.                         213.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

3. Model Evaluation

R² (Coefficient of Determination)

  • Proportion of variance explained by the model
  • Ranges from 0 to 1, higher is better

RMSE (Root Mean Squared Error)

from sklearn.metrics import mean_squared_error import numpy as np # Predictions y_pred = model.predict(df) # Calculate RMSE rmse = np.sqrt(mean_squared_error(df['sale_price'], y_pred)) print(f"RMSE: ${rmse:.2f}")
실행 결과
RMSE: $0.00

Checking Multicollinearity (VIF)

from statsmodels.stats.outliers_influence import variance_inflation_factor # Calculate VIF X = df[['retail_price', 'cost', 'num_of_item']] X = sm.add_constant(X) vif_data = pd.DataFrame() vif_data['Variable'] = X.columns vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] print("VIF (Multicollinearity suspected if > 10):") print(vif_data)
실행 결과
VIF (Multicollinearity suspected if > 10):
     Variable        VIF
0         const   5.081528
1  retail_price  29.215176
2          cost  29.215126
3   num_of_item   1.000010

Quiz

Problem

Create a regression model to predict sale price using retail price and cost, and interpret the influence of each variable.

View Answer

from statsmodels.formula.api import ols # Multiple regression model model = ols('sale_price ~ retail_price + cost', data=df).fit() print(model.summary()) print("\n=== Interpretation ===") print(f"R²: {model.rsquared:.4f} (Model explains {model.rsquared*100:.1f}% of variance)") print(f"\nCoefficients:") print(f"- Retail price $1 increase → Sale price ${model.params['retail_price']:.4f} increase") print(f"- Cost $1 increase → Sale price ${model.params['cost']:.4f} increase") # Check p-values for var in ['retail_price', 'cost']: p = model.pvalues[var] sig = "Significant" if p < 0.05 else "Not significant" print(f"- {var}: p={p:.4f} ({sig})")
실행 결과
OLS Regression Results
==============================================================================
Dep. Variable:             sale_price   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.002e+34
Date:                Sat, 20 Dec 2025   Prob (F-statistic):               0.00
Time:                        00:24:24   Log-Likelihood:             5.1341e+06
No. Observations:              181026   AIC:                        -1.027e+07
Df Residuals:                  181023   BIC:                        -1.027e+07
Df Model:                           2
Covariance Type:            nonrobust
================================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept    -3.617e-15   3.75e-16     -9.655      0.000   -4.35e-15   -2.88e-15
retail_price     1.0000   2.21e-17   4.53e+16      0.000       1.000       1.000
cost          9.236e-16   4.77e-17     19.345      0.000     8.3e-16    1.02e-15
==============================================================================
Omnibus:                   179421.631   Durbin-Watson:                   1.178
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         13824323.079
Skew:                          -4.786   Prob(JB):                         0.00
Kurtosis:                      44.727   Cond. No.                         136.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

=== Interpretation ===
R²: 1.0000 (Model explains 100.0% of variance)

Coefficients:
- Retail price $1 increase → Sale price $1.0000 increase
- Cost $1 increase → Sale price $0.0000 increase
- retail_price: p=0.0000 (Significant)
- cost: p=0.0000 (Significant)

Next Steps

Learn how to verify causal relationships through experiments in A/B Testing.

Last updated on

🤖AI 모의면접실전처럼 연습하기