Regression Analysis
IntermediateAdvanced
Learning Objectives
- Understand simple/multiple linear regression
- Interpret regression coefficients
- Model evaluation (R², RMSE)
0. Setup
Load CSV files for data practice.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Load Data
orders = pd.read_csv('src_orders.csv', parse_dates=['created_at'])
items = pd.read_csv('src_order_items.csv')
products = pd.read_csv('src_products.csv')
# Merge for Analysis
df = orders.merge(items, on='order_id').merge(products, on='product_id')1. Simple Linear Regression
Theory
Simple linear regression predicts the dependent variable (Y) using one independent variable (X).
Model: Y = β₀ + β₁X + ε
Python Implementation
import statsmodels.api as sm
# Independent and dependent variables
X = df['retail_price']
y = df['sale_price']
# Add constant term
X = sm.add_constant(X)
# Fit regression model
model = sm.OLS(y, X).fit()
# Print results
print(model.summary())
print(f"\nInterpretation:")
print(f"- Intercept (β₀): {model.params['const']:.2f}")
print(f"- Slope (β₁): {model.params['retail_price']:.4f}")
print(f"- R²: {model.rsquared:.4f}")
print(f"→ For each $1 increase in retail price, sale price increases by ${model.params['retail_price']:.4f}")실행 결과
OLS Regression Results
==============================================================================
Dep. Variable: sale_price R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.045e+36
Date: Sat, 20 Dec 2025 Prob (F-statistic): 0.00
Time: 00:24:23 Log-Likelihood: 5.4535e+06
No. Observations: 181026 AIC: -1.091e+07
Df Residuals: 181024 BIC: -1.091e+07
Df Model: 1
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const -1.644e-15 6.28e-17 -26.160 0.000 -1.77e-15 -1.52e-15
retail_price 1.0000 6.99e-19 1.43e+18 0.000 1.000 1.000
==============================================================================
Omnibus: 210423.122 Durbin-Watson: 1.142
Prob(Omnibus): 0.000 Jarque-Bera (JB): 32784753.264
Skew: 6.036 Prob(JB): 0.00
Kurtosis: 67.813 Cond. No. 120.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpretation:
- Intercept (β₀): -0.00
- Slope (β₁): 1.0000
- R²: 1.0000
→ For each $1 increase in retail price, sale price increases by $1.00002. Multiple Linear Regression
Using Multiple Independent Variables
from statsmodels.formula.api import ols
# Define model with formula
model = ols('sale_price ~ retail_price + cost + num_of_item', data=df).fit()
print(model.summary())
# Coefficient interpretation
print("\nVariable Effects:")
for var, coef in model.params.items():
if var != 'Intercept':
print(f"- {var}: {coef:.4f}")실행 결과
OLS Regression Results
==============================================================================
Dep. Variable: sale_price R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 9.789e+34
Date: Sat, 20 Dec 2025 Prob (F-statistic): 0.00
Time: 00:24:23 Log-Likelihood: 5.2778e+06
No. Observations: 181026 AIC: -1.056e+07
Df Residuals: 181022 BIC: -1.056e+07
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -2.618e-14 2.79e-16 -93.752 0.000 -2.67e-14 -2.56e-14
retail_price 1.0000 9.97e-18 1e+17 0.000 1.000 1.000
cost 1.442e-16 2.16e-17 6.681 0.000 1.02e-16 1.87e-16
num_of_item 2.086e-15 1.17e-16 17.822 0.000 1.86e-15 2.32e-15
==============================================================================
Omnibus: 174196.359 Durbin-Watson: 1.677
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11170271.467
Skew: -4.624 Prob(JB): 0.00
Kurtosis: 40.355 Cond. No. 236.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Variable Effects:
- retail_price: 1.0000
- cost: 0.0000
- num_of_item: 0.0000Including Categorical Variables
# Categorical variables are automatically dummy encoded
model = ols('sale_price ~ retail_price + C(department)', data=df).fit()
print(model.summary())실행 결과
OLS Regression Results
==============================================================================
Dep. Variable: sale_price R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.018e+36
Date: Sat, 20 Dec 2025 Prob (F-statistic): 0.00
Time: 00:24:23 Log-Likelihood: 5.4530e+06
No. Observations: 181026 AIC: -1.091e+07
Df Residuals: 181023 BIC: -1.091e+07
Df Model: 2
Covariance Type: nonrobust
==========================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------
Intercept -1.7e-14 8.01e-17 -212.274 0.000 -1.72e-14 -1.68e-14
C(department)[T.Women] -1.179e-14 9.43e-17 -125.084 0.000 -1.2e-14 -1.16e-14
retail_price 1.0000 7.02e-19 1.42e+18 0.000 1.000 1.000
==============================================================================
Omnibus: 176025.839 Durbin-Watson: 1.474
Prob(Omnibus): 0.000 Jarque-Bera (JB): 15454549.554
Skew: -4.573 Prob(JB): 0.00
Kurtosis: 47.331 Cond. No. 213.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.3. Model Evaluation
R² (Coefficient of Determination)
- Proportion of variance explained by the model
- Ranges from 0 to 1, higher is better
RMSE (Root Mean Squared Error)
from sklearn.metrics import mean_squared_error
import numpy as np
# Predictions
y_pred = model.predict(df)
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(df['sale_price'], y_pred))
print(f"RMSE: ${rmse:.2f}")실행 결과
RMSE: $0.00
Checking Multicollinearity (VIF)
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate VIF
X = df[['retail_price', 'cost', 'num_of_item']]
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("VIF (Multicollinearity suspected if > 10):")
print(vif_data)실행 결과
VIF (Multicollinearity suspected if > 10):
Variable VIF
0 const 5.081528
1 retail_price 29.215176
2 cost 29.215126
3 num_of_item 1.000010Quiz
Problem
Create a regression model to predict sale price using retail price and cost, and interpret the influence of each variable.
View Answer
from statsmodels.formula.api import ols
# Multiple regression model
model = ols('sale_price ~ retail_price + cost', data=df).fit()
print(model.summary())
print("\n=== Interpretation ===")
print(f"R²: {model.rsquared:.4f} (Model explains {model.rsquared*100:.1f}% of variance)")
print(f"\nCoefficients:")
print(f"- Retail price $1 increase → Sale price ${model.params['retail_price']:.4f} increase")
print(f"- Cost $1 increase → Sale price ${model.params['cost']:.4f} increase")
# Check p-values
for var in ['retail_price', 'cost']:
p = model.pvalues[var]
sig = "Significant" if p < 0.05 else "Not significant"
print(f"- {var}: p={p:.4f} ({sig})")실행 결과
OLS Regression Results
==============================================================================
Dep. Variable: sale_price R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 3.002e+34
Date: Sat, 20 Dec 2025 Prob (F-statistic): 0.00
Time: 00:24:24 Log-Likelihood: 5.1341e+06
No. Observations: 181026 AIC: -1.027e+07
Df Residuals: 181023 BIC: -1.027e+07
Df Model: 2
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -3.617e-15 3.75e-16 -9.655 0.000 -4.35e-15 -2.88e-15
retail_price 1.0000 2.21e-17 4.53e+16 0.000 1.000 1.000
cost 9.236e-16 4.77e-17 19.345 0.000 8.3e-16 1.02e-15
==============================================================================
Omnibus: 179421.631 Durbin-Watson: 1.178
Prob(Omnibus): 0.000 Jarque-Bera (JB): 13824323.079
Skew: -4.786 Prob(JB): 0.00
Kurtosis: 44.727 Cond. No. 136.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
=== Interpretation ===
R²: 1.0000 (Model explains 100.0% of variance)
Coefficients:
- Retail price $1 increase → Sale price $1.0000 increase
- Cost $1 increase → Sale price $0.0000 increase
- retail_price: p=0.0000 (Significant)
- cost: p=0.0000 (Significant)Next Steps
Learn how to verify causal relationships through experiments in A/B Testing.
Last updated on