03. Correlation and Regression Analysis

Intermediate2 hours

1. Overview and Scenario

Situation: “How upset do customers get when delivery is late?” Intuitively we know “very upset,” but in business, we need numbers.

“For every 1 day delay in delivery, customer satisfaction (CSAT) drops by an average of 0.5 points.”

To say this, we need Regression Analysis. Let’s express the relationship between two variables as a number (coefficient).

2. Data Preparation

Join cs_tickets_dummy and survey_cs_dummy to examine the relationship between response time and satisfaction score.

BigQuery (SQL)


import seaborn as sns
from scipy import stats
import statsmodels.api as sm
 
# ... BigQuery client setup

3. Correlation Analysis

First, let’s see if the two variables are related.

❓ Problem 1: Correlation Coefficient between First Response Time and CSAT

Q. Calculate the Pearson correlation coefficient between the ticket’s first_response_time (in hours) and the survey’s csat_score.

BigQuery + Python

💡

Hint: Use TIMESTAMP_DIFF to calculate time, join the data, then compute in Python. (You could use BigQuery’s CORR() function, but we’ll fetch it for visualization.)

View Solution


# 1. Extract data
query = """
SELECT
    TIMESTAMP_DIFF(t.first_response_at, t.opened_at, HOUR) as response_hours,
    s.csat_score
FROM `your-project-id.retail_analytics_us.cs_tickets_dummy` t
JOIN `your-project-id.retail_analytics_us.survey_cs_dummy` s ON t.ticket_id = s.related_ticket_id
WHERE t.status = 'solved'
AND s.csat_score IS NOT NULL
"""
df = client.query(query).to_dataframe()
 
# 2. Calculate correlation coefficient
corr, p_val = stats.pearsonr(df['response_hours'], df['csat_score'])
 
print(f"Correlation coefficient (r): {corr:.4f}")
print(f"P-value: {p_val:.4f}")
 
# 3. Visualization
sns.scatterplot(x=df['response_hours'], y=df['csat_score'], alpha=0.1)

실행 결과

Error: name 'client' is not defined

💡 Interpretation Guide

r value range: -1 to 1
Closer to -1: As one increases, the other decreases (negative correlation)
Closer to 0: No relationship
Generally, $|r| > 0.3$ indicates a noticeable relationship, $> 0.7$ indicates a strong relationship.

4. Simple Linear Regression

Correlation only tells us “there’s a relationship,” but regression tells us “how much influence.” $y = \beta_0 + \beta_1 x$ (CSAT = intercept + coefficient × response time)

❓ Problem 2: Fit a Regression Model

Q. Use statsmodels to create a linear regression model with csat_score as the dependent variable ( $y$ ) and response_hours as the independent variable ( $x$ ).

Python (Common)

View Solution


# Add constant term (intercept) to X (explanatory variable)
X = sm.add_constant(df['response_hours'])
y = df['csat_score']
 
# Fit model
model = sm.OLS(y, X).fit()
 
# Summary of results
print(model.summary())

실행 결과

Error: name 'df' is not defined

💡 Result Interpretation (OLS Summary)

Coef (Coefficient): Look at the coef for response_hours.
- Example: If it’s -0.05 → “For every 1 hour delay in response, satisfaction drops by 0.05 points.”
P>|t|: Significance of the coefficient. Must be less than 0.05 to be reliable.
R-squared: Explanatory power. How much of the data variation does this model explain (0~1).

5. Advanced: Multiple Regression (Preview)

Reality isn’t explained by just one variable. Satisfaction is influenced not only by response_hours but also by priority, issue_type, and more. Including all of these is Multiple Regression Analysis.

(This is covered in more detail in Project 1’s machine learning section.)

💡 Summary

Correlation Analysis: Strength and direction of relationship ( $r$ )
Regression Analysis: Estimate causality and make predictions ( $y = ax + b$ )
Data Analyst’s Weapon: You should be able to say “When X changes by 1 unit, Y changes by this amount” rather than “they’re just related.”

Project 2 Complete! Congratulations! Now you can explore data and verify whether the results are statistically valid. In the next phase, we’ll move on to predicting the future (AI/ML) using all this data.

03. Correlation and Regression Analysis

1. Overview and Scenario

2. Data Preparation

BigQuery (SQL)

Pandas (CSV)

3. Correlation Analysis

❓ Problem 1: Correlation Coefficient between First Response Time and CSAT

BigQuery + Python

View Solution

Pandas

View Solution

💡 Interpretation Guide

4. Simple Linear Regression

❓ Problem 2: Fit a Regression Model

Python (Common)

View Solution

💡 Result Interpretation (OLS Summary)

5. Advanced: Multiple Regression (Preview)

💡 Summary