03. Correlation and Regression Analysis
1. Overview and Scenario
Situation: “How upset do customers get when delivery is late?” Intuitively we know “very upset,” but in business, we need numbers.
“For every 1 day delay in delivery, customer satisfaction (CSAT) drops by an average of 0.5 points.”
To say this, we need Regression Analysis. Let’s express the relationship between two variables as a number (coefficient).
2. Data Preparation
Join cs_tickets_dummy and survey_cs_dummy to examine the relationship between response time and satisfaction score.
BigQuery (SQL)
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
# ... BigQuery client setup3. Correlation Analysis
First, let’s see if the two variables are related.
❓ Problem 1: Correlation Coefficient between First Response Time and CSAT
Q. Calculate the Pearson correlation coefficient between the ticket’s first_response_time (in hours) and the survey’s csat_score.
BigQuery + Python
Hint: Use TIMESTAMP_DIFF to calculate time, join the data, then compute in Python.
(You could use BigQuery’s CORR() function, but we’ll fetch it for visualization.)
View Solution
# 1. Extract data
query = """
SELECT
TIMESTAMP_DIFF(t.first_response_at, t.opened_at, HOUR) as response_hours,
s.csat_score
FROM `your-project-id.retail_analytics_us.cs_tickets_dummy` t
JOIN `your-project-id.retail_analytics_us.survey_cs_dummy` s ON t.ticket_id = s.related_ticket_id
WHERE t.status = 'solved'
AND s.csat_score IS NOT NULL
"""
df = client.query(query).to_dataframe()
# 2. Calculate correlation coefficient
corr, p_val = stats.pearsonr(df['response_hours'], df['csat_score'])
print(f"Correlation coefficient (r): {corr:.4f}")
print(f"P-value: {p_val:.4f}")
# 3. Visualization
sns.scatterplot(x=df['response_hours'], y=df['csat_score'], alpha=0.1)Error: name 'client' is not defined
💡 Interpretation Guide
- r value range: -1 to 1
- Closer to -1: As one increases, the other decreases (negative correlation)
- Closer to 0: No relationship
- Generally, indicates a noticeable relationship, indicates a strong relationship.
4. Simple Linear Regression
Correlation only tells us “there’s a relationship,” but regression tells us “how much influence.” (CSAT = intercept + coefficient × response time)
❓ Problem 2: Fit a Regression Model
Q. Use statsmodels to create a linear regression model with csat_score as the dependent variable () and response_hours as the independent variable ().
Python (Common)
View Solution
# Add constant term (intercept) to X (explanatory variable)
X = sm.add_constant(df['response_hours'])
y = df['csat_score']
# Fit model
model = sm.OLS(y, X).fit()
# Summary of results
print(model.summary())Error: name 'df' is not defined
💡 Result Interpretation (OLS Summary)
- Coef (Coefficient): Look at the coef for
response_hours.- Example: If it’s -0.05 → “For every 1 hour delay in response, satisfaction drops by 0.05 points.”
- P>|t|: Significance of the coefficient. Must be less than 0.05 to be reliable.
- R-squared: Explanatory power. How much of the data variation does this model explain (0~1).
5. Advanced: Multiple Regression (Preview)
Reality isn’t explained by just one variable.
Satisfaction is influenced not only by response_hours but also by priority, issue_type, and more.
Including all of these is Multiple Regression Analysis.
(This is covered in more detail in Project 1’s machine learning section.)
💡 Summary
- Correlation Analysis: Strength and direction of relationship ()
- Regression Analysis: Estimate causality and make predictions ()
- Data Analyst’s Weapon: You should be able to say “When X changes by 1 unit, Y changes by this amount” rather than “they’re just related.”
Project 2 Complete! Congratulations! Now you can explore data and verify whether the results are statistically valid. In the next phase, we’ll move on to predicting the future (AI/ML) using all this data.