Skip to Content
ProjectsProject 2: Marketing Strategy03. Correlation & Regression

03. Correlation and Regression Analysis

Intermediate2 hours

1. Overview and Scenario

Situation: “How upset do customers get when delivery is late?” Intuitively we know “very upset,” but in business, we need numbers.

“For every 1 day delay in delivery, customer satisfaction (CSAT) drops by an average of 0.5 points.”

To say this, we need Regression Analysis. Let’s express the relationship between two variables as a number (coefficient).


2. Data Preparation

Join cs_tickets_dummy and survey_cs_dummy to examine the relationship between response time and satisfaction score.

import seaborn as sns from scipy import stats import statsmodels.api as sm # ... BigQuery client setup

3. Correlation Analysis

First, let’s see if the two variables are related.

❓ Problem 1: Correlation Coefficient between First Response Time and CSAT

Q. Calculate the Pearson correlation coefficient between the ticket’s first_response_time (in hours) and the survey’s csat_score.

💡

Hint: Use TIMESTAMP_DIFF to calculate time, join the data, then compute in Python. (You could use BigQuery’s CORR() function, but we’ll fetch it for visualization.)

View Solution

# 1. Extract data query = """ SELECT TIMESTAMP_DIFF(t.first_response_at, t.opened_at, HOUR) as response_hours, s.csat_score FROM `your-project-id.retail_analytics_us.cs_tickets_dummy` t JOIN `your-project-id.retail_analytics_us.survey_cs_dummy` s ON t.ticket_id = s.related_ticket_id WHERE t.status = 'solved' AND s.csat_score IS NOT NULL """ df = client.query(query).to_dataframe() # 2. Calculate correlation coefficient corr, p_val = stats.pearsonr(df['response_hours'], df['csat_score']) print(f"Correlation coefficient (r): {corr:.4f}") print(f"P-value: {p_val:.4f}") # 3. Visualization sns.scatterplot(x=df['response_hours'], y=df['csat_score'], alpha=0.1)
실행 결과
Error: name 'client' is not defined

💡 Interpretation Guide

  • r value range: -1 to 1
  • Closer to -1: As one increases, the other decreases (negative correlation)
  • Closer to 0: No relationship
  • Generally, r>0.3|r| > 0.3 indicates a noticeable relationship, >0.7> 0.7 indicates a strong relationship.

4. Simple Linear Regression

Correlation only tells us “there’s a relationship,” but regression tells us “how much influence.” y=β0+β1xy = \beta_0 + \beta_1 x (CSAT = intercept + coefficient × response time)

❓ Problem 2: Fit a Regression Model

Q. Use statsmodels to create a linear regression model with csat_score as the dependent variable (yy) and response_hours as the independent variable (xx).

View Solution

# Add constant term (intercept) to X (explanatory variable) X = sm.add_constant(df['response_hours']) y = df['csat_score'] # Fit model model = sm.OLS(y, X).fit() # Summary of results print(model.summary())
실행 결과
Error: name 'df' is not defined

💡 Result Interpretation (OLS Summary)

  1. Coef (Coefficient): Look at the coef for response_hours.
    • Example: If it’s -0.05 → “For every 1 hour delay in response, satisfaction drops by 0.05 points.”
  2. P>|t|: Significance of the coefficient. Must be less than 0.05 to be reliable.
  3. R-squared: Explanatory power. How much of the data variation does this model explain (0~1).

5. Advanced: Multiple Regression (Preview)

Reality isn’t explained by just one variable. Satisfaction is influenced not only by response_hours but also by priority, issue_type, and more. Including all of these is Multiple Regression Analysis.

(This is covered in more detail in Project 1’s machine learning section.)


💡 Summary

  • Correlation Analysis: Strength and direction of relationship (rr)
  • Regression Analysis: Estimate causality and make predictions (y=ax+by = ax + b)
  • Data Analyst’s Weapon: You should be able to say “When X changes by 1 unit, Y changes by this amount” rather than “they’re just related.”

Project 2 Complete! Congratulations! Now you can explore data and verify whether the results are statistically valid. In the next phase, we’ll move on to predicting the future (AI/ML) using all this data.

Last updated on

🤖AI 모의면접실전처럼 연습하기