Correlation Analysis
BeginnerIntermediate
Learning Objectives
- Understand Pearson/Spearman correlation coefficients
- Distinguish between correlation and causation
- Visualize correlation matrices
0. Setup
Load CSV files for data practice.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
# Load Data
orders = pd.read_csv('src_orders.csv', parse_dates=['created_at'])
items = pd.read_csv('src_order_items.csv')
products = pd.read_csv('src_products.csv')
# Merge for Analysis
df = orders.merge(items, on='order_id').merge(products, on='product_id')1. Pearson Correlation Coefficient
Theory
The Pearson correlation coefficient (r) measures the strength of the linear relationship between two continuous variables.
- r = 1: Perfect positive correlation
- r = 0: No correlation
- r = -1: Perfect negative correlation
Calculation
import pandas as pd
from scipy import stats
# Correlation coefficient between two variables
r, p_value = stats.pearsonr(df['retail_price'], df['sale_price'])
print(f"Correlation coefficient: {r:.3f}")
print(f"p-value: {p_value:.4f}")
# Correlation interpretation
if abs(r) >= 0.7:
print("→ Strong correlation")
elif abs(r) >= 0.4:
print("→ Moderate correlation")
else:
print("→ Weak correlation")실행 결과
Correlation coefficient: 1.000 p-value: 0.0000 → Strong correlation
2. Correlation Matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Correlation matrix
numeric_cols = ['retail_price', 'cost', 'sale_price', 'num_of_item']
corr_matrix = df[numeric_cols].corr()
# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdYlBu_r', center=0, vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()실행 결과
[Graph Displayed]
3. Correlation vs Causation
⚠️
Caution
Correlation ≠ Causation
Just because two variables change together doesn’t mean one causes the other.
Example: Ice cream sales and drowning incidents have a positive correlation → The cause is a third variable: “hot weather”
Quiz
Problem
Analyze the correlation between retail price, cost, and sale price in the product data and visualize it as a heatmap.
View Answer
# Calculate correlation
cols = ['retail_price', 'cost', 'sale_price']
corr = df[cols].corr()
print("Correlation Matrix:")
print(corr.round(3))
# Visualization
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='RdYlBu_r', center=0,
fmt='.3f', square=True, vmin=-1, vmax=1)
plt.title('Correlation Between Price Variables')
plt.tight_layout()
plt.show()실행 결과
Correlation Matrix:
retail_price ... sale_price
retail_price 1.000 ... 1.000
cost 0.983 ... 0.983
sale_price 1.000 ... 1.000
[3 rows x 3 columns]
[Graph Displayed]Next Steps
Learn how to test statistical significance in Hypothesis Testing.
Last updated on