Skip to Content
ConceptsStatisticsCorrelation Analysis

Correlation Analysis

BeginnerIntermediate

Learning Objectives

  • Understand Pearson/Spearman correlation coefficients
  • Distinguish between correlation and causation
  • Visualize correlation matrices

0. Setup

Load CSV files for data practice.

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from scipy import stats # Load Data orders = pd.read_csv('src_orders.csv', parse_dates=['created_at']) items = pd.read_csv('src_order_items.csv') products = pd.read_csv('src_products.csv') # Merge for Analysis df = orders.merge(items, on='order_id').merge(products, on='product_id')

1. Pearson Correlation Coefficient

Theory

The Pearson correlation coefficient (r) measures the strength of the linear relationship between two continuous variables.

  • r = 1: Perfect positive correlation
  • r = 0: No correlation
  • r = -1: Perfect negative correlation

Calculation

import pandas as pd from scipy import stats # Correlation coefficient between two variables r, p_value = stats.pearsonr(df['retail_price'], df['sale_price']) print(f"Correlation coefficient: {r:.3f}") print(f"p-value: {p_value:.4f}") # Correlation interpretation if abs(r) >= 0.7: print("→ Strong correlation") elif abs(r) >= 0.4: print("→ Moderate correlation") else: print("→ Weak correlation")
실행 결과
Correlation coefficient: 1.000
p-value: 0.0000
→ Strong correlation

2. Correlation Matrix

import seaborn as sns import matplotlib.pyplot as plt # Correlation matrix numeric_cols = ['retail_price', 'cost', 'sale_price', 'num_of_item'] corr_matrix = df[numeric_cols].corr() # Heatmap plt.figure(figsize=(10, 8)) sns.heatmap(corr_matrix, annot=True, cmap='RdYlBu_r', center=0, vmin=-1, vmax=1) plt.title('Correlation Matrix') plt.show()
실행 결과
[Graph Displayed]

3. Correlation vs Causation

⚠️
Caution

Correlation ≠ Causation

Just because two variables change together doesn’t mean one causes the other.

Example: Ice cream sales and drowning incidents have a positive correlation → The cause is a third variable: “hot weather”


Quiz

Problem

Analyze the correlation between retail price, cost, and sale price in the product data and visualize it as a heatmap.

View Answer

# Calculate correlation cols = ['retail_price', 'cost', 'sale_price'] corr = df[cols].corr() print("Correlation Matrix:") print(corr.round(3)) # Visualization plt.figure(figsize=(8, 6)) sns.heatmap(corr, annot=True, cmap='RdYlBu_r', center=0, fmt='.3f', square=True, vmin=-1, vmax=1) plt.title('Correlation Between Price Variables') plt.tight_layout() plt.show()
실행 결과
Correlation Matrix:
            retail_price  ...  sale_price
retail_price         1.000  ...       1.000
cost                 0.983  ...       0.983
sale_price           1.000  ...       1.000

[3 rows x 3 columns]
[Graph Displayed]

Next Steps

Learn how to test statistical significance in Hypothesis Testing.

Last updated on

🤖AI 모의면접실전처럼 연습하기