Hypothesis Testing

IntermediateMath/Statistics

0. Setup

Practice hypothesis testing using various datasets.


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import (
    ttest_ind, ttest_rel, ttest_1samp,
    mannwhitneyu, wilcoxon, kruskal,
    chi2_contingency, fisher_exact,
    f_oneway, shapiro, levene, bartlett,
    spearmanr, pearsonr, kendalltau,
    kstest, normaltest, anderson
)
import warnings
warnings.filterwarnings('ignore')
 
# 1. Titanic Dataset
titanic = sns.load_dataset('titanic')
print(f"Titanic: {titanic.shape}")
 
# 2. Iris Dataset
iris = sns.load_dataset('iris')
print(f"Iris: {iris.shape}")
 
# 3. Tips Dataset
tips = sns.load_dataset('tips')
print(f"Tips: {tips.shape}")
 
# 4. Diamonds Dataset (sampled)
diamonds = sns.load_dataset('diamonds').sample(n=1000, random_state=42)
print(f"Diamonds: {diamonds.shape}")

실행 결과

Titanic: (891, 15)
Iris: (150, 5)
Tips: (244, 7)
Diamonds: (1000, 10)

1. Fundamentals of Hypothesis Testing

Key Concepts

Term	Description
Null Hypothesis (H₀)	“There is no difference/effect” - Default assumption
Alternative Hypothesis (H₁)	“There is a difference/effect” - What we want to prove
p-value	Probability of getting the current result if H₀ is true
Significance Level (α)	Usually 0.05 (5%)
Power	Probability of detecting an effect when it truly exists

Test Selection Guide


Data Type?
├── Continuous
│   ├── Normal distribution → Parametric tests (t-test, ANOVA)
│   └── Non-normal distribution → Non-parametric tests (Mann-Whitney, Kruskal-Wallis)
│
└── Categorical
    ├── 2×2 table (small sample) → Fisher's Exact Test
    └── Otherwise → Chi-Square Test

2. Normality Tests

When to use?

Use before performing parametric tests like t-tests or ANOVA to check if your data follows a normal distribution. If normality is not satisfied, use non-parametric tests (Mann-Whitney, Kruskal-Wallis, etc.).

Shapiro-Wilk Test

Example Use Cases

Check if blood pressure measurements in a clinical trial follow normal distribution

Verify distribution of user session duration before A/B testing

Quality control: Check if product weight data is normally distributed

Characteristics: Most powerful normality test. Recommended for n < 5000.


# Titanic: Normality test for age distribution
ages = titanic['age'].dropna()
 
stat, p_value = shapiro(ages)
 
print("=== Shapiro-Wilk Normality Test ===")
print(f"Data: Titanic passenger ages (n={len(ages)})")
print(f"Statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Follows normal distribution' if p_value >= 0.05 else 'Does not follow normal distribution'}")
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
 
axes[0].hist(ages, bins=30, edgecolor='black', alpha=0.7)
axes[0].set_title('Age Distribution')
axes[0].set_xlabel('Age')
 
stats.probplot(ages, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot')
 
plt.tight_layout()
plt.show()

실행 결과

=== Shapiro-Wilk Normality Test ===
Data: Titanic passenger ages (n=714)
Statistic: 0.9816
p-value: 0.0000
Conclusion: Does not follow normal distribution

D’Agostino-Pearson Test

Example Use Cases

Check if financial return data follows normal distribution (when skewness/kurtosis matters)

Test distribution shape of survey scores

Check asymmetry of exam score distribution

Characteristics: Considers both skewness and kurtosis. Useful when distribution shape matters. Can be used when n >= 20.


# Iris: Normality test for petal length
petal_length = iris['petal_length']
 
stat, p_value = normaltest(petal_length)
 
print("=== D'Agostino-Pearson Normality Test ===")
print(f"Data: Iris petal length (n={len(petal_length)})")
print(f"Statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Follows normal distribution' if p_value >= 0.05 else 'Does not follow normal distribution'}")

실행 결과

=== D'Agostino-Pearson Normality Test ===
Data: Iris petal length (n=150)
Statistic: 31.5324
p-value: 0.0000
Conclusion: Does not follow normal distribution

Kolmogorov-Smirnov Test

Example Use Cases

Normality testing for large datasets (n > 5000)

Compare with other theoretical distributions (exponential, uniform, etc.) besides normal

Compare if two samples have identical distributions (2-sample KS test)

Characteristics: Less sensitive to sample size, suitable for large datasets. Can compare with various distributions.


# Tips: Normality test for tip amounts
tip_values = tips['tip']
 
# Test after standardization
tip_standardized = (tip_values - tip_values.mean()) / tip_values.std()
stat, p_value = kstest(tip_standardized, 'norm')
 
print("=== Kolmogorov-Smirnov Normality Test ===")
print(f"Data: Tips tip amounts (n={len(tip_values)})")
print(f"Statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Follows normal distribution' if p_value >= 0.05 else 'Does not follow normal distribution'}")

실행 결과

=== Kolmogorov-Smirnov Normality Test ===
Data: Tips tip amounts (n=244)
Statistic: 0.0975
p-value: 0.0186
Conclusion: Does not follow normal distribution

Anderson-Darling Test

Example Use Cases

Normality check in analyses where extreme values (outliers) are important (risk management, insurance, etc.)

Check how much the distribution tails differ from normal distribution

When decisions are needed at multiple significance levels simultaneously

Characteristics: More sensitive to distribution tails. Preferred in finance/insurance where extreme values matter.


# Diamonds: Normality test for prices
prices = diamonds['price']
 
result = anderson(prices, dist='norm')
 
print("=== Anderson-Darling Normality Test ===")
print(f"Data: Diamonds prices (n={len(prices)})")
print(f"Statistic: {result.statistic:.4f}")
print("\nCritical values by significance level:")
for i, (cv, sl) in enumerate(zip(result.critical_values, result.significance_level)):
    result_str = "Reject" if result.statistic > cv else "Accept"
    print(f"  {sl}%: Critical value = {cv:.4f} → H₀ {result_str}")

실행 결과

=== Anderson-Darling Normality Test ===
Data: Diamonds prices (n=1000)
Statistic: 47.8932

Critical values by significance level:
15.0%: Critical value = 0.5740 → H₀ Reject
10.0%: Critical value = 0.6540 → H₀ Reject
5.0%: Critical value = 0.7850 → H₀ Reject
2.5%: Critical value = 0.9150 → H₀ Reject
1.0%: Critical value = 1.0890 → H₀ Reject

3. Homogeneity of Variance Tests

When to use?

Use before performing independent samples t-tests or ANOVA to check if the groups being compared have equal variances. If the equal variance assumption is violated, use Welch’s t-test or Games-Howell post-hoc test.

Levene’s Test

Example Use Cases

Check if purchase amount variance is equal between treatment and control groups in A/B testing

Test if exam score variance is equal between male and female groups

Check if quality dispersion is equal across products from multiple factories

Characteristics: Robust as it doesn’t require normality assumption. Can be used with non-normal data.


# Titanic: Compare age variance by survival status
survived_ages = titanic[titanic['survived'] == 1]['age'].dropna()
died_ages = titanic[titanic['survived'] == 0]['age'].dropna()
 
stat, p_value = levene(survived_ages, died_ages)
 
print("=== Levene's Test for Equal Variances ===")
print(f"Survived age variance: {survived_ages.var():.2f}")
print(f"Died age variance: {died_ages.var():.2f}")
print(f"Statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Equal variance assumption satisfied' if p_value >= 0.05 else 'Equal variance assumption not satisfied'}")

실행 결과

=== Levene's Test for Equal Variances ===
Survived age variance: 207.03
Died age variance: 199.41
Statistic: 0.1557
p-value: 0.6933
Conclusion: Equal variance assumption satisfied

Bartlett’s Test

Example Use Cases

Compare variance between groups in normally distributed data

Compare response variability across multiple dose groups in clinical trials

Test variance equality across multiple production lines in quality control

Characteristics: Most powerful equal variance test when data follows normal distribution. Use Levene’s if normality is violated.


# Iris: Compare sepal length variance by species
setosa = iris[iris['species'] == 'setosa']['sepal_length']
versicolor = iris[iris['species'] == 'versicolor']['sepal_length']
virginica = iris[iris['species'] == 'virginica']['sepal_length']
 
stat, p_value = bartlett(setosa, versicolor, virginica)
 
print("=== Bartlett's Test for Equal Variances ===")
print(f"Setosa variance: {setosa.var():.4f}")
print(f"Versicolor variance: {versicolor.var():.4f}")
print(f"Virginica variance: {virginica.var():.4f}")
print(f"Statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Equal variance assumption satisfied' if p_value >= 0.05 else 'Equal variance assumption not satisfied'}")

실행 결과

=== Bartlett's Test for Equal Variances ===
Setosa variance: 0.1242
Versicolor variance: 0.2664
Virginica variance: 0.4043
Statistic: 16.0057
p-value: 0.0003
Conclusion: Equal variance assumption not satisfied

4. T-Tests

One-Sample T-Test

Example Use Cases

Quality Control: Verify if the mean lifespan of batteries produced at a factory equals the nominal lifespan of 1000 hours

Marketing: Check if average customer satisfaction has reached the target of 4.0 points

Education: Test if students’ average scores differ from the national average of 75 points

Service: Check if average response time is within the SLA standard of 3 seconds

Key Question: “Is our sample mean equal to/different from a specific reference value?”


# Tips: Test if mean tip is $3
# Situation: Restaurant manager claims "Our average tip is $3". Verify if this is true.
tip_values = tips['tip']
hypothesized_mean = 3.0
 
stat, p_value = ttest_1samp(tip_values, hypothesized_mean)
 
print("=== One-Sample t-test ===")
print(f"H₀: Mean tip = ${hypothesized_mean:.2f}")
print(f"H₁: Mean tip ≠ ${hypothesized_mean:.2f}")
print(f"\nSample mean: ${tip_values.mean():.2f}")
print(f"Sample standard deviation: ${tip_values.std():.2f}")
print(f"t-statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Mean tip differs from $3' if p_value < 0.05 else 'Mean tip can be considered equal to $3'}")

실행 결과

=== One-Sample t-test ===
H₀: Mean tip = $3.00
H₁: Mean tip ≠ $3.00

Sample mean: $3.00
Sample standard deviation: $1.38
t-statistic: -0.0363
p-value: 0.9711

Conclusion: Mean tip can be considered equal to $3

Independent Samples T-Test

Example Use Cases

A/B Testing: Verify if the new website design (B) has a higher conversion rate than the old design (A)

Medicine: Compare blood pressure changes between drug treatment group and placebo group

Education: Compare grade differences between online and offline classes

HR: Analyze productivity differences between remote workers and office workers

Marketing: Compare average purchase amounts between male and female customers

Key Question: “Are the means of two different groups equal/different?”

Note: The two groups must be independent (same person cannot belong to both groups)


# Titanic: Compare fare by gender
# Situation: Analyze if there were differences in fares paid by males and females on the Titanic
male_fare = titanic[titanic['sex'] == 'male']['fare'].dropna()
female_fare = titanic[titanic['sex'] == 'female']['fare'].dropna()
 
# Test equal variance assumption
_, levene_p = levene(male_fare, female_fare)
equal_var = levene_p >= 0.05
 
# t-test (Welch's t-test if unequal variance)
stat, p_value = ttest_ind(male_fare, female_fare, equal_var=equal_var)
 
print("=== Independent Samples t-test ===")
print(f"H₀: Male mean fare = Female mean fare")
print(f"H₁: Male mean fare ≠ Female mean fare")
print(f"\nMale mean fare: ${male_fare.mean():.2f} (n={len(male_fare)})")
print(f"Female mean fare: ${female_fare.mean():.2f} (n={len(female_fare)})")
print(f"Difference: ${female_fare.mean() - male_fare.mean():.2f}")
print(f"\nEqual variance assumption: {'Satisfied (Student t-test)' if equal_var else 'Not satisfied (Welch t-test used)'}")
print(f"t-statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Significant difference in fare by gender' if p_value < 0.05 else 'No difference in fare by gender'}")

실행 결과

=== Independent Samples t-test ===
H₀: Male mean fare = Female mean fare
H₁: Male mean fare ≠ Female mean fare

Male mean fare: $25.52 (n=577)
Female mean fare: $44.48 (n=314)
Difference: $18.95

Equal variance assumption: Not satisfied (Welch t-test used)
t-statistic: -4.7994
p-value: 0.0000

Conclusion: Significant difference in fare by gender

Paired Samples T-Test

Example Use Cases

Diet Effect: Compare weight before and after diet for the same people

Educational Effect: Compare test scores before and after class for the same students

Drug Effect: Compare blood pressure before and after medication for the same patients

UX Improvement: Compare task completion time when the same users use old version/new version app

Marketing: Compare sales before and after promotion for the same stores

Key Question: “Did the before and after values of the same subjects change?”

Difference from independent samples: Paired samples measure the same subject twice, so individual differences can be controlled, making it more sensitive to detecting changes


# Simulation: A/B test conversion rate (same users experience both UIs)
# Situation: Show 100 users old UI and new UI sequentially and measure click rates
np.random.seed(42)
n_users = 100
 
# Before: Old UI click rate
before = np.random.beta(2, 8, n_users)  # Mean about 20%
# After: New UI click rate (slight improvement)
after = before + np.random.normal(0.05, 0.03, n_users)
after = np.clip(after, 0, 1)
 
stat, p_value = ttest_rel(before, after)
 
print("=== Paired Samples t-test ===")
print(f"H₀: No change in conversion rate (new UI has no effect)")
print(f"H₁: Change in conversion rate (new UI has effect)")
print(f"\nBefore (old UI) mean: {before.mean():.4f} ({before.mean()*100:.1f}%)")
print(f"After (new UI) mean: {after.mean():.4f} ({after.mean()*100:.1f}%)")
print(f"Mean difference: {(after - before).mean():.4f} (+{(after - before).mean()*100:.1f}%p)")
print(f"\nt-statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'New UI significantly improved' if p_value < 0.05 else 'No significant change'}")

실행 결과

=== Paired Samples t-test ===
H₀: No change in conversion rate (new UI has no effect)
H₁: Change in conversion rate (new UI has effect)

Before (old UI) mean: 0.2037 (20.4%)
After (new UI) mean: 0.2503 (25.0%)
Mean difference: 0.0466 (+4.7%p)

t-statistic: -14.7221
p-value: 0.0000

Conclusion: New UI significantly improved

5. Non-parametric Tests

When to use?

When data does not follow normal distribution

When sample size is small (n < 30)

When data is rank data or ordinal scale

When there are many outliers (non-parametric tests are less sensitive to outliers)

Mann-Whitney U Test

Example Use Cases

E-commerce: Compare order amount distribution between premium and regular members (amount data usually has a non-normal right-skewed distribution)

Gaming: Compare play time between paying and non-paying users

Medicine: Compare pain scale (1-10) between two treatments

Satisfaction: Compare customer rating distributions between two products

Key Question: “Are the distributions (ranks) of two independent groups different?”

Non-parametric alternative to independent samples t-test. Think of it as comparing median/rank rather than mean.


# Diamonds: Compare prices by cut quality (Ideal vs Good)
# Situation: Compare price distributions of diamonds with Ideal vs Good cut quality
# Price data generally doesn't follow normal distribution, so use non-parametric test
ideal_price = diamonds[diamonds['cut'] == 'Ideal']['price']
good_price = diamonds[diamonds['cut'] == 'Good']['price']
 
stat, p_value = mannwhitneyu(ideal_price, good_price, alternative='two-sided')
 
print("=== Mann-Whitney U Test ===")
print(f"H₀: Ideal and Good cut have the same price distribution")
print(f"H₁: Price distributions are different")
print(f"\nIdeal median: ${ideal_price.median():,.2f} (n={len(ideal_price)})")
print(f"Good median: ${good_price.median():,.2f} (n={len(good_price)})")
print(f"\nU-statistic: {stat:,.2f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Price distributions significantly different' if p_value < 0.05 else 'No difference in price distributions'}")

실행 결과

=== Mann-Whitney U Test ===
H₀: Ideal and Good cut have the same price distribution
H₁: Price distributions are different

Ideal median: $1,810.00 (n=393)
Good median: $3,086.50 (n=96)

U-statistic: 14,985.00
p-value: 0.0031

Conclusion: Price distributions significantly different

Wilcoxon Signed-Rank Test

Example Use Cases

Weight Loss: Compare weight before and after diet (when weight changes are not normally distributed)

Survey: Compare satisfaction scores (1-5) before and after policy change for the same respondents

Education: Compare confidence scores before and after special lecture for the same students

App Ratings: Compare rating changes from the same users before and after update

Key Question: “Did the before and after value distribution of the same subjects change?”

Non-parametric alternative to paired samples t-test. Uses sign and rank of differences.


# Tips: Compare lunch vs dinner tip rates (same waiters)
# Situation: Test if 50 waiters receive different tip rates at lunch vs dinner
np.random.seed(42)
n_waiters = 50
 
lunch_tip_rate = np.random.uniform(0.12, 0.22, n_waiters)
dinner_tip_rate = lunch_tip_rate + np.random.normal(0.02, 0.03, n_waiters)
 
stat, p_value = wilcoxon(lunch_tip_rate, dinner_tip_rate)
 
print("=== Wilcoxon Signed-Rank Test ===")
print(f"H₀: No difference between lunch and dinner tip rates")
print(f"H₁: Difference between lunch and dinner tip rates")
print(f"\nLunch tip rate median: {np.median(lunch_tip_rate)*100:.1f}%")
print(f"Dinner tip rate median: {np.median(dinner_tip_rate)*100:.1f}%")
print(f"\nW-statistic: {stat:.2f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Dinner tip rate significantly higher' if p_value < 0.05 else 'No difference'}")

실행 결과

=== Wilcoxon Signed-Rank Test ===
H₀: No difference between lunch and dinner tip rates
H₁: Difference between lunch and dinner tip rates

Lunch tip rate median: 16.9%
Dinner tip rate median: 19.1%

W-statistic: 304.00
p-value: 0.0004

Conclusion: Dinner tip rate significantly higher

Kruskal-Wallis H Test

Example Use Cases

Marketing: Compare brand awareness scores across 3 ad types (TV, Online, SNS)

Products: Compare customer satisfaction across multiple smartphone brands

Education: Compare student grade distributions across 3 schools

Medicine: Compare recovery periods across multiple treatments (non-normal data)

Key Question: “Are the distributions of 3 or more groups all the same, or is at least one different?”

Non-parametric alternative to One-Way ANOVA. Post-hoc tests needed to determine which groups differ.


# Iris: Compare petal width by species
# Situation: Test if petal width distributions differ among three iris species (setosa, versicolor, virginica)
setosa_pw = iris[iris['species'] == 'setosa']['petal_width']
versicolor_pw = iris[iris['species'] == 'versicolor']['petal_width']
virginica_pw = iris[iris['species'] == 'virginica']['petal_width']
 
stat, p_value = kruskal(setosa_pw, versicolor_pw, virginica_pw)
 
print("=== Kruskal-Wallis H Test ===")
print(f"H₀: All species have the same petal width distribution")
print(f"H₁: At least one species is different")
print(f"\nSetosa median: {setosa_pw.median():.2f}cm")
print(f"Versicolor median: {versicolor_pw.median():.2f}cm")
print(f"Virginica median: {virginica_pw.median():.2f}cm")
print(f"\nH-statistic: {stat:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"\nConclusion: {'Significant difference among species' if p_value < 0.05 else 'No difference'}")

실행 결과

=== Kruskal-Wallis H Test ===
H₀: All species have the same petal width distribution
H₁: At least one species is different

Setosa median: 0.20cm
Versicolor median: 1.30cm
Virginica median: 2.00cm

H-statistic: 130.0111
p-value: 0.000000

Conclusion: Significant difference among species

6. Analysis of Variance (ANOVA)

One-Way ANOVA

Example Use Cases

Marketing: Compare average purchase amount across 4 promotion types (discount, points, gifts, free shipping)

Manufacturing: Compare average quality scores of products from 3 factories

HR: Compare employee satisfaction across departments (Development, Marketing, Sales, Support)

Education: Compare learning effects across multiple teaching methods

Key Question: “Are the means of 3 or more groups all the same?”

Prerequisites: Normality, equal variance. Use Kruskal-Wallis if violated.


# Tips: Compare total bill amounts by day of week
# Situation: Analyze if customers' average payment amounts differ by day of the week
thur = tips[tips['day'] == 'Thur']['total_bill']
fri = tips[tips['day'] == 'Fri']['total_bill']
sat = tips[tips['day'] == 'Sat']['total_bill']
sun = tips[tips['day'] == 'Sun']['total_bill']
 
stat, p_value = f_oneway(thur, fri, sat, sun)
 
print("=== One-Way ANOVA ===")
print(f"H₀: Average payment amounts are the same for all days")
print(f"H₁: At least one day is different")
print(f"\nAverage payment by day:")
print(f"  Thursday: ${thur.mean():.2f} (n={len(thur)})")
print(f"  Friday: ${fri.mean():.2f} (n={len(fri)})")
print(f"  Saturday: ${sat.mean():.2f} (n={len(sat)})")
print(f"  Sunday: ${sun.mean():.2f} (n={len(sun)})")
print(f"\nF-statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Significant difference by day → Post-hoc test needed' if p_value < 0.05 else 'No difference by day'}")

실행 결과

=== One-Way ANOVA ===
H₀: Average payment amounts are the same for all days
H₁: At least one day is different

Average payment by day:
Thursday: $17.68 (n=62)
Friday: $17.15 (n=19)
Saturday: $20.44 (n=87)
Sunday: $21.41 (n=76)

F-statistic: 2.7675
p-value: 0.0424

Conclusion: Significant difference by day → Post-hoc test needed

Two-Way ANOVA

Example Use Cases

Marketing: Analyze the effect of ad channel (TV/Online) and target age group (20s/30s/40s) on purchase intent

Manufacturing: Effect of machine type and operator skill level on production volume

Education: Effect of teaching method and class size on learning outcomes

Medicine: Effect of drug type and dosage on treatment efficacy

Key Questions:

Is there a main effect of Factor A?

Is there a main effect of Factor B?

Is there an interaction effect between A and B? (e.g., Is there an effect only in specific combinations?)


import statsmodels.api as sm
from statsmodels.formula.api import ols
 
# Tips: Effect of gender and smoking status on tips
# Situation: Analyze if tip amounts differ by gender and smoking status, and if there's a combination effect
model = ols('tip ~ C(sex) + C(smoker) + C(sex):C(smoker)', data=tips).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
 
print("=== Two-Way ANOVA ===")
print(f"Dependent Variable: Tip amount")
print(f"Factor 1: Gender (sex)")
print(f"Factor 2: Smoking status (smoker)")
print(f"\n{anova_table.round(4)}")
print("\nInterpretation:")
for idx, row in anova_table.iterrows():
    if idx != 'Residual':
        sig = "Significant ***" if row['PR(>F)'] < 0.001 else \
              "Significant **" if row['PR(>F)'] < 0.01 else \
              "Significant *" if row['PR(>F)'] < 0.05 else "Not significant"
        print(f"  {idx}: p = {row['PR(>F)']:.4f} → {sig}")

실행 결과

=== Two-Way ANOVA ===
Dependent Variable: Tip amount
Factor 1: Gender (sex)
Factor 2: Smoking status (smoker)

                     sum_sq     df         F    PR(>F)
C(sex)                  1.0554    1.0    0.5596    0.4551
C(smoker)               0.1477    1.0    0.0783    0.7799
C(sex):C(smoker)        0.2077    1.0    0.1101    0.7404
Residual              452.5604  240.0       NaN       NaN

Interpretation:
C(sex): p = 0.4551 → Not significant
C(smoker): p = 0.7799 → Not significant
C(sex):C(smoker): p = 0.7404 → Not significant

7. Chi-Square Tests

Test of Independence

Example Use Cases

Marketing: Analyze if age group (20s/30s/40s) and preferred brand (A/B/C) are associated

Medicine: Test if smoking status and lung cancer occurrence are associated

Education: Analyze if gender and major selection are associated

HR: Analyze if education level and turnover status are associated

Elections: Analyze if region and party support are associated

Key Question: “Are two categorical variables independent or associated?”


# Titanic: Relationship between gender and survival
# Situation: Analyze if survival rates differed by gender during the Titanic sinking ("Women and children first" rule)
contingency = pd.crosstab(titanic['sex'], titanic['survived'])
print("Contingency Table:")
print(contingency)
print()
 
chi2, p_value, dof, expected = chi2_contingency(contingency)
 
print("=== Chi-Square Test of Independence ===")
print(f"H₀: Gender and survival are independent (no relationship)")
print(f"H₁: Gender and survival are associated")
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"p-value: {p_value:.6f}")
print(f"\nExpected frequencies (if independent, these would be expected):")
print(pd.DataFrame(expected,
                   index=contingency.index,
                   columns=contingency.columns).round(1))
print(f"\nConclusion: {'Gender and survival are strongly associated (higher female survival rate)' if p_value < 0.05 else 'Independent'}")

실행 결과

Contingency Table:
survived    0    1
sex
female     81  233
male      468  109

=== Chi-Square Test of Independence ===
H₀: Gender and survival are independent (no relationship)
H₁: Gender and survival are associated

Chi-square statistic: 260.7170
Degrees of freedom: 1
p-value: 0.000000

Expected frequencies (if independent, these would be expected):
survived      0      1
sex
female    193.5  120.5
male      355.5  221.5

Conclusion: Gender and survival are strongly associated (higher female survival rate)

Goodness of Fit Test

Example Use Cases

Quality Control: Test if defect occurrences are evenly distributed (1/5 each) across days of the week

Marketing: Check if customer distribution matches the expected ratio (40:35:25)

Genetics: Test if observed genotype ratios match Mendel’s laws (9:3:3:1)

Survey: Check if response distribution follows a uniform distribution

Key Question: “Does observed frequency match the expected theoretical distribution?”


# Titanic: Test if cabin class distribution is uniform
# Situation: Check if Titanic passengers were evenly distributed across 1st, 2nd, 3rd class
observed = titanic['pclass'].value_counts().sort_index()
n = len(titanic)
expected = np.array([n/3, n/3, n/3])  # Expected uniform distribution
 
chi2, p_value = stats.chisquare(observed, expected)
 
print("=== Chi-Square Goodness of Fit Test ===")
print(f"H₀: Cabin classes are evenly distributed (each 33.3%)")
print(f"H₁: Not evenly distributed")
print(f"\nObserved frequencies:")
for cls, count in observed.items():
    print(f"  Class {cls}: {count} passengers ({count/n*100:.1f}%)")
print(f"\nExpected frequency (uniform distribution): {n/3:.0f} each (33.3%)")
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"\nConclusion: {'Not uniform - 3rd class is the majority' if p_value < 0.05 else 'Uniform distribution'}")

실행 결과

=== Chi-Square Goodness of Fit Test ===
H₀: Cabin classes are evenly distributed (each 33.3%)
H₁: Not evenly distributed

Observed frequencies:
Class 1: 216 passengers (24.2%)
Class 2: 184 passengers (20.7%)
Class 3: 491 passengers (55.1%)

Expected frequency (uniform distribution): 297 each (33.3%)

Chi-square statistic: 110.8417
p-value: 0.000000

Conclusion: Not uniform - 3rd class is the majority

Fisher’s Exact Test

Example Use Cases

Clinical Trials: Test treatment effect in small-scale pilot studies (n < 20)

Rare Diseases: Association between rare disease occurrence and genetic mutations

Quality Control: Analysis of rare defect types where expected frequency is less than 5

Epidemiology: Association between infection status and specific behaviors in small groups

Key Question: “Are two variables associated in a 2x2 contingency table when sample size is small?”

Alternative to chi-square test: Use Fisher’s Exact when any cell has expected frequency less than 5


# Small-scale clinical trial simulation
# Situation: Pilot study with 20 subjects. Compare cure rates between drug treatment group (10) and placebo group (10)
contingency = np.array([[8, 2],   # Treatment group: 8 cured, 2 not cured
                        [3, 7]])  # Control group: 3 cured, 7 not cured
 
odds_ratio, p_value = fisher_exact(contingency)
 
print("=== Fisher's Exact Test ===")
print("Situation: Small-scale clinical trial (n=20)")
print("\nContingency Table:")
print("         Cured  Not Cured")
print(f"Treatment     {contingency[0,0]}      {contingency[0,1]}")
print(f"Control       {contingency[1,0]}      {contingency[1,1]}")
print(f"\nTreatment cure rate: {contingency[0,0]/contingency[0].sum()*100:.0f}%")
print(f"Control cure rate: {contingency[1,0]/contingency[1].sum()*100:.0f}%")
print(f"\nOdds Ratio: {odds_ratio:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Treatment effect exists' if p_value < 0.05 else 'No treatment effect'}")
print(f"\nInterpretation: Treatment group has {odds_ratio:.1f}x higher cure odds than control group")

실행 결과

=== Fisher's Exact Test ===
Situation: Small-scale clinical trial (n=20)

Contingency Table:
       Cured  Not Cured
Treatment     8      2
Control       3      7

Treatment cure rate: 80%
Control cure rate: 30%

Odds Ratio: 9.3333
p-value: 0.0350

Conclusion: Treatment effect exists

Interpretation: Treatment group has 9.3x higher cure odds than control group

McNemar’s Test

Example Use Cases

Marketing: Change in brand awareness of same customers before/after ad campaign (Aware→Aware, Unaware→Aware, etc.)

Medicine: Change in symptom presence of same patients before/after treatment

Politics: Change in party support of same voters before/after election

Education: Change in understanding of specific concepts of same students before/after class

Key Question: “Did the categorical response of the same subjects change before/after?”

Use for paired samples + categorical data. Use Wilcoxon/paired t-test for continuous data.


from statsmodels.stats.contingency_tables import mcnemar
 
# Marketing campaign before/after purchase behavior change
# Situation: Track purchase status of 100 customers before and after campaign
# [Bought before/Bought after, Bought before/Didn't buy after]
# [Didn't buy before/Bought after, Didn't buy before/Didn't buy after]
table = np.array([[45, 15],   # Bought before & after / Bought before, didn't buy after
                  [35, 5]])   # Didn't buy before, bought after / Didn't buy before & after
 
result = mcnemar(table, exact=True)
 
print("=== McNemar's Test ===")
print("Situation: Purchase behavior change of 100 customers before/after campaign")
print("\nPaired Table:")
print("              Bought After  Didn't Buy After")
print(f"Bought Before      {table[0,0]}            {table[0,1]}")
print(f"Didn't Buy Before      {table[1,0]}             {table[1,1]}")
print(f"\nChange Analysis:")
print(f"  New purchase (Didn't buy → Bought): {table[1,0]} people")
print(f"  Churned (Bought → Didn't buy): {table[0,1]} people")
print(f"  No change: {table[0,0] + table[1,1]} people")
print(f"\np-value: {result.pvalue:.4f}")
print(f"\nConclusion: {'Campaign significantly changed purchase behavior (New purchases > Churn)' if result.pvalue < 0.05 else 'No significant change'}")

실행 결과

=== McNemar's Test ===
Situation: Purchase behavior change of 100 customers before/after campaign

Paired Table:
            Bought After  Didn't Buy After
Bought Before      45            15
Didn't Buy Before      35             5

Change Analysis:
New purchase (Didn't buy → Bought): 35 people
Churned (Bought → Didn't buy): 15 people
No change: 50 people

p-value: 0.0066

Conclusion: Campaign significantly changed purchase behavior (New purchases > Churn)

8. Correlation Analysis

Pearson Correlation Coefficient

Example Use Cases

Marketing: Analyze linear relationship between ad spending and sales

HR: Correlation between years of service and salary

Education: Relationship between study hours and test scores

Finance: Relationship between interest rates and stock prices

Key Question: “Is there a linear relationship between two continuous variables?”

Prerequisites: Both variables follow normal distribution, linear relationship. Cannot detect non-linear relationships.


# Diamonds: Correlation between carat and price
# Situation: Analyze if there is a linear relationship between diamond carat (weight) and price
carat = diamonds['carat']
price = diamonds['price']
 
r, p_value = pearsonr(carat, price)
 
print("=== Pearson Correlation Analysis ===")
print(f"H₀: There is no correlation between carat and price (r = 0)")
print(f"H₁: There is correlation between carat and price (r ≠ 0)")
print(f"\nPearson r: {r:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"Coefficient of determination (R²): {r**2:.4f} → {r**2*100:.1f}% of price variation explained by carat")
print(f"\nCorrelation strength interpretation:")
print(f"  |r| < 0.3: Weak correlation")
print(f"  0.3 <= |r| < 0.7: Moderate correlation")
print(f"  |r| >= 0.7: Strong correlation")
print(f"\nCurrent |r| = {abs(r):.4f} → Strong positive correlation (carat↑ → price↑)")

실행 결과

=== Pearson Correlation Analysis ===
H₀: There is no correlation between carat and price (r = 0)
H₁: There is correlation between carat and price (r ≠ 0)

Pearson r: 0.9209
p-value: 0.000000
Coefficient of determination (R²): 0.8481 → 84.8% of price variation explained by carat

Correlation strength interpretation:
|r| < 0.3: Weak correlation
0.3 <= |r| < 0.7: Moderate correlation
|r| >= 0.7: Strong correlation

Current |r| = 0.9209 → Strong positive correlation (carat↑ → price↑)

Spearman Rank Correlation Coefficient

Example Use Cases

Survey: Relationship between satisfaction ranking and repurchase intent ranking

Economics: Relationship between GDP ranking and happiness index ranking

Education: Relationship between academic grade ranking and employment rate ranking

Sports: Relationship between salary ranking and performance ranking

Key Question: “Is there a monotonic relationship between two variables?”

Non-parametric alternative to Pearson: Doesn’t require normal distribution, can detect non-linear but monotonic relationships. Example: y = x² (Spearman is high in monotonically increasing intervals)


# Tips: Rank correlation between total bill and tip
# Situation: Check if higher bills tend to have higher tips (even if not exactly proportional)
total_bill = tips['total_bill']
tip = tips['tip']
 
rho, p_value = spearmanr(total_bill, tip)
r_pearson, _ = pearsonr(total_bill, tip)
 
print("=== Spearman Rank Correlation Analysis ===")
print(f"H₀: No monotonic relationship between bill and tip")
print(f"H₁: Monotonic relationship exists between bill and tip")
print(f"\nSpearman rho: {rho:.4f}")
print(f"(Comparison) Pearson r: {r_pearson:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"\nConclusion: {'Significant monotonic relationship - higher bills tend to have higher tips' if p_value < 0.05 else 'No relationship'}")

실행 결과

=== Spearman Rank Correlation Analysis ===
H₀: No monotonic relationship between bill and tip
H₁: Monotonic relationship exists between bill and tip

Spearman rho: 0.8264
(Comparison) Pearson r: 0.6757
p-value: 0.000000

Conclusion: Significant monotonic relationship - higher bills tend to have higher tips

Kendall’s Tau

Example Use Cases

Rank Data: Agreement between two judges’ ranking evaluations (e.g., restaurant rankings)

Ordinal Scale: Relationship between education level (Elementary/Middle/High/College) and income level (Low/Medium/High)

When Many Ties Exist: Data with many same values like 5-point Likert scale surveys

Key Question: “What is the concordance between two variables in rank data?”

More conservative than Spearman. More accurate when there are many ties.


# Titanic: Relationship between cabin class and age
# Situation: Check if 1st class passengers tend to be older (ordinal vs continuous)
pclass = titanic['pclass'].dropna()
age = titanic['age'].dropna()
 
# Align indices
common_idx = pclass.index.intersection(age.index)
pclass_aligned = pclass.loc[common_idx]
age_aligned = age.loc[common_idx]
 
tau, p_value = kendalltau(pclass_aligned, age_aligned)
 
print("=== Kendall's Tau Correlation Analysis ===")
print(f"H₀: No relationship between cabin class and age")
print(f"H₁: Relationship exists between cabin class and age")
print(f"\nKendall tau: {tau:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Significant relationship' if p_value < 0.05 else 'No relationship'}")
if tau < 0:
    print(f"Interpretation: tau < 0 means cabin class↓(1st class) → age↑ (older passengers in premium cabins)")

실행 결과

=== Kendall's Tau Correlation Analysis ===
H₀: No relationship between cabin class and age
H₁: Relationship exists between cabin class and age

Kendall tau: -0.1080
p-value: 0.0000

Conclusion: Significant relationship
Interpretation: tau < 0 means cabin class↓(1st class) → age↑ (older passengers in premium cabins)

9. Effect Size

Why is effect size important?

p-value only tells you “Is there a difference?” but not “How big is the difference?” With large samples, even tiny differences can become significant.

Example: In an A/B test with 1 million people, a 0.01% difference in conversion rate can yield p < 0.05, but you need effect size to judge if this difference is actually meaningful for the business.

Cohen’s d

When to use: Interpret the practical meaning of mean difference between two groups

Interpretation guidelines (Cohen, 1988):

|d| < 0.2: Small effect (negligible)

0.2 ≤ |d| < 0.5: Medium effect

0.5 ≤ |d| < 0.8: Large effect

|d| ≥ 0.8: Very large effect


def cohens_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    var1, var2 = group1.var(), group2.var()
    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
    return (group1.mean() - group2.mean()) / pooled_std
 
# Titanic: Effect size of age difference between survivors vs non-survivors
# Situation: Relationship between age and survival. Even if p-value is significant, is it a meaningful difference?
survived_ages = titanic[titanic['survived'] == 1]['age'].dropna()
died_ages = titanic[titanic['survived'] == 0]['age'].dropna()
 
d = cohens_d(survived_ages, died_ages)
t_stat, p_value = ttest_ind(survived_ages, died_ages)
 
print("=== Effect Size Analysis ===")
print(f"Survivor mean age: {survived_ages.mean():.2f} years")
print(f"Non-survivor mean age: {died_ages.mean():.2f} years")
print(f"Difference: {abs(survived_ages.mean() - died_ages.mean()):.2f} years")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f} → {'Significant' if p_value < 0.05 else 'Not significant'}")
print(f"Cohen's d: {d:.4f}")
print(f"\nEffect size interpretation:")
print(f"  |d| < 0.2: Small effect")
print(f"  0.2 <= |d| < 0.5: Medium effect")
print(f"  0.5 <= |d| < 0.8: Large effect")
print(f"  |d| >= 0.8: Very large effect")
print(f"\nCurrent: |d| = {abs(d):.4f} → ", end="")
if abs(d) >= 0.8:
    print("Very large effect")
elif abs(d) >= 0.5:
    print("Large effect")
elif abs(d) >= 0.2:
    print("Medium effect")
else:
    print("Small effect → Statistically significant but practical meaning is limited!")

실행 결과

=== Effect Size Analysis ===
Survivor mean age: 28.34 years
Non-survivor mean age: 30.63 years
Difference: 2.29 years

t-statistic: -2.0551
p-value: 0.0402 → Significant
Cohen's d: -0.1616

Effect size interpretation:
|d| < 0.2: Small effect
0.2 <= |d| < 0.5: Medium effect
0.5 <= |d| < 0.8: Large effect
|d| >= 0.8: Very large effect

Current: |d| = 0.1616 → Small effect → Statistically significant but practical meaning is limited!

Cramer’s V

When to use: Measure association strength between categorical variables (after chi-square test)

Interpretation guidelines:

V < 0.1: Negligible

0.1 ≤ V < 0.3: Weak association

0.3 ≤ V < 0.5: Moderate association

V ≥ 0.5: Strong association


def cramers_v(contingency_table):
    chi2 = chi2_contingency(contingency_table)[0]
    n = contingency_table.sum().sum()
    min_dim = min(contingency_table.shape) - 1
    return np.sqrt(chi2 / (n * min_dim))
 
# Titanic: Strength of association between gender and survival
contingency = pd.crosstab(titanic['sex'], titanic['survived'])
v = cramers_v(contingency)
chi2, p_value, _, _ = chi2_contingency(contingency)
 
print("=== Cramer's V (Association Strength) ===")
print(f"Chi-square = {chi2:.4f}, p-value = {p_value:.6f}")
print(f"Cramer's V = {v:.4f}")
print(f"\nInterpretation guidelines:")
print(f"  V < 0.1: Negligible")
print(f"  0.1 <= V < 0.3: Weak association")
print(f"  0.3 <= V < 0.5: Moderate association")
print(f"  V >= 0.5: Strong association")
print(f"\nCurrent: V = {v:.4f} → ", end="")
if v >= 0.5:
    print("Strong association → Gender strongly affected survival!")
elif v >= 0.3:
    print("Moderate association")
elif v >= 0.1:
    print("Weak association")
else:
    print("Negligible")

실행 결과

=== Cramer's V (Association Strength) ===
Chi-square = 260.7170, p-value = 0.000000
Cramer's V = 0.5410

Interpretation guidelines:
V < 0.1: Negligible
0.1 <= V < 0.3: Weak association
0.3 <= V < 0.5: Moderate association
V >= 0.5: Strong association

Current: V = 0.5410 → Strong association → Gender strongly affected survival!

10. Multiple Testing Correction

Why is correction needed?

When performing multiple tests simultaneously, Type I errors (false positives) accumulate.

Example: Performing 20 tests at α = 0.05

Probability of at least 1 false positive = 1 - (0.95)^20 = 64%!

This is called the Multiple Comparison Problem.

Bonferroni Correction

When to use:

Analyzing multiple A/B tests simultaneously

Post-hoc tests after ANOVA comparing multiple pairs

Genome studies testing thousands of genes

Method: Divide α by number of tests (k). Example: 5 tests → 0.05/5 = 0.01

Very conservative. May miss real effects (increased Type II error)


from statsmodels.stats.multitest import multipletests
 
# Multiple A/B test results
# Situation: Testing 5 UI elements simultaneously. Which ones have real effects?
p_values = [0.03, 0.04, 0.01, 0.08, 0.002]
test_names = ['Button Color', 'Headline', 'CTA Position', 'Image', 'Price Display']
 
# Bonferroni correction
rejected, corrected_p, _, _ = multipletests(p_values, method='bonferroni')
 
print("=== Multiple Testing Correction (Bonferroni) ===")
print(f"Number of tests: {len(p_values)}")
print(f"Corrected significance level: 0.05 / {len(p_values)} = {0.05/len(p_values):.3f}")
print(f"\n{'Test':<15} {'Original p-value':<18} {'Corrected p-value':<18} {'Conclusion'}")
print("-" * 70)
for name, p, cp, rej in zip(test_names, p_values, corrected_p, rejected):
    result = "Significant" if rej else "Not significant"
    print(f"{name:<15} {p:<18.4f} {min(cp, 1.0):<18.4f} {result}")

실행 결과

=== Multiple Testing Correction (Bonferroni) ===
Number of tests: 5
Corrected significance level: 0.05 / 5 = 0.010

Test            Original p-value   Corrected p-value  Conclusion
----------------------------------------------------------------------
Button Color    0.0300             0.1500             Not significant
Headline        0.0400             0.2000             Not significant
CTA Position    0.0100             0.0500             Not significant
Image           0.0800             0.4000             Not significant
Price Display   0.0020             0.0100             Significant

Benjamini-Hochberg (FDR)

When to use:

When performing many tests but can tolerate some false positives

Screening candidates in exploratory analysis

Genome studies where Bonferroni is too conservative

Method: Controls False Discovery Rate (FDR). Controls “proportion of false positives among those declared significant” to 5%


# Benjamini-Hochberg correction
rejected_bh, corrected_p_bh, _, _ = multipletests(p_values, method='fdr_bh')
 
print("=== Multiple Testing Correction (Benjamini-Hochberg FDR) ===")
print(f"\n{'Test':<15} {'Original p-value':<18} {'Corrected p-value':<18} {'Conclusion'}")
print("-" * 70)
for name, p, cp, rej in zip(test_names, p_values, corrected_p_bh, rejected_bh):
    result = "Significant" if rej else "Not significant"
    print(f"{name:<15} {p:<18.4f} {cp:<18.4f} {result}")
 
print(f"\nComparison:")
print(f"  Significant with Bonferroni: {sum(rejected)}")
print(f"  Significant with FDR(BH): {sum(rejected_bh)}")
print(f"\n→ FDR is less conservative, allowing more discoveries")
print(f"   However, about 5% of these may be false positives")

실행 결과

=== Multiple Testing Correction (Benjamini-Hochberg FDR) ===

Test            Original p-value   Corrected p-value  Conclusion
----------------------------------------------------------------------
Button Color    0.0300             0.0500             Significant
Headline        0.0400             0.0500             Significant
CTA Position    0.0100             0.0250             Significant
Image           0.0800             0.0800             Not significant
Price Display   0.0020             0.0100             Significant

Comparison:
Significant with Bonferroni: 1
Significant with FDR(BH): 4

→ FDR is less conservative, allowing more discoveries
 However, about 5% of these may be false positives

11. Power Analysis

When to use?

Use at the experimental design stage to calculate “How many samples are needed?”

Too few samples → Cannot detect real effects even when they exist (Type II error) Too many samples → Waste of resources


from statsmodels.stats.power import TTestIndPower
 
power_analysis = TTestIndPower()
 
# Scenario: Effect size 0.3, Power 80%, Significance level 5%
# Situation: "Expecting the new UI to improve conversion rate moderately (d=0.3).
#           How many people are needed to detect this effect with 80% probability?"
effect_size = 0.3
alpha = 0.05
power = 0.8
 
n = power_analysis.solve_power(effect_size=effect_size,
                                alpha=alpha,
                                power=power,
                                ratio=1.0,
                                alternative='two-sided')
 
print("=== Sample Size Calculation ===")
print(f"Target effect size (Cohen's d): {effect_size} (medium effect)")
print(f"Significance level (α): {alpha}")
print(f"Target power (1-β): {power} (80% probability of detecting effect)")
print(f"\nRequired sample size: {n:.0f} per group")
print(f"Total required: {n*2:.0f} people")
 
# Required sample sizes for various effect sizes
print("\nRequired sample sizes by effect size (Power 80%):")
for es, desc in [(0.2, 'small effect'), (0.3, 'medium effect'), (0.5, 'large effect'), (0.8, 'very large effect')]:
    n = power_analysis.solve_power(effect_size=es, alpha=0.05, power=0.8, ratio=1.0)
    print(f"  d = {es} ({desc}): {n:.0f} per group (total {n*2:.0f})")

실행 결과

=== Sample Size Calculation ===
Target effect size (Cohen's d): 0.3 (medium effect)
Significance level (α): 0.05
Target power (1-β): 0.8 (80% probability of detecting effect)

Required sample size: 176 per group
Total required: 352 people

Required sample sizes by effect size (Power 80%):
d = 0.2 (small effect): 394 per group (total 787)
d = 0.3 (medium effect): 176 per group (total 352)
d = 0.5 (large effect): 64 per group (total 128)
d = 0.8 (very large effect): 26 per group (total 51)

12. Test Selection Summary

Test Selection by Data Type

Situation	Parametric Test	Non-parametric Test
1 sample mean vs reference value	One-sample t-test	Wilcoxon signed-rank
2 independent samples comparison	Independent t-test	Mann-Whitney U
2 paired samples comparison	Paired t-test	Wilcoxon signed-rank
3+ independent samples comparison	One-way ANOVA	Kruskal-Wallis H
2×2 categories (small sample)	-	Fisher’s exact
Category independence	-	Chi-square
Paired category before/after comparison	-	McNemar’s
Correlation	Pearson r	Spearman ρ, Kendall τ

Quick Guide by Situation


Q: Comparing means of two groups?
├── Same subjects before/after? → Paired t-test (normal) / Wilcoxon (non-normal)
└── Different subjects? → Independent t-test (normal) / Mann-Whitney (non-normal)

Q: Comparing 3+ groups?
├── Normal distribution? → One-way ANOVA
└── Non-normal distribution? → Kruskal-Wallis

Q: Relationship between two categorical variables?
├── Any expected frequency < 5? → Fisher's exact
└── All expected frequencies >= 5? → Chi-square

Q: Relationship between two continuous variables?
├── Linear relationship? → Pearson r
└── Monotonic relationship? → Spearman ρ

Quiz

Problem 1

Perform an appropriate test to check if there is a significant difference in survival rate by cabin class (pclass) in the Titanic data.

View Answer


# Categorical vs Categorical → Chi-square test
contingency = pd.crosstab(titanic['pclass'], titanic['survived'])
chi2, p_value, dof, expected = chi2_contingency(contingency)
 
print("Contingency Table:")
print(contingency)
print(f"\nChi-square = {chi2:.4f}, p-value = {p_value:.6f}")
print(f"\nConclusion: {'Cabin class and survival rate are associated' if p_value < 0.05 else 'No association'}")
 
# Effect size
v = cramers_v(contingency)
print(f"Cramer's V = {v:.4f} (moderate association)")

Problem 2

Use an appropriate test to check if the tip amount distribution differs between smokers and non-smokers in the Tips data.

View Answer


smoker_tip = tips[tips['smoker'] == 'Yes']['tip']
nonsmoker_tip = tips[tips['smoker'] == 'No']['tip']
 
# Normality test
_, p_smoker = shapiro(smoker_tip)
_, p_nonsmoker = shapiro(nonsmoker_tip)
 
print(f"Normality (smoker): p = {p_smoker:.4f}")
print(f"Normality (non-smoker): p = {p_nonsmoker:.4f}")
 
# Normality not satisfied → Use Mann-Whitney U
stat, p_value = mannwhitneyu(smoker_tip, nonsmoker_tip)
 
print(f"\nMann-Whitney U: {stat:.2f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Tip distributions differ' if p_value < 0.05 else 'No difference in tip distributions'}")

Next Steps

Learn how to model relationships between variables in Regression Analysis.
Learn experimental design in A/B Testing.