Skip to Content
ConceptsStatisticsHypothesis Testing

Hypothesis Testing

IntermediateMath/Statistics

0. Setup

Practice hypothesis testing using various datasets.

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from scipy import stats from scipy.stats import ( ttest_ind, ttest_rel, ttest_1samp, mannwhitneyu, wilcoxon, kruskal, chi2_contingency, fisher_exact, f_oneway, shapiro, levene, bartlett, spearmanr, pearsonr, kendalltau, kstest, normaltest, anderson ) import warnings warnings.filterwarnings('ignore') # 1. Titanic Dataset titanic = sns.load_dataset('titanic') print(f"Titanic: {titanic.shape}") # 2. Iris Dataset iris = sns.load_dataset('iris') print(f"Iris: {iris.shape}") # 3. Tips Dataset tips = sns.load_dataset('tips') print(f"Tips: {tips.shape}") # 4. Diamonds Dataset (sampled) diamonds = sns.load_dataset('diamonds').sample(n=1000, random_state=42) print(f"Diamonds: {diamonds.shape}")
실행 ź²°ź³¼
Titanic: (891, 15)
Iris: (150, 5)
Tips: (244, 7)
Diamonds: (1000, 10)

1. Fundamentals of Hypothesis Testing

Key Concepts

TermDescription
Null Hypothesis (Hā‚€)ā€œThere is no difference/effectā€ - Default assumption
Alternative Hypothesis (H₁)ā€œThere is a difference/effectā€ - What we want to prove
p-valueProbability of getting the current result if Hā‚€ is true
Significance Level (α)Usually 0.05 (5%)
PowerProbability of detecting an effect when it truly exists

Test Selection Guide

Data Type? ā”œā”€ā”€ Continuous │ ā”œā”€ā”€ Normal distribution → Parametric tests (t-test, ANOVA) │ └── Non-normal distribution → Non-parametric tests (Mann-Whitney, Kruskal-Wallis) │ └── Categorical ā”œā”€ā”€ 2Ɨ2 table (small sample) → Fisher's Exact Test └── Otherwise → Chi-Square Test

2. Normality Tests

When to use?

Use before performing parametric tests like t-tests or ANOVA to check if your data follows a normal distribution. If normality is not satisfied, use non-parametric tests (Mann-Whitney, Kruskal-Wallis, etc.).

Shapiro-Wilk Test

Example Use Cases

  • Check if blood pressure measurements in a clinical trial follow normal distribution
  • Verify distribution of user session duration before A/B testing
  • Quality control: Check if product weight data is normally distributed

Characteristics: Most powerful normality test. Recommended for n < 5000.

# Titanic: Normality test for age distribution ages = titanic['age'].dropna() stat, p_value = shapiro(ages) print("=== Shapiro-Wilk Normality Test ===") print(f"Data: Titanic passenger ages (n={len(ages)})") print(f"Statistic: {stat:.4f}") print(f"p-value: {p_value:.4f}") print(f"Conclusion: {'Follows normal distribution' if p_value >= 0.05 else 'Does not follow normal distribution'}") # Visualization fig, axes = plt.subplots(1, 2, figsize=(10, 4)) axes[0].hist(ages, bins=30, edgecolor='black', alpha=0.7) axes[0].set_title('Age Distribution') axes[0].set_xlabel('Age') stats.probplot(ages, dist="norm", plot=axes[1]) axes[1].set_title('Q-Q Plot') plt.tight_layout() plt.show()
실행 ź²°ź³¼
=== Shapiro-Wilk Normality Test ===
Data: Titanic passenger ages (n=714)
Statistic: 0.9816
p-value: 0.0000
Conclusion: Does not follow normal distribution

D’Agostino-Pearson Test

Example Use Cases

  • Check if financial return data follows normal distribution (when skewness/kurtosis matters)
  • Test distribution shape of survey scores
  • Check asymmetry of exam score distribution

Characteristics: Considers both skewness and kurtosis. Useful when distribution shape matters. Can be used when n >= 20.

# Iris: Normality test for petal length petal_length = iris['petal_length'] stat, p_value = normaltest(petal_length) print("=== D'Agostino-Pearson Normality Test ===") print(f"Data: Iris petal length (n={len(petal_length)})") print(f"Statistic: {stat:.4f}") print(f"p-value: {p_value:.4f}") print(f"Conclusion: {'Follows normal distribution' if p_value >= 0.05 else 'Does not follow normal distribution'}")
실행 ź²°ź³¼
=== D'Agostino-Pearson Normality Test ===
Data: Iris petal length (n=150)
Statistic: 31.5324
p-value: 0.0000
Conclusion: Does not follow normal distribution

Kolmogorov-Smirnov Test

Example Use Cases

  • Normality testing for large datasets (n > 5000)
  • Compare with other theoretical distributions (exponential, uniform, etc.) besides normal
  • Compare if two samples have identical distributions (2-sample KS test)

Characteristics: Less sensitive to sample size, suitable for large datasets. Can compare with various distributions.

# Tips: Normality test for tip amounts tip_values = tips['tip'] # Test after standardization tip_standardized = (tip_values - tip_values.mean()) / tip_values.std() stat, p_value = kstest(tip_standardized, 'norm') print("=== Kolmogorov-Smirnov Normality Test ===") print(f"Data: Tips tip amounts (n={len(tip_values)})") print(f"Statistic: {stat:.4f}") print(f"p-value: {p_value:.4f}") print(f"Conclusion: {'Follows normal distribution' if p_value >= 0.05 else 'Does not follow normal distribution'}")
실행 ź²°ź³¼
=== Kolmogorov-Smirnov Normality Test ===
Data: Tips tip amounts (n=244)
Statistic: 0.0975
p-value: 0.0186
Conclusion: Does not follow normal distribution

Anderson-Darling Test

Example Use Cases

  • Normality check in analyses where extreme values (outliers) are important (risk management, insurance, etc.)
  • Check how much the distribution tails differ from normal distribution
  • When decisions are needed at multiple significance levels simultaneously

Characteristics: More sensitive to distribution tails. Preferred in finance/insurance where extreme values matter.

# Diamonds: Normality test for prices prices = diamonds['price'] result = anderson(prices, dist='norm') print("=== Anderson-Darling Normality Test ===") print(f"Data: Diamonds prices (n={len(prices)})") print(f"Statistic: {result.statistic:.4f}") print("\nCritical values by significance level:") for i, (cv, sl) in enumerate(zip(result.critical_values, result.significance_level)): result_str = "Reject" if result.statistic > cv else "Accept" print(f" {sl}%: Critical value = {cv:.4f} → Hā‚€ {result_str}")
실행 ź²°ź³¼
=== Anderson-Darling Normality Test ===
Data: Diamonds prices (n=1000)
Statistic: 47.8932

Critical values by significance level:
15.0%: Critical value = 0.5740 → Hā‚€ Reject
10.0%: Critical value = 0.6540 → Hā‚€ Reject
5.0%: Critical value = 0.7850 → Hā‚€ Reject
2.5%: Critical value = 0.9150 → Hā‚€ Reject
1.0%: Critical value = 1.0890 → Hā‚€ Reject

3. Homogeneity of Variance Tests

When to use?

Use before performing independent samples t-tests or ANOVA to check if the groups being compared have equal variances. If the equal variance assumption is violated, use Welch’s t-test or Games-Howell post-hoc test.

Levene’s Test

Example Use Cases

  • Check if purchase amount variance is equal between treatment and control groups in A/B testing
  • Test if exam score variance is equal between male and female groups
  • Check if quality dispersion is equal across products from multiple factories

Characteristics: Robust as it doesn’t require normality assumption. Can be used with non-normal data.

# Titanic: Compare age variance by survival status survived_ages = titanic[titanic['survived'] == 1]['age'].dropna() died_ages = titanic[titanic['survived'] == 0]['age'].dropna() stat, p_value = levene(survived_ages, died_ages) print("=== Levene's Test for Equal Variances ===") print(f"Survived age variance: {survived_ages.var():.2f}") print(f"Died age variance: {died_ages.var():.2f}") print(f"Statistic: {stat:.4f}") print(f"p-value: {p_value:.4f}") print(f"Conclusion: {'Equal variance assumption satisfied' if p_value >= 0.05 else 'Equal variance assumption not satisfied'}")
실행 ź²°ź³¼
=== Levene's Test for Equal Variances ===
Survived age variance: 207.03
Died age variance: 199.41
Statistic: 0.1557
p-value: 0.6933
Conclusion: Equal variance assumption satisfied

Bartlett’s Test

Example Use Cases

  • Compare variance between groups in normally distributed data
  • Compare response variability across multiple dose groups in clinical trials
  • Test variance equality across multiple production lines in quality control

Characteristics: Most powerful equal variance test when data follows normal distribution. Use Levene’s if normality is violated.

# Iris: Compare sepal length variance by species setosa = iris[iris['species'] == 'setosa']['sepal_length'] versicolor = iris[iris['species'] == 'versicolor']['sepal_length'] virginica = iris[iris['species'] == 'virginica']['sepal_length'] stat, p_value = bartlett(setosa, versicolor, virginica) print("=== Bartlett's Test for Equal Variances ===") print(f"Setosa variance: {setosa.var():.4f}") print(f"Versicolor variance: {versicolor.var():.4f}") print(f"Virginica variance: {virginica.var():.4f}") print(f"Statistic: {stat:.4f}") print(f"p-value: {p_value:.4f}") print(f"Conclusion: {'Equal variance assumption satisfied' if p_value >= 0.05 else 'Equal variance assumption not satisfied'}")
실행 ź²°ź³¼
=== Bartlett's Test for Equal Variances ===
Setosa variance: 0.1242
Versicolor variance: 0.2664
Virginica variance: 0.4043
Statistic: 16.0057
p-value: 0.0003
Conclusion: Equal variance assumption not satisfied

4. T-Tests

One-Sample T-Test

Example Use Cases

  • Quality Control: Verify if the mean lifespan of batteries produced at a factory equals the nominal lifespan of 1000 hours
  • Marketing: Check if average customer satisfaction has reached the target of 4.0 points
  • Education: Test if students’ average scores differ from the national average of 75 points
  • Service: Check if average response time is within the SLA standard of 3 seconds

Key Question: ā€œIs our sample mean equal to/different from a specific reference value?ā€

# Tips: Test if mean tip is $3 # Situation: Restaurant manager claims "Our average tip is $3". Verify if this is true. tip_values = tips['tip'] hypothesized_mean = 3.0 stat, p_value = ttest_1samp(tip_values, hypothesized_mean) print("=== One-Sample t-test ===") print(f"Hā‚€: Mean tip = ${hypothesized_mean:.2f}") print(f"H₁: Mean tip ≠ ${hypothesized_mean:.2f}") print(f"\nSample mean: ${tip_values.mean():.2f}") print(f"Sample standard deviation: ${tip_values.std():.2f}") print(f"t-statistic: {stat:.4f}") print(f"p-value: {p_value:.4f}") print(f"\nConclusion: {'Mean tip differs from $3' if p_value < 0.05 else 'Mean tip can be considered equal to $3'}")
실행 ź²°ź³¼
=== One-Sample t-test ===
Hā‚€: Mean tip = $3.00
H₁: Mean tip ≠ $3.00

Sample mean: $3.00
Sample standard deviation: $1.38
t-statistic: -0.0363
p-value: 0.9711

Conclusion: Mean tip can be considered equal to $3

Independent Samples T-Test

Example Use Cases

  • A/B Testing: Verify if the new website design (B) has a higher conversion rate than the old design (A)
  • Medicine: Compare blood pressure changes between drug treatment group and placebo group
  • Education: Compare grade differences between online and offline classes
  • HR: Analyze productivity differences between remote workers and office workers
  • Marketing: Compare average purchase amounts between male and female customers

Key Question: ā€œAre the means of two different groups equal/different?ā€

Note: The two groups must be independent (same person cannot belong to both groups)

# Titanic: Compare fare by gender # Situation: Analyze if there were differences in fares paid by males and females on the Titanic male_fare = titanic[titanic['sex'] == 'male']['fare'].dropna() female_fare = titanic[titanic['sex'] == 'female']['fare'].dropna() # Test equal variance assumption _, levene_p = levene(male_fare, female_fare) equal_var = levene_p >= 0.05 # t-test (Welch's t-test if unequal variance) stat, p_value = ttest_ind(male_fare, female_fare, equal_var=equal_var) print("=== Independent Samples t-test ===") print(f"Hā‚€: Male mean fare = Female mean fare") print(f"H₁: Male mean fare ≠ Female mean fare") print(f"\nMale mean fare: ${male_fare.mean():.2f} (n={len(male_fare)})") print(f"Female mean fare: ${female_fare.mean():.2f} (n={len(female_fare)})") print(f"Difference: ${female_fare.mean() - male_fare.mean():.2f}") print(f"\nEqual variance assumption: {'Satisfied (Student t-test)' if equal_var else 'Not satisfied (Welch t-test used)'}") print(f"t-statistic: {stat:.4f}") print(f"p-value: {p_value:.4f}") print(f"\nConclusion: {'Significant difference in fare by gender' if p_value < 0.05 else 'No difference in fare by gender'}")
실행 ź²°ź³¼
=== Independent Samples t-test ===
Hā‚€: Male mean fare = Female mean fare
H₁: Male mean fare ≠ Female mean fare

Male mean fare: $25.52 (n=577)
Female mean fare: $44.48 (n=314)
Difference: $18.95

Equal variance assumption: Not satisfied (Welch t-test used)
t-statistic: -4.7994
p-value: 0.0000

Conclusion: Significant difference in fare by gender

Paired Samples T-Test

Example Use Cases

  • Diet Effect: Compare weight before and after diet for the same people
  • Educational Effect: Compare test scores before and after class for the same students
  • Drug Effect: Compare blood pressure before and after medication for the same patients
  • UX Improvement: Compare task completion time when the same users use old version/new version app
  • Marketing: Compare sales before and after promotion for the same stores

Key Question: ā€œDid the before and after values of the same subjects change?ā€

Difference from independent samples: Paired samples measure the same subject twice, so individual differences can be controlled, making it more sensitive to detecting changes

# Simulation: A/B test conversion rate (same users experience both UIs) # Situation: Show 100 users old UI and new UI sequentially and measure click rates np.random.seed(42) n_users = 100 # Before: Old UI click rate before = np.random.beta(2, 8, n_users) # Mean about 20% # After: New UI click rate (slight improvement) after = before + np.random.normal(0.05, 0.03, n_users) after = np.clip(after, 0, 1) stat, p_value = ttest_rel(before, after) print("=== Paired Samples t-test ===") print(f"Hā‚€: No change in conversion rate (new UI has no effect)") print(f"H₁: Change in conversion rate (new UI has effect)") print(f"\nBefore (old UI) mean: {before.mean():.4f} ({before.mean()*100:.1f}%)") print(f"After (new UI) mean: {after.mean():.4f} ({after.mean()*100:.1f}%)") print(f"Mean difference: {(after - before).mean():.4f} (+{(after - before).mean()*100:.1f}%p)") print(f"\nt-statistic: {stat:.4f}") print(f"p-value: {p_value:.4f}") print(f"\nConclusion: {'New UI significantly improved' if p_value < 0.05 else 'No significant change'}")
실행 ź²°ź³¼
=== Paired Samples t-test ===
Hā‚€: No change in conversion rate (new UI has no effect)
H₁: Change in conversion rate (new UI has effect)

Before (old UI) mean: 0.2037 (20.4%)
After (new UI) mean: 0.2503 (25.0%)
Mean difference: 0.0466 (+4.7%p)

t-statistic: -14.7221
p-value: 0.0000

Conclusion: New UI significantly improved

5. Non-parametric Tests

When to use?

  • When data does not follow normal distribution
  • When sample size is small (n < 30)
  • When data is rank data or ordinal scale
  • When there are many outliers (non-parametric tests are less sensitive to outliers)

Mann-Whitney U Test

Example Use Cases

  • E-commerce: Compare order amount distribution between premium and regular members (amount data usually has a non-normal right-skewed distribution)
  • Gaming: Compare play time between paying and non-paying users
  • Medicine: Compare pain scale (1-10) between two treatments
  • Satisfaction: Compare customer rating distributions between two products

Key Question: ā€œAre the distributions (ranks) of two independent groups different?ā€

Non-parametric alternative to independent samples t-test. Think of it as comparing median/rank rather than mean.

# Diamonds: Compare prices by cut quality (Ideal vs Good) # Situation: Compare price distributions of diamonds with Ideal vs Good cut quality # Price data generally doesn't follow normal distribution, so use non-parametric test ideal_price = diamonds[diamonds['cut'] == 'Ideal']['price'] good_price = diamonds[diamonds['cut'] == 'Good']['price'] stat, p_value = mannwhitneyu(ideal_price, good_price, alternative='two-sided') print("=== Mann-Whitney U Test ===") print(f"Hā‚€: Ideal and Good cut have the same price distribution") print(f"H₁: Price distributions are different") print(f"\nIdeal median: ${ideal_price.median():,.2f} (n={len(ideal_price)})") print(f"Good median: ${good_price.median():,.2f} (n={len(good_price)})") print(f"\nU-statistic: {stat:,.2f}") print(f"p-value: {p_value:.4f}") print(f"\nConclusion: {'Price distributions significantly different' if p_value < 0.05 else 'No difference in price distributions'}")
실행 ź²°ź³¼
=== Mann-Whitney U Test ===
Hā‚€: Ideal and Good cut have the same price distribution
H₁: Price distributions are different

Ideal median: $1,810.00 (n=393)
Good median: $3,086.50 (n=96)

U-statistic: 14,985.00
p-value: 0.0031

Conclusion: Price distributions significantly different

Wilcoxon Signed-Rank Test

Example Use Cases

  • Weight Loss: Compare weight before and after diet (when weight changes are not normally distributed)
  • Survey: Compare satisfaction scores (1-5) before and after policy change for the same respondents
  • Education: Compare confidence scores before and after special lecture for the same students
  • App Ratings: Compare rating changes from the same users before and after update

Key Question: ā€œDid the before and after value distribution of the same subjects change?ā€

Non-parametric alternative to paired samples t-test. Uses sign and rank of differences.

# Tips: Compare lunch vs dinner tip rates (same waiters) # Situation: Test if 50 waiters receive different tip rates at lunch vs dinner np.random.seed(42) n_waiters = 50 lunch_tip_rate = np.random.uniform(0.12, 0.22, n_waiters) dinner_tip_rate = lunch_tip_rate + np.random.normal(0.02, 0.03, n_waiters) stat, p_value = wilcoxon(lunch_tip_rate, dinner_tip_rate) print("=== Wilcoxon Signed-Rank Test ===") print(f"Hā‚€: No difference between lunch and dinner tip rates") print(f"H₁: Difference between lunch and dinner tip rates") print(f"\nLunch tip rate median: {np.median(lunch_tip_rate)*100:.1f}%") print(f"Dinner tip rate median: {np.median(dinner_tip_rate)*100:.1f}%") print(f"\nW-statistic: {stat:.2f}") print(f"p-value: {p_value:.4f}") print(f"\nConclusion: {'Dinner tip rate significantly higher' if p_value < 0.05 else 'No difference'}")
실행 ź²°ź³¼
=== Wilcoxon Signed-Rank Test ===
Hā‚€: No difference between lunch and dinner tip rates
H₁: Difference between lunch and dinner tip rates

Lunch tip rate median: 16.9%
Dinner tip rate median: 19.1%

W-statistic: 304.00
p-value: 0.0004

Conclusion: Dinner tip rate significantly higher

Kruskal-Wallis H Test

Example Use Cases

  • Marketing: Compare brand awareness scores across 3 ad types (TV, Online, SNS)
  • Products: Compare customer satisfaction across multiple smartphone brands
  • Education: Compare student grade distributions across 3 schools
  • Medicine: Compare recovery periods across multiple treatments (non-normal data)

Key Question: ā€œAre the distributions of 3 or more groups all the same, or is at least one different?ā€

Non-parametric alternative to One-Way ANOVA. Post-hoc tests needed to determine which groups differ.

# Iris: Compare petal width by species # Situation: Test if petal width distributions differ among three iris species (setosa, versicolor, virginica) setosa_pw = iris[iris['species'] == 'setosa']['petal_width'] versicolor_pw = iris[iris['species'] == 'versicolor']['petal_width'] virginica_pw = iris[iris['species'] == 'virginica']['petal_width'] stat, p_value = kruskal(setosa_pw, versicolor_pw, virginica_pw) print("=== Kruskal-Wallis H Test ===") print(f"Hā‚€: All species have the same petal width distribution") print(f"H₁: At least one species is different") print(f"\nSetosa median: {setosa_pw.median():.2f}cm") print(f"Versicolor median: {versicolor_pw.median():.2f}cm") print(f"Virginica median: {virginica_pw.median():.2f}cm") print(f"\nH-statistic: {stat:.4f}") print(f"p-value: {p_value:.6f}") print(f"\nConclusion: {'Significant difference among species' if p_value < 0.05 else 'No difference'}")
실행 ź²°ź³¼
=== Kruskal-Wallis H Test ===
Hā‚€: All species have the same petal width distribution
H₁: At least one species is different

Setosa median: 0.20cm
Versicolor median: 1.30cm
Virginica median: 2.00cm

H-statistic: 130.0111
p-value: 0.000000

Conclusion: Significant difference among species

6. Analysis of Variance (ANOVA)

One-Way ANOVA

Example Use Cases

  • Marketing: Compare average purchase amount across 4 promotion types (discount, points, gifts, free shipping)
  • Manufacturing: Compare average quality scores of products from 3 factories
  • HR: Compare employee satisfaction across departments (Development, Marketing, Sales, Support)
  • Education: Compare learning effects across multiple teaching methods

Key Question: ā€œAre the means of 3 or more groups all the same?ā€

Prerequisites: Normality, equal variance. Use Kruskal-Wallis if violated.

# Tips: Compare total bill amounts by day of week # Situation: Analyze if customers' average payment amounts differ by day of the week thur = tips[tips['day'] == 'Thur']['total_bill'] fri = tips[tips['day'] == 'Fri']['total_bill'] sat = tips[tips['day'] == 'Sat']['total_bill'] sun = tips[tips['day'] == 'Sun']['total_bill'] stat, p_value = f_oneway(thur, fri, sat, sun) print("=== One-Way ANOVA ===") print(f"Hā‚€: Average payment amounts are the same for all days") print(f"H₁: At least one day is different") print(f"\nAverage payment by day:") print(f" Thursday: ${thur.mean():.2f} (n={len(thur)})") print(f" Friday: ${fri.mean():.2f} (n={len(fri)})") print(f" Saturday: ${sat.mean():.2f} (n={len(sat)})") print(f" Sunday: ${sun.mean():.2f} (n={len(sun)})") print(f"\nF-statistic: {stat:.4f}") print(f"p-value: {p_value:.4f}") print(f"\nConclusion: {'Significant difference by day → Post-hoc test needed' if p_value < 0.05 else 'No difference by day'}")
실행 ź²°ź³¼
=== One-Way ANOVA ===
Hā‚€: Average payment amounts are the same for all days
H₁: At least one day is different

Average payment by day:
Thursday: $17.68 (n=62)
Friday: $17.15 (n=19)
Saturday: $20.44 (n=87)
Sunday: $21.41 (n=76)

F-statistic: 2.7675
p-value: 0.0424

Conclusion: Significant difference by day → Post-hoc test needed

Two-Way ANOVA

Example Use Cases

  • Marketing: Analyze the effect of ad channel (TV/Online) and target age group (20s/30s/40s) on purchase intent
  • Manufacturing: Effect of machine type and operator skill level on production volume
  • Education: Effect of teaching method and class size on learning outcomes
  • Medicine: Effect of drug type and dosage on treatment efficacy

Key Questions:

  1. Is there a main effect of Factor A?
  2. Is there a main effect of Factor B?
  3. Is there an interaction effect between A and B? (e.g., Is there an effect only in specific combinations?)
import statsmodels.api as sm from statsmodels.formula.api import ols # Tips: Effect of gender and smoking status on tips # Situation: Analyze if tip amounts differ by gender and smoking status, and if there's a combination effect model = ols('tip ~ C(sex) + C(smoker) + C(sex):C(smoker)', data=tips).fit() anova_table = sm.stats.anova_lm(model, typ=2) print("=== Two-Way ANOVA ===") print(f"Dependent Variable: Tip amount") print(f"Factor 1: Gender (sex)") print(f"Factor 2: Smoking status (smoker)") print(f"\n{anova_table.round(4)}") print("\nInterpretation:") for idx, row in anova_table.iterrows(): if idx != 'Residual': sig = "Significant ***" if row['PR(>F)'] < 0.001 else \ "Significant **" if row['PR(>F)'] < 0.01 else \ "Significant *" if row['PR(>F)'] < 0.05 else "Not significant" print(f" {idx}: p = {row['PR(>F)']:.4f} → {sig}")
실행 ź²°ź³¼
=== Two-Way ANOVA ===
Dependent Variable: Tip amount
Factor 1: Gender (sex)
Factor 2: Smoking status (smoker)

                     sum_sq     df         F    PR(>F)
C(sex)                  1.0554    1.0    0.5596    0.4551
C(smoker)               0.1477    1.0    0.0783    0.7799
C(sex):C(smoker)        0.2077    1.0    0.1101    0.7404
Residual              452.5604  240.0       NaN       NaN

Interpretation:
C(sex): p = 0.4551 → Not significant
C(smoker): p = 0.7799 → Not significant
C(sex):C(smoker): p = 0.7404 → Not significant

7. Chi-Square Tests

Test of Independence

Example Use Cases

  • Marketing: Analyze if age group (20s/30s/40s) and preferred brand (A/B/C) are associated
  • Medicine: Test if smoking status and lung cancer occurrence are associated
  • Education: Analyze if gender and major selection are associated
  • HR: Analyze if education level and turnover status are associated
  • Elections: Analyze if region and party support are associated

Key Question: ā€œAre two categorical variables independent or associated?ā€

# Titanic: Relationship between gender and survival # Situation: Analyze if survival rates differed by gender during the Titanic sinking ("Women and children first" rule) contingency = pd.crosstab(titanic['sex'], titanic['survived']) print("Contingency Table:") print(contingency) print() chi2, p_value, dof, expected = chi2_contingency(contingency) print("=== Chi-Square Test of Independence ===") print(f"Hā‚€: Gender and survival are independent (no relationship)") print(f"H₁: Gender and survival are associated") print(f"\nChi-square statistic: {chi2:.4f}") print(f"Degrees of freedom: {dof}") print(f"p-value: {p_value:.6f}") print(f"\nExpected frequencies (if independent, these would be expected):") print(pd.DataFrame(expected, index=contingency.index, columns=contingency.columns).round(1)) print(f"\nConclusion: {'Gender and survival are strongly associated (higher female survival rate)' if p_value < 0.05 else 'Independent'}")
실행 ź²°ź³¼
Contingency Table:
survived    0    1
sex
female     81  233
male      468  109

=== Chi-Square Test of Independence ===
Hā‚€: Gender and survival are independent (no relationship)
H₁: Gender and survival are associated

Chi-square statistic: 260.7170
Degrees of freedom: 1
p-value: 0.000000

Expected frequencies (if independent, these would be expected):
survived      0      1
sex
female    193.5  120.5
male      355.5  221.5

Conclusion: Gender and survival are strongly associated (higher female survival rate)

Goodness of Fit Test

Example Use Cases

  • Quality Control: Test if defect occurrences are evenly distributed (1/5 each) across days of the week
  • Marketing: Check if customer distribution matches the expected ratio (40:35:25)
  • Genetics: Test if observed genotype ratios match Mendel’s laws (9:3:3:1)
  • Survey: Check if response distribution follows a uniform distribution

Key Question: ā€œDoes observed frequency match the expected theoretical distribution?ā€

# Titanic: Test if cabin class distribution is uniform # Situation: Check if Titanic passengers were evenly distributed across 1st, 2nd, 3rd class observed = titanic['pclass'].value_counts().sort_index() n = len(titanic) expected = np.array([n/3, n/3, n/3]) # Expected uniform distribution chi2, p_value = stats.chisquare(observed, expected) print("=== Chi-Square Goodness of Fit Test ===") print(f"Hā‚€: Cabin classes are evenly distributed (each 33.3%)") print(f"H₁: Not evenly distributed") print(f"\nObserved frequencies:") for cls, count in observed.items(): print(f" Class {cls}: {count} passengers ({count/n*100:.1f}%)") print(f"\nExpected frequency (uniform distribution): {n/3:.0f} each (33.3%)") print(f"\nChi-square statistic: {chi2:.4f}") print(f"p-value: {p_value:.6f}") print(f"\nConclusion: {'Not uniform - 3rd class is the majority' if p_value < 0.05 else 'Uniform distribution'}")
실행 ź²°ź³¼
=== Chi-Square Goodness of Fit Test ===
Hā‚€: Cabin classes are evenly distributed (each 33.3%)
H₁: Not evenly distributed

Observed frequencies:
Class 1: 216 passengers (24.2%)
Class 2: 184 passengers (20.7%)
Class 3: 491 passengers (55.1%)

Expected frequency (uniform distribution): 297 each (33.3%)

Chi-square statistic: 110.8417
p-value: 0.000000

Conclusion: Not uniform - 3rd class is the majority

Fisher’s Exact Test

Example Use Cases

  • Clinical Trials: Test treatment effect in small-scale pilot studies (n < 20)
  • Rare Diseases: Association between rare disease occurrence and genetic mutations
  • Quality Control: Analysis of rare defect types where expected frequency is less than 5
  • Epidemiology: Association between infection status and specific behaviors in small groups

Key Question: ā€œAre two variables associated in a 2x2 contingency table when sample size is small?ā€

Alternative to chi-square test: Use Fisher’s Exact when any cell has expected frequency less than 5

# Small-scale clinical trial simulation # Situation: Pilot study with 20 subjects. Compare cure rates between drug treatment group (10) and placebo group (10) contingency = np.array([[8, 2], # Treatment group: 8 cured, 2 not cured [3, 7]]) # Control group: 3 cured, 7 not cured odds_ratio, p_value = fisher_exact(contingency) print("=== Fisher's Exact Test ===") print("Situation: Small-scale clinical trial (n=20)") print("\nContingency Table:") print(" Cured Not Cured") print(f"Treatment {contingency[0,0]} {contingency[0,1]}") print(f"Control {contingency[1,0]} {contingency[1,1]}") print(f"\nTreatment cure rate: {contingency[0,0]/contingency[0].sum()*100:.0f}%") print(f"Control cure rate: {contingency[1,0]/contingency[1].sum()*100:.0f}%") print(f"\nOdds Ratio: {odds_ratio:.4f}") print(f"p-value: {p_value:.4f}") print(f"\nConclusion: {'Treatment effect exists' if p_value < 0.05 else 'No treatment effect'}") print(f"\nInterpretation: Treatment group has {odds_ratio:.1f}x higher cure odds than control group")
실행 ź²°ź³¼
=== Fisher's Exact Test ===
Situation: Small-scale clinical trial (n=20)

Contingency Table:
       Cured  Not Cured
Treatment     8      2
Control       3      7

Treatment cure rate: 80%
Control cure rate: 30%

Odds Ratio: 9.3333
p-value: 0.0350

Conclusion: Treatment effect exists

Interpretation: Treatment group has 9.3x higher cure odds than control group

McNemar’s Test

Example Use Cases

  • Marketing: Change in brand awareness of same customers before/after ad campaign (Aware→Aware, Unaware→Aware, etc.)
  • Medicine: Change in symptom presence of same patients before/after treatment
  • Politics: Change in party support of same voters before/after election
  • Education: Change in understanding of specific concepts of same students before/after class

Key Question: ā€œDid the categorical response of the same subjects change before/after?ā€

Use for paired samples + categorical data. Use Wilcoxon/paired t-test for continuous data.

from statsmodels.stats.contingency_tables import mcnemar # Marketing campaign before/after purchase behavior change # Situation: Track purchase status of 100 customers before and after campaign # [Bought before/Bought after, Bought before/Didn't buy after] # [Didn't buy before/Bought after, Didn't buy before/Didn't buy after] table = np.array([[45, 15], # Bought before & after / Bought before, didn't buy after [35, 5]]) # Didn't buy before, bought after / Didn't buy before & after result = mcnemar(table, exact=True) print("=== McNemar's Test ===") print("Situation: Purchase behavior change of 100 customers before/after campaign") print("\nPaired Table:") print(" Bought After Didn't Buy After") print(f"Bought Before {table[0,0]} {table[0,1]}") print(f"Didn't Buy Before {table[1,0]} {table[1,1]}") print(f"\nChange Analysis:") print(f" New purchase (Didn't buy → Bought): {table[1,0]} people") print(f" Churned (Bought → Didn't buy): {table[0,1]} people") print(f" No change: {table[0,0] + table[1,1]} people") print(f"\np-value: {result.pvalue:.4f}") print(f"\nConclusion: {'Campaign significantly changed purchase behavior (New purchases > Churn)' if result.pvalue < 0.05 else 'No significant change'}")
실행 ź²°ź³¼
=== McNemar's Test ===
Situation: Purchase behavior change of 100 customers before/after campaign

Paired Table:
            Bought After  Didn't Buy After
Bought Before      45            15
Didn't Buy Before      35             5

Change Analysis:
New purchase (Didn't buy → Bought): 35 people
Churned (Bought → Didn't buy): 15 people
No change: 50 people

p-value: 0.0066

Conclusion: Campaign significantly changed purchase behavior (New purchases > Churn)

8. Correlation Analysis

Pearson Correlation Coefficient

Example Use Cases

  • Marketing: Analyze linear relationship between ad spending and sales
  • HR: Correlation between years of service and salary
  • Education: Relationship between study hours and test scores
  • Finance: Relationship between interest rates and stock prices

Key Question: ā€œIs there a linear relationship between two continuous variables?ā€

Prerequisites: Both variables follow normal distribution, linear relationship. Cannot detect non-linear relationships.

# Diamonds: Correlation between carat and price # Situation: Analyze if there is a linear relationship between diamond carat (weight) and price carat = diamonds['carat'] price = diamonds['price'] r, p_value = pearsonr(carat, price) print("=== Pearson Correlation Analysis ===") print(f"Hā‚€: There is no correlation between carat and price (r = 0)") print(f"H₁: There is correlation between carat and price (r ≠ 0)") print(f"\nPearson r: {r:.4f}") print(f"p-value: {p_value:.6f}") print(f"Coefficient of determination (R²): {r**2:.4f} → {r**2*100:.1f}% of price variation explained by carat") print(f"\nCorrelation strength interpretation:") print(f" |r| < 0.3: Weak correlation") print(f" 0.3 <= |r| < 0.7: Moderate correlation") print(f" |r| >= 0.7: Strong correlation") print(f"\nCurrent |r| = {abs(r):.4f} → Strong positive correlation (carat↑ → price↑)")
실행 ź²°ź³¼
=== Pearson Correlation Analysis ===
Hā‚€: There is no correlation between carat and price (r = 0)
H₁: There is correlation between carat and price (r ≠ 0)

Pearson r: 0.9209
p-value: 0.000000
Coefficient of determination (R²): 0.8481 → 84.8% of price variation explained by carat

Correlation strength interpretation:
|r| < 0.3: Weak correlation
0.3 <= |r| < 0.7: Moderate correlation
|r| >= 0.7: Strong correlation

Current |r| = 0.9209 → Strong positive correlation (carat↑ → price↑)

Spearman Rank Correlation Coefficient

Example Use Cases

  • Survey: Relationship between satisfaction ranking and repurchase intent ranking
  • Economics: Relationship between GDP ranking and happiness index ranking
  • Education: Relationship between academic grade ranking and employment rate ranking
  • Sports: Relationship between salary ranking and performance ranking

Key Question: ā€œIs there a monotonic relationship between two variables?ā€

Non-parametric alternative to Pearson: Doesn’t require normal distribution, can detect non-linear but monotonic relationships. Example: y = x² (Spearman is high in monotonically increasing intervals)

# Tips: Rank correlation between total bill and tip # Situation: Check if higher bills tend to have higher tips (even if not exactly proportional) total_bill = tips['total_bill'] tip = tips['tip'] rho, p_value = spearmanr(total_bill, tip) r_pearson, _ = pearsonr(total_bill, tip) print("=== Spearman Rank Correlation Analysis ===") print(f"Hā‚€: No monotonic relationship between bill and tip") print(f"H₁: Monotonic relationship exists between bill and tip") print(f"\nSpearman rho: {rho:.4f}") print(f"(Comparison) Pearson r: {r_pearson:.4f}") print(f"p-value: {p_value:.6f}") print(f"\nConclusion: {'Significant monotonic relationship - higher bills tend to have higher tips' if p_value < 0.05 else 'No relationship'}")
실행 ź²°ź³¼
=== Spearman Rank Correlation Analysis ===
Hā‚€: No monotonic relationship between bill and tip
H₁: Monotonic relationship exists between bill and tip

Spearman rho: 0.8264
(Comparison) Pearson r: 0.6757
p-value: 0.000000

Conclusion: Significant monotonic relationship - higher bills tend to have higher tips

Kendall’s Tau

Example Use Cases

  • Rank Data: Agreement between two judges’ ranking evaluations (e.g., restaurant rankings)
  • Ordinal Scale: Relationship between education level (Elementary/Middle/High/College) and income level (Low/Medium/High)
  • When Many Ties Exist: Data with many same values like 5-point Likert scale surveys

Key Question: ā€œWhat is the concordance between two variables in rank data?ā€

More conservative than Spearman. More accurate when there are many ties.

# Titanic: Relationship between cabin class and age # Situation: Check if 1st class passengers tend to be older (ordinal vs continuous) pclass = titanic['pclass'].dropna() age = titanic['age'].dropna() # Align indices common_idx = pclass.index.intersection(age.index) pclass_aligned = pclass.loc[common_idx] age_aligned = age.loc[common_idx] tau, p_value = kendalltau(pclass_aligned, age_aligned) print("=== Kendall's Tau Correlation Analysis ===") print(f"Hā‚€: No relationship between cabin class and age") print(f"H₁: Relationship exists between cabin class and age") print(f"\nKendall tau: {tau:.4f}") print(f"p-value: {p_value:.4f}") print(f"\nConclusion: {'Significant relationship' if p_value < 0.05 else 'No relationship'}") if tau < 0: print(f"Interpretation: tau < 0 means cabin class↓(1st class) → age↑ (older passengers in premium cabins)")
실행 ź²°ź³¼
=== Kendall's Tau Correlation Analysis ===
Hā‚€: No relationship between cabin class and age
H₁: Relationship exists between cabin class and age

Kendall tau: -0.1080
p-value: 0.0000

Conclusion: Significant relationship
Interpretation: tau < 0 means cabin class↓(1st class) → age↑ (older passengers in premium cabins)

9. Effect Size

Why is effect size important?

p-value only tells you ā€œIs there a difference?ā€ but not ā€œHow big is the difference?ā€ With large samples, even tiny differences can become significant.

Example: In an A/B test with 1 million people, a 0.01% difference in conversion rate can yield p < 0.05, but you need effect size to judge if this difference is actually meaningful for the business.

Cohen’s d

When to use: Interpret the practical meaning of mean difference between two groups

Interpretation guidelines (Cohen, 1988):

  • |d| < 0.2: Small effect (negligible)
  • 0.2 ≤ |d| < 0.5: Medium effect
  • 0.5 ≤ |d| < 0.8: Large effect
  • |d| ≄ 0.8: Very large effect
def cohens_d(group1, group2): n1, n2 = len(group1), len(group2) var1, var2 = group1.var(), group2.var() pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2)) return (group1.mean() - group2.mean()) / pooled_std # Titanic: Effect size of age difference between survivors vs non-survivors # Situation: Relationship between age and survival. Even if p-value is significant, is it a meaningful difference? survived_ages = titanic[titanic['survived'] == 1]['age'].dropna() died_ages = titanic[titanic['survived'] == 0]['age'].dropna() d = cohens_d(survived_ages, died_ages) t_stat, p_value = ttest_ind(survived_ages, died_ages) print("=== Effect Size Analysis ===") print(f"Survivor mean age: {survived_ages.mean():.2f} years") print(f"Non-survivor mean age: {died_ages.mean():.2f} years") print(f"Difference: {abs(survived_ages.mean() - died_ages.mean()):.2f} years") print(f"\nt-statistic: {t_stat:.4f}") print(f"p-value: {p_value:.4f} → {'Significant' if p_value < 0.05 else 'Not significant'}") print(f"Cohen's d: {d:.4f}") print(f"\nEffect size interpretation:") print(f" |d| < 0.2: Small effect") print(f" 0.2 <= |d| < 0.5: Medium effect") print(f" 0.5 <= |d| < 0.8: Large effect") print(f" |d| >= 0.8: Very large effect") print(f"\nCurrent: |d| = {abs(d):.4f} → ", end="") if abs(d) >= 0.8: print("Very large effect") elif abs(d) >= 0.5: print("Large effect") elif abs(d) >= 0.2: print("Medium effect") else: print("Small effect → Statistically significant but practical meaning is limited!")
실행 ź²°ź³¼
=== Effect Size Analysis ===
Survivor mean age: 28.34 years
Non-survivor mean age: 30.63 years
Difference: 2.29 years

t-statistic: -2.0551
p-value: 0.0402 → Significant
Cohen's d: -0.1616

Effect size interpretation:
|d| < 0.2: Small effect
0.2 <= |d| < 0.5: Medium effect
0.5 <= |d| < 0.8: Large effect
|d| >= 0.8: Very large effect

Current: |d| = 0.1616 → Small effect → Statistically significant but practical meaning is limited!

Cramer’s V

When to use: Measure association strength between categorical variables (after chi-square test)

Interpretation guidelines:

  • V < 0.1: Negligible
  • 0.1 ≤ V < 0.3: Weak association
  • 0.3 ≤ V < 0.5: Moderate association
  • V ≄ 0.5: Strong association
def cramers_v(contingency_table): chi2 = chi2_contingency(contingency_table)[0] n = contingency_table.sum().sum() min_dim = min(contingency_table.shape) - 1 return np.sqrt(chi2 / (n * min_dim)) # Titanic: Strength of association between gender and survival contingency = pd.crosstab(titanic['sex'], titanic['survived']) v = cramers_v(contingency) chi2, p_value, _, _ = chi2_contingency(contingency) print("=== Cramer's V (Association Strength) ===") print(f"Chi-square = {chi2:.4f}, p-value = {p_value:.6f}") print(f"Cramer's V = {v:.4f}") print(f"\nInterpretation guidelines:") print(f" V < 0.1: Negligible") print(f" 0.1 <= V < 0.3: Weak association") print(f" 0.3 <= V < 0.5: Moderate association") print(f" V >= 0.5: Strong association") print(f"\nCurrent: V = {v:.4f} → ", end="") if v >= 0.5: print("Strong association → Gender strongly affected survival!") elif v >= 0.3: print("Moderate association") elif v >= 0.1: print("Weak association") else: print("Negligible")
실행 ź²°ź³¼
=== Cramer's V (Association Strength) ===
Chi-square = 260.7170, p-value = 0.000000
Cramer's V = 0.5410

Interpretation guidelines:
V < 0.1: Negligible
0.1 <= V < 0.3: Weak association
0.3 <= V < 0.5: Moderate association
V >= 0.5: Strong association

Current: V = 0.5410 → Strong association → Gender strongly affected survival!

10. Multiple Testing Correction

Why is correction needed?

When performing multiple tests simultaneously, Type I errors (false positives) accumulate.

Example: Performing 20 tests at α = 0.05

  • Probability of at least 1 false positive = 1 - (0.95)^20 = 64%!

This is called the Multiple Comparison Problem.

Bonferroni Correction

When to use:

  • Analyzing multiple A/B tests simultaneously
  • Post-hoc tests after ANOVA comparing multiple pairs
  • Genome studies testing thousands of genes

Method: Divide α by number of tests (k). Example: 5 tests → 0.05/5 = 0.01

Very conservative. May miss real effects (increased Type II error)

from statsmodels.stats.multitest import multipletests # Multiple A/B test results # Situation: Testing 5 UI elements simultaneously. Which ones have real effects? p_values = [0.03, 0.04, 0.01, 0.08, 0.002] test_names = ['Button Color', 'Headline', 'CTA Position', 'Image', 'Price Display'] # Bonferroni correction rejected, corrected_p, _, _ = multipletests(p_values, method='bonferroni') print("=== Multiple Testing Correction (Bonferroni) ===") print(f"Number of tests: {len(p_values)}") print(f"Corrected significance level: 0.05 / {len(p_values)} = {0.05/len(p_values):.3f}") print(f"\n{'Test':<15} {'Original p-value':<18} {'Corrected p-value':<18} {'Conclusion'}") print("-" * 70) for name, p, cp, rej in zip(test_names, p_values, corrected_p, rejected): result = "Significant" if rej else "Not significant" print(f"{name:<15} {p:<18.4f} {min(cp, 1.0):<18.4f} {result}")
실행 ź²°ź³¼
=== Multiple Testing Correction (Bonferroni) ===
Number of tests: 5
Corrected significance level: 0.05 / 5 = 0.010

Test            Original p-value   Corrected p-value  Conclusion
----------------------------------------------------------------------
Button Color    0.0300             0.1500             Not significant
Headline        0.0400             0.2000             Not significant
CTA Position    0.0100             0.0500             Not significant
Image           0.0800             0.4000             Not significant
Price Display   0.0020             0.0100             Significant

Benjamini-Hochberg (FDR)

When to use:

  • When performing many tests but can tolerate some false positives
  • Screening candidates in exploratory analysis
  • Genome studies where Bonferroni is too conservative

Method: Controls False Discovery Rate (FDR). Controls ā€œproportion of false positives among those declared significantā€ to 5%

# Benjamini-Hochberg correction rejected_bh, corrected_p_bh, _, _ = multipletests(p_values, method='fdr_bh') print("=== Multiple Testing Correction (Benjamini-Hochberg FDR) ===") print(f"\n{'Test':<15} {'Original p-value':<18} {'Corrected p-value':<18} {'Conclusion'}") print("-" * 70) for name, p, cp, rej in zip(test_names, p_values, corrected_p_bh, rejected_bh): result = "Significant" if rej else "Not significant" print(f"{name:<15} {p:<18.4f} {cp:<18.4f} {result}") print(f"\nComparison:") print(f" Significant with Bonferroni: {sum(rejected)}") print(f" Significant with FDR(BH): {sum(rejected_bh)}") print(f"\n→ FDR is less conservative, allowing more discoveries") print(f" However, about 5% of these may be false positives")
실행 ź²°ź³¼
=== Multiple Testing Correction (Benjamini-Hochberg FDR) ===

Test            Original p-value   Corrected p-value  Conclusion
----------------------------------------------------------------------
Button Color    0.0300             0.0500             Significant
Headline        0.0400             0.0500             Significant
CTA Position    0.0100             0.0250             Significant
Image           0.0800             0.0800             Not significant
Price Display   0.0020             0.0100             Significant

Comparison:
Significant with Bonferroni: 1
Significant with FDR(BH): 4

→ FDR is less conservative, allowing more discoveries
 However, about 5% of these may be false positives

11. Power Analysis

When to use?

Use at the experimental design stage to calculate ā€œHow many samples are needed?ā€

Too few samples → Cannot detect real effects even when they exist (Type II error) Too many samples → Waste of resources

from statsmodels.stats.power import TTestIndPower power_analysis = TTestIndPower() # Scenario: Effect size 0.3, Power 80%, Significance level 5% # Situation: "Expecting the new UI to improve conversion rate moderately (d=0.3). # How many people are needed to detect this effect with 80% probability?" effect_size = 0.3 alpha = 0.05 power = 0.8 n = power_analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=1.0, alternative='two-sided') print("=== Sample Size Calculation ===") print(f"Target effect size (Cohen's d): {effect_size} (medium effect)") print(f"Significance level (α): {alpha}") print(f"Target power (1-β): {power} (80% probability of detecting effect)") print(f"\nRequired sample size: {n:.0f} per group") print(f"Total required: {n*2:.0f} people") # Required sample sizes for various effect sizes print("\nRequired sample sizes by effect size (Power 80%):") for es, desc in [(0.2, 'small effect'), (0.3, 'medium effect'), (0.5, 'large effect'), (0.8, 'very large effect')]: n = power_analysis.solve_power(effect_size=es, alpha=0.05, power=0.8, ratio=1.0) print(f" d = {es} ({desc}): {n:.0f} per group (total {n*2:.0f})")
실행 ź²°ź³¼
=== Sample Size Calculation ===
Target effect size (Cohen's d): 0.3 (medium effect)
Significance level (α): 0.05
Target power (1-β): 0.8 (80% probability of detecting effect)

Required sample size: 176 per group
Total required: 352 people

Required sample sizes by effect size (Power 80%):
d = 0.2 (small effect): 394 per group (total 787)
d = 0.3 (medium effect): 176 per group (total 352)
d = 0.5 (large effect): 64 per group (total 128)
d = 0.8 (very large effect): 26 per group (total 51)

12. Test Selection Summary

Test Selection by Data Type

SituationParametric TestNon-parametric Test
1 sample mean vs reference valueOne-sample t-testWilcoxon signed-rank
2 independent samples comparisonIndependent t-testMann-Whitney U
2 paired samples comparisonPaired t-testWilcoxon signed-rank
3+ independent samples comparisonOne-way ANOVAKruskal-Wallis H
2Ɨ2 categories (small sample)-Fisher’s exact
Category independence-Chi-square
Paired category before/after comparison-McNemar’s
CorrelationPearson rSpearman ρ, Kendall Ļ„

Quick Guide by Situation

Q: Comparing means of two groups? ā”œā”€ā”€ Same subjects before/after? → Paired t-test (normal) / Wilcoxon (non-normal) └── Different subjects? → Independent t-test (normal) / Mann-Whitney (non-normal) Q: Comparing 3+ groups? ā”œā”€ā”€ Normal distribution? → One-way ANOVA └── Non-normal distribution? → Kruskal-Wallis Q: Relationship between two categorical variables? ā”œā”€ā”€ Any expected frequency < 5? → Fisher's exact └── All expected frequencies >= 5? → Chi-square Q: Relationship between two continuous variables? ā”œā”€ā”€ Linear relationship? → Pearson r └── Monotonic relationship? → Spearman ρ

Quiz

Problem 1

Perform an appropriate test to check if there is a significant difference in survival rate by cabin class (pclass) in the Titanic data.

View Answer

# Categorical vs Categorical → Chi-square test contingency = pd.crosstab(titanic['pclass'], titanic['survived']) chi2, p_value, dof, expected = chi2_contingency(contingency) print("Contingency Table:") print(contingency) print(f"\nChi-square = {chi2:.4f}, p-value = {p_value:.6f}") print(f"\nConclusion: {'Cabin class and survival rate are associated' if p_value < 0.05 else 'No association'}") # Effect size v = cramers_v(contingency) print(f"Cramer's V = {v:.4f} (moderate association)")

Problem 2

Use an appropriate test to check if the tip amount distribution differs between smokers and non-smokers in the Tips data.

View Answer

smoker_tip = tips[tips['smoker'] == 'Yes']['tip'] nonsmoker_tip = tips[tips['smoker'] == 'No']['tip'] # Normality test _, p_smoker = shapiro(smoker_tip) _, p_nonsmoker = shapiro(nonsmoker_tip) print(f"Normality (smoker): p = {p_smoker:.4f}") print(f"Normality (non-smoker): p = {p_nonsmoker:.4f}") # Normality not satisfied → Use Mann-Whitney U stat, p_value = mannwhitneyu(smoker_tip, nonsmoker_tip) print(f"\nMann-Whitney U: {stat:.2f}") print(f"p-value: {p_value:.4f}") print(f"\nConclusion: {'Tip distributions differ' if p_value < 0.05 else 'No difference in tip distributions'}")

Next Steps

Last updated on

šŸ¤–AI ėŖØģ˜ė©“ģ ‘ģ‹¤ģ „ģ²˜ėŸ¼ ģ—°ģŠµķ•˜źø°