Hypothesis Testing
0. Setup
Practice hypothesis testing using various datasets.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import (
ttest_ind, ttest_rel, ttest_1samp,
mannwhitneyu, wilcoxon, kruskal,
chi2_contingency, fisher_exact,
f_oneway, shapiro, levene, bartlett,
spearmanr, pearsonr, kendalltau,
kstest, normaltest, anderson
)
import warnings
warnings.filterwarnings('ignore')
# 1. Titanic Dataset
titanic = sns.load_dataset('titanic')
print(f"Titanic: {titanic.shape}")
# 2. Iris Dataset
iris = sns.load_dataset('iris')
print(f"Iris: {iris.shape}")
# 3. Tips Dataset
tips = sns.load_dataset('tips')
print(f"Tips: {tips.shape}")
# 4. Diamonds Dataset (sampled)
diamonds = sns.load_dataset('diamonds').sample(n=1000, random_state=42)
print(f"Diamonds: {diamonds.shape}")Titanic: (891, 15) Iris: (150, 5) Tips: (244, 7) Diamonds: (1000, 10)
1. Fundamentals of Hypothesis Testing
Key Concepts
| Term | Description |
|---|---|
| Null Hypothesis (Hā) | āThere is no difference/effectā - Default assumption |
| Alternative Hypothesis (Hā) | āThere is a difference/effectā - What we want to prove |
| p-value | Probability of getting the current result if Hā is true |
| Significance Level (α) | Usually 0.05 (5%) |
| Power | Probability of detecting an effect when it truly exists |
Test Selection Guide
Data Type?
āāā Continuous
ā āāā Normal distribution ā Parametric tests (t-test, ANOVA)
ā āāā Non-normal distribution ā Non-parametric tests (Mann-Whitney, Kruskal-Wallis)
ā
āāā Categorical
āāā 2Ć2 table (small sample) ā Fisher's Exact Test
āāā Otherwise ā Chi-Square Test2. Normality Tests
When to use?
Use before performing parametric tests like t-tests or ANOVA to check if your data follows a normal distribution. If normality is not satisfied, use non-parametric tests (Mann-Whitney, Kruskal-Wallis, etc.).
Shapiro-Wilk Test
Example Use Cases
- Check if blood pressure measurements in a clinical trial follow normal distribution
- Verify distribution of user session duration before A/B testing
- Quality control: Check if product weight data is normally distributed
Characteristics: Most powerful normality test. Recommended for n < 5000.
# Titanic: Normality test for age distribution
ages = titanic['age'].dropna()
stat, p_value = shapiro(ages)
print("=== Shapiro-Wilk Normality Test ===")
print(f"Data: Titanic passenger ages (n={len(ages)})")
print(f"Statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Follows normal distribution' if p_value >= 0.05 else 'Does not follow normal distribution'}")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].hist(ages, bins=30, edgecolor='black', alpha=0.7)
axes[0].set_title('Age Distribution')
axes[0].set_xlabel('Age')
stats.probplot(ages, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot')
plt.tight_layout()
plt.show()=== Shapiro-Wilk Normality Test === Data: Titanic passenger ages (n=714) Statistic: 0.9816 p-value: 0.0000 Conclusion: Does not follow normal distribution
DāAgostino-Pearson Test
Example Use Cases
- Check if financial return data follows normal distribution (when skewness/kurtosis matters)
- Test distribution shape of survey scores
- Check asymmetry of exam score distribution
Characteristics: Considers both skewness and kurtosis. Useful when distribution shape matters. Can be used when n >= 20.
# Iris: Normality test for petal length
petal_length = iris['petal_length']
stat, p_value = normaltest(petal_length)
print("=== D'Agostino-Pearson Normality Test ===")
print(f"Data: Iris petal length (n={len(petal_length)})")
print(f"Statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Follows normal distribution' if p_value >= 0.05 else 'Does not follow normal distribution'}")=== D'Agostino-Pearson Normality Test === Data: Iris petal length (n=150) Statistic: 31.5324 p-value: 0.0000 Conclusion: Does not follow normal distribution
Kolmogorov-Smirnov Test
Example Use Cases
- Normality testing for large datasets (n > 5000)
- Compare with other theoretical distributions (exponential, uniform, etc.) besides normal
- Compare if two samples have identical distributions (2-sample KS test)
Characteristics: Less sensitive to sample size, suitable for large datasets. Can compare with various distributions.
# Tips: Normality test for tip amounts
tip_values = tips['tip']
# Test after standardization
tip_standardized = (tip_values - tip_values.mean()) / tip_values.std()
stat, p_value = kstest(tip_standardized, 'norm')
print("=== Kolmogorov-Smirnov Normality Test ===")
print(f"Data: Tips tip amounts (n={len(tip_values)})")
print(f"Statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Follows normal distribution' if p_value >= 0.05 else 'Does not follow normal distribution'}")=== Kolmogorov-Smirnov Normality Test === Data: Tips tip amounts (n=244) Statistic: 0.0975 p-value: 0.0186 Conclusion: Does not follow normal distribution
Anderson-Darling Test
Example Use Cases
- Normality check in analyses where extreme values (outliers) are important (risk management, insurance, etc.)
- Check how much the distribution tails differ from normal distribution
- When decisions are needed at multiple significance levels simultaneously
Characteristics: More sensitive to distribution tails. Preferred in finance/insurance where extreme values matter.
# Diamonds: Normality test for prices
prices = diamonds['price']
result = anderson(prices, dist='norm')
print("=== Anderson-Darling Normality Test ===")
print(f"Data: Diamonds prices (n={len(prices)})")
print(f"Statistic: {result.statistic:.4f}")
print("\nCritical values by significance level:")
for i, (cv, sl) in enumerate(zip(result.critical_values, result.significance_level)):
result_str = "Reject" if result.statistic > cv else "Accept"
print(f" {sl}%: Critical value = {cv:.4f} ā Hā {result_str}")=== Anderson-Darling Normality Test === Data: Diamonds prices (n=1000) Statistic: 47.8932 Critical values by significance level: 15.0%: Critical value = 0.5740 ā Hā Reject 10.0%: Critical value = 0.6540 ā Hā Reject 5.0%: Critical value = 0.7850 ā Hā Reject 2.5%: Critical value = 0.9150 ā Hā Reject 1.0%: Critical value = 1.0890 ā Hā Reject
3. Homogeneity of Variance Tests
When to use?
Use before performing independent samples t-tests or ANOVA to check if the groups being compared have equal variances. If the equal variance assumption is violated, use Welchās t-test or Games-Howell post-hoc test.
Leveneās Test
Example Use Cases
- Check if purchase amount variance is equal between treatment and control groups in A/B testing
- Test if exam score variance is equal between male and female groups
- Check if quality dispersion is equal across products from multiple factories
Characteristics: Robust as it doesnāt require normality assumption. Can be used with non-normal data.
# Titanic: Compare age variance by survival status
survived_ages = titanic[titanic['survived'] == 1]['age'].dropna()
died_ages = titanic[titanic['survived'] == 0]['age'].dropna()
stat, p_value = levene(survived_ages, died_ages)
print("=== Levene's Test for Equal Variances ===")
print(f"Survived age variance: {survived_ages.var():.2f}")
print(f"Died age variance: {died_ages.var():.2f}")
print(f"Statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Equal variance assumption satisfied' if p_value >= 0.05 else 'Equal variance assumption not satisfied'}")=== Levene's Test for Equal Variances === Survived age variance: 207.03 Died age variance: 199.41 Statistic: 0.1557 p-value: 0.6933 Conclusion: Equal variance assumption satisfied
Bartlettās Test
Example Use Cases
- Compare variance between groups in normally distributed data
- Compare response variability across multiple dose groups in clinical trials
- Test variance equality across multiple production lines in quality control
Characteristics: Most powerful equal variance test when data follows normal distribution. Use Leveneās if normality is violated.
# Iris: Compare sepal length variance by species
setosa = iris[iris['species'] == 'setosa']['sepal_length']
versicolor = iris[iris['species'] == 'versicolor']['sepal_length']
virginica = iris[iris['species'] == 'virginica']['sepal_length']
stat, p_value = bartlett(setosa, versicolor, virginica)
print("=== Bartlett's Test for Equal Variances ===")
print(f"Setosa variance: {setosa.var():.4f}")
print(f"Versicolor variance: {versicolor.var():.4f}")
print(f"Virginica variance: {virginica.var():.4f}")
print(f"Statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Equal variance assumption satisfied' if p_value >= 0.05 else 'Equal variance assumption not satisfied'}")=== Bartlett's Test for Equal Variances === Setosa variance: 0.1242 Versicolor variance: 0.2664 Virginica variance: 0.4043 Statistic: 16.0057 p-value: 0.0003 Conclusion: Equal variance assumption not satisfied
4. T-Tests
One-Sample T-Test
Example Use Cases
- Quality Control: Verify if the mean lifespan of batteries produced at a factory equals the nominal lifespan of 1000 hours
- Marketing: Check if average customer satisfaction has reached the target of 4.0 points
- Education: Test if studentsā average scores differ from the national average of 75 points
- Service: Check if average response time is within the SLA standard of 3 seconds
Key Question: āIs our sample mean equal to/different from a specific reference value?ā
# Tips: Test if mean tip is $3
# Situation: Restaurant manager claims "Our average tip is $3". Verify if this is true.
tip_values = tips['tip']
hypothesized_mean = 3.0
stat, p_value = ttest_1samp(tip_values, hypothesized_mean)
print("=== One-Sample t-test ===")
print(f"Hā: Mean tip = ${hypothesized_mean:.2f}")
print(f"Hā: Mean tip ā ${hypothesized_mean:.2f}")
print(f"\nSample mean: ${tip_values.mean():.2f}")
print(f"Sample standard deviation: ${tip_values.std():.2f}")
print(f"t-statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Mean tip differs from $3' if p_value < 0.05 else 'Mean tip can be considered equal to $3'}")=== One-Sample t-test === Hā: Mean tip = $3.00 Hā: Mean tip ā $3.00 Sample mean: $3.00 Sample standard deviation: $1.38 t-statistic: -0.0363 p-value: 0.9711 Conclusion: Mean tip can be considered equal to $3
Independent Samples T-Test
Example Use Cases
- A/B Testing: Verify if the new website design (B) has a higher conversion rate than the old design (A)
- Medicine: Compare blood pressure changes between drug treatment group and placebo group
- Education: Compare grade differences between online and offline classes
- HR: Analyze productivity differences between remote workers and office workers
- Marketing: Compare average purchase amounts between male and female customers
Key Question: āAre the means of two different groups equal/different?ā
Note: The two groups must be independent (same person cannot belong to both groups)
# Titanic: Compare fare by gender
# Situation: Analyze if there were differences in fares paid by males and females on the Titanic
male_fare = titanic[titanic['sex'] == 'male']['fare'].dropna()
female_fare = titanic[titanic['sex'] == 'female']['fare'].dropna()
# Test equal variance assumption
_, levene_p = levene(male_fare, female_fare)
equal_var = levene_p >= 0.05
# t-test (Welch's t-test if unequal variance)
stat, p_value = ttest_ind(male_fare, female_fare, equal_var=equal_var)
print("=== Independent Samples t-test ===")
print(f"Hā: Male mean fare = Female mean fare")
print(f"Hā: Male mean fare ā Female mean fare")
print(f"\nMale mean fare: ${male_fare.mean():.2f} (n={len(male_fare)})")
print(f"Female mean fare: ${female_fare.mean():.2f} (n={len(female_fare)})")
print(f"Difference: ${female_fare.mean() - male_fare.mean():.2f}")
print(f"\nEqual variance assumption: {'Satisfied (Student t-test)' if equal_var else 'Not satisfied (Welch t-test used)'}")
print(f"t-statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Significant difference in fare by gender' if p_value < 0.05 else 'No difference in fare by gender'}")=== Independent Samples t-test === Hā: Male mean fare = Female mean fare Hā: Male mean fare ā Female mean fare Male mean fare: $25.52 (n=577) Female mean fare: $44.48 (n=314) Difference: $18.95 Equal variance assumption: Not satisfied (Welch t-test used) t-statistic: -4.7994 p-value: 0.0000 Conclusion: Significant difference in fare by gender
Paired Samples T-Test
Example Use Cases
- Diet Effect: Compare weight before and after diet for the same people
- Educational Effect: Compare test scores before and after class for the same students
- Drug Effect: Compare blood pressure before and after medication for the same patients
- UX Improvement: Compare task completion time when the same users use old version/new version app
- Marketing: Compare sales before and after promotion for the same stores
Key Question: āDid the before and after values of the same subjects change?ā
Difference from independent samples: Paired samples measure the same subject twice, so individual differences can be controlled, making it more sensitive to detecting changes
# Simulation: A/B test conversion rate (same users experience both UIs)
# Situation: Show 100 users old UI and new UI sequentially and measure click rates
np.random.seed(42)
n_users = 100
# Before: Old UI click rate
before = np.random.beta(2, 8, n_users) # Mean about 20%
# After: New UI click rate (slight improvement)
after = before + np.random.normal(0.05, 0.03, n_users)
after = np.clip(after, 0, 1)
stat, p_value = ttest_rel(before, after)
print("=== Paired Samples t-test ===")
print(f"Hā: No change in conversion rate (new UI has no effect)")
print(f"Hā: Change in conversion rate (new UI has effect)")
print(f"\nBefore (old UI) mean: {before.mean():.4f} ({before.mean()*100:.1f}%)")
print(f"After (new UI) mean: {after.mean():.4f} ({after.mean()*100:.1f}%)")
print(f"Mean difference: {(after - before).mean():.4f} (+{(after - before).mean()*100:.1f}%p)")
print(f"\nt-statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'New UI significantly improved' if p_value < 0.05 else 'No significant change'}")=== Paired Samples t-test === Hā: No change in conversion rate (new UI has no effect) Hā: Change in conversion rate (new UI has effect) Before (old UI) mean: 0.2037 (20.4%) After (new UI) mean: 0.2503 (25.0%) Mean difference: 0.0466 (+4.7%p) t-statistic: -14.7221 p-value: 0.0000 Conclusion: New UI significantly improved
5. Non-parametric Tests
When to use?
- When data does not follow normal distribution
- When sample size is small (n < 30)
- When data is rank data or ordinal scale
- When there are many outliers (non-parametric tests are less sensitive to outliers)
Mann-Whitney U Test
Example Use Cases
- E-commerce: Compare order amount distribution between premium and regular members (amount data usually has a non-normal right-skewed distribution)
- Gaming: Compare play time between paying and non-paying users
- Medicine: Compare pain scale (1-10) between two treatments
- Satisfaction: Compare customer rating distributions between two products
Key Question: āAre the distributions (ranks) of two independent groups different?ā
Non-parametric alternative to independent samples t-test. Think of it as comparing median/rank rather than mean.
# Diamonds: Compare prices by cut quality (Ideal vs Good)
# Situation: Compare price distributions of diamonds with Ideal vs Good cut quality
# Price data generally doesn't follow normal distribution, so use non-parametric test
ideal_price = diamonds[diamonds['cut'] == 'Ideal']['price']
good_price = diamonds[diamonds['cut'] == 'Good']['price']
stat, p_value = mannwhitneyu(ideal_price, good_price, alternative='two-sided')
print("=== Mann-Whitney U Test ===")
print(f"Hā: Ideal and Good cut have the same price distribution")
print(f"Hā: Price distributions are different")
print(f"\nIdeal median: ${ideal_price.median():,.2f} (n={len(ideal_price)})")
print(f"Good median: ${good_price.median():,.2f} (n={len(good_price)})")
print(f"\nU-statistic: {stat:,.2f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Price distributions significantly different' if p_value < 0.05 else 'No difference in price distributions'}")=== Mann-Whitney U Test === Hā: Ideal and Good cut have the same price distribution Hā: Price distributions are different Ideal median: $1,810.00 (n=393) Good median: $3,086.50 (n=96) U-statistic: 14,985.00 p-value: 0.0031 Conclusion: Price distributions significantly different
Wilcoxon Signed-Rank Test
Example Use Cases
- Weight Loss: Compare weight before and after diet (when weight changes are not normally distributed)
- Survey: Compare satisfaction scores (1-5) before and after policy change for the same respondents
- Education: Compare confidence scores before and after special lecture for the same students
- App Ratings: Compare rating changes from the same users before and after update
Key Question: āDid the before and after value distribution of the same subjects change?ā
Non-parametric alternative to paired samples t-test. Uses sign and rank of differences.
# Tips: Compare lunch vs dinner tip rates (same waiters)
# Situation: Test if 50 waiters receive different tip rates at lunch vs dinner
np.random.seed(42)
n_waiters = 50
lunch_tip_rate = np.random.uniform(0.12, 0.22, n_waiters)
dinner_tip_rate = lunch_tip_rate + np.random.normal(0.02, 0.03, n_waiters)
stat, p_value = wilcoxon(lunch_tip_rate, dinner_tip_rate)
print("=== Wilcoxon Signed-Rank Test ===")
print(f"Hā: No difference between lunch and dinner tip rates")
print(f"Hā: Difference between lunch and dinner tip rates")
print(f"\nLunch tip rate median: {np.median(lunch_tip_rate)*100:.1f}%")
print(f"Dinner tip rate median: {np.median(dinner_tip_rate)*100:.1f}%")
print(f"\nW-statistic: {stat:.2f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Dinner tip rate significantly higher' if p_value < 0.05 else 'No difference'}")=== Wilcoxon Signed-Rank Test === Hā: No difference between lunch and dinner tip rates Hā: Difference between lunch and dinner tip rates Lunch tip rate median: 16.9% Dinner tip rate median: 19.1% W-statistic: 304.00 p-value: 0.0004 Conclusion: Dinner tip rate significantly higher
Kruskal-Wallis H Test
Example Use Cases
- Marketing: Compare brand awareness scores across 3 ad types (TV, Online, SNS)
- Products: Compare customer satisfaction across multiple smartphone brands
- Education: Compare student grade distributions across 3 schools
- Medicine: Compare recovery periods across multiple treatments (non-normal data)
Key Question: āAre the distributions of 3 or more groups all the same, or is at least one different?ā
Non-parametric alternative to One-Way ANOVA. Post-hoc tests needed to determine which groups differ.
# Iris: Compare petal width by species
# Situation: Test if petal width distributions differ among three iris species (setosa, versicolor, virginica)
setosa_pw = iris[iris['species'] == 'setosa']['petal_width']
versicolor_pw = iris[iris['species'] == 'versicolor']['petal_width']
virginica_pw = iris[iris['species'] == 'virginica']['petal_width']
stat, p_value = kruskal(setosa_pw, versicolor_pw, virginica_pw)
print("=== Kruskal-Wallis H Test ===")
print(f"Hā: All species have the same petal width distribution")
print(f"Hā: At least one species is different")
print(f"\nSetosa median: {setosa_pw.median():.2f}cm")
print(f"Versicolor median: {versicolor_pw.median():.2f}cm")
print(f"Virginica median: {virginica_pw.median():.2f}cm")
print(f"\nH-statistic: {stat:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"\nConclusion: {'Significant difference among species' if p_value < 0.05 else 'No difference'}")=== Kruskal-Wallis H Test === Hā: All species have the same petal width distribution Hā: At least one species is different Setosa median: 0.20cm Versicolor median: 1.30cm Virginica median: 2.00cm H-statistic: 130.0111 p-value: 0.000000 Conclusion: Significant difference among species
6. Analysis of Variance (ANOVA)
One-Way ANOVA
Example Use Cases
- Marketing: Compare average purchase amount across 4 promotion types (discount, points, gifts, free shipping)
- Manufacturing: Compare average quality scores of products from 3 factories
- HR: Compare employee satisfaction across departments (Development, Marketing, Sales, Support)
- Education: Compare learning effects across multiple teaching methods
Key Question: āAre the means of 3 or more groups all the same?ā
Prerequisites: Normality, equal variance. Use Kruskal-Wallis if violated.
# Tips: Compare total bill amounts by day of week
# Situation: Analyze if customers' average payment amounts differ by day of the week
thur = tips[tips['day'] == 'Thur']['total_bill']
fri = tips[tips['day'] == 'Fri']['total_bill']
sat = tips[tips['day'] == 'Sat']['total_bill']
sun = tips[tips['day'] == 'Sun']['total_bill']
stat, p_value = f_oneway(thur, fri, sat, sun)
print("=== One-Way ANOVA ===")
print(f"Hā: Average payment amounts are the same for all days")
print(f"Hā: At least one day is different")
print(f"\nAverage payment by day:")
print(f" Thursday: ${thur.mean():.2f} (n={len(thur)})")
print(f" Friday: ${fri.mean():.2f} (n={len(fri)})")
print(f" Saturday: ${sat.mean():.2f} (n={len(sat)})")
print(f" Sunday: ${sun.mean():.2f} (n={len(sun)})")
print(f"\nF-statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Significant difference by day ā Post-hoc test needed' if p_value < 0.05 else 'No difference by day'}")=== One-Way ANOVA === Hā: Average payment amounts are the same for all days Hā: At least one day is different Average payment by day: Thursday: $17.68 (n=62) Friday: $17.15 (n=19) Saturday: $20.44 (n=87) Sunday: $21.41 (n=76) F-statistic: 2.7675 p-value: 0.0424 Conclusion: Significant difference by day ā Post-hoc test needed
Two-Way ANOVA
Example Use Cases
- Marketing: Analyze the effect of ad channel (TV/Online) and target age group (20s/30s/40s) on purchase intent
- Manufacturing: Effect of machine type and operator skill level on production volume
- Education: Effect of teaching method and class size on learning outcomes
- Medicine: Effect of drug type and dosage on treatment efficacy
Key Questions:
- Is there a main effect of Factor A?
- Is there a main effect of Factor B?
- Is there an interaction effect between A and B? (e.g., Is there an effect only in specific combinations?)
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Tips: Effect of gender and smoking status on tips
# Situation: Analyze if tip amounts differ by gender and smoking status, and if there's a combination effect
model = ols('tip ~ C(sex) + C(smoker) + C(sex):C(smoker)', data=tips).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print("=== Two-Way ANOVA ===")
print(f"Dependent Variable: Tip amount")
print(f"Factor 1: Gender (sex)")
print(f"Factor 2: Smoking status (smoker)")
print(f"\n{anova_table.round(4)}")
print("\nInterpretation:")
for idx, row in anova_table.iterrows():
if idx != 'Residual':
sig = "Significant ***" if row['PR(>F)'] < 0.001 else \
"Significant **" if row['PR(>F)'] < 0.01 else \
"Significant *" if row['PR(>F)'] < 0.05 else "Not significant"
print(f" {idx}: p = {row['PR(>F)']:.4f} ā {sig}")=== Two-Way ANOVA ===
Dependent Variable: Tip amount
Factor 1: Gender (sex)
Factor 2: Smoking status (smoker)
sum_sq df F PR(>F)
C(sex) 1.0554 1.0 0.5596 0.4551
C(smoker) 0.1477 1.0 0.0783 0.7799
C(sex):C(smoker) 0.2077 1.0 0.1101 0.7404
Residual 452.5604 240.0 NaN NaN
Interpretation:
C(sex): p = 0.4551 ā Not significant
C(smoker): p = 0.7799 ā Not significant
C(sex):C(smoker): p = 0.7404 ā Not significant7. Chi-Square Tests
Test of Independence
Example Use Cases
- Marketing: Analyze if age group (20s/30s/40s) and preferred brand (A/B/C) are associated
- Medicine: Test if smoking status and lung cancer occurrence are associated
- Education: Analyze if gender and major selection are associated
- HR: Analyze if education level and turnover status are associated
- Elections: Analyze if region and party support are associated
Key Question: āAre two categorical variables independent or associated?ā
# Titanic: Relationship between gender and survival
# Situation: Analyze if survival rates differed by gender during the Titanic sinking ("Women and children first" rule)
contingency = pd.crosstab(titanic['sex'], titanic['survived'])
print("Contingency Table:")
print(contingency)
print()
chi2, p_value, dof, expected = chi2_contingency(contingency)
print("=== Chi-Square Test of Independence ===")
print(f"Hā: Gender and survival are independent (no relationship)")
print(f"Hā: Gender and survival are associated")
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"p-value: {p_value:.6f}")
print(f"\nExpected frequencies (if independent, these would be expected):")
print(pd.DataFrame(expected,
index=contingency.index,
columns=contingency.columns).round(1))
print(f"\nConclusion: {'Gender and survival are strongly associated (higher female survival rate)' if p_value < 0.05 else 'Independent'}")Contingency Table: survived 0 1 sex female 81 233 male 468 109 === Chi-Square Test of Independence === Hā: Gender and survival are independent (no relationship) Hā: Gender and survival are associated Chi-square statistic: 260.7170 Degrees of freedom: 1 p-value: 0.000000 Expected frequencies (if independent, these would be expected): survived 0 1 sex female 193.5 120.5 male 355.5 221.5 Conclusion: Gender and survival are strongly associated (higher female survival rate)
Goodness of Fit Test
Example Use Cases
- Quality Control: Test if defect occurrences are evenly distributed (1/5 each) across days of the week
- Marketing: Check if customer distribution matches the expected ratio (40:35:25)
- Genetics: Test if observed genotype ratios match Mendelās laws (9:3:3:1)
- Survey: Check if response distribution follows a uniform distribution
Key Question: āDoes observed frequency match the expected theoretical distribution?ā
# Titanic: Test if cabin class distribution is uniform
# Situation: Check if Titanic passengers were evenly distributed across 1st, 2nd, 3rd class
observed = titanic['pclass'].value_counts().sort_index()
n = len(titanic)
expected = np.array([n/3, n/3, n/3]) # Expected uniform distribution
chi2, p_value = stats.chisquare(observed, expected)
print("=== Chi-Square Goodness of Fit Test ===")
print(f"Hā: Cabin classes are evenly distributed (each 33.3%)")
print(f"Hā: Not evenly distributed")
print(f"\nObserved frequencies:")
for cls, count in observed.items():
print(f" Class {cls}: {count} passengers ({count/n*100:.1f}%)")
print(f"\nExpected frequency (uniform distribution): {n/3:.0f} each (33.3%)")
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"\nConclusion: {'Not uniform - 3rd class is the majority' if p_value < 0.05 else 'Uniform distribution'}")=== Chi-Square Goodness of Fit Test === Hā: Cabin classes are evenly distributed (each 33.3%) Hā: Not evenly distributed Observed frequencies: Class 1: 216 passengers (24.2%) Class 2: 184 passengers (20.7%) Class 3: 491 passengers (55.1%) Expected frequency (uniform distribution): 297 each (33.3%) Chi-square statistic: 110.8417 p-value: 0.000000 Conclusion: Not uniform - 3rd class is the majority
Fisherās Exact Test
Example Use Cases
- Clinical Trials: Test treatment effect in small-scale pilot studies (n < 20)
- Rare Diseases: Association between rare disease occurrence and genetic mutations
- Quality Control: Analysis of rare defect types where expected frequency is less than 5
- Epidemiology: Association between infection status and specific behaviors in small groups
Key Question: āAre two variables associated in a 2x2 contingency table when sample size is small?ā
Alternative to chi-square test: Use Fisherās Exact when any cell has expected frequency less than 5
# Small-scale clinical trial simulation
# Situation: Pilot study with 20 subjects. Compare cure rates between drug treatment group (10) and placebo group (10)
contingency = np.array([[8, 2], # Treatment group: 8 cured, 2 not cured
[3, 7]]) # Control group: 3 cured, 7 not cured
odds_ratio, p_value = fisher_exact(contingency)
print("=== Fisher's Exact Test ===")
print("Situation: Small-scale clinical trial (n=20)")
print("\nContingency Table:")
print(" Cured Not Cured")
print(f"Treatment {contingency[0,0]} {contingency[0,1]}")
print(f"Control {contingency[1,0]} {contingency[1,1]}")
print(f"\nTreatment cure rate: {contingency[0,0]/contingency[0].sum()*100:.0f}%")
print(f"Control cure rate: {contingency[1,0]/contingency[1].sum()*100:.0f}%")
print(f"\nOdds Ratio: {odds_ratio:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Treatment effect exists' if p_value < 0.05 else 'No treatment effect'}")
print(f"\nInterpretation: Treatment group has {odds_ratio:.1f}x higher cure odds than control group")=== Fisher's Exact Test ===
Situation: Small-scale clinical trial (n=20)
Contingency Table:
Cured Not Cured
Treatment 8 2
Control 3 7
Treatment cure rate: 80%
Control cure rate: 30%
Odds Ratio: 9.3333
p-value: 0.0350
Conclusion: Treatment effect exists
Interpretation: Treatment group has 9.3x higher cure odds than control groupMcNemarās Test
Example Use Cases
- Marketing: Change in brand awareness of same customers before/after ad campaign (AwareāAware, UnawareāAware, etc.)
- Medicine: Change in symptom presence of same patients before/after treatment
- Politics: Change in party support of same voters before/after election
- Education: Change in understanding of specific concepts of same students before/after class
Key Question: āDid the categorical response of the same subjects change before/after?ā
Use for paired samples + categorical data. Use Wilcoxon/paired t-test for continuous data.
from statsmodels.stats.contingency_tables import mcnemar
# Marketing campaign before/after purchase behavior change
# Situation: Track purchase status of 100 customers before and after campaign
# [Bought before/Bought after, Bought before/Didn't buy after]
# [Didn't buy before/Bought after, Didn't buy before/Didn't buy after]
table = np.array([[45, 15], # Bought before & after / Bought before, didn't buy after
[35, 5]]) # Didn't buy before, bought after / Didn't buy before & after
result = mcnemar(table, exact=True)
print("=== McNemar's Test ===")
print("Situation: Purchase behavior change of 100 customers before/after campaign")
print("\nPaired Table:")
print(" Bought After Didn't Buy After")
print(f"Bought Before {table[0,0]} {table[0,1]}")
print(f"Didn't Buy Before {table[1,0]} {table[1,1]}")
print(f"\nChange Analysis:")
print(f" New purchase (Didn't buy ā Bought): {table[1,0]} people")
print(f" Churned (Bought ā Didn't buy): {table[0,1]} people")
print(f" No change: {table[0,0] + table[1,1]} people")
print(f"\np-value: {result.pvalue:.4f}")
print(f"\nConclusion: {'Campaign significantly changed purchase behavior (New purchases > Churn)' if result.pvalue < 0.05 else 'No significant change'}")=== McNemar's Test ===
Situation: Purchase behavior change of 100 customers before/after campaign
Paired Table:
Bought After Didn't Buy After
Bought Before 45 15
Didn't Buy Before 35 5
Change Analysis:
New purchase (Didn't buy ā Bought): 35 people
Churned (Bought ā Didn't buy): 15 people
No change: 50 people
p-value: 0.0066
Conclusion: Campaign significantly changed purchase behavior (New purchases > Churn)8. Correlation Analysis
Pearson Correlation Coefficient
Example Use Cases
- Marketing: Analyze linear relationship between ad spending and sales
- HR: Correlation between years of service and salary
- Education: Relationship between study hours and test scores
- Finance: Relationship between interest rates and stock prices
Key Question: āIs there a linear relationship between two continuous variables?ā
Prerequisites: Both variables follow normal distribution, linear relationship. Cannot detect non-linear relationships.
# Diamonds: Correlation between carat and price
# Situation: Analyze if there is a linear relationship between diamond carat (weight) and price
carat = diamonds['carat']
price = diamonds['price']
r, p_value = pearsonr(carat, price)
print("=== Pearson Correlation Analysis ===")
print(f"Hā: There is no correlation between carat and price (r = 0)")
print(f"Hā: There is correlation between carat and price (r ā 0)")
print(f"\nPearson r: {r:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"Coefficient of determination (R²): {r**2:.4f} ā {r**2*100:.1f}% of price variation explained by carat")
print(f"\nCorrelation strength interpretation:")
print(f" |r| < 0.3: Weak correlation")
print(f" 0.3 <= |r| < 0.7: Moderate correlation")
print(f" |r| >= 0.7: Strong correlation")
print(f"\nCurrent |r| = {abs(r):.4f} ā Strong positive correlation (caratā ā priceā)")=== Pearson Correlation Analysis === Hā: There is no correlation between carat and price (r = 0) Hā: There is correlation between carat and price (r ā 0) Pearson r: 0.9209 p-value: 0.000000 Coefficient of determination (R²): 0.8481 ā 84.8% of price variation explained by carat Correlation strength interpretation: |r| < 0.3: Weak correlation 0.3 <= |r| < 0.7: Moderate correlation |r| >= 0.7: Strong correlation Current |r| = 0.9209 ā Strong positive correlation (caratā ā priceā)
Spearman Rank Correlation Coefficient
Example Use Cases
- Survey: Relationship between satisfaction ranking and repurchase intent ranking
- Economics: Relationship between GDP ranking and happiness index ranking
- Education: Relationship between academic grade ranking and employment rate ranking
- Sports: Relationship between salary ranking and performance ranking
Key Question: āIs there a monotonic relationship between two variables?ā
Non-parametric alternative to Pearson: Doesnāt require normal distribution, can detect non-linear but monotonic relationships. Example: y = x² (Spearman is high in monotonically increasing intervals)
# Tips: Rank correlation between total bill and tip
# Situation: Check if higher bills tend to have higher tips (even if not exactly proportional)
total_bill = tips['total_bill']
tip = tips['tip']
rho, p_value = spearmanr(total_bill, tip)
r_pearson, _ = pearsonr(total_bill, tip)
print("=== Spearman Rank Correlation Analysis ===")
print(f"Hā: No monotonic relationship between bill and tip")
print(f"Hā: Monotonic relationship exists between bill and tip")
print(f"\nSpearman rho: {rho:.4f}")
print(f"(Comparison) Pearson r: {r_pearson:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"\nConclusion: {'Significant monotonic relationship - higher bills tend to have higher tips' if p_value < 0.05 else 'No relationship'}")=== Spearman Rank Correlation Analysis === Hā: No monotonic relationship between bill and tip Hā: Monotonic relationship exists between bill and tip Spearman rho: 0.8264 (Comparison) Pearson r: 0.6757 p-value: 0.000000 Conclusion: Significant monotonic relationship - higher bills tend to have higher tips
Kendallās Tau
Example Use Cases
- Rank Data: Agreement between two judgesā ranking evaluations (e.g., restaurant rankings)
- Ordinal Scale: Relationship between education level (Elementary/Middle/High/College) and income level (Low/Medium/High)
- When Many Ties Exist: Data with many same values like 5-point Likert scale surveys
Key Question: āWhat is the concordance between two variables in rank data?ā
More conservative than Spearman. More accurate when there are many ties.
# Titanic: Relationship between cabin class and age
# Situation: Check if 1st class passengers tend to be older (ordinal vs continuous)
pclass = titanic['pclass'].dropna()
age = titanic['age'].dropna()
# Align indices
common_idx = pclass.index.intersection(age.index)
pclass_aligned = pclass.loc[common_idx]
age_aligned = age.loc[common_idx]
tau, p_value = kendalltau(pclass_aligned, age_aligned)
print("=== Kendall's Tau Correlation Analysis ===")
print(f"Hā: No relationship between cabin class and age")
print(f"Hā: Relationship exists between cabin class and age")
print(f"\nKendall tau: {tau:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Significant relationship' if p_value < 0.05 else 'No relationship'}")
if tau < 0:
print(f"Interpretation: tau < 0 means cabin classā(1st class) ā ageā (older passengers in premium cabins)")=== Kendall's Tau Correlation Analysis === Hā: No relationship between cabin class and age Hā: Relationship exists between cabin class and age Kendall tau: -0.1080 p-value: 0.0000 Conclusion: Significant relationship Interpretation: tau < 0 means cabin classā(1st class) ā ageā (older passengers in premium cabins)
9. Effect Size
Why is effect size important?
p-value only tells you āIs there a difference?ā but not āHow big is the difference?ā With large samples, even tiny differences can become significant.
Example: In an A/B test with 1 million people, a 0.01% difference in conversion rate can yield p < 0.05, but you need effect size to judge if this difference is actually meaningful for the business.
Cohenās d
When to use: Interpret the practical meaning of mean difference between two groups
Interpretation guidelines (Cohen, 1988):
- |d| < 0.2: Small effect (negligible)
- 0.2 ⤠|d| < 0.5: Medium effect
- 0.5 ⤠|d| < 0.8: Large effect
- |d| ā„ 0.8: Very large effect
def cohens_d(group1, group2):
n1, n2 = len(group1), len(group2)
var1, var2 = group1.var(), group2.var()
pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
return (group1.mean() - group2.mean()) / pooled_std
# Titanic: Effect size of age difference between survivors vs non-survivors
# Situation: Relationship between age and survival. Even if p-value is significant, is it a meaningful difference?
survived_ages = titanic[titanic['survived'] == 1]['age'].dropna()
died_ages = titanic[titanic['survived'] == 0]['age'].dropna()
d = cohens_d(survived_ages, died_ages)
t_stat, p_value = ttest_ind(survived_ages, died_ages)
print("=== Effect Size Analysis ===")
print(f"Survivor mean age: {survived_ages.mean():.2f} years")
print(f"Non-survivor mean age: {died_ages.mean():.2f} years")
print(f"Difference: {abs(survived_ages.mean() - died_ages.mean()):.2f} years")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f} ā {'Significant' if p_value < 0.05 else 'Not significant'}")
print(f"Cohen's d: {d:.4f}")
print(f"\nEffect size interpretation:")
print(f" |d| < 0.2: Small effect")
print(f" 0.2 <= |d| < 0.5: Medium effect")
print(f" 0.5 <= |d| < 0.8: Large effect")
print(f" |d| >= 0.8: Very large effect")
print(f"\nCurrent: |d| = {abs(d):.4f} ā ", end="")
if abs(d) >= 0.8:
print("Very large effect")
elif abs(d) >= 0.5:
print("Large effect")
elif abs(d) >= 0.2:
print("Medium effect")
else:
print("Small effect ā Statistically significant but practical meaning is limited!")=== Effect Size Analysis === Survivor mean age: 28.34 years Non-survivor mean age: 30.63 years Difference: 2.29 years t-statistic: -2.0551 p-value: 0.0402 ā Significant Cohen's d: -0.1616 Effect size interpretation: |d| < 0.2: Small effect 0.2 <= |d| < 0.5: Medium effect 0.5 <= |d| < 0.8: Large effect |d| >= 0.8: Very large effect Current: |d| = 0.1616 ā Small effect ā Statistically significant but practical meaning is limited!
Cramerās V
When to use: Measure association strength between categorical variables (after chi-square test)
Interpretation guidelines:
- V < 0.1: Negligible
- 0.1 ⤠V < 0.3: Weak association
- 0.3 ⤠V < 0.5: Moderate association
- V ā„ 0.5: Strong association
def cramers_v(contingency_table):
chi2 = chi2_contingency(contingency_table)[0]
n = contingency_table.sum().sum()
min_dim = min(contingency_table.shape) - 1
return np.sqrt(chi2 / (n * min_dim))
# Titanic: Strength of association between gender and survival
contingency = pd.crosstab(titanic['sex'], titanic['survived'])
v = cramers_v(contingency)
chi2, p_value, _, _ = chi2_contingency(contingency)
print("=== Cramer's V (Association Strength) ===")
print(f"Chi-square = {chi2:.4f}, p-value = {p_value:.6f}")
print(f"Cramer's V = {v:.4f}")
print(f"\nInterpretation guidelines:")
print(f" V < 0.1: Negligible")
print(f" 0.1 <= V < 0.3: Weak association")
print(f" 0.3 <= V < 0.5: Moderate association")
print(f" V >= 0.5: Strong association")
print(f"\nCurrent: V = {v:.4f} ā ", end="")
if v >= 0.5:
print("Strong association ā Gender strongly affected survival!")
elif v >= 0.3:
print("Moderate association")
elif v >= 0.1:
print("Weak association")
else:
print("Negligible")=== Cramer's V (Association Strength) === Chi-square = 260.7170, p-value = 0.000000 Cramer's V = 0.5410 Interpretation guidelines: V < 0.1: Negligible 0.1 <= V < 0.3: Weak association 0.3 <= V < 0.5: Moderate association V >= 0.5: Strong association Current: V = 0.5410 ā Strong association ā Gender strongly affected survival!
10. Multiple Testing Correction
Why is correction needed?
When performing multiple tests simultaneously, Type I errors (false positives) accumulate.
Example: Performing 20 tests at α = 0.05
- Probability of at least 1 false positive = 1 - (0.95)^20 = 64%!
This is called the Multiple Comparison Problem.
Bonferroni Correction
When to use:
- Analyzing multiple A/B tests simultaneously
- Post-hoc tests after ANOVA comparing multiple pairs
- Genome studies testing thousands of genes
Method: Divide α by number of tests (k). Example: 5 tests ā 0.05/5 = 0.01
Very conservative. May miss real effects (increased Type II error)
from statsmodels.stats.multitest import multipletests
# Multiple A/B test results
# Situation: Testing 5 UI elements simultaneously. Which ones have real effects?
p_values = [0.03, 0.04, 0.01, 0.08, 0.002]
test_names = ['Button Color', 'Headline', 'CTA Position', 'Image', 'Price Display']
# Bonferroni correction
rejected, corrected_p, _, _ = multipletests(p_values, method='bonferroni')
print("=== Multiple Testing Correction (Bonferroni) ===")
print(f"Number of tests: {len(p_values)}")
print(f"Corrected significance level: 0.05 / {len(p_values)} = {0.05/len(p_values):.3f}")
print(f"\n{'Test':<15} {'Original p-value':<18} {'Corrected p-value':<18} {'Conclusion'}")
print("-" * 70)
for name, p, cp, rej in zip(test_names, p_values, corrected_p, rejected):
result = "Significant" if rej else "Not significant"
print(f"{name:<15} {p:<18.4f} {min(cp, 1.0):<18.4f} {result}")=== Multiple Testing Correction (Bonferroni) === Number of tests: 5 Corrected significance level: 0.05 / 5 = 0.010 Test Original p-value Corrected p-value Conclusion ---------------------------------------------------------------------- Button Color 0.0300 0.1500 Not significant Headline 0.0400 0.2000 Not significant CTA Position 0.0100 0.0500 Not significant Image 0.0800 0.4000 Not significant Price Display 0.0020 0.0100 Significant
Benjamini-Hochberg (FDR)
When to use:
- When performing many tests but can tolerate some false positives
- Screening candidates in exploratory analysis
- Genome studies where Bonferroni is too conservative
Method: Controls False Discovery Rate (FDR). Controls āproportion of false positives among those declared significantā to 5%
# Benjamini-Hochberg correction
rejected_bh, corrected_p_bh, _, _ = multipletests(p_values, method='fdr_bh')
print("=== Multiple Testing Correction (Benjamini-Hochberg FDR) ===")
print(f"\n{'Test':<15} {'Original p-value':<18} {'Corrected p-value':<18} {'Conclusion'}")
print("-" * 70)
for name, p, cp, rej in zip(test_names, p_values, corrected_p_bh, rejected_bh):
result = "Significant" if rej else "Not significant"
print(f"{name:<15} {p:<18.4f} {cp:<18.4f} {result}")
print(f"\nComparison:")
print(f" Significant with Bonferroni: {sum(rejected)}")
print(f" Significant with FDR(BH): {sum(rejected_bh)}")
print(f"\nā FDR is less conservative, allowing more discoveries")
print(f" However, about 5% of these may be false positives")=== Multiple Testing Correction (Benjamini-Hochberg FDR) === Test Original p-value Corrected p-value Conclusion ---------------------------------------------------------------------- Button Color 0.0300 0.0500 Significant Headline 0.0400 0.0500 Significant CTA Position 0.0100 0.0250 Significant Image 0.0800 0.0800 Not significant Price Display 0.0020 0.0100 Significant Comparison: Significant with Bonferroni: 1 Significant with FDR(BH): 4 ā FDR is less conservative, allowing more discoveries However, about 5% of these may be false positives
11. Power Analysis
When to use?
Use at the experimental design stage to calculate āHow many samples are needed?ā
Too few samples ā Cannot detect real effects even when they exist (Type II error) Too many samples ā Waste of resources
from statsmodels.stats.power import TTestIndPower
power_analysis = TTestIndPower()
# Scenario: Effect size 0.3, Power 80%, Significance level 5%
# Situation: "Expecting the new UI to improve conversion rate moderately (d=0.3).
# How many people are needed to detect this effect with 80% probability?"
effect_size = 0.3
alpha = 0.05
power = 0.8
n = power_analysis.solve_power(effect_size=effect_size,
alpha=alpha,
power=power,
ratio=1.0,
alternative='two-sided')
print("=== Sample Size Calculation ===")
print(f"Target effect size (Cohen's d): {effect_size} (medium effect)")
print(f"Significance level (α): {alpha}")
print(f"Target power (1-β): {power} (80% probability of detecting effect)")
print(f"\nRequired sample size: {n:.0f} per group")
print(f"Total required: {n*2:.0f} people")
# Required sample sizes for various effect sizes
print("\nRequired sample sizes by effect size (Power 80%):")
for es, desc in [(0.2, 'small effect'), (0.3, 'medium effect'), (0.5, 'large effect'), (0.8, 'very large effect')]:
n = power_analysis.solve_power(effect_size=es, alpha=0.05, power=0.8, ratio=1.0)
print(f" d = {es} ({desc}): {n:.0f} per group (total {n*2:.0f})")=== Sample Size Calculation === Target effect size (Cohen's d): 0.3 (medium effect) Significance level (α): 0.05 Target power (1-β): 0.8 (80% probability of detecting effect) Required sample size: 176 per group Total required: 352 people Required sample sizes by effect size (Power 80%): d = 0.2 (small effect): 394 per group (total 787) d = 0.3 (medium effect): 176 per group (total 352) d = 0.5 (large effect): 64 per group (total 128) d = 0.8 (very large effect): 26 per group (total 51)
12. Test Selection Summary
Test Selection by Data Type
| Situation | Parametric Test | Non-parametric Test |
|---|---|---|
| 1 sample mean vs reference value | One-sample t-test | Wilcoxon signed-rank |
| 2 independent samples comparison | Independent t-test | Mann-Whitney U |
| 2 paired samples comparison | Paired t-test | Wilcoxon signed-rank |
| 3+ independent samples comparison | One-way ANOVA | Kruskal-Wallis H |
| 2Ć2 categories (small sample) | - | Fisherās exact |
| Category independence | - | Chi-square |
| Paired category before/after comparison | - | McNemarās |
| Correlation | Pearson r | Spearman Ļ, Kendall Ļ |
Quick Guide by Situation
Q: Comparing means of two groups?
āāā Same subjects before/after? ā Paired t-test (normal) / Wilcoxon (non-normal)
āāā Different subjects? ā Independent t-test (normal) / Mann-Whitney (non-normal)
Q: Comparing 3+ groups?
āāā Normal distribution? ā One-way ANOVA
āāā Non-normal distribution? ā Kruskal-Wallis
Q: Relationship between two categorical variables?
āāā Any expected frequency < 5? ā Fisher's exact
āāā All expected frequencies >= 5? ā Chi-square
Q: Relationship between two continuous variables?
āāā Linear relationship? ā Pearson r
āāā Monotonic relationship? ā Spearman ĻQuiz
Problem 1
Perform an appropriate test to check if there is a significant difference in survival rate by cabin class (pclass) in the Titanic data.
View Answer
# Categorical vs Categorical ā Chi-square test
contingency = pd.crosstab(titanic['pclass'], titanic['survived'])
chi2, p_value, dof, expected = chi2_contingency(contingency)
print("Contingency Table:")
print(contingency)
print(f"\nChi-square = {chi2:.4f}, p-value = {p_value:.6f}")
print(f"\nConclusion: {'Cabin class and survival rate are associated' if p_value < 0.05 else 'No association'}")
# Effect size
v = cramers_v(contingency)
print(f"Cramer's V = {v:.4f} (moderate association)")Problem 2
Use an appropriate test to check if the tip amount distribution differs between smokers and non-smokers in the Tips data.
View Answer
smoker_tip = tips[tips['smoker'] == 'Yes']['tip']
nonsmoker_tip = tips[tips['smoker'] == 'No']['tip']
# Normality test
_, p_smoker = shapiro(smoker_tip)
_, p_nonsmoker = shapiro(nonsmoker_tip)
print(f"Normality (smoker): p = {p_smoker:.4f}")
print(f"Normality (non-smoker): p = {p_nonsmoker:.4f}")
# Normality not satisfied ā Use Mann-Whitney U
stat, p_value = mannwhitneyu(smoker_tip, nonsmoker_tip)
print(f"\nMann-Whitney U: {stat:.2f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion: {'Tip distributions differ' if p_value < 0.05 else 'No difference in tip distributions'}")Next Steps
- Learn how to model relationships between variables in Regression Analysis.
- Learn experimental design in A/B Testing.