01. Hypothesis Testing
1. Overview and Scenario
Situation: There’s a rumor circulating in the team that “male customers spend more money than female customers.” The data team lead asks:
“Is that really true? Or is it just a coincidence because of a few big-spending male customers? Is it statistically significant?”
We need to go beyond simply comparing averages and prove that the difference is not by chance. For this, we use T-test and ANOVA.
2. Data Preparation
Using the src_users, src_orders, and src_order_items tables, we calculate the total purchase amount per customer.
BigQuery (SQL)
from google.cloud import bigquery
from scipy import stats
import pandas as pd
client = bigquery.Client()3. T-Test: Comparing Two Groups (Male vs Female)
This is the most common question: “Is the mean of A different from B?”
❓ Problem 1: Comparing Average Purchase Amount by Gender
Q. Perform an Independent Two-sample T-test to determine whether the difference in average total purchase amount between male and female customers is significant.
(Only include customers with purchase history.)
BigQuery + Python
Strategy: Aggregate data in BigQuery, bring it to Python, then use scipy.stats.
Hint: Use GROUP BY user_id, gender in SQL to get total purchase amount, then fetch it.
View Solution
# 1. Extract data
query = """
SELECT
u.gender,
SUM(oi.sale_price) as total_purchase
FROM `your-project-id.retail_analytics_us.src_users` u
JOIN `your-project-id.retail_analytics_us.src_orders` o ON u.user_id = o.user_id
JOIN `your-project-id.retail_analytics_us.src_order_items` oi ON o.order_id = oi.order_id
WHERE o.status NOT IN ('Cancelled', 'Returned')
GROUP BY u.user_id, u.gender
"""
df = client.query(query).to_dataframe()
# 2. Separate groups
group_m = df[df['gender'] == 'M']['total_purchase']
group_f = df[df['gender'] == 'F']['total_purchase']
# 3. Perform T-test (not assuming equal variance -> Welch's t-test)
t_stat, p_val = stats.ttest_ind(group_m, group_f, equal_var=False)
print(f"Male average: ${group_m.mean():.2f}")
print(f"Female average: ${group_f.mean():.2f}")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_val:.4f}")
if p_val < 0.05:
print("✅ The difference is statistically significant.")
else:
print("❌ The difference is not significant.")Male average: $130.20 Female average: $114.75 T-statistic: 15.4704 P-value: 0.0000 ✅ The difference is statistically significant.
💡 Interpretation Guide
- P-value < 0.05: Accept the hypothesis that “the two group means are different” (significant difference).
- P-value >= 0.05: There is not enough evidence to say there is a difference (could be by chance).
- If the male average purchase amount is higher and the P-value is 0.0000, we can conclude that “men spend statistically more.”
4. ANOVA: Comparing Three or More Groups
What if there are three or more groups to compare? (e.g., by signup channel, by country) Instead of running multiple T-tests, we use ANOVA (Analysis of Variance).
❓ Problem 2: Purchase Amount Difference by Traffic Source
Q. Test whether there are differences in average total purchase amount among customers by traffic_source (Search, Display, Facebook, etc.).
BigQuery + Python
Hint: Since the src_users table doesn’t have a traffic_source column, we need traffic source data.
Here, let’s assume traffic source info exists in src_users, or for practice, let’s use country.
(Guide modification: Due to dataset limitations, we’ll substitute with Country comparison.)
Alternative Problem: Testing Purchase Amount Difference by Country
View Solution
# 1. Extract data
query = """
SELECT
u.country,
SUM(oi.sale_price) as total_purchase
FROM `your-project-id.retail_analytics_us.src_users` u
JOIN `your-project-id.retail_analytics_us.src_orders` o ON u.user_id = o.user_id
JOIN `your-project-id.retail_analytics_us.src_order_items` oi ON o.order_id = oi.order_id
WHERE o.status NOT IN ('Cancelled', 'Returned')
AND u.country IN ('United States', 'China', 'Japan') -- Compare 3 countries
GROUP BY u.user_id, u.country
"""
df_anova = client.query(query).to_dataframe()
# 2. Prepare data by group
groups = [df_anova[df_anova['country'] == c]['total_purchase'] for c in df_anova['country'].unique()]
# 3. Perform ANOVA
f_stat, p_val = stats.f_oneway(*groups)
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_val:.4f}")F-statistic: 2.6464 P-value: 0.0709
5. Post-hoc Analysis
If ANOVA shows “there is a difference (P < 0.05)”, we need to know “specifically, which groups are different from each other?” For this, we perform the Tukey HSD test.
❓ Problem 3: Post-hoc Test by Country
Q. Use pairwise_tukeyhsd from statsmodels to identify which countries have clear differences.
Python (Common)
View Solution
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# df_anova is the dataframe created above
tukey = pairwise_tukeyhsd(endog=df_anova['total_purchase'],
groups=df_anova['country'],
alpha=0.05)
print(tukey)Multiple Comparison of Means - Tukey HSD, FWER=0.05 =========================================================== group1 group2 meandiff p-adj lower upper reject ----------------------------------------------------------- China Japan 6.4094 0.1268 -1.3222 14.141 False China United States -1.217 0.6424 -4.398 1.9641 False Japan United States -7.6263 0.0596 -15.4896 0.2369 False -----------------------------------------------------------
Result Interpretation:
- Rows where
rejectcolumn isTrue: The difference between the two groups is significant. - Example: If US vs China is True, the spending behavior between the two countries is different.
💡 Summary
- T-test: Compare two groups (A vs B)
- ANOVA: Compare three or more groups (A vs B vs C)
- Post-hoc: Find the culprit - who is different from whom
In the next chapter, we will verify marketing performance through A/B Testing.