01. Hypothesis Testing

Advanced2 hours

1. Overview and Scenario

Situation: There’s a rumor circulating in the team that “male customers spend more money than female customers.” The data team lead asks:

“Is that really true? Or is it just a coincidence because of a few big-spending male customers? Is it statistically significant?”

We need to go beyond simply comparing averages and prove that the difference is not by chance. For this, we use T-test and ANOVA.

2. Data Preparation

Using the src_users, src_orders, and src_order_items tables, we calculate the total purchase amount per customer.

BigQuery (SQL)


from google.cloud import bigquery
from scipy import stats
import pandas as pd
 
client = bigquery.Client()

3. T-Test: Comparing Two Groups (Male vs Female)

This is the most common question: “Is the mean of A different from B?”

❓ Problem 1: Comparing Average Purchase Amount by Gender

Q. Perform an Independent Two-sample T-test to determine whether the difference in average total purchase amount between male and female customers is significant. (Only include customers with purchase history.)

BigQuery + Python

Strategy: Aggregate data in BigQuery, bring it to Python, then use scipy.stats.

💡

Hint: Use GROUP BY user_id, gender in SQL to get total purchase amount, then fetch it.

View Solution


# 1. Extract data
query = """
SELECT
    u.gender,
    SUM(oi.sale_price) as total_purchase
FROM `your-project-id.retail_analytics_us.src_users` u
JOIN `your-project-id.retail_analytics_us.src_orders` o ON u.user_id = o.user_id
JOIN `your-project-id.retail_analytics_us.src_order_items` oi ON o.order_id = oi.order_id
WHERE o.status NOT IN ('Cancelled', 'Returned')
GROUP BY u.user_id, u.gender
"""
df = client.query(query).to_dataframe()
 
# 2. Separate groups
group_m = df[df['gender'] == 'M']['total_purchase']
group_f = df[df['gender'] == 'F']['total_purchase']
 
# 3. Perform T-test (not assuming equal variance -> Welch's t-test)
t_stat, p_val = stats.ttest_ind(group_m, group_f, equal_var=False)
 
print(f"Male average: ${group_m.mean():.2f}")
print(f"Female average: ${group_f.mean():.2f}")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_val:.4f}")
 
if p_val < 0.05:
    print("✅ The difference is statistically significant.")
else:
    print("❌ The difference is not significant.")

실행 결과

Male average: $130.20
Female average: $114.75
T-statistic: 15.4704
P-value: 0.0000
✅ The difference is statistically significant.

💡 Interpretation Guide

P-value < 0.05: Accept the hypothesis that “the two group means are different” (significant difference).
P-value >= 0.05: There is not enough evidence to say there is a difference (could be by chance).
If the male average purchase amount is higher and the P-value is 0.0000, we can conclude that “men spend statistically more.”

4. ANOVA: Comparing Three or More Groups

What if there are three or more groups to compare? (e.g., by signup channel, by country) Instead of running multiple T-tests, we use ANOVA (Analysis of Variance).

❓ Problem 2: Purchase Amount Difference by Traffic Source

Q. Test whether there are differences in average total purchase amount among customers by traffic_source (Search, Display, Facebook, etc.).

BigQuery + Python

💡

Hint: Since the src_users table doesn’t have a traffic_source column, we need traffic source data. Here, let’s assume traffic source info exists in src_users, or for practice, let’s use country. (Guide modification: Due to dataset limitations, we’ll substitute with Country comparison.)

Alternative Problem: Testing Purchase Amount Difference by Country

View Solution


# 1. Extract data
query = """
SELECT
    u.country,
    SUM(oi.sale_price) as total_purchase
FROM `your-project-id.retail_analytics_us.src_users` u
JOIN `your-project-id.retail_analytics_us.src_orders` o ON u.user_id = o.user_id
JOIN `your-project-id.retail_analytics_us.src_order_items` oi ON o.order_id = oi.order_id
WHERE o.status NOT IN ('Cancelled', 'Returned')
AND u.country IN ('United States', 'China', 'Japan')  -- Compare 3 countries
GROUP BY u.user_id, u.country
"""
df_anova = client.query(query).to_dataframe()
 
# 2. Prepare data by group
groups = [df_anova[df_anova['country'] == c]['total_purchase'] for c in df_anova['country'].unique()]
 
# 3. Perform ANOVA
f_stat, p_val = stats.f_oneway(*groups)
 
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_val:.4f}")

실행 결과

F-statistic: 2.6464
P-value: 0.0709

5. Post-hoc Analysis

If ANOVA shows “there is a difference (P < 0.05)”, we need to know “specifically, which groups are different from each other?” For this, we perform the Tukey HSD test.

❓ Problem 3: Post-hoc Test by Country

Q. Use pairwise_tukeyhsd from statsmodels to identify which countries have clear differences.

Python (Common)

View Solution


from statsmodels.stats.multicomp import pairwise_tukeyhsd
 
# df_anova is the dataframe created above
tukey = pairwise_tukeyhsd(endog=df_anova['total_purchase'],
                          groups=df_anova['country'],
                          alpha=0.05)
print(tukey)

실행 결과

Multiple Comparison of Means - Tukey HSD, FWER=0.05
===========================================================
group1     group2    meandiff p-adj   lower   upper  reject
-----------------------------------------------------------
China         Japan   6.4094 0.1268  -1.3222 14.141  False
China United States   -1.217 0.6424   -4.398 1.9641  False
Japan United States  -7.6263 0.0596 -15.4896 0.2369  False
-----------------------------------------------------------

Result Interpretation:

Rows where reject column is True: The difference between the two groups is significant.
Example: If US vs China is True, the spending behavior between the two countries is different.

💡 Summary

T-test: Compare two groups (A vs B)
ANOVA: Compare three or more groups (A vs B vs C)
Post-hoc: Find the culprit - who is different from whom

In the next chapter, we will verify marketing performance through A/B Testing.

01. Hypothesis Testing

1. Overview and Scenario

2. Data Preparation

BigQuery (SQL)

Pandas (CSV)

3. T-Test: Comparing Two Groups (Male vs Female)

❓ Problem 1: Comparing Average Purchase Amount by Gender

BigQuery + Python

View Solution

Pandas

View Solution

💡 Interpretation Guide

4. ANOVA: Comparing Three or More Groups

❓ Problem 2: Purchase Amount Difference by Traffic Source

BigQuery + Python

View Solution

Pandas

View Solution

5. Post-hoc Analysis

❓ Problem 3: Post-hoc Test by Country

Python (Common)

View Solution

💡 Summary