02. A/B Testing (Experiments)

Advanced2 hours

1. Overview and Scenario

Situation: The marketing team created a new landing page (Version B). They’re excited, saying “The conversion rate increased by 2% compared to the original page (Version A)!”

But you calmly ask:

“How large was the sample size? What’s the probability (P-value) that this 2% increase is just by chance?”

A/B testing is the crown jewel of business decision-making. In this chapter, we learn Proportions Z-test and Sample Size Calculation (Power Analysis).

2. Data Preparation

Since we don’t have A/B test log data, we’ll simulate using existing data by assuming gender as A/B groups.

Group A: Male
Group B: Female
Conversion: Purchase status (1 if order history exists, 0 otherwise)

BigQuery (SQL)


from statsmodels.stats.proportion import proportions_ztest
import numpy as np
# ... BigQuery client setup

3. Proportions Z-test

Used when comparing proportions like conversion rates.

❓ Problem 1: Comparing Conversion Rates Between Groups

Q. Calculate the purchase conversion rate (proportion of purchasers among all registered users) for males and females, and test whether the difference in proportions is significant.

BigQuery + Python

💡

Hint: Use COUNT(DISTINCT user_id) to get the total population, and LEFT JOIN orders to count purchasers.

View Solution


# 1. Aggregate data
query = """
SELECT
    u.gender,
    COUNT(DISTINCT u.user_id) as total_users,
    COUNT(DISTINCT o.user_id) as purchasers
FROM `your-project-id.retail_analytics_us.src_users` u
LEFT JOIN `your-project-id.retail_analytics_us.src_orders` o ON u.user_id = o.user_id
GROUP BY u.gender
"""
df = client.query(query).to_dataframe().set_index('gender')
 
# 2. Extract statistics (number of successes, number of trials)
count = df['purchasers'].values # [male purchasers, female purchasers]
nobs = df['total_users'].values # [total males, total females]
 
# 3. Z-test
z_stat, p_val = proportions_ztest(count, nobs)
 
print(f"Male conversion rate: {count[0]/nobs[0]:.4f}")
print(f"Female conversion rate: {count[1]/nobs[1]:.4f}")
print(f"P-value: {p_val:.4f}")
 
if p_val < 0.05:
    print("✅ The difference in conversion rates is significant.")

실행 결과

Error: name 'client' is not defined

4. Sample Size Calculation (Sample Size & Power)

This is the first question you should ask before running a test:

“How many people do we need to experiment on to get reliable results?”

If it’s too few, you might miss an effect even if it exists (False Negative). If it’s too many, you’re wasting money.

❓ Problem 2: Calculate Required Sample Size

Q. If the current conversion rate is 10%, how many people per group are needed to detect an improvement to 11% (1%p increase)? (Based on significance level $\alpha=0.05$ , Power=0.8)

Python (Common)

💡

Hint: Use statsmodels.stats.power.NormalIndPower or calculate proportion_effectsize.

View Solution


import statsmodels.stats.api as sms
from statsmodels.stats.proportion import proportion_effectsize
 
# 1. Calculate Effect Size
p1 = 0.10  # baseline
p2 = 0.11  # target
effect_size = proportion_effectsize(p1, p2)
 
# 2. Calculate sample size
required_n = sms.NormalIndPower().solve_power(
    effect_size=effect_size,
    power=0.8,
    alpha=0.05,
    ratio=1
)
 
print(f"Required sample size per group: {int(np.ceil(required_n))} people")
print(f"Total required sample size: {int(np.ceil(required_n)) * 2} people")

실행 결과

Required sample size per group: 14745 people
Total required sample size: 29490 people

💡 Parameter Explanation

Alpha ( $\alpha$ ): Type I error probability (usually 0.05). “Probability of saying there’s an effect when there isn’t”
Power ( $1-\beta$ ): Statistical power (usually 0.8). “Probability of finding an effect when it truly exists”
Effect Size: The magnitude of the difference you want to detect (larger = fewer samples needed)

5. Experiment Design Pitfalls (Common Pitfalls)

Design is just as important as coding.

Peeking Problem:
- You shouldn’t say “Oh? P-value is 0.04, let’s stop!” in the middle of an experiment.
- You must wait until you reach the predetermined sample size (N).
SRM (Sample Ratio Mismatch):
- You split 50:50, but the result is 1000 vs 950 people?
- There’s a bug in the traffic allocation system, or data loss occurred in a specific group.
- Test results are invalid!

💡 Summary

Proportions Test: Compare 0/1 data like click rates and conversion rates
Power Analysis: Essential step before starting an experiment (“How many do we need?”)
A/B testing is science: Make decisions with data, not feelings.

In the next chapter, we’ll explore hidden relationships between variables through Correlation and Regression Analysis.

02. A/B Testing (Experiments)

1. Overview and Scenario

2. Data Preparation

BigQuery (SQL)

Pandas (CSV)

3. Proportions Z-test

❓ Problem 1: Comparing Conversion Rates Between Groups

BigQuery + Python

View Solution

Pandas

View Solution

4. Sample Size Calculation (Sample Size & Power)

❓ Problem 2: Calculate Required Sample Size

Python (Common)

View Solution

💡 Parameter Explanation

5. Experiment Design Pitfalls (Common Pitfalls)

💡 Summary