Skip to Content

02. A/B Testing (Experiments)

Advanced2 hours

1. Overview and Scenario

Situation: The marketing team created a new landing page (Version B). They’re excited, saying ā€œThe conversion rate increased by 2% compared to the original page (Version A)!ā€

But you calmly ask:

ā€œHow large was the sample size? What’s the probability (P-value) that this 2% increase is just by chance?ā€

A/B testing is the crown jewel of business decision-making. In this chapter, we learn Proportions Z-test and Sample Size Calculation (Power Analysis).


2. Data Preparation

Since we don’t have A/B test log data, we’ll simulate using existing data by assuming gender as A/B groups.

  • Group A: Male
  • Group B: Female
  • Conversion: Purchase status (1 if order history exists, 0 otherwise)
from statsmodels.stats.proportion import proportions_ztest import numpy as np # ... BigQuery client setup

3. Proportions Z-test

Used when comparing proportions like conversion rates.

ā“ Problem 1: Comparing Conversion Rates Between Groups

Q. Calculate the purchase conversion rate (proportion of purchasers among all registered users) for males and females, and test whether the difference in proportions is significant.

šŸ’”

Hint: Use COUNT(DISTINCT user_id) to get the total population, and LEFT JOIN orders to count purchasers.

View Solution

# 1. Aggregate data query = """ SELECT u.gender, COUNT(DISTINCT u.user_id) as total_users, COUNT(DISTINCT o.user_id) as purchasers FROM `your-project-id.retail_analytics_us.src_users` u LEFT JOIN `your-project-id.retail_analytics_us.src_orders` o ON u.user_id = o.user_id GROUP BY u.gender """ df = client.query(query).to_dataframe().set_index('gender') # 2. Extract statistics (number of successes, number of trials) count = df['purchasers'].values # [male purchasers, female purchasers] nobs = df['total_users'].values # [total males, total females] # 3. Z-test z_stat, p_val = proportions_ztest(count, nobs) print(f"Male conversion rate: {count[0]/nobs[0]:.4f}") print(f"Female conversion rate: {count[1]/nobs[1]:.4f}") print(f"P-value: {p_val:.4f}") if p_val < 0.05: print("āœ… The difference in conversion rates is significant.")
실행 ź²°ź³¼
Error: name 'client' is not defined

4. Sample Size Calculation (Sample Size & Power)

This is the first question you should ask before running a test:

ā€œHow many people do we need to experiment on to get reliable results?ā€

If it’s too few, you might miss an effect even if it exists (False Negative). If it’s too many, you’re wasting money.

ā“ Problem 2: Calculate Required Sample Size

Q. If the current conversion rate is 10%, how many people per group are needed to detect an improvement to 11% (1%p increase)? (Based on significance level α=0.05\alpha=0.05, Power=0.8)

šŸ’”

Hint: Use statsmodels.stats.power.NormalIndPower or calculate proportion_effectsize.

View Solution

import statsmodels.stats.api as sms from statsmodels.stats.proportion import proportion_effectsize # 1. Calculate Effect Size p1 = 0.10 # baseline p2 = 0.11 # target effect_size = proportion_effectsize(p1, p2) # 2. Calculate sample size required_n = sms.NormalIndPower().solve_power( effect_size=effect_size, power=0.8, alpha=0.05, ratio=1 ) print(f"Required sample size per group: {int(np.ceil(required_n))} people") print(f"Total required sample size: {int(np.ceil(required_n)) * 2} people")
실행 ź²°ź³¼
Required sample size per group: 14745 people
Total required sample size: 29490 people

šŸ’” Parameter Explanation

  • Alpha (α\alpha): Type I error probability (usually 0.05). ā€œProbability of saying there’s an effect when there isn’tā€
  • Power (1āˆ’Ī²1-\beta): Statistical power (usually 0.8). ā€œProbability of finding an effect when it truly existsā€
  • Effect Size: The magnitude of the difference you want to detect (larger = fewer samples needed)

5. Experiment Design Pitfalls (Common Pitfalls)

Design is just as important as coding.

  1. Peeking Problem:

    • You shouldn’t say ā€œOh? P-value is 0.04, let’s stop!ā€ in the middle of an experiment.
    • You must wait until you reach the predetermined sample size (N).
  2. SRM (Sample Ratio Mismatch):

    • You split 50:50, but the result is 1000 vs 950 people?
    • There’s a bug in the traffic allocation system, or data loss occurred in a specific group.
    • Test results are invalid!

šŸ’” Summary

  • Proportions Test: Compare 0/1 data like click rates and conversion rates
  • Power Analysis: Essential step before starting an experiment (ā€œHow many do we need?ā€)
  • A/B testing is science: Make decisions with data, not feelings.

In the next chapter, we’ll explore hidden relationships between variables through Correlation and Regression Analysis.

Last updated on

šŸ¤–AI ėŖØģ˜ė©“ģ ‘ģ‹¤ģ „ģ²˜ėŸ¼ ģ—°ģŠµķ•˜źø°