Statistical Analysis
A collection of statistical analysis recipes for data-driven decision making. Learn statistical techniques frequently used in practice, from descriptive statistics to A/B testing.
Why Do We Need Statistics?
ℹ️
Data Analysis vs Statistical Analysis
- Data Analysis: “Sales increased by 10% last month”
- Statistical Analysis: “Verifying whether this increase is by chance or a meaningful change”
Statistics is a tool for quantifying uncertainty in data and drawing reliable conclusions.
0. Setup
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Dummy Data for Examples
group_a = np.random.normal(100, 10, 100)
group_b = np.random.normal(105, 12, 100)
df = pd.DataFrame({'x1': np.random.rand(100), 'x2': np.random.rand(100), 'y': np.random.rand(100)})Curriculum
1. Descriptive Statistics
BeginnerLearn basic statistics that summarize data characteristics.
- Central Tendency: Mean, Median, Mode
- Dispersion: Standard Deviation, Variance, Range, IQR
- Distribution Shape: Skewness, Kurtosis
- Percentiles and Quartiles
Start Descriptive Statistics →
2. Correlation Analysis
BeginnerIntermediateLearn methods to analyze relationships between two variables.
- Pearson Correlation Coefficient (Continuous Variables)
- Spearman Correlation Coefficient (Ordinal/Non-linear)
- Correlation vs Causation
- Correlation Matrix and Heatmap
3. Hypothesis Testing
IntermediateLearn data-based hypothesis verification methods.
- Null Hypothesis and Alternative Hypothesis
- Meaning and Interpretation of p-value
- t-test (One-sample, Independent, Paired)
- Chi-square Test (Categorical Variables)
- Type I/Type II Errors
4. Regression Analysis
IntermediateAdvancedLearn methods to model and predict relationships between variables.
- Simple Linear Regression
- Multiple Linear Regression
- Interpreting Regression Coefficients
- Coefficient of Determination (R²) and Model Evaluation
- Multicollinearity Diagnosis
5. A/B Testing
IntermediateAdvancedLearn causal relationship verification through experiments.
- A/B Test Design
- Sample Size Calculation (Power Analysis)
- Conversion Rate Comparison (Proportion Test)
- Continuous Metric Comparison (t-test)
- Early Stopping and Multiple Comparison Issues
6. Time Series Analysis
AdvancedLearn methods to analyze and forecast data patterns over time.
- Time Series Decomposition (Trend, Seasonality, Residual)
- Moving Average and Exponential Smoothing
- Autocorrelation (ACF) and Partial Autocorrelation (PACF)
- ARIMA Model Basics
- Using Prophet
Key Concepts Summary
Probability and Distributions
| Distribution | When to Use | Examples |
|---|---|---|
| Normal Distribution | Continuous data, Sample means | Height, Weight, Test scores |
| Binomial Distribution | Success/Failure counts | Conversions, Clicks |
| Poisson Distribution | Occurrences per unit time | Daily orders |
| t-Distribution | Small sample mean comparison | Testing mean differences between groups |
Hypothesis Testing Decision Flow
Data Collection
↓
Hypothesis Setting (H₀, H₁)
↓
Significance Level (usually α = 0.05)
↓
Calculate Test Statistic
↓
Calculate p-value
↓
p < α → Reject H₀ (Significant difference)
p ≥ α → Fail to reject H₀ (No significant difference)Python Statistics Libraries
# Basic Statistics
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Example: t-test
t_stat, p_value = stats.ttest_ind(group_a, group_b)
# Example: Regression Analysis
model = ols('y ~ x1 + x2', data=df).fit()
print(model.summary())실행 결과
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.005
Model: OLS Adj. R-squared: -0.015
Method: Least Squares F-statistic: 0.2639
Date: Sat, 20 Dec 2025 Prob (F-statistic): 0.769
Time: 00:25:05 Log-Likelihood: -13.739
No. Observations: 100 AIC: 33.48
Df Residuals: 97 BIC: 41.29
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.5603 0.078 7.139 0.000 0.404 0.716
x1 -0.0050 0.097 -0.052 0.959 -0.197 0.187
x2 -0.0750 0.103 -0.726 0.469 -0.280 0.130
==============================================================================
Omnibus: 19.776 Durbin-Watson: 2.230
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.138
Skew: -0.135 Prob(JB): 0.0766
Kurtosis: 1.923 Cond. No. 5.63
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.Practical Tips
💡
Statistical Significance vs Practical Importance
- p < 0.05 doesn’t automatically mean it’s meaningful
- Consider Effect Size together
- Example: If conversion rate increased by 0.01% and p = 0.001?
- Statistically significant, but practical value may be small
- Always calculate business impact (revenue effect, etc.) together
Last updated on