Descriptive Statistics
Beginner
Learning Objectives
After completing this recipe, you will be able to:
- Calculate measures of central tendency (mean, median, mode)
- Understand measures of dispersion (standard deviation, variance, IQR)
- Identify distribution shapes (skewness, kurtosis)
- Calculate descriptive statistics with Pandas and SQL
0. Setup
Load CSV files for data practice.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
# Load Data
orders = pd.read_csv('src_orders.csv', parse_dates=['created_at'])
items = pd.read_csv('src_order_items.csv')
products = pd.read_csv('src_products.csv')
users = pd.read_csv('src_users.csv')
# Merge for Analysis
df = orders.merge(items, on='order_id').merge(products, on='product_id').merge(users, on='user_id')1. Measures of Central Tendency
Theory
Central Tendency indicates where the data is concentrated.
| Measure | Description | Advantages | Disadvantages |
|---|---|---|---|
| Mean | Sum of all values ÷ Count | Reflects all data | Sensitive to outliers |
| Median | Middle value when sorted | Robust to outliers | Ignores distribution tails |
| Mode | Most frequent value | Useful for categorical data | May not be unique |
Calculating with Pandas
import pandas as pd
import numpy as np
# Basic statistics
print("=== Central Tendency ===")
print(f"Mean: ${df['sale_price'].mean():.2f}")
print(f"Median: ${df['sale_price'].median():.2f}")
print(f"Mode: ${df['sale_price'].mode()[0]:.2f}")
# All at once with describe()
print("\n=== describe() ===")
print(df['sale_price'].describe())실행 결과
=== Central Tendency === Mean: $59.73 Median: $39.99 Mode: $25.00 === describe() === count 181026.000000 mean 59.728416 std 67.142661 min 0.020000 25% 24.900000 50% 39.990002 75% 69.949997 max 999.000000 Name: sale_price, dtype: float64
Calculating with SQL
SELECT
AVG(sale_price) as mean_price,
PERCENTILE_CONT(sale_price, 0.5) OVER() as median_price,
MIN(sale_price) as min_price,
MAX(sale_price) as max_price
FROM src_order_items2. Measures of Dispersion
Theory
Dispersion indicates how spread out the data is.
| Measure | Description | Formula |
|---|---|---|
| Range | Maximum - Minimum | max - min |
| Variance | Average of squared differences from mean | Σ(x-μ)² / n |
| Standard Deviation (SD) | Square root of variance | √Variance |
| IQR | Q3 - Q1 | 75% - 25% |
| Coefficient of Variation (CV) | Relative variation | SD / Mean × 100% |
Calculating with Pandas
print("=== Dispersion ===")
print(f"Range: {df['sale_price'].max() - df['sale_price'].min():.2f}")
print(f"Variance: {df['sale_price'].var():.2f}")
print(f"Standard Deviation: {df['sale_price'].std():.2f}")
print(f"IQR: {df['sale_price'].quantile(0.75) - df['sale_price'].quantile(0.25):.2f}")
print(f"Coefficient of Variation: {df['sale_price'].std() / df['sale_price'].mean() * 100:.1f}%")실행 결과
=== Dispersion === Range: 998.98 Variance: 4508.14 Standard Deviation: 67.14 IQR: 45.05 Coefficient of Variation: 112.4%
Using Coefficient of Variation
# Compare variables with different units
cv_price = df['sale_price'].std() / df['sale_price'].mean() * 100
cv_age = df['age'].std() / df['age'].mean() * 100
print(f"Price CV: {cv_price:.1f}%")
print(f"Age CV: {cv_age:.1f}%")
# Higher CV means relatively more spread out실행 결과
Price CV: 112.4% Age CV: 41.6%
3. Distribution Shape
Skewness
Skewness measures the asymmetry of a distribution.
- Skewness = 0: Symmetric distribution
- Skewness > 0: Right tail (positive skew)
- Skewness < 0: Left tail (negative skew)
from scipy import stats
skewness = stats.skew(df['sale_price'].dropna())
print(f"Skewness: {skewness:.3f}")
if skewness > 0.5:
print("→ Skewed right (some high-priced products exist)")
elif skewness < -0.5:
print("→ Skewed left")
else:
print("→ Close to symmetric")실행 결과
Skewness: 5.062 → Skewed right (some high-priced products exist)
Kurtosis
Kurtosis measures how peaked a distribution is.
- Kurtosis = 0: Similar to normal distribution
- Kurtosis > 0: More peaked (more extreme values)
- Kurtosis < 0: Flatter
kurtosis = stats.kurtosis(df['sale_price'].dropna())
print(f"Kurtosis: {kurtosis:.3f}")실행 결과
Kurtosis: 45.596
4. Group-wise Descriptive Statistics
Pandas groupby + agg
# Descriptive statistics by department
dept_stats = df.groupby('department')['sale_price'].agg([
('Count', 'count'),
('Mean', 'mean'),
('Median', 'median'),
('Std Dev', 'std'),
('Min', 'min'),
('Max', 'max')
]).round(2)
print("Price Statistics by Department:")
print(dept_stats)실행 결과
Price Statistics by Department:
Count ... Max
department ...
Men 90612 ... 999.0
Women 90414 ... 903.0
[2 rows x 6 columns]Group Statistics with SQL
SELECT
department,
COUNT(*) as count,
AVG(sale_price) as mean,
STDDEV(sale_price) as std,
MIN(sale_price) as min,
MAX(sale_price) as max
FROM src_order_items oi
JOIN src_products p ON oi.product_id = p.product_id
GROUP BY department
ORDER BY mean DESCQuiz 1: Calculating Descriptive Statistics
Problem
For the number of items per order (num_of_item) in the order data:
- Calculate mean, median, standard deviation
- Interpret the difference between mean and median
- Calculate coefficient of variation
View Answer
# 1. Basic statistics
mean_items = df['num_of_item'].mean()
median_items = df['num_of_item'].median()
std_items = df['num_of_item'].std()
print(f"Mean: {mean_items:.2f}")
print(f"Median: {median_items:.2f}")
print(f"Standard Deviation: {std_items:.2f}")
# 2. Interpretation
if mean_items > median_items:
print("\n→ Mean > Median: Right-skewed distribution (some large orders exist)")
elif mean_items < median_items:
print("\n→ Mean < Median: Left-skewed distribution")
else:
print("\n→ Symmetric distribution")
# 3. Coefficient of variation
cv = std_items / mean_items * 100
print(f"\nCoefficient of Variation: {cv:.1f}%")실행 결과
Mean: 1.89 Median: 2.00 Standard Deviation: 1.06 → Mean < Median: Left-skewed distribution Coefficient of Variation: 55.9%
Summary
Key Functions Summary
| Statistic | Pandas | SQL |
|---|---|---|
| Mean | df['col'].mean() | AVG(col) |
| Median | df['col'].median() | PERCENTILE_CONT(col, 0.5) |
| Standard Deviation | df['col'].std() | STDDEV(col) |
| Variance | df['col'].var() | VARIANCE(col) |
| Min/Max | min(), max() | MIN(), MAX() |
| Percentile | quantile(0.25) | PERCENTILE_CONT(col, 0.25) |
Statistic Selection Guide
| Situation | Recommendation |
|---|---|
| No outliers | Mean, Standard Deviation |
| Outliers present | Median, IQR |
| Distribution comparison | Coefficient of Variation |
| Asymmetry check | Skewness |
Next Steps
You’ve mastered descriptive statistics! Next, learn how to analyze relationships between variables in Correlation Analysis.
Last updated on