Descriptive Statistics

Beginner

Learning Objectives

After completing this recipe, you will be able to:

Calculate measures of central tendency (mean, median, mode)
Understand measures of dispersion (standard deviation, variance, IQR)
Identify distribution shapes (skewness, kurtosis)
Calculate descriptive statistics with Pandas and SQL

0. Setup

Load CSV files for data practice.


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
 
# Load Data
orders = pd.read_csv('src_orders.csv', parse_dates=['created_at'])
items = pd.read_csv('src_order_items.csv')
products = pd.read_csv('src_products.csv')
users = pd.read_csv('src_users.csv')
 
# Merge for Analysis
df = orders.merge(items, on='order_id').merge(products, on='product_id').merge(users, on='user_id')

1. Measures of Central Tendency

Theory

Central Tendency indicates where the data is concentrated.

Measure	Description	Advantages	Disadvantages
Mean	Sum of all values ÷ Count	Reflects all data	Sensitive to outliers
Median	Middle value when sorted	Robust to outliers	Ignores distribution tails
Mode	Most frequent value	Useful for categorical data	May not be unique

Calculating with Pandas


import pandas as pd
import numpy as np
 
# Basic statistics
print("=== Central Tendency ===")
print(f"Mean: ${df['sale_price'].mean():.2f}")
print(f"Median: ${df['sale_price'].median():.2f}")
print(f"Mode: ${df['sale_price'].mode()[0]:.2f}")
 
# All at once with describe()
print("\n=== describe() ===")
print(df['sale_price'].describe())

실행 결과

=== Central Tendency ===
Mean: $59.73
Median: $39.99
Mode: $25.00

=== describe() ===
count    181026.000000
mean         59.728416
std          67.142661
min           0.020000
25%          24.900000
50%          39.990002
75%          69.949997
max         999.000000
Name: sale_price, dtype: float64

Calculating with SQL


SELECT
    AVG(sale_price) as mean_price,
    PERCENTILE_CONT(sale_price, 0.5) OVER() as median_price,
    MIN(sale_price) as min_price,
    MAX(sale_price) as max_price
FROM src_order_items

2. Measures of Dispersion

Theory

Dispersion indicates how spread out the data is.

Measure	Description	Formula
Range	Maximum - Minimum	max - min
Variance	Average of squared differences from mean	Σ(x-μ)² / n
Standard Deviation (SD)	Square root of variance	√Variance
IQR	Q3 - Q1	75% - 25%
Coefficient of Variation (CV)	Relative variation	SD / Mean × 100%

Calculating with Pandas


print("=== Dispersion ===")
print(f"Range: {df['sale_price'].max() - df['sale_price'].min():.2f}")
print(f"Variance: {df['sale_price'].var():.2f}")
print(f"Standard Deviation: {df['sale_price'].std():.2f}")
print(f"IQR: {df['sale_price'].quantile(0.75) - df['sale_price'].quantile(0.25):.2f}")
print(f"Coefficient of Variation: {df['sale_price'].std() / df['sale_price'].mean() * 100:.1f}%")

실행 결과

=== Dispersion ===
Range: 998.98
Variance: 4508.14
Standard Deviation: 67.14
IQR: 45.05
Coefficient of Variation: 112.4%

Using Coefficient of Variation


# Compare variables with different units
cv_price = df['sale_price'].std() / df['sale_price'].mean() * 100
cv_age = df['age'].std() / df['age'].mean() * 100
 
print(f"Price CV: {cv_price:.1f}%")
print(f"Age CV: {cv_age:.1f}%")
# Higher CV means relatively more spread out

실행 결과

Price CV: 112.4%
Age CV: 41.6%

3. Distribution Shape

Skewness

Skewness measures the asymmetry of a distribution.

Skewness = 0: Symmetric distribution
Skewness > 0: Right tail (positive skew)
Skewness < 0: Left tail (negative skew)


from scipy import stats
 
skewness = stats.skew(df['sale_price'].dropna())
print(f"Skewness: {skewness:.3f}")
 
if skewness > 0.5:
    print("→ Skewed right (some high-priced products exist)")
elif skewness < -0.5:
    print("→ Skewed left")
else:
    print("→ Close to symmetric")

실행 결과

Skewness: 5.062
→ Skewed right (some high-priced products exist)

Kurtosis

Kurtosis measures how peaked a distribution is.

Kurtosis = 0: Similar to normal distribution
Kurtosis > 0: More peaked (more extreme values)
Kurtosis < 0: Flatter


kurtosis = stats.kurtosis(df['sale_price'].dropna())
print(f"Kurtosis: {kurtosis:.3f}")

실행 결과

Kurtosis: 45.596

4. Group-wise Descriptive Statistics

Pandas groupby + agg


# Descriptive statistics by department
dept_stats = df.groupby('department')['sale_price'].agg([
    ('Count', 'count'),
    ('Mean', 'mean'),
    ('Median', 'median'),
    ('Std Dev', 'std'),
    ('Min', 'min'),
    ('Max', 'max')
]).round(2)
 
print("Price Statistics by Department:")
print(dept_stats)

실행 결과

Price Statistics by Department:
          Count  ...     Max
department         ...
Men         90612  ...  999.0
Women       90414  ...  903.0

[2 rows x 6 columns]

Group Statistics with SQL


SELECT
    department,
    COUNT(*) as count,
    AVG(sale_price) as mean,
    STDDEV(sale_price) as std,
    MIN(sale_price) as min,
    MAX(sale_price) as max
FROM src_order_items oi
JOIN src_products p ON oi.product_id = p.product_id
GROUP BY department
ORDER BY mean DESC

Quiz 1: Calculating Descriptive Statistics

Problem

For the number of items per order (num_of_item) in the order data:

Calculate mean, median, standard deviation
Interpret the difference between mean and median
Calculate coefficient of variation

View Answer


# 1. Basic statistics
mean_items = df['num_of_item'].mean()
median_items = df['num_of_item'].median()
std_items = df['num_of_item'].std()
 
print(f"Mean: {mean_items:.2f}")
print(f"Median: {median_items:.2f}")
print(f"Standard Deviation: {std_items:.2f}")
 
# 2. Interpretation
if mean_items > median_items:
    print("\n→ Mean > Median: Right-skewed distribution (some large orders exist)")
elif mean_items < median_items:
    print("\n→ Mean < Median: Left-skewed distribution")
else:
    print("\n→ Symmetric distribution")
 
# 3. Coefficient of variation
cv = std_items / mean_items * 100
print(f"\nCoefficient of Variation: {cv:.1f}%")

실행 결과

Mean: 1.89
Median: 2.00
Standard Deviation: 1.06

→ Mean < Median: Left-skewed distribution

Coefficient of Variation: 55.9%

Summary

Key Functions Summary

Statistic	Pandas	SQL
Mean	`df['col'].mean()`	`AVG(col)`
Median	`df['col'].median()`	`PERCENTILE_CONT(col, 0.5)`
Standard Deviation	`df['col'].std()`	`STDDEV(col)`
Variance	`df['col'].var()`	`VARIANCE(col)`
Min/Max	`min()`, `max()`	`MIN()`, `MAX()`
Percentile	`quantile(0.25)`	`PERCENTILE_CONT(col, 0.25)`

Statistic Selection Guide

Situation	Recommendation
No outliers	Mean, Standard Deviation
Outliers present	Median, IQR
Distribution comparison	Coefficient of Variation
Asymmetry check	Skewness

Next Steps

You’ve mastered descriptive statistics! Next, learn how to analyze relationships between variables in Correlation Analysis.