Skip to Content
ConceptsStatisticsDescriptive Statistics

Descriptive Statistics

Beginner

Learning Objectives

After completing this recipe, you will be able to:

  • Calculate measures of central tendency (mean, median, mode)
  • Understand measures of dispersion (standard deviation, variance, IQR)
  • Identify distribution shapes (skewness, kurtosis)
  • Calculate descriptive statistics with Pandas and SQL

0. Setup

Load CSV files for data practice.

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from scipy import stats # Load Data orders = pd.read_csv('src_orders.csv', parse_dates=['created_at']) items = pd.read_csv('src_order_items.csv') products = pd.read_csv('src_products.csv') users = pd.read_csv('src_users.csv') # Merge for Analysis df = orders.merge(items, on='order_id').merge(products, on='product_id').merge(users, on='user_id')

1. Measures of Central Tendency

Theory

Central Tendency indicates where the data is concentrated.

MeasureDescriptionAdvantagesDisadvantages
MeanSum of all values ÷ CountReflects all dataSensitive to outliers
MedianMiddle value when sortedRobust to outliersIgnores distribution tails
ModeMost frequent valueUseful for categorical dataMay not be unique

Calculating with Pandas

import pandas as pd import numpy as np # Basic statistics print("=== Central Tendency ===") print(f"Mean: ${df['sale_price'].mean():.2f}") print(f"Median: ${df['sale_price'].median():.2f}") print(f"Mode: ${df['sale_price'].mode()[0]:.2f}") # All at once with describe() print("\n=== describe() ===") print(df['sale_price'].describe())
실행 결과
=== Central Tendency ===
Mean: $59.73
Median: $39.99
Mode: $25.00

=== describe() ===
count    181026.000000
mean         59.728416
std          67.142661
min           0.020000
25%          24.900000
50%          39.990002
75%          69.949997
max         999.000000
Name: sale_price, dtype: float64

Calculating with SQL

SELECT AVG(sale_price) as mean_price, PERCENTILE_CONT(sale_price, 0.5) OVER() as median_price, MIN(sale_price) as min_price, MAX(sale_price) as max_price FROM src_order_items

2. Measures of Dispersion

Theory

Dispersion indicates how spread out the data is.

MeasureDescriptionFormula
RangeMaximum - Minimummax - min
VarianceAverage of squared differences from meanΣ(x-μ)² / n
Standard Deviation (SD)Square root of variance√Variance
IQRQ3 - Q175% - 25%
Coefficient of Variation (CV)Relative variationSD / Mean × 100%

Calculating with Pandas

print("=== Dispersion ===") print(f"Range: {df['sale_price'].max() - df['sale_price'].min():.2f}") print(f"Variance: {df['sale_price'].var():.2f}") print(f"Standard Deviation: {df['sale_price'].std():.2f}") print(f"IQR: {df['sale_price'].quantile(0.75) - df['sale_price'].quantile(0.25):.2f}") print(f"Coefficient of Variation: {df['sale_price'].std() / df['sale_price'].mean() * 100:.1f}%")
실행 결과
=== Dispersion ===
Range: 998.98
Variance: 4508.14
Standard Deviation: 67.14
IQR: 45.05
Coefficient of Variation: 112.4%

Using Coefficient of Variation

# Compare variables with different units cv_price = df['sale_price'].std() / df['sale_price'].mean() * 100 cv_age = df['age'].std() / df['age'].mean() * 100 print(f"Price CV: {cv_price:.1f}%") print(f"Age CV: {cv_age:.1f}%") # Higher CV means relatively more spread out
실행 결과
Price CV: 112.4%
Age CV: 41.6%

3. Distribution Shape

Skewness

Skewness measures the asymmetry of a distribution.

  • Skewness = 0: Symmetric distribution
  • Skewness > 0: Right tail (positive skew)
  • Skewness < 0: Left tail (negative skew)
from scipy import stats skewness = stats.skew(df['sale_price'].dropna()) print(f"Skewness: {skewness:.3f}") if skewness > 0.5: print("→ Skewed right (some high-priced products exist)") elif skewness < -0.5: print("→ Skewed left") else: print("→ Close to symmetric")
실행 결과
Skewness: 5.062
→ Skewed right (some high-priced products exist)

Kurtosis

Kurtosis measures how peaked a distribution is.

  • Kurtosis = 0: Similar to normal distribution
  • Kurtosis > 0: More peaked (more extreme values)
  • Kurtosis < 0: Flatter
kurtosis = stats.kurtosis(df['sale_price'].dropna()) print(f"Kurtosis: {kurtosis:.3f}")
실행 결과
Kurtosis: 45.596

4. Group-wise Descriptive Statistics

Pandas groupby + agg

# Descriptive statistics by department dept_stats = df.groupby('department')['sale_price'].agg([ ('Count', 'count'), ('Mean', 'mean'), ('Median', 'median'), ('Std Dev', 'std'), ('Min', 'min'), ('Max', 'max') ]).round(2) print("Price Statistics by Department:") print(dept_stats)
실행 결과
Price Statistics by Department:
          Count  ...     Max
department         ...
Men         90612  ...  999.0
Women       90414  ...  903.0

[2 rows x 6 columns]

Group Statistics with SQL

SELECT department, COUNT(*) as count, AVG(sale_price) as mean, STDDEV(sale_price) as std, MIN(sale_price) as min, MAX(sale_price) as max FROM src_order_items oi JOIN src_products p ON oi.product_id = p.product_id GROUP BY department ORDER BY mean DESC

Quiz 1: Calculating Descriptive Statistics

Problem

For the number of items per order (num_of_item) in the order data:

  1. Calculate mean, median, standard deviation
  2. Interpret the difference between mean and median
  3. Calculate coefficient of variation

View Answer

# 1. Basic statistics mean_items = df['num_of_item'].mean() median_items = df['num_of_item'].median() std_items = df['num_of_item'].std() print(f"Mean: {mean_items:.2f}") print(f"Median: {median_items:.2f}") print(f"Standard Deviation: {std_items:.2f}") # 2. Interpretation if mean_items > median_items: print("\n→ Mean > Median: Right-skewed distribution (some large orders exist)") elif mean_items < median_items: print("\n→ Mean < Median: Left-skewed distribution") else: print("\n→ Symmetric distribution") # 3. Coefficient of variation cv = std_items / mean_items * 100 print(f"\nCoefficient of Variation: {cv:.1f}%")
실행 결과
Mean: 1.89
Median: 2.00
Standard Deviation: 1.06

→ Mean < Median: Left-skewed distribution

Coefficient of Variation: 55.9%

Summary

Key Functions Summary

StatisticPandasSQL
Meandf['col'].mean()AVG(col)
Mediandf['col'].median()PERCENTILE_CONT(col, 0.5)
Standard Deviationdf['col'].std()STDDEV(col)
Variancedf['col'].var()VARIANCE(col)
Min/Maxmin(), max()MIN(), MAX()
Percentilequantile(0.25)PERCENTILE_CONT(col, 0.25)

Statistic Selection Guide

SituationRecommendation
No outliersMean, Standard Deviation
Outliers presentMedian, IQR
Distribution comparisonCoefficient of Variation
Asymmetry checkSkewness

Next Steps

You’ve mastered descriptive statistics! Next, learn how to analyze relationships between variables in Correlation Analysis.

Last updated on

🤖AI 모의면접실전처럼 연습하기