Pandas (Python Data Analysis)

This section covers the core concepts and principles of Pandas, the standard library for data analysis in the Python ecosystem.

EssentialInternal

1. Core Structures

Series (1-Dimensional)

Concept: Like a single ‘column’ in Excel.
Characteristic: All data has the same type (dtype). (e.g., all integers, all dates)
Components: Values + Index


import pandas as pd
 
# Creating a Series
price_series = df['sale_price']
 
print(f"Type: {type(price_series)}")
print(f"Data Type (dtype): {price_series.dtype}")
print(f"Size: {len(price_series):,}")

실행 결과

Type: <class 'pandas.core.series.Series'>
Data Type (dtype): float64
Size: 124,892

DataFrame (2-Dimensional)

Concept: Like an Excel ‘sheet’.
Characteristic: A collection of multiple Series.
Column-Major: Data in the same column is stored contiguously in memory. This makes column-wise operations fast.


# Each column of a DataFrame is a Series
print(f"Type of df['category']: {type(df['category'])}")
print(f"Type of df[['category', 'sale_price']]: {type(df[['category', 'sale_price']])}")

실행 결과

Type of df['category']: <class 'pandas.core.series.Series'>
Type of df[['category', 'sale_price']]: <class 'pandas.core.frame.DataFrame'>

2. Vectorization

The most significant feature of Pandas (and NumPy) is that you don’t use loops (for-loops).

Slow Method (Python for-loop)


# With 1 million data points
total = 0
for price in df['price']:
    total += price

The Python interpreter fetches each list element one by one, checks its type, and adds it. (High overhead)

Fast Method (Vectorization)


total = df['price'].sum()

Internally optimized C functions process the entire memory block at once (using CPU SIMD, etc.). Hundreds of times faster.

Execution Result (1 million data points)

For-loop Sum: 0.1892 sec
Vectorized Sum: 0.0004 sec
========================================
Speedup: 473x faster!

Vectorization Examples


# Aggregation operations
print(f"Total Sales: ${df['sale_price'].sum():,.2f}")
print(f"Average Sales: ${df['sale_price'].mean():.2f}")
 
# Vector arithmetic operations
df['profit'] = df['sale_price'] - df['cost']
df['margin_rate'] = (df['profit'] / df['sale_price'] * 100).round(2)

실행 결과

Total Sales: $5,234,892.45
Average Sales: $41.92

3. Indexing (`loc` vs `iloc`)

This is the most confusing part. You must clearly distinguish between them.

Type	Syntax	Description	Example
Label-based	`loc[row_label, col_label]`	Find by name.	`df.loc[0, 'category']`
Position-based	`iloc[row_number, col_number]`	Find by number (position).	`df.iloc[0, 3]` (row 0, column 3)


# loc: Label-based (by name)
sample.loc[0, 'category']  # 'Jeans'
 
# iloc: Position-based (by position)
sample.iloc[0, 1]  # 'Jeans'
 
# Note: loc includes the end, iloc excludes the end
sample.loc[0:2]   # 0, 1, 2 (3 rows)
sample.iloc[0:2]  # 0, 1 (2 rows)

Conditional Filtering


# Boolean Indexing (most commonly used)
high_value = df[df['sale_price'] > 100]
 
# Multiple conditions
filtered = df[(df['sale_price'] > 50) & (df['category'] == 'Jeans')]
 
# query() method (SQL style)
filtered_query = df.query("sale_price > 50 and category == 'Jeans'")

Best Practice: Use explicit loc when possible, or use the query() method for better readability.

Curriculum

1. Data Loading and Exploration

Learn how to load data from various formats including CSV, Excel, JSON, and explore basic methods.

Get Started with Data Loading and Exploration

2. Data Cleaning

Learn data quality management methods including handling missing values, removing duplicates, and detecting outliers.

Get Started with Data Cleaning

3. Advanced Filtering

Learn data filtering techniques using various conditions.

Get Started with Advanced Filtering

4. Grouping and Aggregation

Learn data summarization and analysis methods using groupby() and aggregation functions.

Get Started with Grouping and Aggregation

5. Data Merging

Learn various methods for combining multiple DataFrames.

Get Started with Data Merging

6. Date/Time Handling

Learn how to use datetime types for time series data processing.

Get Started with Date/Time Handling

7. Pivot and Reshaping

Learn pivot, melt, and stack operations for transforming data shapes.

Get Started with Pivot and Reshaping

SQL to Pandas Conversion Guide

SQL	Pandas
`SELECT col1, col2`	`df[['col1', 'col2']]`
`WHERE condition`	`df.query("condition")` or `df[condition]`
`GROUP BY col`	`df.groupby('col')`
`JOIN`	`df.merge(df2, on='key')`
`ORDER BY col`	`df.sort_values('col')`
`LIMIT n`	`df.head(n)`


# SQL: SELECT category, SUM(sale_price)
#      FROM df
#      GROUP BY category
#      ORDER BY SUM(sale_price) DESC
#      LIMIT 5
 
result = (
    df.groupby('category')['sale_price']
    .sum()
    .sort_values(ascending=False)
    .head(5)
)

실행 결과

category
Outerwear & Coats    892341.23
Jeans                654892.11
Sweaters             543210.87
Suits & Sport Coats  432109.65
Swim                 321098.43
Name: sale_price, dtype: float64

Key Summary

Series: 1-dimensional array (like an Excel column), same type
DataFrame: 2-dimensional table (like an Excel sheet), combination of multiple Series
Vectorization: Use built-in functions instead of for-loops for 100-1000x performance improvement
loc: Label-based, includes the end
iloc: Position-based, excludes the end

💡

Next Steps

To practice these concepts hands-on, navigate to each section in the curriculum and follow along with the example code!

Pandas (Python Data Analysis)

1. Core Structures

Series (1-Dimensional)

DataFrame (2-Dimensional)

2. Vectorization

Slow Method (Python for-loop)

Fast Method (Vectorization)

Vectorization Examples

3. Indexing (loc vs iloc)

Conditional Filtering

Curriculum

1. Data Loading and Exploration

2. Data Cleaning

3. Advanced Filtering

4. Grouping and Aggregation

5. Data Merging

6. Date/Time Handling

7. Pivot and Reshaping

SQL to Pandas Conversion Guide

Key Summary

3. Indexing (`loc` vs `iloc`)