Skip to Content
ConceptsPandasOverview

Pandas (Python Data Analysis)

This section covers the core concepts and principles of Pandas, the standard library for data analysis in the Python ecosystem.

EssentialInternal

1. Core Structures

Series (1-Dimensional)

  • Concept: Like a single ‘column’ in Excel.
  • Characteristic: All data has the same type (dtype). (e.g., all integers, all dates)
  • Components: Values + Index
import pandas as pd # Creating a Series price_series = df['sale_price'] print(f"Type: {type(price_series)}") print(f"Data Type (dtype): {price_series.dtype}") print(f"Size: {len(price_series):,}")
실행 결과
Type: <class 'pandas.core.series.Series'>
Data Type (dtype): float64
Size: 124,892

DataFrame (2-Dimensional)

  • Concept: Like an Excel ‘sheet’.
  • Characteristic: A collection of multiple Series.
  • Column-Major: Data in the same column is stored contiguously in memory. This makes column-wise operations fast.
# Each column of a DataFrame is a Series print(f"Type of df['category']: {type(df['category'])}") print(f"Type of df[['category', 'sale_price']]: {type(df[['category', 'sale_price']])}")
실행 결과
Type of df['category']: <class 'pandas.core.series.Series'>
Type of df[['category', 'sale_price']]: <class 'pandas.core.frame.DataFrame'>

2. Vectorization

The most significant feature of Pandas (and NumPy) is that you don’t use loops (for-loops).

Slow Method (Python for-loop)

# With 1 million data points total = 0 for price in df['price']: total += price

The Python interpreter fetches each list element one by one, checks its type, and adds it. (High overhead)

Fast Method (Vectorization)

total = df['price'].sum()

Internally optimized C functions process the entire memory block at once (using CPU SIMD, etc.). Hundreds of times faster.

Execution Result (1 million data points)
For-loop Sum: 0.1892 sec
Vectorized Sum: 0.0004 sec
========================================
Speedup: 473x faster!

Vectorization Examples

# Aggregation operations print(f"Total Sales: ${df['sale_price'].sum():,.2f}") print(f"Average Sales: ${df['sale_price'].mean():.2f}") # Vector arithmetic operations df['profit'] = df['sale_price'] - df['cost'] df['margin_rate'] = (df['profit'] / df['sale_price'] * 100).round(2)
실행 결과
Total Sales: $5,234,892.45
Average Sales: $41.92

3. Indexing (loc vs iloc)

This is the most confusing part. You must clearly distinguish between them.

TypeSyntaxDescriptionExample
Label-basedloc[row_label, col_label]Find by name.df.loc[0, 'category']
Position-basediloc[row_number, col_number]Find by number (position).df.iloc[0, 3] (row 0, column 3)
# loc: Label-based (by name) sample.loc[0, 'category'] # 'Jeans' # iloc: Position-based (by position) sample.iloc[0, 1] # 'Jeans' # Note: loc includes the end, iloc excludes the end sample.loc[0:2] # 0, 1, 2 (3 rows) sample.iloc[0:2] # 0, 1 (2 rows)

Conditional Filtering

# Boolean Indexing (most commonly used) high_value = df[df['sale_price'] > 100] # Multiple conditions filtered = df[(df['sale_price'] > 50) & (df['category'] == 'Jeans')] # query() method (SQL style) filtered_query = df.query("sale_price > 50 and category == 'Jeans'")

Best Practice: Use explicit loc when possible, or use the query() method for better readability.


Curriculum

1. Data Loading and Exploration

Learn how to load data from various formats including CSV, Excel, JSON, and explore basic methods.

2. Data Cleaning

Learn data quality management methods including handling missing values, removing duplicates, and detecting outliers.

3. Advanced Filtering

Learn data filtering techniques using various conditions.

4. Grouping and Aggregation

Learn data summarization and analysis methods using groupby() and aggregation functions.

5. Data Merging

Learn various methods for combining multiple DataFrames.

6. Date/Time Handling

Learn how to use datetime types for time series data processing.

7. Pivot and Reshaping

Learn pivot, melt, and stack operations for transforming data shapes.

SQL to Pandas Conversion Guide

SQLPandas
SELECT col1, col2df[['col1', 'col2']]
WHERE conditiondf.query("condition") or df[condition]
GROUP BY coldf.groupby('col')
JOINdf.merge(df2, on='key')
ORDER BY coldf.sort_values('col')
LIMIT ndf.head(n)
# SQL: SELECT category, SUM(sale_price) # FROM df # GROUP BY category # ORDER BY SUM(sale_price) DESC # LIMIT 5 result = ( df.groupby('category')['sale_price'] .sum() .sort_values(ascending=False) .head(5) )
실행 결과
category
Outerwear & Coats    892341.23
Jeans                654892.11
Sweaters             543210.87
Suits & Sport Coats  432109.65
Swim                 321098.43
Name: sale_price, dtype: float64

Key Summary

  1. Series: 1-dimensional array (like an Excel column), same type
  2. DataFrame: 2-dimensional table (like an Excel sheet), combination of multiple Series
  3. Vectorization: Use built-in functions instead of for-loops for 100-1000x performance improvement
  4. loc: Label-based, includes the end
  5. iloc: Position-based, excludes the end
💡
Next Steps

To practice these concepts hands-on, navigate to each section in the curriculum and follow along with the example code!

Last updated on

🤖AI 모의면접실전처럼 연습하기