Pandas (Python Data Analysis)
This section covers the core concepts and principles of Pandas, the standard library for data analysis in the Python ecosystem.
1. Core Structures
Series (1-Dimensional)
- Concept: Like a single ‘column’ in Excel.
- Characteristic: All data has the same type (dtype). (e.g., all integers, all dates)
- Components: Values + Index
import pandas as pd
# Creating a Series
price_series = df['sale_price']
print(f"Type: {type(price_series)}")
print(f"Data Type (dtype): {price_series.dtype}")
print(f"Size: {len(price_series):,}")Type: <class 'pandas.core.series.Series'> Data Type (dtype): float64 Size: 124,892
DataFrame (2-Dimensional)
- Concept: Like an Excel ‘sheet’.
- Characteristic: A collection of multiple Series.
- Column-Major: Data in the same column is stored contiguously in memory. This makes column-wise operations fast.
# Each column of a DataFrame is a Series
print(f"Type of df['category']: {type(df['category'])}")
print(f"Type of df[['category', 'sale_price']]: {type(df[['category', 'sale_price']])}")Type of df['category']: <class 'pandas.core.series.Series'> Type of df[['category', 'sale_price']]: <class 'pandas.core.frame.DataFrame'>
2. Vectorization
The most significant feature of Pandas (and NumPy) is that you don’t use loops (for-loops).
Slow Method (Python for-loop)
# With 1 million data points
total = 0
for price in df['price']:
total += priceThe Python interpreter fetches each list element one by one, checks its type, and adds it. (High overhead)
Fast Method (Vectorization)
total = df['price'].sum()Internally optimized C functions process the entire memory block at once (using CPU SIMD, etc.). Hundreds of times faster.
For-loop Sum: 0.1892 sec Vectorized Sum: 0.0004 sec ======================================== Speedup: 473x faster!
Vectorization Examples
# Aggregation operations
print(f"Total Sales: ${df['sale_price'].sum():,.2f}")
print(f"Average Sales: ${df['sale_price'].mean():.2f}")
# Vector arithmetic operations
df['profit'] = df['sale_price'] - df['cost']
df['margin_rate'] = (df['profit'] / df['sale_price'] * 100).round(2)Total Sales: $5,234,892.45 Average Sales: $41.92
3. Indexing (loc vs iloc)
This is the most confusing part. You must clearly distinguish between them.
| Type | Syntax | Description | Example |
|---|---|---|---|
| Label-based | loc[row_label, col_label] | Find by name. | df.loc[0, 'category'] |
| Position-based | iloc[row_number, col_number] | Find by number (position). | df.iloc[0, 3] (row 0, column 3) |
# loc: Label-based (by name)
sample.loc[0, 'category'] # 'Jeans'
# iloc: Position-based (by position)
sample.iloc[0, 1] # 'Jeans'
# Note: loc includes the end, iloc excludes the end
sample.loc[0:2] # 0, 1, 2 (3 rows)
sample.iloc[0:2] # 0, 1 (2 rows)Conditional Filtering
# Boolean Indexing (most commonly used)
high_value = df[df['sale_price'] > 100]
# Multiple conditions
filtered = df[(df['sale_price'] > 50) & (df['category'] == 'Jeans')]
# query() method (SQL style)
filtered_query = df.query("sale_price > 50 and category == 'Jeans'")Best Practice: Use explicit
locwhen possible, or use thequery()method for better readability.
Curriculum
1. Data Loading and Exploration
Learn how to load data from various formats including CSV, Excel, JSON, and explore basic methods.
2. Data Cleaning
Learn data quality management methods including handling missing values, removing duplicates, and detecting outliers.
3. Advanced Filtering
Learn data filtering techniques using various conditions.
4. Grouping and Aggregation
Learn data summarization and analysis methods using groupby() and aggregation functions.
5. Data Merging
Learn various methods for combining multiple DataFrames.
6. Date/Time Handling
Learn how to use datetime types for time series data processing.
7. Pivot and Reshaping
Learn pivot, melt, and stack operations for transforming data shapes.
SQL to Pandas Conversion Guide
| SQL | Pandas |
|---|---|
SELECT col1, col2 | df[['col1', 'col2']] |
WHERE condition | df.query("condition") or df[condition] |
GROUP BY col | df.groupby('col') |
JOIN | df.merge(df2, on='key') |
ORDER BY col | df.sort_values('col') |
LIMIT n | df.head(n) |
# SQL: SELECT category, SUM(sale_price)
# FROM df
# GROUP BY category
# ORDER BY SUM(sale_price) DESC
# LIMIT 5
result = (
df.groupby('category')['sale_price']
.sum()
.sort_values(ascending=False)
.head(5)
)category Outerwear & Coats 892341.23 Jeans 654892.11 Sweaters 543210.87 Suits & Sport Coats 432109.65 Swim 321098.43 Name: sale_price, dtype: float64
Key Summary
- Series: 1-dimensional array (like an Excel column), same type
- DataFrame: 2-dimensional table (like an Excel sheet), combination of multiple Series
- Vectorization: Use built-in functions instead of for-loops for 100-1000x performance improvement
- loc: Label-based, includes the end
- iloc: Position-based, excludes the end
To practice these concepts hands-on, navigate to each section in the curriculum and follow along with the example code!