Data Loading and Exploration
Beginner
Learning Objectives
After completing this recipe, you will be able to:
- Load data from various formats including CSV, Excel, JSON
- Understand DataFrame structure
- Explore data with
head(),tail(),info(),describe() - Check and convert data types
- Apply memory optimization techniques
0. Setup
Load Cookbook sample data for hands-on practice.
import pandas as pd
# Load data
DATA_PATH = '/data/' # Cookbook sample data path
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
print(f"Data loaded!")
print(f" - orders: {len(orders):,} rows")
print(f" - order_items: {len(order_items):,} rows")
print(f" - products: {len(products):,} rows")
print(f" - users: {len(users):,} rows")실행 결과
Data loaded! - orders: 100,000 rows - order_items: 150,000 rows - products: 1,000 rows - users: 10,000 rows
1. What is a DataFrame?
Theory
DataFrame = A table-like data structure similar to an Excel sheet.
- Row: Horizontal line (each data item)
- Column: Vertical line (each attribute)
- Index: Labels identifying rows (default: 0, 1, 2, …)
Practice: Creating the Simplest DataFrame
import pandas as pd
import numpy as np
print("Pandas version:", pd.__version__)
# Creating a DataFrame from a dictionary (easiest way!)
data = {
'name': ['John', 'Emily', 'Mike', 'Sarah'],
'age': [25, 30, 35, 28],
'city': ['New York', 'Boston', 'Chicago', 'New York']
}
df = pd.DataFrame(data)
print("DataFrame created!")
print(df)실행 결과
Pandas version: 2.2.3 DataFrame created! name age city 0 John 25 New York 1 Emily 30 Boston 2 Mike 35 Chicago 3 Sarah 28 New York
Execution Result:
name age city
0 John 25 New York
1 Emily 30 Boston
2 Mike 35 Chicago
3 Sarah 28 New Yorkℹ️
Explanation
'name','age','city'-> Column names (column headers)['John', 'Emily', ...]-> Values (data)0, 1, 2, 3on the left -> Index (row numbers, auto-generated)
2. Reading CSV Files
Theory
CSV (Comma-Separated Values) is the most common data format.
Syntax
df = pd.read_csv('filepath.csv')Key Options
| Option | Description | Example |
|---|---|---|
encoding | File encoding | encoding='utf-8' |
sep | Delimiter | sep='\t' (tab) |
header | Header row number | header=0 (first row) |
index_col | Column to use as index | index_col='id' |
usecols | Read specific columns only | usecols=['col1', 'col2'] |
nrows | Number of rows to read | nrows=1000 |
dtype | Specify data types | dtype={'col': str} |
Practice: Reading CSV Files
DATA_PATH = '/data/' # Cookbook sample data path
# Basic read
df = pd.read_csv(DATA_PATH + 'src_orders.csv')
print(f"Total data: {len(df):,} rows")
# Read specific columns only
df_cols = pd.read_csv(DATA_PATH + 'src_orders.csv', usecols=['order_id', 'user_id', 'created_at'])
print(f"Specific columns only: {df_cols.columns.tolist()}")
# Read first 1000 rows only
df_sample = pd.read_csv(DATA_PATH + 'src_orders.csv', nrows=1000)
print(f"\nFirst 1000 rows sample:")
print(df_sample.head())실행 결과
Total data: 100,000 rows Specific columns only: ['order_id', 'user_id', 'created_at'] First 1000 rows sample: order_id user_id created_at status num_of_item 0 1 8372 2023-01-01 00:00:00 Complete 1 1 2 5765 2023-01-01 01:00:00 Shipped 2 2 3 2341 2023-01-01 02:00:00 Complete 1 3 4 9156 2023-01-01 03:00:00 Pending 1 4 5 4489 2023-01-01 04:00:00 Complete 1
3. Reading Excel/JSON Files
Excel Files
# Practice with products data as Excel
products = pd.read_csv('/data/src_products.csv')
# Save and read file (requires openpyxl)
try:
products.head(10).to_excel('products_sample.xlsx', index=False)
# Basic read
df_excel = pd.read_excel('products_sample.xlsx')
print("Excel file read successful:")
print(df_excel.head(3))
except ImportError:
print("openpyxl library required: pip install openpyxl")실행 결과
Excel file read successful: product_id name category price 0 1 Product1 CategoryA 1000 1 2 Product2 CategoryB 2000 2 3 Product3 CategoryC 3000
JSON Files
# Save DataFrame to JSON then read
products = pd.read_csv('/data/src_products.csv')
products.head(5).to_json('products_sample.json', orient='records', force_ascii=False)
# Read JSON file
df_json = pd.read_json('products_sample.json')
print("JSON file read:")
print(df_json)
# Nested JSON handling example
json_data = {
'company': 'ABC Corp',
'items': [
{'name': 'Product A', 'price': 100},
{'name': 'Product B', 'price': 200}
]
}
df_nested = pd.json_normalize(json_data, record_path='items')
print("\nNested JSON handling:")
print(df_nested)실행 결과
JSON file read:
product_id name category price
0 1 Product1 CategoryA 1000
1 2 Product2 CategoryB 2000
2 3 Product3 CategoryC 3000
3 4 Product4 CategoryA 4000
4 5 Product5 CategoryB 5000
Nested JSON handling:
name price
0 Product A 100
1 Product B 200Reading Directly from BigQuery
# BigQuery integration example (for actual environments)
# from google.cloud import bigquery
#
# client = bigquery.Client(project='your-project')
#
# query = """
# SELECT *
# FROM `project.dataset.src_orders`
# LIMIT 1000
# """
#
# df = client.query(query).to_dataframe()
# print(f"Loaded {len(df)} rows from BigQuery")
# For practice, use local CSV
df = pd.read_csv('/data/src_orders.csv', nrows=1000)
print(f"Data loaded: {len(df)} rows")실행 결과
Data loaded: 1000 rows
4. Basic Data Exploration
Viewing First/Last Few Rows
df = pd.read_csv('/data/src_orders.csv')
# First 5 rows (default)
print("First 5 rows:")
print(df.head())
print("\nFirst 3 rows:")
print(df.head(3))
print("\nLast 2 rows:")
print(df.tail(2))실행 결과
First 5 rows:
order_id user_id created_at status num_of_item
0 1 8372 2023-01-01 00:00:00 Complete 1
1 2 5765 2023-01-01 01:00:00 Shipped 2
2 3 2341 2023-01-01 02:00:00 Complete 1
3 4 9156 2023-01-01 03:00:00 Pending 1
4 5 4489 2023-01-01 04:00:00 Complete 1
First 3 rows:
order_id user_id created_at status num_of_item
0 1 8372 2023-01-01 00:00:00 Complete 1
1 2 5765 2023-01-01 01:00:00 Shipped 2
2 3 2341 2023-01-01 02:00:00 Complete 1
Last 2 rows:
order_id user_id created_at status num_of_item
99998 99999 3847 2024-12-30 22:00:00 Complete 1
99999 100000 6721 2024-12-30 23:00:00 Pending 2Checking Data Size
print("DataFrame size (rows, columns):", df.shape)
print("Total row count:", len(df))
print("Column names:", df.columns.tolist())실행 결과
DataFrame size (rows, columns): (100000, 5) Total row count: 100000 Column names: ['order_id', 'user_id', 'created_at', 'status', 'num_of_item']
Checking Data Types
print("Data type for each column:")
print(df.dtypes)실행 결과
Data type for each column: order_id int64 user_id int64 created_at object status object num_of_item int64 dtype: object
Key Data Types:
| Pandas Type | Description | Example |
|---|---|---|
object | String (text) | Name, address |
int64 | Integer | Age, quantity |
float64 | Decimal | Price, ratio |
datetime64 | Date/time | Order date, created date |
bool | Boolean | True/False |
category | Categorical | Gender, grade |
Overview of All Information
# Summary of all info
df.info()Output Example:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 100000 non-null int64
1 user_id 100000 non-null int64
2 created_at 100000 non-null object
3 status 100000 non-null object
4 num_of_item 100000 non-null int64
dtypes: int64(3), object(2)
memory usage: 3.8+ MBQuiz 1: Data Loading and Exploration
Problem
Load order data and:
- Read CSV file
- Check data size (rows, columns)
- Check data type for each column
- Output first 10 rows
View Answer
import pandas as pd
# 1. Read CSV file
df = pd.read_csv('/data/src_orders.csv')
# 2. Check data size
print(f"Data size: {df.shape[0]:,} rows x {df.shape[1]} columns")
# 3. Check data types
print("\nData type by column:")
print(df.dtypes)
# 4. First 10 rows
print("\nFirst 10 rows:")
print(df.head(10))실행 결과
Data size: 100,000 rows x 5 columns Data type by column: order_id int64 user_id int64 created_at object status object num_of_item int64 dtype: object First 10 rows: order_id user_id created_at status num_of_item 0 1 8372 2023-01-01 00:00:00 Complete 1 1 2 5765 2023-01-01 01:00:00 Shipped 2 2 3 2341 2023-01-01 02:00:00 Complete 1 3 4 9156 2023-01-01 03:00:00 Pending 1 4 5 4489 2023-01-01 04:00:00 Complete 1 5 6 7823 2023-01-01 05:00:00 Complete 1 6 7 1456 2023-01-01 06:00:00 Pending 1 7 8 8901 2023-01-01 07:00:00 Complete 2 8 9 3214 2023-01-01 08:00:00 Shipped 1 9 10 6547 2023-01-01 09:00:00 Complete 1
5. Descriptive Statistics
Summary with describe()
# Statistics for numeric columns
print(df.describe())실행 결과
order_id user_id num_of_item count 100000.000000 100000.00000 100000.00000 mean 50000.500000 5000.50123 1.60000 std 28867.657797 2886.75124 0.84853 min 1.000000 1.00000 1.00000 25% 25000.750000 2500.00000 1.00000 50% 50000.500000 5000.00000 1.00000 75% 75000.250000 7500.00000 2.00000 max 100000.000000 10000.00000 4.00000
Output Example:
age age_plus_10
count 4.000000 4.000000
mean 29.500000 39.500000
std 4.203173 4.203173
min 25.000000 35.000000
25% 27.250000 37.250000
50% 29.000000 39.000000
75% 31.250000 41.250000
max 35.000000 45.000000Key Statistical Functions
# Simple DataFrame for statistics practice
data = {
'name': ['John', 'Emily', 'Mike', 'Sarah'],
'age': [25, 30, 35, 28],
'city': ['New York', 'Boston', 'Chicago', 'New York']
}
df_sample = pd.DataFrame(data)
# Individual statistics
print("Mean:", df_sample['age'].mean())
print("Sum:", df_sample['age'].sum())
print("Min:", df_sample['age'].min())
print("Max:", df_sample['age'].max())
print("Median:", df_sample['age'].median())
print("Std:", df_sample['age'].std())
print("Unique count:", df_sample['city'].nunique())
print("\nValue counts:")
print(df_sample['city'].value_counts())실행 결과
Mean: 29.5 Sum: 118 Min: 25 Max: 35 Median: 29.0 Std: 4.203173463227086 Unique count: 3 Value counts: city New York 2 Boston 1 Chicago 1 Name: count, dtype: int64
6. Data Type Conversion
String to Number
# Test DataFrame
df_convert = pd.DataFrame({
'quantity': ['10', '20', '30', '40'],
'price': ['100.5', '200.3', '300.0', 'N/A']
})
print("Before conversion:")
print(df_convert.dtypes)
# String to integer
df_convert['quantity'] = df_convert['quantity'].astype(int)
# Convert ignoring errors (unconvertible -> NaN)
df_convert['price'] = pd.to_numeric(df_convert['price'], errors='coerce')
print("\nAfter conversion:")
print(df_convert.dtypes)
print(df_convert)실행 결과
Before conversion: quantity object price object dtype: object After conversion: quantity int64 price float64 dtype: object quantity price 0 10 100.5 1 20 200.3 2 30 300.0 3 40 NaN
String to Date
# Load order data
df = pd.read_csv('/data/src_orders.csv')
print("Before conversion type:", df['created_at'].dtype)
# String to datetime
df['created_at'] = pd.to_datetime(df['created_at'])
print("After conversion type:", df['created_at'].dtype)
print("\nDate column sample:")
print(df['created_at'].head())실행 결과
Before conversion type: object After conversion type: datetime64[ns] Date column sample: 0 2023-01-01 00:00:00 1 2023-01-01 01:00:00 2 2023-01-01 02:00:00 3 2023-01-01 03:00:00 4 2023-01-01 04:00:00 Name: created_at, dtype: datetime64[ns]
Categorical Conversion (Memory Optimization)
df = pd.read_csv('/data/src_orders.csv')
# Memory before conversion
mem_before = df.memory_usage(deep=True).sum() / 1024**2
print(f"Before conversion: {mem_before:.2f} MB")
# object -> category (memory savings)
df['status'] = df['status'].astype('category')
# Memory after conversion
mem_after = df.memory_usage(deep=True).sum() / 1024**2
print(f"After conversion: {mem_after:.2f} MB")
print(f"Savings: {mem_before - mem_after:.2f} MB ({(1 - mem_after/mem_before)*100:.1f}%)")실행 결과
Before conversion: 12.76 MB After conversion: 6.39 MB Savings: 6.37 MB (49.9%)
Quiz 2: Data Type Conversion
Problem
From order data:
- Convert
created_atcolumn to datetime type - Convert
statuscolumn to category type - Check data types after conversion
View Answer
import pandas as pd
# Load data
df = pd.read_csv('/data/src_orders.csv')
# 1. datetime conversion
df['created_at'] = pd.to_datetime(df['created_at'])
# 2. category conversion
df['status'] = df['status'].astype('category')
# 3. Check data types
print(df.dtypes)
print("\nstatus category values:")
print(df['status'].cat.categories.tolist())실행 결과
order_id int64 user_id int64 created_at datetime64[ns] status category num_of_item int64 dtype: object status category values: ['Cancelled', 'Complete', 'Pending', 'Shipped']
7. Saving Data
Saving to CSV
# Basic save (exclude index)
df.to_csv('output.csv', index=False)
print("output.csv saved")
# File with special characters
df.head(100).to_csv('orders_saved.csv', index=False, encoding='utf-8-sig')
print("orders_saved.csv saved")실행 결과
output.csv saved orders_saved.csv saved
Saving to Excel
# Prepare example data
df1 = pd.DataFrame({'order_id': [1, 2, 3], 'amount': [100, 200, 300]})
df2 = pd.DataFrame({'customer_id': [101, 102], 'name': ['John Doe', 'Jane Smith']})
try:
# Save to multiple sheets
with pd.ExcelWriter('output.xlsx') as writer:
df1.to_excel(writer, sheet_name='Orders', index=False)
df2.to_excel(writer, sheet_name='Customers', index=False)
print("output.xlsx saved (2 sheets)")
except ImportError:
print("openpyxl library required")실행 결과
output.xlsx saved (2 sheets)
Saving to JSON
# Basic save
df.head(5).to_json('output.json', orient='records')
# With special characters
df.head(5).to_json('output_full.json', orient='records', force_ascii=False)
print("JSON saved")실행 결과
JSON saved
8. Memory Optimization
Processing Large Files
# Large file simulation (using src_orders.csv)
# Read in chunks
chunks = pd.read_csv('/data/src_orders.csv', chunksize=10000)
result = []
for i, chunk in enumerate(chunks):
# Process each chunk: filter Complete status only
processed = chunk[chunk['status'] == 'Complete']
result.append(processed)
if i < 3: # Print only first 3 chunks
print(f"Chunk {i+1}: {len(chunk)} rows -> {len(processed)} rows (Complete only)")
df_complete = pd.concat(result)
print(f"\nTotal Complete orders: {len(df_complete):,} rows")실행 결과
Chunk 1: 10000 rows -> 7012 rows (Complete only) Chunk 2: 10000 rows -> 6998 rows (Complete only) Chunk 3: 10000 rows -> 7023 rows (Complete only) Total Complete orders: 70,000 rows
Data Type Optimization
def optimize_dtypes(df):
"""Convert data types for memory optimization"""
for col in df.columns:
col_type = df[col].dtype
if col_type == 'object':
# Convert to category if few unique values
if df[col].nunique() / len(df) < 0.5:
df[col] = df[col].astype('category')
elif col_type == 'int64':
# Convert to smaller integer type
if df[col].min() >= 0:
if df[col].max() < 255:
df[col] = df[col].astype('uint8')
elif df[col].max() < 65535:
df[col] = df[col].astype('uint16')
elif df[col].max() < 4294967295:
df[col] = df[col].astype('uint32')
elif col_type == 'float64':
# Convert to float32
df[col] = df[col].astype('float32')
return df
# Apply
df = pd.read_csv('/data/src_orders.csv')
mem_before = df.memory_usage(deep=True).sum() / 1024**2
df_optimized = optimize_dtypes(df.copy())
mem_after = df_optimized.memory_usage(deep=True).sum() / 1024**2
print(f"Before optimization: {mem_before:.2f} MB")
print(f"After optimization: {mem_after:.2f} MB")
print(f"Savings: {(1 - mem_after/mem_before)*100:.1f}%")
print("\nOptimized data types:")
print(df_optimized.dtypes)실행 결과
Before optimization: 12.76 MB After optimization: 2.19 MB Savings: 82.8% Optimized data types: order_id uint32 user_id uint16 created_at category status category num_of_item uint8 dtype: object
Summary
Key Functions Summary
| Task | Function | Example |
|---|---|---|
| Read CSV | pd.read_csv() | pd.read_csv('file.csv') |
| Read Excel | pd.read_excel() | pd.read_excel('file.xlsx') |
| First N rows | df.head(n) | df.head(10) |
| Last N rows | df.tail(n) | df.tail(5) |
| Check size | df.shape | (100, 5) |
| Check types | df.dtypes | Type for each column |
| Info summary | df.info() | All info |
| Stats summary | df.describe() | Descriptive statistics |
| Save to CSV | df.to_csv() | df.to_csv('out.csv') |
SQL to Pandas Comparison
| SQL | Pandas |
|---|---|
SELECT * FROM table LIMIT 10 | df.head(10) |
SELECT COUNT(*) FROM table | len(df) |
SELECT col FROM table | df['col'] |
Next Steps
You’ve mastered data loading and exploration! Next, learn how to handle missing values and outliers in Data Cleaning.
Last updated on