Data Loading and Exploration

Beginner

Learning Objectives

After completing this recipe, you will be able to:

Load data from various formats including CSV, Excel, JSON
Understand DataFrame structure
Explore data with head(), tail(), info(), describe()
Check and convert data types
Apply memory optimization techniques

0. Setup

Load Cookbook sample data for hands-on practice.


import pandas as pd
 
# Load data
DATA_PATH = '/data/'  # Cookbook sample data path
 
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
 
print(f"Data loaded!")
print(f"   - orders: {len(orders):,} rows")
print(f"   - order_items: {len(order_items):,} rows")
print(f"   - products: {len(products):,} rows")
print(f"   - users: {len(users):,} rows")

실행 결과

Data loaded!
 - orders: 100,000 rows
 - order_items: 150,000 rows
 - products: 1,000 rows
 - users: 10,000 rows

1. What is a DataFrame?

Theory

DataFrame = A table-like data structure similar to an Excel sheet.

Row: Horizontal line (each data item)
Column: Vertical line (each attribute)
Index: Labels identifying rows (default: 0, 1, 2, …)

Practice: Creating the Simplest DataFrame


import pandas as pd
import numpy as np
 
print("Pandas version:", pd.__version__)
 
# Creating a DataFrame from a dictionary (easiest way!)
data = {
    'name': ['John', 'Emily', 'Mike', 'Sarah'],
    'age': [25, 30, 35, 28],
    'city': ['New York', 'Boston', 'Chicago', 'New York']
}
 
df = pd.DataFrame(data)
print("DataFrame created!")
print(df)

실행 결과

Pandas version: 2.2.3
DataFrame created!
 name  age      city
0  John   25  New York
1 Emily   30    Boston
2  Mike   35   Chicago
3 Sarah   28  New York

Execution Result:


   name  age      city
0  John   25  New York
1 Emily   30    Boston
2  Mike   35   Chicago
3 Sarah   28  New York

ℹ️

Explanation

'name', 'age', 'city' -> Column names (column headers)
['John', 'Emily', ...] -> Values (data)
0, 1, 2, 3 on the left -> Index (row numbers, auto-generated)

2. Reading CSV Files

Theory

CSV (Comma-Separated Values) is the most common data format.

Syntax


df = pd.read_csv('filepath.csv')

Key Options

Option	Description	Example
`encoding`	File encoding	`encoding='utf-8'`
`sep`	Delimiter	`sep='\t'` (tab)
`header`	Header row number	`header=0` (first row)
`index_col`	Column to use as index	`index_col='id'`
`usecols`	Read specific columns only	`usecols=['col1', 'col2']`
`nrows`	Number of rows to read	`nrows=1000`
`dtype`	Specify data types	`dtype={'col': str}`

Practice: Reading CSV Files


DATA_PATH = '/data/'  # Cookbook sample data path
 
# Basic read
df = pd.read_csv(DATA_PATH + 'src_orders.csv')
print(f"Total data: {len(df):,} rows")
 
# Read specific columns only
df_cols = pd.read_csv(DATA_PATH + 'src_orders.csv', usecols=['order_id', 'user_id', 'created_at'])
print(f"Specific columns only: {df_cols.columns.tolist()}")
 
# Read first 1000 rows only
df_sample = pd.read_csv(DATA_PATH + 'src_orders.csv', nrows=1000)
print(f"\nFirst 1000 rows sample:")
print(df_sample.head())

실행 결과

Total data: 100,000 rows
Specific columns only: ['order_id', 'user_id', 'created_at']

First 1000 rows sample:
 order_id  user_id           created_at    status  num_of_item
0         1     8372  2023-01-01 00:00:00  Complete            1
1         2     5765  2023-01-01 01:00:00   Shipped            2
2         3     2341  2023-01-01 02:00:00  Complete            1
3         4     9156  2023-01-01 03:00:00   Pending            1
4         5     4489  2023-01-01 04:00:00  Complete            1

3. Reading Excel/JSON Files

Excel Files


# Practice with products data as Excel
products = pd.read_csv('/data/src_products.csv')
 
# Save and read file (requires openpyxl)
try:
    products.head(10).to_excel('products_sample.xlsx', index=False)
 
    # Basic read
    df_excel = pd.read_excel('products_sample.xlsx')
    print("Excel file read successful:")
    print(df_excel.head(3))
except ImportError:
    print("openpyxl library required: pip install openpyxl")

실행 결과

Excel file read successful:
 product_id  name  category  price
0           1  Product1    CategoryA   1000
1           2  Product2    CategoryB   2000
2           3  Product3    CategoryC   3000

JSON Files


# Save DataFrame to JSON then read
products = pd.read_csv('/data/src_products.csv')
products.head(5).to_json('products_sample.json', orient='records', force_ascii=False)
 
# Read JSON file
df_json = pd.read_json('products_sample.json')
print("JSON file read:")
print(df_json)
 
# Nested JSON handling example
json_data = {
    'company': 'ABC Corp',
    'items': [
        {'name': 'Product A', 'price': 100},
        {'name': 'Product B', 'price': 200}
    ]
}
df_nested = pd.json_normalize(json_data, record_path='items')
print("\nNested JSON handling:")
print(df_nested)

실행 결과

JSON file read:
 product_id  name  category  price
0           1  Product1    CategoryA   1000
1           2  Product2    CategoryB   2000
2           3  Product3    CategoryC   3000
3           4  Product4    CategoryA   4000
4           5  Product5    CategoryB   5000

Nested JSON handling:
      name  price
0  Product A    100
1  Product B    200

Reading Directly from BigQuery


# BigQuery integration example (for actual environments)
# from google.cloud import bigquery
#
# client = bigquery.Client(project='your-project')
#
# query = """
# SELECT *
# FROM `project.dataset.src_orders`
# LIMIT 1000
# """
#
# df = client.query(query).to_dataframe()
# print(f"Loaded {len(df)} rows from BigQuery")
 
# For practice, use local CSV
df = pd.read_csv('/data/src_orders.csv', nrows=1000)
print(f"Data loaded: {len(df)} rows")

실행 결과

Data loaded: 1000 rows

4. Basic Data Exploration

Viewing First/Last Few Rows


df = pd.read_csv('/data/src_orders.csv')
 
# First 5 rows (default)
print("First 5 rows:")
print(df.head())
 
print("\nFirst 3 rows:")
print(df.head(3))
 
print("\nLast 2 rows:")
print(df.tail(2))

실행 결과

First 5 rows:
 order_id  user_id           created_at    status  num_of_item
0         1     8372  2023-01-01 00:00:00  Complete            1
1         2     5765  2023-01-01 01:00:00   Shipped            2
2         3     2341  2023-01-01 02:00:00  Complete            1
3         4     9156  2023-01-01 03:00:00   Pending            1
4         5     4489  2023-01-01 04:00:00  Complete            1

First 3 rows:
 order_id  user_id           created_at    status  num_of_item
0         1     8372  2023-01-01 00:00:00  Complete            1
1         2     5765  2023-01-01 01:00:00   Shipped            2
2         3     2341  2023-01-01 02:00:00  Complete            1

Last 2 rows:
     order_id  user_id           created_at    status  num_of_item
99998     99999     3847  2024-12-30 22:00:00  Complete            1
99999    100000     6721  2024-12-30 23:00:00   Pending            2

Checking Data Size


print("DataFrame size (rows, columns):", df.shape)
print("Total row count:", len(df))
print("Column names:", df.columns.tolist())

실행 결과

DataFrame size (rows, columns): (100000, 5)
Total row count: 100000
Column names: ['order_id', 'user_id', 'created_at', 'status', 'num_of_item']

Checking Data Types


print("Data type for each column:")
print(df.dtypes)

실행 결과

Data type for each column:
order_id        int64
user_id         int64
created_at     object
status         object
num_of_item     int64
dtype: object

Key Data Types:

Pandas Type	Description	Example
`object`	String (text)	Name, address
`int64`	Integer	Age, quantity
`float64`	Decimal	Price, ratio
`datetime64`	Date/time	Order date, created date
`bool`	Boolean	True/False
`category`	Categorical	Gender, grade

Overview of All Information


# Summary of all info
df.info()

Output Example:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   order_id     100000 non-null  int64
 1   user_id      100000 non-null  int64
 2   created_at   100000 non-null  object
 3   status       100000 non-null  object
 4   num_of_item  100000 non-null  int64
dtypes: int64(3), object(2)
memory usage: 3.8+ MB

Quiz 1: Data Loading and Exploration

Problem

Load order data and:

Read CSV file
Check data size (rows, columns)
Check data type for each column
Output first 10 rows

View Answer


import pandas as pd
 
# 1. Read CSV file
df = pd.read_csv('/data/src_orders.csv')
 
# 2. Check data size
print(f"Data size: {df.shape[0]:,} rows x {df.shape[1]} columns")
 
# 3. Check data types
print("\nData type by column:")
print(df.dtypes)
 
# 4. First 10 rows
print("\nFirst 10 rows:")
print(df.head(10))

실행 결과

Data size: 100,000 rows x 5 columns

Data type by column:
order_id        int64
user_id         int64
created_at     object
status         object
num_of_item     int64
dtype: object

First 10 rows:
 order_id  user_id           created_at    status  num_of_item
0         1     8372  2023-01-01 00:00:00  Complete            1
1         2     5765  2023-01-01 01:00:00   Shipped            2
2         3     2341  2023-01-01 02:00:00  Complete            1
3         4     9156  2023-01-01 03:00:00   Pending            1
4         5     4489  2023-01-01 04:00:00  Complete            1
5         6     7823  2023-01-01 05:00:00  Complete            1
6         7     1456  2023-01-01 06:00:00   Pending            1
7         8     8901  2023-01-01 07:00:00  Complete            2
8         9     3214  2023-01-01 08:00:00   Shipped            1
9        10     6547  2023-01-01 09:00:00  Complete            1

5. Descriptive Statistics

Summary with describe()


# Statistics for numeric columns
print(df.describe())

실행 결과

           order_id       user_id   num_of_item
count  100000.000000  100000.00000  100000.00000
mean    50000.500000    5000.50123       1.60000
std     28867.657797    2886.75124       0.84853
min         1.000000       1.00000       1.00000
25%     25000.750000    2500.00000       1.00000
50%     50000.500000    5000.00000       1.00000
75%     75000.250000    7500.00000       2.00000
max    100000.000000   10000.00000       4.00000

Output Example:


              age    age_plus_10
count   4.000000   4.000000
mean   29.500000  39.500000
std     4.203173   4.203173
min    25.000000  35.000000
25%    27.250000  37.250000
50%    29.000000  39.000000
75%    31.250000  41.250000
max    35.000000  45.000000

Key Statistical Functions


# Simple DataFrame for statistics practice
data = {
    'name': ['John', 'Emily', 'Mike', 'Sarah'],
    'age': [25, 30, 35, 28],
    'city': ['New York', 'Boston', 'Chicago', 'New York']
}
df_sample = pd.DataFrame(data)
 
# Individual statistics
print("Mean:", df_sample['age'].mean())
print("Sum:", df_sample['age'].sum())
print("Min:", df_sample['age'].min())
print("Max:", df_sample['age'].max())
print("Median:", df_sample['age'].median())
print("Std:", df_sample['age'].std())
print("Unique count:", df_sample['city'].nunique())
print("\nValue counts:")
print(df_sample['city'].value_counts())

실행 결과

Mean: 29.5
Sum: 118
Min: 25
Max: 35
Median: 29.0
Std: 4.203173463227086
Unique count: 3

Value counts:
city
New York    2
Boston      1
Chicago     1
Name: count, dtype: int64

6. Data Type Conversion

String to Number


# Test DataFrame
df_convert = pd.DataFrame({
    'quantity': ['10', '20', '30', '40'],
    'price': ['100.5', '200.3', '300.0', 'N/A']
})
 
print("Before conversion:")
print(df_convert.dtypes)
 
# String to integer
df_convert['quantity'] = df_convert['quantity'].astype(int)
 
# Convert ignoring errors (unconvertible -> NaN)
df_convert['price'] = pd.to_numeric(df_convert['price'], errors='coerce')
 
print("\nAfter conversion:")
print(df_convert.dtypes)
print(df_convert)

실행 결과

Before conversion:
quantity    object
price       object
dtype: object

After conversion:
quantity      int64
price       float64
dtype: object
 quantity  price
0        10  100.5
1        20  200.3
2        30  300.0
3        40    NaN

String to Date


# Load order data
df = pd.read_csv('/data/src_orders.csv')
 
print("Before conversion type:", df['created_at'].dtype)
 
# String to datetime
df['created_at'] = pd.to_datetime(df['created_at'])
 
print("After conversion type:", df['created_at'].dtype)
print("\nDate column sample:")
print(df['created_at'].head())

실행 결과

Before conversion type: object
After conversion type: datetime64[ns]

Date column sample:
0   2023-01-01 00:00:00
1   2023-01-01 01:00:00
2   2023-01-01 02:00:00
3   2023-01-01 03:00:00
4   2023-01-01 04:00:00
Name: created_at, dtype: datetime64[ns]

Categorical Conversion (Memory Optimization)


df = pd.read_csv('/data/src_orders.csv')
 
# Memory before conversion
mem_before = df.memory_usage(deep=True).sum() / 1024**2
print(f"Before conversion: {mem_before:.2f} MB")
 
# object -> category (memory savings)
df['status'] = df['status'].astype('category')
 
# Memory after conversion
mem_after = df.memory_usage(deep=True).sum() / 1024**2
print(f"After conversion: {mem_after:.2f} MB")
print(f"Savings: {mem_before - mem_after:.2f} MB ({(1 - mem_after/mem_before)*100:.1f}%)")

실행 결과

Before conversion: 12.76 MB
After conversion: 6.39 MB
Savings: 6.37 MB (49.9%)

Quiz 2: Data Type Conversion

Problem

From order data:

Convert created_at column to datetime type
Convert status column to category type
Check data types after conversion

View Answer


import pandas as pd
 
# Load data
df = pd.read_csv('/data/src_orders.csv')
 
# 1. datetime conversion
df['created_at'] = pd.to_datetime(df['created_at'])
 
# 2. category conversion
df['status'] = df['status'].astype('category')
 
# 3. Check data types
print(df.dtypes)
print("\nstatus category values:")
print(df['status'].cat.categories.tolist())

실행 결과

order_id                int64
user_id                 int64
created_at     datetime64[ns]
status             category
num_of_item             int64
dtype: object

status category values:
['Cancelled', 'Complete', 'Pending', 'Shipped']

7. Saving Data

Saving to CSV


# Basic save (exclude index)
df.to_csv('output.csv', index=False)
print("output.csv saved")
 
# File with special characters
df.head(100).to_csv('orders_saved.csv', index=False, encoding='utf-8-sig')
print("orders_saved.csv saved")

실행 결과

output.csv saved
orders_saved.csv saved

Saving to Excel


# Prepare example data
df1 = pd.DataFrame({'order_id': [1, 2, 3], 'amount': [100, 200, 300]})
df2 = pd.DataFrame({'customer_id': [101, 102], 'name': ['John Doe', 'Jane Smith']})
 
try:
    # Save to multiple sheets
    with pd.ExcelWriter('output.xlsx') as writer:
        df1.to_excel(writer, sheet_name='Orders', index=False)
        df2.to_excel(writer, sheet_name='Customers', index=False)
    print("output.xlsx saved (2 sheets)")
except ImportError:
    print("openpyxl library required")

실행 결과

output.xlsx saved (2 sheets)

Saving to JSON


# Basic save
df.head(5).to_json('output.json', orient='records')
 
# With special characters
df.head(5).to_json('output_full.json', orient='records', force_ascii=False)
print("JSON saved")

실행 결과

JSON saved

8. Memory Optimization

Processing Large Files


# Large file simulation (using src_orders.csv)
# Read in chunks
chunks = pd.read_csv('/data/src_orders.csv', chunksize=10000)
 
result = []
for i, chunk in enumerate(chunks):
    # Process each chunk: filter Complete status only
    processed = chunk[chunk['status'] == 'Complete']
    result.append(processed)
    if i < 3:  # Print only first 3 chunks
        print(f"Chunk {i+1}: {len(chunk)} rows -> {len(processed)} rows (Complete only)")
 
df_complete = pd.concat(result)
print(f"\nTotal Complete orders: {len(df_complete):,} rows")

실행 결과

Chunk 1: 10000 rows -> 7012 rows (Complete only)
Chunk 2: 10000 rows -> 6998 rows (Complete only)
Chunk 3: 10000 rows -> 7023 rows (Complete only)

Total Complete orders: 70,000 rows

Data Type Optimization


def optimize_dtypes(df):
    """Convert data types for memory optimization"""
    for col in df.columns:
        col_type = df[col].dtype
 
        if col_type == 'object':
            # Convert to category if few unique values
            if df[col].nunique() / len(df) < 0.5:
                df[col] = df[col].astype('category')
 
        elif col_type == 'int64':
            # Convert to smaller integer type
            if df[col].min() >= 0:
                if df[col].max() < 255:
                    df[col] = df[col].astype('uint8')
                elif df[col].max() < 65535:
                    df[col] = df[col].astype('uint16')
                elif df[col].max() < 4294967295:
                    df[col] = df[col].astype('uint32')
 
        elif col_type == 'float64':
            # Convert to float32
            df[col] = df[col].astype('float32')
 
    return df
 
# Apply
df = pd.read_csv('/data/src_orders.csv')
mem_before = df.memory_usage(deep=True).sum() / 1024**2
 
df_optimized = optimize_dtypes(df.copy())
mem_after = df_optimized.memory_usage(deep=True).sum() / 1024**2
 
print(f"Before optimization: {mem_before:.2f} MB")
print(f"After optimization: {mem_after:.2f} MB")
print(f"Savings: {(1 - mem_after/mem_before)*100:.1f}%")
 
print("\nOptimized data types:")
print(df_optimized.dtypes)

실행 결과

Before optimization: 12.76 MB
After optimization: 2.19 MB
Savings: 82.8%

Optimized data types:
order_id       uint32
user_id        uint16
created_at   category
status       category
num_of_item     uint8
dtype: object

Summary

Key Functions Summary

Task	Function	Example
Read CSV	`pd.read_csv()`	`pd.read_csv('file.csv')`
Read Excel	`pd.read_excel()`	`pd.read_excel('file.xlsx')`
First N rows	`df.head(n)`	`df.head(10)`
Last N rows	`df.tail(n)`	`df.tail(5)`
Check size	`df.shape`	`(100, 5)`
Check types	`df.dtypes`	Type for each column
Info summary	`df.info()`	All info
Stats summary	`df.describe()`	Descriptive statistics
Save to CSV	`df.to_csv()`	`df.to_csv('out.csv')`

SQL to Pandas Comparison

SQL	Pandas
`SELECT * FROM table LIMIT 10`	`df.head(10)`
`SELECT COUNT(*) FROM table`	`len(df)`
`SELECT col FROM table`	`df['col']`

Next Steps

You’ve mastered data loading and exploration! Next, learn how to handle missing values and outliers in Data Cleaning.