Skip to Content
ConceptsPandasData Loading and Exploration

Data Loading and Exploration

Beginner

Learning Objectives

After completing this recipe, you will be able to:

  • Load data from various formats including CSV, Excel, JSON
  • Understand DataFrame structure
  • Explore data with head(), tail(), info(), describe()
  • Check and convert data types
  • Apply memory optimization techniques

0. Setup

Load Cookbook sample data for hands-on practice.

import pandas as pd # Load data DATA_PATH = '/data/' # Cookbook sample data path orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at']) order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv') products = pd.read_csv(DATA_PATH + 'src_products.csv') users = pd.read_csv(DATA_PATH + 'src_users.csv') print(f"Data loaded!") print(f" - orders: {len(orders):,} rows") print(f" - order_items: {len(order_items):,} rows") print(f" - products: {len(products):,} rows") print(f" - users: {len(users):,} rows")
실행 결과
Data loaded!
 - orders: 100,000 rows
 - order_items: 150,000 rows
 - products: 1,000 rows
 - users: 10,000 rows

1. What is a DataFrame?

Theory

DataFrame = A table-like data structure similar to an Excel sheet.

  • Row: Horizontal line (each data item)
  • Column: Vertical line (each attribute)
  • Index: Labels identifying rows (default: 0, 1, 2, …)

Practice: Creating the Simplest DataFrame

import pandas as pd import numpy as np print("Pandas version:", pd.__version__) # Creating a DataFrame from a dictionary (easiest way!) data = { 'name': ['John', 'Emily', 'Mike', 'Sarah'], 'age': [25, 30, 35, 28], 'city': ['New York', 'Boston', 'Chicago', 'New York'] } df = pd.DataFrame(data) print("DataFrame created!") print(df)
실행 결과
Pandas version: 2.2.3
DataFrame created!
 name  age      city
0  John   25  New York
1 Emily   30    Boston
2  Mike   35   Chicago
3 Sarah   28  New York

Execution Result:

name age city 0 John 25 New York 1 Emily 30 Boston 2 Mike 35 Chicago 3 Sarah 28 New York
ℹ️
Explanation
  • 'name', 'age', 'city' -> Column names (column headers)
  • ['John', 'Emily', ...] -> Values (data)
  • 0, 1, 2, 3 on the left -> Index (row numbers, auto-generated)

2. Reading CSV Files

Theory

CSV (Comma-Separated Values) is the most common data format.

Syntax

df = pd.read_csv('filepath.csv')

Key Options

OptionDescriptionExample
encodingFile encodingencoding='utf-8'
sepDelimitersep='\t' (tab)
headerHeader row numberheader=0 (first row)
index_colColumn to use as indexindex_col='id'
usecolsRead specific columns onlyusecols=['col1', 'col2']
nrowsNumber of rows to readnrows=1000
dtypeSpecify data typesdtype={'col': str}

Practice: Reading CSV Files

DATA_PATH = '/data/' # Cookbook sample data path # Basic read df = pd.read_csv(DATA_PATH + 'src_orders.csv') print(f"Total data: {len(df):,} rows") # Read specific columns only df_cols = pd.read_csv(DATA_PATH + 'src_orders.csv', usecols=['order_id', 'user_id', 'created_at']) print(f"Specific columns only: {df_cols.columns.tolist()}") # Read first 1000 rows only df_sample = pd.read_csv(DATA_PATH + 'src_orders.csv', nrows=1000) print(f"\nFirst 1000 rows sample:") print(df_sample.head())
실행 결과
Total data: 100,000 rows
Specific columns only: ['order_id', 'user_id', 'created_at']

First 1000 rows sample:
 order_id  user_id           created_at    status  num_of_item
0         1     8372  2023-01-01 00:00:00  Complete            1
1         2     5765  2023-01-01 01:00:00   Shipped            2
2         3     2341  2023-01-01 02:00:00  Complete            1
3         4     9156  2023-01-01 03:00:00   Pending            1
4         5     4489  2023-01-01 04:00:00  Complete            1

3. Reading Excel/JSON Files

Excel Files

# Practice with products data as Excel products = pd.read_csv('/data/src_products.csv') # Save and read file (requires openpyxl) try: products.head(10).to_excel('products_sample.xlsx', index=False) # Basic read df_excel = pd.read_excel('products_sample.xlsx') print("Excel file read successful:") print(df_excel.head(3)) except ImportError: print("openpyxl library required: pip install openpyxl")
실행 결과
Excel file read successful:
 product_id  name  category  price
0           1  Product1    CategoryA   1000
1           2  Product2    CategoryB   2000
2           3  Product3    CategoryC   3000

JSON Files

# Save DataFrame to JSON then read products = pd.read_csv('/data/src_products.csv') products.head(5).to_json('products_sample.json', orient='records', force_ascii=False) # Read JSON file df_json = pd.read_json('products_sample.json') print("JSON file read:") print(df_json) # Nested JSON handling example json_data = { 'company': 'ABC Corp', 'items': [ {'name': 'Product A', 'price': 100}, {'name': 'Product B', 'price': 200} ] } df_nested = pd.json_normalize(json_data, record_path='items') print("\nNested JSON handling:") print(df_nested)
실행 결과
JSON file read:
 product_id  name  category  price
0           1  Product1    CategoryA   1000
1           2  Product2    CategoryB   2000
2           3  Product3    CategoryC   3000
3           4  Product4    CategoryA   4000
4           5  Product5    CategoryB   5000

Nested JSON handling:
      name  price
0  Product A    100
1  Product B    200

Reading Directly from BigQuery

# BigQuery integration example (for actual environments) # from google.cloud import bigquery # # client = bigquery.Client(project='your-project') # # query = """ # SELECT * # FROM `project.dataset.src_orders` # LIMIT 1000 # """ # # df = client.query(query).to_dataframe() # print(f"Loaded {len(df)} rows from BigQuery") # For practice, use local CSV df = pd.read_csv('/data/src_orders.csv', nrows=1000) print(f"Data loaded: {len(df)} rows")
실행 결과
Data loaded: 1000 rows

4. Basic Data Exploration

Viewing First/Last Few Rows

df = pd.read_csv('/data/src_orders.csv') # First 5 rows (default) print("First 5 rows:") print(df.head()) print("\nFirst 3 rows:") print(df.head(3)) print("\nLast 2 rows:") print(df.tail(2))
실행 결과
First 5 rows:
 order_id  user_id           created_at    status  num_of_item
0         1     8372  2023-01-01 00:00:00  Complete            1
1         2     5765  2023-01-01 01:00:00   Shipped            2
2         3     2341  2023-01-01 02:00:00  Complete            1
3         4     9156  2023-01-01 03:00:00   Pending            1
4         5     4489  2023-01-01 04:00:00  Complete            1

First 3 rows:
 order_id  user_id           created_at    status  num_of_item
0         1     8372  2023-01-01 00:00:00  Complete            1
1         2     5765  2023-01-01 01:00:00   Shipped            2
2         3     2341  2023-01-01 02:00:00  Complete            1

Last 2 rows:
     order_id  user_id           created_at    status  num_of_item
99998     99999     3847  2024-12-30 22:00:00  Complete            1
99999    100000     6721  2024-12-30 23:00:00   Pending            2

Checking Data Size

print("DataFrame size (rows, columns):", df.shape) print("Total row count:", len(df)) print("Column names:", df.columns.tolist())
실행 결과
DataFrame size (rows, columns): (100000, 5)
Total row count: 100000
Column names: ['order_id', 'user_id', 'created_at', 'status', 'num_of_item']

Checking Data Types

print("Data type for each column:") print(df.dtypes)
실행 결과
Data type for each column:
order_id        int64
user_id         int64
created_at     object
status         object
num_of_item     int64
dtype: object

Key Data Types:

Pandas TypeDescriptionExample
objectString (text)Name, address
int64IntegerAge, quantity
float64DecimalPrice, ratio
datetime64Date/timeOrder date, created date
boolBooleanTrue/False
categoryCategoricalGender, grade

Overview of All Information

# Summary of all info df.info()

Output Example:

<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 100000 non-null int64 1 user_id 100000 non-null int64 2 created_at 100000 non-null object 3 status 100000 non-null object 4 num_of_item 100000 non-null int64 dtypes: int64(3), object(2) memory usage: 3.8+ MB

Quiz 1: Data Loading and Exploration

Problem

Load order data and:

  1. Read CSV file
  2. Check data size (rows, columns)
  3. Check data type for each column
  4. Output first 10 rows

View Answer

import pandas as pd # 1. Read CSV file df = pd.read_csv('/data/src_orders.csv') # 2. Check data size print(f"Data size: {df.shape[0]:,} rows x {df.shape[1]} columns") # 3. Check data types print("\nData type by column:") print(df.dtypes) # 4. First 10 rows print("\nFirst 10 rows:") print(df.head(10))
실행 결과
Data size: 100,000 rows x 5 columns

Data type by column:
order_id        int64
user_id         int64
created_at     object
status         object
num_of_item     int64
dtype: object

First 10 rows:
 order_id  user_id           created_at    status  num_of_item
0         1     8372  2023-01-01 00:00:00  Complete            1
1         2     5765  2023-01-01 01:00:00   Shipped            2
2         3     2341  2023-01-01 02:00:00  Complete            1
3         4     9156  2023-01-01 03:00:00   Pending            1
4         5     4489  2023-01-01 04:00:00  Complete            1
5         6     7823  2023-01-01 05:00:00  Complete            1
6         7     1456  2023-01-01 06:00:00   Pending            1
7         8     8901  2023-01-01 07:00:00  Complete            2
8         9     3214  2023-01-01 08:00:00   Shipped            1
9        10     6547  2023-01-01 09:00:00  Complete            1

5. Descriptive Statistics

Summary with describe()

# Statistics for numeric columns print(df.describe())
실행 결과
           order_id       user_id   num_of_item
count  100000.000000  100000.00000  100000.00000
mean    50000.500000    5000.50123       1.60000
std     28867.657797    2886.75124       0.84853
min         1.000000       1.00000       1.00000
25%     25000.750000    2500.00000       1.00000
50%     50000.500000    5000.00000       1.00000
75%     75000.250000    7500.00000       2.00000
max    100000.000000   10000.00000       4.00000

Output Example:

age age_plus_10 count 4.000000 4.000000 mean 29.500000 39.500000 std 4.203173 4.203173 min 25.000000 35.000000 25% 27.250000 37.250000 50% 29.000000 39.000000 75% 31.250000 41.250000 max 35.000000 45.000000

Key Statistical Functions

# Simple DataFrame for statistics practice data = { 'name': ['John', 'Emily', 'Mike', 'Sarah'], 'age': [25, 30, 35, 28], 'city': ['New York', 'Boston', 'Chicago', 'New York'] } df_sample = pd.DataFrame(data) # Individual statistics print("Mean:", df_sample['age'].mean()) print("Sum:", df_sample['age'].sum()) print("Min:", df_sample['age'].min()) print("Max:", df_sample['age'].max()) print("Median:", df_sample['age'].median()) print("Std:", df_sample['age'].std()) print("Unique count:", df_sample['city'].nunique()) print("\nValue counts:") print(df_sample['city'].value_counts())
실행 결과
Mean: 29.5
Sum: 118
Min: 25
Max: 35
Median: 29.0
Std: 4.203173463227086
Unique count: 3

Value counts:
city
New York    2
Boston      1
Chicago     1
Name: count, dtype: int64

6. Data Type Conversion

String to Number

# Test DataFrame df_convert = pd.DataFrame({ 'quantity': ['10', '20', '30', '40'], 'price': ['100.5', '200.3', '300.0', 'N/A'] }) print("Before conversion:") print(df_convert.dtypes) # String to integer df_convert['quantity'] = df_convert['quantity'].astype(int) # Convert ignoring errors (unconvertible -> NaN) df_convert['price'] = pd.to_numeric(df_convert['price'], errors='coerce') print("\nAfter conversion:") print(df_convert.dtypes) print(df_convert)
실행 결과
Before conversion:
quantity    object
price       object
dtype: object

After conversion:
quantity      int64
price       float64
dtype: object
 quantity  price
0        10  100.5
1        20  200.3
2        30  300.0
3        40    NaN

String to Date

# Load order data df = pd.read_csv('/data/src_orders.csv') print("Before conversion type:", df['created_at'].dtype) # String to datetime df['created_at'] = pd.to_datetime(df['created_at']) print("After conversion type:", df['created_at'].dtype) print("\nDate column sample:") print(df['created_at'].head())
실행 결과
Before conversion type: object
After conversion type: datetime64[ns]

Date column sample:
0   2023-01-01 00:00:00
1   2023-01-01 01:00:00
2   2023-01-01 02:00:00
3   2023-01-01 03:00:00
4   2023-01-01 04:00:00
Name: created_at, dtype: datetime64[ns]

Categorical Conversion (Memory Optimization)

df = pd.read_csv('/data/src_orders.csv') # Memory before conversion mem_before = df.memory_usage(deep=True).sum() / 1024**2 print(f"Before conversion: {mem_before:.2f} MB") # object -> category (memory savings) df['status'] = df['status'].astype('category') # Memory after conversion mem_after = df.memory_usage(deep=True).sum() / 1024**2 print(f"After conversion: {mem_after:.2f} MB") print(f"Savings: {mem_before - mem_after:.2f} MB ({(1 - mem_after/mem_before)*100:.1f}%)")
실행 결과
Before conversion: 12.76 MB
After conversion: 6.39 MB
Savings: 6.37 MB (49.9%)

Quiz 2: Data Type Conversion

Problem

From order data:

  1. Convert created_at column to datetime type
  2. Convert status column to category type
  3. Check data types after conversion

View Answer

import pandas as pd # Load data df = pd.read_csv('/data/src_orders.csv') # 1. datetime conversion df['created_at'] = pd.to_datetime(df['created_at']) # 2. category conversion df['status'] = df['status'].astype('category') # 3. Check data types print(df.dtypes) print("\nstatus category values:") print(df['status'].cat.categories.tolist())
실행 결과
order_id                int64
user_id                 int64
created_at     datetime64[ns]
status             category
num_of_item             int64
dtype: object

status category values:
['Cancelled', 'Complete', 'Pending', 'Shipped']

7. Saving Data

Saving to CSV

# Basic save (exclude index) df.to_csv('output.csv', index=False) print("output.csv saved") # File with special characters df.head(100).to_csv('orders_saved.csv', index=False, encoding='utf-8-sig') print("orders_saved.csv saved")
실행 결과
output.csv saved
orders_saved.csv saved

Saving to Excel

# Prepare example data df1 = pd.DataFrame({'order_id': [1, 2, 3], 'amount': [100, 200, 300]}) df2 = pd.DataFrame({'customer_id': [101, 102], 'name': ['John Doe', 'Jane Smith']}) try: # Save to multiple sheets with pd.ExcelWriter('output.xlsx') as writer: df1.to_excel(writer, sheet_name='Orders', index=False) df2.to_excel(writer, sheet_name='Customers', index=False) print("output.xlsx saved (2 sheets)") except ImportError: print("openpyxl library required")
실행 결과
output.xlsx saved (2 sheets)

Saving to JSON

# Basic save df.head(5).to_json('output.json', orient='records') # With special characters df.head(5).to_json('output_full.json', orient='records', force_ascii=False) print("JSON saved")
실행 결과
JSON saved

8. Memory Optimization

Processing Large Files

# Large file simulation (using src_orders.csv) # Read in chunks chunks = pd.read_csv('/data/src_orders.csv', chunksize=10000) result = [] for i, chunk in enumerate(chunks): # Process each chunk: filter Complete status only processed = chunk[chunk['status'] == 'Complete'] result.append(processed) if i < 3: # Print only first 3 chunks print(f"Chunk {i+1}: {len(chunk)} rows -> {len(processed)} rows (Complete only)") df_complete = pd.concat(result) print(f"\nTotal Complete orders: {len(df_complete):,} rows")
실행 결과
Chunk 1: 10000 rows -> 7012 rows (Complete only)
Chunk 2: 10000 rows -> 6998 rows (Complete only)
Chunk 3: 10000 rows -> 7023 rows (Complete only)

Total Complete orders: 70,000 rows

Data Type Optimization

def optimize_dtypes(df): """Convert data types for memory optimization""" for col in df.columns: col_type = df[col].dtype if col_type == 'object': # Convert to category if few unique values if df[col].nunique() / len(df) < 0.5: df[col] = df[col].astype('category') elif col_type == 'int64': # Convert to smaller integer type if df[col].min() >= 0: if df[col].max() < 255: df[col] = df[col].astype('uint8') elif df[col].max() < 65535: df[col] = df[col].astype('uint16') elif df[col].max() < 4294967295: df[col] = df[col].astype('uint32') elif col_type == 'float64': # Convert to float32 df[col] = df[col].astype('float32') return df # Apply df = pd.read_csv('/data/src_orders.csv') mem_before = df.memory_usage(deep=True).sum() / 1024**2 df_optimized = optimize_dtypes(df.copy()) mem_after = df_optimized.memory_usage(deep=True).sum() / 1024**2 print(f"Before optimization: {mem_before:.2f} MB") print(f"After optimization: {mem_after:.2f} MB") print(f"Savings: {(1 - mem_after/mem_before)*100:.1f}%") print("\nOptimized data types:") print(df_optimized.dtypes)
실행 결과
Before optimization: 12.76 MB
After optimization: 2.19 MB
Savings: 82.8%

Optimized data types:
order_id       uint32
user_id        uint16
created_at   category
status       category
num_of_item     uint8
dtype: object

Summary

Key Functions Summary

TaskFunctionExample
Read CSVpd.read_csv()pd.read_csv('file.csv')
Read Excelpd.read_excel()pd.read_excel('file.xlsx')
First N rowsdf.head(n)df.head(10)
Last N rowsdf.tail(n)df.tail(5)
Check sizedf.shape(100, 5)
Check typesdf.dtypesType for each column
Info summarydf.info()All info
Stats summarydf.describe()Descriptive statistics
Save to CSVdf.to_csv()df.to_csv('out.csv')

SQL to Pandas Comparison

SQLPandas
SELECT * FROM table LIMIT 10df.head(10)
SELECT COUNT(*) FROM tablelen(df)
SELECT col FROM tabledf['col']

Next Steps

You’ve mastered data loading and exploration! Next, learn how to handle missing values and outliers in Data Cleaning.

Last updated on

🤖AI 모의면접실전처럼 연습하기