Pandas Interview Questions
Pandas problems frequently asked in data analyst coding tests. Each problem is structured as: Problem โ Answer โ Explanation โ Interview Points.
๐ข Sample Questions (3/25)
Question 1. Basic Filtering and Sorting
[Problem] Extract data that satisfies the following conditions:
- Category is โJeansโ or โDressesโ
- Sale price is $50 or more
- Sort by sale price in descending order
import pandas as pd
DATA_PATH = '/data/'
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
df = order_items.merge(products, on='product_id')
# Write your code hereโ
Answer Code
# Method 1: Boolean indexing
result = df[
(df['category'].isin(['Jeans', 'Dresses'])) &
(df['sale_price'] >= 50)
].sort_values('sale_price', ascending=False)
# Method 2: Using query()
result = df.query(
"category in ['Jeans', 'Dresses'] and sale_price >= 50"
).sort_values('sale_price', ascending=False)
print(f"Result: {len(result):,} rows")
print(result[['category', 'sale_price', 'name']].head())๐ Detailed Explanation
query() vs Boolean Indexing:
| Method | Pros | Cons |
|---|---|---|
| Boolean indexing | Most basic, always works | Readability decreases with many conditions |
| query() | SQL-style, better readability | String parsing overhead |
| loc[] | Label-based, supports slicing | Limited condition expression |
isin() Tips:
# โ Repeating OR operators
df[(df['cat'] == 'A') | (df['cat'] == 'B') | (df['cat'] == 'C')]
# โ
Using isin()
df[df['cat'].isin(['A', 'B', 'C'])]Interviewer Point:
โWhatโs the performance difference between query() and boolean indexing?โ โ Mostly similar, choose based on readability
Question 2. GroupBy Basics
[Problem] Calculate total revenue, average price, and product count by brand, then output the top 10 by total revenue.
โ
Answer Code
result = df.groupby('brand').agg(
total_revenue=('sale_price', 'sum'),
avg_price=('sale_price', 'mean'),
product_count=('product_id', 'nunique')
).round(2).sort_values('total_revenue', ascending=False).head(10)
print(result)๐ Detailed Explanation
3 agg() Syntax Options:
# 1. Dictionary method (column: function)
df.groupby('brand')['sale_price'].agg({'total': 'sum', 'average': 'mean'})
# 2. Named Aggregation (recommended, pandas 0.25+)
df.groupby('brand').agg(
total=('sale_price', 'sum'),
average=('sale_price', 'mean')
)
# 3. List method (multiple functions)
df.groupby('brand')['sale_price'].agg(['sum', 'mean', 'count'])nunique vs count:
nunique(): Count of unique values (excluding duplicates)count(): Row count excluding NULL
Question 3. Missing Value Handling
[Problem] Check missing value status, fill numeric columns with median and categorical columns with mode.
โ
Answer Code
# Missing value status
print("=== Missing Value Status ===")
print(df.isnull().sum()[df.isnull().sum() > 0])
# Numeric columns: median
numeric_cols = df.select_dtypes(include=['number']).columns
for col in numeric_cols:
if df[col].isnull().sum() > 0:
median_val = df[col].median()
df[col] = df[col].fillna(median_val)
print(f"{col}: filled with median({median_val:.2f})")
# Categorical columns: mode
object_cols = df.select_dtypes(include=['object']).columns
for col in object_cols:
if df[col].isnull().sum() > 0:
mode_val = df[col].mode()[0]
df[col] = df[col].fillna(mode_val)
print(f"{col}: filled with mode({mode_val})")๐ Detailed Explanation
Missing Value Handling Strategies:
| Situation | Method | Code |
|---|---|---|
| Normal distribution | Mean | fillna(df['col'].mean()) |
| Skewed distribution | Median | fillna(df['col'].median()) |
| Categorical | Mode | fillna(df['col'].mode()[0]) |
| Time series | Forward/backward fill | fillna(method='ffill') |
| Delete | dropna | dropna(subset=['col']) |
Note: inplace vs assignment
# โ Modifying original (not recommended)
df.fillna(0, inplace=True)
# โ
Assigning to new object (recommended)
df = df.fillna(0)Interviewer Point:
โWhy did you choose median over mean?โ โ To minimize the impact of outliers
๐ Premium Questions (22 Questions)
All 25 Questions Breakdown
| Difficulty | Questions | Main Topics |
|---|---|---|
| ๐ข Beginner | 8 questions | Filtering, GroupBy, Missing Values, Merge, String, DateTime |
| ๐ก Intermediate | 9 questions | Pivot, Apply/Transform, Rolling, Conditional Columns, Melt |
| ๐ด Advanced | 8 questions | RFM Analysis, Cohort Retention, Memory Optimization |
What Youโll Learn in Premium
- โ apply vs transform vs agg: When to use which
- โ Rolling Window: 7-day moving average, cumulative calculations
- โ RFM Analysis: Real-world customer segmentation code
- โ Cohort Retention: Analysis used in actual companies
- โ Memory Optimization: Large-scale data processing techniques
- โ Answer Points Interviewers Expect
๐ฏ Purchase All 25 Questions + Explanations
SQL + Pandas + Statistics + Case Study bundle discount
๐ Practice More for Free
If you need more interview preparation, review the concept sections in the Cookbook: