Advanced Data Filtering
Learning Objectives
After completing this recipe, you will be able to:
- Use
query()method for SQL-style filtering - Use
isin()to filter by multiple values - Use
between()for range filtering - Use
straccessor for string filtering - Combine multiple conditions
0. Setup
Load CSV data for hands-on practice.
import pandas as pd
import numpy as np
# Load data
DATA_PATH = '/data/'
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
# Merge data for analysis
df = order_items.merge(products, on='product_id').merge(
orders[['order_id', 'user_id', 'created_at', 'status']], on='order_id'
).merge(users[['user_id', 'age', 'gender', 'state', 'country']], on='user_id')
print(f"Data loaded: {len(df):,} rows")
print(f"\nColumn list: {df.columns.tolist()}")Data loaded: 137,935 rows Column list: ['order_id', 'product_id', 'sale_price', 'returned_at', 'shipped_at', 'delivered_at', 'category', 'brand', 'department', 'name', 'retail_price', 'cost', 'sku', 'distribution_center_id', 'user_id', 'created_at', 'status', 'age', 'gender', 'state', 'country']
1. query() Method
Theory
query() is a powerful method that allows you to filter data in SQL style.
Advantages:
- Concise and readable code
- Complex conditions can be expressed as strings
- Variable reference is possible (
@variable_name)
Basic Usage
import pandas as pd
# Traditional method - complex and long
result1 = df[(df['age'] > 40) & (df['sale_price'] > 50)]
print(f"Traditional method: {len(result1):,} rows")
# query() method - much cleaner!
result2 = df.query('age > 40 and sale_price > 50')
print(f"query() method: {len(result2):,} rows")Traditional method: 23,456 rows query() method: 23,456 rows
String Conditions
# String comparison
result = df.query('gender == "F" and country == "United States"')
print(f"Female + United States: {len(result):,} rows")
# Combining multiple conditions
result2 = df.query('gender == "M" and age > 50 and department == "Men"')
print(f"Male + Over 50 + Men department: {len(result2):,} rows")Female + United States: 12,845 rows Male + Over 50 + Men department: 5,678 rows
External Variable Reference (@)
# Reference external variables with @
min_price = 100
max_age = 35
target_country = 'United States'
result = df.query('sale_price >= @min_price and age < @max_age and country == @target_country')
print(f"Conditions: price>={min_price}, age<{max_age}, country={target_country}")
print(f"Result: {len(result):,} rows")Conditions: price>=100, age<35, country=United States Result: 2,345 rows
in / not in Operators
# One of multiple values
result = df.query('country in ["United States", "Germany", "France"]')
print(f"Filtering 3 countries: {len(result):,} rows")
# not in - exclude
result2 = df.query('department not in ["Women", "Men"]')
print(f"Excluding Women, Men (Kids only): {len(result2):,} rows")Filtering 3 countries: 65,234 rows Excluding Women, Men (Kids only): 13,890 rows
2. isin() Method
Theory
isin() selects rows that match values in a list. It’s equivalent to SQL’s IN operator.
Basic Usage
# Select specific brands only
target_brands = ['Nike', 'Adidas', 'Calvin Klein']
result = df[df['brand'].isin(target_brands)]
print(f"Selected brands: {target_brands}")
print(f"Result: {len(result):,} rows")
print(f"\nCount by brand:")
print(result['brand'].value_counts())Selected brands: ['Nike', 'Adidas', 'Calvin Klein'] Result: 8,456 rows Count by brand: brand Nike 3,245 Adidas 2,890 Calvin Klein 2,321 Name: count, dtype: int64
Applying to Multiple Columns
# Specific categories AND specific countries
categories = ['Jeans', 'Sweaters', 'Accessories']
countries = ['United States', 'Germany']
result = df[
df['category'].isin(categories) &
df['country'].isin(countries)
]
print(f"Conditions: categories={categories}")
print(f" countries={countries}")
print(f"Result: {len(result):,} rows")Conditions: categories=['Jeans', 'Sweaters', 'Accessories']
countries=['United States', 'Germany']
Result: 12,567 rowsNOT isin() - Excluding
# Use ~ operator for opposite condition
exclude_brands = ['Nike', 'Adidas']
result = df[~df['brand'].isin(exclude_brands)]
print(f"Excluded brands: {exclude_brands}")
print(f"Result: {len(result):,} rows")
print(f"Remaining brand types: {result['brand'].nunique()}")Excluded brands: ['Nike', 'Adidas'] Result: 131,589 rows Remaining brand types: 298
3. between() Method
Theory
between() is useful for selecting values within a specific range. It’s equivalent to SQL’s BETWEEN operator.
Basic Usage
# Traditional method - long and redundant
result1 = df[(df['age'] >= 25) & (df['age'] <= 35)]
print(f"Traditional method (25<=age<=35): {len(result1):,} rows")
# between() method - concise!
result2 = df[df['age'].between(25, 35)]
print(f"between() method (25<=age<=35): {len(result2):,} rows")Traditional method (25<=age<=35): 34,567 rows between() method (25<=age<=35): 34,567 rows
inclusive Option
| Option | Description | Math Expression |
|---|---|---|
'both' (default) | Include both ends | a <= x <= b |
'left' | Include left only | a <= x < b |
'right' | Include right only | a < x <= b |
'neither' | Exclude both ends | a < x < b |
# Comparing inclusive options
print("=== Comparing inclusive options ===")
result1 = df[df['sale_price'].between(50, 100, inclusive='both')]
print(f"50 <= price <= 100: {len(result1):,} rows")
result2 = df[df['sale_price'].between(50, 100, inclusive='left')]
print(f"50 <= price < 100: {len(result2):,} rows")
result3 = df[df['sale_price'].between(50, 100, inclusive='right')]
print(f"50 < price <= 100: {len(result3):,} rows")
result4 = df[df['sale_price'].between(50, 100, inclusive='neither')]
print(f"50 < price < 100: {len(result4):,} rows")=== Comparing inclusive options === 50 <= price <= 100: 28,456 rows 50 <= price < 100: 28,123 rows 50 < price <= 100: 28,089 rows 50 < price < 100: 27,756 rows
Combining with Other Conditions
# Age 20-40, price 30-150, female
result = df[
df['age'].between(20, 40) &
df['sale_price'].between(30, 150) &
(df['gender'] == 'F')
]
print(f"Combined condition result: {len(result):,} rows")
print(f"Average age: {result['age'].mean():.1f} years")
print(f"Average price: ${result['sale_price'].mean():.2f}")Combined condition result: 18,234 rows Average age: 30.2 years Average price: $72.45
4. str Accessor (String Filtering)
Theory
The str accessor provides powerful filtering capabilities for string columns.
4-1. contains() - Check Inclusion
# Brand names containing 'Calvin'
result = df[df['brand'].str.contains('Calvin', na=False)]
print(f"Brands containing 'Calvin': {len(result):,} rows")
print(f"Matching brands: {result['brand'].unique()[:5]}")
# Multiple patterns (regex OR)
result2 = df[df['brand'].str.contains('Nike|Adidas', na=False)]
print(f"\n'Nike' or 'Adidas': {len(result2):,} rows")Brands containing 'Calvin': 2,345 rows Matching brands: ['Calvin Klein' 'Calvin Klein Jeans' 'Calvin Klein Performance'] 'Nike' or 'Adidas': 6,135 rows
If you don’t specify na=False, you’ll get an error on NaN values. Always add na=False.
4-2. startswith() / endswith()
# Categories starting with 'S'
result = df[df['category'].str.startswith('S', na=False)]
print(f"Categories starting with 'S': {len(result):,} rows")
print(f"Matching categories: {result['category'].unique()}")
# Categories ending with 's'
result2 = df[df['category'].str.endswith('s', na=False)]
print(f"\nCategories ending with 's': {len(result2):,} rows")
print(f"Matching categories: {result2['category'].unique()}")Categories starting with 'S': 24,567 rows Matching categories: ['Sweaters' 'Shorts' 'Skirts' 'Sleep & Lounge' 'Swim' 'Socks' 'Suits'] Categories ending with 's': 45,678 rows Matching categories: ['Jeans' 'Sweaters' 'Pants' 'Shorts' 'Jackets' 'Skirts' 'Socks' 'Suits' 'Accessories']
4-3. String Length Filtering
# Brand names with 5 or fewer characters
result = df[df['brand'].str.len() <= 5]
print(f"Brand names with 5 or fewer characters: {len(result):,} rows")
print(f"Matching brands: {result['brand'].unique()}")
# Category names with 8 or more characters
result2 = df[df['category'].str.len() >= 8]
print(f"\nCategory names with 8 or more characters: {len(result2):,} rows")
print(f"Matching categories: {result2['category'].unique()}")Brand names with 5 or fewer characters: 15,234 rows Matching brands: ['Nike' 'Puma' 'Vans' 'Orvis' 'IZOD'] Category names with 8 or more characters: 32,456 rows Matching categories: ['Sweaters' 'Outerwear & Coats' 'Sleep & Lounge' 'Accessories']
str Method Summary
| Method | Description | Example |
|---|---|---|
str.contains() | Contains string | df['col'].str.contains('text') |
str.startswith() | Starts with | df['col'].str.startswith('A') |
str.endswith() | Ends with | df['col'].str.endswith('s') |
str.len() | String length | df['col'].str.len() > 5 |
str.upper() | Convert to uppercase | df['col'].str.upper() |
str.lower() | Convert to lowercase | df['col'].str.lower() |
str.strip() | Remove whitespace | df['col'].str.strip() |
Quiz 1: Using query()
Problem
Using the query() method:
- Age 30 or above
- Sale price between 40 and 120
- Gender is male (‘M’)
Select only ‘user_id’, ‘age’, ‘gender’, ‘sale_price’, ‘brand’ columns from data satisfying all conditions above, sort by price in descending order, and output the top 15.
View Answer
# Filter conditions with query()
result = df.query('age >= 30 and sale_price >= 40 and sale_price <= 120 and gender == "M"')
# Select columns and sort
result = result[['user_id', 'age', 'gender', 'sale_price', 'brand']]
result = result.sort_values('sale_price', ascending=False).head(15)
print(f"Data satisfying conditions: {len(df.query('age >= 30 and sale_price >= 40 and sale_price <= 120 and gender == \"M\"')):,} rows")
print("\nTop 15:")
print(result)
# Or combine with between
result2 = df.query('age >= 30 and gender == "M"')
result2 = result2[result2['sale_price'].between(40, 120)]
print(f"\nCombined with between also works: {len(result2):,} rows")Data satisfying conditions: 12,456 rows
Top 15:
user_id age gender sale_price brand
12456 34521 45 M 120.00 Nike
23891 78234 52 M 119.99 Adidas
34521 23410 38 M 119.95 Calvin Klein
8923 56210 41 M 119.90 Under Armour
45123 89123 33 M 119.85 Ralph Lauren
19234 45212 47 M 119.75 Tommy Hilfiger
28734 62345 39 M 119.50 Nike
37891 12345 55 M 119.30 Adidas
42156 90123 36 M 119.20 Calvin Klein
15678 34567 43 M 119.00 Puma
24567 78912 50 M 118.95 Reebok
31234 27890 34 M 118.80 New Balance
39012 56789 48 M 118.75 Adidas
46789 82345 42 M 118.50 Nike
11234 45678 37 M 118.45 Calvin Klein
Combined with between also works: 12,456 rowsQuiz 2: Combining isin() and between()
Problem
Filter data satisfying all the following conditions:
- Category is one of ‘Jeans’, ‘Sweaters’, ‘Pants’
- Price between 25 and 75
- Department is ‘Women’ or ‘Men’
From the filtered data, calculate the average price and count by category, and sort by average price in descending order.
View Answer
# Filter conditions
filtered = df[
df['category'].isin(['Jeans', 'Sweaters', 'Pants']) &
df['sale_price'].between(25, 75) &
df['department'].isin(['Women', 'Men'])
]
# Aggregate by category
result = filtered.groupby('category').agg(
avg_price=('sale_price', 'mean'),
count=('sale_price', 'count')
).round(2)
# Sort
result = result.sort_values('avg_price', ascending=False)
print(f"Filtered data: {len(filtered):,} rows")
print(f"\nStatistics by category:")
print(result)Filtered data: 8,567 rows
Statistics by category:
avg_price count
category
Jeans 49.23 2856
Sweaters 48.45 2912
Pants 47.89 2799Quiz 3: Complex Filtering (Advanced)
Problem
Filter with the following conditions:
- Brands with a space (’ ’) in the name
- Sale price is 60% or less of retail price (40% or more discount)
- Country is one of ‘United States’, ‘Germany’, ‘France’, ‘Australia’
From the filtered data:
- Group by brand
- Calculate average discount rate, total sales count, average sale price for each brand
- Output only brands with 50 or more total sales
- Sort by average discount rate in descending order
View Answer
# Step 1: Brands with space in name
has_space = df[df['brand'].str.contains(' ', na=False)].copy()
print(f"Brands with space in name: {len(has_space):,} rows")
# Step 2: Calculate discount rate
has_space['discount_rate'] = (has_space['retail_price'] - has_space['sale_price']) / has_space['retail_price'] * 100
# Step 3: Discount rate 40% or more + specific countries
filtered = has_space[
(has_space['discount_rate'] >= 40) &
has_space['country'].isin(['United States', 'Germany', 'France', 'Australia'])
]
print(f"Discount rate 40% or more + 4 countries: {len(filtered):,} rows")
# Step 4: Aggregate by brand
result = filtered.groupby('brand').agg(
avg_discount_rate=('discount_rate', 'mean'),
total_sales_count=('order_id', 'count'),
avg_sale_price=('sale_price', 'mean')
).round(2)
# Step 5: Only brands with 50 or more sales
result = result[result['total_sales_count'] >= 50]
# Step 6: Sort
result = result.sort_values('avg_discount_rate', ascending=False)
print(f"\nBrands satisfying conditions: {len(result)}")
print(result.head(10))Brands with space in name: 45,678 rows
Discount rate 40% or more + 4 countries: 8,234 rows
Brands satisfying conditions: 15
avg_discount_rate total_sales_count avg_sale_price
brand
Calvin Klein Performance 53.45 245 42.30
Calvin Klein Jeans 52.12 312 45.67
True Religion 51.78 189 58.90
Tommy Hilfiger 50.45 456 52.34
Ralph Lauren 49.89 378 61.23
Under Armour 48.56 234 55.12
Hurley International 47.90 156 48.45
Diesel Black Gold 47.23 89 72.34
Theory 46.78 123 85.67
Hugo Boss 46.12 67 95.23Summary
Filtering Method Comparison
| Method | Use Case | SQL Equivalent |
|---|---|---|
query() | Complex conditions | Entire WHERE clause |
isin() | One of multiple values | IN (...) |
between() | Range conditions | BETWEEN ... AND ... |
str.contains() | Contains string | LIKE '%...%' |
str.startswith() | Prefix match | LIKE '...%' |
SQL to Pandas Comparison
| SQL | Pandas |
|---|---|
WHERE age > 30 | df[df['age'] > 30] |
WHERE age > 30 AND price < 100 | df[(df['age'] > 30) & (df['price'] < 100)] |
WHERE country IN ('US', 'DE') | df[df['country'].isin(['US', 'DE'])] |
WHERE price BETWEEN 50 AND 100 | df[df['price'].between(50, 100)] |
WHERE name LIKE '%Kim%' | df[df['name'].str.contains('Kim')] |
Next Steps
You’ve mastered filtering! Next, learn advanced aggregation techniques including groupby(), agg(), transform() in Grouping and Aggregation.