Advanced Data Filtering

Intermediate

Learning Objectives

After completing this recipe, you will be able to:

Use query() method for SQL-style filtering
Use isin() to filter by multiple values
Use between() for range filtering
Use str accessor for string filtering
Combine multiple conditions

0. Setup

Load CSV data for hands-on practice.


import pandas as pd
import numpy as np
 
# Load data
DATA_PATH = '/data/'
 
orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at'])
order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv')
products = pd.read_csv(DATA_PATH + 'src_products.csv')
users = pd.read_csv(DATA_PATH + 'src_users.csv')
 
# Merge data for analysis
df = order_items.merge(products, on='product_id').merge(
    orders[['order_id', 'user_id', 'created_at', 'status']], on='order_id'
).merge(users[['user_id', 'age', 'gender', 'state', 'country']], on='user_id')
 
print(f"Data loaded: {len(df):,} rows")
print(f"\nColumn list: {df.columns.tolist()}")

실행 결과

Data loaded: 137,935 rows

Column list: ['order_id', 'product_id', 'sale_price', 'returned_at', 'shipped_at', 'delivered_at', 'category', 'brand', 'department', 'name', 'retail_price', 'cost', 'sku', 'distribution_center_id', 'user_id', 'created_at', 'status', 'age', 'gender', 'state', 'country']

1. query() Method

Theory

query() is a powerful method that allows you to filter data in SQL style.

Advantages:

Concise and readable code
Complex conditions can be expressed as strings
Variable reference is possible (@variable_name)

Basic Usage


import pandas as pd
 
# Traditional method - complex and long
result1 = df[(df['age'] > 40) & (df['sale_price'] > 50)]
print(f"Traditional method: {len(result1):,} rows")
 
# query() method - much cleaner!
result2 = df.query('age > 40 and sale_price > 50')
print(f"query() method: {len(result2):,} rows")

실행 결과

Traditional method: 23,456 rows
query() method: 23,456 rows

String Conditions


# String comparison
result = df.query('gender == "F" and country == "United States"')
print(f"Female + United States: {len(result):,} rows")
 
# Combining multiple conditions
result2 = df.query('gender == "M" and age > 50 and department == "Men"')
print(f"Male + Over 50 + Men department: {len(result2):,} rows")

실행 결과

Female + United States: 12,845 rows
Male + Over 50 + Men department: 5,678 rows

External Variable Reference (@)


# Reference external variables with @
min_price = 100
max_age = 35
target_country = 'United States'
 
result = df.query('sale_price >= @min_price and age < @max_age and country == @target_country')
print(f"Conditions: price>={min_price}, age<{max_age}, country={target_country}")
print(f"Result: {len(result):,} rows")

실행 결과

Conditions: price>=100, age<35, country=United States
Result: 2,345 rows

in / not in Operators


# One of multiple values
result = df.query('country in ["United States", "Germany", "France"]')
print(f"Filtering 3 countries: {len(result):,} rows")
 
# not in - exclude
result2 = df.query('department not in ["Women", "Men"]')
print(f"Excluding Women, Men (Kids only): {len(result2):,} rows")

실행 결과

Filtering 3 countries: 65,234 rows
Excluding Women, Men (Kids only): 13,890 rows

2. isin() Method

Theory

isin() selects rows that match values in a list. It’s equivalent to SQL’s IN operator.

Basic Usage


# Select specific brands only
target_brands = ['Nike', 'Adidas', 'Calvin Klein']
result = df[df['brand'].isin(target_brands)]
 
print(f"Selected brands: {target_brands}")
print(f"Result: {len(result):,} rows")
print(f"\nCount by brand:")
print(result['brand'].value_counts())

실행 결과

Selected brands: ['Nike', 'Adidas', 'Calvin Klein']
Result: 8,456 rows

Count by brand:
brand
Nike           3,245
Adidas         2,890
Calvin Klein   2,321
Name: count, dtype: int64

Applying to Multiple Columns


# Specific categories AND specific countries
categories = ['Jeans', 'Sweaters', 'Accessories']
countries = ['United States', 'Germany']
 
result = df[
    df['category'].isin(categories) &
    df['country'].isin(countries)
]
print(f"Conditions: categories={categories}")
print(f"           countries={countries}")
print(f"Result: {len(result):,} rows")

실행 결과

Conditions: categories=['Jeans', 'Sweaters', 'Accessories']
         countries=['United States', 'Germany']
Result: 12,567 rows

NOT isin() - Excluding


# Use ~ operator for opposite condition
exclude_brands = ['Nike', 'Adidas']
result = df[~df['brand'].isin(exclude_brands)]
 
print(f"Excluded brands: {exclude_brands}")
print(f"Result: {len(result):,} rows")
print(f"Remaining brand types: {result['brand'].nunique()}")

실행 결과

Excluded brands: ['Nike', 'Adidas']
Result: 131,589 rows
Remaining brand types: 298

3. between() Method

Theory

between() is useful for selecting values within a specific range. It’s equivalent to SQL’s BETWEEN operator.

Basic Usage


# Traditional method - long and redundant
result1 = df[(df['age'] >= 25) & (df['age'] <= 35)]
print(f"Traditional method (25<=age<=35): {len(result1):,} rows")
 
# between() method - concise!
result2 = df[df['age'].between(25, 35)]
print(f"between() method (25<=age<=35): {len(result2):,} rows")

실행 결과

Traditional method (25<=age<=35): 34,567 rows
between() method (25<=age<=35): 34,567 rows

inclusive Option

Option	Description	Math Expression
`'both'` (default)	Include both ends	`a <= x <= b`
`'left'`	Include left only	`a <= x < b`
`'right'`	Include right only	`a < x <= b`
`'neither'`	Exclude both ends	`a < x < b`


# Comparing inclusive options
print("=== Comparing inclusive options ===")
 
result1 = df[df['sale_price'].between(50, 100, inclusive='both')]
print(f"50 <= price <= 100: {len(result1):,} rows")
 
result2 = df[df['sale_price'].between(50, 100, inclusive='left')]
print(f"50 <= price < 100: {len(result2):,} rows")
 
result3 = df[df['sale_price'].between(50, 100, inclusive='right')]
print(f"50 < price <= 100: {len(result3):,} rows")
 
result4 = df[df['sale_price'].between(50, 100, inclusive='neither')]
print(f"50 < price < 100: {len(result4):,} rows")

실행 결과

=== Comparing inclusive options ===
50 <= price <= 100: 28,456 rows
50 <= price < 100: 28,123 rows
50 < price <= 100: 28,089 rows
50 < price < 100: 27,756 rows

Combining with Other Conditions


# Age 20-40, price 30-150, female
result = df[
    df['age'].between(20, 40) &
    df['sale_price'].between(30, 150) &
    (df['gender'] == 'F')
]
 
print(f"Combined condition result: {len(result):,} rows")
print(f"Average age: {result['age'].mean():.1f} years")
print(f"Average price: ${result['sale_price'].mean():.2f}")

실행 결과

Combined condition result: 18,234 rows
Average age: 30.2 years
Average price: $72.45

4. str Accessor (String Filtering)

Theory

The str accessor provides powerful filtering capabilities for string columns.

4-1. contains() - Check Inclusion


# Brand names containing 'Calvin'
result = df[df['brand'].str.contains('Calvin', na=False)]
print(f"Brands containing 'Calvin': {len(result):,} rows")
print(f"Matching brands: {result['brand'].unique()[:5]}")
 
# Multiple patterns (regex OR)
result2 = df[df['brand'].str.contains('Nike|Adidas', na=False)]
print(f"\n'Nike' or 'Adidas': {len(result2):,} rows")

실행 결과

Brands containing 'Calvin': 2,345 rows
Matching brands: ['Calvin Klein' 'Calvin Klein Jeans' 'Calvin Klein Performance']

'Nike' or 'Adidas': 6,135 rows

⚠️

na=False is Important!

If you don’t specify na=False, you’ll get an error on NaN values. Always add na=False.

4-2. startswith() / endswith()


# Categories starting with 'S'
result = df[df['category'].str.startswith('S', na=False)]
print(f"Categories starting with 'S': {len(result):,} rows")
print(f"Matching categories: {result['category'].unique()}")
 
# Categories ending with 's'
result2 = df[df['category'].str.endswith('s', na=False)]
print(f"\nCategories ending with 's': {len(result2):,} rows")
print(f"Matching categories: {result2['category'].unique()}")

실행 결과

Categories starting with 'S': 24,567 rows
Matching categories: ['Sweaters' 'Shorts' 'Skirts' 'Sleep & Lounge' 'Swim' 'Socks' 'Suits']

Categories ending with 's': 45,678 rows
Matching categories: ['Jeans' 'Sweaters' 'Pants' 'Shorts' 'Jackets' 'Skirts' 'Socks' 'Suits' 'Accessories']

4-3. String Length Filtering


# Brand names with 5 or fewer characters
result = df[df['brand'].str.len() <= 5]
print(f"Brand names with 5 or fewer characters: {len(result):,} rows")
print(f"Matching brands: {result['brand'].unique()}")
 
# Category names with 8 or more characters
result2 = df[df['category'].str.len() >= 8]
print(f"\nCategory names with 8 or more characters: {len(result2):,} rows")
print(f"Matching categories: {result2['category'].unique()}")

실행 결과

Brand names with 5 or fewer characters: 15,234 rows
Matching brands: ['Nike' 'Puma' 'Vans' 'Orvis' 'IZOD']

Category names with 8 or more characters: 32,456 rows
Matching categories: ['Sweaters' 'Outerwear & Coats' 'Sleep & Lounge' 'Accessories']

str Method Summary

Method	Description	Example
`str.contains()`	Contains string	`df['col'].str.contains('text')`
`str.startswith()`	Starts with	`df['col'].str.startswith('A')`
`str.endswith()`	Ends with	`df['col'].str.endswith('s')`
`str.len()`	String length	`df['col'].str.len() > 5`
`str.upper()`	Convert to uppercase	`df['col'].str.upper()`
`str.lower()`	Convert to lowercase	`df['col'].str.lower()`
`str.strip()`	Remove whitespace	`df['col'].str.strip()`

Quiz 1: Using query()

Problem

Using the query() method:

Age 30 or above
Sale price between 40 and 120
Gender is male (‘M’)

Select only ‘user_id’, ‘age’, ‘gender’, ‘sale_price’, ‘brand’ columns from data satisfying all conditions above, sort by price in descending order, and output the top 15.

View Answer


# Filter conditions with query()
result = df.query('age >= 30 and sale_price >= 40 and sale_price <= 120 and gender == "M"')
 
# Select columns and sort
result = result[['user_id', 'age', 'gender', 'sale_price', 'brand']]
result = result.sort_values('sale_price', ascending=False).head(15)
 
print(f"Data satisfying conditions: {len(df.query('age >= 30 and sale_price >= 40 and sale_price <= 120 and gender == \"M\"')):,} rows")
print("\nTop 15:")
print(result)
 
# Or combine with between
result2 = df.query('age >= 30 and gender == "M"')
result2 = result2[result2['sale_price'].between(40, 120)]
print(f"\nCombined with between also works: {len(result2):,} rows")

실행 결과

Data satisfying conditions: 12,456 rows

Top 15:
     user_id  age gender  sale_price           brand
12456    34521   45      M      120.00            Nike
23891    78234   52      M      119.99          Adidas
34521    23410   38      M      119.95    Calvin Klein
8923     56210   41      M      119.90   Under Armour
45123    89123   33      M      119.85    Ralph Lauren
19234    45212   47      M      119.75  Tommy Hilfiger
28734    62345   39      M      119.50            Nike
37891    12345   55      M      119.30          Adidas
42156    90123   36      M      119.20    Calvin Klein
15678    34567   43      M      119.00            Puma
24567    78912   50      M      118.95         Reebok
31234    27890   34      M      118.80     New Balance
39012    56789   48      M      118.75          Adidas
46789    82345   42      M      118.50            Nike
11234    45678   37      M      118.45    Calvin Klein

Combined with between also works: 12,456 rows

Quiz 2: Combining isin() and between()

Problem

Filter data satisfying all the following conditions:

Category is one of ‘Jeans’, ‘Sweaters’, ‘Pants’
Price between 25 and 75
Department is ‘Women’ or ‘Men’

From the filtered data, calculate the average price and count by category, and sort by average price in descending order.

View Answer


# Filter conditions
filtered = df[
    df['category'].isin(['Jeans', 'Sweaters', 'Pants']) &
    df['sale_price'].between(25, 75) &
    df['department'].isin(['Women', 'Men'])
]
 
# Aggregate by category
result = filtered.groupby('category').agg(
    avg_price=('sale_price', 'mean'),
    count=('sale_price', 'count')
).round(2)
 
# Sort
result = result.sort_values('avg_price', ascending=False)
 
print(f"Filtered data: {len(filtered):,} rows")
print(f"\nStatistics by category:")
print(result)

실행 결과

Filtered data: 8,567 rows

Statistics by category:
        avg_price    count
category
Jeans         49.23   2856
Sweaters      48.45   2912
Pants         47.89   2799

Quiz 3: Complex Filtering (Advanced)

Problem

Filter with the following conditions:

Brands with a space (’ ’) in the name
Sale price is 60% or less of retail price (40% or more discount)
Country is one of ‘United States’, ‘Germany’, ‘France’, ‘Australia’

From the filtered data:

Group by brand
Calculate average discount rate, total sales count, average sale price for each brand
Output only brands with 50 or more total sales
Sort by average discount rate in descending order

View Answer


# Step 1: Brands with space in name
has_space = df[df['brand'].str.contains(' ', na=False)].copy()
print(f"Brands with space in name: {len(has_space):,} rows")
 
# Step 2: Calculate discount rate
has_space['discount_rate'] = (has_space['retail_price'] - has_space['sale_price']) / has_space['retail_price'] * 100
 
# Step 3: Discount rate 40% or more + specific countries
filtered = has_space[
    (has_space['discount_rate'] >= 40) &
    has_space['country'].isin(['United States', 'Germany', 'France', 'Australia'])
]
print(f"Discount rate 40% or more + 4 countries: {len(filtered):,} rows")
 
# Step 4: Aggregate by brand
result = filtered.groupby('brand').agg(
    avg_discount_rate=('discount_rate', 'mean'),
    total_sales_count=('order_id', 'count'),
    avg_sale_price=('sale_price', 'mean')
).round(2)
 
# Step 5: Only brands with 50 or more sales
result = result[result['total_sales_count'] >= 50]
 
# Step 6: Sort
result = result.sort_values('avg_discount_rate', ascending=False)
 
print(f"\nBrands satisfying conditions: {len(result)}")
print(result.head(10))

실행 결과

Brands with space in name: 45,678 rows
Discount rate 40% or more + 4 countries: 8,234 rows

Brands satisfying conditions: 15
                        avg_discount_rate  total_sales_count  avg_sale_price
brand
Calvin Klein Performance             53.45                245           42.30
Calvin Klein Jeans                   52.12                312           45.67
True Religion                        51.78                189           58.90
Tommy Hilfiger                       50.45                456           52.34
Ralph Lauren                         49.89                378           61.23
Under Armour                         48.56                234           55.12
Hurley International                 47.90                156           48.45
Diesel Black Gold                    47.23                 89           72.34
Theory                               46.78                123           85.67
Hugo Boss                            46.12                 67           95.23

Summary

Filtering Method Comparison

Method	Use Case	SQL Equivalent
`query()`	Complex conditions	Entire `WHERE` clause
`isin()`	One of multiple values	`IN (...)`
`between()`	Range conditions	`BETWEEN ... AND ...`
`str.contains()`	Contains string	`LIKE '%...%'`
`str.startswith()`	Prefix match	`LIKE '...%'`

SQL to Pandas Comparison

SQL	Pandas
`WHERE age > 30`	`df[df['age'] > 30]`
`WHERE age > 30 AND price < 100`	`df[(df['age'] > 30) & (df['price'] < 100)]`
`WHERE country IN ('US', 'DE')`	`df[df['country'].isin(['US', 'DE'])]`
`WHERE price BETWEEN 50 AND 100`	`df[df['price'].between(50, 100)]`
`WHERE name LIKE '%Kim%'`	`df[df['name'].str.contains('Kim')]`

Next Steps

You’ve mastered filtering! Next, learn advanced aggregation techniques including groupby(), agg(), transform() in Grouping and Aggregation.