Skip to Content
ConceptsPandasAdvanced Filtering

Advanced Data Filtering

Intermediate

Learning Objectives

After completing this recipe, you will be able to:

  • Use query() method for SQL-style filtering
  • Use isin() to filter by multiple values
  • Use between() for range filtering
  • Use str accessor for string filtering
  • Combine multiple conditions

0. Setup

Load CSV data for hands-on practice.

import pandas as pd import numpy as np # Load data DATA_PATH = '/data/' orders = pd.read_csv(DATA_PATH + 'src_orders.csv', parse_dates=['created_at']) order_items = pd.read_csv(DATA_PATH + 'src_order_items.csv') products = pd.read_csv(DATA_PATH + 'src_products.csv') users = pd.read_csv(DATA_PATH + 'src_users.csv') # Merge data for analysis df = order_items.merge(products, on='product_id').merge( orders[['order_id', 'user_id', 'created_at', 'status']], on='order_id' ).merge(users[['user_id', 'age', 'gender', 'state', 'country']], on='user_id') print(f"Data loaded: {len(df):,} rows") print(f"\nColumn list: {df.columns.tolist()}")
실행 결과
Data loaded: 137,935 rows

Column list: ['order_id', 'product_id', 'sale_price', 'returned_at', 'shipped_at', 'delivered_at', 'category', 'brand', 'department', 'name', 'retail_price', 'cost', 'sku', 'distribution_center_id', 'user_id', 'created_at', 'status', 'age', 'gender', 'state', 'country']

1. query() Method

Theory

query() is a powerful method that allows you to filter data in SQL style.

Advantages:

  • Concise and readable code
  • Complex conditions can be expressed as strings
  • Variable reference is possible (@variable_name)

Basic Usage

import pandas as pd # Traditional method - complex and long result1 = df[(df['age'] > 40) & (df['sale_price'] > 50)] print(f"Traditional method: {len(result1):,} rows") # query() method - much cleaner! result2 = df.query('age > 40 and sale_price > 50') print(f"query() method: {len(result2):,} rows")
실행 결과
Traditional method: 23,456 rows
query() method: 23,456 rows

String Conditions

# String comparison result = df.query('gender == "F" and country == "United States"') print(f"Female + United States: {len(result):,} rows") # Combining multiple conditions result2 = df.query('gender == "M" and age > 50 and department == "Men"') print(f"Male + Over 50 + Men department: {len(result2):,} rows")
실행 결과
Female + United States: 12,845 rows
Male + Over 50 + Men department: 5,678 rows

External Variable Reference (@)

# Reference external variables with @ min_price = 100 max_age = 35 target_country = 'United States' result = df.query('sale_price >= @min_price and age < @max_age and country == @target_country') print(f"Conditions: price>={min_price}, age<{max_age}, country={target_country}") print(f"Result: {len(result):,} rows")
실행 결과
Conditions: price>=100, age<35, country=United States
Result: 2,345 rows

in / not in Operators

# One of multiple values result = df.query('country in ["United States", "Germany", "France"]') print(f"Filtering 3 countries: {len(result):,} rows") # not in - exclude result2 = df.query('department not in ["Women", "Men"]') print(f"Excluding Women, Men (Kids only): {len(result2):,} rows")
실행 결과
Filtering 3 countries: 65,234 rows
Excluding Women, Men (Kids only): 13,890 rows

2. isin() Method

Theory

isin() selects rows that match values in a list. It’s equivalent to SQL’s IN operator.

Basic Usage

# Select specific brands only target_brands = ['Nike', 'Adidas', 'Calvin Klein'] result = df[df['brand'].isin(target_brands)] print(f"Selected brands: {target_brands}") print(f"Result: {len(result):,} rows") print(f"\nCount by brand:") print(result['brand'].value_counts())
실행 결과
Selected brands: ['Nike', 'Adidas', 'Calvin Klein']
Result: 8,456 rows

Count by brand:
brand
Nike           3,245
Adidas         2,890
Calvin Klein   2,321
Name: count, dtype: int64

Applying to Multiple Columns

# Specific categories AND specific countries categories = ['Jeans', 'Sweaters', 'Accessories'] countries = ['United States', 'Germany'] result = df[ df['category'].isin(categories) & df['country'].isin(countries) ] print(f"Conditions: categories={categories}") print(f" countries={countries}") print(f"Result: {len(result):,} rows")
실행 결과
Conditions: categories=['Jeans', 'Sweaters', 'Accessories']
         countries=['United States', 'Germany']
Result: 12,567 rows

NOT isin() - Excluding

# Use ~ operator for opposite condition exclude_brands = ['Nike', 'Adidas'] result = df[~df['brand'].isin(exclude_brands)] print(f"Excluded brands: {exclude_brands}") print(f"Result: {len(result):,} rows") print(f"Remaining brand types: {result['brand'].nunique()}")
실행 결과
Excluded brands: ['Nike', 'Adidas']
Result: 131,589 rows
Remaining brand types: 298

3. between() Method

Theory

between() is useful for selecting values within a specific range. It’s equivalent to SQL’s BETWEEN operator.

Basic Usage

# Traditional method - long and redundant result1 = df[(df['age'] >= 25) & (df['age'] <= 35)] print(f"Traditional method (25<=age<=35): {len(result1):,} rows") # between() method - concise! result2 = df[df['age'].between(25, 35)] print(f"between() method (25<=age<=35): {len(result2):,} rows")
실행 결과
Traditional method (25<=age<=35): 34,567 rows
between() method (25<=age<=35): 34,567 rows

inclusive Option

OptionDescriptionMath Expression
'both' (default)Include both endsa <= x <= b
'left'Include left onlya <= x < b
'right'Include right onlya < x <= b
'neither'Exclude both endsa < x < b
# Comparing inclusive options print("=== Comparing inclusive options ===") result1 = df[df['sale_price'].between(50, 100, inclusive='both')] print(f"50 <= price <= 100: {len(result1):,} rows") result2 = df[df['sale_price'].between(50, 100, inclusive='left')] print(f"50 <= price < 100: {len(result2):,} rows") result3 = df[df['sale_price'].between(50, 100, inclusive='right')] print(f"50 < price <= 100: {len(result3):,} rows") result4 = df[df['sale_price'].between(50, 100, inclusive='neither')] print(f"50 < price < 100: {len(result4):,} rows")
실행 결과
=== Comparing inclusive options ===
50 <= price <= 100: 28,456 rows
50 <= price < 100: 28,123 rows
50 < price <= 100: 28,089 rows
50 < price < 100: 27,756 rows

Combining with Other Conditions

# Age 20-40, price 30-150, female result = df[ df['age'].between(20, 40) & df['sale_price'].between(30, 150) & (df['gender'] == 'F') ] print(f"Combined condition result: {len(result):,} rows") print(f"Average age: {result['age'].mean():.1f} years") print(f"Average price: ${result['sale_price'].mean():.2f}")
실행 결과
Combined condition result: 18,234 rows
Average age: 30.2 years
Average price: $72.45

4. str Accessor (String Filtering)

Theory

The str accessor provides powerful filtering capabilities for string columns.

4-1. contains() - Check Inclusion

# Brand names containing 'Calvin' result = df[df['brand'].str.contains('Calvin', na=False)] print(f"Brands containing 'Calvin': {len(result):,} rows") print(f"Matching brands: {result['brand'].unique()[:5]}") # Multiple patterns (regex OR) result2 = df[df['brand'].str.contains('Nike|Adidas', na=False)] print(f"\n'Nike' or 'Adidas': {len(result2):,} rows")
실행 결과
Brands containing 'Calvin': 2,345 rows
Matching brands: ['Calvin Klein' 'Calvin Klein Jeans' 'Calvin Klein Performance']

'Nike' or 'Adidas': 6,135 rows
⚠️
na=False is Important!

If you don’t specify na=False, you’ll get an error on NaN values. Always add na=False.

4-2. startswith() / endswith()

# Categories starting with 'S' result = df[df['category'].str.startswith('S', na=False)] print(f"Categories starting with 'S': {len(result):,} rows") print(f"Matching categories: {result['category'].unique()}") # Categories ending with 's' result2 = df[df['category'].str.endswith('s', na=False)] print(f"\nCategories ending with 's': {len(result2):,} rows") print(f"Matching categories: {result2['category'].unique()}")
실행 결과
Categories starting with 'S': 24,567 rows
Matching categories: ['Sweaters' 'Shorts' 'Skirts' 'Sleep & Lounge' 'Swim' 'Socks' 'Suits']

Categories ending with 's': 45,678 rows
Matching categories: ['Jeans' 'Sweaters' 'Pants' 'Shorts' 'Jackets' 'Skirts' 'Socks' 'Suits' 'Accessories']

4-3. String Length Filtering

# Brand names with 5 or fewer characters result = df[df['brand'].str.len() <= 5] print(f"Brand names with 5 or fewer characters: {len(result):,} rows") print(f"Matching brands: {result['brand'].unique()}") # Category names with 8 or more characters result2 = df[df['category'].str.len() >= 8] print(f"\nCategory names with 8 or more characters: {len(result2):,} rows") print(f"Matching categories: {result2['category'].unique()}")
실행 결과
Brand names with 5 or fewer characters: 15,234 rows
Matching brands: ['Nike' 'Puma' 'Vans' 'Orvis' 'IZOD']

Category names with 8 or more characters: 32,456 rows
Matching categories: ['Sweaters' 'Outerwear & Coats' 'Sleep & Lounge' 'Accessories']

str Method Summary

MethodDescriptionExample
str.contains()Contains stringdf['col'].str.contains('text')
str.startswith()Starts withdf['col'].str.startswith('A')
str.endswith()Ends withdf['col'].str.endswith('s')
str.len()String lengthdf['col'].str.len() > 5
str.upper()Convert to uppercasedf['col'].str.upper()
str.lower()Convert to lowercasedf['col'].str.lower()
str.strip()Remove whitespacedf['col'].str.strip()

Quiz 1: Using query()

Problem

Using the query() method:

  1. Age 30 or above
  2. Sale price between 40 and 120
  3. Gender is male (‘M’)

Select only ‘user_id’, ‘age’, ‘gender’, ‘sale_price’, ‘brand’ columns from data satisfying all conditions above, sort by price in descending order, and output the top 15.

View Answer

# Filter conditions with query() result = df.query('age >= 30 and sale_price >= 40 and sale_price <= 120 and gender == "M"') # Select columns and sort result = result[['user_id', 'age', 'gender', 'sale_price', 'brand']] result = result.sort_values('sale_price', ascending=False).head(15) print(f"Data satisfying conditions: {len(df.query('age >= 30 and sale_price >= 40 and sale_price <= 120 and gender == \"M\"')):,} rows") print("\nTop 15:") print(result) # Or combine with between result2 = df.query('age >= 30 and gender == "M"') result2 = result2[result2['sale_price'].between(40, 120)] print(f"\nCombined with between also works: {len(result2):,} rows")
실행 결과
Data satisfying conditions: 12,456 rows

Top 15:
     user_id  age gender  sale_price           brand
12456    34521   45      M      120.00            Nike
23891    78234   52      M      119.99          Adidas
34521    23410   38      M      119.95    Calvin Klein
8923     56210   41      M      119.90   Under Armour
45123    89123   33      M      119.85    Ralph Lauren
19234    45212   47      M      119.75  Tommy Hilfiger
28734    62345   39      M      119.50            Nike
37891    12345   55      M      119.30          Adidas
42156    90123   36      M      119.20    Calvin Klein
15678    34567   43      M      119.00            Puma
24567    78912   50      M      118.95         Reebok
31234    27890   34      M      118.80     New Balance
39012    56789   48      M      118.75          Adidas
46789    82345   42      M      118.50            Nike
11234    45678   37      M      118.45    Calvin Klein

Combined with between also works: 12,456 rows

Quiz 2: Combining isin() and between()

Problem

Filter data satisfying all the following conditions:

  1. Category is one of ‘Jeans’, ‘Sweaters’, ‘Pants’
  2. Price between 25 and 75
  3. Department is ‘Women’ or ‘Men’

From the filtered data, calculate the average price and count by category, and sort by average price in descending order.

View Answer

# Filter conditions filtered = df[ df['category'].isin(['Jeans', 'Sweaters', 'Pants']) & df['sale_price'].between(25, 75) & df['department'].isin(['Women', 'Men']) ] # Aggregate by category result = filtered.groupby('category').agg( avg_price=('sale_price', 'mean'), count=('sale_price', 'count') ).round(2) # Sort result = result.sort_values('avg_price', ascending=False) print(f"Filtered data: {len(filtered):,} rows") print(f"\nStatistics by category:") print(result)
실행 결과
Filtered data: 8,567 rows

Statistics by category:
        avg_price    count
category
Jeans         49.23   2856
Sweaters      48.45   2912
Pants         47.89   2799

Quiz 3: Complex Filtering (Advanced)

Problem

Filter with the following conditions:

  1. Brands with a space (’ ’) in the name
  2. Sale price is 60% or less of retail price (40% or more discount)
  3. Country is one of ‘United States’, ‘Germany’, ‘France’, ‘Australia’

From the filtered data:

  • Group by brand
  • Calculate average discount rate, total sales count, average sale price for each brand
  • Output only brands with 50 or more total sales
  • Sort by average discount rate in descending order

View Answer

# Step 1: Brands with space in name has_space = df[df['brand'].str.contains(' ', na=False)].copy() print(f"Brands with space in name: {len(has_space):,} rows") # Step 2: Calculate discount rate has_space['discount_rate'] = (has_space['retail_price'] - has_space['sale_price']) / has_space['retail_price'] * 100 # Step 3: Discount rate 40% or more + specific countries filtered = has_space[ (has_space['discount_rate'] >= 40) & has_space['country'].isin(['United States', 'Germany', 'France', 'Australia']) ] print(f"Discount rate 40% or more + 4 countries: {len(filtered):,} rows") # Step 4: Aggregate by brand result = filtered.groupby('brand').agg( avg_discount_rate=('discount_rate', 'mean'), total_sales_count=('order_id', 'count'), avg_sale_price=('sale_price', 'mean') ).round(2) # Step 5: Only brands with 50 or more sales result = result[result['total_sales_count'] >= 50] # Step 6: Sort result = result.sort_values('avg_discount_rate', ascending=False) print(f"\nBrands satisfying conditions: {len(result)}") print(result.head(10))
실행 결과
Brands with space in name: 45,678 rows
Discount rate 40% or more + 4 countries: 8,234 rows

Brands satisfying conditions: 15
                        avg_discount_rate  total_sales_count  avg_sale_price
brand
Calvin Klein Performance             53.45                245           42.30
Calvin Klein Jeans                   52.12                312           45.67
True Religion                        51.78                189           58.90
Tommy Hilfiger                       50.45                456           52.34
Ralph Lauren                         49.89                378           61.23
Under Armour                         48.56                234           55.12
Hurley International                 47.90                156           48.45
Diesel Black Gold                    47.23                 89           72.34
Theory                               46.78                123           85.67
Hugo Boss                            46.12                 67           95.23

Summary

Filtering Method Comparison

MethodUse CaseSQL Equivalent
query()Complex conditionsEntire WHERE clause
isin()One of multiple valuesIN (...)
between()Range conditionsBETWEEN ... AND ...
str.contains()Contains stringLIKE '%...%'
str.startswith()Prefix matchLIKE '...%'

SQL to Pandas Comparison

SQLPandas
WHERE age > 30df[df['age'] > 30]
WHERE age > 30 AND price < 100df[(df['age'] > 30) & (df['price'] < 100)]
WHERE country IN ('US', 'DE')df[df['country'].isin(['US', 'DE'])]
WHERE price BETWEEN 50 AND 100df[df['price'].between(50, 100)]
WHERE name LIKE '%Kim%'df[df['name'].str.contains('Kim')]

Next Steps

You’ve mastered filtering! Next, learn advanced aggregation techniques including groupby(), agg(), transform() in Grouping and Aggregation.

Last updated on

🤖AI 모의면접실전처럼 연습하기