Skip to Content
ConceptsMachine LearningRecommendation Systems

Recommendation Systems

Advanced

Learning Objectives

After completing this recipe, you will be able to:

  • Implement Collaborative Filtering
  • Implement Content-Based Filtering
  • Design hybrid recommendation systems
  • Evaluate recommendation performance (Precision@K, Recall@K)

1. What are Recommendation Systems?

Theory

Recommendation systems suggest relevant items based on users’ past behavior and preferences.

Recommendation Methods:

MethodPrincipleAdvantagesDisadvantages
Collaborative FilteringSimilar users/itemsDiverse recommendationsCold start
Content-BasedItem feature similarityAddresses cold startLack of diversity
HybridCombines both methodsBalanced recommendationsComplexity

Business Applications

  • E-commerce: ā€œCustomers who viewed this also boughtā€
  • Content: ā€œVideos you might likeā€
  • Marketing: Personalized promotions

2. Data Preparation

Sample Purchase Data Generation

import pandas as pd import numpy as np from scipy.sparse import csr_matrix from sklearn.metrics.pairwise import cosine_similarity import warnings warnings.filterwarnings('ignore') # Set seed for reproducible results np.random.seed(42) # Generate user-product purchase data n_users = 200 n_products = 100 n_interactions = 3000 # Generate random interactions user_ids = np.random.randint(1, n_users + 1, n_interactions) product_ids = np.random.randint(1, n_products + 1, n_interactions) # Purchase dataframe interactions = pd.DataFrame({ 'user_id': user_ids, 'product_id': product_ids, 'purchase_count': np.random.randint(1, 5, n_interactions), 'total_spent': np.random.exponential(50, n_interactions) }) # Remove duplicates (aggregate same user-product) interactions = interactions.groupby(['user_id', 'product_id']).agg({ 'purchase_count': 'sum', 'total_spent': 'sum' }).reset_index() # Implicit feedback: purchase indicator (0/1) interactions['purchased'] = 1 print(f"Total interactions: {len(interactions)}") print(f"Unique users: {interactions['user_id'].nunique()}") print(f"Unique products: {interactions['product_id'].nunique()}") print(f"\nInteraction sample:") print(interactions.head())
실행 ź²°ź³¼
Total interactions: 2456
Unique users: 200
Unique products: 100

Interaction sample:
 user_id  product_id  purchase_count  total_spent  purchased
0        1           3               2        78.45          1
1        1          12               1        34.56          1
2        1          25               3       112.34          1
3        1          45               1        23.78          1
4        1          67               2        89.12          1

User-Item Matrix Creation

# Create matrix with pivot table user_item_matrix = interactions.pivot( index='user_id', columns='product_id', values='purchased' ).fillna(0) # Calculate sparsity sparsity = 1 - (user_item_matrix.values.sum() / user_item_matrix.size) print(f"Matrix size: {user_item_matrix.shape}") print(f"Sparsity: {sparsity:.2%}") print(f"\nMatrix sample (first 5 users x first 10 products):") print(user_item_matrix.iloc[:5, :10])
실행 ź²°ź³¼
Matrix size: (200, 100)
Sparsity: 87.72%

Matrix sample (first 5 users x first 10 products):
product_id   1    2    3    4    5    6    7    8    9   10
user_id
1          0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0
2          1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0
3          0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0
4          0.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0
5          1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0

3. Collaborative Filtering

3-1. User-Based Collaborative Filtering (User-Based CF)

Principle: Recommend items liked by users with similar preferences

# Calculate user similarity (cosine similarity) user_similarity = cosine_similarity(user_item_matrix) user_similarity_df = pd.DataFrame( user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index ) # Find most similar users to a specific user def find_similar_users(user_id, top_n=10): if user_id not in user_similarity_df.index: return pd.Series() similar = user_similarity_df[user_id].sort_values(ascending=False)[1:top_n+1] return similar # Example: Users similar to user 1 target_user = 1 similar_users = find_similar_users(target_user) print(f"Users similar to user {target_user}:") print(similar_users.head())
실행 ź²°ź³¼
Users similar to user 1:
user_id
45     0.654321
78     0.612345
123    0.589012
167    0.567890
34     0.534567
Name: 1, dtype: float64

User-Based Recommendation Function

def recommend_user_based(user_id, top_n=10): """User-based collaborative filtering recommendations""" if user_id not in user_item_matrix.index: return [] # Find similar users similar_users = find_similar_users(user_id, top_n=20) # Already purchased products purchased = set(user_item_matrix.loc[user_id][user_item_matrix.loc[user_id] > 0].index) # Calculate scores for products purchased by similar users recommendations = {} for sim_user, similarity in similar_users.items(): sim_user_purchases = user_item_matrix.loc[sim_user] for product, purchased_flag in sim_user_purchases.items(): if purchased_flag > 0 and product not in purchased: if product not in recommendations: recommendations[product] = 0 recommendations[product] += similarity # Sort by score sorted_recs = sorted(recommendations.items(), key=lambda x: x[1], reverse=True) return sorted_recs[:top_n] # Run recommendations recs_user_based = recommend_user_based(target_user) print(f"\nRecommended products for user {target_user} (User-Based CF):") for product, score in recs_user_based[:5]: print(f" Product {product}: Score {score:.3f}")

User-Item Matrix

User-product purchase matrix: Darker blue indicates higher purchase/rating

Recommendation Results

Left: User purchase history, Right: Recommended products and scores

3-2. Item-Based Collaborative Filtering (Item-Based CF)

Principle: Recommend items similar to items I liked

# Calculate item similarity item_similarity = cosine_similarity(user_item_matrix.T) item_similarity_df = pd.DataFrame( item_similarity, index=user_item_matrix.columns, columns=user_item_matrix.columns ) def find_similar_items(product_id, top_n=10): """Find similar items""" if product_id not in item_similarity_df.index: return pd.Series() similar = item_similarity_df[product_id].sort_values(ascending=False)[1:top_n+1] return similar # Example: Products similar to product 3 target_product = 3 similar_items = find_similar_items(target_product) print(f"Products similar to product {target_product}:") print(similar_items.head())

Product Similarity

Product similarity heatmap: Darker colors indicate more similar products. The diagonal represents self-similarity (1.0).

Item-Based Recommendation Function

def recommend_item_based(user_id, top_n=10): """Item-based collaborative filtering recommendations""" if user_id not in user_item_matrix.index: return [] # Products purchased by user user_purchases = user_item_matrix.loc[user_id] purchased = set(user_purchases[user_purchases > 0].index) # Calculate scores for products similar to purchased ones recommendations = {} for product in purchased: similar = find_similar_items(product, top_n=20) for sim_product, similarity in similar.items(): if sim_product not in purchased: if sim_product not in recommendations: recommendations[sim_product] = 0 recommendations[sim_product] += similarity sorted_recs = sorted(recommendations.items(), key=lambda x: x[1], reverse=True) return sorted_recs[:top_n] # Run recommendations recs_item_based = recommend_item_based(target_user) print(f"\nItem-Based Recommendations (Item-Based CF):") for product, score in recs_item_based[:5]: print(f" Product {product}: Score {score:.3f}")
실행 ź²°ź³¼
Item-Based Recommendations (Item-Based CF):
Product 45: Score 3.234
Product 67: Score 2.987
Product 12: Score 2.765
Product 89: Score 2.543
Product 34: Score 2.321

4. Content-Based Filtering

Product Feature Data Generation

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.preprocessing import OneHotEncoder # Generate product metadata categories = ['Electronics', 'Clothing', 'Books', 'Home', 'Sports'] brands = ['BrandA', 'BrandB', 'BrandC', 'BrandD', 'BrandE'] products = pd.DataFrame({ 'product_id': range(1, n_products + 1), 'name': [f'Product {i}' for i in range(1, n_products + 1)], 'category': np.random.choice(categories, n_products), 'brand': np.random.choice(brands, n_products), 'price': np.random.exponential(50, n_products) + 10 }) print("Product metadata sample:") print(products.head(10))
실행 ź²°ź³¼
Product metadata sample:
 product_id         name      category   brand      price
0           1    Product 1   Electronics  BrandA      45.67
1           2    Product 2      Clothing  BrandC      78.23
2           3    Product 3         Books  BrandB      23.45
3           4    Product 4          Home  BrandD      56.78
4           5    Product 5        Sports  BrandE      34.56
5           6    Product 6   Electronics  BrandA      89.12
6           7    Product 7      Clothing  BrandB      67.89
7           8    Product 8         Books  BrandC      12.34
8           9    Product 9          Home  BrandD      45.67
9          10   Product 10        Sports  BrandE      78.90

Content-Based Similarity Calculation

# Categorical features: One-Hot Encoding categorical_features = pd.get_dummies(products[['category', 'brand']]) # Price normalization products['price_norm'] = (products['price'] - products['price'].min()) / \ (products['price'].max() - products['price'].min()) # Combine features product_features = pd.concat([categorical_features, products[['price_norm']]], axis=1) # Calculate content similarity content_similarity = cosine_similarity(product_features) content_similarity_df = pd.DataFrame( content_similarity, index=products['product_id'], columns=products['product_id'] ) def find_similar_items_content(product_id, top_n=10): """Content-based similar items""" if product_id not in content_similarity_df.index: return pd.Series() similar = content_similarity_df.loc[product_id].sort_values(ascending=False)[1:top_n+1] return similar # Example print(f"\nProducts similar to product 3 (content-based):") content_similar = find_similar_items_content(3) for prod, sim in content_similar.head().items(): prod_info = products[products['product_id'] == prod].iloc[0] print(f" Product {prod} ({prod_info['category']}, {prod_info['brand']}): {sim:.3f}")
실행 ź²°ź³¼
Products similar to product 3 (content-based):
Product 8 (Books, BrandC): 0.923
Product 15 (Books, BrandB): 0.912
Product 23 (Books, BrandA): 0.876
Product 45 (Books, BrandD): 0.854
Product 67 (Books, BrandE): 0.832

Content-Based Recommendation Function

def recommend_content_based(user_id, top_n=10): """Content-based recommendations""" if user_id not in user_item_matrix.index: return [] # Products purchased by user user_purchases = user_item_matrix.loc[user_id] purchased = list(user_purchases[user_purchases > 0].index) # Find similar content to purchased products recommendations = {} for product in purchased: if product in content_similarity_df.index: similar = find_similar_items_content(product, top_n=20) for sim_product, similarity in similar.items(): if sim_product not in purchased: if sim_product not in recommendations: recommendations[sim_product] = 0 recommendations[sim_product] += similarity sorted_recs = sorted(recommendations.items(), key=lambda x: x[1], reverse=True) return sorted_recs[:top_n] # Run content-based recommendations recs_content = recommend_content_based(target_user) print(f"\nContent-Based Recommendations:") for product, score in recs_content[:5]: prod_info = products[products['product_id'] == product].iloc[0] print(f" Product {product} ({prod_info['category']}): Score {score:.3f}")
실행 ź²°ź³¼
Content-Based Recommendations:
Product 15 (Books): Score 4.567
Product 23 (Electronics): Score 4.234
Product 45 (Clothing): Score 3.987
Product 67 (Home): Score 3.765
Product 89 (Sports): Score 3.543

5. Hybrid Recommendations

Weighted Hybrid Recommendations

def recommend_hybrid(user_id, cf_weight=0.6, content_weight=0.4, top_n=10): """Hybrid recommendations (CF + Content)""" # Collaborative filtering scores cf_recs = recommend_item_based(user_id, top_n=50) cf_scores = {p: s for p, s in cf_recs} # Content-based scores content_recs = recommend_content_based(user_id, top_n=50) content_scores = {p: s for p, s in content_recs} # Score normalization and combination all_products = set(cf_scores.keys()) | set(content_scores.keys()) # Min-Max normalization def normalize(scores): if not scores: return {} min_s, max_s = min(scores.values()), max(scores.values()) if max_s == min_s: return {k: 0.5 for k in scores} return {k: (v - min_s) / (max_s - min_s) for k, v in scores.items()} cf_norm = normalize(cf_scores) content_norm = normalize(content_scores) # Weighted sum final_scores = {} for product in all_products: cf_s = cf_norm.get(product, 0) content_s = content_norm.get(product, 0) final_scores[product] = cf_weight * cf_s + content_weight * content_s sorted_recs = sorted(final_scores.items(), key=lambda x: x[1], reverse=True) return sorted_recs[:top_n] # Run hybrid recommendations hybrid_recs = recommend_hybrid(target_user) print(f"Hybrid Recommendations (CF 60% + Content 40%):") for product, score in hybrid_recs[:5]: prod_info = products[products['product_id'] == product].iloc[0] print(f" Product {product} ({prod_info['category']}): Score {score:.3f}")
실행 ź²°ź³¼
Hybrid Recommendations (CF 60% + Content 40%):
Product 45 (Clothing): Score 0.923
Product 67 (Electronics): Score 0.876
Product 23 (Books): Score 0.854
Product 89 (Home): Score 0.832
Product 12 (Sports): Score 0.798

6. Recommendation Performance Evaluation

Evaluation Metrics

MetricFormulaMeaning
Precision@KActual purchases in top K recommendations / KRecommendation accuracy
Recall@KActual purchases in top K recommendations / Total purchasesCoverage
Hit RateUsers with at least 1 hit / Total usersHit rate

Evaluation Implementation

def evaluate_recommendations(recommend_func, test_users, k=10): """Evaluate recommendation performance""" precisions = [] recalls = [] hits = 0 for user_id in test_users: # User's actual purchased products (assuming last 20% as test) user_purchases = user_item_matrix.loc[user_id] all_purchased = list(user_purchases[user_purchases > 0].index) if len(all_purchased) < 5: continue # Test set: last 20% of products test_size = max(1, len(all_purchased) // 5) test_items = set(all_purchased[-test_size:]) # Recommended products recommendations = recommend_func(user_id, top_n=k) recommended = set([p for p, _ in recommendations]) # Precision@K hits_count = len(recommended & test_items) precision = hits_count / k if k > 0 else 0 precisions.append(precision) # Recall@K recall = hits_count / len(test_items) if len(test_items) > 0 else 0 recalls.append(recall) # Hit Rate if hits_count > 0: hits += 1 n_users = len(precisions) results = { 'Precision@K': np.mean(precisions) if precisions else 0, 'Recall@K': np.mean(recalls) if recalls else 0, 'Hit Rate': hits / n_users if n_users > 0 else 0 } return results # Sample test users test_users = user_item_matrix.index[:50].tolist() # Evaluate each model print("=== Recommendation Model Performance Comparison (K=10) ===\n") models = [ ('User-Based CF', recommend_user_based), ('Item-Based CF', recommend_item_based), ('Content-Based', recommend_content_based), ('Hybrid', recommend_hybrid) ] results_list = [] for name, func in models: results = evaluate_recommendations(func, test_users, k=10) results['Model'] = name results_list.append(results) print(f"{name}:") for metric, value in results.items(): if metric != 'Model': print(f" {metric}: {value:.3f}") print() # Results table results_df = pd.DataFrame(results_list)[['Model', 'Precision@K', 'Recall@K', 'Hit Rate']] print("=== Performance Summary ===") print(results_df.to_string(index=False))
실행 ź²°ź³¼
=== Recommendation Model Performance Comparison (K=10) ===

User-Based CF:
Precision@K: 0.082
Recall@K: 0.156
Hit Rate: 0.420

Item-Based CF:
Precision@K: 0.098
Recall@K: 0.187
Hit Rate: 0.480

Content-Based:
Precision@K: 0.076
Recall@K: 0.145
Hit Rate: 0.380

Hybrid:
Precision@K: 0.112
Recall@K: 0.214
Hit Rate: 0.540

=== Performance Summary ===
         Model  Precision@K  Recall@K  Hit Rate
User-Based CF        0.082     0.156     0.420
Item-Based CF        0.098     0.187     0.480
Content-Based        0.076     0.145     0.380
      Hybrid        0.112     0.214     0.540

7. Practical Considerations

Cold Start Problem Solutions

def recommend_for_new_user(top_n=10): """Recommendations for new users (cold start)""" # Popular product recommendations product_popularity = interactions.groupby('product_id')['user_id'].count() popular_products = product_popularity.sort_values(ascending=False).head(top_n) return list(popular_products.index) def recommend_for_new_item(product_id, top_n=10): """Find target users for new product recommendations""" # Target users who purchased similar products (content-based) similar = find_similar_items_content(product_id, top_n=5) target_users = set() for sim_product in similar.index: buyers = interactions[interactions['product_id'] == sim_product]['user_id'].unique() target_users.update(buyers) return list(target_users)[:top_n] # New user recommendations print("=== New User Recommendations (Popular Products) ===") popular_recs = recommend_for_new_user() for i, prod in enumerate(popular_recs[:5], 1): prod_info = products[products['product_id'] == prod].iloc[0] print(f"{i}. Product {prod} ({prod_info['category']})")
실행 ź²°ź³¼
=== New User Recommendations (Popular Products) ===
1. Product 45 (Electronics)
2. Product 23 (Clothing)
3. Product 67 (Books)
4. Product 12 (Home)
5. Product 89 (Sports)

Improving Recommendation Diversity

def diversify_recommendations(recommendations, top_n=10, diversity_weight=0.3): """Improve recommendation diversity""" if not recommendations: return [] selected = [recommendations[0]] remaining = list(recommendations[1:]) while len(selected) < top_n and remaining: best_item = None best_score = -1 for item, score in remaining: # Average distance from selected items (diversity) if item in content_similarity_df.index: avg_sim = np.mean([ content_similarity_df.loc[item, s[0]] for s in selected if s[0] in content_similarity_df.index ]) diversity = 1 - avg_sim else: diversity = 0.5 # Score + diversity weighted sum combined = (1 - diversity_weight) * score + diversity_weight * diversity if combined > best_score: best_score = combined best_item = (item, score) if best_item: selected.append(best_item) remaining.remove(best_item) return selected # Diversity-applied recommendations original_recs = recommend_hybrid(target_user, top_n=10) diverse_recs = diversify_recommendations(original_recs, diversity_weight=0.3) print("=== Before/After Diversity Comparison ===") print("\nOriginal recommendations:") for prod, score in original_recs[:5]: prod_info = products[products['product_id'] == prod].iloc[0] print(f" {prod} ({prod_info['category']})") print("\nAfter diversity applied:") for prod, score in diverse_recs[:5]: prod_info = products[products['product_id'] == prod].iloc[0] print(f" {prod} ({prod_info['category']})")

Recommendation Comparison

Performance comparison by recommendation method:

  • Precision: Proportion of actual interests among recommendations
  • Recall: Proportion of recommended items among interested products
  • Coverage: Range of recommendable products

The hybrid method shows balanced performance.


Quiz 1: Recommendation Method Selection

Problem

Which recommendation method should you choose in the following situation?

  1. Many new users (before first purchase after signup)
  2. Product catalog changes frequently

View Answer

Prioritize Content-Based Filtering.

Reasons:

  1. Cold start handling: Collaborative filtering needs purchase history, but content-based can recommend using only product features
  2. New product handling: Can recommend immediately with product metadata (category, brand)

Supplementary strategies:

  • Popular product recommendations for new user initial response
  • Transition to hybrid as purchase history accumulates
  • A/B testing to find optimal ratios

Quiz 2: Evaluation Metrics Interpretation

Problem

A recommendation system has Precision@10 = 0.15, Recall@10 = 0.08. Interpret these results.

View Answer

Interpretation:

  • Precision@10 = 0.15: On average, 1.5 out of 10 recommendations are actually purchased
  • Recall@10 = 0.08: 8% of user’s total purchases are included in recommendations

Business meaning:

  • 1-2 hits out of 10 recommendations is a reasonable level
  • Low recall means users also purchase many products outside of recommendations

Improvement directions:

  1. Adjust K value (more recommendations)
  2. Ensure diversity with hybrid approach
  3. Incorporate real-time behavioral data

Summary

Recommendation Method Selection Guide

SituationRecommendation Method
Rich purchase historyCollaborative Filtering
New users/productsContent-Based
Balanced recommendationsHybrid
Diversity importantDiversity-weighted Hybrid

Recommendation System Checklist

  1. Prepare user-item interaction data
  2. Check and handle sparsity
  3. Establish cold start strategy
  4. Offline evaluation (Precision, Recall)
  5. Online validation with A/B testing
  6. Balance diversity/freshness

Next Steps

You’ve completed the ML section! Apply the techniques you’ve learned comprehensively in Real-World Projects.

Last updated on

šŸ¤–AI ėŖØģ˜ė©“ģ ‘ģ‹¤ģ „ģ²˜ėŸ¼ ģ—°ģŠµķ•˜źø°