Sunil M Anirudhan

Project Analysis

Recommender System for Amazon Beauty Products

About Project

E-commerce platforms like Amazon face the challenge of helping users navigate massive product catalogs. Without personalization, users experience choice overload and irrelevant suggestions. Many products lack complete metadata, and cold-start users/items limit collaborative approaches.

Problem Statement

Objective

🎯 Project Objective

To design and evaluate a hybrid recommender system that suggests beauty products on Amazon using multiple recommendation techniques:

Popularity-based Filtering: Recommends top-selling or highest-rated items
Content-Based Filtering: Uses product descriptions, tags, and features to find similar items
Collaborative Filtering: Includes both user-user and item-based approaches using ratings and behavior
Model-Based Filtering: Matrix factorization (e.g., SVD, NMF) for latent user-product relationships
Association Rule Mining: Discovers frequent itemsets and co-purchase rules using algorithms like Apriori

🔍 Goal

To understand the individual performance, interpretability, scalability, and limitations of each method, and propose an ideal hybrid system suitable for large-scale deployment on platforms like Amazon.

Proposed Solution

📊 Phase 1: Data Exploration & Cleaning

Used two datasets: Product metadata and Customer reviews
Cleaned missing values in key fields like brand, price, and description
Merged datasets using ItemId as the key
Created derived fields (e.g., combined title + description for text modeling)
Applied temporal train-test split to simulate real-world recommendation scenarios

🧠 Phase 2: Recommender Models

Popularity-Based Recommender: Ranked products by number of reviews and average rating
Content-Based Filtering:
- Used TF-IDF on product title and description
- Encoded categorical features and scaled prices
- Built a cosine similarity matrix to recommend similar products
Collaborative Filtering:
- User-User similarity using cosine distance
- NMF (Non-negative Matrix Factorization): Discovered latent user-item preferences
Association Rule Mining:
- Used the Apriori algorithm to identify co-purchase patterns
- Generated interpretable “frequently bought together” rules

Technologies Used

Language: Python
Libraries:
- Pandas, NumPy – Data wrangling and preprocessing
- Scikit-learn – ML models, TF-IDF vectorization, NMF
- Surprise, SciPy – Collaborative filtering algorithms
- MLxtend – Association rule mining with Apriori
- Matplotlib, WordCloud – Data visualization and word clouds
Datasets:
- beauty_amazon_items.csv – Product metadata
- beauty_amazon_reviews.csv – Customer reviews and ratings

Challenges Faced

Very sparse user-item matrix: Majority of users have ≤ 1 review
Incomplete product metadata: Missing brand names, prices, and descriptions
Cold-start problems: Collaborative filtering fails for new users or products
Low precision in recommendations due to sparse user interaction data
Latent factors from NMF are hard to interpret and explain
Association rules are biased towards frequently purchased items, not necessarily relevant ones

Methodology

🧹 Data Preprocessing

Cleaned missing text and price fields
Handled nulls in brand and title using inferencing
Formatted timestamps for temporal train-test split

🕒 Temporal Split

Used 80% training and 20% testing, sorted chronologically
Ensured no cold-start in test data (filtered unseen user-item pairs)

🧠 Model Building

Popularity-Based Recommender – ranked by review count
Content-Based – used text, brand, and price embeddings
User-User Collaborative Filtering – cosine similarity on rating matrix
NMF – matrix factorization on top 5k users and 2k items
Association Rule Mining – applied Apriori for basket-style co-purchase rules

📊 Evaluation Metrics

RMSE, MAE – f

Result / Outcome

📈 Model Performance Summary

Model	RMSE	MAE	Precision@5	Notes
Popularity-Based	—	—	—	Works for all users but not personalized
Content-Based Filtering	—	—	0.0028–0.0032	Struggles with relevance, no user signals
User-User Collaborative	0.569	0.348	—	Accurate for dense clusters, cold-start issue
NMF (Model-Based)	2.51	1.44	0.0	Learn EDA ML MODEL © 2025 Linus Tech. All Rights Reserved. Made by Linus Tech

Choose Your Language