Project Analysis
Recommender System for Amazon Beauty Products
About Project
E-commerce platforms like Amazon face the challenge of helping users navigate massive product catalogs. Without personalization, users experience choice overload and irrelevant suggestions. Many products lack complete metadata, and cold-start users/items limit collaborative approaches.
Problem Statement
E-commerce platforms like Amazon face the challenge of helping users navigate massive product catalogs. Without personalization, users experience choice overload and irrelevant suggestions. Many products lack complete metadata, and cold-start users/items limit collaborative approaches.
Objective
To design and evaluate a hybrid recommender system that suggests beauty products on Amazon using multiple recommendation techniques:
- Popularity-based Filtering: Recommends top-selling or highest-rated items
- Content-Based Filtering: Uses product descriptions, tags, and features to find similar items
- Collaborative Filtering: Includes both user-user and item-based approaches using ratings and behavior
- Model-Based Filtering: Matrix factorization (e.g., SVD, NMF) for latent user-product relationships
- Association Rule Mining: Discovers frequent itemsets and co-purchase rules using algorithms like Apriori
To understand the individual performance, interpretability, scalability, and limitations of each method, and propose an ideal hybrid system suitable for large-scale deployment on platforms like Amazon.
Proposed Solution
- Used two datasets: Product metadata and Customer reviews
- Cleaned missing values in key fields like brand, price, and description
- Merged datasets using
ItemIdas the key - Created derived fields (e.g., combined title + description for text modeling)
- Applied temporal train-test split to simulate real-world recommendation scenarios
- Popularity-Based Recommender: Ranked products by number of reviews and average rating
- Content-Based Filtering:
- Used TF-IDF on product title and description
- Encoded categorical features and scaled prices
- Built a cosine similarity matrix to recommend similar products
- Collaborative Filtering:
- User-User similarity using cosine distance
- NMF (Non-negative Matrix Factorization): Discovered latent user-item preferences
- Association Rule Mining:
- Used the Apriori algorithm to identify co-purchase patterns
- Generated interpretable βfrequently bought togetherβ rules
Technologies Used
- Language: Python
- Libraries:
- Pandas, NumPy β Data wrangling and preprocessing
- Scikit-learn β ML models, TF-IDF vectorization, NMF
- Surprise, SciPy β Collaborative filtering algorithms
- MLxtend β Association rule mining with Apriori
- Matplotlib, WordCloud β Data visualization and word clouds
- Datasets:
beauty_amazon_items.csvβ Product metadatabeauty_amazon_reviews.csvβ Customer reviews and ratings
Challenges Faced
- Very sparse user-item matrix: Majority of users have β€ 1 review
- Incomplete product metadata: Missing brand names, prices, and descriptions
- Cold-start problems: Collaborative filtering fails for new users or products
- Low precision in recommendations due to sparse user interaction data
- Latent factors from NMF are hard to interpret and explain
- Association rules are biased towards frequently purchased items, not necessarily relevant ones
Methodology
- Cleaned missing text and price fields
- Handled nulls in brand and title using inferencing
- Formatted timestamps for temporal train-test split
- Used 80% training and 20% testing, sorted chronologically
- Ensured no cold-start in test data (filtered unseen user-item pairs)
- Popularity-Based Recommender β ranked by review count
- Content-Based β used text, brand, and price embeddings
- User-User Collaborative Filtering β cosine similarity on rating matrix
- NMF β matrix factorization on top 5k users and 2k items
- Association Rule Mining β applied Apriori for basket-style co-purchase rules
- RMSE, MAE β f
Result / Outcome
π Model Performance SummaryModel RMSE MAE Precision@5 Notes Popularity-Based β β β Works for all users but not personalized Content-Based Filtering β β 0.0028β0.0032 Struggles with relevance, no user signals User-User Collaborative 0.569 0.348 β Accurate for dense clusters, cold-start issue NMF (Model-Based) 2.51 1.44 0.0 Learn EDA
ML MODEL