Project Analysis
image

Recommender System for Amazon Beauty Products

About Project

E-commerce platforms like Amazon face the challenge of helping users navigate massive product catalogs. Without personalization, users experience choice overload and irrelevant suggestions. Many products lack complete metadata, and cold-start users/items limit collaborative approaches.

Problem Statement

E-commerce platforms like Amazon face the challenge of helping users navigate massive product catalogs. Without personalization, users experience choice overload and irrelevant suggestions. Many products lack complete metadata, and cold-start users/items limit collaborative approaches.

Objective

🎯 Project Objective

To design and evaluate a hybrid recommender system that suggests beauty products on Amazon using multiple recommendation techniques:

  • Popularity-based Filtering: Recommends top-selling or highest-rated items
  • Content-Based Filtering: Uses product descriptions, tags, and features to find similar items
  • Collaborative Filtering: Includes both user-user and item-based approaches using ratings and behavior
  • Model-Based Filtering: Matrix factorization (e.g., SVD, NMF) for latent user-product relationships
  • Association Rule Mining: Discovers frequent itemsets and co-purchase rules using algorithms like Apriori
πŸ” Goal

To understand the individual performance, interpretability, scalability, and limitations of each method, and propose an ideal hybrid system suitable for large-scale deployment on platforms like Amazon.

Proposed Solution

πŸ“Š Phase 1: Data Exploration & Cleaning
  • Used two datasets: Product metadata and Customer reviews
  • Cleaned missing values in key fields like brand, price, and description
  • Merged datasets using ItemId as the key
  • Created derived fields (e.g., combined title + description for text modeling)
  • Applied temporal train-test split to simulate real-world recommendation scenarios
🧠 Phase 2: Recommender Models
  • Popularity-Based Recommender: Ranked products by number of reviews and average rating
  • Content-Based Filtering:
    • Used TF-IDF on product title and description
    • Encoded categorical features and scaled prices
    • Built a cosine similarity matrix to recommend similar products
  • Collaborative Filtering:
    • User-User similarity using cosine distance
    • NMF (Non-negative Matrix Factorization): Discovered latent user-item preferences
  • Association Rule Mining:
    • Used the Apriori algorithm to identify co-purchase patterns
    • Generated interpretable β€œfrequently bought together” rules

Technologies Used

  • Language: Python
  • Libraries:
    • Pandas, NumPy – Data wrangling and preprocessing
    • Scikit-learn – ML models, TF-IDF vectorization, NMF
    • Surprise, SciPy – Collaborative filtering algorithms
    • MLxtend – Association rule mining with Apriori
    • Matplotlib, WordCloud – Data visualization and word clouds
  • Datasets:
    • beauty_amazon_items.csv – Product metadata
    • beauty_amazon_reviews.csv – Customer reviews and ratings

Challenges Faced

  • Very sparse user-item matrix: Majority of users have ≀ 1 review
  • Incomplete product metadata: Missing brand names, prices, and descriptions
  • Cold-start problems: Collaborative filtering fails for new users or products
  • Low precision in recommendations due to sparse user interaction data
  • Latent factors from NMF are hard to interpret and explain
  • Association rules are biased towards frequently purchased items, not necessarily relevant ones

Methodology

🧹 Data Preprocessing
  • Cleaned missing text and price fields
  • Handled nulls in brand and title using inferencing
  • Formatted timestamps for temporal train-test split
πŸ•’ Temporal Split
  • Used 80% training and 20% testing, sorted chronologically
  • Ensured no cold-start in test data (filtered unseen user-item pairs)
🧠 Model Building
  • Popularity-Based Recommender – ranked by review count
  • Content-Based – used text, brand, and price embeddings
  • User-User Collaborative Filtering – cosine similarity on rating matrix
  • NMF – matrix factorization on top 5k users and 2k items
  • Association Rule Mining – applied Apriori for basket-style co-purchase rules
πŸ“Š Evaluation Metrics
  • RMSE, MAE – f

    Result / Outcome

    πŸ“ˆ Model Performance Summary
    Model RMSE MAE Precision@5 Notes
    Popularity-Based β€” β€” β€” Works for all users but not personalized
    Content-Based Filtering β€” β€” 0.0028–0.0032 Struggles with relevance, no user signals
    User-User Collaborative 0.569 0.348 β€” Accurate for dense clusters, cold-start issue
    NMF (Model-Based) 2.51 1.44 0.0 Learn

    EDA
    ML MODEL

    Β© 2025 Linus Tech. All Rights Reserved.

    Made by Linus Tech