Project Analysis
image

Spam Detection Application

About Project

This project aims to develop an intelligent spam detection system using deep learning techniques. The objective is to accurately classify emails as either "spam" or "not spam" by building and deploying a robust model. The project involves several stages, starting with dataset acquisition, followed by data preprocessing, model training, and evaluation. The best-performing model and preprocessing pipeline are then saved and integrated into an interactive web application using Django and Streamlit.

Problem Statement

Email and SMS platforms are plagued with spam, threatening user security and productivity. Traditional rule-based or statistical filters often fail to adapt to new spam tactics, especially when faced with obfuscated language or minimal messages.

Objective

๐ŸŽฏ Objective
  • Develop a deep learning-based binary text classifier
  • Accurately distinguish between spam and ham messages
๐Ÿ“ Dataset Support
  • Capable of handling multiple datasets for training and evaluation
๐Ÿงน NLP Preprocessing
  • Integrated custom preprocessing pipeline tailored for spam detection
  • Included steps like lemmatization, stopword removal, and tokenization
๐Ÿงช Feature Engineering
  • Generated spam-specific features (e.g., keyword presence, punctuation patterns)
โš™๏ธ Deployment & Real-Time Prediction
  • Enabled real-time message classification through:
    • Streamlit โ€“ interactive demo UI
    • FastAPI โ€“ API-first architecture
    • Django โ€“ full-stack integration for web apps

Proposed Solution

๐Ÿ› ๏ธ Full Pipeline Overview
  • Developed an end-to-end pipeline for spam message classification
๐Ÿ—‚๏ธ Dataset Merging
  • Combined 4 datasets (SMS Spam Collection, NUS SMS Corpus, SpamAssassin, others)
  • Total size: 19,314 messages
๐Ÿงน NLP Preprocessing
  • Custom preprocessing pipeline included:
    • Stopword removal
    • Stemming
    • Punctuation stripping
    • Spam-word scoring (based on keyword frequency)
๐Ÿงช Feature Engineering
  • spam_word_score โ€“ Custom metric for spam keyword density
  • len_message โ€“ Length of the message
๐Ÿค– Model Training
  • Used CNN and LSTM architectures with:
    • Keras Tokenizer and padded sequences
    • Concatenated custom features (e.g., spam_word_score)
    • Keras Tuner for hyperparameter tuning
๐Ÿš€ Model Deployment
  • Streamlit frontend for interactive spam classification
  • Django backend app deployed on DigitalOcean

Technologies Used

๐Ÿงฐ Tech Stack & Tools
๐Ÿ—ฃ๏ธ Natural Language Processing (NLP)
  • NLTK โ€“ for text processing
  • Regular expressions โ€“ for pattern-based text cleaning
  • Stemming โ€“ to reduce words to their root form
  • Stopword filtering โ€“ to remove non-informative words
๐Ÿง  Deep Learning
  • TensorFlow โ€“ backend engine for model training
  • Keras โ€“ used to build CNN and LSTM models
๐Ÿงช Supporting Tools
  • Scikit-learn โ€“ for evaluation metrics and preprocessing
  • Keras Tuner โ€“ for hyperparameter optimization
  • Matplotlib & Seaborn โ€“ for data visualization
๐Ÿš€ Deployment
  • Streamlit โ€“ interactive frontend
  • Django โ€“ backend framework
  • DigitalOcean โ€“ cloud hosting and deployment
๐Ÿ“‚ Datasets
  • Combined SMS and Email datasets from open-source repositories

Challenges Faced

๐Ÿ“‰ Dataset Imbalance
  • Ham messages significantly outnumbered spam messages
  • Addressed using oversampling, stratified train-test split, and metrics beyond accuracy
๐Ÿ“› Overfitting on Small Models
  • CNN and LSTM models overfit quickly due to limited data
  • Used Dropout, L2 regularization, and early stopping to prevent overfitting
๐Ÿ”ก Tokenization & Padding
  • Tokenizer had to be saved and reused consistently across training and inference
  • Ensured no mismatch between training and deployment inputs
๐Ÿ”€ Combining Engineered + Learned Features
  • Integrated engineered features (e.g., spam score, message length)
  • Required careful reshaping and concatenation with embedded sequences
๐Ÿ’พ Deployment Memory Limits
  • Streamlit was lightweight and easy to set up
  • Django + model + tokenizer + webcam consumed significant memory during deployment

Methodology

๐Ÿ—๏ธ Model Architecture โ€“ CNN
  • Embedding layer โ€“ 64 dimensions
  • 1D Convolution โ€“ 128 filters
  • Global MaxPooling
  • Dropout โ€“ 0.5
  • Dense layers
  • Output: Sigmoid activation
๐Ÿงช Evaluation Tools
  • Classification Report โ€“ precision, recall, F1-score
  • Confusion Matrix
  • Learning Curves โ€“ Train vs Validation Accuracy & Loss
  • Tokenizer, padded sequences, and engineered features
๐ŸŒ Deployment
  • Streamlit App: User enters message โ†’ outputs label, confidence, and matched spam keywords
  • Django App: Production-ready backend hosted at:
    ๐Ÿ”— sunilmanirudhan.linustech.in/spam/spam
โœ… Key Innovations
  • Custom spam_word_score โ€“ Enhanced interpretability of model predictions
  • Hybrid NLP + Deep Learning โ€“ Combined classical techniques with modern architectures
  • Dual Deployment โ€“ Accessible via both Streamlit and Django platforms
  • Keras Tuner โ€“ Automated architecture and hyperparameter tuning

Result / Outcome

๐Ÿ† Best Performing Model
  • Model: Convolutional Neural Network (CNN)
  • Test Accuracy: 97.83%
๐Ÿ“ˆ Evaluation Metrics (Spam Class)
  • Precision: 94%
  • Recall: 98%
  • F1-Score: 96%
๐Ÿงฎ Confusion Matrix
  • True Positives: 560
  • False Positives: 33
  • False Negatives: 9
  • True Negatives: 1330

EDA
ML MODEL