Sunil M Anirudhan

Project Analysis

Spam Detection Application

About Project

This project aims to develop an intelligent spam detection system using deep learning techniques. The objective is to accurately classify emails as either "spam" or "not spam" by building and deploying a robust model. The project involves several stages, starting with dataset acquisition, followed by data preprocessing, model training, and evaluation. The best-performing model and preprocessing pipeline are then saved and integrated into an interactive web application using Django and Streamlit.

Problem Statement

Email and SMS platforms are plagued with spam, threatening user security and productivity. Traditional rule-based or statistical filters often fail to adapt to new spam tactics, especially when faced with obfuscated language or minimal messages.

Objective

🎯 Objective

Develop a deep learning-based binary text classifier
Accurately distinguish between spam and ham messages

📁 Dataset Support

Capable of handling multiple datasets for training and evaluation

🧹 NLP Preprocessing

Integrated custom preprocessing pipeline tailored for spam detection
Included steps like lemmatization, stopword removal, and tokenization

🧪 Feature Engineering

Generated spam-specific features (e.g., keyword presence, punctuation patterns)

⚙️ Deployment & Real-Time Prediction

Enabled real-time message classification through:

Streamlit – interactive demo UI
FastAPI – API-first architecture
Django – full-stack integration for web apps

Proposed Solution

🛠️ Full Pipeline Overview

Developed an end-to-end pipeline for spam message classification

🗂️ Dataset Merging

Combined 4 datasets (SMS Spam Collection, NUS SMS Corpus, SpamAssassin, others)
Total size: 19,314 messages

🧹 NLP Preprocessing

Custom preprocessing pipeline included:

Stopword removal
Stemming
Punctuation stripping
Spam-word scoring (based on keyword frequency)

🧪 Feature Engineering

spam_word_score – Custom metric for spam keyword density
len_message – Length of the message

🤖 Model Training

Used CNN and LSTM architectures with:

Keras Tokenizer and padded sequences
Concatenated custom features (e.g., spam_word_score)
Keras Tuner for hyperparameter tuning

🚀 Model Deployment

Streamlit frontend for interactive spam classification
Django backend app deployed on DigitalOcean

Technologies Used

🧰 Tech Stack & Tools

🗣️ Natural Language Processing (NLP)

NLTK – for text processing
Regular expressions – for pattern-based text cleaning
Stemming – to reduce words to their root form
Stopword filtering – to remove non-informative words

🧠 Deep Learning

TensorFlow – backend engine for model training
Keras – used to build CNN and LSTM models

🧪 Supporting Tools

Scikit-learn – for evaluation metrics and preprocessing
Keras Tuner – for hyperparameter optimization
Matplotlib & Seaborn – for data visualization

🚀 Deployment

Streamlit – interactive frontend
Django – backend framework
DigitalOcean – cloud hosting and deployment

📂 Datasets

Combined SMS and Email datasets from open-source repositories

Challenges Faced

📉 Dataset Imbalance

Ham messages significantly outnumbered spam messages
Addressed using oversampling, stratified train-test split, and metrics beyond accuracy

📛 Overfitting on Small Models

CNN and LSTM models overfit quickly due to limited data
Used Dropout, L2 regularization, and early stopping to prevent overfitting

🔡 Tokenization & Padding

Tokenizer had to be saved and reused consistently across training and inference
Ensured no mismatch between training and deployment inputs

🔀 Combining Engineered + Learned Features

Integrated engineered features (e.g., spam score, message length)
Required careful reshaping and concatenation with embedded sequences

💾 Deployment Memory Limits

Streamlit was lightweight and easy to set up
Django + model + tokenizer + webcam consumed significant memory during deployment

Methodology

🏗️ Model Architecture – CNN

Embedding layer – 64 dimensions
1D Convolution – 128 filters
Global MaxPooling
Dropout – 0.5
Dense layers
Output: Sigmoid activation

🧪 Evaluation Tools

Classification Report – precision, recall, F1-score
Confusion Matrix
Learning Curves – Train vs Validation Accuracy & Loss
Tokenizer, padded sequences, and engineered features

🌐 Deployment

Streamlit App: User enters message → outputs label, confidence, and matched spam keywords
Django App: Production-ready backend hosted at:
🔗 sunilmanirudhan.linustech.in/spam/spam

✅ Key Innovations

Custom spam_word_score – Enhanced interpretability of model predictions
Hybrid NLP + Deep Learning – Combined classical techniques with modern architectures
Dual Deployment – Accessible via both Streamlit and Django platforms
Keras Tuner – Automated architecture and hyperparameter tuning

Result / Outcome

🏆 Best Performing Model

Model: Convolutional Neural Network (CNN)
Test Accuracy: 97.83%

📈 Evaluation Metrics (Spam Class)

Precision: 94%
Recall: 98%
F1-Score: 96%

🧮 Confusion Matrix

True Positives: 560
False Positives: 33
False Negatives: 9
True Negatives: 1330

Project Analysis

Spam Detection Application

About Project

Problem Statement

Objective

Proposed Solution

Technologies Used

Challenges Faced

Methodology

Result / Outcome

EDA

ML MODEL

Choose Your Language

Project Analysis

Spam Detection Application

About Project

Problem Statement

Objective

Proposed Solution

Technologies Used

Challenges Faced

Methodology

Result / Outcome

EDA

ML MODEL