Project Analysis
Spam Detection Application
About Project
This project aims to develop an intelligent spam detection system using deep learning techniques. The objective is to accurately classify emails as either "spam" or "not spam" by building and deploying a robust model. The project involves several stages, starting with dataset acquisition, followed by data preprocessing, model training, and evaluation. The best-performing model and preprocessing pipeline are then saved and integrated into an interactive web application using Django and Streamlit.
Problem Statement
Email and SMS platforms are plagued with spam, threatening user security and productivity. Traditional rule-based or statistical filters often fail to adapt to new spam tactics, especially when faced with obfuscated language or minimal messages.
Objective
- Develop a deep learning-based binary text classifier
- Accurately distinguish between spam and ham messages
- Capable of handling multiple datasets for training and evaluation
- Integrated custom preprocessing pipeline tailored for spam detection
- Included steps like lemmatization, stopword removal, and tokenization
- Generated spam-specific features (e.g., keyword presence, punctuation patterns)
- Enabled real-time message classification through:
- Streamlit โ interactive demo UI
- FastAPI โ API-first architecture
- Django โ full-stack integration for web apps
Proposed Solution
- Developed an end-to-end pipeline for spam message classification
- Combined 4 datasets (SMS Spam Collection, NUS SMS Corpus, SpamAssassin, others)
- Total size: 19,314 messages
- Custom preprocessing pipeline included:
- Stopword removal
- Stemming
- Punctuation stripping
- Spam-word scoring (based on keyword frequency)
- spam_word_score โ Custom metric for spam keyword density
- len_message โ Length of the message
- Used CNN and LSTM architectures with:
- Keras Tokenizer and padded sequences
- Concatenated custom features (e.g., spam_word_score)
- Keras Tuner for hyperparameter tuning
- Streamlit frontend for interactive spam classification
- Django backend app deployed on DigitalOcean
Technologies Used
- NLTK โ for text processing
- Regular expressions โ for pattern-based text cleaning
- Stemming โ to reduce words to their root form
- Stopword filtering โ to remove non-informative words
- TensorFlow โ backend engine for model training
- Keras โ used to build CNN and LSTM models
- Scikit-learn โ for evaluation metrics and preprocessing
- Keras Tuner โ for hyperparameter optimization
- Matplotlib & Seaborn โ for data visualization
- Streamlit โ interactive frontend
- Django โ backend framework
- DigitalOcean โ cloud hosting and deployment
- Combined SMS and Email datasets from open-source repositories
Challenges Faced
- Ham messages significantly outnumbered spam messages
- Addressed using oversampling, stratified train-test split, and metrics beyond accuracy
- CNN and LSTM models overfit quickly due to limited data
- Used Dropout, L2 regularization, and early stopping to prevent overfitting
- Tokenizer had to be saved and reused consistently across training and inference
- Ensured no mismatch between training and deployment inputs
- Integrated engineered features (e.g., spam score, message length)
- Required careful reshaping and concatenation with embedded sequences
- Streamlit was lightweight and easy to set up
- Django + model + tokenizer + webcam consumed significant memory during deployment
Methodology
- Embedding layer โ 64 dimensions
- 1D Convolution โ 128 filters
- Global MaxPooling
- Dropout โ 0.5
- Dense layers
- Output: Sigmoid activation
- Classification Report โ precision, recall, F1-score
- Confusion Matrix
- Learning Curves โ Train vs Validation Accuracy & Loss
- Tokenizer, padded sequences, and engineered features
- Streamlit App: User enters message โ outputs label, confidence, and matched spam keywords
- Django App: Production-ready backend hosted at:
๐ sunilmanirudhan.linustech.in/spam/spam
- Custom spam_word_score โ Enhanced interpretability of model predictions
- Hybrid NLP + Deep Learning โ Combined classical techniques with modern architectures
- Dual Deployment โ Accessible via both Streamlit and Django platforms
- Keras Tuner โ Automated architecture and hyperparameter tuning
Result / Outcome
- Model: Convolutional Neural Network (CNN)
- Test Accuracy: 97.83%
- Precision: 94%
- Recall: 98%
- F1-Score: 96%
- True Positives: 560
- False Positives: 33
- False Negatives: 9
- True Negatives: 1330