Project Analysis
image

HR Salary Prediction

About Project

The HR salary dataset from Kaggle is a valuable resource for analyzing employee compensation trends. It contains information on employee demographics, job-related details, and compensation figures. This data can be used to identify factors that influence salary, such as experience, education, and job title, and assess organisational pay equity.

Problem Statement

TCS HR Dashboard Requirements
  • Predict salaries of new hires or internal transfers
  • Justify salary revisions based on experience, skills, and performance
  • Ensure consistent and fair compensation using analytics

Objective

Project Objectives
  • Create a machine learning model that predicts employee salaries
  • Build an interactive web dashboard that HR managers can access
  • Improve prediction accuracy through feature engineering and model tuning

Proposed Solution

Data Summary
  • Used an HR dataset with 311 entries and 38 features
  • Cleaned and reduced the dataset to 8 key features (e.g., age, experience, department, race, project count, performance score)
Modeling
  • Trained and compared multiple models:
    • Linear Regression
    • Random Forest
    • Gradient Boosting (with and without hyperparameter tuning)
Deployment
  • Built and hosted a Flask-based interactive web dashboard on PythonAnywhere

Technologies Used

Languages
  • Python
Libraries
  • Pandas
  • Scikit-learn
  • GridSearchCV
  • Matplotlib
Frameworks
  • Flask
  • HTML (for dashboard)
Deployment
  • PythonAnywhere
Tools
  • Google Colab
  • Spyder

Challenges Faced

  • Small dataset size made overfitting a concern
  • Target variable (salary) was skewed, requiring transformation
  • Feature selection was non-trivial due to missing or redundant fields
  • Needed to extract derived features like age and experience from DOB, hire/termination dates

Methodology

Business Understanding
  • Identify salary influencers such as experience, age, and performance
Data Cleaning
  • Filled missing values (e.g., termination date) using the current date
  • Removed uninformative columns (e.g., Employee ID, Manager Name)
Feature Engineering
  • Extracted Age from DOB and Experience from Date of Hire
  • Encoded categorical variables
Exploratory Data Analysis (EDA)
  • Used correlation heatmaps and scatter plots
  • Confirmed project count and performance score had the highest influence on salary
Modeling
  • Trained models using train/test split
  • Applied target transformation to handle salary skew
  • Tuned Random Forest and Gradient Boosting using GridSearchCV
Deployment
  • Created HTML forms and Flask routes
  • Hosted the app on PythonAnywhere

Result / Outcome

Best Model
  • Tuned Gradient Boosting
Performance
  • R² Score: 0.76 (acceptable range for business use)
Business Impact
  • TCS HR can now predict salaries with improved accuracy and transparency

EDA
ML MODEL