Skip to content

Income classification of census data with support vector machine (SVM), decision trees and random forest including hyper parameter tuning and evaluation

Notifications You must be signed in to change notification settings

lucashoeft/Intelligent-Data-Analysis-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intelligent-Data-Analysis-Project

The final project for the course 'Intelligent Data Analysis & Machine Learning I' at the University of Potsdam. The given task was to predict the income (under or over 50k) based on interview answers from 30k individuals. The income is known for 5k of the 30k people. The income needs to be predicted for the other 25k people.

As methods to solve the task, I chose Support Vector Machine (SVM), decision trees and random forest. I did hyperparameter tuning for each method. The different models showed similiar results based on the checked metrics (F-Measure and ROC AUC). The random forest slightly outperforms the other models.

For the future:

  • Balance the data set for target variable (e.g. with downsampling)
  • Test binning or removing of heavily skewed/imbalanced variables (e.g. gains, losses, country of birth)
  • Extent hyperparameter search trying out more combinations (e.g. with RandomizedSearchCV)
  • Check if decision trees / random forest perform better without one-hot encoding

Dataset

Kohavi, R. (1996). Census Income [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5GP7S.

About

Income classification of census data with support vector machine (SVM), decision trees and random forest including hyper parameter tuning and evaluation

Topics

Resources

Stars

Watchers

Forks