Predictive model

50 %
50 %
Information about Predictive model

Published on December 31, 2016

Author: PingYin8

Source: slideshare.net

1. Income Analysis Ping Yin 11/10/2016

2. Contents • Executive Summary ------------------------------------------------------------------------------------- 3 • Introduction ---------------------------------------------------------------------------------------------- 4 • Purpose ---------------------------------------------------------------------------------------------------- 5 • Methodology Data Selection ----------------------------------------------------------------------------------- 6 Exploration ----------------------------------------------------------------------------------- 7-24 Preparation & Transformation ---------------------------------------------------------- 25-34 Model Development & Assessment --------------------------------------------------- 35-44 Model Comparison ------------------------------------------------------------------------ 45-47 • Options and Recommendations ---------------------------------------------------------------- 48-52 • Summary ------------------------------------------------------------------------------------------------- 53 • Appendix ------------------------------------------------------------------------------------------------- 54

3. Executive Summary • After data preparation and partition, three models are built in SAS studio, EM, and DataRobot • The same test dataset is scored by these models • The model built in EM has the best performance

4. Introduction • Can we predict Income level based on age, gender, education, etc.? • What is my income level after I graduate?

5. Purpose • Figure out the best predictive model for Income dataset • Predict my Income level • Practice skills for preparing data, building model, and model assessment

6. Data Selection • Income dataset is originally extracted from 1994 Census bureau database • Downloaded from Kaggle.com • Reasons for choosing it: • Target variable, Income, is categorical variable • Medium size: 10+ columns and 30K+ rows • Used in Macro and DataRobot projects

7. Exploration • Using SAS studio to explore data • 32,561 observations • 15 variables: 6 Num, 9 Char • Num: Age Capitalgain Capitalloss Weekhour Edunum Fnlwgt • Char: Income Relationship Education Occupation Sex Marital Workclass Race Nativecountry • Target: Income (“>50K” , “<=50k”)

8. Exploration

9. Exploration

10. Exploration

11. Exploration

12. Exploration

13. Exploration

14. Exploration

15. Exploration

16. Exploration

17. Exploration

18. Exploration

19. Exploration

20. Exploration

21. Exploration

22. Exploration

23. Exploration

24. Exploration Data issues : • Missing value: Workclass Occupation Nativecountry • Multiple levels: Education Marital Workclass Nativecountry • Numeric variables: Capitalgain Capitalloss • Screen variable: Fnlwgt

25. Preparation & Transformations • Solutions: • Imputing missing value using subject matter knowledge: impute missing value for Workclass and Occupation with “Unemployeed” • Imputing missing value using mode value: impute missing value for Nativecountry with “United-States”

26. Preparation & Transformations • Solutions: • Coverting Capitalgain and Capitalloss from Num to Char • Binning multiple-level variables: Education Marital Workclass

27. Preparation & Transformations • Solutions: • Binning Nativecountry and creating a new variable: region

28. Preparation & Transformations • Reasons for dropping variable Fnlwgt: • It is the weight on the Current Population Survey files, not original data from Census • It shows near zero importance in last week DataRobot project

29. Preparation & Transformations • Reasons for not handling with variable Occupation: • 15 levels • Do not have a sound criterion • Reasons for not handling with variable Race and Relationship: • 5-6 Levels • Each level is meaningful

30. Preparation & Transformations After preparation:

31. Preparation & Transformations

32. Preparation & Transformations

33. Preparation & Transformations • Data partition using Strata method

34. Now it is ready to go! Training dataset Test dataset SAS Studio Enterprise Miner DataRobot

35. Model Development & Assessment: SAS Studio

36. Model Development & Assessment: SAS Studio

37. Model Development & Assessment: SAS Studio

38. Model Development & Assessment: SAS Studio

39. Model Development & Assessment: EM

40. Model Development & Assessment: EM

41. Model Development & Assessment: DataRobot

42. Model Development & Assessment: DataRobot

43. Model Development & Assessment: DataRobot

44. Model Development & Assessment: DataRobot

45. Model Comparison

46. Model Comparison • The best model in this project: EM Studio DataRobot

47. Model Comparison: Predict my Income level Ping Dataset EM Studio DataRobot

48. Options and Recommendations Using 60% data to build a model Using 70% data to build a model

49. Options and Recommendations Macro Project DataRobot Project The overall best model

50. Options and Recommendations • Factors which may cause these differences: • Dropping variable Fnlwgt • Reducing levels • Variable transformation: Capitalgain Capitalloss • Increase speed, but decrease model performance

51. Options • Using DataRobot to build models without handling “data issues” • Keep trying in SAS studio

52. Summary • We can predict Income level based on these characteristics • For Income dataset, DataRobot is most robust to build models • Be aware of unexpected outcomes for data preparing • Back and forth, until getting an ideal result

53. Appendix Link to Data: https://www.kaggle.com/uciml/adult-census-Income

54. Thanks !

Add a comment