Barga Data Science lecture 6

50 %
50 %
Information about Barga Data Science lecture 6

Published on April 24, 2016

Author: rsbarga

Source: slideshare.net

1. Deriving Knowledge from Data at Scale

2. Deriving Knowledge from Data at Scale Feature extraction and selection are the most important but underrated step of machine learning. Better features are better than better algorithms…

3. Deriving Knowledge from Data at Scale

4. Deriving Knowledge from Data at Scale

5. Deriving Knowledge from Data at Scale Lecture Objectives homework There is an order or workflow that takes place here, don’t lose the forest in the trees…

6. Deriving Knowledge from Data at Scale Review…

7. Deriving Knowledge from Data at Scale • Cluster 0 – It contains a cluster of Females with an average age of 37 who live in inner city and possess saving account number and current account number. They are unmarried and do not have any mortgage or pep. The average monthly income is 23,300. • Cluster 1 - It contains a cluster of Females with an average age of 44 who live in rural area and possess saving account number and current account number. They are married and do not have any mortgage or pep. The average monthly income is 27,772. • Cluster 2 - It contains a cluster of Females with an average age of 48 who live in inner city and possess current account number but no saving account number. They are unmarried and do not have mortgage but do have pep. The average monthly income is 27,668. • Cluster 3 - It contains a cluster of Females with an average age of 39 who live in town and possess saving account number and current account number. They are married and do not have any mortgage or pep. The average monthly income is 24,047. • Cluster 4 - It contains a cluster of Males with an average age of 39 who live in inner city and possess current account number but no saving account number. They are married and have mortgage and pep. The average monthly income is 26,359. • Cluster 5 - It contains a cluster of Males with an average age of 47 who live in inner city and possess saving account number and current account number. They are unmarried and do not have mortgage but do have pep. The average monthly income is 35,419.

8. Deriving Knowledge from Data at Scale

9. Deriving Knowledge from Data at Scale Classifiers  Lazy –> IBk

10. Deriving Knowledge from Data at Scale

11. Deriving Knowledge from Data at Scale

12. Deriving Knowledge from Data at Scale

13. Deriving Knowledge from Data at Scale

14. Deriving Knowledge from Data at Scale

15. Deriving Knowledge from Data at Scale15

16. Deriving Knowledge from Data at Scale No Prob Target CustID Age 1 0.97 Y 1746 … 2 0.95 N 1024 … 3 0.94 Y 2478 … 4 0.93 Y 3820 … 5 0.92 N 4897 … … … … … 99 0.11 N 2734 … 100 0.06 N 2422 Use a model to assign score (probability) to each instance Sort instances by decreasing score Expect more targets (hits) near the top of the list 3 hits in top 5% of the list If there 15 targets overall, then top 5 has 3/15=20% of targets

17. Deriving Knowledge from Data at Scale 40% of responses for 10% of cost Lift factor = 4 80% of responses for 40% of cost Lift factor = 2 Model Random

18. Deriving Knowledge from Data at Scale

19. Deriving Knowledge from Data at Scale

20. Deriving Knowledge from Data at Scale

21. Deriving Knowledge from Data at Scale

22. Deriving Knowledge from Data at Scale to impact… 1. Build our predictive model in WEKA Explorer; 2. Use our model to score (predict) which new customers to target in our upcoming advertising campaign; • ARFF file manipulation (hacking), all too common pita… • Excel manipulation to join model output with our customers list 3. Compute the lift chart to assess business impact of our predictive model on the advertising campaign • How are Lift charts built, of all the charts and/or performance measures from a model this one is ‘on you’ to construct; • Where is the business ‘bang for the buck’?

23. Deriving Knowledge from Data at Scale

24. Deriving Knowledge from Data at Scale

25. Deriving Knowledge from Data at Scale

26. Deriving Knowledge from Data at Scale You can’t turn data lead into modeling gold – we’re data scientists, not data alchemists…

27. Deriving Knowledge from Data at Scale Motivation: Real world examples Example (1) Lesson: Correct data transformation is important!

28. Deriving Knowledge from Data at Scale Motivation: Real world examples Example (2): KDD Cup 2001 Lesson: A model that uses lots of features can turn out to be very sub-optimal, however well it is designed!

29. Deriving Knowledge from Data at Scale Motivation: Real world examples Example (3) Lesson: Feature selection can be crucial even when the number of features is small!

30. Deriving Knowledge from Data at Scale Motivation: Real world examples Example (4) Lesson: Variations of the same ML method can give vastly different performances!

31. Deriving Knowledge from Data at Scale

32. Deriving Knowledge from Data at Scale Predictive modeling competitions

33. Deriving Knowledge from Data at Scale Global competitions 1½ weeks 70.8% Competition closes 77% State of the art 70% Predicting HIV viral load Improved by 10%

34. Deriving Knowledge from Data at Scale Mismatch between those with data and those with the skills to analyse it Crowdsourcing

35. Deriving Knowledge from Data at Scale Forecast Error (MASE) Existing model Tourism Forecasting Competition Aug 9 2 weeks later 1 month later Competition End

36. Deriving Knowledge from Data at Scale • neural networks • logistic regression • support vector machine • decision trees • ensemble methods • adaBoost • Bayesian networks • genetic algorithms • random forest • Monte Carlo methods • principal component analysis • Kalman filter • evolutionary fuzzy modeling Users apply different techniques

37. Deriving Knowledge from Data at Scale VicRoads has an algorithm they use to forecast travel time on Melbourne freeways (taking into account time, weather, accidents, etc). Their current model is inaccurate and somewhat useless. They want to do better (or at least find out about whether it’s possible to do better).

38. Deriving Knowledge from Data at Scale 1 2 3 Upload Submit Evaluate & Exchange

39. Deriving Knowledge from Data at Scale Use the wizard to post a competition

40. Deriving Knowledge from Data at Scale Participants make their entries

41. Deriving Knowledge from Data at Scale Competitions are judged based on predictive accuracy

42. Deriving Knowledge from Data at Scale Competition Mechanics Competitions are judged on objective criteria

43. Deriving Knowledge from Data at Scale Kaggle How They Won It…

44. Deriving Knowledge from Data at Scale

45. Deriving Knowledge from Data at Scale

46. Deriving Knowledge from Data at Scale Three Files ford_train • 510 Trials, ~1,200 observations each spaced by 0.1 sec -> 604,330 rows ford_test • 100 Trials,~1,200 observations/trial, 120,841 rows example_submission.csv

47. Deriving Knowledge from Data at Scale Junpei Komiyama (#4)

48. Deriving Knowledge from Data at Scale Junpei Komiyama (#4)

49. Deriving Knowledge from Data at Scale Mick Wagner (#2)

50. Deriving Knowledge from Data at Scale Mick Wagner (#2)

51. Deriving Knowledge from Data at Scale Inference (#1)

52. Deriving Knowledge from Data at Scale VicRoads has an algorithm they use to forecast travel time on Melbourne freeways (taking into account time, weather, accidents etc). Their current model is inaccurate and somewhat useless. They want to do better (or at least find out about whether it’s possible to do better).

53. Deriving Knowledge from Data at Scale

54. Deriving Knowledge from Data at Scale François GUILLEM (#14)

55. Deriving Knowledge from Data at Scale #1 used Random Forests

56. Deriving Knowledge from Data at Scale

57. Deriving Knowledge from Data at Scale Homework Week 6 Monday Sept. 21st Upload to site… http://blog.kaggle.com/category/dojo/ Content is 10 pages of interview on how the team(s) built their models, some have multiple interviews; You will review at least 10 interviews, bounce around do not go sequentially. 1) What model(s) did they use, 2) insights they had that influenced modeling, 3) what feature creation and selection, 4) other observations. I will cons all these together and upload as shared document on our site.

58. Deriving Knowledge from Data at Scale 5 Minute Break…

59. Deriving Knowledge from Data at Scale Course Project

60. Deriving Knowledge from Data at Scale

61. Deriving Knowledge from Data at Scale https://www.kaggle.com/c/springleaf-marketing-response not Determine whether to send a direct mail piece to a customer

62. Deriving Knowledge from Data at Scale The Data

63. Deriving Knowledge from Data at Scale The Rules

64. Deriving Knowledge from Data at Scale

65. Deriving Knowledge from Data at Scale

66. Deriving Knowledge from Data at Scale

67. Deriving Knowledge from Data at Scale what is the data telling you

68. Deriving Knowledge from Data at Scale

69. Deriving Knowledge from Data at Scale

70. Deriving Knowledge from Data at Scale Data Wrangling

71. Deriving Knowledge from Data at Scale Data Acquisition Data Exploration Pre- processing Feature and Target construction Train/ Test split Feature selection Model training Model scoring Model scoring Evaluation Evaluation Compare metrics

72. Deriving Knowledge from Data at Scale • Data preparation step is by far the most time consuming step 0 10 20 30 40 50 60 70 Understanding of Domain Understanding of Data Preparation of Data Data Mining Evaluation of Results Deployment of Results KDDM steps relative effort [%] Cabena et al. estimates Shearer estimates Cios and Kurgan estimates

73. Deriving Knowledge from Data at Scale Out of Class Reading, highly recommended

74. Deriving Knowledge from Data at Scale Out of Class Reading, highly recommended

75. Deriving Knowledge from Data at Scale 1. Do you have domain knowledge? 2. Are your features commensurate? 3. Do you suspect interdependence of features? 4. Do you need to prune the input variables 5. Do you need to assess features individually 6. Do you need a predictor? 7. Do you suspect your data is “dirty” 8. Do you know what to try first? 9. Do you have new ideas, time, computational resources, and enough examples? 10. Do you want a stable solution

76. Deriving Knowledge from Data at Scale

77. Deriving Knowledge from Data at Scale

78. Deriving Knowledge from Data at Scale

79. Deriving Knowledge from Data at Scale 15 15

Add a comment

Related pages

Data Science. | LinkedIn

Manager of Data Science and Analytics at Facebook, Data Scientist, Platform at Facebook, Lead Analytics Scientist at Slide, Inc.,... Data Science Fellow at ...
Read more

Lecture Notes in Computer Science 1777 - Springer

Lecture Notes in Computer Science 1777 ... Scienti c Data Using SQL/MED and XML:::::447 Mark Papiani ... Roger S. Barga, David B. Lomet, ...
Read more

Automatic capture and efficient storage of e-Science ...

Automatic capture and efficient storage of e-Science ... Barga, R. S. and ... It is the same in e-Science, except provenance data are a record of the ...
Read more

Tools and Services for Data Intensive Research Roger Barga ...

Tools and Services for Data Intensive Research Roger Barga, Architect eXtreme Computing Group, Microsoft Research An Elephant Through the Eye of a Needle.
Read more

Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6. - ppt ...

Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6. Upload Log in. My presentations; Profile; Feedback; Log out; Search Download presentation. We ...
Read more

Intro to Data Science with DB - es.scribd.com

Intro to Data Science with DB. Navegar Navegar. Intereses. Biography & Memoir; Business & Leadership; Fiction & Literature; Politics & Economy; Health ...
Read more

Versioning for Workflow Evolution - University at Buffalo

Versioning for Workflow Evolution Eran Chinthaka Withana, ... Roger Barga, Nelson Araujo ... Computational science experiments often involve a sequence of
Read more