Published on April 24, 2016
1. Roger S. Barga, Ph.D. General Manager Amazon Web Services Driving Business Value with Data Science
2. Recent Experience
3. Fielded Solutions • Customer Segmentation & Targeting • Which category item will customer buy next? • Azure ML Reference Customer • Predictive Analytics to reduce school dropout rate • Predictive model identifies which students are risk of dropping out at K12 • Predictive Model shows when JLL can charge above or below the market for a specific deal • Predictive maintenance & Internet of Things • Built model to predict causes of elevator failure • Reference Customer for Azure ML and ISS REFERENCECUSTOMERS
4. Predictive Maintenance at ThyssenKrupp ThyssenKrupp partnered with Microsoft to build a new predictive maintenance solution to improve service margins for its elevator business • Great Internet of Things example • Used ISS and Azure Machine Learning • ML model predicts top causes of failure in an elevator – 5M elevators in production, $400 cost savings annually. Key Benefits • Ease of use across skillsets • Ease of deployment • Increased productivity Now we have the ability to use live data to define the needed repair before a breakdown happens, reducing costs for ourselves and our customers. Dr. Rory Smith ThyssenKrupp
5. Problem: To leverage the history of a person’s behavior on Microsoft.com to identify their interests and predict future actions Findings: • Opportunity to provide upsell after users hit on Microsoft online products such as Bing, SkyDrive, Xbox Live, Zune • Target messaging on Windows Phone extends the functionality of Microsoft products Methodology: • Big Data Platform – HDP for Windows/Azure HDInsight and Advanced Analytics support • Develop statistical models to determine the probability of users buying a Surface Device Customer Targeting With Machine Learning
6. Problem: Early detection of suspicious activity on the network servers & eliminate the threat. Methodology: • File system to store massive security data. • Fully automated workflow to drive end-to-end data receiving and transformation process. • Analysis and visualizations of Windows Events to identify pre-defined threat scenarios. • Move from descriptive analytics to a mature predictive archetype. Preventing Network Intrusion with Machine Learning
7. A Sample Project
8. • Create a Pdemo to show potential of Predictive Analytics • Develop a demo to answer the question “What factors drive our client to charge over or below market rates?” • Create 2 predictive models to predict • If our client can charge over the market average for Landlords, and • Whether our client can charge below the market for Tenants • Develop strategies to explain key factors that drive these outcomes • Visualize results in Power BI.
9. Building Predictive Models Business Insights 1 2 34 5 Note: This is a variant of the Cross-Industry Standard Process for Data Mining (CRISP-DM)
10. Conceptual Solution Data Pre-processing on Hadoop (Hive queries) Data Preparation and Predictive Models with Machine Learning Source Data #1 Source Data #2 Visualization in Power BI
11. How to Use the Predictive Model Predictive Model Data on a new deal 1 = You can charge above the market average 0 = You can charge below market Broker
12. Data Preparation • Source data: 1 internal and 1 external data source • Internal data source prepared on Hadoop cluster • Both datasets joined in our internal Machine Learning tool • New column created to determine when our client charges above or below market average Data Source #1 Data Source #2
13. Predictive Model • Tested several algorithms including Logistic Regression, Boosted Decision Trees, etc. • Models were trained with 10-fold cross validation. • Boosted Decision Trees was the best algorithm – see ROC curve • Area under curve for Boosted Decision Trees was 92.4%!
14. Predictive ModelforLandlords-Results • Boosted Decision Trees - Area under the curve = 92.4%! • Logistic Regresssion - Area under the curve = 81.2%!
15. Visualization in Power BI
16. Industry Overview: Financial Services Data Science applied to the Financial Services sector enables insights into: “The opportunity for the Financial sectors are to unlock the potential in their data through analytics and shape the strategy for business through reliable factual insight rather than intuition…” - Deloitte, 2013 Fraud & Financial Crimes • Enterprise fraud and financial crimes • Fraud Detection • Credit Risk Management Analytics • Actuarial analysis, portfolio management and rate making • Forecasting and econometrics • Predictive analytics and data mining • Mathematical optimization and simulations Marketing & Customer Experience • Social media analytics • Customer Segmentation • Customer Targeting Customer Experience Enhancement • Clickstream analysis • Customer lifecycle management • Dynamic profiling and enhanced customer segmentation Banks, Insurance, Real Estate
17. Industry Overview: Healthcare Providers, Payers, Pharmaceuticals & Biotechnology Data Science applied to the Healthcare sector enables insights into: “Predictive analytics addresses today's pressing challenges in healthcare effectiveness and economics by improving operations across the spectrum of healthcare functions…” - Predictive Analytics World Healthcare, 2014 Quality & Outcomes • Readmissions Avoidance Analysis • Health outcomes • Patient safety Consumer Analytics • Customer acquisition • Health intervention • Member & Population Health • Value-based care and payment models • Membership portfolio optimization Risk & Incentives • A holistic view of patient episodes • Value-based care and payment models Care Delivery • Health care cost analytics • Performance management • Workforce planning Cost Containment • Fraud and improper payments • Eligibility fraud • Enterprise case management
18. Industry Overview: Oil &Gas Oil & Gas Producers, Oil Equipment, Services & Distribution, Alternative Energy Data Science applied to the Oil and Gas sector enables insights into: Oil Field Analytics • Seismic analyses • Reservoir characterization • Drilling optimization. • Unconventional completions. • Production forecasting. Assets & Operations • Facility integrity • Demand forecasting. • Integrated operations and logistics • Operational risk/environment, health and safety (EH&S) Data Management • Complex Event Processing • Data Quality • Master Data Management “Access to more information from multiple sources and disciplines and more sophisticated analytics will improve the oil and gas industry's ability to optimize production… Analytics will provide a way to bring optimization from statisticians to the business.” – IDC, 2013
19. How to be Successful
20. How to be successful? 1. Create value 2. Capture some for yourself
21. How to create value (as a data scientist) Extract insights from data for decision support
22. Productive Use of Time Have a bias against writing learning algorithms • Have a bias in favor of leveraging 3rd party implementations…
23. Productive Use of Time Have a bias against writing learning algorithms • Bias in favor of leveraging 3rd party implementations • Add data: more information beats better algorithms
24. Productive Use of Time Have a bias against writing learning algorithms • Bias in favor of leveraging 3rd party implementations • Add data: more information beats better algorithms You will write data manipulation algorithms • Data is surprising enough, need algorithm certainty • Defect count is proportional to line count • Use as high level a language as possible
25. Analysis and Diminishing Returns First few models tend to capture most of the value
26. Analysis and Diminishing Returns First few models tend to capture most of the value Distinguish between: • Marginal improvements important (e.g., search, WalMart); • Marginal improvements unimportant (typical).
27. Analysis and Diminishing Returns First few models tend to capture most of the value Distinguish between: • Marginal improvements important (e.g., search, WalMart); • Marginal improvements unimportant (typical). Latter case: get first 80%, move to new problem
28. The Importance of Starting Small
29. The Importance of Starting Small When you first encounter a data set, you know nothing. • Ergo: first piece of data is very informative. • Think of data set utility as roughly logarithmic in size.
30. The Importance of Starting Small When you first encounter a data set, you know nothing. • Ergo: first piece of data is very informative. • Think of data set utility as roughly logarithmic in size. Don’t require a large data set before starting analysis.
31. The Importance of Starting Small When you first encounter a data set, you know nothing. • Ergo: first piece of data is very informative. • Think of data set utility as roughly logarithmic in size. Don’t require a large data set before starting analysis. Always try things out on small portions of data first.
32. Timescales and Failing Fast 1. Immediate zone: less than 60 seconds • 100s per day 2.Bathroom break zone: less than 5 minutes • 10s per day 3.Lunch zone: less than an hour • 5 per day 4.Overnight zone: less than 12 hours • 1 per day
33. Timescales and Failing Slow 1. Immediate zone: less than 60 seconds • 100s per day 2.Bathroom break zone: less than 5 minutes • 10s per day 3.Lunch zone: less than an hour • 5 per day 4.Overnight zone: less than 12 hours • 1 per day
34. Timescales and Failing Fast 1. Immediate zone: less than 60 seconds • 100s per day 2.Bathroom break zone: less than 5 minutes • 10s per day 3.Lunch zone: less than an hour • 5 per day 4.Overnight zone: less than 12 hours • 1 per day
35. Failing Fast: Summary 1. Move code to data, not the converse! 2.Do feature engineering with a fast learning algorithm (e.g., linear), then switch to a slower algorithm for the final product (e.g., GBDT, NN). 3.Subsample your data intelligently. 4.Less examples (rows), e.g., imbalanced classification. 5.Less features (columns), e.g., random projections
36. Productivity demands debugging as fast as possible. Stay in the immediate zone
37. Proxy Metrics Proxy Metric: Something you can measure and optimize • Revenue per impression • Clickthrough rate • Reciprocal communication rate • Polling results • Gene expression levels • Value at risk
38. Proxy Metrics Reality Reality: Something you actually care about Revenue per impression Economic Value Created Clickthrough rate User Experience Quality Reciprocal communication rate Match Quality Polling results Election Outcome Gene expression levels Drug Efficacy in Vivo Value at risk Portfolio Quality
39. Proxy Metrics vs. Reality
40. Agree on the OEC A concrete goal begets concrete stopping conditions and concrete acceptance criteria. The less specific the goal, the likelier that the project will go unbounded, because no result will be "good enough." If you don't know what you want to achieve, you don't know when to stop trying – or even what to try. When the project eventually terminates – because either time or resources run out – no one will be happy with the outcome…
41. Key Takeaways Think about your data, not about your software. Productivity is about not waiting for answers. Mind the gap (between proxy metrics and reality). Agree upon the OEC with business stakeholders Best Defense: close collaboration with a business expert.
42. Know Your (re)Sources
43. You can make much stronger inferences about a woman named Brittany. That name was very popular from the mid-1980s through the mid-1990s, but it wasn’t all that common before and hasn’t been since. If you know a Brittany, she is probably of college age or just a bit older. Half of living American Brittany’s are between the ages of 19 and 25
44. Blogs to Follow… • FastML, covering practical applications of machine learning and data science • Hilary Mason blog, from Bitly Chief Scientist, covering Data Science and Machine Learning on Big Data. • Hunch.net, by John Langford, a leading applied machine learning researcher; His blog covers the intersection of theory and practice • Kaggle blog no free hunch, covering Kaggle data science and machine learning competitions • KDnuggets, news, jobs, software, events, and more in Data Mining and Data Science research and applications • Normal Deviate by Larry Wasserman, CMU Prof. of Statistics and Machine Learning • Statistical Modeling, Causal Inference, and Social Science by Andrew Gelman • Three-Toed Sloth by Cosma Shalizi • FiveThirtyEight Blog by Nate Silver, a very popular and non-technical blog covering analytics applied mainly to politics and sports
45. Blogs to Follow… • Data Mining Research blog by Sandro Saitta • Data Mining: Text Mining, Visualization, and Social Media, by Matthew Hurst, a leading data scientist at Microsoft • DecisionStats, by Ajay Ohri, covering business analytics and R, with practical examples, and interviews of field leaders • Geeking with Greg , by Greg Linden, inventor of Amazon recommendation engine and internet enterpreneur • IA Ventures blog, one of the leading Big Data venture capitalists Roger Ehrenberg and team • Occam's Razor, by Avinash Kaushik, brilliant Digital Marketing Evangelist at Google • R-bloggers , best blogs from the community of R, with code, examples, and visualizations • Smart Data Collective, an aggregation of blogs from many interesting data science people • Steve Miller blog, covering data science, statistics, R, and other topics at Information management. • Tom H. C. Anderson blog, focusing on market research with data and text mining. • What's the Big Data, by Gil Press. Gil covers the Big Data space and also writes a column on Big Data and Business in Forbes.
46. THANK YOU!
SUMMARY:Cry Justice: Activism, Organizing, and Civil Liberties after Sept 11 UID:20160901T000511EDT-gdDJdv1sJ5@126.96.36.199 URL:http://www.cryjustice.org/
... 2012 - 2015: laos to have full ... Am not ope advertising for a bit of softer moment for a friend Sept a ... Governments to 'galvanize' the youth ...
View 2921 Galvanize posts, presentations, experts, and more. Get the professional knowledge you need on LinkedIn.
BEST NEWS OF 2013-2014. The Miami Student Oldest university newspaper in the United States, established 1826. WELCOME HOME CLASS OF 2018 THURSDAY, AUGUST ...
Title: December 6, 2013 | The Miami Student, Author: The Miami Student, Name: 12.06.13, Length: 12 pages, Published: 2013-12-06T00:00:00.000Z.
Travel ~ Italy _ Save Learn more at abriendo-puertas.tumblr.com. Sardaigne More from Abriendo Puertas. 2. Shelly Recicar. doors. Italian. Save Learn ...