Published on March 16, 2014
Data Scientist Enablement DSE 400 - Fast Track to Data Science Week 8 Roadmap Advanced Center of Excellence Modern Renaissance Corporation In Collaboration with SONO team and others Content of this document is under Creative Commons Licence CC BY 4.0
Agenda You can always find the latest version of this document at http://bit.ly/1qbXns0 Week 8 Overview Mission Statement Discussions Learning Path Activities Assignment Submission DSE Program Timeline Adaptive Learning Options References “Charity and personal force are the only investments worth anything.”- Walt Whitman.
Mission and Objectives Mission of our program is to provide free, open and world- class enablement of Data Scientists and help advance the profession of Data Science as well as allied disciplines. We aim to prepare the participants with analytical and practical skills emphasizing breadth and depth in a range of relevant disciplines and capabilities in Data/Decision Sciences, Big Data Analytics, Architecture and Systems Engineering.
Social Discourse: Discuss about Ethics around Big Data . Test drive R-COP and Modern Data Platforms-COP Learning plan: Read about Data Quality, Watch On demand Webinar Activities: Explore Google datasets. Start a blog on Big Data. Continue Personal Roadmap Assignment 8: Cleanse and Visualize Sensor dataset. Alternatively, do a case study, write a blogpost or create mini-documentary. DSE 400 - Week 8 at a glance
Discussion: Read Big Data’s Dangerous New Era of Discrimination Research and reflect on how Big Data and its associated technologies are misused or applied unethically. Share your views on how this can be rectified. You can participate in this discussion on Linkedin, Facebook and Google+ Discussions on SONO will continue as planned on DSE 400 Jump Pad. This will allow more choice for participants. We are hoping this will result in the increased social engagement. Check out Language R and Modern Data Platforms Communities of Practice (COPs) to help you increase your competence in R, Machine Learning, Hadoop ecosystem and other platforms. Reach out to Olivia Ramirez, Ellen Brock or Manju Rupani if you want to contribute to these communities or if you have any suggestions. Social Engagement - Week 8 SONO Linkedin Facebook Google+
Read Big Data’s Dangerous New Era of Discrimination Watch Human Ethical Aspects of Big Data by Grady Booch <Optional> Read Get a Handle on Big Data Quality <Optional> Watch Big Data Integration and Governance Use Case - IBM OnDemand Webinar <Optional> Big Data and the Ethics and Challenges of Living in a Connected Society. O'Reilly Webcast <Optional> Big Data: Usage, Ethics, Algorithms by Vladislav Shershulsky Recommended Learning Plan
Activities <Practice> Visit Google Public Data Directory. Check out Global Competitiveness Report. Compare your country’s GDP per capita with World average. Also compare your country’s capacity for Innovation with other countries in your region. Also explore other dimensions in this dataset pertinent to your analysis or area of focus. <Practice> Continue learning and experimentation with R and Hadoop ecosystem through R-COP and Modern Data Platforms-COP. Seek/share advice, knowledge and resources. Reach put to Ellen Brock, Manju Rupani or Olivia Ramirez if you want to play more active role in these communities. <Practice>Write a blog post on the ongoing Disruption in the Education sector. Explore sites like Stanford Journal of Social Innovation, blogs.hbr.org, Innovation Excellence, Forbes.com or Asoka Foundation etc. to see if you can publish you blogs on these communities.
Examples of Infographics Source: UNICEF
Activities <Practice> Infographics are graphic visual representations of information, data or knowledge intended to present complex information quickly and clearly. Read 10 Free Tools for Creating Infographics. Research about a cause you seriously care about and produce one page Infographic on your cause. Human Rights, Environment, Poverty Elimination, Fight against Child labor, corruption, Equality, Religious Harmony, Prevention of cruelty to animals etc. are a few examples of causes people around the planet care about. <Practice> Continue your earlier work ( or start it afresh, in case you haven’t started it) on Personal Career Advancement Roadmap. Revise it to take advantage of the Certification options available in DSE program. Read or listen to Malcolm Gladwell’s Outliers: Our Story of Success. <Optional> <Advanced Research> Techniques for Fraud Prevention. Read Improving Credit Card Fraud Prevention Using a Meta Learning Strategy and explore how this framework can be applied to robust solutions for Fraud Prevention in your industry.
Assignment 8 - Submission Required Option A - HDP 2.0 | R-sqldf | BigQuery Download HortonWorks Sensor Data from Amazon. Using either HDP 2.0 (or its equivalent), or R-sqldf or Google BigQuery compute the following. a) import raw data and clean using HiveSQL Script (see next slide) or equivalent technique b) download and import cleansed data (hvac_building) into a spreadsheet like Google Spreadsheet, Excel or OpenOffice Calc etc. or any visualization tool you are familiar with c) Visualize the data showing geographic distribution pattern based on data from hvac_building table You may reach out to Rachel Fleming <email@example.com> if you have any difficulties with the assignments or looking for more challenging assignments or activities.
Extract raw tables hvac and building from Sensor.zip file then execute the following HiveSQL scripts to generate tables hvac_temperatures and hvac_building create table hvac_temperatures as select *, targettemp - actualtemp as temp_diff, IF ((targettemp - actualtemp) > 5, 'COLD', IF((targettemp - actualtemp) < -5, 'HOT', 'NORMAL')) AS temprange, IF((targettemp - actualtemp) > 5, '1', IF((targettemp - actualtemp) < -5, '1', 0)) AS extremetemp from hvac; create table if not exists hvac_building as select h.*, b.country, b.hvacproduct, b. buildingage, b.buildingmgr from building b join hvac_temperatures h on b.buildingid = h.buildingid; (Source: HortonWorks) Assignment 8 - Option A HiveSQL Scripts
Assignment 8 - Submission Required Options B, C and D Option B - Data-Driven Philanthropy Do a case study on how organizations like Red Cross, UNICEF, Gates Foundation or Oxfam are using data-driven strategies to promote Global Health and Development. Option C - Ethics and Big Data Write a blog post or short article on Ethical Application of Big Data technologies in the areas or sectors you care about. (Fighting Poverty, Child Labor or Illiteracy, and ecological degradation etc. are a few examples) Option D - Biopic or mini-documentary on Florence Nightingale Florence Nightingale came to prominence for her outstanding service-orientation and originating modern nursing practices. She also employed statistics and data-driven decision management approaches. Research on Florence Nightingale and produce a short Biopic or mini-documentary about her. You may reach out to Rachel Fleming <firstname.lastname@example.org> if you have any difficulties with the assignments or looking for more challenging assignments or activities.
Submission in PDF format is required Recommended Deadline: Saturday, 11:59 PM your local time. If you can’t submit your assignment in time, please complete it and turn it in ASAP. While there is no penalty for late submission, it will help you focus on next week’s lessons if you turn in assignments in time. Mail Assignment 8 to <email@example.com> with DSE 400 > Assignment 8 in the subject line. Submit a single PDF document showing your queries and result samples. Include screenshots as necessary. Naming convention DSE 400 - Assignment 8 - Your Full Name is required for your document for the sake of consistency. No document links should be sent. Just one single PDF document, and Only in PDF format is accepted.
DSE Program 2014 timeline Fast track to Data Science (DSE 400) Modern Data Platforms (DSE 502) Advanced Techniques in Big Data Analytics (DSE 600) Jan 19 - Mar 15 Mar 30 - May 10 May 25 - July 5 July 20 - Aug 30 Machine Learning with R (DSE 501)
Adaptive Learning Options Data Scientist Enablement program Maturity Composite Score * Proficiency Certificate Level 5 > 90 Innovating Capability Black Belt Level 4 > 80 and <= 90 Architectural Capability Green Belt Level 3 > 70 and <= 80 Solutioning Capability Yellow Belt Level 2 > 60 and <= 70 Basic Understanding Completion Level 1 <= 60 Basic Familiarity Audit * Composite score is computed taking into consideration of performance of participants in assignments, activities, projects, social engagement, collaboration, team development, publications and advanced research etc. in all 4 modules of DSE program
References, Resources and Additional Reading Ethics of Big Data. Davis and Patterson. O’Reilly Publications. 2012 Outliers: Our Story of Success. Malcolm Gladwell. Little Brown and Company. 2008 SQL Tutorial. W3Schools.com Improving Credit Card Fraud Prevention Using a Meta Learning Strategy Joseph King-Fung Pun. 2011 17 short tutorials all Data Scientists should read (and practice). Dr. Granville. Data Science Central Hadoop Illuminated. Kerzner and Maniyam 2013. Hadoop Illuminated LLC Hadoop Definitive Guide. 3rd Edition. Tom White. O’Reilly Publications. 2012 Mapreduce: Simplified Data Processing on Large Clusters. Dean and Ghemavat. Google 2004 [MIT OCW] How to Process, Analyze and Visualize Data. Marcus & Wu. 2012 [MIT OCW] Ethical Practice: Professionalism, Social Responsibility… Prof Leigh Hafrey, 2012 Big Data - Hadoop, Hive, Pig and Hbase video collection Modern Data Platforms-Community of Practice Language R-Community of Practice Data Science Enablement playlist
Citation Content that appears as is, on this document only, is under Creative Commons License CC BY 4.0 This license may not necessarily apply to other material referenced here in this document. Sensor dataset used in this week’s assignment is attributed to Hortonworks. This dataset is not available under Creative Commons Licence. Content from IBM, Hortonworks, Google, Youtube, Data Science Central and O’ Reilly Media etc. is excluded from the above Creative Commons License.
For More Information Week 8 discussions take place during this week on DSE 400 forums on Linkedin, Facebook, Google+ and SONO. There is also an active Q&A session for everyone's benefit. Also check out Language R- Community of Practice if you would like to advance your competence in R or if you would like to contribute to this community. <Mentoring On Demand> You may reach out to Ms. Rachel Fleming <firstname.lastname@example.org> if you have any difficulties with the assignments or looking for more challenging activities. If you need a mentor or someone to help you accelerate along the DSE program, you may reach out to Vishal Kumar <email@example.com> or Ligia Buzan<firstname.lastname@example.org> We welcome questions, thoughts and suggestions. Post these in the right forums/discussions or write to us at <email@example.com> You can always find the latest version of this document and other DSE 400 roadmaps at http://bitly. com/bundles/o_4ldaljhta4/1
Data scientist enablement dse 400 week 8 roadmap. Data scientist enablement dse 400 week 7 roadmap. ... Data Scientist Enablement roadmap 1.0.
Data Scientist Enablement DSE 400 ... Data Scientist Enablement Roadmap ... Data scientist enablement dse 400 week 8 roadmap.
View Venkatesh Muthiah’s professional ... The Data Scientist’s ... R data types and objects, reading and writing data •Week 2: Control ...
Enablement Services ... Innovators in Individualized Marketing to Gather for Connect 2016 Last week; ... Democratized Data and an Expanded Teradata Data ...
Sravan Ankaraju. Founder & President | Divergence Academy | Practicing Data Scientist | Innovation Architect. Location Dallas/Fort Worth Area Industry
Kristensen analyzes this data and discusses what it could mean for the Obama ... FAS Visiting Scientist, ... Federation of American Scientists.
Microsoft Azure is an open, ... 8 standard SQL Databases. Hadoop instance for a week. And much more... Learn more.