Open Data Science on Hadoop in the Enterprise

55 %
45 %
Information about Open Data Science on Hadoop in the Enterprise

Published on October 5, 2016

Author: continuumio

Source: slideshare.net

1. © 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary Open Data Science on Hadoop in the Enterprise From Sandbox to Production Peter Wang CTO, Co-founder

2. © 2016 Continuum Analytics - Confidential & Proprietary 2 Overview • Open Data Science and Anaconda • Architecture & challenges of real-world Hadoop • Anaconda for Open Data Science in the Enterprise

3. an inclusive movement that makes open source tools of data science – data, analytics, & computation – easily work together as a connected ecosystem Open Data Science is…

4. Availability | Innovation | Interoperability | Transparency For everyone in the data science team Open Data Science means… OPEN DATA SCIENCE IS THE FOUNDATION TO MODERNIZATION

5. © 2016 Continuum Analytics - Confidential & Proprietary 5 Data Science is not just Machine Learning… Distributed 
 Systems Business 
 Intelligence Machine Learning
 / Statistics Web Scientific 
 Computing / HPC

6. © 2016 Continuum Analytics - Confidential & Proprietary 6 Data Science is Interdisciplinary… Distributed 
 Systems Business 
 Intelligence Machine Learning
 / Statistics Web Scientific 
 Computing / HPC Classification, deep learning, Regression, PCA Hadoop, Spark Web crawling, scraping, 3rd party data & API providers, predictive services & APIs GPUs, multi-coresData warehouse, querying, reporting

7. © 2016 Continuum Analytics - Confidential & Proprietary 7 Numba dask xlwings Airflow BlazeOpen Source Communities Creates Powerful Technology for Data Science Distributed 
 Systems Business 
 Intelligence Web Scientific 
 Computing / HPC Machine Learning
 / Statistics

8. © 2016 Continuum Analytics - Confidential & Proprietary 8 Numba dask xlwings Airflow Blaze Python is 
 the common language Distributed 
 Systems Business 
 Intelligence Web Scientific 
 Computing / HPC Machine Learning
 / Statistics

9. © 2016 Continuum Analytics - Confidential & Proprietary 9 Python’s 
 Not the 
 Only One… Distributed 
 Systems Business 
 Intelligence Web Scientific 
 Computing / HPC SQL Machine Learning
 / Statistics

10. © 2016 Continuum Analytics - Confidential & Proprietary 10 But it’s also a Great Glue Language Distributed 
 Systems Business 
 Intelligence Machine Learning
 / Statistics Web Scientific 
 Computing / HPC SQL

11. © 2016 Continuum Analytics - Confidential & Proprietary 11 Numba dask xlwings Airflow BlazeAnaconda is 
 the Open Data Science Platform bringing technology together… Distributed 
 Systems Business 
 Intelligence Web Scientific 
 Computing / HPC Machine Learning
 / Statistics

12. © 2016 Continuum Analytics - Confidential & Proprietary 12 Open Data Science
 Vibrant and Growing Community Python Community 30M+ Packages in Anaconda 720+ R Community 16M+ Spark Python Usage 60%+ ANACONDA
 Downloads 8M+

13. © 2016 Continuum Analytics - Confidential & Proprietary 13 Open Data Science Platform ACCELERATE. CONNECT. EMPOWER

14. © 2016 Continuum Analytics - Confidential & Proprietary 14 INNOVATE faster through managed agile experimentation MOVE from analysis to deployment immediately DELIVER powerful results backed by high performance open data science platform LEVERAGE innovative open source analytics to extract value from data MAXIMIZE your computational power to easily analyze all data CONNECT and integrate all your data sources for predictive models ITERATE quickly to create powerful analysis and predictive models COLLABORATE and share with your data science team PUBLISH interactive results to 
 the business ACCELERATE Time-to-Value CONNECT Data, Analytics & Compute EMPOWER Data Science Teams

15. Common Architectures of Real-world Hadoop Environments 15

16. © 2016 Continuum Analytics - Confidential & Proprietary Major Components 16 Hadoop Infrastructure: • Hadoop Manager • HDFS NameNode, DataNodes • Hive, Impala servers • YARN Resource Manager • Spark: History server, Gateway server, compute nodes DW / Analytics Env: • SQL DB • ETL systems • Data Marts Data Science Sandbox: • Notebook server • Big memory nodes • GPU nodes

17. © 2016 Continuum Analytics - Confidential & Proprietary 17 Anaconda Scale System Architecture

18. © 2016 Continuum Analytics - Confidential & Proprietary 18 Hadoop / Spark (& existing DW) App 1 HTTP API Legacy ETL App 2 Data marts XLS, CSV Viz servers

19. © 2016 Continuum Analytics - Confidential & Proprietary 19 source: Master Data Management and Data Governance, 2e

20. © 2016 Continuum Analytics - Confidential & Proprietary 20 source: Master Data Management and Data Governance, 2e Data Science “Sandbox”

21. © 2016 Continuum Analytics - Confidential & Proprietary Common Problems 21 • Data Science Sandbox is on isolated network, outside of “GRC reservation” • Provides freedom to data scientists • Protects production ETL, DW, event processing • … but moving anything from Sandbox to Production is a huge pain • Multiple orgs / LOBs interface with Data Science team in the mixed sandbox environment • Compliance, audit, & risk control?

22. © 2016 Continuum Analytics - Confidential & Proprietary Contrasting Concerns 22 Exploration Production Data • Fast, unfettered access • Ease of introducing new, varied, messy datasets • Reproducibility • Strict, governed access • Well-defined schema • Provenance & auditability Compute Infrastructure • High performance • Low latency, interactive • Individualized & specialized • Scalable, high-availability • Manageable at scale • Cost amortization over many machines and users Organization • Individual high-achievers with lots of context & capability • Agile, able to quickly learn new skills and approaches • Sustain operations at lowest possible cost • Robustness against unintended change

23. © 2016 Continuum Analytics - Confidential & Proprietary • Data Exploration generates insight & is required to respond to business challenges • Production data processing & analytics requires different operational concerns • Over-engineering for either leads to structural deficiencies • Modern & future needs will require more agile exploration Core Challenges 23

24. The Core Challenge of Open Data Science in the Enterprise 24

25. © 2016 Continuum Analytics - Confidential & Proprietary Conway’s Law 25 The design of any piece of software reflects the communications structure of the organization that produced it.

26. © 2016 Continuum Analytics - Confidential & Proprietary Peter’s Corollary to Conway’s Law 26 The architecture of any business data system evolves to reflect the budget structure of the IT groups that maintain it. … not strategic or operational needs … not ensuring future analytical agility … not optimizing for rapid insights

27. © 2016 Continuum Analytics - Confidential & Proprietary • How businesses are used to buying can actually push power away from exploratory data science capabilities • Information systems have ossified into “software & hardware”, which is fine for straightforward data processing • Not suited for human-in-the-loop production of inference, insight, knowledge “Don’t Starve the Unicorns” 27

28. © 2016 Continuum Analytics - Confidential & Proprietary • VERY common misconception • Python is probably the most misunderstood language • There are “tribes” and ecosystems in Python: web dev, scipy, pydata, embedded, scripting, 3D graphics, etc. • But businesses tend to pigeonhole it: • IT/software/data engineering view: competes with Java, C#, Ruby… • Analytics, stats, data science view: competes with R, SAS, Matlab, SPSS, BI systems Data Science != Software Development 28 Python done right can be a powerful, unifying force across the business.

29. Anaconda for Open Data Science in Hadoop 29

30. © 2016 Continuum Analytics - Confidential & Proprietary 30 Data ScientistBiz Analyst Data EngineerDeveloper DevOps Modern Data Science Teams
 Love ANACONDA • Hadoop / Spark • Programming Languages • Analytic Libraries • IDE • Notebooks • Visualization • Spreadsheets • Visualization • Notebooks • Analytic Development Environment • Database / Data Warehouse • ETL • Programming Languages • Analytic Libraries • IDE • Notebooks • Visualization • Database / Data Warehouse • Middleware • Programming Languages

31. © 2016 Continuum Analytics - Confidential & Proprietary 31 Anaconda Powers Teams EXPLORE 
 & ANALYZE COLLABORATE 
 & PUBLISH DEPLOY &
 OPERATE • Explore & prepare data • Build, test, validate data science models with Python & R • Build simulations & optimizations • View data lineage & reuse transformations • Leverage & explore metadata • Create & share data science notebooks with interactive visualizations • Identify reusable data science assets easily • Authorize access to data science projects • Manage & control data science asset versions • Build & share data science packages & environments • Launch & provision distributed environments

32. © 2016 Continuum Analytics - Confidential & Proprietary 32 Write Once, Deploy AnywhereOPENDATASCIENCE Explore & Analyze Collaborate & Publish Deploy & Operate Servers Linux, Windows OSX GPUs & High End Workstations Linux & Windows NVIDIA, AMD, X86/ARM Clusters Yarn, Mesos, MPI Power8, LSF, Sun Grid Engine NoSQL MongoDB Cassandra / DataStax Hadoop Cloudera, Hortonworks Apache Hadoop & Spark Files Microsoft Excel Trifacta, Import.io DW & SQL Any SQL DB Any SQL DW, Impala

33. © 2016 Continuum Analytics - Confidential & Proprietary 33 Anaconda Architectures ON-PREMISE PRIVATE CLOUD ANACONDA CLOUD

34. © 2016 Continuum Analytics - Confidential & Proprietary 34 Public Anaconda Repository Cloud Has access to Gateway Repo Have access to Prod Repo Active Directory/ LDAP Optional Authentication Mirror Anaconda Repository Multi-Step Process – Mirror packages from Anaconda’s public Repository to a ‘Gateway’ Repo – Testers (with authorization to access Gateway) evaluate new packages. – Approved packages are mirrored to the Production Repo Server – Standard End users now have access to updated, approved packages. Gateway 
 (Test) Repo Server Production Repo ServerMirror If an Anaconda repo can function as a gateway “Tester” End User </> End User

35. © 2016 Continuum Analytics - Confidential & Proprietary 35 Public Anaconda Repository Cloud conda install numpy ipython conda update ipython conda create –n env1 ipython pandas conda env upload environment.yml project1 anaconda notebook upload project1.ipynb conda build project2 anaconda upload project2.bz2 Active Directory/ LDAP Optional Authentication Firewall Anaconda Repository—Air-Gapped Install On-site Package Repo and Sharing platform – Mirror public repository of packages – Analysts consume packages from local repo – Analysts upload and share notebooks & pre-configured computing environments – Developers create, deploy & share custom packages Internal Anaconda Repository
 (pre-loaded from disk) Analyst 1 Analyst 2 </> Developer

36. © 2016 Continuum Analytics - Confidential & Proprietary 36 Internal Anaconda Repository Package Control Head Node Cluster Provisioning Job Submission Worker
 Nodes Edge Node State Management Job Control Package Control Cluster Anaconda Scale: Cluster Management Client Machine

37. © 2016 Continuum Analytics - Confidential & Proprietary 37 Gateway & Project Nodes,
 running IPython kernels Package Control Internal Anaconda Repository Authentication Anaconda Enterprise Notebook Server Computation Web Interface Active Directory/ LDAP Optional Workflow: – Analyst Log into the Enterprise notebook server, authenticating against LDAP/AD – Based on the project they select, is re-directed to the appropriate project node – All notebooks/python code runs on project nodes; any needed packages are pulled down from your local repository Anaconda Enterprise Notebook Computing User 1 User 2 User 3

38. © 2016 Continuum Analytics - Confidential & Proprietary 38 Client Machine Internal Anaconda Repository Package Control Head Node Cluster Provisioning Job Submission Hadoop Worker Nodes Edge Node State Management Cluster Package Control Authentication Anaconda Enterprise Notebook Server Web Interface Computation Package Control LDAP: TCP 389/636 HTTP: TCP 8080 HTTP: TCP 5002 SSH: TCP 22 SALT: TCP 4505, 4506 HTTP/HTTPS: TCP 80/44 TCP 8080 Teradata Integrated Environment User 1 User 2 User 3Analyst 1 Analyst 2 Developer </>

39. © 2016 Continuum Analytics - Confidential & Proprietary 39 Anaconda
 Accelerates Adoption of Open Data Science for Enterprises Across all Data, Operating Systems, & Hardware Platforms Explore & Visualize complex data easily Harness Open Source Python & R Analytics Write Once, Deploy Anywhere for Scalable High Performance Data Engineering Simplified for All Data Collaborate with Your Team anywhere in the World Integrate Data from Anywhere

Add a comment