Published on March 13, 2014
Productionizing Hadoop: 7 Architectural Best Practices Mike Gualtieri, Principal Analyst
© 2013 Forrester Research, Inc. Reproduction Prohibited 7% 13% 7% 17% 31% Implemented, not expanding Expanding/upgrading implementation Planning to implement in the next 12 months Planning to implement in more than 1 year Interested but no plans Base: 634 business intelligence users and planners “What best describes your firm's current usage/plans to adopt Big Data technologies and solutions?” Source: Forrsights BI/Big Data Survey, Q3 2012 Big Data has momentum 20% have implemented some big data technology 37% are planning some big data technology project
“Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all of the data it needs to operate, make decisions, reduce risks, and serve customers.” DEFINITION FORRESTER
© 2013 Forrester Research, Inc. Reproduction Prohibited 2% 3% 21% 22% 28% 32% 32% 36% 36% 38% 41% Other Don't know Earlier generation technology is too expensive The velocity of data is too high for earlier technologies We can achieve (or are achieving) significant cost reductions by changing our data management and analytic architecture Data changes or becomes available much faster than we can process in support of business decisions The number of data formats that we must be able to deal with exceeds our ability to cost-effectively integrate Analysis requirements change too fast to keep up with We want to access data that was not accessible for us with existing technologies Data volumes have grown beyond what we can cost effectively manage We don't know what our entire data universe contains, we need new ways to explore data and discover patterns and… “What are the main business and technical requirements or inadequacies of earlier-generation BI technologies that lead you to consider new BI techniques and technologies?” Firms seek more value in data, struggle to wrangle it, & seek lower cost solutions
© 2013 Forrester Research, Inc. Reproduction Prohibited Integrating data from a variety of data sources is a top challenge
© 2013 Forrester Research, Inc. Reproduction Prohibited Big Data architecture must support three core capabilities (SPA): •Can you capture and store all your data++?Store •Do you have the compute power to cleanse, enrich, & analyze your data++? Process •Can you retrieve, search, integrate, a nd visualize all your data++? Access 7
How can you keep your Big Data operations running smoothly? Production
© 2013 Forrester Research, Inc. Reproduction Prohibited Productionizing Big Data can be complex because of: Integration with heterogeneous infrastructure Use of multiple analytical software applications Reliance on 3rd-party cloud services Always available modeling and visualization sandboxes Increasing volume, velocity, variety of data from multiple data sources Compute intensive analytics
Big Data production requires sound architecture. Production
The 7 architectural qualities of Big Data production platforms Quality What it means 1 Experience Users’ perceptions of the usefulness, usability, and desirability of the application. 2 Availability The readiness of the service or application to perform its functions when needed 3 Performance The speed to perform functions to meet business and user expectations 4 Scalability Handle increasing volumes of data, transactions, services, and applications. 5 Adaptability The ease with which an application or service can be changed or extended 6 Security Supports the security properties of confidentiality, integrity, authentication, authorization, and nonrepudiation 7 Economy Minimize cost to build, operate, & change an application or service without compromising its business value
Operational experience is critical to production. 1. Experience
Best practices: User experience Usefulness, Usability, Desirability of applications require ease of use with power Developers Administrators • Standard Tools • Linux Commands • Direct Access with NFS • Visibility • Self Healing • Architectural Simplicity
Easy Workflow Management Workload Automation with Cisco Tidal Enterprise Scheduler • Detailed, dependency-driven event execution • Point-and-click dynamic variables and parameters • Scalable, extensible architecture • Granular notification and alerts
High-availability strategy and architecture are often overlooked in proof-of-concepts. 2. Availability
What does high availability mean? Uptime %* Downtime per year 99.999% (5 nines) 5.26 minutes 99.99% (4 nines) 52.6 minutes 99.5% 1.83 days 99% (2 nines) 3.65 days 98% 7.30 days 95% 18.25 days *Uptime calculations assume no scheduled downtime.
19©MapR Technologies - Confidential High Availability and Dependability Reliable Compute Dependable Storage Automated stateful failover Automated re-replication Self-healing from HW and SW failures Load balancing Rolling upgrades No lost jobs or data 99999’s of uptime Business continuity with snapshots and mirrors Recover to a point in time End-to-end check summing Strong consistency Data safe Mirror across sites to meet Recovery Time Objectives
Unexpected latencies can emerge from rapid fluctuations in volume, velocity, & variety of data and interactions of the larger Big Data ecosystem. 3. Performance
21©MapR Technologies - Confidential World Record Performance New Minute Sort World Record 1.5 TB in 1 minute 2103 nodes Previous Record: 1.4 TB Benchmark MapR 2.1.1 CDH 4.1.1 MapR Speed Increase Terasort (1x replication, compression disabled) Total 13m 35s 26m 6s 1.9x Map 7m 58s 21m 8s 2.7x Reduce 13m 32s 23m 37s 1.7x DFSIO throughput/node Read 1003 MB/s 656 MB/s 1.5x Write 924 MB/s 654 MB/s 1.4x YCSB (50% read, 50% update) Throughput 36,584.4 op/s 12,500.5 op/s 2.9x Runtime 3.80 hr 11.11 hr 2.9x YCSB (95% read, 5% update) Throughput 24,704.3 op/s 10,776.4 op/s 2.3x Runtime 0.56 hr 1.29 hr 2.3x
Scalability is as much about scaling up as it is about scaling down. 4. Scalability
23©MapR Technologies - Confidential MapR’s Relative Scale Testing completed on 10 node cluster, 2x Quad-Core, 24G DRAM 12 x 1TB SATA Drives @ 7200 rpm 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 1000 2000 3000 4000 5000 6000 Filecreates/s Files (M) 0 100 200 400 600 800 1000 0 100 200 300 400 0 0.5 1 1.5 Filecreates/s Files (M) Other distribution MapR distribution Scale Advantage: 4600x
Firms have barely scratched the surface of what is possible with Big Data analytics. Change is always in the wind. 5. Adaptability
I am a data scientist. I am a data scientist. I am a data scientist. Data scientists will constantly have new requirements
…to accelerate the pace of discovery Compress… Production must address and help compress the full Big Data analytics life cycle
27©MapR Technologies - Confidential Direct Integration with Existing Applications 100% POSIX compliant Industry standard APIs - NFS, ODBC, LDAP, REST More 3rd party solutions Proprietary connectors unnecessary Language neutral
A breach can devastate an organization's reputation with customers or have legal repercussions. 6. Security
All, some, or none of these 6 security properties may apply to Big Data • Information is available only to the people intended to use it or see itConfidentiality • Information is only changed in appropriate ways by people authorized to change itIntegrity • Applications are available when needed and perform acceptablyReadiness • A person’s identity is determined before access is granted if anonymous people are not allowedAuthentication • People are allowed or denied access to applications or application resourcesAuthorization • A person cannot perform and action and then later deny performing that actionNonrepudiation
30©MapR Technologies - Confidential Securing Big Data Corporate Security Requirements Authentication Wire-level security Authorization (Access Control) Standard: UID, GID based Granular: File, Table, Column Family, Column, Cell Integration into Existing Environments Kerberos or non-Kerberos Use existing Directory for credential lookups Seamless Access with Single Sign-On
Every architectural decision has an impact on the return on investment for Big Data analytics platforms. 7. Economy
Production Sweet Spot Beware of pilot programs that don’t scale economically Business value of big data Investment People- intensive platforms Technology- intensive platforms
33©MapR Technologies - Confidential Maximizing Economic Value Analytics – Ability to perform broader and deeper analytics – Real-time streaming – Mission critical SLAs – Cloud based analysis Ease of Development Ease of Administration Value of Uptime Value of Data Protection Hardware Efficiency First Class Support
34©MapR Technologies - Confidential One Platform for Big Data … 99.999% HA Data Protection Disaster Recovery Scalability & Performance Enterprise Integration Multi- tenancy Map Reduce File-Based Applications SQL Database Search Stream Processing Batch Interactive Real-time
The 7 qualities of Big Data production platforms Quality What it means 1 Experience Users’ perceptions of the usefulness, usability, and desirability of the application. 2 Availability The readiness of the service or application to perform its functions when needed 3 Performance The speed to perform functions to meet business and user expectations 4 Scalability Handle increasing or decreasing volumes of transactions, services, and data 5 Adaptability The ease with which an application or service can be changed or extended 6 Security Supports the security properties of confidentiality, integrity, authentication, authorization, and nonrepudiation 7 Economy Minimize cost to build, operate, & change an application or service without compromising its business value
Big Data is about innovation, but not if you don’t productionize it. 36 Collectors • Capture • Store Journalists • Reports • Dashboards Innovators • Predictive analytics Operations Business Intelligence Predictive Power
Frontier Big data is about pushing limits. Exponential growth in data means the frontier is vast.
Thank you Mike Gualtieri firstname.lastname@example.org Twitter: @mgualtieri
... will present architectural best practices for productionizing Hadoop successfully. ... Productionizing Hadoop: Seven Architectural Best Practices.
Productionizing Hadoop: Seven Architectural Best Practices ... Hadoop Operations Best Practices from the ... Productionizing a 24/7 Spark ...
Big Data will change the way your organization responds to business opportunities. But to reap its full benefits, you have to move from proof of concept ...
Productionizing Hadoop: Seven Architectural Best Practices ... Productionizing Hadoop: Seven Architectural Best Practices. Facebook Like; Tweet; Google +1;
800.728.1292 Home Client Login Partner Login Associates Login
MapR Technologies, Inc., the Hadoop ... “Productionizing Hadoop: Seven Architectural Best Practices ... “Productionizing Hadoop: Seven Architectural ...
How to Move a Hadoop Deployment from the Test ... 7 Architectural Best Practices and ... Productionizing Hadoop: Seven Architectural ...
From Yahoo Finance: ... "Productionizing Hadoop: Seven Architectural Best Practices ... The 7 architectural qualities for productionizing ...
MapR and Cisco and Forrester Discuss the Seven Architectural Best Practices ... productionizing” Hadoop. ... 7 architectural best practices ...