Published on February 21, 2014
BI, Hive or Big Data Analytics? © 2012 Datameer, Inc. All rights reserved. © 2012 Datameer, Inc. All rights reserved.
View the Recording of these Slides! You can view the full recording of this on-demand webinar with slides at: http://info.datameer.com/Slideshare-BI-HiveBig-Data-Analytics.html ! © 2012 Datameer, Inc. All rights reserved.
About our Speaker! Todd Nash! ! Todd is a founding Principal at CBIG Consulting, a professional services ﬁrm that helps clients leverage their data assets to produce timely, effective business strategies and tactical decisions. Todd leads CBIG’s eastern region consulting practice in the development, implementation, and execution of business intelligence and Big Data methodologies, cloud-based analytics strategies, and complex data warehousing solutions.! ! Todd graduated from Clemson University with a Bachelor of Science degree in Management Information Systems.! © 2012 Datameer, Inc. All rights reserved.
About our Speaker! Eduardo Rosas! ! Eduardo Rosas is Vice President of Services at Datameer and brings over 12 years of software implementation experience to the table.! ! In this role, Eduardo is focused on delivering repeatable, high quality level of services and support to help clients achieve their goals. ! ! Prior to Datameer, Eduardo spent 11 years at Trintech where he focused on managing a team of Technical Consultants and implementing global Java web based solutions. Eduardo is originally from San Jose, CA and graduated from Santa Clara University.! ! © 2012 Datameer, Inc. All rights reserved.
Agenda • Problem Statement – Business & Technical • POC Technical Solu;on – High-‐level and Detailed • Results • Lessons Learned Copyright © 2013 CBIG Consul;ng 5
PROBLEM STATEMENT Copyright © 2013 CBIG Consul;ng 6
Business Problem Statement A Real Estate .com business makes money in two ways: 1. Property Owners adver;se proper;es 2. Ancillary businesses adver;se services This site needs the analy;cs to show customers the return on their investment SEARCH IMPRESSIONS CLICK-‐THRU LEAD Breadth: • Searches to Impressions to Click Thru to Leads • Website op;miza;on • Customer op;miza;on & upgrades • Market op;miza;on Depth: • Can the search criteria be op;mized? • Conversion of impressions based on reﬁnement of search? • Which product mix of impressions get the greatest click thru • What is the impact of ameni;es to leads? • What addi;onal features get used to convert to leads? Copyright © 2013 CBIG Consul;ng 7
Source Source • • • • • Web Ac7vity Master Data Search & Impression ODS Lookup Data Data Movement Source Data Movement Source Service Search Data Movement Technical Problem Statement Search & Impression EDW Search Cube Sales Cube Marke7ng Cube Search & Impressions volume too large to build cube and provide deep analytics This has a negative impact on all reporting and performance of the entire system The business is unable to determine the value of all the data; has requests to add more Evaluating options to increase environment or look for alternatives POC to evaluate how Hadoop, Amazon cloud and Datameer could support challenge Copyright © 2013 CBIG Consul;ng 8
Technical Problem Statement Source Source • • • • • Web Ac7vity Master Data Search & Impression ODS Lookup Data Data Movement Source Data Movement Source Service Search Data Movement Search EDW Sales Cube Marke7ng Cube Search & Impressions volume too large to build cube and provide deep analytics This has a negative impact on all reporting and performance of the entire system The business is unable to determine the value of all the data; has requests to add more Evaluating options to increase environment or look for alternatives POC to evaluate how Hadoop, Amazon cloud and Datameer could support challenge Copyright © 2013 CBIG Consul;ng 9
Problem Statement – Success Criteria Objec7ve: To prove that the Hadoop architecture is an excellent op;on for the business to interact with large data and ﬁnd dataset and rela;onships that require deeper analy;cs. Original Scope & Goals: • Bring in one years worth of data from 6 tables, into the Amazon Cloud Hadoop environment. • IT resources will be able to extract the data from these tables and load them into .CSV ﬁles. • The success criteria for this stream of work will be: ü Amazon Hadoop cloud environment & account is setup. ü Search Analy;cs data loaded into the Amazon Hadoop cloud ü Business is able to execute and perform analy;cs on Search Analy;cs data that is stored in Hadoop with acceptable performance. ü Gain analy;cal insights with new solu;on Copyright © 2013 CBIG Consul;ng 10
POC TECHNICAL SOLUTION Copyright © 2013 CBIG Consul;ng 11
POC Technical Solu;on – High Level Web Ac7vity History Lookup Data AWS S3 Datameer (Data Discovery) Web Portal (Widget Based UI) AWS EMR (Hadoop) Amazon Web Services (Cloud) Copyright © 2013 CBIG Consul;ng 12
POC Technical Solu;on -‐ Detailed Amazon Cloud AllLeads WebClicks WebClicks Web Impressions WebLead Data Movement AllLeads Web Impressions WebLead WebSearch WebSearch WebVisit WebVisit Generic Ac;vity LR Apts IMPS Other Leads EmailLeads Data Movement EmailLeads Phone Leads Generic Ac;vity LR Apts IMPS Other Leads Phone Leads Site SubSite Site Lead Type PageType Lead Type Event Type PhoneType Event Type Email Type Contaniner Type Aﬄiate Product ID Email Type Contaniner Type Aﬄiate Property List SearchType S3 Hadoop SubSite PageType PhoneType Product ID Property List SearchType Data Workbooks AllLeads WebClicks Web Impressions WebLeads WebSearch WebVisits Use Case Workbooks Use Case1 Use C ase 2 Addi7onal Data Workbooks Addi7onal Use Cases
RESULTS Copyright © 2013 CBIG Consul;ng 14
POC Results Success Criteria Hadoop, Amazon, Datameer environment setup Able to load 1 years worth of data – nearly 1.3 TB Business able to execute and perform analy;cs Users provided acceptable performance Gain new insights Results Environment setup within the 1st couple of days Loaded signiﬁcantly more data than planned for more robust analy;cs Business leveraged Datameer to execute use cases; executed ~20 addi;onal without IT help Queries executed to comple;on. Some took seconds, some took minutes and some required overnight. 1st ;me able to run these analy;cs. Found pajerns and rela;onships contrary to assump;ons. Will be upda;ng service oﬀerings & marke;ng plans because of POC Copyright © 2013 CBIG Consul;ng 15
LESSONS LEARNED Copyright © 2013 CBIG Consul;ng 16
Lessons Learned GETTING DATA TO HADOOP Hadoop is ﬁle structure Finding the right delimiter Integra;ng data Requires ETL Data cleansing can be big Several itera;ons required CLOUD Cloud ﬂexible Easy setup and scaling Performance & sizing Sizing the cloud is challenging Cost for performance TBs with support becomes costly HADOOP Hadoop is batch Answers one thing at a ;me Analy;cs Move to database w/ tools PEOPLE Remember change mgmt Educa;on new methods & tools Copyright © 2013 CBIG Consul;ng 17
So what about open source tools like hive? © 2012 Datameer, Inc. All rights reserved. © 2012 Datameer, Inc. All rights reserved.
Hive…! ! Goal of hive! • ! Eases the complexity of writing MapReduce jobs by providing the technical user a set of tools that are more familiar with via sql! Who can use hive?! • SQL Users can pick up hql basics fairly quickly! ! Prerequisites! • • • Must have data in hadoop! The data must be CLEAN! Schema must be applied to the data by creating a hive table! © 2012 Datameer, Inc. All rights reserved.
What is hive really good at?! ! Hive is good in environments where we have clean prepared data that doesn’t change often already in hadoop! ! ! Resembles a language that many IT folks are already familiar with.! ! ! Hive can help a user trying to identify a reporting trend! ! ! User deﬁned ﬁelds (UDFs) can be used to reuse functions! © 2012 Datameer, Inc. All rights reserved.
Some troubles! << - Start of Hive script ->> --Create an TEMP Housing Table CREATE EXTERNAL TABLE MY_TABLE( num_ods string, num_bus_id int, um_ctry_cd int, prod_id string, rng_svc_cd string, rng6 string, bin string, bin_bus_id_enr int, bin_ctry_cd int, cd_fmt_a_2 string, cd_enr string, rsn_us_ind string, x_bus_id int, flg_enr string, my_dt string, user_id string, mthd_cd_enr string, tran_seq_id string, cd_enr2 string, us_amt string, moto_cd string, fee_curr_cd int, fee_desc_num string, fee_sgn_amt string, us_fee_sgn_amt string, mkt_spec string, catg_cd int, city_enr string, ctry_cd_enr int, dba_id int, nm_dscrptr string, geo_id int, geo_phone_num string, tier_cd string, msa string, nrmlzd_id int, pstl_cd string, b_st_cd_enr string, b_store_id string, b_vrfcn_val string, ntwrk_id int, site string, ! ! ! ! ! entry_mode_cd string, term_cpbty_cd string, sub_typ_cd string, dt string, id_num_enr int, prod_num int, prod_ppd_sub_typ_cd string, prod_typ_cd_enr string, prod_typ_ext_enr string, promo_cd string, promo_typ string, rwds_pgm_id_enr string, tran_cd string, tran_gmt_dt string, tran_gmt_tm string, tran_id string, unfrzn_acct_num_bus_id_enr int, unfrzn_arn_bin_bus_id_enr int, usage_cd_enr string, Other_amt string, curr_cd int, dt string, )COMMENT "THIS IS MY TEMP TABLE"; --INSERT DATA INTO MY_TABLE INSERT OVERWRITE MY_TABLE select * , SUM(us_tran_amt) AS SALES_VOL, SUM(US_FEE_SGN_AMT) AS US_FEE_SGN_AMT, COUNT(*) AS TRAN_COUNT, MIN(ACTIVE_DT) AS FIRST_ACTIVE_DT, MAX(SEARCH_DT) AS LAST_SEARCH_DT, MAX(customer_biz_id) AS customer_biz_id, MAX(PGM_ID_ENR) AS PGM_ID_ENR, MAX(CUST_PROD_ID) AS CUST_PROD_ID , MAX(POD_ID_NUM_ENR) AS POD_ID_NUM_ENR, MAX(PROD_TYPE)AS PROD_TYPE, MAX(SUB_TYPE) AS SUB_TYPE, 1 as ID from MY_TABLE WHERE dt like '2012%' GROUP BY customer_biz_id, PGM_ID_ENR, CUST_PROD_ID, eci_moto_cd, catg_cd, city_enr, ctry_cd_enr, pstl_cd, pod_id, prod_num, SUB_TYPE; --CREATE TEMP LOOKUP TABLE CREATE EXTERNAL TABLE TEMP_LOOKUP( acct_num bigint, acct_sta_cd string, acct_zip_cd string, rwrd_pgm_id string, pgm_ref_cd string, acct_prod_id string, bus_id int, bin int, status string, pgm_eff_dt string, dt string, )COMMENT "THIS IS TEMP LOOKUP TABLE"; --INSERT DATA INTO IT INSERT OVERWRITE MY_LOOKUP SELECT *, 1 as cmf_ind FROM LOOKUP WHERE DT = '201211'; --Do a Full Outer Join SELECT * FROM MY_TABLE mt FULL OUTER JOIN MY_LOOKUP ml ON mt.member_id = ml.member_id; No way to get data in hadoop! No data validation / may throw data away! Security ! Sharing code via teams is a challenge! No visualization! © 2012 Datameer, Inc. All rights reserved.
… but it’s free right?! ! ! "Time to create Hive": Any machine-generated data (or anything semi/unstructured) must ﬁrst be parsed by writing !! !MapReduce or Pig/Python programs. Time-to-market disadvantage. Table deﬁnition is a manual effort (though this can be made easier by 3rd party tools). ! "Time to maintain Hive": Hive data models (tables) are most likely static, shared objects maintained and controlled by a few people who own the schema ! Hive is also more of a black box for new employees coming in (so employee churn creates more maintenance effort). ! ! ! Cost to implement Hive: This is mostly down to the human capital (expensive developers), and don't forget the prerequisite cost of implementing the data ingestion stage of the pipeline (populating the warehouse by writing MapReduce programs or other programs parsing/loading the data). ! © 2012 Datameer, Inc. All rights reserved.
Business decsion! ! ! Do I train my engineers on a language or eliminate the need from this by taking the problem directly to the business user.! ! © 2012 Datameer, Inc. All rights reserved.
So what would my hive resource need to know?! ! ! ! Hive QL (different dialect than ANSI standard SQL) ! MapReduce TUNING parameters. (to name a few)! • Data block size! • Number of mappers/reducers! • Compression at map out level; result compression; what codec to use! • io.sort.factor !! Access to hive is mainly done via Command line interface! © 2012 Datameer, Inc. All rights reserved.
How does Datameer do it differently ! © 2012 Datameer, Inc. All rights reserved.
Questions and Answers! © 2012 Datameer, Inc. All rights reserved.
Online Resources § § Try Datameer: www.datameer.com! Follow us on Twitter @datameer! ! ! © 2012 Datameer, Inc. All rights reserved.
Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...
In this presentation we will describe our experience developing with a highly dyna...
Presentation to the LITA Forum 7th November 2014 Albuquerque, NM
Un recorrido por los cambios que nos generará el wearabletech en el futuro
Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...
Home > Blogs > SQL Server BI Blog > Big Data Analytics with Hive, Windows & SQL Server. SQL Server BI Blog. ... This is what I call Big Data Analytics.
Big Data Meets Business Intelligence ... Server repository to the Hive server on Oracle Big Data Lite ... the Hive table metadata into the BI ...
Big Data, Analytics and Hadoop ... R and Hive,” said Russom. ... advanced analytics on big data is reduced,” wrote Fern Halper,
Nutzen Sie Big-Data-Quellen und -Plattformen per ... Alteryx Analytics Gallery; ... Gartner 2016 Magic Quadrant for BI and Analytics Platforms. Partner ...
Big Data Analytics: Time For New Tools. ... cited by 48% of respondents using or planning to deploy data analytics, BI, ... big data and analytics.
Learn about big data and how IBM can help you use big data to achieve big results. ... Take the next step with IBM Big Data & Analytics
... UTILIZIN BI DATA ANALYTICS ... with large data sets Both Hive and Pig can ... Eight Considerations for Utilizing Big Data ...
Self-Service BI. Data Sharpening ... Big Data Analytics for Apache Hive. Zoomdata provides interactive visual analytics of data stored in Hadoop by ...
A few key trends within BI relate to the emerging use of big data. ... for big data analytics is ... leverage the data. In some instances, Hadoop Hive is ...
Wandeln Sie Ihren Big Data-Bestand mithilfe von Big Data- und Advanced Analytics-Lösungen von ... für Big Data-Analyseworkloads; Power BI Embedded ...