BI, Hive or Big Data Analytics?

62 %
38 %
Information about BI, Hive or Big Data Analytics?
Technology

Published on February 21, 2014

Author: Datameer

Source: slideshare.net

Description

As more organizations look to Hadoop as the technology solution for big data analytics, common questions arise.

Join us in this case study look at an online services provider's experience with Big Data and how they answered the questions:

*What does big data analytics do that my existing BI software doesn’t?
*Will Hadoop replace my data warehouse?
*What about Hive?

BI, Hive or Big Data Analytics? © 2012 Datameer, Inc. All rights reserved. © 2012 Datameer, Inc. All rights reserved.

View the Recording of these Slides! You can view the full recording of this on-demand webinar with slides at: http://info.datameer.com/Slideshare-BI-HiveBig-Data-Analytics.html ! © 2012 Datameer, Inc. All rights reserved.

About our Speaker! Todd Nash! ! Todd is a founding Principal at CBIG Consulting, a professional services firm that helps clients leverage their data assets to produce timely, effective business strategies and tactical decisions. Todd leads CBIG’s eastern region consulting practice in the development, implementation, and execution of business intelligence and Big Data methodologies, cloud-based analytics strategies, and complex data warehousing solutions.! ! Todd graduated from Clemson University with a Bachelor of Science degree in Management Information Systems.! © 2012 Datameer, Inc. All rights reserved.

About our Speaker! Eduardo Rosas! ! Eduardo Rosas is Vice President of Services at Datameer and brings over 12 years of software implementation experience to the table.! ! In this role, Eduardo is focused on delivering repeatable, high quality level of services and support to help clients achieve their goals. ! ! Prior to Datameer, Eduardo spent 11 years at Trintech where he focused on managing a team of Technical Consultants and implementing global Java web based solutions. Eduardo is originally from San Jose, CA and graduated from Santa Clara University.! ! © 2012 Datameer, Inc. All rights reserved.

Agenda   •  Problem  Statement  –  Business  &  Technical   •  POC  Technical  Solu;on  –  High-­‐level  and  Detailed   •  Results   •  Lessons  Learned   Copyright  ©  2013  CBIG  Consul;ng   5  

PROBLEM STATEMENT Copyright  ©  2013  CBIG  Consul;ng   6  

Business  Problem  Statement   A  Real  Estate  .com  business  makes  money  in  two  ways:   1.  Property  Owners  adver;se  proper;es   2.  Ancillary  businesses  adver;se  services   This  site  needs  the  analy;cs  to  show  customers  the  return  on  their  investment     SEARCH   IMPRESSIONS   CLICK-­‐THRU   LEAD           Breadth:     •  Searches  to  Impressions  to  Click  Thru  to  Leads   •  Website  op;miza;on   •  Customer  op;miza;on  &  upgrades   •  Market  op;miza;on   Depth:   •  Can  the  search  criteria  be  op;mized?   •  Conversion  of  impressions  based  on  refinement  of  search?   •  Which  product  mix  of  impressions  get  the  greatest  click  thru   •  What  is  the  impact  of  ameni;es  to  leads?   •  What  addi;onal  features  get  used  to  convert  to  leads?   Copyright  ©  2013  CBIG  Consul;ng   7  

Source   Source   •  •  •  •  •  Web   Ac7vity   Master   Data   Search  &   Impression   ODS   Lookup   Data   Data  Movement   Source   Data  Movement   Source   Service   Search   Data   Movement   Technical  Problem  Statement   Search  &   Impression   EDW   Search   Cube   Sales  Cube   Marke7ng   Cube   Search & Impressions volume too large to build cube and provide deep analytics This has a negative impact on all reporting and performance of the entire system The business is unable to determine the value of all the data; has requests to add more Evaluating options to increase environment or look for alternatives POC to evaluate how Hadoop, Amazon cloud and Datameer could support challenge Copyright  ©  2013  CBIG  Consul;ng   8  

Technical  Problem  Statement   Source   Source   •  •  •  •  •  Web   Ac7vity   Master   Data   Search  &   Impression   ODS   Lookup   Data   Data  Movement   Source   Data  Movement   Source   Service   Search   Data   Movement     Search   EDW   Sales  Cube   Marke7ng   Cube   Search & Impressions volume too large to build cube and provide deep analytics This has a negative impact on all reporting and performance of the entire system The business is unable to determine the value of all the data; has requests to add more Evaluating options to increase environment or look for alternatives POC to evaluate how Hadoop, Amazon cloud and Datameer could support challenge Copyright  ©  2013  CBIG  Consul;ng   9  

Problem  Statement  –  Success  Criteria   Objec7ve:   To  prove  that  the  Hadoop  architecture  is  an  excellent  op;on  for  the  business  to   interact  with  large  data  and  find  dataset  and  rela;onships  that  require  deeper   analy;cs.       Original  Scope  &  Goals:   •  Bring  in  one  years  worth  of  data  from  6  tables,  into  the  Amazon  Cloud  Hadoop   environment.   •  IT  resources  will  be  able  to  extract  the  data  from  these  tables  and  load  them   into  .CSV  files.       •  The  success  criteria  for  this  stream  of  work  will  be:     ü  Amazon  Hadoop  cloud  environment  &  account  is  setup.   ü  Search  Analy;cs  data  loaded  into  the  Amazon  Hadoop  cloud     ü  Business  is  able  to  execute  and  perform  analy;cs  on  Search  Analy;cs  data   that  is  stored  in  Hadoop  with  acceptable  performance.     ü  Gain  analy;cal  insights  with  new  solu;on   Copyright  ©  2013  CBIG  Consul;ng   10  

POC TECHNICAL SOLUTION Copyright  ©  2013  CBIG  Consul;ng   11  

POC  Technical  Solu;on  –  High  Level   Web  Ac7vity   History     Lookup  Data   AWS   S3   Datameer   (Data   Discovery)   Web  Portal   (Widget  Based   UI)   AWS   EMR     (Hadoop)   Amazon  Web  Services  (Cloud)   Copyright  ©  2013  CBIG  Consul;ng   12  

POC  Technical  Solu;on  -­‐  Detailed   Amazon  Cloud   AllLeads   WebClicks   WebClicks   Web   Impressions   WebLead   Data  Movement   AllLeads   Web   Impressions   WebLead   WebSearch   WebSearch   WebVisit   WebVisit   Generic  Ac;vity   LR  Apts  IMPS   Other  Leads   EmailLeads   Data  Movement   EmailLeads   Phone  Leads   Generic  Ac;vity   LR  Apts  IMPS   Other  Leads   Phone  Leads   Site   SubSite   Site   Lead  Type   PageType   Lead  Type   Event  Type   PhoneType   Event  Type   Email  Type   Contaniner   Type   Affliate   Product  ID   Email  Type   Contaniner   Type   Affliate   Property  List   SearchType   S3                                                                       Hadoop   SubSite   PageType   PhoneType   Product  ID   Property  List   SearchType   Data  Workbooks     AllLeads     WebClicks     Web  Impressions     WebLeads     WebSearch     WebVisits           Use  Case  Workbooks     Use  Case1   Use  C  ase  2                   Addi7onal  Data  Workbooks                     Addi7onal    Use  Cases                  

RESULTS Copyright  ©  2013  CBIG  Consul;ng   14  

POC  Results                Success  Criteria     Hadoop,  Amazon,  Datameer   environment  setup     Able  to  load  1  years  worth  of   data  –  nearly  1.3  TB     Business  able  to  execute  and   perform  analy;cs       Users  provided  acceptable   performance     Gain  new  insights              Results     Environment  setup  within  the  1st  couple  of  days           Loaded  significantly  more  data  than  planned  for   more  robust  analy;cs     Business  leveraged  Datameer  to  execute  use  cases;   executed  ~20  addi;onal  without  IT  help           Queries  executed  to  comple;on.  Some  took  seconds,   some  took  minutes  and  some  required  overnight.     1st  ;me  able  to  run  these  analy;cs.    Found  pajerns   and  rela;onships  contrary  to  assump;ons.    Will  be   upda;ng  service  offerings  &  marke;ng  plans  because   of  POC   Copyright  ©  2013  CBIG  Consul;ng   15  

LESSONS LEARNED Copyright  ©  2013  CBIG  Consul;ng   16  

Lessons  Learned   GETTING  DATA  TO  HADOOP          Hadoop  is  file  structure              Finding  the  right  delimiter          Integra;ng  data              Requires  ETL          Data  cleansing  can  be  big              Several  itera;ons  required   CLOUD          Cloud  flexible              Easy  setup  and  scaling          Performance  &  sizing              Sizing  the  cloud  is  challenging          Cost  for  performance              TBs  with  support  becomes  costly   HADOOP          Hadoop  is  batch                Answers  one  thing  at  a  ;me          Analy;cs              Move  to  database  w/  tools   PEOPLE          Remember  change  mgmt              Educa;on  new  methods  &  tools   Copyright  ©  2013  CBIG  Consul;ng   17  

So what about open source tools like hive? © 2012 Datameer, Inc. All rights reserved. © 2012 Datameer, Inc. All rights reserved.

Hive…! !  Goal of hive! •  !  Eases the complexity of writing MapReduce jobs by providing the technical user a set of tools that are more familiar with via sql! Who can use hive?! •  SQL Users can pick up hql basics fairly quickly! !  Prerequisites! •  •  •  Must have data in hadoop! The data must be CLEAN! Schema must be applied to the data by creating a hive table! © 2012 Datameer, Inc. All rights reserved.

What is hive really good at?! !  Hive is good in environments where we have clean prepared data that doesn’t change often already in hadoop! ! !  Resembles a language that many IT folks are already familiar with.! ! !  Hive can help a user trying to identify a reporting trend! ! !  User defined fields (UDFs) can be used to reuse functions! © 2012 Datameer, Inc. All rights reserved.

Some troubles! << - Start of Hive script ->> --Create an TEMP Housing Table CREATE EXTERNAL TABLE MY_TABLE( num_ods string, num_bus_id int, um_ctry_cd int, prod_id string, rng_svc_cd string, rng6 string, bin string, bin_bus_id_enr int, bin_ctry_cd int, cd_fmt_a_2 string, cd_enr string, rsn_us_ind string, x_bus_id int, flg_enr string, my_dt string, user_id string, mthd_cd_enr string, tran_seq_id string, cd_enr2 string, us_amt string, moto_cd string, fee_curr_cd int, fee_desc_num string, fee_sgn_amt string, us_fee_sgn_amt string, mkt_spec string, catg_cd int, city_enr string, ctry_cd_enr int, dba_id int, nm_dscrptr string, geo_id int, geo_phone_num string, tier_cd string, msa string, nrmlzd_id int, pstl_cd string, b_st_cd_enr string, b_store_id string, b_vrfcn_val string, ntwrk_id int, site string, !  !  !  !  !  entry_mode_cd string, term_cpbty_cd string, sub_typ_cd string, dt string, id_num_enr int, prod_num int, prod_ppd_sub_typ_cd string, prod_typ_cd_enr string, prod_typ_ext_enr string, promo_cd string, promo_typ string, rwds_pgm_id_enr string, tran_cd string, tran_gmt_dt string, tran_gmt_tm string, tran_id string, unfrzn_acct_num_bus_id_enr int, unfrzn_arn_bin_bus_id_enr int, usage_cd_enr string, Other_amt string, curr_cd int, dt string, )COMMENT "THIS IS MY TEMP TABLE"; --INSERT DATA INTO MY_TABLE INSERT OVERWRITE MY_TABLE select * , SUM(us_tran_amt) AS SALES_VOL, SUM(US_FEE_SGN_AMT) AS US_FEE_SGN_AMT, COUNT(*) AS TRAN_COUNT, MIN(ACTIVE_DT) AS FIRST_ACTIVE_DT, MAX(SEARCH_DT) AS LAST_SEARCH_DT, MAX(customer_biz_id) AS customer_biz_id, MAX(PGM_ID_ENR) AS PGM_ID_ENR, MAX(CUST_PROD_ID) AS CUST_PROD_ID , MAX(POD_ID_NUM_ENR) AS POD_ID_NUM_ENR, MAX(PROD_TYPE)AS PROD_TYPE, MAX(SUB_TYPE) AS SUB_TYPE, 1 as ID from MY_TABLE WHERE dt like '2012%' GROUP BY customer_biz_id, PGM_ID_ENR, CUST_PROD_ID, eci_moto_cd, catg_cd, city_enr, ctry_cd_enr, pstl_cd, pod_id, prod_num, SUB_TYPE; --CREATE TEMP LOOKUP TABLE CREATE EXTERNAL TABLE TEMP_LOOKUP( acct_num bigint, acct_sta_cd string, acct_zip_cd string, rwrd_pgm_id string, pgm_ref_cd string, acct_prod_id string, bus_id int, bin int, status string, pgm_eff_dt string, dt string, )COMMENT "THIS IS TEMP LOOKUP TABLE"; --INSERT DATA INTO IT INSERT OVERWRITE MY_LOOKUP SELECT *, 1 as cmf_ind FROM LOOKUP WHERE DT = '201211'; --Do a Full Outer Join SELECT * FROM MY_TABLE mt FULL OUTER JOIN MY_LOOKUP ml ON mt.member_id = ml.member_id; No way to get data in hadoop! No data validation / may throw data away! Security ! Sharing code via teams is a challenge! No visualization! © 2012 Datameer, Inc. All rights reserved.

… but it’s free right?! !  !  "Time to create Hive":  
 
 Any machine-generated data (or anything semi/unstructured) must first be parsed by writing !! !MapReduce or Pig/Python programs.  Time-to-market disadvantage.
 Table definition is a manual effort (though this can be made easier by 3rd party tools).
 ! "Time to maintain Hive":
 
 Hive data models (tables) are most likely static, shared objects maintained and controlled by a few people who own the schema ! Hive is also more of a black box for new employees coming in (so employee churn creates more maintenance effort). ! ! !  Cost to implement Hive:
 
 This is mostly down to the human capital (expensive developers), and don't forget the prerequisite cost of implementing the data ingestion stage of the pipeline (populating the warehouse by writing MapReduce programs or other programs parsing/loading the data).  ! © 2012 Datameer, Inc. All rights reserved.

Business decsion! ! !  Do I train my engineers on a language or eliminate the need from this by taking the problem directly to the business user.! ! © 2012 Datameer, Inc. All rights reserved.

So what would my hive resource need to know?! !  ! !  Hive QL (different dialect than ANSI standard SQL)
 ! MapReduce TUNING parameters.  (to name a few)! •  Data block size! •  Number of mappers/reducers! •  Compression at map out level; result compression; what codec to use! •  io.sort.factor !! Access to hive is mainly done via Command line interface! © 2012 Datameer, Inc. All rights reserved.

How does Datameer do it differently ! © 2012 Datameer, Inc. All rights reserved.

Questions and Answers! © 2012 Datameer, Inc. All rights reserved.

Online Resources §  §  Try Datameer: www.datameer.com! Follow us on Twitter @datameer! ! ! © 2012 Datameer, Inc. All rights reserved.

Add a comment

Related presentations

Related pages

Big Data Analytics with Hive, Windows & SQL Server | SQL ...

Home > Blogs > SQL Server BI Blog > Big Data Analytics with Hive, Windows & SQL Server. SQL Server BI Blog. ... This is what I call Big Data Analytics.
Read more

Business Analytics: Big Data Meets Business Intelligence

Big Data Meets Business Intelligence ... Server repository to the Hive server on Oracle Big Data Lite ... the Hive table metadata into the BI ...
Read more

Big Data, Analytics and Hadoop - SAS

Big Data, Analytics and Hadoop ... R and Hive,” said Russom. ... advanced analytics on big data is reduced,” wrote Fern Halper,
Read more

Big Data Analytics - alteryx.com

Nutzen Sie Big-Data-Quellen und -Plattformen per ... Alteryx Analytics Gallery; ... Gartner 2016 Magic Quadrant for BI and Analytics Platforms. Partner ...
Read more

Big Data Analytics: Time For New Tools - InformationWeek

Big Data Analytics: Time For New Tools. ... cited by 48% of respondents using or planning to deploy data analytics, BI, ... big data and analytics.
Read more

IBM Big Data – What is Big Data – United States

Learn about big data and how IBM can help you use big data to achieve big results. ... Take the next step with IBM Big Data & Analytics
Read more

Eight Considerations for Utilizing Big Data Analytics with ...

... UTILIZIN BI DATA ANALYTICS ... with large data sets Both Hive and Pig can ... Eight Considerations for Utilizing Big Data ...
Read more

Big Data Analytics for Apache Hive | Zoomdata

Self-Service BI. Data Sharpening ... Big Data Analytics for Apache Hive. Zoomdata provides interactive visual analytics of data stored in Hadoop by ...
Read more

SEA EPORT Big Data AnalyticsDeep Dive - IDG Enterprise

A few key trends within BI relate to the emerging use of big data. ... for big data analytics is ... leverage the data. In some instances, Hadoop Hive is ...
Read more

Big Data und Advanced Analytics-Lösungen | Microsoft Azure

Wandeln Sie Ihren Big Data-Bestand mithilfe von Big Data- und Advanced Analytics-Lösungen von ... für Big Data-Analyseworkloads; Power BI Embedded ...
Read more