The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

40 %
60 %
Information about The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
Technology

Published on February 24, 2014

Author: DataPad

Source: slideshare.net

The Last Mile: Challenges and opportunities in data tools Strata Santa Clara 2014

Wes McKinney @wesmckinn • Former quant @ AQR (a hedge fund) • Creator of pandas • Author of 
 Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 2 www.datapad.io

3 www.datapad.io

• http://datapad.io • New web-based visual analytics environment • In private beta, join us! • Hiring for engineering 4 www.datapad.io

Some Problems • • Statistics and ML ETL • • Data Visualization Workflows + Collaboration • Business Analytics 5 www.datapad.io

Data toolchains Data Acquisition Data Slinging / Management ETL SQL / Tidy Form Code-based Env 6 UI-based Env www.datapad.io Analysis

Data toolchains Data Acquisition Maybe HDFS ETL ETL Analytic DBMS Code-based Env 7 ETL? www.datapad.io UI-based Env

Some Trends • • SQL-on-Hadoop Spark / Spark ecosystem • New life in visual ETL / data prep • Better data manipulation libraries • Columnar / analytic databases 8 www.datapad.io

Crunching data with code • Python: pandas • Data frames in Scala, F#, Julia, … • Spark (Scala/Java) • R (+ data.table, dplyr) 9 www.datapad.io

Some Programmatic Tool Problems • Awkward / slow DB interactions • In-process memory management Reuse of intermediate results • Execution speed • • Evaluation semantics 10 www.datapad.io

dplyr (R library) • By Hadley Wickham and Romain Francois • Uniform R API, SQL and in-memory backends • Describe complex data manipulation using “chaining” 11 www.datapad.io

dplyr (R library) final %.% %.% %.% %.% %.% %.% 12 <- crime.by.state filter(State=="New York", Year==2005) arrange(desc(Count)) select(Type.of.Crime, Count) mutate(Proportion=Count/sum(Count)) group_by(Type.of.Crime) summarise(num.types = n(), counts = sum(Count)) www.datapad.io

Apache Spark • Broad set of primitive data ops • Distributed in-memory model scales naturally, high performance • Build complex computation graphs for analytics • Applications: Shark, GraphX, … 13 www.datapad.io

pandas (Python library) • Broad traction • Strong feature: time series analytics User-friendly API and community • Being used in many unexpected • ways 14 www.datapad.io

badger (DataPad internal) • A high performance in-memory analytics engine for DataPad • Addresses many performance and memory management concerns in pandas • May become an OSS project someday 15 www.datapad.io

Standardized machine learning toolkits • scikit-learn • PMML • Mahout Cloudera ML • 16 www.datapad.io

Enterprise data workflows • • Apache Crunch Pig • Cascading (+ Scalding, Cascalog) 17 www.datapad.io

Analytic databases • Powering visual analytics tools on big data • • MPP / in-memory execution model Compressed columnar storage 18 www.datapad.io

Visual data tools • Visual Analytics/BI gone mainstream • New Data Prep products • Drag-and-drop predictive analytics • Proliferation of vertical SaaS solutions 19 www.datapad.io

Visual tool challenges • Tend to be less flexible than code • Multiple tools to get the job done • Many still dependent on Excel • Collaboration, versioning, provenance 20 www.datapad.io

Collaboration tools • Discovery and reuse • Cataloguing insights • Analytics from ad-hoc to production • Interesting projects: IPython Notebook, Shiny, Pivotal Chorus 21 www.datapad.io

Some ideas 22 www.datapad.io

Abstract away the execution model (where possible) 23 www.datapad.io

More integrated environments 24 www.datapad.io

Enhance collaboration 25 www.datapad.io

Thank you! 26 www.datapad.io

Add a comment

Related presentations

Related pages

Speaker: William McKinney: Strata 2014 - O'Reilly ...

William McKinney ... Innovating analytics and data visualization tools. ... The Last Mile: Challenges and Opportunities in Data Tools.
Read more

Agile Analytics: Strata 2014 - O'Reilly Conferences ...

This talk describe technique for incorporating analytics and data science ... Strata Conference ... 2014 • Santa Clara, CA. Program. Schedule;
Read more

Oreilly - Strata Conference Santa Clara 2014 Part9 Final ...

Strata Conference Santa Clara 2014 Part9 ... of Things and learn about the tools for ... Last Mile Challenges and Opportunities in ...
Read more

Strata Conference Santa Clara 2014 Part9 Final (AvaxHome ...

... jpg Strata Conference Santa Clara 2014 ... in Strata's 7 tracks: Data ... Last Mile Challenges and Opportunities in ...
Read more

Oreilly - Strata Conference Santa Clara 2014 Part9 Final

... Strata Conference Santa Clara 2014 ... and capture the knowledge offered in Strata’s 7 tracks: Data ... The Last Mile Challenges and Opportunities ...
Read more

Strata Conference Santa Clara 2014 Part9 Final » Free ...

Strata Conference Santa Clara 2014 ... tools, and technologies you need to use data ... 167.The Last Mile Challenges and Opportunities in Data ...
Read more

Strata Conference Santa Clara 2014: Video Compilation - O ...

Strata Conference Santa Clara 2014: ... and sessions at O’Reilly’s Strata Conference Santa Clara 2014. ... Dive into tools that make the big data ...
Read more

[Offer] Strata Conference Santa Clara 2014 Part9 Final ...

... Strata Conference Santa Clara 2014 Part9 Final English | Audio: English ... The future belongs to those who understand data. ... Last active: Apr 16 ...
Read more

In-Hadoop Analytics: Bringing analytics to big data ...

... Anjul Bhambhri by O'Reilly Media, Inc. Big Data without analytics is just data, ... You are previewing Strata Conference Santa Clara 2014: ...
Read more