Building a cutting-edge data processing environment on a budget

40 %
60 %
Information about Building a cutting-edge data processing environment on a budget
Technology

Published on February 23, 2014

Author: GaelVaroquaux

Source: slideshare.net

Description

As a penniless academic I wanted to do "big data" for science. Open source, Python, and simple patterns were the way forward. Staying on top of todays growing datasets is an arm race. Data analytics machinery —clusters, NOSQL, visualization, Hadoop, machine learning, ...— can spread a team's resources thin. Focusing on simple patterns, lightweight technologies, and a good understanding of the applications gets us most of the way for a fraction of the cost.

I will present a personal perspective on ten years of scientific data processing with Python. What are the emerging patterns in data processing? How can modern data-mining ideas be used without a big engineering team? What constraints and design trade-offs govern software projects like scikit-learn, Mayavi, or joblib? How can we make the most out of distributed hardware with simple framework-less code?

Building a cutting-edge data processing environment on a budget Ga¨l Varoquaux e This talk is not about rocket science!

Building a cutting-edge data processing environment on a budget Ga¨l Varoquaux e Disclaimer: this talk is as much about people and projects as it is about code and algorithms.

Growing up as a penniless academic I did a PhD in quantum physics

Growing up as a penniless academic I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Best training ever for agile project management

Growing up as a penniless academic I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Computers were only one of the many moving parts Matlab Instrument control

Growing up as a penniless academic I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Shaped my vision of computing as a means to an end Computers were only one of the many moving parts Matlab Instrument control

Growing up as a penniless academic 2011 Tenured researcher in computer science

Growing up as a penniless academic 2011 Today Tenured researcher in computer science Growing team with data science rock stars

1 Using machine learning to understand brain function Link neural activity to thoughts and cognition G Varoquaux 6

1 Functional MRI t Recordings of brain activity G Varoquaux 7

1 Cognitive NeuroImaging Learn a bilateral link between brain activity and cognitive function G Varoquaux 8

1 Encoding models of stimuli Predicting neural response ñ a window into brain representations of stimuli “feature engineering” a description of the world G Varoquaux 9

1 Decoding brain activity “brain reading” G Varoquaux 10

1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] “brain reading” G Varoquaux 11

1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] “if it’s not open and verifiable by others, it’s not science, or engineering...” Stodden, 2010 G Varoquaux 11

1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] Make it work, make it right, make it boring

1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] Make it work, make it right, make it boring http://nilearn.github.io/auto examples/ plot miyawaki reconstruction.html Code, data, ... just worksTM G Varoquaux http://nilearn.github.io ni 11

1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] Make it work, make it right, make it boring ge len al ch nt me p elo ev ed arhttp://nilearn.github.io/auto examples/ ftw plot miyawaki reconstruction.html So Code, data, ... just worksTM G Varoquaux http://nilearn.github.io ni 11

1 Data accumulation When data processing is routine... “big data” for rich models of brain function Accumulation of scientific knowledge and learning formal representations G Varoquaux 12

1 Data accumulation When data processing is routine... “big data” for rich models of brain function “A theory is a good theory if it satisfies two requirements: It must accurately describe a large class of observations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations.” Stephen Hawking, A Brief History of Time. Accumulation of scientific knowledge and learning formal representations G Varoquaux 12

1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago G Varoquaux 13

1 Petty day-to-day technicalities Buggy code A lab is no different from a startup Slow code Difficulties Risks LeadRecruitment leaves data scientist Bus factor Technical dept New Limited resources intern to train (people & hardware) I don’t understand the code I have written a year ago G Varoquaux 13

1 Petty day-to-day technicalities Buggy code A lab is no different from a startup Slow code Difficulties Risks LeadRecruitment leaves data scientist Bus factor Technical dept New Limited resources intern to train (people & hardware) I don’t understand the code I have written a year ago Our mission is to revolutionize brain data processing on a tight budget G Varoquaux 13

2 Patterns in data processing G Varoquaux 14

2 The data processing workflow agile Interaction... Ñ script... Ñ module... ý interaction again... Consolidation, progressively Low tech and short turn-around times G Varoquaux 15

Paradigm shift as the dimensionality of data grows y 2 From statistics to statistical learning # features, not only # samples From parameter inference to prediction x Statistical learning is spreading everywhere G Varoquaux 16

3 Let’s just make software to solve all these problems. G Varoquaux c Theodore W. Gray 17

3 Design philosophy 1. Don’t solve hard problems The original problem can be bent. 2. Easy setup, works out of the box Installing software sucks. Convention over configuration. 3. Fail gracefully Robust to errors. Easy to debug. 4. Quality, quality, quality What’s not excellent won’t be used. G Varoquaux 18

3 Design philosophy 1. Don’t solve hard problems The original problem can be bent. 2. Easy setup, works out of the box Installing software sucks. Not “one software to rule them all” Convention over configuration. Break down projects by expertise 3. Fail gracefully Robust to errors. Easy to debug. 4. Quality, quality, quality What’s not excellent won’t be used. G Varoquaux 18

G Varoquaux 19

Vision Machine learning without learning the machinery Black box that can be opened Right trade-off between ”just works” and versatility (think Apple vs Linux) G Varoquaux 19

Vision Machine learning without learning the machinery Black box that can be opened Right trade-off between ”just works” and versatility (think Apple vs Linux) We’re not going to solve all the problems for you I don’t solve hard problems Feature-engineering, domain-specific cases... Python is a programming language. Use it. Cover all the 80% usecases in one package G Varoquaux 19

3 Performance in high-level programming High-level programming is what keeps us alive and kicking G Varoquaux 20

3 Performance in high-level programming The secret sauce Optimize algorithmes, not for loops Know perfectly Numpy and Scipy - Significant data should be arrays/memoryviews - Avoid memory copies, rely on blas/lapack line-profiler/memory-profiler scipy-lectures.github.io Cython G Varoquaux not C/C++ 20

3 Performance in high-level programming Hierarchical clustering PR #2199 The secret sauce 1.Optimize algorithmes,clustersloops Take the 2 closest not for 2. Merge them 3. Update the distance matrix Know perfectly Numpy and Scipy ... - Significant data should be arrays/memoryviews Faster with constraints: sparse distance matrix - Avoid memory copies, rely on blas/lapack - Keep a heap queue of distances: cheap minimum line-profiler/memory-profiler - Need sparse growable structure for neighborhoods scipy-lectures.github.io skip-list in Cython! Oplog nq insert, remove, access Cython not C/C++ bind C++ map[int, float] with Cython Fast traversal, possibly in Cython, for step 3. G Varoquaux 20

3 Performance in high-level programming Hierarchical clustering PR #2199 The secret sauce 1.Optimize algorithmes,clustersloops Take the 2 closest not for 2. Merge them 3. Update the distance matrix Know perfectly Numpy and Scipy ... - Significant data should be arrays/memoryviews Faster with constraints: sparse distance matrix - Avoid memory copies, rely on blas/lapack - Keep a heap queue of distances: cheap minimum line-profiler/memory-profiler - Need sparse growable structure for neighborhoods scipy-lectures.github.io skip-list in Cython! Oplog nq insert, remove, access Cython not C/C++ bind C++ map[int, float] with Cython Fast traversal, possibly in Cython, for step 3. G Varoquaux 20

0 3 0 0 38 01 7 87 9 1 78 4 5 9 40 7990 8779 4 1 5 49 0771 0775 9447 13 6 97 17 52 79 7 3 70 74 27 97 4 4 7 47 6553 0771 4661 7001 7992 48 7 75 34 18 12 15 27 9 8 54 49 87 24 57 7 0 7 9 03 7221 4226 9004 7117 4779 788 78 9 34 15 65 49 78 97 8 5 45 54 53 95 88 7 9 7 56 46 35 51 87 5 1 67 63 58 19 7 7 8 73 34 80 90 1 7 32 49 09 0 8 7 24 90 98 7 4 45 08 8 7 5 56 84 4 6 61 4 5 2 14 60 4 0 2 3 0 0 0 38 01 7 87 9 1 78 94 0 79 87 5 4 90 79 4 1 5 49 0771 0775 9447 3 7 1 36 9770 1774 5227 7997 4 65 07 46 70 79 4 7 47 53 71 61 01 92 8 9 4 87 7554 3449 1887 1224 1557 277 90 3 72 42 90 71 47 78 7 0 21 26 04 17 79 8 78 9 34 15 65 49 78 97 8 5 45 54 53 95 88 7 9 7 56 46 35 51 87 5 1 67 63 58 19 7 7 8 73 34 80 90 0 1 7 32 49 09 0 3 8 7 24 90 98 0 0 38 7 4 45 08 8 01 7 87 6 84 75 5 9 1 78 4 6 61 4 4 5 2 14 5 9 40 7990 8779 60 4 4 1 5 49 0771 0775 9447 2 0 13 6 97 17 52 79 7 3 70 74 27 97 4 4 7 47 6553 0771 4661 7001 7992 48 7 75 34 18 12 15 27 9 8 54 49 87 24 57 7 0 7 9 03 7221 4226 9004 7117 4779 788 78 9 34 15 65 49 78 97 8 5 45 54 53 95 88 7 9 7 56 46 35 51 87 5 1 67 63 58 19 7 7 8 73 34 80 90 1 7 32 49 09 0 8 7 24 90 98 7 4 45 08 8 7 5 56 84 4 6 61 4 5 2 14 60 4 2 0 3 Architecture of a data-manipulation toolkit Separate data from operations, but keep an imperative-like language bokeh, chaco, hadoop, Mayavi, CPUs G Varoquaux 21

3 Architecture of a data-manipulation toolkit Separate data from operations, but keep an imperative-like language Object API exposes a data-processing language fit, predict, transform, score, partial fit Instantiated without data but with all the parameters Objects pipeline, merging, etc... G Varoquaux 21

3 Architecture of a data-manipulation toolkit Separate data from operations, but keep an imperative-like language Object API exposes a data-processing language fit, predict, transform, score, partial fit Instantiated without data but with all the parameters Objects pipeline, merging, etc... configuration/run pattern curry in functional programming Ideas from MVC pattern G Varoquaux traits, pyre functools.partial 21

4 Big data on small hardware G Varoquaux 22

h h isdata on smallishardware ll g 4 Big a Big sm “Big data”: Petabytes... Distributed storage Computing cluster G Varoquaux Mere mortals: Gigabytes... Python programming Off-the-self computers 22

4 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? G Varoquaux 23

4 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? No: just do a running mean G Varoquaux 23

4 On-line algorithms Converges to expectations Mini-batch = bunch observations for vectorization Example: K-Means clustering X = np.random.normal(size=(10 000, 200)) scipy.cluster.vq. sklearn.cluster. MiniBatchKMeans(n clusters=10, kmeans(X, 10, n init=2).fit(X) iter=2) 11.33 s 0.62 s G Varoquaux 23

4 On-the-fly data reduction Big data is often I/O bound Layer memory access CPU caches RAM Local disks Distant storage Less data also means less work G Varoquaux 24

4 On-the-fly data reduction Dropping data 1 loop: take a random fraction of the data 2 run algorithm on that fraction 3 aggregate results across sub-samplings Looks like bagging: bootstrap aggregation Exploits redundancy across observations Run the loop in parallel G Varoquaux 24

4 On-the-fly data reduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.WardAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 24

4 On-the-fly data reduction Example: randomized SVD Random projection sklearn.utils.extmath.randomized svd X = np.random.normal(size=(50000, 200)) %timeit lapack = linalg.svd(X, full matrices=False) 1 loops, best of 3: 6.09 s per loop %timeit arpack=splinalg.svds(X, 10) 1 loops, best of 3: 2.49 s per loop %timeit randomized = randomized svd(X, 10) 1 loops, best of 3: 303 ms per loop linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000 0.0022360679774997738 linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000 0.0022121161221386925 G Varoquaux 24

4 Biggish iron Our new box: 48 cores 384G RAM 70T storage 15 ke (SSD cache on RAID controller) Gets our work done faster than our 800 CPU cluster It’s the access patterns! “Nobody ever got fired for using Hadoop on a cluster” A. Rowstron et al., HotCDP ’12 G Varoquaux 25

5 Avoiding the framework joblib G Varoquaux 26

5 Parallel processing big picture Focus on embarassingly parallel for loops Life is too short to worry about deadlocks Workers compete for data access Memory bus is a bottleneck The right grain of parallelism Too fine ñ overhead Too coarse ñ memory shortage Scale by the relevant cache pool G Varoquaux 27

5 Parallel processing joblib Focus on embarassingly parallel for loops Life is too short to worry about deadlocks >>> from joblib import Parallel, delayed >>> Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] G Varoquaux 27

5 Parallel processing joblib IPython, multiprocessing, celery, MPI? joblib is higher-level No dependencies, works everywhere Better traceback reporting Memmaping arrays to share memory (O. Grisel) On-the-fly dispatch of jobs – memory-friendly Threads or processes backend G Varoquaux 27

5 Parallel processing joblib IPython, multiprocessing, celery, MPI? joblib is higher-level No dependencies, works everywhere Better traceback reporting Memmaping arrays to share memory (O. Grisel) On-the-fly dispatch of jobs – memory-friendly Threads or processes backend G Varoquaux 27

5 Parallel processing Queues Queues: high-performance, concurrent-friendly Difficulty: callback on result arrival ñ multiple threads in caller ` risk of deadlocks Dispatch queue should fill up “slowly” ñ pre dispatch in joblib ñ Back and forth communication Door open to race conditions G Varoquaux 28

5 Parallel processing: what happens where joblib design: Caller, dispatch queue, and collect queue in same process Benefit: robustness Grand-central dispatch design: dispatch queue has a process of its own Benefit: resource managment in nested for loops G Varoquaux 29

5 Caching For reproducibility: avoid manually chained scripts (make-like usage) For performance: avoiding re-computing is the crux of optimization G Varoquaux 30

5 Caching The joblib approach For reproducibility: avoid manually chained scripts (make-like usage) For performance: avoiding re-computing is the crux of optimization Memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store G Varoquaux 30

5 Caching The joblib approach Challenges in the context of big data For reproducibility: avoid b are big chained scripts (make-like usage) a & manually For performance: Design goals avoiding re-computing is the crux of optimization a & b arbitrary Python objects No dependencies Drop-in, framework-less code Memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store G Varoquaux 30

5 Caching The joblib approach For bricks for out-of-core algorithms coming soon Lego reproducibility: avoid manually chained scripts ąąą result = g.call and shelve(a)(make-like usage) For performance: ąąą result avoiding re-computing is the crux argument hash=”...”) MemorizedResult(cachedir=”...”, func=”g...”,of optimization ąąą c = result.get() Memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store G Varoquaux 30

5 Efficient input argument hashing – joblib.hash Compute md5‹ of input arguments Trade-off between features and cost Black boxy Robust and completely generic G Varoquaux 31

5 Efficient input argument hashing – joblib.hash Compute md5‹ of input arguments Implementation 1. Create an md5 hash object 2. Subclass the standard-library pickler = state machine that walks the object graph 3. Walk the object graph: - ndarrays: pass data pointer to md5 algorithm (“update” method) - the rest: pickle 4. Update the md5 with the pickle ‹ md5 is in the Python standard library G Varoquaux 31

5 Fast, disk-based, concurrent, store – joblib.dump Persisting arbritrary objects Once again sub-class the pickler Use .npy for large numpy arrays (np.save), pickle for the rest ñ Multiple files Store concurrency issues Strategy: atomic operations ` try/except Renaming a directory is atomic Directory layout consistent with remove operations Good performance, usable on shared disks (cluster) G Varoquaux 32

5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buffers (bypass gzip module to work online + in-memory) G Varoquaux 33

5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buffers (bypass gzip module to work online + in-memory) Avoiding copies zlib.compress: C-contiguous buffers Copyless storage of raw buffer + meta-information (strides, class...) G Varoquaux 33

5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buffers (bypass gzip module to work online + in-memory) Avoiding copies zlib.compress: C-contiguous buffers Copyless storage of raw buffer + meta-information (strides, class...) Single file dump coming soon File opening is slow on cluster Challenge: streaming the above for memory usage G Varoquaux 33

5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel StandardWhat matters on large with buffers library: zlib.compress systems (bypass gzip module to stored Numbers of bytes work online + in-memory) brings network/SATA bus down Avoiding copies zlib.compress: C-contiguous buffers Memory usage Copyless storage brings buffer of raw compute nodes down + meta-information (strides, class...) Number of atomic file access Single file dump brings shared storage down soon coming File opening is slow on cluster Challenge: streaming the above for memory usage G Varoquaux 33

y axis scale: 1 is np.save 5 Benchmarking to np.save and pytables G Varoquaux NeuroImaging data (MNI atlas) 34

6 The bigger picture: building an ecosystem Helping your future self G Varoquaux 35

6 Community-based development in scikit-learn Huge feature set: benefits of a large team Project growth: More than 200 contributors „ 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model, http://www.ohloh.net/p/scikit-learn G Varoquaux 36

6 The economics of open source Code maintenance too expensive to be alone scikit-learn „ 300 email/month nipy „ 45 email/month joblib „ 45 email/month mayavi „ 30 email/month “Hey Gael, I take it you’re too busy. That’s okay, I spent a day trying to install XXX and I think I’ll succeed myself. Next time though please don’t ignore my emails, I really don’t like it. You can say, ‘sorry, I have no time to help you.’ Just don’t ignore.” G Varoquaux 37

6 The economics of open source Code maintenance too expensive to be alone scikit-learn „ 300 email/month nipy „ 45 email/month joblib „ 45 email/month mayavi „ 30 email/month Your “benefits” come from a fraction of the code Data loading? Maybe? Standard algorithms? Nah Share the common code... ...to avoid dying under code Code becomes less precious with time And somebody might contribute features G Varoquaux 37

6 Many eyes makes code fast Bench WiseRF anybody? L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 38

6 6 steps to a community-driven project 1 Focus on quality 2 Build great docs and examples 3 Use github 4 Limit the technicality of your codebase 5 Releasing and packaging matter 6 Focus on your contributors, give them credit, decision power http://www.slideshare.net/GaelVaroquaux/ scikit-learn-dveloppement-communautaire G Varoquaux 39

6 Core project contributors Number of commits Normalized number of commits since 2009-06 Individual committer G Varoquaux Credit: Fernando Perez, Gist 5843625 40

6 The tragedy of the commons Individuals, acting independently and rationally according to each one’s self-interest, behave contrary to the whole group’s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted ñ Hard to fund, less excitement They need citation, in papers & on corporate web pages G Varoquaux 41

Solving problems that matter The 80/20 rule 80% of the usecases can be solved with 20% of the lines of code scikit-learn, joblib, nilearn, ... @GaelVaroquaux I hope

Cutting-edge ... environment ... on a budget 1 Set the goals right Don’t solve hard problems What’s your original problem? @GaelVaroquaux

Cutting-edge ... environment ... on a budget 1 Set the goals right 2 Use the simplest technological solutions possible Be very technically sophisticated Don’t use that sophistication @GaelVaroquaux

Cutting-edge ... environment ... on a budget 1 Set the goals right 2 Use the simplest technological solutions possible 3 Don’t forget the human factors With your users (documentation) With your contributors @GaelVaroquaux

Cutting-edge ... environment ... on a budget 1 Set the goals right 2 Use the simplest technological solutions possible 3 Don’t forget the human factors A perfect design? @GaelVaroquaux

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Building a cutting-edge data processing environment on a ...

Building a cutting-edge data processing environment on a budget. As a penniless academic I wanted to do "big data" for science.
Read more

Gael Varoquaux - Building a cutting-edge data processing ...

Want to watch this again later? Sign in to add this video to a playlist. http://www.slideshare.net/PyData/buil
Read more

BibSonomy :: url :: Building a cutting-edge data ...

1 Building a cutting-edge data processing environment on a budget ...
Read more

What Is Data Mining? - Oracle Help Center

Data mining is accomplished by building ... your data. Data mining algorithms are often ... the use of data mining within a target environment.
Read more

The Log: What every software engineer should know about ...

... being able to put it together in an applicable processing environment ... building out custom data loads for each data source and destination, ...
Read more

Digital Government: Building a 21st Century Platform to ...

These groups worked with the Office of Management and Budget ... building and terrestrial environments, ... digital government information, data ...
Read more

Exhibition Archives \ Processing.org

Processing is a flexible software sketchbook and a ... Exhibition. A curated collection ... Fragmented Memory are portraits of raw binary data extracted ...
Read more

Datasets - data.gov.uk

Environment Agency (686 ... Land and building assets ... These files provide detailed road safety data about the circumstances of personal injury road ...
Read more