Extreme Scripting July 2009

0 %
100 %
Information about Extreme Scripting July 2009

Published on July 31, 2009

Author: ianfoster

Source: slideshare.net


A talk presented at an NSF Workshop on Data-Intensive Computing, July 30, 2009.

Extreme scripting and other adventures in data-intensive computing

Data analysis in many scientific laboratories is performed via a mix of standalone analysis programs, often written in languages such as Matlab or R, and shell scripts, used to coordinate multiple invocations of these programs. These programs and scripts all run against a shared file system that is used to store both experimental data and computational results.

While superficially messy, the flexibility and simplicity of this approach makes it highly popular and surprisingly effective. However, continued exponential growth in data volumes is leading to a crisis of sorts in many laboratories. Workstations and file servers, even local clusters and storage arrays, are no longer adequate. Users also struggle with the logistical challenges of managing growing numbers of files and computational tasks. In other words, they face the need to engage in data-intensive computing.

We describe the Swift project, an approach to this problem that seeks not to replace the scripting approach but to scale it, from the desktop to larger clusters and ultimately to supercomputers. Motivated by applications in the physical, biological, and social sciences, we have developed methods that allow for the specification of parallel scripts that operate on large amounts of data, and the efficient and reliable execution of those scripts on different computing systems. A particular focus of this work is on methods for implementing, in an efficient and scalable manner, the Posix file system semantics that underpin scripting applications. These methods have allowed us to run applications unchanged on workstations, clusters, infrastructure as a service ("cloud") systems, and supercomputers, and to scale applications from a single workstation to a 160,000-core supercomputer.

Swift is one of a variety of projects in the Computation Institute that seek individually and collectively to develop and apply software architectures and methods for data-intensive computing. Our investigations seek to treat data management and analysis as an end-to-end problem. Because interesting data often has its origins in multiple organizations, a full treatment must encompass not only data analysis but also issues of data discovery, access, and integration. Depending on context, data-intensive applications may have to compute on data at its source, move data to computing, operate on streaming data, or adopt some hybrid of these and other approaches.

Thus, our projects span a wide range, from software technologies (e.g., Swift, the Nimbus infrastructure as a service system, the GridFTP and DataKoa data movement and management systems, the Globus tools for service oriented science, the PVFS parallel file system) to application-oriented projects (e.g., text analysis in the biological sciences, metagenomic analysis, image analysis in neuroscience, information integration for health care applications, management of experimental data from X-ray sources, diffusion tensor imaging for computer aided diagnosis), and the creation and operation of national-scale infrastructures, including the Earth System Grid (ESG), cancer Biomedical Informatics Grid (caBIG), Biomedical Informatics Research Network (BIRN), TeraGrid, and Open Science Grid (OSG).
For more information, please see www.ci.uchicago/swift.

Extreme scriptingand other adventures in data-intensive computing Ian FosterAllan Espinosa, Ioan Raicu, Mike Wilde, Zhao Zhang Computation Institute Argonne National Lab & University of Chicago

How data analysis happens at data-intensive computing workshops

How data analysis really happensin scientific laboratories %foo file1 > file2 % bar file2 > file3 %foo file1 | bar > file3 % foreachf (f1 f2 f3 f4 f5 f6 f7 … f100) foreach?foo $f.in | bar > $f.out foreach? end % % Now where on earth is f98.out, and how did I generate it again? Now: command not found. %

Extreme scripting Complex scripts Swift Many activities Numerous files Complex data Data dependencies Many programs Simple scripts Big computers Small computers Many processors Storage hierarchy Failure Heterogeneity Preserving file system semantics, ability to call arbitrary executables

Functional magnetic resonance imaging (fMRI) data analysis

AIRSN program definition (Run or) reorientRun (Run ir, string direction) { foreachVolume iv, i in ir.v { or.v[i] = reorient(iv, direction); } } (Run snr) functional ( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun( r , "y" ); Run roRun = reorientRun( yroRun , "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun, 0.1 ); AirVectorrndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" ); Volume meanRand = softmean( reslicedRndr, "y", "null" ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean( nr, "y", "null" ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, "6 6 6" ); }

Many many tasks:Identifying potential drug targets 2M+ ligands Protein xtarget(s) Benoit Roux et al.

6 GB 2M structures (6 GB) ~4M x 60s x 1 cpu ~60K cpu-hrs FRED DOCK6 Select best ~5K Select best ~5K ~10K x 20m x 1 cpu ~3K cpu-hrs Amber Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC ZINC 3-D structures Manually prep DOCK6 rec file Manually prep FRED rec file NAB scriptparameters (defines flexible residues, #MDsteps) NAB Script Template DOCK6 Receptor (1 per protein: defines pocket to bind to) FRED Receptor (1 per protein: defines pocket to bind to) PDB protein descriptions 1 protein (1MB) BuildNABScript Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript NAB Script start Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript For 1 target: 4 million tasks500,000 cpu-hrs (50 cpu-years) end report ligands complexes

IBM BG/P 570 Teraflop/s, 164,000 cores, 80 TB

DOCK on BG/P: ~1M tasks on 119,000 CPUs 118784 cores 934803 tasks Elapsed time: 7257 sec Compute time: 21.43 CPU years Average task: 667 sec Time (sec) Relativeefficiency 99.7% (from 16 to 32 racks) Utilization: 99.6% sustained, 78.3% overall Ioan Raicu et al.

Managing 160,000 cores Falkon High-speed local “disk” Slower shared storage

Scaling Posix to petascale Global file system Chirp(multicast) Staging  Torus and tree interconnects  Intermediate CN-striped intermediate file system Largedataset MosaStore(striping) … IFScompute node IFScompute node LFS LFS IFSseg IFSseg ZOID on I/O node Computenode(local datasets) Computenode(local datasets) ZOID IFS Local . . .

Efficiency for 4 second tasks and varying data size(1KB to 1MB) for CIO and GPFS up to 32K processors

+ + + + + + + = Provisioning for data-intensive workloads Example: on-demand “stacking” of arbitrary locations within ~10TB sky survey Challenges Random data access Much computing Time-varying load Solution Dynamic acquisition of compute & storage Data diffusion Sloan Data S IoanRaicu

IoanRaicu “Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node

Same scenario, but with dynamic resource provisioning

Data diffusion sine-wave workload: Summary GPFS  5.70 hrs, ~8Gb/s, 1138 CPU hrs DD+SRP  1.80 hrs, ~25Gb/s, 361 CPU hrs DD+DRP  1.86 hrs, ~24Gb/s, 253 CPU hrs

Data-intensive computing @ Computation Institute: Example applications Astrophysics Cognitive science East Asian studies Economics Environmental science Epidemiology Genomic medicine Neuroscience Political science Sociology Solid state physics

Sequencing outpaces Moore’s law BLAST On EC2, US$ Next-gen Solexa 454 Solexa Gigabases Folker Meyer, Computation Institute

Data-intensive computing @ Computation Institute: Hardware 1000 TBtape backup Dynamic provisioning 500 TB reliable storage (data & metadata) Parallel analysis Diversedatasources Remote access P A D S 180 TB, 180 GB/s 17 Top/s analysis Diverseusers Data ingest Offload to remote data centers PADS: Petascale Active Data Store (NSF MRI)

Data-intensive computing @ Computation Institute: Software HPC systems software (MPICH, PVFS, ZeptOS) Collaborative data tagging (GLOSS) Data integration (XDTM) HPC data analytics and visualization Loosely coupled parallelism (Swift, Hadoop) Dynamic provisioning (Falkon) Service authoring (Introduce, caGrid, gRAVI) Provenance recording and query (Swift) Service composition and workflow (Taverna) Virtualization management (Workspace Service) Distributed data management (GridFTP, etc.)

Data-intensive computing is an end-to-end problem Low Chaos Zone ofcomplexity Agreement about outcomes Plan and control High Low High Certainty about outcomes Ralph Stacey, Complexity and Creativity in Organizations, 1996

We need to function in the zone of complexity Low Chaos Agreement about outcomes Plan and control High Low High Certainty about outcomes Ralph Stacey, Complexity and Creativity in Organizations, 1996

The Grid paradigm Principles and mechanisms for dynamic virtual organizations Leverage service oriented architecture Loose coupling of data and services Open software,architecture Engineering Biomedicine Computer science Physics Healthcare Astronomy Biology 1995 2000 2005 2010

As of Oct19, 2008: 122 participants 105 services 70 data 35 analytical

Multi-center clinical cancer trials image capture and review (Center for Health Informatics)

Summary Extreme scripting offers the potential for easy scaling of proven working practices Interesting technical problems relating to programming and I/O models Many wonderful applications Data-intensive computing is an end-to-end problem Data generation, integration, analysis, etc., is a continuous, loosely coupled process

Thank you! Computation Institutewww.ci.uchicago.edu

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

July | 2009 | Hey, Scripting Guy! Blog

December 30, 2009 July 4, 2015 by ScriptingGuy1 // 4 Comments. ... Hey, Scripting Guy! I enjoyed reading yesterday’s blog post about working with functions.
Read more

To The Extreme - July 2009 - YouTube

To The Extreme - July 2009 by Extreme Alternative; 5 videos; 61 views; Last updated on Jun 6, 2010 * IT & K-Chug vs. Masked Mercutio
Read more

Moby - Extreme Ways _ Live July 05-2009 - YouTube

Moby Extreme Ways Live Main square festival 5 juillet 2009. ... Moby - Extreme Ways _ Live July 05-2009 Jorge Barata. Subscribe Subscribed ...
Read more

July | 2009 | The Old Dogs Scripting Blog

Monthly Archives: July 2009 Run PowerShell Scripts from an HTA Menu. Posted on July 29, 2009 by mikef2691. Reply.
Read more

July | 2009 | Extreme Productions Blog

A Sacramento DJ and Sacramento Wedding DJ is Extreme Productions DJs providing wedding and event entertainment for Sacramento, ... Archive for July, 2009.
Read more

get-scripting: July 2009

Monday, 13 July 2009. Get-Scripting Podcast Episode 11 - ... get [dash] scripting [at] hotmail [dot] co [dot] uk or leave a comment here on the blog
Read more

July, 2009 | Designscripting

Articles Archive for July 2009. ... [9 Jul 2009 | 2 Comments | ] The Google Chrome Operating System: “It’s our attempt to re-think what operating ...
Read more

July | 2009 | Suhail Algosaibi's Radical Dojo

By Suhail on 14 July 2009 in Books, Success. It’s finally here! This video concludes the 7 part video series of the books that changed my life.
Read more

Free Download All PHP Scripts: Extreme Executive Limited ...

July 2, 2009. Extreme Executive ... Extreme Executive Limited Edition v7.0 null,Extreme Executive Limited Edition v7.0 nulled,Extreme Executive ...
Read more

GDB Scripting : A short article for a internal magazine ...

I wrote a small article for a internal magazine and few of my friends wanted me to post it to this blog. This is for people who are new to GDB ...
Read more