myHadoop 0.30

50 %
50 %
Information about myHadoop 0.30
Technology

Published on March 8, 2014

Author: glennklockwood

Source: slideshare.net

Description

Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up a Hadoop cluster, and use the new features.

myHadoop 0.30's project page is now on GitHub (https://github.com/glennklockwood/myhadoop) and the latest release tarball can be downloaded from my website (glennklockwood.com/files/myhadoop-0.30.tar.gz)

myHadoop 0.30 and Beyond! Glenn K. Lockwood, Ph.D.! User Services Group! San Diego Supercomputer Center! SAN DIEGO SUPERCOMPUTER CENTER

Hadoop and HPC! PROBLEM: domain scientists aren't using Hadoop! •  toy for computer scientists and salespeople! •  Java is not high-performance! •  don't want to learn computer science and Java! ! SOLUTION: make Hadoop easier for HPC users! •  use existing HPC clusters and software! •  use Perl/Python/C/C++/Fortran instead of Java! •  make starting Hadoop as easy as possible! SAN DIEGO SUPERCOMPUTER CENTER

Compute: Traditional vs. Data-Intensive! Traditional HPC! •  CPU-bound problems! •  Solution: OpenMPand MPI-based parallelism! Data-Intensive! •  IO-bound problems! •  Solution: Map/reducebased parallelism! SAN DIEGO SUPERCOMPUTER CENTER

Architecture for Both Workloads! •  •  •  •  PROs! High-speed interconnect! Complementary object storage! Fast CPUs, RAM! Less faulty! CONs! •  Nodes aren't storagerich! •  Transferring data between HDFS and object storage*! ! * unless using Lustre, S3, etc backends! SAN DIEGO SUPERCOMPUTER CENTER

Add Data Analysis to Existing Compute Infrastructure! SAN DIEGO SUPERCOMPUTER CENTER

Add Data Analysis to Existing Compute Infrastructure! SAN DIEGO SUPERCOMPUTER CENTER

Add Data Analysis to Existing Compute Infrastructure! SAN DIEGO SUPERCOMPUTER CENTER

Add Data Analysis to Existing Compute Infrastructure! SAN DIEGO SUPERCOMPUTER CENTER

myHadoop – 3-step Install! 1. Download Apache Hadoop 1.x and myHadoop 0.30! $ wget http://apache.cs.utah.edu/hadoop/common/ hadoop-1.2.1/hadoop-1.2.1-bin.tar.gz! $ wget glennklockwood.com/files/myhadoop-0.30.tar.gz! 2. Unpack both Hadoop and myHadoop! $ tar zxvf hadoop-1.2.1-bin.tar.gz! $ tar zxvf myhadoop-0.30.tar.gz! 3. Apply myHadoop patch to Hadoop! $ cd hadoop-1.2.1/conf! $ patch < ../myhadoop-0.30/myhadoop-1.2.1.patch! ! SAN DIEGO SUPERCOMPUTER CENTER

myHadoop – 3-step Cluster! 1. Set a few environment variables! # sets HADOOP_HOME, JAVA_HOME, and PATH! $ module load hadoop! $ export HADOOP_CONF_DIR=$HOME/mycluster.conf! ! 2. Run myhadoop-configure.sh to set up Hadoop! $ myhadoop-configure.sh -s /scratch/$USER/$PBS_JOBID! ! 3. Start cluster with Hadoop's start-all.sh! $ start-all.sh! SAN DIEGO SUPERCOMPUTER CENTER

Easy Wordcount in Python! #!/usr/bin/env  python     import  sys     for  line  in  sys.stdin:      line  =  line.strip()      keys  =  line.split()      for  key  in  keys          value  =  1          print(  '%st%d'  %                (key,  value)  )   #!/usr/bin/env  python     import  sys     last_key  =  None   running_tot  =  0     for  input_line  in  sys.stdin:      input_line  =  input_line.strip()      this_key,  value  =  input_line.split("t",  1)      value  =  int(value)      if  last_key  ==  this_key:          running_tot  +=  value      else:          if  last_key:              print(  "%st%d"  %  (last_key,  running_tot)  )              running_tot  =  value              last_key  =  this_key       if  last_key  ==  this_key:      print(  "%st%d"  %  (last_key,  running_tot)  )   https://github.com/glennklockwood/hpchadoop/tree/master/wordcount.py! SAN DIEGO SUPERCOMPUTER CENTER

Easy Wordcount in Python! $  hadoop  dfs  -­‐put  ./input.txt  mobydick.txt     $  hadoop  jar            /opt/hadoop/contrib/streaming/hadoop-­‐streaming-­‐1.1.1.jar            -­‐mapper  "python  $PWD/mapper.py"            -­‐reducer  "python  $PWD/reducer.py"            -­‐input  mobydick.txt            -­‐output  output     $  hadoop  dfs  -­‐cat  output/part-­‐*  >  ./output.txt   SAN DIEGO SUPERCOMPUTER CENTER

Data-Intensive Performance Scaling! 8-node Hadoop cluster on Gordon ! 8 GB VCF (Variant Calling Format) file ! 9x speedup using 2 mappers/node ! SAN DIEGO SUPERCOMPUTER CENTER

Data-Intensive Performance Scaling! 7.7 GB of chemical evolution data, 270 MB/s processing rate ! SAN DIEGO SUPERCOMPUTER CENTER

Advanced Features - Useability! •  System-wide default configurations! •  myhadoop-0.30/conf/myhadoop.conf! •  MH_SCRATCH_DIR – specify location of node-local storage for all users! •  MH_IPOIB_TRANSFORM – specify regex to transform node hostnames into IP over InfiniBand hostnames! •  Users can remain totally ignorant of scratch disks and InfiniBand! •  Literally define HADOOP_CONF_DIR and run myhadoop-configure.sh with no parameters – myHadoop figures out everything else! SAN DIEGO SUPERCOMPUTER CENTER

Advanced Features - Useability! •  Parallel filesystem support! •  HDFS on Lustre via myHadoop persistent mode (-p)! •  Direct Lustre support (IDH)! •  No performance loss at smaller scales for HDFS on Lustre! •  Resource managers supported in unified framework:! •  •  •  •  Torque 2.x and 4.x – Tested on SDSC Gordon! SLURM 2.6 – Tested on TACC Stampede! Grid Engine! Can support LSF, PBSpro, Condor easily (need testbeds)! SAN DIEGO SUPERCOMPUTER CENTER

New Features! •  Perl reimplementation for expandability! •  Scalability improvements! •  Separate namenode, secondary namenode, jobtrackers! •  Automatic switchover based on cluster size best practices! •  Hadoop 2.0 / YARN support (halfway there)! •  Spark integration! SAN DIEGO SUPERCOMPUTER CENTER

Add a comment

Related presentations

Related pages

myHadoop 0.30 - Technology - documents

Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up ...
Read more

Fixed spelling errors. · joshua-hull/myhadoop@22a4861 · GitHub

joshua-hull / myhadoop forked from glennklockwood/myhadoop. Code. Pull requests 1. Projects 0 Pulse Graphs Permalink. Browse files. Fixed spelling errors.
Read more

HADOOP | FSU Research Computing Center

Table of Contents. Running Hadoop on the HPC Cluster. Step 1: Install Apache hadoop-1.2.1 and myhadoop-0.30 at your home directory. Step 2. Start up a ...
Read more

Chap 030 - Documents - DOCSLIDE.US

Download Chap 030. Transcript. Chapter 30 - Fiscal Policy, Deficits, and Debt CHAPTER THIRTY FISCAL POLICY, DEFICITS, ... Overview of myHadoop 0.30, ...
Read more

[confuciu@vedur ~]$ module avail ...

beagle-2.1.2 myhadoop-0.30. beagle-lib myhadoop-0.40a. BEAST-2.2.1 nest-2.2.2. beast-2.4.0 netcdf-4.2. beast-2.4.2 ...
Read more

Accelerating Big Data in HPC with Open-Source Software by ...

Accelerating Big Data in HPC with Open-Source Software. ... myHadoop Customizations for OrangeFS Run Hadoop MapReduce without HDFS Configured to use ...
Read more

GitHub - hocks/hpchadoop: Hadoop for Traditional HPC Users

master myhadoop-0.3 ucsdext Nothing to show. Nothing to show. New pull request Pull request Compare ...
Read more

Chinese 030 - Education - documents.mx

myHadoop 0.30. 030 rum. Login or Join. Processing Login successful. The system will automatically switch to the previous page after 6 seconds. Sign in
Read more