Published on August 28, 2007
Condor Parallel Universe: Condor Parallel Universe Overview: Overview Task vs. Job Parallelism New Condor support for Task-Parallelism Other goodies The Talk in one Slide: The Talk in one Slide Parallel Universe can run any* task parallel job Not just MPICH 1.2.4 Not just MPI… Job vs Task Parallelism: Job vs Task Parallelism Condor historically focused on Job Parallelism Job parallelism either manually or via DAGman Rest of talk on task parallelism Can also get task parallel via pvm or MW Parallel Universe: Parallel Universe Adaptation of MPI universe Modifications based on experience with MPI User feedback But, more than just MPI MPI lifecycle without Condor: MPI lifecycle without Condor Lam Version lamboot lamboot -ssi boot ssh machine_file mpirun mpirun -np 8 exe arg1 arg2... lamhalt lamhalt Scheduling: Scheduling Need 'Dedicated Scheduler' 'Dedicated' has a specific Condor meaning Nodes running MPI require a dedicated scheduler A Given machine can have many opportunistic schedulers ... but only 1 dedicated scheduler DedicatedScheduler surprises: DedicatedScheduler surprises DedicatedScheduler co-opts normal negotiation cycle Preemption and scheduling work differently than opportunistic DedicatedScheduler schedules First-Fit, sorted by UserJobPrio Condor_q –analyze mystery! Job startup: Job startup Same file transfer, etc. as Vanilla One shadow, many starters Starter runs sshd on all machines, does key exchange Starter runs the exe on first machine (head node, Rank0) Your script Here: Your script Here Script on the head node has contact file We provide samples for LAM, MPICH We try to mimic 'by hand' startup Use condor_ssh to start remote jobs When script exits, condor cleans up Parallel Example: Parallel Example Submit Machine Execute Machines Schedd Startd Startd Startd Sshd Sshd Sshd Job Job Job Example submit file: Example submit file Universe = Parallel # executable is a script executable = script # the real binary transfer_input_files = executable arguments = arg1 arg2 arg3 machine_count = 8 output = out.$(Cluster).$(NODE) queue Example Script: Example Script chmod 755 simple lamboot –ssi boot rsh $MACHINE_FILE mpirun –np $NO_MACHINES simple lamhalt Example submit file 2: Example submit file 2 Universe = Parallel Requirements = (Hostname == 'somemachine') queue Requirements = (Hostname != 'somemachine') queue 7 Example Script 2: Example Script 2 mach1 = `sed –n 1p $MACHINE_FILE` mach2 = `sed –n 2p $MACHINE_FILE` ./server andamp; ssh $mach1 client_app ssh $mach2 client_app wait Summary: Summary With Parallel Universe in Condor 6.8 comes: Support for most MPI implementations (some scripting required) Somewhat better MPI scheduling Better node placement via condor matchmaking Questions?: Questions? Thank you!