MAPLD2004 Paper198

40 %
60 %
Information about MAPLD2004 Paper198

Published on February 28, 2008

Author: Urania


Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI:  Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI Vikas Aggarwal, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Lab Department of Electrical and Computer Engineering University of Florida Gainesville, FL Outline:  Outline Introduction N-Queens Solutions Backtracking Approach N-Queens Parallelization Experimental Setup Handel-C and Lessons Learned Results and Analysis Conclusions Future Work and Acknowledgements References Introduction:  Introduction N-Queens dates back to the 19th century (studied by Gauss) Classical combinatorial problem, widely used as a benchmark because of its simple and regular structure Problem involves placing N queens on an N  N chessboard such that no queen can attack any other Benchmark code versions include finding the first solution and finding all solutions Introduction :  Introduction Mathematically stated: Find a permutation of the BOARD() vector containing numbers 1:N, such that for any i != j Board( i ) - i != Board( j ) - j Board( i ) + i != Board( j ) + j N-Queens Solutions:  N-Queens Solutions Various approaches to the problem Brute force[2] Local search algorithms[4] Backtracking[2], [7] , [11], [12], [13] Divide and conquer approach[1] Permutation generation[2] Mathematical solutions[6] Graph theory concepts[2] Heuristics and AI[4], [14] Backtracking Approach:  Backtracking Approach One of the only approaches that guarantees a solution, though it can be slow Can be seen as a form of intelligent depth-first search Complexity of backtracking typically rises exponentially with problem size Good test case for performance analysis of RC systems, as the problem is complex even for small data size* Traditional processors provide a suboptimal platform for this iterative application due to serial nature of their processing pipelines Tremendous speedups achieved by adding parallelism at the logic level via RC * For an 8x8 board, 981 moves (876 tests + 105 backtracks) are required for first solution alone Backtracking Approach:  Backtracking Approach Tables provide an estimate of the backtracking approach’s complexity Problem can be made to find first solution or the total number of solutions Total number of solutions is obviously a more challenging problem Interesting observation: 1st solution’s complexity (i.e. number of operations) does not increase monotonically with board size # of operations for 1st solution [7] Number of solutions [8] N-Queens Parallelization:  N-Queens Parallelization Different levels of parallelism added to improve performance Functional Unit replication Parallel column check Parallel row check Q Q Sequential : 11 cycles Parallel column check: 3 cycles Multiple row check appended:1 cycles 11x speedup over sequential operation Note: Assume first four queens have been placed and the fifth queen starts from the 1st row Parallelization Comparison Experimental setup:  Experimental setup Experiments conducted using RC1000 boards from Celoxica, Inc., and Tarari RC boards from Tarari, Inc. Each RC1000 board features a Xilinx Virtex-2000 FPGA, 8 MB of on-card SRAM, and PCI Mezzanine Card (PMC) sockets for connecting two daughter cards Each Tarari board features two user-programmable Xilinx Virtex-II FPGAs in addition to a controller FPGA, 256 MB of DDR SDRAM Configurations designed in Handel-C using Celoxica’s application mapping tool DK-2, along with Xilinx ISE for place and route Performance compared against 2.4 GHz Xeon server and 1.33 GHz Athlon server Celoxica RC1000:  Celoxica RC1000 * Figure courtesy of Celoxica RC1000 manual PCI-based card having one Xilinx FPGA and four memory banks FPGA configured from the host processor over the PCI bus Four memory banks, each of 2MB, accessible to both the FPGA and any other device on the PCI bus Data transfers: The RC1000 provides 3 methods of transferring data over PCI bus between host processor and FPGA: Bulk data transfers performed via memory banks Two unidirectional 8 bit ports, called control and status ports, for direct comm. between FPGA and PCI bus (note: this method used in our experiments) User I/O pins USER1 and USERO for single bit communication with FPGA API-layer calls from host to configure and communicate with RC board Tarari Content Processing Platform:  Tarari Content Processing Platform * Figure courtesy of Tarari CP-DK manual Handel-C Programming Paradigm:  Handel-C Programming Paradigm Handel-C acts as a bridge between VHDL and “C” Comparison with conventional C More explicit provisioning of parallelism within the code Variables declared to have the exact bit-lengths to save space Provides more bit-level manipulations beyond shifts and logic operations Limited support for many ANSI C standards and extensions Comparison with VHDL Application porting is much faster for experienced coders Similar to VHDL behavioral models Lacks VHDL concurrent signal assignments which can be suspended until changes on input triggers (Handel-C requires polling) Provides more higher-level routines Handel-C Design Specifics :  Handel-C Design Specifics Design makes use of the following two approaches Approach 1 Use of an array of binary numbers to hold a ‘1’ at a particular bit position to indicate the location of queen in the column A 32 x 32 board will require an array of 32 elements of 32 bits each Correspondingly use bit-shift operations and logical-and operations to check diagonal and row conditions More closely corresponds to the way the operations will take place on the RC fabric Approach 2 Use of an array of integers instead of binary numbers Correspondingly use the mathematical model of the problem to check the validation conditions Smaller variables yield better device utilization; slices occupied reduce from about 75% to about 15% for similar performance and parallelism Approach 2 found to be more amenable for Handel-C designs Lessons Learned with Handel-C:  Lessons Learned with Handel-C Some interesting observations: Code for which place and route did not work, finally worked when the function parameters were replaced by global variables Less control at lower level with place and route being a consistent problem even with designs using up only 40% of total slices Self-referenced operations (e.g. a=a+x) affect the design adversely, so use intermediate variables Order of operations and conditional statements can affect design Useful to reduce wider-bit operations into a sequence of narrower-bit operations Balancing “if” with “else” branches leads to better designs Comments in the main program sometimes affected the synthesis, leading to place and route errors in fully commented code We are still learning more everyday! Sequential First-Solution Results:  Sequential First-Solution Results Sequential version does not perform well versus the Xeon and Athlon CPUs Algorithm needs an efficient design to minimize resource utilization The results do not include the one-time configuration overhead of ~150 ms RC1000 clock speed @ 40 MHz RC1000 clock speed @ 40 MHz Parallel First-Solution Results:  Parallel First-Solution Results RC1000 clock speed @ 25 MHz The most parallel algorithm runs about 20x faster than sequential algorithm on RC fabric Parallel algorithm with two row checks almost duplicates behavior of 2.4 GHz Xeon server, while 6-row check outperforms it by 74% Further increasing the number of rows checked is likely to further improve performance for larger problem sizes Total Number of Solutions Method:  Total Number of Solutions Method Employ divide-and-conquer approach Seen as a parallel depth-first search Solutions obtained with queen positioned in any row in the first column are independent from solutions with queens in other positions Technique allows for high degree of parallelism (DoP) One-Board Total-Solutions Results:  One-Board Total-Solutions Results Designs on hardware perform around 1.7x faster than Xeon server Performance on both RC platforms similar for same clock rates RC1000 performs a notch better for smaller chess board sizes while Tarari CPP’s performance improves with chess board sizes Almost entire Virtex–II chip on the Tarari is occupied for one FU RC1000 and Tarari clock speed @ 33 MHz Multiple Functional Units (FUs):  Multiple Functional Units (FUs) Used additional FUs per chip to increase parallelism per chip Each FU searches for the number of solutions corresponding to a subset of rows in the first column The controller Handles communication with the host Invokes all FUs in parallel Combines all results Controller Host processor fu1 fu2 fu6 fu7 fu8 fu9 fu10 fu3 fu4 fu5 On board FPGA Total-Solutions Results with Multiple FUs:  Total-Solutions Results with Multiple FUs RC1000 with three FUs performs almost 5x faster than Xeon server Speedup increases near linearly with number of FUs Area occupied scales linearly with number of FUs RC1000 clock speed @ 30 MHz RC speedup vs. Xeon server for board size of 17 MPI for Inter-Board Communication:  MPI for Inter-Board Communication On-board FPGA (with one or multiple FU’s) MPI Host server Host server On-board FPGA (with one or multiple FU’s) To further increase system speedup (having more functional units), multiple boards employed Each FU programmed to search a subset of the solution space Servers communicate using the Message Passing Interface (MPI) to start search in parallel and obtain the final result Total-Solutions Results with MPI:  Total-Solutions Results with MPI Results show total execution time including MPI overhead Minimal MPI overhead incurred (high computation-to-communication ratio) Communication overhead bounded to 3 ms regardless of problem size and initialization overhead is around 750 ms Overhead becomes negligible for large problem sizes Speedup scales near linearly with number of boards 4-board Tarari design performs about 6.5x faster than Xeon server Tarari CPP clock speed @ 33 MHz RC speedup vs. Xeon server for board size of 12 Total-Solutions Results with MPI:  Total-Solutions Results with MPI RC speedup vs. Xeon server for board size of 12 RC1000 clock speed @ 30 MHz Results show total execution time including MPI overhead Minimal MPI overhead incurred (high computation-to-communication ratio) Communication overhead bounded to 3 ms regardless of problem size and initialization overhead is around 750 ms Overhead becomes negligible for large problem sizes Speedup scales near linearly with number of boards 4-board RC1000 design performs about 12x faster than Xeon server Total-Solutions Results with MPI:  Total-Solutions Results with MPI Communication overheads still remain low, while MPI initialization overheads increase with number of boards (now 1316 ms for 8 boards) Heterogeneous mixture of boards employed to solve the problem coordinating via MPI Total of 8 boards (4 RC1000 and 4 Tarari boards) allows up to 16 (43 + 41) FUs 8 boards perform about 21x faster than Xeon server for chess board size of 16 What appears to be an unfair comparison really shows how the approach scales to many more FUs per FPGA (on higher density chips) RC1000 clock speed @ 30 MHz and Tarari clock speed @ 33MHz Conclusions:  Conclusions Parallel backtracking for solving N-Queens problem in RC shows promise for performance N-Queens is an important benchmark in the HPC community RC devices outperform CPUs for N-Queens due to RC’s efficient processing of fine-grained, parallel, bit-manipulation operations Previously inefficient methods for CPUs like backtracking can be improved by reexamining their design This approach can be applied to many other applications Numerous parallel approaches developed at several levels Handel-C lessons learned A “C-based” programming model for application mapping provides a degree of higher-level abstraction, yet still requires programmer to code from a hardware perspective Solutions produced to date show promise for application mapping Future Work and Acknowledgements:  Future Work and Acknowledgements Compare application mappers with HDL design in terms of mapping efficiency Develop and use direct communication between FPGAs to avoid MPI overhead Export approach featured in this talk to variety of algorithms and HPC benchmarks for performance analysis and optimization Develop library of application and middleware kernels for RC-based HPC We wish to thank the following for their support of this research: Department of Defense Xilinx Celoxica Tarari Key vendors of our HPC cluster resources (Intel, AMD, Cisco, Nortel) References:  References [1] “Divide and Conquer under Global Constraints: A Solution to the N-Queens Problem”, Bruce Abramson and Mordechai M. Yung [2] ”Different Perspectives Of The N-queens Problem”, Cengiz Erbas, Seyed Sarkeshikt, Murat M. Tanik, Department of Computer Science and Engineering,Southern Methodist University, Dallas [3] “Algorithms and Complexity”, Herbert S. Wilf, University of Pennsylvania, Philadelphia [4] “Fast search algorithms for N-Queens problem”, Rok Sausic, Jum Gu, appeared in IEEE transactions on Systems, Man, and Cybernetics, Vol 21, 6, pp 1572-76, Nov/Dec 1991 [5] [6] [7] [8] [9] A polynomial time algorithm for N-queens problem [10] [11] [12] [13] [14] “From Alife Agents To A Kingdom Of N Queens”, Han Jing, Jimimg Liu, Cai Qingsheng [15] [16]

Add a comment

Related presentations

Related pages

Ppt Hcs-567 | Powerpoint Presentations and Slides » View ...

View and Download PowerPoint Presentations on HCS 567 PPT. Find PowerPoint Presentations and Slides using the power of, find free ...
Read more

MAPLD2004_Paper198 ( PDF ) Full Text: UNIVERSITY OF FLORIDA Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI
Read more

Ppt Queens-problem | Powerpoint Presentations and Slides ...

View and Download PowerPoint Presentations on QUEENS PROBLEM PPT. Find PowerPoint Presentations and Slides using the power of, find free ...
Read more

NQUEENS - UF HCS Research Laboratory

Title: NQUEENS Author: Guest Last modified by: Ian Created Date: 2/2/2004 7:08:34 PM Document presentation format: On-screen Show Company: UF Other titles
Read more