Wawrzynek HERC BEE2v1

50 %
50 %
Information about Wawrzynek HERC BEE2v1

Published on January 3, 2008

Author: Denise

Source: authorstream.com

High-End Reconfigurable Computing:  High-End Reconfigurable Computing Berkeley Wireless Research Center January 2004 John Wawrzynek, Robert W. Brodersen, Chen Chang, Vason Srini, Brian Richards Berkeley Emulation Engine:  Berkeley Emulation Engine FPGA-based system for real-time hardware emulation: Emulation speeds up to 60 MHz Emulation capacity of 10 Million ASIC gate-equivalents, corresponding to 600 Gops (16-bit adds) (although not a logic gate emulator.) 2400 external parallel I/Os providing 192 Gbps raw bandwidth. Status:  Status Four BEE processing units built Three in continuous “production” use Supported universities CMU, USC, Tampere, UMass, Stanford Successful tapeout of: 3.2M transistor pico-radio chip 1.8M transistor LDPC decoder chip System emulated: QPSK radio transceiver BCJR decoder MPEG IDCT On-going projects UWB mix-signal SOC MPEG transcoder Pico radio multi-node system Infineon SIMD processor for SDR Lessons from BEE:  Lessons from BEE Simulink based tool-flow very effective FPGA programming model in DSP domain. Many system emulation tasks are significant computations in their own right – high-performance emulation hardware makes for high-performance general computing. Is this the right way to build supercomputers? BEE could be scaled up with latest FPGAs and by using multiple boards  TeraBEE (B2). High-End Reconfigurable Computer (HERC):  High-End Reconfigurable Computer (HERC) The machine with supercomputer-level performance configured on a per-problem-basis to match the structure of the task by exploiting spatial parallelism. All data-paths, control paths, memory ports and controllers, communication channels and controllers, are wired to match the needs of a particular problem. Applications of Interest:  Applications of Interest High-performance DSP/communication systems Cognitive radio or SDR Hyper-spectral imaging Image processing and navigation Real-time scientific computation and simulation E & M simulation Molecular dynamics CAD acceleration FPGA Place & Route Others Bioinformatics High-performance DSP:  High-performance DSP “Stream-based” computation model Usually real-time requirement High-bandwidth data I/O Low numerical precision requirements: fix-point or reduced floating point Data processing dominated, few control branch points Scientific Computing:  Scientific Computing Computationally demanding Double-precision floating point Traditional methods require FFTs, matrix operations, linear systems solvers (linpack) Often regular or adaptive grid structure Traditionally not real-time processing, but real-time processing would offer new applications. Opportunities to innovate on the algorithm/mapping for reconfigurable. CAD acceleration:  CAD acceleration Existing low-level tool flow currently too slow to be practical for HERC systems. HERC machines should be used to accelerate their own tools. Some starting ideas: “Hardware-Assisted Fast Routing.” André DeHon, Randy Huang, and John Wawrzynek., Published in Proceedings of the IEEE Symposium on Field-Programmable, Custom Computing Machines (FCCM '02, April 22--24, 2002). “Hardware-Assisted Simulated Annealing with Application for Fast FPGA Placement.” Michael Wrighton and André DeHon. In Proceedings of the International Symposium on Field Programmable Gate Arrays, pages 33--42, February, 2003. Bioinformatics:  Bioinformatics Implicitly parallel algorithms Stream-like data processing Integer operations sufficient History of success with reconfigurable architectures. High-capacity persistent storage devices required for matching large database Conventional High-end Computers:  Conventional High-end Computers System performance in the 100’s of GFLOPs to 10’s of TFLOP range. Using commodity components is a key idea: Low-volume production makes it difficult to justify custom silicon. Commodity components ride technology curve. But Microprocessors are the wrong component! Clusters of commodity microprocessors Computation Density of Processors:  Computation Density of Processors Serial instruction stream limits parallelism Power consumption limits performance Xilinx Platform FPGA Roadmap:  Xilinx Platform FPGA Roadmap Reconfigurable devices drive the next process step Simple performance scaling FPGA Density & Flexibility:  FPGA Density & Flexibility FPGAs already offer density advantage Offer problem specific operators @200MHz = 20GFLOPS Other Characteristics of High-end Microprocessor systems:  Other Characteristics of High-end Microprocessor systems Memory is a problem: Serial von-Neuman execution forces heavy demands on memory system Processor memory speed gap widens with Moore’s Law: Multiple layers of caches are necessary to keep up. Most HEC applications derive little or no benefit from caches but, caches add power, latency, cost, unpredictability. Real-time processng impossible because unpredictable delay in memory hierarchy and communication network Microprocessors inherently fault-intolerant and costly Characteristics of Reconfigurable Computer Systems:  Characteristics of Reconfigurable Computer Systems increased performance density, lower clock rate, reduced power node. High spatial parallelism and circuit specialization within nodes. No cache, computing elements operate at the same speed as memory Multiple independently addressed memory banks per node Internal FPGA SRAM can be user-controlled cache if needed. Flexible interconnection network (circuit/packet switching). Predictable memory and network latency permit static scheduling in real-time applications. FPGAs inherently manufacturing “fault-tolerant” B2 Design:  B2 Design Approach: look at hardware configurations, evaluate based on programming model and applications, and iterate. Starting Constraints: Use all COTS components: FPGAs, memory, connectors/cables Highly modular Scalable from single module to approximately 1K FPGA chips in a system (8 TFlops) Computing node and memory:  Computing node and memory Single Xilinx Virtex 2 Pro 70+ FPGA ~75K logic cells (4lut+FF  ~0.5M logic gates) 1704 package with 996 user I/O pins 2 PowerPC cores ~500 dedicated multipliers (18-bit) ~ 700KBytes SRAM on-chip 20 10-Gbit/s serial communication links (MGTs) 4 physical DDR 400 banks Each banks has 72 data bits with ECC Independently addressed with 16 logical banks total 12.8 GBps memory bandwidth, with up to 8 GB capacity Inter-node Connections:  Inter-node Connections Module: Each group of four nodes share a “control node” and form a computational cluster. Point-to-point connection between control node and processing node 144 bit 300MHz DDR 38.4 Gb/s per link Uplink connect to other modules to form a 4-ary tree. Downlinks for I/O on leaf nodes and for tree connection on switch modules. B2 module 4-ary Tree Connection:  4-ary Tree Connection 4-ary tree configuration High-bandwidth high-latency: 12X Infiniband 10 Gbps duplex Low bandwidth low latency: 64 pin (32 bit) LVDS @ 200 MHz DDR Some B2 modules act as a switch node and aggregation computing point. Fat-trees – balancing computation and communication:  Fat-trees – balancing computation and communication Uplink bandwidth can be partitioned to allow a family of fat-tree structures. Rent’s rule type analysis can be used to characterize application connection locality. Machine can be built or configured to match appropriate Rent constant (maybe different at different levels). Non fat-tree (64 nodes):  Non fat-tree (64 nodes) Constant cross-section bandwidth at each tree level Fat-tree Configuration (64 nodes):  Fat-tree Configuration (64 nodes) Cross-section bandwidth grows towards tree root. Uses a higher ratio of switch/leaf modules. Tree structure is configured by how modules are wired. B2 Module: board layout:  B2 Module: board layout 4 computing nodes, 1 control node, Up to 40GBytes DRAM 8 SATA connection for up to 8 hard disks Example B2 System:  Example B2 System Two modules (8 nodes) per 1 RU unit (19” x 27”) or One module plus up to 4 disks. Single cabinet: 256 node tree-connected B2 system, with: up to 3.4 TB DDR DRAM >40 TOPS or 2 TFLOPS (not counting tree nodes) Summary:  Summary Supercomputer level computation at fraction of cost and size: High computational density enables small physical size. Low-level redundancy enables manufacturing-fault tolerance and drastic cost reduction. Platform for: Extending BEE approach to real-time emulation, experimenting with “reconfigurable computing” programming models and application domains Scalable: Computation/memory capacity varies with number of modules and FPGA generation Wiring options vary computation/communication balance Spare Slides:  Spare Slides Alternative Switch Scheme:  Alternative Switch Scheme Specialized crossbar switch implemented as ASIC (Mellanox) 200 ns latency Fat tree organization with constant cross section bandwidth Disk Storage Schemes:  Disk Storage Schemes Intra B2 module working storage at each module User disk storage schemes Connection to existing NAS through Gigabit Ethernet from all B2 modules Direct high bandwidth storage nodes attached to the main crossbar network SAN bridge attached to the main crossbar network adapting to existing SAN

Add a comment

Related presentations

Related pages

Wawrzynek-HERC_BEE2v1 - Ace Recommendation Platform - 9

January 12, 2004 BWRC, UC Berkeley 9CAD acceleration Existing low-level tool flow currently too slow to be practical for HERC systems. HERC machines ...
Read more

Wawrzynek-HERC_BEE2v1 - Ace Recommendation Platform - 1

High-End Reconfigurable ComputingBerkeley Wireless Research CenterJanuary 2004John Wawrzynek, Robert W. Brodersen, Chen Chang, Vason Srini, Brian Richards
Read more

Wawrzynek-HERC_BEE2v1 - Ace Recommendation Platform - 6

January 12, 2004 BWRC, UC Berkeley 6Applications of Interest High-performance DSP/communication systems Cognitive radio or SDR Hyper-spectral imaging ...
Read more

Fakultät für informatik informatik 12 technische ...

... UC Berkeley14 High-End Reconfigurable Computing Berkeley Wireless Research Center January 2004 John Wawrzynek ... 20PM/Wawrzynek- HERC_BEE2v1 ...
Read more

EDA - TU Dortmund

Applications (2) Peter Marwedel Informatik 12, TU Dortmund
Read more

Applications (2) - TU Dortmund

technische universität - 3 - dortmund fakultät für informatik p.marwedel, informatik 12, 2008 Typical applications Diploma theses @ Dortmund: real-time ...
Read more