CF home | organization | program | special | venue / travel |
Scaling Up Next Generation Supercomputers
Monday, May 5
Historically, technology has been the main driver of computer performance. For many system generations, CMOS scaling has been leveraged to increase clock speed and build increasingly complex microarchitectures. As technology-driven performance gains are becoming increasingly harder to achieve, innovative system architecture must step in. In the context of the design of the Blue Gene/P supercomputer chip, we will discuss how we adopted a holistic approach to optimization of the entire hardware and software stack for a range of metrics: performance, power, power/performance, reliability and ease of use.
The new Blue Gene/P chip multiprocessor (CMP) scales node performance using a multi-core system-on-a-chip design. While in the past large symmetric multi processor (SMP) designs were sized to handle large amounts of coherence traffic, many modern CMP designs find this cost prohibitive in terms of area, power dissipation, and design complexity. As multi-core processors evolve to larger configurations, the performance loss due to handling coherence traffic must be carefully managed. Thus, to ensure high efficiency of each quad-processor node in Blue Gene/P, taming the cost of coherence of traditional SMP designs was a key requirement.
The new Blue Gene/P chip multiprocessor exploits a novel way of reducing coherence cost by filtering useless coherence actions. Each processor core is paired with a snoop filter which identifies and discards unnecessary coherence requests before they can reach the processor cores. Removing unnecessary lookups reduces the interference of invalidate requests with L1 data cache accesses, and reduces power by eliminating expensive tag array accesses. This approach results in improved power and performance characteristics.
To optimize application performance, we exploit parallelism at multiple levels: at the process-level, thread-level, data-level, and instruction-level. Hardware supported coherence allows applications to efficiently share data between threads on different processors for thread-level parallelism, while the dual floating point unit and the dual-issue out-of-order PowerPC450 processor core exploit data and instruction level parallelism, respectively. To exploit process-level parallelism, special emphasis was put on efficient communication primitives by including hardware support for the MPI protocol, such as low latency barriers, and five highly optimized communication networks. A new high performance DMA unit supports high throughput data transfers.
As the result of this deliberate design for scalability approach, Blue Gene supercomputers offer unprecedented scalability, in some cases by orders of magnitude, to a wide range of scientific applications. A broad range of scientific applications on Blue Gene supercomputers have advanced scientific discovery, which is the real merit and ultimate measure of success of the Blue Gene system family.
Data-parallel Abstractions for Irregular Programs
Tuesday, May 6
Client-side applications running on multicore processors are likely to be irregular programs that deal with complex, pointer-based data structures such as graphs and trees. In her 2007 Turing award lecture, Fran Allen raised an important question about such programs: do irregular programs have data parallelism, and if so, how do we exploit it on multicore processors?
In this talk, we demonstrate using concrete examples that irregular programs have a generalized data parallelism that arises from the use of iterative algorithms that manipulate worklists of various sorts. We also argue that optimistic parallelization is required for these programs since compile-time parallelization techniques like points-to and shape analysis cannot expose the parallelism in these programs.
We then describe the approach taken in the Galois project to exploit this generalized data-parallelism. There are three main aspects to the Galois system: (1) a small number of syntactic constructs for packaging optimistic parallelism as iteration over ordered and unordered sets, (2) assertions about methods in class libraries, and (3) a runtime scheme for detecting and recovering from potentially unsafe accesses to shared memory made by optimistic computations. We present experimental results that demonstrate that the Galois approach is practical, and discuss ongoing work on this system.
JANUS: Reconfigurable High-Performance Computing for Physics
Wednesday, May 7
Several applications in computational physics, chemistry and biology are still beyond the reach of state-of-the-art computers, whenever studying the surprisingly complex behavior governed by some simple dynamics equations requires inordinately long execution times.
This is the case, for instance, for Spin Glasses, apparently simple statistical models used to describe the properties of disordered materials and theoretically interesting also as a paradigm of complexity. These models use discrete variables (called spins) sitting at the edges of discrete and regular ddimensional lattices (typically d = 2...4); their properties are usually studied with Monte Carlo simulation techniques. Integration over configuration space of a three-dimensional lattice of 503 sites, requiring up to 1013 Monte Carlo steps, is still an intractable task. One has to collect statistics for several (≅ 102) copies of the system, corresponding to 1020 Monte Carlo spin updates. Spins variables have values in a discrete and finite set (just two values, ±1 in several important models), while traditional data-paths use long data words. An established approach to boost performance resorts to using each bit of the computer data word to store the same spin of several independent configurations, so (e.g., using the SIMD data-paths present in most architectures) O(102) (more or less all the copies we need to study) copies of the system are handled in parallel. Using these tricks, an average update rate of one spin every ≅ 1ns is possible, meaning that on one state-of-the-art processor our simulation program would go on for ≅ 10000 years.
Once this embarassingly-parallel approach is used, very little additional space for useful parallelization is left on conventional parallel architectures (Monte Carlo evolution is inherently serial in the time direction), so there are no obvious strategies able to cut simulation time to a figure compatible with human timescales. What is really needed is an architecture able to express the huge parallelism intrinsically available in the evolution of each copy of the system.
We have developed the JANUS system precisely in order to face this problem. We do so by accurately matching its architecture with the structure of the relevant algorithms. Each JANUS processor is able to perform the needed 1013 Monte Carlo steps on one copy of the system in just a few months, so the massive system of 256 processors that we plan to deploy in spring 2008 brings the wall-clock time needed for out simulation program down to about one year.
JANUS is based on FPGAs for implementation simplicity and added flexibility, so it is an example of a reconfigurable architecture on which a high-performance application has been succesfully configured. It also turns out that JANUS, once configured for spin-glass simulations, has several commonalities with many-core architectures that are clearly emerging as a recent trend in computer systems.
JANUS provides performance for spin-glass simulations by exploting the parallelism available in the application up to the level of available computing resources. This is done following two rather simple ideas:
These features are rather well supported by recent highend FPGA architectures. A key feature is the large number of independent memory blocks, that can be configured to provide the huge needed memory bandwidth (of the order of several thousands bits per clock cycle).
Other specific features of our Monte Carlo algorithm help boost performance for our system:
I will describe the details of the JANUS platform and discuss how it can deliver outstanding perfomance for its (admittedly limited) class of target applications. I will then try to draw some lessons hopefully applicable to a more general context: