Computing Frontiers: Invited speakers

Invited speakers

Computing Frontiers 2008 will feature three keynote addresses:

Valentina Salapura
IBM Master Inventor
IBM Thomas J. Watson Research Center
Scaling Up Next Generation Supercomputers
Monday, May 5
Historically, technology has been the main driver of computer performance. For many system generations, CMOS scaling has been leveraged to increase clock speed and build increasingly complex microarchitectures. As technology-driven performance gains are becoming increasingly harder to achieve, innovative system architecture must step in. In the context of the design of the Blue Gene/P supercomputer chip, we will discuss how we adopted a holistic approach to optimization of the entire hardware and software stack for a range of metrics: performance, power, power/performance, reliability and ease of use.
The new Blue Gene/P chip multiprocessor (CMP) scales node performance using a multi-core system-on-a-chip design. While in the past large symmetric multi processor (SMP) designs were sized to handle large amounts of coherence traffic, many modern CMP designs find this cost prohibitive in terms of area, power dissipation, and design complexity. As multi-core processors evolve to larger configurations, the performance loss due to handling coherence traffic must be carefully managed. Thus, to ensure high efficiency of each quad-processor node in Blue Gene/P, taming the cost of coherence of traditional SMP designs was a key requirement.
The new Blue Gene/P chip multiprocessor exploits a novel way of reducing coherence cost by filtering useless coherence actions. Each processor core is paired with a snoop filter which identifies and discards unnecessary coherence requests before they can reach the processor cores. Removing unnecessary lookups reduces the interference of invalidate requests with L1 data cache accesses, and reduces power by eliminating expensive tag array accesses. This approach results in improved power and performance characteristics.
To optimize application performance, we exploit parallelism at multiple levels: at the process-level, thread-level, data-level, and instruction-level. Hardware supported coherence allows applications to efficiently share data between threads on different processors for thread-level parallelism, while the dual floating point unit and the dual-issue out-of-order PowerPC450 processor core exploit data and instruction level parallelism, respectively. To exploit process-level parallelism, special emphasis was put on efficient communication primitives by including hardware support for the MPI protocol, such as low latency barriers, and five highly optimized communication networks. A new high performance DMA unit supports high throughput data transfers.
As the result of this deliberate design for scalability approach, Blue Gene supercomputers offer unprecedented scalability, in some cases by orders of magnitude, to a wide range of scientific applications. A broad range of scientific applications on Blue Gene supercomputers have advanced scientific discovery, which is the real merit and ultimate measure of success of the Blue Gene system family.
Keshav Pingali
Professor, Department of Computer Science
University of Texas at Austin
Data-parallel Abstractions for Irregular Programs
Tuesday, May 6
Client-side applications running on multicore processors are likely to be irregular programs that deal with complex, pointer-based data structures such as graphs and trees. In her 2007 Turing award lecture, Fran Allen raised an important question about such programs: do irregular programs have data parallelism, and if so, how do we exploit it on multicore processors?
In this talk, we demonstrate using concrete examples that irregular programs have a generalized data parallelism that arises from the use of iterative algorithms that manipulate worklists of various sorts. We also argue that optimistic parallelization is required for these programs since compile-time parallelization techniques like points-to and shape analysis cannot expose the parallelism in these programs.
We then describe the approach taken in the Galois project to exploit this generalized data-parallelism. There are three main aspects to the Galois system: (1) a small number of syntactic constructs for packaging optimistic parallelism as iteration over ordered and unordered sets, (2) assertions about methods in class libraries, and (3) a runtime scheme for detecting and recovering from potentially unsafe accesses to shared memory made by optimistic computations. We present experimental results that demonstrate that the Galois approach is practical, and discuss ongoing work on this system.
Raffaele Tripiccione
Professor, Department of Physics
Universita' di Ferrara, Italy
JANUS: Reconfigurable High-Performance Computing for Physics
Wednesday, May 7
Several applications in computational physics, chemistry and biology are still beyond the reach of state-of-the-art computers, whenever studying the surprisingly complex behavior governed by some simple dynamics equations requires inordinately long execution times.
This is the case, for instance, for Spin Glasses, apparently simple statistical models used to describe the properties of disordered materials and theoretically interesting also as a paradigm of complexity. These models use discrete variables (called spins) sitting at the edges of discrete and regular ddimensional lattices (typically d = 2...4); their properties are usually studied with Monte Carlo simulation techniques. Integration over configuration space of a three-dimensional lattice of 50³ sites, requiring up to 10¹³ Monte Carlo steps, is still an intractable task. One has to collect statistics for several (≅ 10²) copies of the system, corresponding to 10²⁰ Monte Carlo spin updates. Spins variables have values in a discrete and finite set (just two values, ±1 in several important models), while traditional data-paths use long data words. An established approach to boost performance resorts to using each bit of the computer data word to store the same spin of several independent configurations, so (e.g., using the SIMD data-paths present in most architectures) O(10²) (more or less all the copies we need to study) copies of the system are handled in parallel. Using these tricks, an average update rate of one spin every ≅ 1ns is possible, meaning that on one state-of-the-art processor our simulation program would go on for ≅ 10000 years.
Once this embarassingly-parallel approach is used, very little additional space for useful parallelization is left on conventional parallel architectures (Monte Carlo evolution is inherently serial in the time direction), so there are no obvious strategies able to cut simulation time to a figure compatible with human timescales. What is really needed is an architecture able to express the huge parallelism intrinsically available in the evolution of each copy of the system.
We have developed the JANUS system precisely in order to face this problem. We do so by accurately matching its architecture with the structure of the relevant algorithms. Each JANUS processor is able to perform the needed 10¹³ Monte Carlo steps on one copy of the system in just a few months, so the massive system of 256 processors that we plan to deploy in spring 2008 brings the wall-clock time needed for out simulation program down to about one year.
JANUS is based on FPGAs for implementation simplicity and added flexibility, so it is an example of a reconfigurable architecture on which a high-performance application has been succesfully configured. It also turns out that JANUS, once configured for spin-glass simulations, has several commonalities with many-core architectures that are clearly emerging as a recent trend in computer systems.
JANUS provides performance for spin-glass simulations by exploting the parallelism available in the application up to the level of available computing resources. This is done following two rather simple ideas:
- use all available logic resources to assemble a large number of data-paths performing in parallel the specific set of operations required by the algorithm.
- make sure that the processing blocks are fed with all the data they need, a highly non trivial task that calls for embedded-memory.
These features are rather well supported by recent highend FPGA architectures. A key feature is the large number of independent memory blocks, that can be configured to provide the huge needed memory bandwidth (of the order of several thousands bits per clock cycle).
Other specific features of our Monte Carlo algorithm help boost performance for our system:
- Code kernels have a regular structure, associated to regular loops. At each iteration the same set of operations is performed on data stored at locations whose addresses can be predicted in advance.
- Data processing is associated to simple logical operations performed on bit-variables defining the spin variables, rather than on standard integer or floating-point arithmetics on long data words.
- The data base associated to the computation is of the order of just a few bytes for each site (e.g., 100 KBytes for a grid of 48³ sites).
I will describe the details of the JANUS platform and discuss how it can deliver outstanding perfomance for its (admittedly limited) class of target applications. I will then try to draw some lessons hopefully applicable to a more general context:
- FPGAs can be configured to perform functions and full algorithms poorly done by traditional CPUs, with exceptional performance rewards.
- Embedded memory is critical to sustain performance: as soon as the database becomes so large that off-chip memory is needed, performance drops significantly as a result of the lower bandwidth. This problem is even more serious if access patterns are irregular or statically unpredictable.

last modified