The choices we face in mapping out the development of PetaFLOPS computing technology can be brought into focus by considering two development paths, which are the two ends of a spectrum of opportunities. The first path continues the high-performance-computer-as-big-laboratory-instrument approach mentioned above. We may refer to the software technology required for this approach as BLISS (Big Lab Instrument Software System). BLISS is specifically oriented toward scientific computing on ``big iron'' machines. The second path leads to the development of a seamless ``metasystem,'' composed of multiple high-performance (but sub-PetaFLOPS) machines, designed for use by a wide community. This path leverages the commercial software market by providing the software infrastructure needed to facilitate (entice) the entry of third party independent software vendors into the parallel processing market. Below, we expand on these two diametrically opposed visions, examine their respective impact on technology, and discuss the risks and rewards associated with each.
BLISS represents the status quo in which the abstract machine presented to the programmer closely matches the underlying physical machine. The system will be response-time oriented and targeted to data-parallel scientific applications. It will be a single-user, batch-style environment, with resource management being performed primarily by hand. With the exception of data-parallel applications, writing software on BLISS will be at least as difficult as it is today on comparable systems. Current tools, or reasonable extensions to these tools (e.g., HPF), will provide the software infrastructure for data-parallel programming. As noted above, the range of applications for which these tools are effective is likely to become narrower and narrower as peak system performance increases, unless software and hardware architects collaborate to reverse this trend.
The construction of a seamless metasystem is a departure from current mainstream trends in high-performance computing. With the exception of regular, data-parallel problems, it is already too difficult to write applications for parallel machines; this problem will severely handicap attempts to broaden the marketplace. Much of the problem stems from the fact that the programmer of massively parallel systems today is exposed fully to the underlying machine and must constantly manage the details of where, when, and how computations are performed. The solution is to hide these details from the programmer, allowing her to concentrate on the application, not on low-level implementation details. This is accomplished by providing various forms of transparency - access, location, and temporal transparency, to name a few. We envision a software system with the following properties:
Continued development of current tools and approaches, but little else, is required to build BLISS. The software technology for a seamless metasystem, however, will depart radically from current HPC software in many ways. Work is needed in the areas of programming models, parallel languages, language interaction, compilers and run-time support, resource management, user interfaces and tools, and I/O and database management. A major effort is needed in operating systems, but the nature and scale of the challenges faced by the operating system for a metasystem will make it quite different from operating systems as we have known them for the last 20 years.
Risks versus rewards: BLISS is a well-understood problem, with well-understood pitfalls, and with comparatively small up-front software development costs. The downside is that software development for all but data-parallel applications will continue to be extremely painful. Software costs with the BLISS are not low, though. Instead, they are distributed to the users, recurring, and hidden.
The risks associated with metasystem development are much greater. There is a significant up-front software investment that is not guaranteed to bear fruit. Also, for those applications that would run well under BLISS, execution on the metasystem is likely to entail a performance penalty, further increasing the gap between peak (guaranteed never to exceed) performance, and realized performance for those applications. On the plus side, the metasystem will be accessible and useful to a larger class of users and applications, not just traditional ``number crunching,'' and it has the potential to better leverage the commercial software market by facilitating a transition to parallel computing by commercial software vendors. Further, the metasystem presents an opportunity to realize a PetaFLOPS system sooner by interconnecting many sub-PetaFLOPS systems. In the final analysis, for parallel computers to become commercially viable requires that they be made easier to use, and thus become mainstream. Without moving away from the laboratory-instrument model, that will not happen.
Since both BLISS and the metasystem are extremes on a spectrum, the most likely scenarios fall somewhere between them. However, we believe that the HPC usage models that are likely to be self-sustaining and viable over the long term fall closer to the metasystem end of the spectrum. Moreover, the explosive proliferation of high-performance wide-area networking technology virtually guarantees that any PetaFLOPS machine will be born into a computational world that is already organized into a metasystem of some sort. For a PetaFLOPS system to operate effectively in this world, its design will have to take into account many of the considerations outlined above. Therefore, although we do not altogether abandon the BLISS model in the discussions that follow, we will be particularly mindful of the technology directions required for the metasystem approach.