The Architecture/Systems Working Group conducted a design exercise to create a rough sketch of a PetaFLOPS machine using technology assumed to be available in 2014. The exercise revealed that a PetaFLOPS machine can be built with the projected technology of that period at a cost comparable to that of a supercomputer in 1994. The exercise also showed that a number of assumptions drive designs in particular directions. These assumptions should be examined for validity. If they do not hold for performance in the PetaFLOPS region, then PetaFLOPS machines may be quite different from the panel's projections.
This section starts with an outline of three designs for supercomputers 20 years from now. The designs differ because they are each suited for a different class of problem. After discussion of the designs, the section closes with a discussion of the crucial assumptions. The most critical assumptions are the scaling laws that say that memory bandwidth per GigaFLOPS and total memory size per GigaFLOPS are constant as performance in GigaFLOPS increases. Another important assumption is that device technology will follow historic trends for the next 20 years.
Figures 6.1, 6.2, and 6.3 depict three architectures that are attractive for PetaFLOPS implementations in 20 years. For the purposes of the workshop, the panel assumed that all three machines would be built and that each would have a performance rating of a fraction of a PetaFLOPS per second. Collectively, the three machines achieve a PetaFLOPS per second over a large class of applications, but individual applications may do best on specific machines. This analysis assumes that each machine is rated at 400 TeraFLOPS. Applications are likely to be well-suited to one or two of the machines, if not to all three of them. Because the cost of building a 400 TeraFLOPS machine may be less than a third of the cost of building a PetaFLOPS machine, the panel expects that all three of the machines can be built for the cost of building a single PetaFLOPS machine, and that the three machines can achieve an aggregate of a PetaFLOPS performance over a much broader range of applications than can a single PetaFLOPS machine.
The machines architectures are
Some of the characteristics of the respective processor designs are summarized in Table 6.1. They are suited for different types of applications. The Category I machine is ideally suited for applications that create streams of accesses at easily predicted addresses. Typical of these applications are vector and matrix computations that access data with a uniform stride. This type of machine represents the evolutionary development of a CRAY 1 architecture. Because this architecture provides high connectivity and uniform latency, address references can range throughout shared memory, provided that streams of references from different machines are not conflicting.
Category II machines are multiprocessors that are useful for unstructured computations that exhibit a high degree of locality. Such machines might be useful for query systems and database applications where searches and record accesses tend to be clustered in time or to particular regions of the database. Address references in this architecture can be made anywhere in the hierarchy, but references to local memory are much less costly than references made to remote regions of memory. Consequently, performance depends strongly on how the architecture and software support together can successfully direct the majority of references to local data.
Category III machines take advantage of the high bandwidth of memory arrays internal to a memory chip, and provide bandwidth required to run a 400 TeraFLOPS machine with substantially fewer parts than do either Category I or Category II machines. The performance potential can be achieved provided the applications make use of the local memory bandwidth, and have relatively little need to access data off-chip. For applications that can meet these constraints, Category III machines offer extremely low cost per GigaFLOPS.
Table 6.1 makes clear several important observations:
Such machines can be used for applications that run effectively with a memory capacity much smaller than that postulated by the rule that says that memory capacity should be approximately one gigabyte per GigaFLOPS per second of performance. The success of Category III machines will depend strongly on how widely this rule holds. If many applications need much less memory than prescribed by this rule, then the Category III machine will be an effective machine design. However, the range of applications that can make effective use of Category III machines has not been explored at this writing.
By similar reasoning, if memory feeds processors at the rate of
operands per second, and the number of concurrent
streams is 400, then the latency per stream can not be more than
seconds per memory stream. If the latency is
picoseconds per access, then
memory streams must run
concurrently.
Table 6.2 gives the latency required for a single memory
reference stream per processor. If actual latency exceeds the figure
in the table by a factor , then
accesses must be in progress
concurrently.