Next: Role of Device Up: Architecture Working Group Previous: Metrics and Limitations

PetaFLOPS Architectures Design Points

The Architecture/Systems Working Group conducted a design exercise to create a rough sketch of a PetaFLOPS machine using technology assumed to be available in 2014. The exercise revealed that a PetaFLOPS machine can be built with the projected technology of that period at a cost comparable to that of a supercomputer in 1994. The exercise also showed that a number of assumptions drive designs in particular directions. These assumptions should be examined for validity. If they do not hold for performance in the PetaFLOPS region, then PetaFLOPS machines may be quite different from the panel's projections.

This section starts with an outline of three designs for supercomputers 20 years from now. The designs differ because they are each suited for a different class of problem. After discussion of the designs, the section closes with a discussion of the crucial assumptions. The most critical assumptions are the scaling laws that say that memory bandwidth per GigaFLOPS and total memory size per GigaFLOPS are constant as performance in GigaFLOPS increases. Another important assumption is that device technology will follow historic trends for the next 20 years.

Figures 6.1, 6.2, and 6.3 depict three architectures that are attractive for PetaFLOPS implementations in 20 years. For the purposes of the workshop, the panel assumed that all three machines would be built and that each would have a performance rating of a fraction of a PetaFLOPS per second. Collectively, the three machines achieve a PetaFLOPS per second over a large class of applications, but individual applications may do best on specific machines. This analysis assumes that each machine is rated at 400 TeraFLOPS. Applications are likely to be well-suited to one or two of the machines, if not to all three of them. Because the cost of building a 400 TeraFLOPS machine may be less than a third of the cost of building a PetaFLOPS machine, the panel expects that all three of the machines can be built for the cost of building a single PetaFLOPS machine, and that the three machines can achieve an aggregate of a PetaFLOPS performance over a much broader range of applications than can a single PetaFLOPS machine.

The machines architectures are

Some of the characteristics of the respective processor designs are summarized in Table 6.1. They are suited for different types of applications. The Category I machine is ideally suited for applications that create streams of accesses at easily predicted addresses. Typical of these applications are vector and matrix computations that access data with a uniform stride. This type of machine represents the evolutionary development of a CRAY 1 architecture. Because this architecture provides high connectivity and uniform latency, address references can range throughout shared memory, provided that streams of references from different machines are not conflicting.

Category II machines are multiprocessors that are useful for unstructured computations that exhibit a high degree of locality. Such machines might be useful for query systems and database applications where searches and record accesses tend to be clustered in time or to particular regions of the database. Address references in this architecture can be made anywhere in the hierarchy, but references to local memory are much less costly than references made to remote regions of memory. Consequently, performance depends strongly on how the architecture and software support together can successfully direct the majority of references to local data.

Category III machines take advantage of the high bandwidth of memory arrays internal to a memory chip, and provide bandwidth required to run a 400 TeraFLOPS machine with substantially fewer parts than do either Category I or Category II machines. The performance potential can be achieved provided the applications make use of the local memory bandwidth, and have relatively little need to access data off-chip. For applications that can meet these constraints, Category III machines offer extremely low cost per GigaFLOPS.

Table 6.1 makes clear several important observations:

  1. The major cost of Category I and II machines is for the memory. The reason that Category III machines have such low parts count is that they achieve the necessary bandwidth to sustain an aggregate rate of 400 TeraFLOPS with a fraction of the number of memory parts by using the internal bandwidth of memory rather than the external bandwidth.

    Such machines can be used for applications that run effectively with a memory capacity much smaller than that postulated by the rule that says that memory capacity should be approximately one gigabyte per GigaFLOPS per second of performance. The success of Category III machines will depend strongly on how widely this rule holds. If many applications need much less memory than prescribed by this rule, then the Category III machine will be an effective machine design. However, the range of applications that can make effective use of Category III machines has not been explored at this writing.

  2. For both Category I and II machines, the cost of providing network bandwidth appears to be small compared to the cost of providing 400 terabytes of memory. Consequently, the designs can afford to put various enhancements into interconnections to improve performance because the speed leverage they provide may be greater than proportional to their cost.

  3. The table shows that Category I machines have a concurrency of 400, and produce outputs per second. Therefore, it follows from the Concurrency Law that the latency must be no greater than seconds (1 picosecond) per operation. If this latency is not achieved, each processor must have internal concurrency to offset the extra latency. Thus, if the output rate of an arithmetic unit is 1 output per nanosecond, each processor must have 1000 arithmetic units running at full speed in order for the system of 400 processors to output at a 400 TeraFLOPS rate.

    By similar reasoning, if memory feeds processors at the rate of operands per second, and the number of concurrent streams is 400, then the latency per stream can not be more than seconds per memory stream. If the latency is picoseconds per access, then memory streams must run concurrently.

    Table 6.2 gives the latency required for a single memory reference stream per processor. If actual latency exceeds the figure in the table by a factor , then accesses must be in progress concurrently.



Next: Role of Device Up: Architecture Working Group Previous: Metrics and Limitations


gcf@npac.syr.edu