To achieve a viable component count for a PetaFLOPS machine, the building block processor chip must provide about 100 GigaFLOPS throughput. Using circuit architectures comparable to those of today's devices, only two to four GigaFLOPS per node are predictable based on growth of fabrication technology speed capability. In other words, viable chip designs should contain 32 or more nodes.
To sustain continuous computation each node must have high-speed access to memory to obtain instructions and data. A design decision must be made as to how much memory should be placed on the processor chip with each node, and how much off-chip memory needs to be available for high-speed access by the multinode chip. The traditional approach has been to use off-chip memory as the primary memory resource, and limit on-chip memory to registers and small amounts of critical cache.
For a given target chip area (based on fabrication economics), the addition of node memory reduces the number of nodes per chip and increases the power per node. On-chip memory requires much less power than the equivalent amount of off-chip power, so the net system power is probably reduced as on-chip memory increases. If only small amounts of memory are included on-chip, it is likely that at least one memory chip will be required for each processor chip.
The problem of processing node-to-memory communication is a central machine design issue. With multi-gigahertz clock rates, the processor-to-off-chip memory interface is a difficult problem. Ideally, every on-chip node should have a wide bandwidth connection directly to the off-chip memory resources. Even with flip-chip area array devices, the physical limitations of pin count, electrical connection paths, and power dissipation make this arrangement very difficult. The off-chip memory is forced to be a shared resource, and it is probably be a limiting factor for throughput.
Use of on-chip memory eases the processing-node-to-memory interface problems. Pin count limitations are eased, and wide communication buses are practical. Shorter processor-to-memory connection lengths result in reduced line loading, smaller drivers, and less power dissipation. Because the memory and processor are designed at the same time, the detailed architecture, timing, and control of the memory can be tailored to meet the needs of the processor. In general, higher performance is achievable. To illustrate some of the general features of a potential 0.05 micrometer technology, CMOS-based PetaFLOPS machine, a few parameters are tabulated in Table 5.3.
Machine processor chip count, power, and on-chip memory are listed. Each row corresponds to the chip classes presented in the previous processor chip projection table. For example, the last row in the table is based on a 0.05 micrometer technology, 16 megabyte per node, 32-node chip. For these architectural variations, the processor chip count varies from about 1,000 to 8,000. The power varies from about 100 kilowatts to over 600 kilowatts. When the need for external memory chips is included, the power could easily double. Use of lower on-chip memory devices such as the one megabyte per node option will reduce the processor chip count (i.e., the same node count but more nodes per chip), and will increase the requirement for off-chip memory.
In light of what has been achieved manufacturing supercomputers in the past, none of these parameters pose unmanageable problems. In fact, the assumption of no throughput improvement derived from architectural innovations makes these predictions overly conservative. If devices of this density and level of performance can be produced, the implementation of a PetaFLOPS processor becomes largely an engineering problem. The physical structure and thermal management issues are similar to those associated with past supercomputer designs. The electrical interconnect and clock distribution at gigahertz rates will be difficult, but solutions can be achieved. Problems requiring some research and development will be the distribution of a 0.9 volt supply and ground, and the management of switching noise. For the 0.05 micron generation it appears that PetaFLOPS machine construction will be viable with considerable margin. For the 0.1 and 0.07 micrometer generations, architectural features probably can be introduced to provide enhanced node throughput. If that proves to be true, PetaFLOPS machines can be introduced many years earlier.