Next: Final Comments Up: Architecture Working Group Previous: Role of Device

Obstacles and Uncertainties

This exercise projected the SIA technology road map into realms where no device technology is known to operate. This suggests avenues of study of devices with feature sizes below 0.1 and raises issues regarding tunneling and other phenomena that become dominant in this regime.

Fundamental assumptions in the sizing of Category I and II designs are that memory requirements grow proportionally with performance, and that memory bandwidth grows proportionally also. These assumptions created a need for large memories for Category I and II machines, and the cost of the memory dominated the cost of the machine. This memory cost leaves very little room for the designer to lower cost through architectural advances. When these assumptions are abandoned, however, the Category III machine becomes a viable candidate, and it clearly has a cost advantage over the other types of machines.

It is essential to investigate memory requirements of large-scale applications to determine if memory capacity and bandwidth both scale linearly with performance. Markedly less expensive Category I and II designs may be possible if, for a large number of applications, the requirements scale less than linearly, resulting in smaller memories.

Market forces also will have a dramatic impact on the ability to build a PetaFLOPS machine. The SIA technology curves are based on what is likely to be possible to build with the improvement of technology, but they do not predict what actually will be built. To keep costs per device at a minimum, the PetaFLOPS machine should use the same memory and processor chips used in the mainstream to the extent possible. Because memory parts dominate Category I and II designs, the memory parts for these designs should be the same as those used in lower performance machines. Market forces, though, may slow the progress of commercial offerings so that they do not advance as rapidly as the SIA curves indicate might be possible. The effect of slower mainstream progress may either delay the viability of the PetaFLOPS machine, or increase its cost, or both.

Memory bandwidth offers a particularly interesting challenge. The Category III design shows it is possible to tap existing bandwidth through an appropriate architectural design, but this technique is not unique. Memory manufacturers are starting to introduce memory schemes that raise the bandwidth per pin. Such schemes include synchronous memory devices and memory devices with built-in caches and synchronous block transfer modes.

Technology for the PetaFLOPS machine should draw upon mainstream technology as much as possible. For special niches, however, it may be necessary to develop techniques that have little use beyond the PetaFLOPS machine. As an example of such a niche, consider the Category II and Category III designs, where the number of processors grows very large compared to today's highly parallel machines. Applications and systems software have a great challenge ahead to tap the power of high-speed machines by using 10,000-way parallelism effectively. It is not clear that today's programming paradigms will support efficient use of such machines. The vector and array codes may present some difficulty in partitioning them into 10,000 or more concurrent pieces. Writing multiprocessor programs with 10,000-way parallelism for less structured problems is an art that has rarely, if ever, been practiced. Consequently, the panel recommends that software technologists study parallelism techniques that will scale up to thousands and tens of thousands of processors.

The PetaFLOPS machine appears to be destined to have a large diameter as defined earlier in this section. Hence, it has the potential for having larger latency and for suffering relatively more from latency than will machines with a smaller diameter (and lower average latency). For this reason, we need more effective techniques for latency hiding than we know today.

The very high parts count of the PetaFLOPS machine may demand a special packaging and cooling technology that is not necessary for machines with substantially fewer parts.

Input/output requirements have not been addressed in this section because of a general belief that I/O requirements do not scale linearly with computational performance. For many large-scale algorithms, the number of operations grows much faster than linearly in the size of the input data. Consequently, the time it takes to calculate results must necessarily be much larger than the time it takes to load the initial data and write the results.

Nevertheless, peak I/O rates may be quite large. Also, some classes of problems have a computational requirement that scales linearly with the size of the problem. These classes clearly will place very high I/O demands on a PetaFLOPS machine if they themselves scale large enough to demand PetaFLOPS performance.

In reexaming the three categories of machines independently, the panel made the following observations:



Next: Final Comments Up: Architecture Working Group Previous: Role of Device


gcf@npac.syr.edu