Figure 9.1: The Loosely Synchronous Problem Class
The significance of loosely synchronous problems and their natural parallelism was an important realization that emerged gradually (perhaps in 1987 as a clear concept) as we accumulated results from CP research. As described in Figure 9.1, fundamental theories often describe phenomena in terms of a set of similar entities obeying a single law. However, one does not usually describe practical problems in terms of their fundamental description in a theory such as QCD in Section 4.3. Rather, we use macroscopic concepts. Looking at society, a particle physicist might view it as a bunch of quarks and gluons; a nuclear physicist as a collection of protons and neutrons; a chemist as a collection of molecules; a biochemist as a set of proteins; a biologist as myriad cells; and a social scientist as a collection of people. Each description is appropriate to answer certain questions, and it is usually clear which description should be used. If we consider a simulation of society, or a part there of, only the QCD description is naturally synchronous. The other fields view society as a set of macroscopic constructs, which are no longer identical and typically have an irregular interconnect. This is caricatured in Figure 9.1 as an irregular network. The simulation is still data-parallel and, further, there is a critical macroscopic synchronization-in a time-stepped simulation at every time step , , . This is an algorithmic synchronization that ensures natural scaling parallelism, that is, that the efficiency of Equations 3.10 and 3.11 is given by
with the parameter of Equation 3.10 equal to zero. is given by Equation 3.10 in terms of the system dimension. The efficiency only depends on the problem grain size and not explicitly on the number of processing nodes. As emphasized in [Gustafson:88a], these problems scale so that if one doubles both the machine and problem size, the speedup will also double with constant efficiency. This situation is summarized in Figure 9.2.
Figure 9.2: Speedup as a Function of Number of Processors
Why is there no synchronization overhead in this problem class? Picturesquely, we can say that the processors ``know'' that they are synchronized at the end of each algorithmic time step. We use time in the generalized complex system language of Section 3.3 and so it would represent, for instance, iteration number in a matrix problem. Operationally, we can describe the loosely synchronous class on a MIMD machine by the communication-calculation sequence in Figure 9.3. The update (calculate) phase can involve very different algorithms and computations for the points stored in different processors. Thus, a MIMD architecture is needed in the general case. Synchronization is provided, as in Figure 9.3 by the internode communication phases at each time step. As described in Chapters 5 and 16, this does not need, but certainly can use, the full asynchronous message-passing capability of a MIMD machine.
Figure 9.3: Communication-Calculation Phases in a Loosely Synchronous Problem
We have split the loosely synchronous problems into two chapters, with those in Chapter 12 showing more irregularities and greater need for MIMD architectures than the applications described in this chapter. There has been no definitive study of which loosely synchronous problems can run well on SIMD machines. Some certainly can, but not all. We have discussed some of these issues in Section 6.1. If, as many expect, SIMD will remain a cost-effective architecture offered commercially, it will be important to better clarify the class of irregular problems that definitely need the full MIMD architecture.
As mentioned above, the applications in this chapter are ``modestly'' loosely synchronous. They include particle simulations (Sections 9.2 and 9.3), solutions of partial differential equation (Sections 9.3, 9.4, 9.5, 9.7), and circuit simulation (Sections 9.5 and 9.6). In Section 9.8, we describe an optimal assignment algorithm that can be used for multiple target Kalman filters and was developed for the large scale battle management simulation of Sections 18.3 and 18.4. Section 9.9 covers the parallelization of learning (``back-propagation'') neural nets with improved learning methods. An interesting CP application not covered in detail in this book was the calculation of an exchange energy in solid at temperatures below [Callahan:88a;88b]. This was our first major use of the nCUBE-1 in production mode and Callahan suffered all the difficulties of a pioneer with the, at the time, decidely unreliable hardware and software. He used 250 hours on our 512-node nCUBE-1, which was equivalent to 1000 hours of a non-vectorized CRAY X-MP implementation. In discussing SIMD versus MIMD, one usually concentrates on the synchronization aspects. However, Callahan's application illustrates another point; namely, commercial SIMD machines typically have many more processors than a comparable MIMD computer. For example, Thinking Machines introduced the 32-node MIMD CM-5 as roughly equivalent (in price) to an SIMD CM-2. The SIMD architecture has 256 times as many nodes. Of course, the SIMD nodes are much simpler, but this still implies that one needs a large enough problem to exploit this extra number of nodes. There are some coarse-grain SIMD machines-especially special-purpose QCD machines [Battista:92a], [Christ:86a], [Fox:93a], [Marinari:93a]-but it is more natural to build fine-grain machines. If the node is large, one might as well add MIMD capability! Full matrix algorithms, such as LU decomposition, (see Chapter 8), are often synchronous, but do not perform very well on SIMD machines due to insufficient parallelism [Fox:92j]. Many of the operations only involve single rows and columns and have severe load imbalance on fine-grain machines. Callahan's application did not exhibit ``massive'' parallelism, and so ``had'' to use a MIMD machine irrespective of his problem's temporal structure. He used 512 nodes on the nCUBE-1 by combining three forms of parallelism: Two came from the problem formulation with spatial and temporal parallelism, the other from running four different parameter values concurrently.
A polymer simulation [Ding:88a;88b] [Ding:88a] [Ding:88b] by Ding and Goddard, using the reptation method, exhibited a similar effect. There is a chain of N chemical units and the algorithm involves special treatment of the two units at the beginning and end of the linear polymer. The MIMD program ran successfully on the Mark III and FPS T-Series, but the problem is too ``small'' (parallelism of ) for this algorithm to run on a SIMD machine even though most of the basic operations can be run synchronously.
This issue of available parallelism also complicates the implementation of multigrid algorithms on SIMD machines [Frederickson:88a;88b;89a;89b].