We wish to distribute the elements among the processors of the machine to minimize both load imbalance (one processor having more elements than another) and communication between elements.
Our approach here is to write down a cost function which is minimized when the total running time of the code is minimized and is reasonably simple and independent of the details of the code. We then minimize this cost function and distribute the elements accordingly.
The load-balancing problem [Fox:88a;88mm], may be stated as a graph-coloring problem: Given an undirected graph of N nodes (finite elements), color these nodes with P colors (processors) to minimize a cost function H which is related to the time taken to execute the program for a given coloring. For DIME applications, it is the finite elements which are to be distributed among the processors, so the graph to be colored is actually the dual graph to the mesh, where each graph node corresponds to an element of the mesh and has (if it is not at a boundary) three neighbors.
We may construct the cost function as the sum of a part that minimizes load imbalance and one that minimizes communication:\
where is the part of the cost function which is minimized when each processor has equal work, is minimal when communication time is minimized, and is a parameter expressing the balance between the two, with related to the number c discussed above. If and were proportional to the times taken for calculation and communication, then should be inversely proportional to c. For programs with a great deal of calculation compared to communication, should be small, and vice versa.
As is increased, the number of processors in use will decrease until eventually the communication is so costly that the entire calculation must be done on a single processor.
Let e, f, label the nodes of the graph, and be the color (or processor assignment) of graph node e. Then the number of graph nodes of color q is:\
and is proportional to the maximum value of , because the whole calculation runs at the speed of the slowest processor, and the slowest processor is the one with the most graph nodes. This ignores node and link (node-to-node communication) contention, which contribute to idle time.
The formulation as a maximum of is, however, not satisfactory when a perturbation is added to the cost function, such as that from the communication cost function. If, for example, we were to add a linear forcing term proportional to , the cost function would be:\
and the minimum of this perturbed cost function is either if is less than , or , if is larger than this. This discontinuous behavior as a result of perturbations is undesirable, so we use a sum of squares instead, whose minima change smoothly with the magnitude of a perturbation:\
where is a scaling constant to be determined.
We now consider the communication part of the cost function. Let us define the matrix
which is the amount of communication between processors q and r, and the notation means that the graph nodes e and f are connected by an edge of the graph.
The cost of communication from processors q to r depends on the machine architecture; for some parallel machines it may be possible to write down this metric explicitly. For example, with the early hypercubes, the cost is the number of bits which are different in the binary representations of the processor numbers q and r. The metric may also depend on the message-passing software, or even on the activities of other users for a shared machine. A truly portable load balancer would have no option but to send sample messages around and measure the machine metric, then distribute the graph appropriately. In this book, however, we shall avoid the question of the machine metric by simply assuming that all pairs of processors are equally far apart, except of course a processor may communicate with itself at no cost.
The cost of sending the quantity of data also depends on the programming: the cost will be much less if it is possible for the messages to be bundled together and sent as one, rather than separately. The major problem is latency: The cost to send a message in any distributed system is the sum of an initial fixed price and one proportional to the size of the message. This is also the case for the pricing of telephone calls, freight shipping, mail service, and many other examples from the everyday world. If the message is large enough, we may ignore latency: For the nCUBE used in Section 11.1.7 of this book, latency may be ignored if the message is longer than a hundred bytes or so. In the tests of Section 11.1.7, most of the messages are indeed long enough to neglect latency, though there is certainly further work needed on load balancing in the presence of this important effect. We also ignore blocking (idling) due to needed resources being unavailable due to contention.
The result of this discussion is that we shall assume that the cost of communicating the quantity of data is proportional to , unless q=r, in which case the cost is zero. This is a good assumption on many new machines, such as the Intel Touchstone series.
We shall now make the assumption that the total communication cost is the sum of the individual communications between processors:\
where is a constant to be determined. Notice that any overlap between calculation and communication is ignored. Here, we have ignored ``global'' contributions to , such as collective communication (global sums or reductions) mentioned in Section 11.1.1.
Substituting the expression for , the expression for the load balance cost function simplifies to
The assumptions made to derive this cost function are significant. The most serious deviation from reality is neglecting the parallelism of communication, so that a minimum of this cost function may have grossly unbalanced communication loads. This turns out not to be the case, however, because when the mesh is equally balanced, there is a lower limit to the amount of boundary, analogous to a bubble having minimal surface area for fixed volume; if we then minimize the sum of surface areas for a set of bubbles of equal volumes, each surface must be minimized and equal.
We may now choose the scaling constants and . A convenient choice is such that the optimal and have contributions of about unit size from each processor; the form of the scaling constant is because the surface area of a compact shape in d dimensions varies as the d-1 power of the size, while volume varies as the d power. The final form for H is
where d is the dimensionality of the mesh from which the graph came.
The formalism of this section has a simple physical interpretation
[Fox:86a;88kk;88mm;88tt;88uu], which we introduce here and discuss further in Section 11.2. The data points (tasks) to be distributed can be thought of as particles moving around in the discrete space formed by the processors. This physical system is controlled by the Hamiltonian (energy function) given in Equation 11.9. The two terms in the Hamiltonian have simple physical meanings illustrated in Figure 11.3. The first term in Equation 11.9 ensures equal work per node and is a short-range repulsive force trying to push particles away if they land in the same node. The second term in Equation 11.9 is a long-range attractive force which links ``particles'' (data points) which communicate with each other. This force tries to pull particles together (into the same node) with a strength proportional to the information needed to be communicated between them. In general, this communication force depends on the architecture of the interconnect of the parallel machine, although Equation 11.9 has assumed a simple form for this. The analogy is preserved in general with the MPP interconnect architecture translating into a topology for the discrete space formed by the processors in the analogy. This topology implies a distance dependence force for the communication term in H. We can also extend the discussion to include the cost of moving data between processors to rebalance a dynamically changing problem. This migration cost becomes a third force attracting each particle to the processor in which it currently resides. Figure 11.3 illustrates these three forces.
Figure: Sixteen Data Points Distributed Optimally on Four Processors, Illustrating the Physical Analogy of Section 11.3. We take a simple two-dimensional mesh connection for the particles.
Note that the load-balancing problem becomes that of finding the equilibrium state of a system of particles with a ``conflict'' between short-range repulsive (hardcore) and long-range attractive forces. This scenario is qualitatively similar to classical atomic physics problems and leads one to expect that the physically based optimization methods could be effective. This physical analogy is extended in Section 11.2 where we show that the physical system exhibits effects that can be associated with temperature and phase transitions. We also indicate how it needs to be extended for problems with microscopic structure in their temporal properties.