\input/home/sashimi2/kea/fonts.tex \overfullrule=0pt \hsize=6.0in \hoffset=0.25in \vsize=8.5in \voffset=0.25in \elevenpoint \footline{\ifnum\pageno=1 {\hfil} \else {\hss\tenrm\folio\hss}\fi} \centerline{\bf Concurrency: Practice and Experience} \vskip 10pt \centerline{\bf Published Abstracts for 1995} \vskip 0.5in \centerline{\bf Experiences with Networked Parallel Computing} \medskip \centerline{\it P. Hoogerbrugge and R. Mirchandaney} \medskip \noindent The performance and proliferation of workstations continues to increase at a rapid rate. However, the practical utilization of workstation networks for parallel computing is still in its infancy. This is due to the relative immaturity of programming tools, low bandwidth networks such as Ethernet, and high message latencies. However, programming tools are becoming more mature and network bandwidths are increasing rapidly. Hence, networks of commodity workstations may prove to be practical for certain classes of parallel applications. This paper describes our experiences with two applications parallelized on a network of Sun workstations. The first application is from Shell's petroleum engineering department. This program quantitatively derives rock and porefill composition from well-log data, using a compute-intensive iterative optimization procedure. The second application is time filtering, which is a fundamental operation performed on seismic traces. Through our experiments we identify the limits of networked parallel computing based on the current state of network technology. We also provide a discussion on the possible impact of future high speed networks on networked parallel computing. \vskip 8pt \noindent{\bf Short Code:\ \ }[Hoogerbrugge:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 1, pp. 1--16 (C174) \vskip 0.25in \centerline{\bf Dynamic Load-balancing for PDE Solvers on Adaptive} \centerline{\bf Unstructured Meshes} \medskip \centerline{\it Chris Walshaw and Martin Berzins} \medskip \noindent Modern PDE solvers written for time-dependent problems increasingly employ adaptive unstructured meshes (Flaherty et al., 1989) in order to both increase efficiency and control the numerical error. If a distributed memory parallel computer is to be used, there arises the significant problem of dividing the domain equally amongst the processors whilst minimizing the inter-subdomain dependencies. A number of graph-based algorithms have recently been proposed for steady-state calculations. The paper considers an extension to such methods which renders them more suitable for time-dependent problems in which the mesh may be changed frequently. \vskip 8pt \noindent{\bf Short Code:\ \ }[Walshaw:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 1, pp. 17--28 (C228) \vskip 0.25in \centerline{\bf Pipeline Optimizations of the Prime Factor Algorithm} \medskip \centerline{\it R. Albrizio, A. Mazzone, N. Veneziani, G. Aloisio, M. Bochicchio, and P. Messina} \medskip \noindent The prime factor algorithm (PFA) is an efficient discrete Fourier transform (DFT) computation algorithm used when the sequence length can be decomposed into mutually prime factors. Following our previous results on PFA decomposition carried out at Caltech on hypercube machines, we present in this paper a pipeline PFA implementation suitable for multiprocessor systems with distributed memory. This implementation achieves high values of efficiency and speedup when processing multiple sequences of data. The paper shows how an optimized structure can be obtained when the concurrency among computation and communications is exploited at each node of the pipe. Experimental results obtained on Transputer-based structures and on the Intel {\it Touchstone\/} Delta system are also reported. \vskip 8pt \noindent{\bf Short Code:\ \ }[Albrizio:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 1, pp. 29--41 (C134) \vskip 0.25in \centerline{\bf Fault Tolerance in the Execution of Remote Jobs on Idling Workstations} \medskip \centerline{\it Cui-Qing Yang, and Yaoshuang Qu} \medskip \noindent Many workstation-based distributed systems allow programs to be executed on remote idling machines for effective utilization of system resources. Usually, the control policies in these systems force a remote job be discontinued by the arrival of local jobs to guarantee the autonomy of individual workstations. Therefore, one special concern in the design of such systems is the fault-tolerant aspects for the execution of remote jobs. In the paper we discuss two control policies of workstation-based distributed systems, checkpointing and non-checkpointing policy, which support fault-tolerant execution of remote jobs on idling workstations. An analytical analysis on the reliability and mean turnaround time of the execution of remote jobs are conducted for both control policies. The optimal time interval between checkpoints in the checkpointing policy is formulated based on the given reliability and overhead of the system. In addition, several sample results derived from these analyses are compared with the outcome of corresponding simulation programs. Some observations of fault-tolerant features of each control policy are thereupon presented as guidelines for the future development of such workstation-based distributed systems. \vskip 8pt \noindent{\bf Short Code:\ \ }[Yang:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 1, pp. 43--60 (C189) \vfill\eject \centerline{\bf Evaluation of Parallelization Strategies for an Incremental} \centerline{\bf Delaunay Triangulator in $E^3$} \medskip \centerline{\it P. Cignoni, D. Laforenza, C. Montani, R. Perego, and R. Scopigno} \medskip \noindent The paper deals with the parallelization of Delaunay triangulation, a widely used space partitioning technique. Two parallel implementations of a three-dimensional {\it incremental construction\/} algorithm are presented. The first is based on the decomposition of the spatial domain, while the second relies on the master-slaves approach. Both parallelization strategies are evaluated, stressing practical issues rather than theoretical complexity. We report on the exploitation of two different parallel environments: a tightly-coupled distributed memory MIMD architecture and a network of workstations co-operating under the Linda environment. Then, a third hybrid solution is proposed, specifically addressed to the exploitation of higher parallelism. It combines the other two solutions by grouping the processing nodes of the multicomputer into clusters and by exploiting parallelism at two different levels. \vskip 8pt \noindent{\bf Short Code:\ \ }[Cignoni:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 1, pp. 61--80 (C212) \vskip 0.25in \centerline{\bf Processing Recursive Queries on Transputers} \medskip \centerline{\it J. Shao, D. A. Bell, and M. E. C. Hull} \medskip \noindent There is a perceived need within the database community to extend the traditional relational database systems so as to accommodate applications which are deductive in nature. One major problem involved in such an extension is the efficient processing of recursive queries. To this end, parallel processing is expected to play an important role. While substantial work has been done in devising strategies for processing recursive queries in parallel, it is perhaps surprising that little has been reported on the implementation and the run-time performance of these strategies. In the paper we report our experience of implementing a pipelined evaluation strategy on transputers. A wide range of queries, database structures and architectural configurations are considered as benchmarks in this study. The performance is studied in terms of both speed-up factors and communication costs. The experimental results show the potential of processing recursive queries in parallel, and provide insight into the usefulness of using transputers for such applications. \vskip 8pt \noindent{\bf Short Code:\ \ }[Shao:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 2, pp. 81--120 (C244) \vskip 0.25in \centerline{\bf SCIDDLE: A Tool for Large Scale Distributed Computing} \medskip \centerline{\it P. Arbenz, C. Sprenger, H. P. L\"uthi, and S. Vogel} \medskip \noindent We report on a portable communication environment, `SCIDDLE', for distributing computations over heterogeneous networks of UNIX computers. SCIDDLE is based on the client-server model. It was designed to support the distribution of large scale numerical computations and to keep its usage as simple as possible. All interprocess communication is done via remote procedure calls. The user defines the interface between communicating processes in a simple declarative language. Parallel programming is supported by asynchronous RPCs. A convenient array handling has been implemented. We demonstrate the usefulness of the system with an application from quantum chemistry on internet-connected workstations and supercomputers. \vskip 8pt \noindent{\bf Short Code:\ \ }[Arbenz:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 2, pp. 121--146 (C248) \vskip 0.25in \centerline{\bf Computational Similarity} \medskip \centerline{\it R. W. Hockney} \medskip \noindent The paper enunciates the principle of computational similarity, whereby calculations with the same values for certain dimensionless ratios are said to be `computationally similar' and as a consequence have the same optimum self-speed-up and optimum number of processors. Based on a three-parameter description of the computer hardware, two dimensionless ratios, which are only a function of the problem size and the hardware parameters, completely determine the scaling. Contours of constant self-speed-up can be drawn on a two-dimensional dimensionless universal scaling diagram (DUSD). This diagram is for a particular class of timing expressions that can be shown to represent approximately the performance of a corresponding class of computer programs or benchmarks, but it applies to all computers describable by the three computer programs or benchmarks, but it applies to all computers describable by the three hardware parameters and to all problem sizes. Thus the dimensionless ratios play a similar role in the study of computer performance, as do the Reynolds and other dimensionless numbers in fluid dynamics. This dimensional analysis of computer performance is illustrated by the case of the FFT1 benchmark from the Southampton `Genesis' distributed-memory benchmarks. \vskip 8pt \noindent{\bf Short Code:\ \ }[Hockney:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 2, pp. 147--166 (C249) \vskip 0.25in \centerline{\bf Partitioning and Mapping in Embedded Multiprocessor Architectures} \centerline{\bf in the Presence of Constraints} \medskip \centerline{\it S. Yalamanchili, L. Te Winkel, D. Perschbacher, and B. Shenoy} \medskip \noindent The paper focuses on the problem of partitioning and mapping parallel programs onto heterogeneous embedded multiprocessor architectures for real-time applications. Such applications present unique constraints and challenges. In addition to heterogeneity, the proposed partitioning and mapping algorithms satisfy memory, task throughput, task placement, intertask communication bandwidth, and co-location constraints. They do so for architectures that utilize circuit-switched (rather than packet-switched) interprocessor communication and optimize latency and throughput in addition to load-balancing. Finally, these mapping algorithms make use of knowledge of the local scheduling discipline to accommodate real-time scheduling constraints. Our focus is on unstructured parallel programs that fall into one of two classes: (i) the class of computations characteristic of control applications in a real-time environment where tasks execute concurrently, periodically exchanging information, and (ii) pipelined computation graphs found in sensor data processing applications. The algorithms are implemented in a set of tools that operate with commercial CASE tools at one end, and present an interface to multiprocessor simulators at the other end. Collectively, the algorithms form a significant component of an interactive design environment for the development and mapping of real-time embedded parallel programs. The paper describes the algorithms, the encapsulating toolset, and presents an example of their application to an existing embedded application---an Autonomous Underwater Vehicle application. \vskip 8pt \noindent{\bf Short Code:\ \ }[Yalamanchili:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 3, pp. 167--189 (C203) \vskip 0.25in \centerline{\bf Instructional Footprinting and Semantic Preservation in Linda} \medskip \centerline{\it K. Landry and J. D. Arthur} \medskip \noindent Linda is a co-ordination language designed to support process creation and interprocess communication within conventional computational languages. Although the Linda paradigm touts architectural and language independence, it often suffers performance penalties, particularly on local area network platforms. Instructional footprinting is an optimization technique with the primary goal of enhancing the execution speed of Linda programs. The two main aspects of instructional footprinting are instructional decomposition and code motion. This paper addresses the semantic issues encountered when the Linda primitives, IN and RD, are decomposed and moved past other Linda operations. Formal semantics are given as well as results showing significant speedup (as high as 100\%) when instructional footprinting is used. \vskip 8pt \noindent{\bf Short Code:\ \ }[Landry:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 3, pp. 191--207 (C182) \vskip 0.25in \centerline{\bf Partitioning Tasks Between a Pair of Interconnected} \centerline{\bf Heterogeneous Processors:\ A Case Study} \medskip \centerline{\it D. J. Lilja} \medskip \noindent With the variety of computer architectures available today, it is often difficult to determine which particular type of architecture will provide the best performance on a given application program. In fact, one type of architecture may be well suited to executing one section of a program while another architecture may be better suited to executing another section of the same program. One potentially promising approach for exploiting the best features of different computer architectures is to partition an application program to simultaneously execute on two or more types of machines interconnected with a high-speed communication network. A fundamental difficulty with this heterogeneous computing, however, is the problem of determining how to partition the application program across the interconnected machines. The goal of this paper is to show how a programmer or a compiler can use a model of a heterogeneous system to determine the machine on which each subtask should be executed. This technique is illustrated with a simple model that relates the relative performance of two heterogeneous machines to the communication time required to transfer partial results across their interconnection network. Experiments with a Connection Machine CM-200 demonstrate how to apply this model to partition two different application programs across the sequential front-end processor and the parallel back-end array. \vskip 8pt \noindent{\bf Short Code:\ \ }[Lilja:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 3, pp. 209--223 (C209) \vskip 0.25in \centerline{\bf $P^3L$: A Structured High-level Parallel Language, and Its Structured Support} \medskip \centerline{\it B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, and M. Vanneschi} \medskip \noindent The paper presents a parallel programming methodology that ensures easy programming, efficiency and portability of programs to different machines belonging to the class of the general-purpose, distributed-memory, MIMD architectures. The methodology is based on the definition of a new, high-level, explicitly parallel language, called $P^3L$, and of a set of static tools that automatically adapt the program features for each target architecture. $P^3L$ does not require programmers to specify process activations, the actual parallelism degree, scheduling, or interprocess communications, i.e. all those features that need to be adjusted to harness each specific target machine. Parallelism is, on the other hand, expressed to a structured and qualitative way, by hierarchical composition of a restricted set of language constructs, corresponding to those forms of parallelism that are frequently encountered in parallel applications, and that can be efficiently implemented. The efficient portability of $P^3L$ applications is guaranteed by the compiler along with the novel structure of the support. The compiler automatically adapts the program features for each specific architecture, using the costs (in terms of performance) of the low-level mechanisms exported by the architecture itself. In our methodology, these costs, along with other features of the architecture, are viewed through an abstract machine, whose interface is used by the compiler to produce the final object code. \vskip 8pt \noindent{\bf Short Code:\ \ }[Bacci:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 3, pp. 225--255 (C247) \vskip 0.25in \centerline{\bf PARADOX---A Heterogeneous Machine for the 3D Vision Algorithm} \medskip \centerline{\it Han Wang and Mike Brady} \medskip \noindent We present in the paper a heterogeneous machine, PARADOX, designed in Oxford for real time image processing. PARADOX consists of a Datacube pipelined image processor, a configurable network of 32 T800 transputers, a Sun-4 workstation and a special-purpose interface connecting the Datacube to the transputer network. Its programming style is a mixture of MIMD paradigm employing a processor farm and interrupts to control the image processing pipeline in the Datacube via a VME bus. A number of vision algorithms including the Canny edge detector have been implemented on this machine at video rate. In particular, a 3D vision algorithm, Droid, has been implemented to provide navigation data for an autonomous guided vehicle. \vskip 8pt \noindent{\bf Short Code:\ \ }[Wang:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 4, pp. 257--272 (C205) \vskip 0.25in \centerline{\bf Implementation of a Fully-balanced Periodic Tridiagonal Solver} \centerline{\bf on a Parallel Distributed Memory Architecture} \medskip \centerline{\it T. M. Eidson and G. Erlebacher} \medskip \noindent While parallel computers offer significant computational performance, it is generally necessary to evaluate several programming strategies. Two programming strategies for a fairly common problem---a periodic tridiagonal solver---are developed and evaluated. Simple model calculations as well as timing results are presented to evaluate these strategies. The particular tridiagonal solver evaluated is used in many computational fluid dynamic simulation codes. The feature that makes this algorithm unique is that these simulation codes usually require simultaneous solutions for multiple right-hand-sides (RHS) of the system of equations. Each RHS solutions is independent and thus can be computed in parallel. Thus, a Gaussian-elimination-type algorithm can be used in a parallel computation and more complicated approaches such as cyclic reduction are not required. The two strategies are a transpose strategy and a distributed solver strategy. For the transpose strategy, the data are moved so that a subset of all the RHS problems is solved on each of the several processors. This usually requires significant data movement between processor memories across a network. The second strategy attempts to have the algorithm follow the data across processor boundaries in a chained manner. This usually requires significantly less data movement. An approach to accomplish this second strategy in a near-perfect load-balanced manner is developed. In addition, an algorithm will be shown to directly transform a sequential Gaussian-elimination-type algorithm into the parallel, chained, load-balanced algorithm. \vskip 8pt \noindent{\bf Short Code:\ \ }[Eidson:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 4, pp 273--302 (C208) \vskip 0.25in \centerline{\bf Partitioning of Unstructured Meshes for Load Balancing} \medskip \centerline{\it Olivier C. Martin and Steve W. Otto} \medskip \noindent Many large-scale engineering and scientific calculations involve repeated updating of variables on an unstructured mesh. To do these types of computations on distributed memory parallel computers, it is necessary to partition the mesh among the processors so that the load balance is maximized and interprocessor communication time is minimized. This can be approximated by the problem of partitioning a graph so as to obtain a minimum cut, a well-studied combinatorial optimization problem. Graph partitioning is NP complete, so for real world applications, one resorts to heuristics, i.e., algorithms that give good but not necessarily optimum solutions. These algorithms include recursive spectral bisection, local search methods such as Kernighan-Lin, and more general purpose methods such as simulated annealing. We show that a general procedure enables us to combine simulating annealing with Kernighan-Lin. The resulting algorithm is both very fast and extremely effective. \vskip 8pt \noindent{\bf Short Code:\ \ }[Martin:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 4, pp. 303--314 (C220) \vskip 0.25in \centerline{\bf Programming Pipelined CAD Applications on Message Passing Architectures} \medskip \centerline{\it P. Kenyon, S. Seth, P. Agrawal, A. Clematis, G. Dodera, and V. Gianuzzi} \medskip \noindent Programming applications in computer-aided design of VLSI are difficult on parallel architectures, especially pipelined implementations derived from their sequential counterparts by algorithmic partitioning. The difficulty is primarily due to lack of good program development environments and tools. Our solution, applicable to message-passing architectures, is based upon a definition of a broad class of non-linear pipeline configurations and an asynchronous data-driven model for pipeline stage interactions. It provides object-oriented definitions of stages and interconnecting channels. These objects are embedded in C++ so that the correctness of application programs can be tested on a workstation in a simulated environment. The simulation is instrumented to provide data useful in assessing relative computational loading and balancing of stages. Thus a good part of program development can take place in the environment of a workstation familiar to the programmer. A non-trivial application is developed to illustrate these ideas. \vskip 8pt \noindent{\bf Short Code:\ \ }[Kenyon:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 4, pp. 315--337 (C204) \vskip 0.25in \centerline{\bf Editorial:\ Resource Management of Parallel and Distributed Systems} \centerline{\bf with Static Scheduling: Challenges, Solutions and New Problems} \medskip \centerline{\it Isfaq Ahmad} \medskip \noindent The true potential of high-performance computing using massively parallel processors or network-based distributed systems can only be realized if application-to-architecture mapping is done properly and if automated tools for performing such tasks are available to the programmers. Extracting parallelism from serial programs is not trivial. However, even when parallelism is readily available from highly compute-intensive applications such as digital signal processing, fluid dynamics, solution of linear and partial differential equations, and computer vision, such applications fail to yield any meaningful speedup in the absence of efficient mapping and scheduling strategies. The research in this area therefore is an integral part of high-performance computing. \vskip 8pt \noindent{\bf Short Code:\ \ }[Ahmad:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 5, pp. 339--347 \vskip 0.25in \centerline{\bf Compile-time Scheduling of Multithread with Data Localities} \centerline{\bf on Multiple Vector Processors} \medskip \centerline{\it Chih-Yung Chang, and Jang-Ping Sheu} \medskip \noindent A large class of loop programs applied in solving differential equations, Fourier transforms, image processing and neural processing can be translated or rewritten into a vector execution form with a $\pi\/$-{\it block\/} dependence graph. In the paper we propose a multithreading strategy to partition such vectorized loops into multithread execution form. Each partitioned thread consists of instances of statements with localities in vector registers. The multithreading scheme gives a novel combination of {\it loop unrolling, statement instances reordering, index shifting, vector register reuse exploiting\/} and {\it multithreading}. For some cases of loop program with $\pi\/$-{\it block\/} dependence graph, experimental results show that our scheme assists vector compilers of the Convex~C38 series to reduce the number of memory accesses and synchronizations among CPUs. \vskip 8pt \noindent{\bf Short Code:\ \ }[Chang:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 5, pp. 349--369 \vskip 0.25in \centerline{\bf Comparative Study of Task Duplication Static Scheduling} \centerline{\bf Versus Clustering and Non-clustering Techniques} \medskip \centerline{\it Behrooz Shirazi, Hsing-Bung Chen, and Jeff Marquis} \medskip \noindent One of the major issues that needs to be addressed in distributed memory multiprocessor (DMM) systems is the program task partitioning and scheduling problems, i.e. mapping of an application program's precedence related task threads among the processing elements of a DMM system. The optimal task partitioning and scheduling problem, with the goal of minimizing the program execution time and interprocessor communication overheads, is known to be an NP-complete problem. The paper addresses the design, development and performance evaluation of a novelstatic task partitioning and scheduling method called linear clustering with task duplication (LCTD). LCTD employs the linear (sequential) execution of tasks and task duplication heuristics in achieving minimized computation and interprocessor communication delays in DMMs. The superiority of the proposed LCTD algorithm is demonstrated through simulation studies and comparison against several of the existing static scheduling schemes, such as heavy node first (HNF) and linear clustering. We show that the proposed method can obtain an average of 33\% improvement in program execution time and 21\% improvement in processor utilization compared to linear clustering and HNF methods. \vskip 8pt \noindent{\bf Short Code:\ \ }[Shirazi:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 5, pp. 371--389 \vfill\eject \centerline{\bf A Fast Recursive Mapping Algorithm} \medskip \centerline{\it Song Chen, and Mary M. Eshaghian} \medskip \noindent The paper presents a generic technique for mapping parallel algorithms onto parallel architectures. The proposed technique is a fast recursive mapping algorithm which is a component of the Cluster-M programming tool. The other components of Cluster-M are the Specification module and the Representation module. In the Specification module, for a given task specified by a high-level machine-independent program, a clustered task graph called Spec graph is generated. In the Representation module, for a given architecture or computing organization, a clustered system graph called Rep graph is generated. Given a task (or system) graph, a Spec (or Rep) graph can be generated using one of the clustering algorithms presented in the paper. The clustering is done only once for a given task graph (system graph) independent of any system graphs (task graphs). It is a machine-independent (application-independent) clustering, and therefore it is not repeated for different mappings. The Cluster-M mapping algorithm presented produces a sub-optimal matching of a given Spec graph containing $M$ task modules, onto a Rep graph of $N$ processors, in $O(MN)$ time. This generic algorithm is suitable for both the allocation problem and the scheduling problem. Its performance is compared to other leading techniques. We show that Cluster-M produces better or similar results in significantly less time and using fewer or an equal number of processors as compared to the other known methods. \vskip 8pt \noindent{\bf Short Code:\ \ }[Chen:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 5, pp. 391--409 \vskip 0.25in \centerline{\bf Task Assignment Using a Problem-space Genetic Algorithm} \medskip \centerline{\it Imtiaz Ahmad, and Muhammad K. Dhodhi} \medskip \noindent The task assignment problems is one of assigning tasks of a parallel program among the processors of a distributed computing system in order to reduce the job turnaround time and to increase the throughput of the system. Since the task assignment problem is known to be NP-complete except in a few special situations, satisfactory suboptimal solutions obtainable in a reasonable amount of computation time are generally sought. In the paper we introduce a technique based on the problem-space genetic algorithm (PSGA) for the static task assignment problem in both homogeneous and heterogeneous distributed computing systems to reduce the task turnaround time and to increase the throughput of the system of properly balancing the load and reducing the interprocessor communication time among processors. The PSGA based approach combines the power of genetic algorithms, a global search method, with a simple and fast problem-specific heuristic to search a large solution space efficiently and effectively to find the best possible solution in an acceptable CPU time. Experimental results on test examples from the literature show considerable improvements in both the assignment cost and the CPU times over the previous work. The proposed scheme is also applied to a digital signal processing (DSP) system consisting of 119 tasks to illustrate its balancing properties and computational advantage on a large system. \vfill\eject \noindent The proposed scheme offers 12--30\% improvement in the assignment cost as compared to the previous best known results for the DSP example. \vskip 8pt \noindent{\bf Short Code:\ \ }[Ahmad:95b] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 5, pp. 411--428 \vskip 0.25in \centerline{\bf Run-time Issues in Program Partitioning on Distributed Memory Systems} \medskip \centerline{\it Santosh Pande, and Dharma P. Agrawal} \medskip \noindent Our earlier work reported a {\it Threshold Scheduling Method\/} for compile-time mapping of functional parallelism on distributed-memory systems. The work reported in this paper discusses run-time issues in efficiently supporting the functional parallelism with minimal overheads, through a combination of compile-time and run-time ownership analysis. At compile time, the code generation phase determines whether a local copy of a live definition of a variable needed by a task is available on a given processor, through an ownership analysis. In case ownership cannot be resolved at compile time, an appropriate code is generated to perform analysis at run time. The code generation is carried out so that all the processors carry the same copy of the compiled program with the individual processor's code being isolated and the universally owned code being replicated on all processors to minimize run-time overheads. The run-time system maintains the static and dynamic ownerships at every processor to avoid communication overhead on ownership information. We demonstrate the approach by incorporating it in the compiler for targeting a parallel functional language, Sisal (streams and iterations in single assignment language), to Intel Touchstone i860 systems. Several benchmarks demonstrate the viability of the approach. \vskip 8pt \noindent{\bf Short Code:\ \ }[Pande:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 5, pp. 429--454 \vskip 0.25in \centerline{\bf A Framework for Partitioning Parallel Computations} \centerline{\bf in Heterogeneous Environments} \medskip \centerline{\it Jon B. Weissman, and Andrew S. Grimshaw} \medskip \noindent In the paper we present a framework for partitioning data parallel computations across a heterogeneous metasystem at runtime. The framework is guided by program and resource information which is made available to the system. Three difficult problems are handled by the framework: processor selection, task placement and heterogeneous data domain decomposition. Solving each of these problems contributes to reduced elapsed time. In particular, processor selection determines the best grain size at which to run the computation, task placement reduces communication cost, and data domain decomposition achieves processor load balance. We present results which indicate that excellent performance is achievable using the framework. \vfill\eject \noindent The paper extends our earlier work on partitioning data parallel computations across a single-level network of heterogeneous workstations. \vskip 8pt \noindent{\bf Short Code:\ \ }[Weissman:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 5, pp. 455--478 \vskip 0.25in \centerline{\bf A Partitioning Advisory System for Networked Data-parallel Processing} \medskip \centerline{\it Phyllis E. Crandall, and Michael J. Quinn} \medskip \noindent With the increased performance capabilities of desktop computers, networked computing has become a popular vehicle for using parallelism to solve a variety of computationally intense problems. However, node heterogeneity and high communication costs may limit performance unless the problem space is carefully partitioned across the network in a way that considers both the capabilities of the machines and the high network communication costs. We describe an advisory system that is designed to help the programmer, compiler or run-time environment choose the best decomposition strategy for partitioning specific data-parallel applications across a given collection of machines. The system includes provisions for assessing the capabilities of the participating machines and the network in light of the current workload. Given information about the problem space, the machine speeds and the network, the system provides a ranking of three standard partitioning methods. We test the validity of our system by comparing the observed relative performance with predicted relative performance of different data decompositions on a program with a variable number of floating point operations and a 5-point stencil communication pattern. \vskip 8pt \noindent{\bf Short Code:\ \ }[Crandall:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 5, pp. 479--495 \vskip 0.25in \centerline{\bf A Compendium of Processor Allocation Strategies for} \centerline{\bf Two-dimensional Mesh Connected Systems} \medskip \centerline{\it Bonnie E. Melhart, Craig A. Morgenstern, and Tom Nute} \medskip \noindent Multiple processor systems are an integral part of today's high-performance computing environment. Such systems are often configured as a two-dimensional grid of processors called a mesh. Tasks compete for rectangular submeshes of this mesh. The choice of submesh allocation strategy can significantly affect the level of processor utilization and a task's waiting time. In addition, the execution speed of various allocation algorithms varies widely, which can further affect system performance. This paper describes and categorizes several submesh allocation strategies, including a previously unreported method that is superior to other methods in terms of execution speed. The paper includes results of simulation studies used to compare the performance characteristics of the most efficient allocation strategies in each category. \vskip 8pt \noindent{\bf Short Code:\ \ }[Melhart:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 5, pp. 497--514 \vskip 0.25in \centerline{\bf Distributed Implementations of Communicating Objects} \medskip \centerline{\it Weijia Jia and Gaetan Libert} \medskip \noindent This paper presents the design and implementation of a CSP-based object-oriented system. The system consists of a specification model, Communicating-object, and a prototype system, C-OBJECT, supporting the model. The objects execute in a set of parallel processes called actions. The dynamic communicating objects exchange messages by both data transmissions and function invocations. The C-OBJECT prototype is constructed in a MIMD architecture (32-node transputer) with C++ which is composed of two parts: network configuration and a Communicating-object service subsystem (library) providing various levels of message-passing primitives. The initial prototype with good performance has shown its availability for C and C++ programming. The integrated system facilitates application software with tools of specification, design and implementation. \vskip 8pt \noindent{\bf Short Code:\ \ }[Jia:94a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 6, pp. 515--541 (C191) \vskip 0.25in \centerline{\bf The Genesis Distributed Memory Benchmarks Part~2:\ COMMS1} \centerline{\bf TRANS1, FFT1 and QCD2 Benchmarks on the SUPRENUM} \centerline{\bf and iPSC/860 Computers} \medskip \centerline{\it T. Hey, R. Hockney, V. Getov, I. Wolton, J. Merlin, and J. Allwright} \medskip \noindent The Genesis benchmark suite has been assembled to evaluate the performance of distributed-memory MIMD systems. The problems selected all have a scientific origin (mostly from physics or theoretical chemistry), and range from synthetic code fragments designed to measure the basic hardware properties of the computer (especially communication and synchronization overheads), through commonly used library sub-routines, to full application codes. This is the second of a series of papers on the Genesis distributed-memory benchmarks, which were developed under the European ESPRIT research program. Results are presented for the SUPRENUM and iPSC/860 computers when running the following benchmarks:\ COMMS1 (communications), TRANS1 (matrix transpose), FFT1 (fast Fourier transform), and QCD2 (conjugate gradient kernel). The theoretical predictions are compared with, or fitted to, the measured results, and then used to predict (with due caution) how the performance might scale for larger problems and more processors than were actually available during the benchmarking. \vskip 8pt \noindent{\bf Short Code:\ \ }[Hey:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 6, pp. 543--570 (C184) \vskip 0.25in \centerline{\bf Accelerated Ray Tracing using an nCUBE2 Multicomputer} \medskip \centerline{\it I. J. Grimstead and S. Hurley} \medskip \noindent Acceleration techniques for rendering a dynamic sequence of frames (animations) and static scenes using ray tracing are presented. The first technique discusses temporal acceleration for dynamic scenes which takes advantage of ray coherence, while the second technique discusses acceleration for complex static scenes based on parallelism. Several practical aspects of the parallel implementation on an nCUBE2 hypercube computer are included. \vskip 8pt \noindent{\bf Short Code:\ \ }[Grimstead:94a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 6, pp. 571--586 (C200) \vskip 0.25in \centerline{\bf Editorial: Resource Management in Parallel and Distributed Systems} \centerline{\bf with Dynamic Scheduling: Dynamic Scheduling} \medskip \centerline{\it Isfaq Ahmad} \medskip \noindent This special issue is devoted to dynamic scheduling techniques and includes papers reporting a wide spectrum of related research. The distinctive feature of dynamic scheduling, in contrast to static scheduling, for parallel and distributed systems is that it takes the notion of time into consideration. In other words, dynamic scheduling is management of computing resources according to their time-dependent states. While it is clear that there are a wide range of problems for which static scheduling cannot be applied due to the lack of information about the problem as well as the system, there are a large number of environments in which dynamic scheduling has to be an essential part of the system software. \vskip 8pt \noindent{\bf Short Code:\ \ }[Ahmad:95c] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 7, pp. 587--590 \vskip 0.25in \centerline{\bf The Dynamic Behaviour of Parallel Programs Under Process Migration} \medskip \centerline{\it Rosemary Candlin, and Joseph Phillips} \medskip \noindent We have studied the interaction between process-based parallel programs whose characteristics change in various ways at run time and the operation of load-balancing, as implemented by process migration. In order to do this, we propose a simple performance model, whose parameters represent features of the program's execution such as the frequency and regularity of the changes in computational characteristics, and conduct a series of experiments involving simulated executions of synthetic programs with controlled parameter values. From these we can deduce the relative importance of the parameters from the point of view of their influence on performance. We can explain our observations in terms of a simplified stochastic model that relates local changes in load to global behaviour. We show that the dynamics of load-balancing can be represented approximately by a first-order difference equation, and that the distributed process migration algorithm is consistent with a behaviour on the global scale which can be regarded as that of a traditional feedback controller. \vskip 8pt \noindent{\bf Short Code:\ \ }[Candlin:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 7, pp. 591--613 \vskip 0.25in \centerline{\bf A Parallel Dynamic Load-balancing Algorithm} \centerline{\bf for Solution-adaptive Finite Element Meshes on 2D Tori} \medskip \centerline{\it Yeh-Ching Chung, Yaa-Jyun Yeh, and J.-S. Liu} \medskip \noindent To efficiently execute a finite element program on a 2D torus, we need to map nodes of the corresponding finite element graph to processors of a 2D torus such that each processor has approximately the same amount of computational load and the communication among processors is minimized. If nodes of a finite element graph do not increase discretely due to the refinement of some finite elements during the execution of a program, a dynamic load-balancing algorithm has to be performed many times in order to balance the computational load of processors while keeping the communication cost as low as possible. In the paper we propose a parallel dynamic load-balancing algorithm (LB) to deal with the load-imbalancing problem of a solution-adaptive finite element program on a 2D torus. The algorithm uses an iterative approach to achieve load-balancing. We have implemented the proposed algorithm along with two parallel mapping algorithms, parallel orthogonal recursive bisection (ORB) and parallel recursive mincut bipartitioning (MC), on a simulated 2D torus. Three criteria, the execution time of load-balancing algorithms, the computation time of an application program under different load balancing algorithms, and the total execution time of an application program (under several refinement phases) are used for performance evaluation. Simulation results show that (1) the execution of LB is faster than those of MC and ORB; (2) the mappings of LB are better than those of ORB and MC; and (3) the speedups of LB are better than those of ORB and MC. \vskip 8pt \noindent{\bf Short Code:\ \ }[Chung:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 7, pp. 615--631 \vskip 0.25in \centerline{\bf Dynamic Scheduling Techniques for Heterogeneous Computing Systems} \medskip \centerline{\it Babak Hamidzadeh, Yacine Atif, and David J. Lilja} \medskip \noindent There has been a recent increase of interest in heterogeneous computing systems, due partly to the fact that a single parallel architecture may not be adequate for exploiting all of a program's available parallelism. In some cases, heterogeneous systems have been shown to produce higher performance for lower cost than a single large machine. However, there has been only limited work on developing techniques and frameworks for partitioning and scheduling applications across the components of a heterogeneous system. In this paper we propose a general model for describing and evaluating heterogeneous systems that considers the degree of uniformity in the processing elements and the communication channels as a measure of the heterogeneity in the system. We also propose a class of dynamic scheduling algorithms for a heterogeneous computing system interconnected with an arbitrary communication network. These algorithms execute a novel optimization technique to dynamically compute schedules based on the potentially non-uniform computation and communication costs on the processors of a heterogeneous system. A unique aspect of these algorithms is that they easily adapt to different task granularities, to dynamically varying processor and system loads, and to systems with varying degrees of heterogeneity. Our simulations are designed to facilitate the evaluation of different scheduling algorithms under varying degrees of heterogeneity. The results show improved performance for our algorithms compared to the performance resulting from existing scheduling techniques. \vskip 8pt \noindent{\bf Short Code:\ \ }[Hamidzadeh:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 7, pp. 633--652 \vskip 0.25in \centerline{\bf An Adaptive Algorithm for Resolving Processor Thrashing} \centerline{\bf in Load Distribution} \medskip \centerline{\it Chin Lu, and Sau-Ming Lau} \medskip \noindent Processor thrashing in load distribution refers to the situation when a number of nodes try to negotiate with the same target node simultaneously. The performance of dynamic load-balancing algorithms can be degraded because processor thrashing can lead to receiver node overdrafting, thus causing congestion at a receiver node and reduction of workload distribution. In the paper we present an adaptive algorithm for resolving processor thrashing in load distribution. The algorithm is based on the integration of three components: (1) a batch task assignment policy, which allows a number of tasks to be transferred as a single batch from a sender to a receiver; (2) a negotiation protocol to obtain mutual agreement between a sender and a receiver on the batch size; and (3) an adaptive symmetrically-initiated location policy to select a potential transfer partner. Simulations reveal that our algorithm provides a significant performance improvement at high system loads because the algorithm can avoid processor thrashing so that CPU capacity is more fully utilized. \vskip 8pt \noindent{\bf Short Code:\ \ }[Lu:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 7, pp. 653--670 \vskip 0.25in \centerline{\bf A Load Index and a Transfer Policy for Global} \centerline{\bf Scheduling Tasks with Deadlines} \medskip \centerline{\it Niranjan G. Shivaratri, and Mukesh Singhal} \medskip \noindent Two important components of a global scheduling algorithms are its transfer policy and its location policy. While the transfer policy determines {\it whether\/} a task should be transferred, the location policy determines {\it where\/} it should be transferred. Many global scheduling algorithms have been proposed to schedule tasks with deadline constraints. These algorithms try to transfer tasks only {\it when\/} task's deadlines cannot be met locally or local load is high (i.e. they take only corrective measures). However, a scheduling algorithm that takes preventive measures in addition to corrective measures can reduce potential deadline misses substantially. In this paper we present: (a) a load index which characterizes the system state and is more conducive to preventive and corrective measures; (b) a new transfer policy which takes preventive measures by doing anticipatory task transfers in addition to corrective measures. The proposed transfer policy adapts better to the workload by availing of the accurate system state made available by the proposed load index. \vfill\eject \noindent An algorithm making use of the new transfer policy and the new load index is shown to reduce the number of deadline misses significantly when compared to algorithms taking {\it only\/} corrective measures. \vskip 8pt \noindent{\bf Short Code:\ \ }[Shivaratri:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 7, pp. 671--688 \vskip 0.25in \centerline{\bf Symmetrical Hopping: A Scalable Scheduling Algorithm} \centerline{\bf for Irregular Problems} \medskip \centerline{\it Min-You Wu} \medskip \noindent A run-time support is necessary for parallel computations with irregular and dynamic structures. One important component in the support system is the run-time scheduler which balances the working load in the system. We present a new algorithm, Symmetrical Hopping, for dynamic scheduling of ultra-lightweight processes. It is a dynamic, distributed, adaptive and scalable scheduling algorithm. This algorithm is described and compared to four other algorithms that have been proposed in this context, namely the randomized allocation, the sender-initiated scheduling, the receiver-initiated scheduling, and the gradient model. The performance of these algorithms on Intel Touchstone Delta is presented. The experimental results show that the Symmetrical Hopping algorithm achieves much better performance due to its adaptiveness. \vskip 8pt \noindent{\bf Short Code:\ \ }[Wu:95b] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 7, pp. 689--706 \vskip 0.25in \centerline{\bf Nearest-neighbor Algorithms for Load-balancing in Parallel Computers} \medskip \centerline{\it Chengzhong Xu, Francis C. M. Lau, Burkhard Monien, and Reinhard L\"uling} \medskip \noindent With nearest-neighbor load-balancing algorithms, a processor makes balancing decisions based on localized workload information and manages workload migrations within its neighborhood. The paper compares a couple of fairly well-known nearest-neighbor algorithms, {\it the dimension-exchange\/} (DE) and {\it the diffusion\/} (DF) methods and their several variants---the average-dimension-exchange (ADE), optimally tuned dimension-exchange (ODE), local average diffusion (ADF) and optimally tuned diffusion (ODF). The measures of interest are their efficiency in driving any initial workload distribution to a uniform distribution and their ability to controlling the growth of the variance among the processors' workloads. The comparison is made with respect to both one-port and all-port communication architectures and in consideration of various implementation strategies including synchronous/asynchronous invocation policies and static/dynamic random workload behaviors. It turns out that the dimension-exchange method outperforms the diffusion method in the one-port communication model. In particular, the ODE algorithm is best suited for statically synchronous implementations of a load-balancing process regardless of its underlying communication models. The strength of the diffusion method is in asynchronous implementations in the all-port communication model; the ODF algorithm performs best in that case. The underlying communication networks considered assume the most popular topologies, the mesh and the torus and their special cases: the hypercube and the $k$-ary $n$-cube. \vskip 8pt \noindent{\bf Short Code:\ \ }[Xu:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 7, pp. 707--736 \centerline{\bf Beyond Software Performance Visualization} \medskip \centerline{\it M. Abrams, T. J. Lee, H. T. Cadiz, and K. Ganugapati} \medskip \noindent Performance visualization tools of the last decade have yielded new insights into the behavior of sequential, parallel, and distributed programs. However, they have three inherent limitations: (1) they only display what happened in {\it one\/} execution of a program (this is dangerous when analyzing concurrent applications, which are prone to non-deterministic behavior); (2) a human uses one or more bandwidth-limited senses with a visualization tool (this limits the scalability of a visualization tool); (3) the relationship of ``interesting'' program events are often separated in time by other events; thus discerning time dependent behavior often hinges on finding the ``right'' visualization---a possibly time-consuming activity. CHITRA93 complements visualization systems, while alleviating these limitations, and analyzes a set (or ensemble) of traces by combining the visualization of a few traces with a statistical analysis of the entire ensemble (overcoming (1)). It reduces the ensemble to empirical models that capture the time-dependent relationships of ``interesting'' program events through application, programming language and computer architecture independent analysis techniques (addressing (2) and (3)). It also incorporates the following transforms, such as aggregation, that simplify the ensemble and reduce the state-space size of the models generated; a user interface that allows certain transforms to be selected by editing the visualization with a mouse; homogeneity tests that allow partitioning of an ensemble; an efficient semi-Markov model generation algorithm whose computation time is linear in the sum of the lengths of the traces comprising the ensemble; and a CHAID-based model that can fathom non-Markovian relationships among transitions in the traces. The use of CHITRA93 is demonstrated by partitioning ten parallel database traces with nearly 8,000 states into two homogeneous subsets, each modeled by an irreducible, periodic and hierarchical stochastic process with as few as four states. \vskip 8pt \noindent{\bf Short Code:\ \ }[Abrams:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 8, pp. 737--764 (C219) \vskip 0.25in \centerline{\bf A Toolkit for Parallel Functional Programming} \medskip \centerline{\it P. H. Hartel, R. F. H. Hofman, K. G. Langendoen, H. L. Muller, W. G. Vree, and L. O. Hertzgerger} \medskip \noindent Our toolkit for the design and implementation of parallel functional programs supports the stepwise development of parallel programs from a high level sequential specification to an optimized parallel implementation. The toolkit is used as follows: \item{1.} The algorithm to be implemented is specified in a functional language. The program is debugged and tested using an interpreter. \item{2.} The program is compiled for a sequential machine. Its performance is analyzed and improved. \item{3.} Annotation-driven transformations are applied to the program to indicate parallel tasks. Simulations at task level, basic block level and bus transaction level make it possible to analyze the parallel performance of the program at three levels of detail. \item{4.} When the performance is optimized using the simulators, the program is executed on a genuine parallel machine. Several programs have been developed with the toolkit. A program that simulates tidal flow in an estuary of the North sea is presented as a case study to demonstrate the merits of the toolkit when developing complex parallel programs. The toolkit not only supports the design of parallel applications, it also allows the study of important concepts in parallel computer architecture. These include the behavior of cached memory systems, bus protocols, scheduling algorithms and memory management algorithms. \vskip 8pt \noindent{\bf Short Code:\ \ }[Hartel:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 8, pp. 765--793 (C225) \vskip 0.25in \centerline{\bf Parallel Solution of Large-Scale Differential-Algebraic Systems} \medskip \centerline{\it R. S. Maier, L. R. Petzold, and W. Rath} \medskip \noindent DASPK solves large-scale systems of differential-algebraic equations. It is based on the integration method in DASSL, but instead of a direct method for the associated linear system which arise at each time step, the preconditioned GMRES iteration is applied in combination with an inexact Newton method. Two parallel versions of DASPK have been developed: DASPKF90, a Fortran 90 data parallel implementation, and DASPKMP, a message-passing implementation written in Fortran 77 with extended BLAS. The parallel versions have been implemented for the Thinking Machines Corporation (TMC) CM-5, a massively parallel multiprocessor, keeping the user interface relatively simple while allowing for portability to other massively parallel architectures. The codes have been demonstrated on several large-scale test problems, including three-dimensional formulations of the heat equation, the Cahn-Hilliard equation and a multi-species reaction-diffusion problem. The formulations are described, including detail on preconditioning the Krylov iteration, timing results and performance analysis. \vskip 8pt \noindent{\bf Short Code:\ \ }[Maier:95a] \vskip 5pt \noindent{\bf Reference:\ \ \ \ }Vol. 7, No. 8, pp. 795--822 (C229) \vskip 0.25in \vfill\eject \bye