\documentstyle[11pt]{article} \textwidth=6.0in \textheight=8.5in \topmargin=-0.5in \oddsidemargin=0.25in % \def\lea{\mathrel{\raise .4ex\hbox{\rlap{$<$}\lower 1.2ex\hbox{$\sim$}}}} \def\gea{\mathrel{\raise .4ex\hbox{\rlap{$>$}\lower 1.2ex\hbox{$\sim$}}}} % \begin{document} \begin{center} {\Large\bf ESSL Parallel Evaluation} \\ \vspace{12pt} {\Large\em Geoffrey C. Fox} \end{center} \vspace{0.25in} \begin{center} {\bf General Remarks} \end{center} \noindent The following comments are applicable to essentially all routines. \begin{enumerate} \item One has in each case, a choice between \begin{description} \item[\rm a)] data parallelism---take a single instance of an ESSL routine and decompose it over a set of SP1 nodes \item[\rm b)] replication or functional parallelism---have separate instances of ESSL routines running in SPMD (Single Program, Multiple Data) mode on each node of a set of SP1 nodes. \end{description} In many cases, one needs to support a mix of both a) and b) to get efficient implementations. As we describe in detailed comments, some routines are more likely to be used in replicated than data parallel mode. Others, such as LU decomposition, are needed in data parallel mode (typically large matrices), replicated mode (typically small matrices), and mixed mode. The mixed mode is naturally supported in MIMD machines as long as one supports both standard ``node'' ESSL and data parallel versions split over arbitrary groupings of nodes. Improved SP1 software is needed, however, to implement (see 3.). \item Currently, the Thinking Machines library, CMSSL, is the best available parallel library. Study of this is recommended in designing a parallel ESSL. \item One can choose between High Performance Fortran (HPF) or Fortran plus message passing to express parallelism. Of course, on each node, one may also revert to assembly language in key cases. C++ or other object oriented approaches have special advantages for library generations. We have not discussed these language issues in our report because at the moment, Fortran (or C/assembly language on the node) plus message passing is the only effective operating environment on the SP1. It is also the approach to ESSL that will get the highest performance---albeit with more work on the part of the implementor. However, this issue is nontrivial and the ESSL team should work with the SP1 software team. For instance, IBM does not yet have a clear MPP software strategy, and this is needed to properly implement the data parallel/replication issue discussed in 1. \item As described below in detailed comments on each function, the parallel version of an ESSL routine sometimes has different algorithmic/parameter trade-offs than in sequential case. Current ESSL documentation does not give (in my opinion) particularly good advice to users. I would recommend improving user advice in future ESSL releases. \item There are two areas where we consider that sequential ESSL routines use inadequate algorithms---namely, random number generators and quadrature. This comment is then expanded below for many other areas where the best parallel implementation needs a new algorithm. \item We follow with detailed comments on Chapters~8--16. These are still broad based and can be amplified in many cases. \end{enumerate} \vspace{12pt} \begin{center} {\bf Matrix Routines: Chapters 8--11} \end{center} There has been substantial work on parallel algorithms for matrix algebra. The problems can be classified into two broad classes---{\em full\/} matrix cases, where no account is taken of possible zero values for matrix elements, and {\em sparse\/} matrix problems, where the number of nonzero elements is so small that explicit account needs to be taken of this structure. Banded matrices, where the band size is large compared to the number of processors (roughly band size $b\ge 10 \sqrt{N_{\rm proc}}$), are treated by the same type of algorithms as full matrices. Full matrix algorithms are implemented in the SCALAPACK project led by Dongarra and Walker and embodies the best known algorithms. Walker was in my group at Caltech, and has followed parallel matrix algorithms for five years. Dongarra is the clear national leader. IBM's full matrix library should either be based on SCALAPACK or advised by the developers of this system. One niche, but important facility is ``out of core'' solvers, which are needed for important (military) electromagnetic simulations using the method of moments. I would recommend providing out of core versions of those matrix algorithms needed for electromagnetic simulations. Syracuse University has great expertise here as the moment method originated in Professor Harrington's research here. Turning to sparse solvers, we find a more confused situation because there are very many variants of this problem. There are two classes of sparse solvers---``direct'' and ``iterative''. Banded solvers, mentioned above, are one example of a direct method. However, there are important direct methods, which make more use of zeros. We follow with a short review by Koester. Note the ``best'' method is very dependent on application as each application class has a very different zero structure. As key is mapping nonzero elements onto nodes in a clever fashion, the best methods depend critically on both machine and problem. Note that the SPn with a good high performance switch is particularly suited for these (sparse matrix) methods as both switch and RISC node should support irregular data access typical of sparse problems. Sparse matrices come in several areas\@; solution of partial differential equations, network simulations and linear programming are three examples. Solution methods involve a symbolic (decide on parallel strategy) and computation phase. In some problems (e.g., network simulation), the zero structure remains unaltered for many different compute phases. The algorithm trade-offs are very different for the solution of a single partial differential equation where a symbolic phase is followed by a single computation. NPAC has developed an excellent sparse solver for the type of hierarchical networks seen in electric transmission systems (draft paper enclosed). As in Koester's note, others have better solutions for different applications. Note that each of these sparse solvers is very hard and there are many poor papers in this field. IBM should get the very best advice and collaborators before producing parallel direct solvers. This is perhaps the hardest generally applicable parallel algorithm known. The use of direct methods can be expected to be less important on parallel than sequential machines as iterative methods are faster on the large problems one expects to solve on powerful parallel computers. Further, iterative methods are much easier to parallelize and typically get better speedup than direct methods. The cross over between direct and iterative methods is not well understood. Consider a typical sparse matrix with $N$ rows and about 10 nonzero elements in each row and column. Then one could expect that direct methods will be preferable to iterative methods for values of $N\lea 10,000 \rightarrow 100,000$. Further research is needed here. IBM should be very careful in the manual explaining for users the trade-offs between direct and iterative methods. Finally, we turn to iterative methods. These are not usually presented as iterative solutions to matrix equations, but rather as iterative partial different equation (PDE) solvers. There is a good reason for this. Although a discretized PDE is a sparse matrix equation, the origin as a PDE that contains important geometric information, which can be used to improve both solver and its parallelization. Thus, one usually considers libraries for conjugate-gradient multigrid, or ADI methods, for instance. A very good summary of this is given in a recent book by Dongarra and collaborators. This also presents an interesting alternative to libraries---namely, templates. This could be considered for ESSL as the many different application-specific choices make efficient libraries hard to develop. IBM should establish leading collaborators before developing ESSL capabilities for iterative sparse matrix problems. NPAC is quite strong in this area, as are many of the other leading HPCC institutions. \vspace{12pt} \begin{center} {\bf Discussion of Sparse Matrix Solvers in the IBM ESSL} \\ \vspace{10pt} {\em David Koester} \end{center} \begin{enumerate} \item Which existing routines can be parallelized for a distributed MIMD coarse-grain environment? Parallel sparse matrix solvers must be included in any reasonable engineering or scientific library, because solving sparse linear systems of equations practically dominates scientific and engineering computations. Unfortunately, the routines to solve linear systems of equations directly are very general and may not address some significant aspects involved in solving sparse systems of equations. For example---ordering the matrix to minimize the amount of fillin and consequently the overall amount of calculations---the general sparse LU factorization routines use Markovitz counts with threshold pivoting to minimize fillin and calculations, however, the direct symmetric solvers say nothing about techniques to maintain numerical stability (pivoting) and say nothing about techniques to minimize calculations (ordering). Parallel library routines, that supply the capability to perform sparse matrix calculations, should definitely be included in a library for distributed MIMD coarse-grain multiprocessors. It is my opinion that efforts should not be expended to parallelize the existing sparse solvers, rather the functionality and possibly the function call structure should be maintained while utilizing existing research codes. The best general multiprocess codes for direct solvers have been developed by Michael Heath, et al., and Seshadri Venugopal, et al. Other specialized parallel sparse matrix solvers such as my parallel block-diagonal-bordered solver could also be included. One area where research may be warranted would be to develop an input data-sensitive sparse matrix solver that intelligently selects the proper algorithm to minimize calculations after examining the data. It would not be difficult to detect various matrix structures and select the appropriate algorithm to minimize calculations while maximizing multiprocessor performance. \item Application class or classes that would rely on those routines ``solving sparse linear systems of equations practically dominates scientific and engineering computations'' \item Machines for which those routines are currently available Both general and application-dependent research codes exist for many distributed-memory and shared-memory multi-processors. The best general multiprocess codes for direct solvers have been developed by Michael Heath, et al., and Seshadri Venugopal, et al. \item User interface and usability of the routines The user interfaces for these routines are somewhat limited---there are limited available input data formats. It may be advantageous to expand the applicability of these routines by expanding the number of user input formats. Presently, there are separate routines for classes of input formats. Other sparse matrix solver libraries simplify the data input choice process by having a single solver for a class of problems with input preprocessing to modify the format of the data to a single canonical form. \item Which routines would be used by real people to do real problems? ``solving sparse linear systems of equations practically dominates scientific and engineering computations'' \item Distinguish routines which be used on a node only (e.g., calculate Hermite Polynomials) and which would be used in parallel form (FFT) Parallel sparse solvers would be used in parallel form, although the limited scalability in the parallel data often limits the number of processors that can be efficiently utilized. \item Also indicate current state of art (who has done best work) in parallelization The best general multiprocess codes for direct solvers have been developed by Michael Heath, et al. \cite{Heath:91a}, and Seshadri Venugopal, et al. [Venugopal:92a;92b]. For specialized applications where the block-diagonal-bordered form is appropriate, solvers developed in NPAC research may be appropriate to include in a general library. We enclose a paper describing this. \nocite{Venugopal:92a} \nocite{Venugopal:92b} \end{enumerate} \clearpage \begin{center} {\bf Chapter 12, Pages 764--842} \end{center} \noindent These pages discuss a broad range of one- two- and three-dimensional Fast Fourier Transform (FFT) routines. Parallelism of the FFT---for both binary $(2^n)$ and discrete (other carefully chosen lengths) cases---has been studied and no new research is probably needed to find good algorithms for the IBM SPn. Issues in design of parallel library include \begin{enumerate} \item Often multidimensional FFTs are needed and taking a simple two-dimensional example, one would make typically one of two choices \begin{description} \item[\rm a)] distribute both $x$ and $y$; \item[\rm b)] distribute, say $x$, so that all $y$ values for each $x$ are stored inside a node. \end{description} Consider the Fourier transform with respect to $\underline{y}$ variable. Then those for each $x$ value can be performed independently. This is an embarrassingly parallel implementation in b) but not in situation a). For b), the $\underline{x}$ FFT is of course not embarrassingly parallel. However, a simple implementation that is also fastest in some cases is to transform the data before FFTs (for both $\underline{x}$ and $\underline{y}$ in a), for just $\underline{x}$ FFT in b)) so that the series to be FFTed are contained in a single node. Then one can use the existing sequential FFTs. The needed data transformation strategies are well understood and are needed, for instance, in a High Performance Fortran compiler. A recent library developed at JPL (contact H.~Ding) for NASA HPCC program uses this strategy. David~Walker (who started FFT work in my group at Caltech) has continued under DOE sponsorship. Consider now (in items 2--5) the one-dimensional FFT, or the multidimensional case when one wishes to perform FFT ``in place'' without data transpositions. \item The discrete (nonbinary) FFTs require irregular communication patterns for which IBM switch architecture could be very effective compared to other machines. Also, discrete Fourier transforms are harder to vectorize than the binary FFTs. I suspect the SPn should be particularly good on discrete FFTs compared to machines built around vector nodes---simply because the RS~6000 node is more effective on such problems than vector architectures. Although I have written several papers myself on the parallel discrete nonbinary FFT [Aloisio:91a;91b] \nocite{Aloisio:91a} \nocite{Aloisio:91b} and this approach has several vigorous advocates, I believe that discrete (nonbinary) FFTs have, up to now, not been competitive with binary FFTs. Thus, the nonbinary FFTs have advantages due to lower arithmetic cost, both due to algorithm and better matching of data set size to FFT. For example, if your data set has 3000 entries, one might use a $2^{12}=4096$ element binary FFT, wasting 1096 elements. Nonbinary FFTs can match any dataset size more or less exactly as shown on page 757 of the ESSL manual (Figure~9). However, binary FFTs win, as they get better performance from vector or parallel machines. It would be interesting to reexamine the comparison between binary and nonbinary FFT on the SPn whose architecture, as indicated above, is particular favorable to the nonbinary case. However, these considerations are very sensitive to switch performance, and could be that even with the high performance switch option on SP1, nonbinary FFTs will perform poorly. \item All parallel FFT algorithms are particularly demanding on communication bandwidth\@; latency should not be a problem as long as FFT is large. \item The natural efficient FFT algorithm leads to particular $\underline{k}$ (if we go from $\underline{x}\rightarrow\underline{k}$ in transform) decomposition, which is not the obvious spatial decomposition in $\underline{k}$. Permutation is needed (a variant of data transformation mentioned in 1.) to get natural $\underline{k}$ space ordering. Sometimes this $\underline{k}$ space reordering is unnecessary. For example, in solving $\nabla^2 \varphi = 4\pi\zeta$ we get $\varphi^{\,(\underline{k})} \sim \frac{1}{\vert\,\underline{k}\,\vert^2}\,\zeta^{\,(\underline{k})}$. $\zeta$ can be Fourier transformed, divided by $\vert\,\underline{k}\,\vert^2$ and inverse transformed to $\varphi$. The results $\varphi$ will be decomposed in same as $\zeta$. The unnatural intermediate ordering of $\zeta^{\,(\underline{k})}, \varphi^{\,(\underline{k})}$ is irrelevant. \item As explained in \cite{Fox:87b}, there are important similarities between the techniques needed in parallelization and those useful to ensure data locality (good cache utilization) on a single node (sequential code). It would be advisable to combine both optimization when producing new FFT codes. \end{enumerate} \vspace{12pt} \begin{center} {\bf Chapter 12, Pages 843--882} \end{center} \noindent The detailed algorithms used here are fully explained. Given two sequences of length $N_1$ and $N_2$, they calculate correlations by two types of methods \begin{enumerate} \item Direct\@: sequential time complexity $O(N_1,N_2)$ \item Fourier Transform: sequential time complexity $O(N\log N)$ with $N=\max (N_1,N_2)$ \end{enumerate} The parallelization issues in the Fourier Transform case (2.) have already been discussed. The direct methods are easy to parallelize using the so called ``$O(N^2)$'' algorithm described, for instance, in Chapter~9 of \cite{Fox:88a}. There are subtle points for autocorrelations where there may be optimizations that can be used to improve performance. Currently, the user seems to choose between direct or Fourier methods. Notice that direct methods parallelize much more easily than Fourier methods. The communication overhead is, in the language of \cite{Fox:88a}, \begin{eqnarray*} \mbox{Direct}& & \frac{t_{\rm comm}}{t_{\rm calc}}\,\cdot\,\frac{{\rm const}}{n} \\ \mbox{Fourier}& & \frac{t_{\rm comm}}{t_{\rm calc}}\,\cdot\,\frac{{\rm const}\log\,N_{\rm proc}}{\log\,n} \end{eqnarray*} where one has $n=N/N_{\rm proc}$ elements of array stored on each of $N_{\rm proc}$ processors. Thus, the trade-off between direct and Fourier approaches depends in detail on $N$, $N_{\rm proc}$, $t_{\rm comm}$ (communication time), $t_{\rm calc}$ (node calculation time). I would recommend letting program, not user, make this choice with an option for manual override. We saw a similar trade-off change between binary and nonbinary FFTs when we went to parallel machines. \clearpage \begin{center} {\bf Chapter 12, Pages 884--886 SPOLY,DPOLY \\ Chapter 12, Pages 887--889 SIZC,DIZC} \end{center} \noindent Polynomial Evaluation and Zero Crossing---I don't see why you would need to run except in replicated mode, i.e., seemingly no parallel version needed. \vspace{12pt} \begin{center} {\bf Chapter 12, Pages 890--892 STREC,DTREC} \end{center} \noindent I am not familiar with the application of these routines. I do not know if they need to be parallelized, that is, whether they would be used for data stored in a node or distributed data. I don't think these parallelize easily. \vspace{12pt} \begin{center} {\bf Chapter 12, Pages 893--896 SQINT,DQINT} \end{center} \noindent This application (quadratic interpolation) would normally be applied separately in each node (namely, replicated mode). If an application needs parallel interpolation, this would be straightforward to implement. The performance might not be good due to load imbalance, that is, the entires in the argument $t$ of SQINT,DQINT need to be uniformly distributed among the nodes. This may not be natural. \vspace{12pt} \begin{center} {\bf Chapter 12, Pages 897--899 SWLEV,DWLEV} \end{center} \noindent I am not familiar with the use of these routines, and so I do not know if parallel versions would be important. If parallelization was needed, then it should be straightforward. The matrix element calculation is ``embarrassingly parallel''\@; the matrix solution can utilize parallel methods contained in routines of Chapters~8 to 11 of ESSL. \vspace{12pt} \begin{center} {\bf Chapter 13, Sorting and Searching} \end{center} \noindent These algorithms are not used very much in scientific computing. Basic sorting algorithms are fully covered in Chapter~18 of \cite{Fox:88a}. I believe these algorithms are still state of the art. Note that different approaches are appropriate in different parameter domains. One area that uses this algorithm is so-called ``Direct Simulation Monte Carlo'', where particles are sorted into cells. This application has been studied by NASA (see, for instance, article by Dagum \cite{Dagum:92a}). Sorting is also used in some simple load balancing algorithms where you distribute a set of particles $\left( x_i, y_i, z_i \right)$ by sorting in increasing $x_i$ and then dividing sample into equal chunks. This can be repeated in $y_i$ and $z_i$ if needed. This is not the best load balancing algorithm, but it can be used as it is fast. It could be supported by parallel algorithms discussed above. Sorting is, of course, very important in preparation of indices for databases. However, most database systems would not use an ESSL utility as issues of I/O and database access would be involved. The basic algorithm would be covered in references cited above---the implementation would be different. I do not know the important uses of searching. Two example applications come to mind. A set of sophisticated examples come in the ``human genome'' where matching new sequences with existing databases has been studied by many on (sequential and) parallel machines. This would be handled by specialized routines---not ESSL. The other general problem is searching an index for a particular entry. This would not necessarily be a formal database\@; one could be searching a table of interpolation points to find those points needed in interpolation algorithm. As in comments on SQINT,DQINT, I don't expect one to need parallelization of a single search. Databases obtain parallelization in this area by having a multitude of simultaneous searches. One could parallelize a single search, but the speedup would be modest. The above comments apply equally to binary or direct search for, as the parallelism is multiple searches there are no changes in search techniques used on the node. \vspace{12pt} \begin{center} {\bf Chapter 14, Pages 921--940, Interpolation} \end{center} \noindent I have already commented on this in discussing SQINT and DQINT in Chapter~12. I expect that parallelism is likely to occur by replication of several interpolations spread over the SPn nodes. I don't believe one needs to directly parallelize these routines. One example would be Monte Carlo programs where interpolation is often extensively used to look up cross section and reaction data as particles (neutrons, photons, electrons, etc.) travel through a medium. Parallelism is gotten in such applications by simulating simultaneously several particles. This would naturally lead to parallel interpolations as the different particles independently look up in cross section tables. This example is MIMD parallelism as each particle would be at different stages at a given time and so need different tables. This is simple to implement if node memory is of sufficient size that each node can hold all the table data. Even if memory is insufficient, most implementations would not ``parallelize the interpolation routine''. Rather, one would send the particle needing data to the routine holding necessary table. There one would perform conventional sequential interpolation. \vspace{12pt} \begin{center} {\bf Chapter 15, Pages 941--968, Numerical Quadrature} \end{center} \noindent The routines cover one- and two-dimensional integration on domains up to $256\times 256$ in size. The methods are not comprehensive and only cover simple cases. In the two-dimensional case, they do not cover the common case where one is integrating over a nonrectangular domain. Very rarely (in my experience), should one use the supplied methods as high order polynomial approximations are not very reliable. Rather, one typically replicates low order polynomial methods. For instance, the best approach to a 256 point integral is more likely to be 32 replications of 8 point Gauss than the 256 point method indicated. Adaptive methods where number of points is increased until a given tolerance is reached are important, but not provided. Again, the popular Monte Carlo method is not provided even though it is also often better than methods used in this chapter. It would be easy to improve on the numerical methods used in this chapter. I recommend that this be done before parallelization. All methods---those supplied now and those that might be---are simple to parallelize. The resultant parallel codes will be very efficient. Parallel quadrature is used quite often and should be provided, especially for integrations in two or higher dimensions. \vspace{12pt} \begin{center} {\bf Chapter 16, Random Number Generation} \\ \vspace{10pt} {\em Paul D. Coddington} \end{center} Random number generators are widely used for simulation in science and engineering problems. Randomness can often be present in the formulation of the problem, especially for systems where the variation of parameters (perhaps in time or space) is not known exactly, but can be modeled probabilistically (e.g., quantum processes, random noise or perturbations, and random defects in solid state physics). Many problems also use solution techniques which are stochastic, for example, Monte Carlo simulation, and stochastic optimization techniques such as simulated annealing or genetic algorithms. The number of applications that utilize random number generators is therefore quite large, especially in the computational sciences. In some simulations, the quality of the random numbers used is not that important, as long as they are ``fairly random''. However, in many of the problems for which random numbers are most heavily used, such as Monte Carlo simulation, the quality of the random number generator is crucial. A poor generator can produce incorrect results, and this has been seen many times in the scientific literature \cite{Barber:85a}, \cite{Coddington:94a}, \cite{Ferrenberg:92a}, \cite{Filk:85a}, \cite{Grassberger:93a}, \cite{Hoogland:85a}, \cite{James:90a}, \cite{Kalle:84a}, \cite{Milchev:86a},\hfill\break \cite{Parisi:85a}. Also, many generators fail standard statistical tests \cite{Knuth:81a}, \cite{LEcuyer:90a}, \cite{Park:88a}, \cite{Vattulainen:93a}. None of the random number generators provided by the ESSL are good enough for large-scale stochastic simulation. Some recommendations for improved generators are given below. A recent paper on research at NPAC on testing sequential random number generators is appended. Vendors of parallel computers generally provide library routines for random number generators, and these range from reasonable (Thinking Machines) to very poor (Maspar). There has been quite a lot of research on developing algorithms for parallel random number generators, but very little has been done on developing and using methods for testing such generators. There are many statistical and Monte Carlo tests for checking sequential random number generators, some of which can be simply extended to apply to parallel generators (although very few have been). However, new and better tests are required for parallel algorithms, for example, to test for correlations between random numbers produced on different processors. To my knowledge, no good set of tests has been proposed, let alone implemented, for parallel random number generators. This would be a very useful research project, since the lack of results means that it is difficult to compare the different generators and to recommend one as being any better than another. Some work in this area has been done at NPAC, but it is very preliminary and is not being actively continued due to lack of funding and manpower. \begin{enumerate} \item The SURAND Linear Congruential Generator SURAND is a standard 32-bit linear congruential generator. This generator has good randomness properties, but its period is too small. The period of SURAND is $2^{31} \! - \! 1$, or around $2 \! \times \! 10^9$, and can be exhausted on an RS/6000 in a few minutes. This generator is, therefore, completely inadequate for most scientific work. Although in the ``Use Considerations'' section of the documentation the short period is alluded to, the actual value of the period is not stated. I recommend that this generator be kept purely for historical purposes, and users should be strongly warned against using it. A longer period linear congruential generator should be provided. This could be one of the 48-bit generators recommended in Refs.~\cite{Coddington:94a}, \cite{Fishman:90a} (for example the standard C and Unix generator DRAND48), or preferably a generator with even larger modulus, for example the generator with multiplier $13^{13}$ and modulus $2^{59}$ used in the Numerical Algorithms Group (NAG) library. Another generator which could be provided is that of L'Ecuyer, which combines two 32-bit linear congruential generators into a generator which has a period almost as long as a 64-bit generator, and probably better randomness properties \cite{LEcuyer:88a}. On a machine with 64-bit integer arithmetic, these algorithms will be just as fast as the 32-bit generators, but the period and randomness properties are much better. On machines with 32-bit integer arithmetic, multiple-precision arithmetic will be required so the generator will be slower than a 32-bit generator. However for most simulations the proportion of the overall run-time spent calling the random number generator is very small, and good randomness properties are far more important than speed. If users are really worried about speed, then they will presumably be producing enough random numbers to exhaust the period of the 32-bit generator anyway. The syntax for the call to the linear congruential generator is standard and adequate. However for a generator using more than 32-bit integers, the initial seed would have to be specified by an array rather than a single number, as in DRAND48. \item The Normally Distributed Random Number Generator SNRAND For the reasons given above, this generator should not call SURAND, but rather a longer period linear congruential generator. The method used to create normally distributed random numbers, and the syntax for the call to the routine, are standard and adequate. \item The Very Long Period Shift Register Generator I deduce from the quoted period that the major lag of this generator is 1279, however the lags ($p,q$) should be stated explicitly in the documentation. It is pleasing to see such a large lag used, however even so, this generator shows demonstrably nonrandom behavior in some Monte Carlo applications \cite{Coddington:94a}, \cite{Ferrenberg:92a}, \cite{Grassberger:93a}. Also, shift register generators of this kind are known to be inferior to lagged Fibonacci generators with the same lag (and thus the same memory requirement) which use addition, subtraction, or multiplication instead of exclusive OR (XOR) \cite{Coddington:94a}, \cite{Marsaglia:85a}. I, therefore, see no reason to use them, except perhaps for their simpler application to vector and parallel processing (see the following section for more details). I recommend that the XOR operation be replaced by either subtraction modulo 1 on floating point numbers in [0,1), or even better, multiplication on odd integers (modulo $2^{32}$, or some large integer) \cite{Coddington:94a}, \cite{Marsaglia:85a}. This will dramatically improve the quality of the generator for little or no cost in speed. If multiplication is used, a lag of 1279 is adequate. If subtraction is used, a larger lag is necessary to ensure adequate randomness (I would suggest $p=9689, q=4187$). The calling syntax seems OK, although I am not sure why the ``work area'' {\it aux\/} has to be an ``array of (at least) length 10000''. The only values which need to be stored are those in the lag table (the array of previous values which are needed to compute the next value), which has length $p$. \end{enumerate} \vspace{10pt} \noindent{\bf Parallel Algorithms} \vspace{10pt} \noindent Quite a lot of work has been done on vectorizing random number generators, and a number of parallel random number generators have also been proposed. However there does not appear to be any consensus on which is the best method, mainly because very little work has been done on testing and comparing parallel random number generators. Implementing a good parallel random number generator is quite difficult. Clearly it needs to be based on a good sequential generator, however there is the additional problem of possible correlations between random number sequences on different processors. Another difficulty is that it may be beneficial (for example, for debugging purposes) to produce the same sequence of random numbers for different numbers of processors, and in particular, the same sequence as on a sequential machine. This can be done, although generally the types of algorithms which allow this functionality are not the best possible methods. Linear congruential and shift register generators can be parallelized or vectorized in this way, because it is known in these cases how to jump ahead in the sequence. For a sequence $X_{i}$ of random numbers, processor $P$ of an $N$ processor machine can therefore generate the sub-sequence $X_{P}$, $X_{P+N}$, $X_{P+2N}$, $\ldots$ This is known as the ``leapfrog'' technique (see Refs.~\cite{Evans:89a}, Chapter 12 of \cite{Fox:88a} for the linear congruential case, Ref.~\cite{Aluru:92a} for the shift register case). One would expect that a large period (i.e., at least 48-bit) parallel linear congruential generator using the leapfrog method should be fine for most purposes. However there is a problem---linear congruential generators are known to have correlations between elements in the sequence which are a power of 2 apart \cite{Filk:85a}, \cite{Percus:88a}. Since for most parallel machines the number of processors is a power of 2, this means that the numbers generated on a given processor may be more strongly correlated than the sequence on a single processor. In fact this type of leapfrog linear congruential algorithm, when used on a vector machine, has given spurious results in some Monte Carlo calculations \cite{Kalle:84a}, \cite{Filk:85a}. It is possible that this problem could be avoided by leapfrogging by $N+1$ rather than $N$, thus avoiding power of 2 correlations, but this would mean that the parallel algorithm no longer produces the same sequence of numbers as the sequential algorithm. The leapfrog shift register method could work well, but only if the lag is extremely large ($p=9689, q=4187$ is probably adequate). Shift register algorithms probably also suffer from long range correlations, however this is not known theoretically, and good statistical tests are lacking for this generator. Another way of parallelizing linear congruential generators is to split the sequence into non-overlapping parts, each of which is generated by a different processor. This also utilizes the fact that this type of generator allows you to easily calculate an arbitrary element of the sequence. One can therefore simply divide the period of the generator by the number of processors, and jump ahead in the sequence by this amount for each processor \cite{Evans:89a}, \cite{LEcuyer:91a}. This requires that the period of the generator be very large. A 32-bit generator is completely inadequate, and even a 48-bit generator may not be good enough on a fast massively parallel machine. A good 64-bit generator should work well, but perhaps better would be to use the generator of L'Ecuyer \cite{LEcuyer:88a}. A program for a parallel generator of this type using sequence splitting is given in Ref.~\cite{LEcuyer:91a}. The only disadvantages of this type of generator are that it does not produce the same sequence on different numbers of processors, and it is relatively slow on a processor which does not have 64-bit integer arithmetic. The simplest method of using lagged Fibonacci generators in parallel is to just run the same sequential generator on each processor, but with different initial lag tables (or ``seed tables'') \cite{Chiu:88b}, \cite{Peterson:88b}. The lag tables on each processor should be initialized using a {\it different\/} parallel generator, for example a leapfrog or sequence splitting linear congruential generator. This should guarantee (especially for a generator with a long lag) statistical independence between processors. This is probably the best parallel random number generator known. Its only known deficiency is that is does not produce the same sequence for different numbers of processors. It is used in the Connection Machine Scientific Software Library (CMSSL) \cite{TMC:92b}, where the interface to the routine allows the user to specify the lags, so in principle the routine can be extremely good, however in the documentation they suggest using lags which are much too small. There are two methods of parallelizing lagged Fibonacci generators which will produce the same sequence for different numbers of processors, which is useful for debugging. However one uses too much memory, and the other has too much communication. The first method, which is also provided in the CMSSL, is to assign a different lag table to every {\it virtual processor} (i.e. distributed array element), rather than every processor \cite{TMC:92b}. It is thus independent of the number of processors, however for a large lag table the memory requirement may be exorbitant. If multiplication is used in the lagged Fibonacci generator, a small lag will still give good results. This generator could therefore be useful for debugging small scale problems. The second method is straightforward vectorization of the basic lagged Fibonacci algorithm. As long as the vector length is smaller than both $p$ and $q$, a vector of new results can be obtained independently \cite{Anderson:90d}, \cite{Peterson:94a}. This parallelism can also be utilized if the elements of the vector are distributed over processors. The problem with using this method on a distributed memory machine rather than a vector or shared memory machine, is that the results are then not distributed properly to be used for the next update, and substantial communication is required. This should be contrasted with the other algorithms described above, for which {\it no\/} communication is required. Other parallel random number generators have been proposed and implemented, for example, the cellular automata generator provided by Thinking Machines \cite{TMC:92a},\hfill\break \cite{Wolfram:86c}, however they do not have a sound enough theoretical background, or enough testing done on them, to be trusted. In summary, I would recommend providing three parallel generators: \begin{enumerate} \item A linear congruential generator or combined (L'Ecuyer) generator, preferably one which is used in the sequential library, using splitting to break up the sequence of random numbers among processors. \item A parallel lagged Fibonacci generator, the same as provided in the sequential library, but with a different initial lag table on each processor, initialized using the parallel linear congruential generator. \item A parallel generator which produces the same sequence of results for different numbers of processors (and on a sequential machine), for debugging purposes. A leapfrog algorithm (either linear congruential or shift register) is the simplest to implement, and should work fine for debugging purposes, however it should be clearly stated that this algorithm should only be used for debugging, and not for data collection, since their randomness properties are suspect. \end{enumerate} \bibliography{/home/sashimi/kea/bib/doc,/home/sashimi/kea/bib/outside,/home/sashimi/kea/bib/SCCS} \bibliographystyle{gcf_unsrt} \end{document}