\documentstyle[11pt]{book} \input psfig %% \font\twelvebf=cmbx12 \font\fifteenbf=cmbx10 at 15 pt %% \hyphenation{semi-con-duc-tor} \hyphenation{tech-nol-ogy} \hyphenation{Peta-flops} \hyphenation{Tera-flops} %% \raggedbottom \tolerance=10000 \newcommand{\medStrut}{{\rule{0cm}{4.0ex}}} % \newdimen\digitwidth \setbox1=\hbox{0} \digitwidth=\wd1 \catcode`"=\active \def"{\kern\digitwidth} % %\pagestyle{empty} % \begin{document} \setcounter{chapter}{3} \chapter{Applications Working Group} \begin{center} \begin{tabular}{ll} {\bf Chair:}\\ Geoffrey C. Fox&Syracuse University \\ &gcf@npac.syr.edu \\ {\bf Associate Chair:}\\ Rick Stevens&Argonne National Laboratory \\ &stevens@mcs.anl.gov \\ \\ {\bf Working Group:} \\ Tony Chan&University of California at Los Angeles \\ Dwight Duston&BMDO \\ Walter Ermler&U. S. Department of Energy \\ Jim Fischer&NASA Goddard \\ Bruce Fryxell&NASA Goddard \\ Ed Giorgio&U. S. Department of Defense \\ Jim Glimm&SUNY, Stony Brook \\ Jacob Maizel&National Institutes of Health \\ Rob Schrieber&NASA RIACS \\ Paul Stolorz&Jet Propulsion Laboratory \\ Francis Sullivan&Supercomputing Research Center \\ Richard Zippel&Cornell University \\ \end{tabular} \end{center} \section{Introduction} The Applications Working Group played two major roles in the workshop. First, the needs of important applications are the motivation for designing and building Petaflops machines. This is discussed in Section~\ref{motive} on general terms. Second, the characteristics of potential Petaflops scale applications can be used to guide the other three workshop activities\@; devices, architectures, and software for Petaflops computers. Our general findings in this area can be found in Section~\ref{issues}. In Section~\ref{apps}, we approach the issues of Sections~\ref{motive} and \ref{issues} from the point of view of particular applications. Section~\ref{chan} describes algorithmic issues. \section[Applications Motivation]{Applications Motivation of a Petaflops \hfill\break {\hspace{2em} Computer Program}} \label{motive} We show in Table~4.1 a wide set of applications, which are potential uses of Petaflops machines. We divide these into three major areas: \begin{enumerate} \item Large Scale Simulations (grand challenges) extrapolated from Teraflops machines. Two sub-classes can be separated. \begin{itemize} \item Problem size naturally increases (an example is turbulence where more grid points are needed to increase spatial resolution), \item Problem size is unchanged but there is a need to increase simulated time (an example is protein dynamics with 10,000 atoms and one needs to achieve millisecond simulated time with $\sim 10^{-14}$ second basic time step). \end{itemize} \item Data Intensive Applications that rely on Petabyte $\rightarrow$ exabyte of primary and secondary storage. \item Novel Applications. \end{enumerate} There is no doubt that these can be used to build a strong case for Petaflops machines. As we discuss in the examples of Section~\ref{apps}, many applications require Petaflops, or in some cases higher, performance for realistic results. The need for this performance level follows directly from the physical structure of the problem in some cases, and from the size of the base dataset in others. The following observations qualify and expand these remarks. \begin{enumerate} \item Our working group did not have the broad expertise to establish the Petaflops motivation in full detail. \item We can give examples, as described in Section~\ref{apps}. However, we recommend that our work be refined by appropriate ``domain expert groups.'' \item We can note generally that computation is and will be increasingly important in economy, society, education, academic, and U.S. needs to be in the lead in the continuing future---just as it is now with HPCC. \item One cannot predict the critical applications 10--20 years from now. New national problems will arise, and surely HPCC will be critical in many of them. Most of our exemplars will be important, if not the most important Petaflops scale problems. \item Many Petaflops scale applications will involve integration of disparate activities and will require changes in current modus operandi. For instance, the NII (and applications such as interactive television) will impact society in a nontrivial way. Agile manufacturing requires database (CAD) simulation, design, analysis, manufacturing, and marketing to be integrated. Petaflops computing enables this, but the multidisciplinary character has implications for hardware, software, and hardest of all, the structure of manufacturing companies. \item We recommend a program to investigate new algorithms needed by Petaflops scale applications and the special architectural features of Petaflops machines. This is expanded in Section~\ref{chan}. \item Historically new algorithms, new difficulties, and indeed new applications have been identified as one increases power of computer---even by a ``mere'' factor of 10. This is likely to occur in ``all'' application areas. Today's typical achieved maximum performance is a Gigaflops. Thus, we expect a set of revolutions or ``just'' minirevolutions as we extrapolate a factor of $10^6$ to Petaflops performance. \end{enumerate} \newpage \begin{center} Table 4.1:\ Petaflops Application Areas \\ \vspace{10pt} \begin{tabular}{llp{4.25in}} {\bf A.\ }&\multicolumn{2}{l}{\bf Biology, Biochemistry, and Biomedicine} \\ &A1&Design better drugs \\ &A2&Understand the structure of biological molecules (protein folding) \\ &A3&Genome informatics and phylogeny \\ &A4&Process data from medical instruments \\ &A5&Simulate functions of human body \\ &&-- Blood flow through heart \\ &A6& Neural networks in cortex \\ &A7&Real time three-dimensional biosensor data fusion (the virtual human) \\ &A8&Analysis of integrated medical database to improve quality and cost of health care \\ \end{tabular} \end{center} \begin{center} \begin{tabular}{llp{4.25in}} {\bf B.\ }&\multicolumn{2}{l}{\bf Chemistry, Chemical Engineering} \\ &B1&Design and understand nature of catalysts and enzymes \\ &B2&Simulate new chemical plants and distribution (pipeline) systems \end{tabular} \end{center} \begin{center} \begin{tabular}{llp{4.25in}} {\bf C.\ }&\multicolumn{2}{l}{\bf Physics} \\ &C1&Understand the nature of new materials \\ &C2&Simulate semiconductors used in chips \\ &C3&Design fusion energy system (Numerical Tokamak) \\ &C4&Simulate nuclear explosions \\ &C5&Matter transporter (three-dimensional fax and edit) \\ &C6&Understand properties of fundamental particles (QCD) \end{tabular} \end{center} \begin{center} \begin{tabular}{llp{4.25in}} {\bf D.\ }&\multicolumn{2}{l}{\bf Space Science and Astronomy} \\ &D1&Evolve the structure of the early universe into the epoch of the current observable world (cosmology) \\ &D2&Understand how galaxies are formed \\ &D3&Understand large scale astrophysical systems (stars, gas clouds, globular clusters) \\ &D4&Understand dynamics of Sun \\ &D5&Understand collision of black holes and emission of gravitational waves \\ &D6&Analyze new optical and radio astronomy data to combine data from many telescopes and minimize impact of Earth's atmosphere \end{tabular} \end{center} \begin{center} \begin{tabular}{llp{4.25in}} {\bf E.\ }&\multicolumn{2}{l}{\bf Artificial Intelligence} \\ &E1&New neural network learning and optimization algorithms \\ &E2&High-level searches of full text databases \\ &E3&Decipher new military coding methods (cryptography) \\ &E4&Deep search of game trees for social models and games such as computer chess \end{tabular} \end{center} \begin{center} \begin{tabular}{llp{4.25in}} {\bf F.\ }&\multicolumn{2}{l}{\bf Study of Climate and Weather} \\ &F1&Forecast weather and predict global climate \\ &F2&Forecast severe storms (tornadoes, hurricanes) \\ &F3&Study coupling of atmosphere, ocean, Earth use with economic and political decisions \\ &F4&Integrate models and weather data for optimal interpolation (data assimilation) \end{tabular} \end{center} \begin{center} \begin{tabular}{llp{4.25in}} {\bf G.\ }&\multicolumn{2}{l}{\bf Environmental Studies} \\ &G1&Model flow of pollutants and ground water in the Earth (flow in porous media) \\ &G2&Model air and water quality and relation to policy \\ &G3&Model ecological systems \\ &G4&Analyze data from planet Earth to understand nature and use of land (Earth Observing System) \end{tabular} \end{center} \begin{center} \begin{tabular}{llp{4.25in}} {\bf H.\ }&\multicolumn{2}{l}{\bf Geophysics and Petroleum Engineering} \\ &H1&Analyze three-dimensional seismic data to obtain better well placement \\ &H2&Model oil reservoir to optimize effectiveness of secondary and tertiary oil extractions \\ &H3&Analyze models of and data from earthquakes to improve predictions of how and when quakes will occur \end{tabular} \end{center} \begin{center} \begin{tabular}{llp{4.25in}} {\bf I.\ }&\multicolumn{2}{l}{\bf Aerospace, Mechanical, and Manufacturing Engineering} \\ &I1&Build more energy efficient cars, airplanes, and other complex artifacts using computational fluid dynamics, structural analysis, and multidisciplinary optimization for a graph of expected performance and memory needs in the aircraft industry (see Figure~\ref{fig1.peta}) \\ &I2&Design new propulsion systems \\ &I3&Simulate new combustion materials \\ &I4&Simulate radar signature of new vehicles (stealth aircraft) \\ &I5&Simulate chips used in new computers \\ &I6&Simulate electromagnetic properties of high-frequency circuits \\ &I7&Simulate manufacturing processes \\ &I8&Optimal scheduling of manufacturing systems \end{tabular} \end{center} \begin{center} \begin{tabular}{llp{4.25in}} {\bf J.\ }&\multicolumn{2}{l}{\bf Military Applications} \\ &J1&Simulate new military sensor and communication systems \\ &J2&Control military operations with data fusion and spatial reasoning \\ &J3&Integrate human, online military systems, and computer simulations in exercises (SIMNET) \end{tabular} \end{center} \begin{center} \begin{tabular}{llp{4.25in}} {\bf K.\ }&\multicolumn{2}{l}{\bf Business Operations} \\ &K1&Simulations and complex database analysis for advanced decision support in business and politics \\ &K2&Support integrated agile manufacturing system \\ &K3&Dynamic scheduling of air and surface traffic when disrupted by weather and crises \\ &K4&Control and image analysis for advanced robots \\ &K5&Economic modeling on Wall Street \\ &K6&Graphics for digital movies \\ &K7&Linkage analysis to find correlated records in a database indicating anomalies and fraud in health care, securities, credit card operations, and similar areas \\ &K8&Analysis of customer data to optimize marketing (market segmentation) \end{tabular} \end{center} \begin{center} \begin{tabular}{llp{4.25in}} {\bf L.\ }&\multicolumn{2}{l}{\bf Society} \\ &L1&Large-scale simulations and database searches for education \\ &L2&Electronic shopping and other interactive television \\ &L3&Support world-wide digital library and information systems (text, images, video) \\ &L4&Integrate and update intelligent agents on the NII (Knowbot garage) \\ &L5&Image analysis for large databases to find just the right picture (missing persons, cover of magazine) \\ &L6&Support of up to a million simultaneous players in large virtual environment linked to advanced home video games \end{tabular} \end{center} \begin{figure} \centerline{\psfig{figure=slide28.epsf,clip=true,width=4.5in}} \caption{Aeronautics Modeling and Simulation} \label{fig1.peta} \end{figure} \section[Issues/Characteristics for Architecture]{Issues/Characteristics for Architecture (Hardware) and Software of Petaflops Machines} \label{issues} There were particularly fruitful interactions between architecture and application groups. We can make the following general remarks on characteristic features of Petaflops scale applications. \begin{enumerate} \item We need to establish a common ``language'' (set of terms) to discuss memory hierarchy/parallelism and communication in a hardware/software /algorithmic implementation neutral fashion. We recommend a near-term activity to refine initial steps begun here to define applications for architecture and software communities. This implementation neutral description of applications and architectures should also help discussions between software and architecture communities. The need for this agreed terminology was highlighted by our discussions with the architecture group where latter noted that application scientists described the computational structure of their problem inappropriately---using, for instance, the language of MIMD distributed memory machines when issues were more general and reflected memory hierarchy. As the target Petaflops machine could have a mix of these architectures, a distributed set of hierarchical memory nodes, translating application scientist specifications into Petaflops designs led to vigorous confused debate. \item Our discussions with architecture group isolated several classes of machines---three based on memory hierarchy processor trade-offs and others based on memory size and I/O requirements. \item We made a list of architectural features, which are shown in Tables~\ref{tab2.peta}, \ref{tab3.peta}, and \ref{tab4.peta}. These were used as a guide in preparing exemplar application discussions in Section~\ref{apps}. \addtocounter{table}{1} \begin{table} \begin{center} \caption{Some General Application Characteristics} \vspace{10pt} \label{tab2.peta} \begin{tabular}{|lp{4.25in}|} \hline & \\ 1. & Is Petaflops performance needed by this application, and if so, why? \\ 2. & Are new algorithms needed? \\ 3. & What are size characteristics of problem? How does size scale as we increase performance of computer? \\ 4. & Does this problem have special precision needs? \\ 5. & What is nature of computation? Does it involve flops or some other sort of OPS? \\ 6. & What are I/O requirements of problem in terms of bandwidth and (secondary) storage size? \\ & \\ \hline \end{tabular} \end{center} \end{table} \begin{table} \begin{center} \caption{Some Architectural Characteristics of Applications} \vspace{10pt} \label{tab3.peta} \begin{tabular}{|lp{4.25in}|} \hline & \\ 1. & What is memory required for a Petaflops performance? \\ 2. & What are secondary and tertiary storage needs? \\ 3. & Can this application use a metacomputer (networked computers)? \\ 4. & Can this application use unconventional architectures (e.g., neural networks, content addressable memory, associative processor)? \\ 5. & What is the expected realized versus peak performance for this application? \\ 6. & Can this application use a SIMD architecture or a MIMD collection of SIMD ``nodes''? \\ 7. & Is this application sensitive to latency? \\ 8. & What degree of local parallelism is present in this application? This is in addition to ``overall'' data parallelism and can be exploited in shared memory multiprocessor nodes. \\ 9. & How many nodes does ``overall'' data parallelism support? \\ & Note\@: Burton~Smith points out that product $P$ of performance (Petaflops) and best possible memory access time (nanosecond) is lower bound to overall concurrency needed in application. $P$ is at least $10^6 ( = 10^{15} \times 10^{-9} )$. Characteristics 8 and 9 break up this minimal $P$ for a given problem into amount that can be supported by coarse-grain data parallelism (characteristic 8) and that (characteristic 9), which can be exploited in a finer grain (e.g., shared memory) architecture. \\ 10. & Discussions with architecture group developed three strawman architectures. \\ & $\bullet$ Architecture I:\phantom{II}\ Around 200 Teraflops nodes---shared memory architecture \\ & $\bullet$ Architecture II:\phantom{I}\ Around 10,000 0.1 Teraflops nodes---switch or similar general interconnect \\ & $\bullet$ Architecture III:\ Around $10^6$ 1~Gigaflops nodes---mesh interconnect architecture \\ & Can applications use these architectures? The answer to this question is related to those of previous characteristics. \\ & \\ \hline \end{tabular} \end{center} \end{table} \item We recommend that once a better framework is agreed (see item 1.), that ``domain experts'' be asked to refine our study (as begun in Section~\ref{apps}) in a broader range of potential Petaflops scale applications. Potential application domains are already given in Table~4.1. \item We identified a general rule for time stepped algorithms (Section~\ref{porous}). \begin{displaymath} \mbox{Memory} = \left\lbrack \frac{\mbox{\# flops}}{1\ \mbox{Gigaflops}} \right\rbrack^n \ \mbox{Gigabyte}^* \end{displaymath} *Put here memory needed at a Gigaflops performance. $n=3/4$ for fixed total simulation time, but $n<3/4$ if needed (as often one does) to increase total simulated time. This rule predicts a memory size of 30 Terabytes is appropriate for a Petaflops machine if 1~Gigabyte is appropriate for a Gigaflops machine. This estimate is consistent with NASA's aerospace predictions in Figure~\ref{fig1.peta}. \item Current ``rule'' memory (bytes) $=$ performance (flops) is modified because up to ``now,'' solving problems has been constrained by machine size and so one scales problem ``blindly.'' On the other hand, the Petaflops machines will solve real problems constrained by the ``physics'' of the situation. \item Some interesting characteristics of a Petaflops machine are \begin{itemize} \item Petaflops machines will do in five minutes what it takes Gigaflops machines 10 years to do \item $10^{13}\ \mbox{bytes} = 8\ \mbox{(bytes/word)} \times 10\ \mbox{components/grid points} \times 5000^3$ (grid size) \item $10^{15}\ \mbox{bytes} = 2300$ years video $\phantom{10^{15}\ \mbox{bytes}} = 10^9$ books $\phantom{10^{15}\ \mbox{bytes}} = 3\times 10^8$ Megapixel images $\phantom{10^{15}\ \mbox{bytes}} = 3\times 10^{10}$ compressed images \end{itemize} \item Some Petaflops applications are somewhat less demanding on hardware characteristics than today's problems (e.g., they are applications with a lot of compute needs and this implies lower internode communication bandwidth to node compute power ratio). \item It is interesting to consider real-time applications so that Petaflops performance is required to keep up with the machine and people in loop (e.g., defense simulation and control). \item Many real-world simulations need Petaflops because problems have multiple length scales. \item All members of the applications working group felt that Petaflops central supercomputers should be accompanied by the natural scaling Teraflops level workstations distributed among the users. \end{enumerate} \begin{table} \begin{center} \caption{Software Technology Issues for Each Application} \vspace{10pt} \label{tab4.peta} \begin{tabular}{|lp{4.25in}|} \hline & \\ 1. & What operating system support does the application need? \\ 2. & What compiler and tool support does the application need? \\ 3. & Does the application naturally fit particular programming \\ & paradigms? \\ 4. & Are there special user interface issues? \\ & \\ \hline \end{tabular} \end{center} \end{table} % \vfill \section{Exemplar Applications} \label{apps} \subsection{Porous Media} \label{porous} From today's machines to the Petaflops computer, there is a factor of $10^4$ in speed. How will this produce value in problems of major importance to society? Most important problems are already solved at some level, but most solutions are insufficient and need improvement in various respects: \begin{itemize} \item under resolution of solution details, averaging of local variations and under representation of physical details \item rapid solutions to allow efficient exploration of system parameters \item robust and automated solution, to allow integration of results in high-level decision, design, and control functions \item inverse problems (history match) to reconstruct missing data require multiple solutions of the direct problem. \end{itemize} For PDE-based problems, the computational effort scales inversely as grid size to the fourth power, $h^4$, and often, especially for implicit problems, higher powers, such as $h^7$ can occur. For field scale oil reservoir simulation, grids on the order of 100 meter spacing might be common. Geological variation occurs on all length scales, down to the pore size of the rock, about a micron. Not all of this variation needs to be simulated, fortunately. The interwell separation is perhaps 400 meters, and flow between wells is the important variable to be predicted. Variation on the range of 10 to 20 meters is not well represented by averaging methods, and is better computed, so that there is a utility in refining grids by a factor of 5 to 10. On the basis of these considerations, we propose the following simulation, for which the Petaflops machine would be necessary: \begin{enumerate} \item $10^3 \times 10^3 \times 10^2 = 10^8$ grid elements \item 30 species \item $10^4$ time steps \item $3 \times 10^9$ words of memory per case \item 300 cases considered (geostatistical parameters; economic or operating parameters; history matching iterative solutions of the direct problem) \item $10^{12}$ words of memory (all cases considered in parallel) \item $3 \times 10^{14}$ grid $\times$ time cells total \item $10^{19}$ flops \item $\rm 10^4\,sec = 3~hrs.$ Petaflops computational time \end{enumerate} At these length scales, geological data is not known, except in a statistical sense, and so statistical ensemble averages will provide average performance as well as a measure of variability associated with these averages and the possibility or probability of outlier solutions, such as early breakthroughs. Similar issues apply to ground water remediation sites. Here the sites and well spacings are typically smaller, but the same scaling of grid to well spacing arguments apply. Commonly narrow conduction bands, or isolated time events, such as runoff during storms dominate total migration of contaminants so that accurate resolution in space and time is needed for reliable predictive capability. Complicated chemistry, included binding of contaminants to absorption sites, or the trapping of contaminant bearing water in semi-isolated micropores gives rise to the disturbing phenomena of sites which appear to be remediated by a pump and treat method, only to have the contaminant re-emerge when treatment is terminated. For this reason, physical processes, and system variables often need an increased accuracy of description, as well as finer grid resolution. What are the architectural issues which result from this problem? For memory, we see that memory size is determined almost entirely by the application, and is nearly independent of system architecture. The Petaflops machine is mainly justified to solve large problems, rather than to solve problems of a fixed size more rapidly. For the easiest problems, we have \begin{displaymath} \mbox{memory}\ \sim \mbox{speed}\ ^{3/4} \end{displaymath} but for many cases, and especially the more computationally difficult ones, the exponent will be smaller, because: \begin{itemize} \item some of the extra computational power will be devoted to solution for longer total time, or exploration of more parameter values \item for implicit problems, or nonlocal force laws, the computational work grows more rapidly than $h^4$. \end{itemize} These scaling laws should be developed with known proportionality coefficients coming from today's machines, which appear to be well balanced for a broad mix of problems. There is a similar scaling law for communication latencies in memory hierarchies. Communication of $n$ bytes takes $an + b$ units of time, where $a$ and $b$ are measured dimensionlessly in units of floating point operations. Here $b$ is latency and $a$ is bandwidth. The number of floating point operations which can be usefully performed between communication steps is proportional to the local memory size. Consider a two-level hierarchy, with $M$ bytes stored at $m$ locations. After $O(mM)$ floating point operations, there will be a need to communicate $O(m)$ domain decomposition boundary information messages of size $O(M^{2/3})$ in the most favorable case, and $O(m^2)$ messages of size $O(M)$ in the worst case. For the more common favorable case, the communication cost is \begin{displaymath} m(aM^{2/3} + b) \end{displaymath} and the computational cost is \begin{displaymath} mM \end{displaymath} so we need \begin{displaymath} mb + ma M^{2/3} << mM \end{displaymath} or \begin{displaymath} b<