Next: Summary of Working Up: No Title Previous: Report Organization

PetaFLOPS from Two Perspectives

Invited talk given at PetaFLOPS Workshop Pasadena, CA, January 1994

Seymour Cray
Cray Computer Corporation

I understand we could characterize our group today as a constructive lunatic fringe group. I would like to start off presenting what I think is today's reality, but then I'll move into the lunatic area a little later. I would like to give you three impressions today. The first one is my view of where we are today in terms of scientific computer technology. The second one is, what's the rate of progress that we, Cray Computer Corporation, are making incrementally? By incrementally I mean in a few years. And thirdly, I'd like to speculate on what I would do if I were going to take a really radical approach to a revolutionary step such as we are talking about in this workshop.

What I want to do is talk about the things I know myself, and I think we are representative of where other companies are as well. In order to have some real numbers to be specific, I'll talk about my own work for a few minutes. The CRAY 4 computer is a current effort and we should complete the machine this year. We should look at the number of GigaFLOPS-and this is where I'd like to start-and at what they cost in today's prices. I would like to separate the memory issue for just a moment from the processors because they are somewhat different. If I do that, then the cost per GigaFLOPS in a CRAY 4 is $80,000. Now, I look at the incremental progress and project it four years, and I use four years because that's the kind of step we do in building machines: two years is too fast, but four years is about right. So if I use four years as my increment of time, and I ask what do we expect to do in that time, this gives us a rate of change. I see a factor of four every four years and I have every reason to believe that in the next four years we can continue at that rate. Whether we can continue at that rate forever, I don't know, but it is a rate that has some history and some credibility.

If I look forward four years, we are going to have a conventional vector machine with about $20,000 per GigaFLOPS, for the processor.

What does it cost per TeraFLOPS? We are talking $20,000,000. Now we have to add memory. One of the rules of thumb we have in vector processing is for every GigaFLOPS in processor you need a gigaword per second bandwidth to a common memory, and this makes the memory expensive. It's the bandwidth more than memory size that actually determines the cost. The memory cost varies somewhat from a minimum of about the cost of the processors to twice the cost of the processors. So if I pick a number, in between or 1 1/2 times that for a very big system, we would find we would have a TeraFLOPS conventional vector machine in four years for around $50,000,000. I think that's reality without any special effort apart from normal competition in the business.

I'd like to look at the other end of the spectrum because I have been involved in that recently. By the other end of the spectrum, I mean a step from a cost of $80,000 or $20,000 per processor to the other extreme end, about $6per processor. That, in fact, is another machine we are building at Cray Computer. If we look at SIMD bit-processing, that is the other end of the spectrum, so to speak. Of course, the purpose of building this is not to do GigaFLOPS, TeraFLOPS, or PetaFLOPS but to do image processing, but never mind that for a moment. I want to come up with a cost figure here. What we are building is a 2,000,000-processor SIMD machine and it will cost around $12,000,000 to build. We are planning to make a 32,000,000-processor system in four years, and that will have a peta-operation per second. My point is, if you program bit processors to do floating point, which may not be the most efficient thing in the world, you still come up with a machine that can do around TeraFLOPS in four years. Whether you take a very large processor or a very small processor, either way we come up with about a TeraFLOPS and about $50,000,000 in four years. I suspect, although I don't really know, that if we try various kinds of processor speeds in between, we're going be somewhere in the same ballpark. So my conclusion is that in four years we could have a TeraFLOPS and it ought to cost about $50,000,000 and the price ought to drop pretty fast thereafter.

So, how do we get another factor of a thousand? Well if we are able to maintain our current incremental rate, it will take 20 years. Now that might be too slow I don't know what our goals are in this exercise. I suspect it might take 20 years anyway, but if we'd like to have both belt and suspenders, we could try a revolutionary approach and so I have a favorite one that I would like to propose. It's probably different from everyone else's.

I think in order to get to a PetaFLOPS within a reasonable period of time, or 10 years, we have to somehow reduce the size of our components (see, I am really a device person) from the micron size to the nanometer size. I don't think we can really build a machine that fills room after room after room and costs an equivalent numbers of dollars. We have to make something roughly the size of present machines, but with a thousand times the components. And, if I understand my physics right, that means we need to be in the nanometer range instead of the micrometer range. Well, that's hard, but there are a lot of exciting things happening in the nanometer-size range right now. During the past year, I have read a number of articles that make my jaw drop. They aren't from our community. They are from the molecular biology community, and I can imagine two ways of riding the coat tails of a much bigger revolution than we have. One way would be to attempt to make computing elements out of biological devices. Now, I'm not very comfortable with that because I am one and I feel threatened. I prefer the second course, which is to use biological devices to manufacture non-biological devices: to manufacture only the things that are more familiar to us and are more stable, in the sense that we understand them better. What evidence do we have that this is possible? Two areas really have impressed me, again, almost all during the past year. The rates of understanding in the nanometer world are just astounding. I don't know how many of you are following this area, but I have been attempting to read abstracts of papers, and some of them are just mind-boggling. Let me just digress for a moment with the understanding of the nanometer world as I perceive it with my superficial knowledge from reading abstracts.

First, I once thought of a cell as sort of a chemical engineer's creation. It was a bag filled with fluid, mostly water, with proteins floating around inside doing goodness knows what. Well, my perception in the past year has certainly changed because I understand now they're not full of water at all. And if we look inside, as we are beginning to do with tools that are equally mind-boggling, we see that we have a whole lot of protein factories scattered around, hundreds and thousands of them in a single cell, with a smaller number of power plants scattered around, and a transportation system that interconnects all of these things with railroad tracks. Now, in case any of you think I'm on drugs, I brought some documentation. You can read these government-sponsored reports, which you have to believe are real, because it's our tax dollars that pay for this. But I'm coming to the part that's most interesting to me. Using laser tweezers, which has been the big breakthrough in seeing what's going on in the nanometer world, human researchers have been able to take a section of the railroad track of the cell, put it on a glass slide, and lo and behold there's a train running on it with a locomotive and four cars. We can measure the speed and we did. The track is not smooth. It has indents in it every 8 nanometers. It's a cog railroad. When we measure the locomotive speed, we see it isn't smooth. The locomotive moves in little 8 nanometer jerks. When it does, it burns one unit of power from the power plant, which is an ATP molecule. So it burns one molecule and it moves one step. Well, how fast does it do this? It does it every few milliseconds. In other words, the locomotive moves many times its own length in a second. This is a fast locomotive. I am obviously impressed with the mechanical nature of what we are learning about in the large molecule world.

What evidence is there that we could get anything to make a non-biological device? Or, to come right to the point, how do we train bacteria to make transistors? Well I don't know how to do that right now, but last spring there was a very interesting experiment in cell replicating of copper wire. It's a nano-tube built with a whole row of copper atoms. The purpose of the experiment was not to make a computer, it was to penetrate the wall of the cell and measure their potentials inside without upsetting the cell's activity. These people are in a different area of concern here. But, if indeed we can make copper wire that grows itself, and this copper wire was three nanometers in diameter insulated and if we can do that today, isn't it conceivable that we can create bacteria that make something more complicated tomorrow?

So, what course of action might we take to explore nanometer devices that are self-replicating. It seems to me we have to have some cross-fertilization among government agencies here. There are people doing very worthwhile research in the sense of finding the causes and cures for diseases, and more power to them, keep going. But maybe we can fund some research more directed toward making non-biological devices using the same nanometer mechanisms. So, that's my radical proposal for how we might proceed. I don't really know what kind of cross-fertilization we can get in this area, or whether any of you think this is a worthwhile idea, but it's going to be interesting for me to hear your proposals on how we get a factor of a thousand in a quick period and this is just one idea. I thank you and am ready to hear your ideas.

Invited talk given at PetaFLOPS Workshop Pasadena, CA, January 1994

SUPERFAST COMPUTATION USING SUPERCONDUCTOR CIRCUITS
Konstantin K. Likharev
State University of New York, Stony Brook, NY
I am grateful to the organizers of this very interesting meeting for inviting me here to speak at this plenary session. I am happy to do that, mostly because I honestly believe that what is happening right now in digital superconductor electronics is a real revolution which deserves the attention of a wide audience. Before I start I should mention my major collaborators at SUNY (M. Bhushan, P. Bunyk, J. Lin, J. Lukens, A. Oliva, S. Polonsky, D. Schneider, P. Shevchenko, V. Semenov, and D. Zinoviev), as well as the organizations with which we are collaborating (HYPRES, IBM, NIST, Tektronix, and Westinghouse), and also the support of our work at SUNY by DoD's University Research Initiative, IBM, and Tektronix. I should also draw your attention to a couple of available reviews of this field [Likharev:93a,94a].

Arnold Silver, who in fact was one of the founding fathers of the field, already gave you some of its flavor, but I believe I should nevertheless repeat some key points. As Table 2.1 shows, superconductor integrated circuits offer several unparalleled advantages over semiconductor transistor circuits. (There are serious problems, too, but I will discuss them later). The advantages, surprisingly enough, start not with active elements. As Steven Pei mentioned earlier today, the real speed of semiconductor VLSI circuits has almost nothing to do with the speed of the transistors employed. It is limited mostly by charging of capacitances of interconnects through output resistances of the transistors. Superconductors have the unique capability to transfer signals (including picosecond waveforms) not in a diffusive way like the RC-type charging, but ballistically with a speed approaching that of light. (When you listen to a talk on opto-electronics like the one earlier today, always remember that it is not necessary to have light if what you need is just the speed of light.) In order to achieve the ballistic transfer in superconductors, it is sufficient to use a simple passive microstrip transmission line, with the thin film strip a few tenths of a micron over a superconducting ground plane. Because of this small distance, the electromagnetic field is well localized within the gap, so that the crosstalk between neighboring transmission lines, parallel or crossing, is very small.

In order to generate picosecond waveforms, we need appropriate generators, and for that Josephson junctions (weak contacts between superconductors) are very convenient. One other good thing about the Josephson junctions is that their impedance can be matched with that of the microstrip lines. This means that the picosecond signal can be in fact injected into the transmission line for ballistic propagation. Finally, superconductor circuits work with very small signals, typically of the order of one millivolt. Therefore, even with the impedance matching, the power dissipation remains low (I will show you some figures later on). Because of this small power dissipation, you can pack devices very close to each other on a chip, and locate chips very close together. This factor again reduces the propagation delays, and increases speed.

Finally, one more advantage: superconductor fabrication technology is extremely simple (if we are speaking about low- superconductors). It is considerably simpler than silicon CMOS technology and much simpler than the gallium arsenide technologies. At SUNY, we are fabricating superconductor integrated circuits. With our facilities we would certainly not be able to run, say, a CMOS process. From ``semiconductor peoples'' point of view, all we are doing is simply several levels of metallization on the intact silicon substrate. Typically, there are three to four layers of niobium, one layer of a resistive film, and two to three layers of insulation (Josephson junctions are formed by thin aluminum oxide barriers between two niobium films). Several niobium foundries in this country are available to fabricate such circuits for you.

What has been going on in this field and what is going on now? You probably have heard about the large-scale IBM project and the Japanese project, with a goal to develop a superfast computer using Josephson junctions. Unfortunately, both projects were based on the so-called latching circuits where two DC voltage levels, low and high, were used to present binary information, just as in semiconductors. The left column of Table 2.2 lists major features of the latching circuits. Unfortunately, their maximum clock frequency was only slightly higher than 1 GHz, and theoretical estimates show that it can hardly go higher than about 3 GHz. In my view, this is too slow to compete with semiconductors, because you should compensate the necessity of low-temperature operation.

Is there any other opportunity? Yes, there is one. In superconductors, there is one basic property that we can use for computing. Namely, the magnetic flux through any superconducting loop is quantized: it can only equal an integer number of the fundamental unit . Of course it is natural to use this number for coding digital data. Thus, any superconductor ring is quite sufficient for the storage of digital information. But for switching, e.g., for writing the information in or reading it out, we need some device for the rapid transfer of the flux in or out of the loop. In our circuits, we do it by inserting a weak link, the Josephson junction, into the loop. When one flux quantum enters or leaves the loop, a picosecond pulse with the area

is generated across the junction according to Faraday's law. This ``Single-Flux-Quantum'' (SFQ) pulse can, in turn, be used to switch other similar circuits. Thus, if you abandon information coding by voltage levels, but use magnetic flux for this purpose, you can do everything very fast.

The right column in Table 2.2 shows you what we can do using such an approach. We call it RSFQ, which stands for Rapid Single-Flux-Quantum circuits. I believe this table is self-explanatory. We use magnetic flux. Power consumption goes down. But of course you should concentrate on the last line showing speed. It is not pure theory. This figure (300 GHz) comes from experiments, complemented by a little bit of extrapolation to slightly better design rules.

We suggested the RSFQ approach as a whole in 1985. Of course, it was based on a lot of previous work, in particular on some of our preliminary work in the mid-70s, some ideas of Arnold Silver and his group (then at the Aerospace Corporation) in the late 1970s, and some Japanese ideas (especially from the Tohoku University group). But the real development of the RSFQ circuits started only in 1985, and only since 1991 has been going really fast. I do not have enough time to show you all the developed circuitry, so that I will just give you an idea of how these circuits are working.

In the simple circuits for generation of the SFQ pulses in which we are using Josephson junctions, information coded in the usual way (by voltage/current levels) arrives at its input, and an SFQ pulse is generated at its output. A simple logic gate, the invertor, uses just three Josephson junctions. How complex are other gates? Sometimes a little bit more complex than those in silicon, sometimes a little bit simpler, but always comparable in terms of, say, Josephson junction count in comparison with junction count in CMOS. We have designed another gate which is a sort of template-a universal gate, potentially with four inputs and six outputs. A slight modification (typically, a truncation) of this template can give you virtually any basic logic function.

Finally, when you have all your logic done in the form of the magnetic flux (or, equivalently, the picosecond SFQ pulses), and you feel you are tired of this superfast processing, you can always transform these SFQ pulses to DC voltage level output. This voltage can be picked up by a normal amplifier. We have demonstrated an extremely simple single-bit interface between RSFQ circuits and room-temperature semiconductor electronics at a data rate slightly below one gigabit per second, with the parts costing less than $20per channel.

Now, what is the current state of the fabrication technology? Though the circuit complexity is still not very exciting, the speed is. For example, consider a very simple digital circuit that we have designed, just a frequency divider by two (in other words, a single stage of a binary counter) which is modification of a device which was first conceived by Arnold Silver. We have implemented it using 1.2-micron, niobium technology. It is fabricated in the usual university lab for not very much money, and we have made measurements that prove that this circuit can divide the frequency of the input SFQ pulse train, for any frequency from 0 to 510 GHz. To be honest, it is not a completely digital device. If you fabricated a regular logic gate, say, with two inputs and two outputs, using this particular technology, the maximum speed would be around 100 GHz.

We are still not doing so well in the terms of complexity, because we started our program at SUNY just two years ago. The most complex circuit which we are testing right now (it was developed by us, but fabricated by HYPRES, Inc.) has 645 Josephson junctions and about 1,000 resistors. Clearly, this is still not very large-scale integration. We have, however, an ambitious two-year plan. Each year we are going to increase the integration scale by a factor of 10.

Now let me just summarize what we have. Table 2.3 shows the performance you can get at a typical computation task, multiplication of two 32-bit operands as fast as possible. If you do it in the silicon technology with one micron minimum feature size, you would need about 100,000 transistors to do the job (in a bit-parallel-pipelined structure to provide the maximum speed). You would get a not very spectacular latency, but relatively high throughput. If you do the same task using the old (latching) Josephson logic, you could have approximately the same circuit complexity, and do it about seven times faster. I don't believe this advantage is very big. But with this new superconductor electronics (RSFQ) you can, for one thing, accomplish the task with approximately the same speed (several times faster than silicon) by an extremely simple bit-serial circuit. This circuit, comprising only 1,500 Josephson junctions, will crunch the numbers bit by bit with this enormous clock frequency which is available on chip. This is the circuit complexity that we may achieve as soon as later this year. Alternatively, you could use the same RSFQ technology to do the same computation with all the bits processed in parallel. That would mean that the complexity of the chip would be much higher, almost the same as in silicon, but look at this throughput (100 GigaFLOPS for a single processor)! In comparison with silicon, I believe, we have at least 2.5 orders of magnitude advantage in speed.

The simple table (Table 2.4), which I have prepared for this workshop, may be even more interesting. Right now when we are using a niobium foundry with a relatively old 3-mm technology for circuit fabrication, we can do calculations at clock frequency of about 30 GHz (which would give us almost 30 GigaFLOPS if we do them in parallel), with very low power ( Watt per processor). Now we at SUNY are in transition to our new 1.25-mm technology, where we hopefully will eventually be able to have about 100 GigaFLOPS per processor, with power consumption about 0.1 Watts. Finally, when eventually we use, for example, a 0.35-mm process (it will already be a rather complex technology, certainly not of the university caliber, but not much more complex than what the silicon people are doing right now), we would approach the natural limitation of speed of niobium RSFQ circuits at the level 300 GHz. Then the power dissipation would be close to the maximum which you can afford in liquid helium. (Unfortunately, at you cannot remove 30 Watts of power from a square centimeter of the chip surface, as you do routinely at . If you do nothing special, just put your chip into liquid helium, you can generate only about 0.3 Watts without substantial overheating. Probably better helium cooling systems could be developed, but nobody has worked on that problem much, to my knowledge).

Now, even if we concentrate at the second (1-mm) level of the technology, rather simple from the point of view of the silicon people, we are talking about something like 100 GigaFLOPS per processor. Hence, you would need only some 10,000 processors to reach 1 PetaFLOPS. The total dissipated power would be about 1 kilowatt. Of course, memory would add something to this estimate. But remember that simple storage of information in this technology does not require power; dissipation is only involved in read/write operations. So I do not believe that memory would add much to our estimate-probably a factor of two or three.

Of course, you should remember that this (1 kW) is the figure for dissipation in liquid helium. If you are speaking about the total power consumption at room temperature, you should multiply this number by at least 300. Fortunately, there is another factor. Of that 1 kilowatt I have mentioned only some 10 Watts would be dissipated in your Josephson junctions. Right now we use very conservative circuits which are, crudely speaking, some analog of n-MOS circuits in the silicon technology. In these circuits, most power dissipation takes place in resistors which do not really play any useful role. We are starting to work on a sort of complementary logic (which should be an analog of CMOS), and I hope we will be able to reduce this 1 kW to about 30 Watts at , which means that 1 PetaFLOPS will cost us some 10 kW at 300 K, incomparably less than silicon or any other technology.

Do we have problems? Yes, we certainly do. But as you can see in Table 2.5, all our problems bear the dollar sign. The first problem is refrigeration. Even if you are using a single superconductor chip, a present-day closed-cycle cryocooler would cost about 10-20 thousand dollars. Of course, with introduction to mass applications this figure would go down, but nevertheless cooling with liquid helium always make the simple superconductor circuits more expensive than silicon. This is why I don't believe that this technology would ever be in PCs or even workstations. It's something to be reserved for the high-performance end of computing.

Next, as far as we know, nothing in this technology prevents big memories. The limitations are essentially the same as in semiconductors. We are presently testing the first RSFQ memory cell, with an area of only 100 lithographic squares (i.e., smaller than semiconductor SRAM cells, though slightly larger than DRAM cells). Of course, to develop a real memory, you would need financial investment much larger than the one we are using now for the development of basic logic circuitry.

Finally, I believe there is what you would call a $60,000,000.00 problem of psychology. People are just not accustomed to these ideas. They are not accustomed to the idea of cooling or to other issues of superfast computation. For example, our circuits can use what is called the local self-timing, in particular, the hand-shaking protocol on the single-bit level. It means that you can use a flexible combination of synchronous and asynchronous computation. The asynchronous computation scares some people to death. This and many other issues should be not only explored, but also implanted into minds of electronic engineers.

Now, let me show you the last transparency (Table 2.6). It is a favorable scenario of the future development in this field. We see several small-scale market niches where I believe this technology would win, because we are far ahead of any competition in performance, and there are people around willing to pay money for that. One example is the famous SQUIDs, which are supersensitive magnetometers. People are using liquid helium to work with these devices now, so they certainly would be willing to do that when we improve the devices radically using some on-chip digital processing. Somewhere below the fourth line of this table I become much less confident. First of all, we should still solve a lot of technical problems. Moreover, somewhere at this point we may come to the situation where we will need much stronger collaboration with architecture people, with potential users, and we will certainly need a much larger investment than we have right now if we want this revolution to continue hopefully all the way down to the PetaFLOPS future.



Next: Summary of Working Up: No Title Previous: Report Organization


gcf@npac.syr.edu