Cray Research Incorporated

# Cray Research Incorporated

### Status:

Active systems manufacturer and subsidiary of Silicon Graphics, Inc. (SGI)

### Overview of Organization:

Cray has a well established user-base in the aerospace, automotive, chemical, electronics and petroleum industries, as well as in research centers and laboratories. A substantial suite of libraries and applications are available for a number of fields (over 600 quoted), but particularly computational chemistry and fluid dynamics. Currently over 270 Cray installations in over 200 customer sites. Staffing is over 5000 worldwide.

Cray Research Inc. was founded by Seymour Cray in 1972 to produce the fastest computer in the world --- the first supercomputer. The Cray-1, available from 1976 to 1982, was the first vector uniprocessor supercomputer, against which other computers are often compared. This evolved into the X-MP, which was the first multi-processor supercomputer system, allowing up to 4 vector units to cooperate on large problems, which was available until 1988, before itself being superseded by two separate Cray development streams.

The evolutionary path lead from the X-MP to the Y-MP series, improving and repackaging the existing technology, and expanding parallelism from a maximum of 4 vector units to a maximum of 8 and now currently 16 CPUs. The Y-MP range has been extended by including an entry-level' system (using technology acquired from Supertek Computers in 1990) and the current top of the range' Y-MP/C90. Plans for the future include provision for more parallelism and indeed, a massively-parallel offering is currently being developed.

The Cray-2, on the other hand, offered a revolutionary approach with radically new technology, and was spawned off as Cray Computer Corporation, under the direction of Seymour Cray himself. Using innovative memory and interconnect systems, and faster processors rather than additional parallel processors, the Cray-2 has essentially matched the Cray Y-MP, but at the expense of a fragmented (non-portable) userbase and less reliable machines. The Cray-3, making extensive use of fledgling GaAs technology, has been expected for some time, but problems with the advanced processor and packing technology have lead to extensive delays and lost orders.

Cray computers are used in compute-intensive scientific fields such as high energy physics, weapons research, crash simulation in the automotive industry, aerospace engineering, biomedical research, and petroleum exploration. By geography, 67% of Cray Research's sales are in the US, and 33% in the rest of the world. 58% of machines are sold to government laboratories, 25% to commercial companies, and 17% to universities.

### Platforms Documented:

Cray Research Inc.
1100 Lowater Rd.
Chippewa Falls, Wisconsin 54701
715-726-1211

Cray Research Inc.
1440 Northland Drive
Mendota Heights, MN 55120
612-452-6650

Cray Research
655F Lone Oak Drive
Eagan, Minnesota 55121

Cray Research (UK) Ltd
Cray House
Bracknell
Berkshire RG12 2SY
England
Tel 0344-485971
Telex  848841


Cray Research, Inc have a WWW server of their own.

## CRAY 1

### Overview of Platform

This machine is no longer being produced, although when first introduced in 1976 (Los Alamos), it was without doubt the fastest processor in the world and is still used as a benchmark for high-speed computing. Since many CRAY customers are currently upgrading their systems to an X-MP, there are opportunities to buy a second-hand CRAY-1S at knockdown prices.

### Compute Hardware:

A uniprocessor. Vector processor, uses pipelining and chaining to gain speed. 12.5-nsec clock. Fast scalar. Uses only four chip types with 2 gates per chip. 64-bit word size up to 4 Mwords of storage. The CRAY 1-S has bipolar (in units of 4K RAM), and the newer (1982) CRAY 1-M has MOS memory (in units of 16K RAM).
• Logic chips - ECL with a gate delay of .7 nsec.
• Main memory banked up to 16 ways. The bank busy time is 50 nsec (70 nsec on the 1-M) and the memory access time (latency) is 12 clocks (150 nsec).
• No virtual memory
• Register-to-register machine
• 8 registers of length 64 (64-bit) words each
• No half precision.
• Double precision (128 bits) is through software and is extremely slow (factors of about fifty times single precision (64 bits) are common).

A uniprocessor.

### Memory System:

There is only one pipe from memory-to-vector registers, resulting in a major bottleneck with loads and stores to memory from registers. Loads can be chained with arithmetic operations; stores cannot.

### Benchmarks / Compute and data transfer performance:

Low vector start-up times and fast scalar performance make this a very general-purpose machine. Max. performance 160 Mflops; 64-bit arithmetic; max. attainable sustained performance 150 Mflops. There are codes for matrix multiplication and the solution of equations which get close to this. Maximum scalar rate is 80 mips. It is easy to attain over 100 Mflops for certain problems, even using Fortran.

The LINPACK benchmark for the Cray-1S system (12.5 ns), using the Fortran compiler cf77 2.1 and problem size(n = 1000):

The TPP (toward peak performance) was 110 Mflops/sec while the theoretical peak was 160 Mflops/sec.

### Operating System Software and Environment:

An extensive range of software exists for this machine. Since the instruction set is compatible with the X-MP range, this software will also run on that range.

## CRAY 2

### Overview of Platform:

This is a 4-processor (quadrant) vector machine with pipelining and overlapping but no chaining. There are more segments in the pipes than in the other CRAYs. Multitasking primitives have same syntax as the X-MP.

### Compute Hardware:

The system has a 4.1-nsec clock cycle time.

Overheads for vector operations are large:

• 63 cycles for vector load
• 22 cycles for vector multiply
• 22 cycles for vector add
• 63 cycles for vector store

Recent enhancements to the CRAY-2 include a 512 Mword memory and models with 128 Mword static RAM. Other improvements include implementing functional units in VLSI (and cutting latency time by half), a larger instruction buffer, reduced branch time, and faster issue rates for certain sequences of instructions.

The machine is liquid cooled using inert fluorocarbon.

### Memory System:

Memory is 256 Mwords of 256 K DRAM in 128 banks. The bank busy time is 57 clocks, and the scalar memory access time is 59 clocks. Local memory is 16 Kwords, 4 clocks from local memory to vector registers. Vector references from local memory must be with unit stride. There are 8 vector registers each with 64 elements.

### Benchmarks / Compute and data transfer performance:

Peak performance is 488 Mflops per processor. A matrix multiply code has run at 1.7Gflops on 4 processors.

The following figures were taken from "Performance of various Computers Using Linear Equations Software", by Jack Dongarra, Univ. of Tennessee. The LINPACK benchmark for the Cray-2/4-256 system, using the fortran compiler cf77 3.0 and problem size(n = 1000):

• For 4 processors (4.1 ns), the TPP (toward peak performance) was 1226 Mflops/sec while the theoretical peak was 1951 Mflops/sec.

• For 2 processors (4.1 ns), the TPP (toward peak performance) was 709 Mflops/sec while the theoretical peak was 976 Mflops/sec.

### Operating System Software and Environment:

Software includes: UNIX-based OS (called UNICOS); C compiler; CFT2 (Fortran compiler); CFT77.

### Networkability/ I/O System / Integrability / Reliability / Scalability:

Cray has an ongoing commitment to high-speed peripherals and fast network links. HSX is a 100 Mbytes/sec link for connecting CRAYs together. CRAYs can be linked to Ultra Corporations 1.6Gbit Ultra bus in addition to standard connections with Ethernet (TCP/IP), and VME buses. The DD-40 disks each hold 5 Gbytes and have a transfer rate of 10 Mbytes/sec.

### Notable Applications / Customers / Market Sectors:

Delivered: NMFECC, NASA Ames, University of Minnesota, Harwell Laboratory, Stuttgart, and Ecole Polytechnique (Paris).

## CRAY XMP

### Overview of Platform

This multiprocessor pipelined vector machine has the same architecture as the CRAY-1. The major difference is that there are three paths from memory to the vector registers, and the clock cycle time is 8.5 nsec on all machines shipped after August 1986 (machines built before August have a cycle time of 9.5 nsec.)

The current machines come with 1, 2, or 4 processors. Gather/scatter hardware is available on the 2- or 4-processor version of the machine. The gather/scatter can be chained to a load/store operation. Users can control all processors through calls in Fortran. The processors share memory.

### Memory System:

Main memory - ECL 4K RAMs with 25-nsec access time. (Interleaving to 64 banks is possible.)

Memory up to 16 M (64-bit) words X-MP-2 MOS. (Bank busy time is 68 nsec and memory access time is 17 clocks.)

X-MP-4 ECL. (Bank busy time is 34 nsec and memory access time is 14 clocks.)

ECL logic with .35-.5 nsec gate delay and 16 gates/chip.

High-speed connection at 1024 Mbytes/sec per channel (max. 2) to a CRAY SSD. The SSD comes in various sizes up to 512 Mwords of secondary MOS memory. Data transfer to high speed (1200 Mbyte) DD-49 disk takes 10 Mbytes/sec. Recent peripheral enhancements as reported under the CRAY-2.

### Benchmarks / Compute and data transfer performance:

Peak of 235 Mflops per processor.

The following figures were taken from "Performance of various Computers Using Linear Equations Software", by Jack Dongarra, Univ. of Tennessee. The LINPACK benchmark for the Cray X-MP/416 system, using the fortran compiler CF77 4.0 and problem size(n = 1000):

• 4 processors (8.5 ns): TPP best effort is 822 Mflops/sec and theoretical peak of 940 Mflops/sec.

• 2 processors (8.5 ns) using CF77 5.0 compiler, TPP best effort is 426 while the theoretical peak is 470.

### Networkability/ I/O System / Integrability / Reliability / Scalability:

There are many possible front ends including IBM, CDC, VAX, and Apollo.

### Notable Applications / Customers / Market Sectors:

Delivery: Announced in August 1982, first system delivered in June 1983.

## CRAY YMP

### Overview of Platform:

Models include: Y-MP 2E, 4E, 8E, 8I

This is a multiprocessor pipelined vector machine. It has a similar architecture to the CRAY X-MP. A major difference is the availability of 32- as well as 24-bit addressing. The cycle time is 6 nsec, and it is an 8 processor machine. As in the X-MP there are three paths from memory to the vector registers.

### Compute Hardware:

There are only three module types, for the CPU (8 modules), the memory (32 modules, with 1 Mword/module), and the clock (1 module), making 41 modules in all compared with 144 in the X-MP. Each module is on an 11" by 21.2" board. 2.5 $\mu$ ECL in 2500 gate arrays with gate delay of 350 picoseconds. There are 312 arrays per processor on four PCBs with a power dissipation of 9 Watts per array.

### Interconnect / Communications System:

Architecture up to a maximum of 2GB central shared-memory (in largest processor configuration only).

### Memory System:

The processors share a common memory of 32 Mword in bipolar SRAM with a 15 nsec access time and a bank busy time of 102 nsec. Memory is interleaved in 256 banks. Total bandwidth is 340 Gbytes/sec (32 words/CP per processor). The Y-MP comes with a 128 Mword SSD as a standard feature. 2 IOSs can be fitted, each with a 4 Mword buffer memory.

### Benchmarks / Compute and data transfer performance:

Peak performance of 4 Gflops. Overall performance of about 30 times a CRAY-1. Each processor should outperform a single X-MP processor by a factor of 1.4 in vector mode (1.2 in scalar).

Data Transfer up to a maximum of 10.4GB/s IO bandwidth (in largest processor configuration only).

Performance over 2Gflops sustained performance claimed for full 8-processor Y-MP/8, with 2.13Gflops achieved on SLALOM benchmark! Single CPU Y-MP/8 SLALOM performance measured as 0.28Gflops.

The following figures were taken from "Performance of various Computers Using Linear Equations Software", by Jack Dongarra, Univ. of Tennessee. The LINPACK benchmark for the Cray Y-MP system, using the fortran compiler CF77 4.0 and problem size(n = 1000):

• on 832 version (8 proc. 6 ns), TPP best effort is 2144 Mflops/sec while theoretical peak is 2667 Mflops/sec.

• on 832 version (4 proc. 6 ns), TPP best effort is 1159 Mflops/sec while theoretical peak is 1333 Mflops/sec.

• On M98 hardware version (8 processors, 6 ns), CF77 5.0, TPP best effort is 1733 Mflop/sec while theoretical peak is 2666 Mflops/sec.

• on 832 version (2 proc. 6 ns), CF77 5.0, TPP best effort is 604 Mflops/sec while theoretical peak is 667 Mflops/sec.

• on 416 version (4 proc. 8.5 ns), CF77 4.0, TPP best effort is 822 Mflops/sec while theoretical peak is 940 Mflops/sec.

• on M98 version (4 proc. 6 ns), CF77 5.0, TPP best effort is 1114 Mflops/sec while theoretical peak is 1333 Mflops/sec.

• on 832 version (1 proc. 6 ns), CF77 5.0, TPP best effort is 324 Mflops/sec while theoretical peak is 333 Mflops/sec.

• on M98 version (1 proc. 6 ns), CF77 5.0, TPP best effort is 307 Mflops/sec while theoretical peak is 333 Mflops/sec.

• on M92 version (2 proc. 6 ns), CF77 5.0, TPP best effort is 550 Mflops/sec while theoretical peak is 666 Mflops/sec.

• on M98 version (2 proc. 6 ns), CF77 5.0, TPP best effort is 596 Mflops/sec while theoretical peak is 666 Mflops/sec.

• on M92 version (1 proc. 6 ns), CF77 5.0, TPP best effort is 332 Mflops/sec while theoretical peak is 333 Mflops/sec.

• on 416 version (2 proc. 8.5 ns), CF77 5.0, TPP best effort is 426 Mflops/sec while theoretical peak is 470 Mflops/sec.

### Operating System Software and Environment:

Programming Environment Extensive parallel libraries and kernels, with compilers featuring auto detection and replacement of BLAS1/2/3, LAPACK and FFTLIB routines. CDBX debugger, along with tools for performance monitoring, analysis, enhancement and visualization, provide a comprehensive programmer support environment. Network interfacing provided via TCP/IP, NFS, HiPPI, FDDI, etc. Graphics support for AVS, Explorer, X11 and Cray Visualization Toolkit, etc. Autotasking Expert System proprietary automatic parallel multitasking tool.

COS and UNICOS are both supported. In addition to CFT77 a new Fortran compiler will be available. Performance tools include dynamic and static analysis, tuning, and debugging aids. Can run in X or Y mode, and software can also run (in X mode) on the X-MP.

Software is fully binary-compatible throughout Y-MP range (including C90 and EL systems).

Languages new compiling system'' releases of sophisticated vectorizing and parallelizing compilers for Cray Fortran-77, C and Ada (using Cray extensions and directives) together with existing Pascal and Common Lisp products. Supports IEEE floating-point and Fortran-90 array syntax, along with extensions for variable-length arrays and restricted pointers. Manual multitasking through (local) microtasking and (global) macrotasking directives distributed across up to 16 CPUs. Scalar processing supported via inlining, overlapping instructions and operation grouping; vector processing supported via pipelined overlap, vector chaining and multiple pipelines.

### Networkability/ I/O System / Integrability / Reliability / Scalability:

There are many possible front ends including IBM, CDC, VAX, and Apollo. There are four VHISP channels each rated at 1250 Mbytes/sec, eight 100 Mbytes/sec HISP channels, and eight LOSP 6 Mbytes/sec channels. A full range of disks, tapes, terminals, workstations, and networks (including TCP/IP) are supported.

Cooled using inert fluorocarbon but not with liquid immersion technology (as on CRAY-2).

The dimensions of the machine are 77" x 30" x 75" with a total footprint of 98 sq ft and weighing 5,000 lb.

Support Environment Low-end machines are air-cooled and used normal' electricity, but higher performance, more demanding configurations, required special (freon) cooling and power systems. SMARTE (System Maintenance and Remote Test Environment) for remote diagnostics.

Scalability numerous configurations from uniprocessors to a maximum of 8 custom vector CPUs.

### Notable Applications / Customers / Market Sectors:

1 CPU running in 1987; first deliveries in 1988; nine deliveries in 1989; full production and possible enhancements in 1990.

## CRAY Extended Architecture

### Overview of Platform:

This is a broad product family, ranging from single-processor to two and four processor systems to the 8-processor Y-MP system. The EA series systems feature large central memories (4 million to 64 million 64-bit words). Many of the models are upgradable with regard to the number of CPUs and memory size. CRAY X-MP EA/4 computer systems offer up to ten times greater performance than the original CRAY-1 computer.

One-processor with 4 or 16 million words of MOS memory on the X-MP EQ/14se and X-MP EA/116se systems

One, two, or four processors sharing 16, 32, or 64 millions words of MOS memory 8.5 nsec clock cycle on X-MP EA systems 100 nsec clock cycle on X-MP EA/14se and X-MP EA/116se systems SECDED memory protecting four parallel memory ports per processor flexible hardware chaining for vector operations gather/scatter and compressed index vector support flexible processor clustering for multitasking applications dedicated registers for efficient interprocessor communications and control X-mode and Y-mode instruction

### Operating System Software and Environment:

Software includes:
• UNIX-based OS (called UNICOS), based on AT\&T UNIX System V
• Cray operating system COS
• vectorizing Fortran compiler
• vectorizing C compiler
• optimized (Fortran mathematical and I/O subroutine library)
• vectorizing ISO Level 1 Pascal compiler
• CAL, the Cray macro assembler
• scientific subroutine library optimized for EA systems
• wide variety of major applications programs

Fortran characteristics: ANSI standard compliance; automatic optimization, including vectorizing of DO loops; portability of applications codes; library routines optimized for the EA systems

### Networkability/ I/O System / Integrability / Reliability / Scalability:

Several hardware interfaces are available to integrate Cray systems into customer environments:

Linkage to workstations through Network Systems Corporation or similar adapters. Front-end system communication with IBM, CDC, DEC, Honeywell, Data General and UNISYS computer systems. IOS optionally configured with a 32-million-word or 128-million-word SSD in the same cabinet. The machine is liquid cooled using inert fluorocarbon.

### Notable Applications / Customers / Market Sectors:

Applications: advanced graphics; applied math; AI; atmospheric and oceanic research; circuit simulation and design; crashworhtiness simulation; economic modeling; energy research; financial modeling; fluid dynamics; genetics engineering; molecular dynamics; weather forecasting; signal and image processing.

## CRAY YMP/EL

### Overview of Platform:

This is an entry level version of the YMP.

### Compute Hardware:

See the YMP entry above.

### Interconnect / Communications System:

The system can be configured with up to a maximum of 0.5GB of shared-memory.

### Benchmarks / Compute and data transfer performance:

133Mflops per CPU claimed peak performance.

A maximum of 1.05GB/s I/O bandwidth.

LINPACK Benchmark (Due to Jack Dongarra): Y-MP EL (4 proc. 30ns), CF77 5.0, the TPP best effort is 345 Mflops/sec and the theoretical peak is 532 Mflops/sec.

### Networkability/ I/O System / Integrability / Reliability / Scalability:

one to a maximum of four CMOS vector processors.

## CRAY C90

### Overview of Platform

This architecture has a maximum of 2GB central shared-memory, connected to CPU processors, each consisting of 2 vector pipes. SECDED single-byte correction, double-byte detection memory protection. 64-way vector parallelism with two vector pipes and two functional pipes for each of the 16 possible CPUs.

### Memory System:

I/O Subsystem (IOS) with up to 16 I/O clusters, consisting of up to 16 channel adapters per cluster. Optional Solid-state Storage Device (SSD) provides up to 16GB of fast storage for code or swap, with up to four 1800MB/s channels. DD-60 disk drives, with 20MB/s sustained transfer rate to 2GB capacity, and supporting disk striping with 8 disks per IOS/channel.

### Benchmarks / Compute and data transfer performance:

Performance 16~Gflops claimed peak performance, with 13.6~Gflops achieved in LINPACK massively parallel benchmark.

Data Transfer 13.6GB/s IO bandwidth, and over 250GB/s total memory bandwidth.

LINPACK Benchmarks: (Due to Jack Dongarra)

• C90 (16 proc. 4.2 ns), CF77 5.0, n=1000, the TPP best effort is 10780 Mflops/sec and the theoretical peak is 15238 Mflops/sec.

• C90 (8 proc, 4.2 ns), CF77 5.0, n=1000, the TPP best effort is 6175 Mflops/sec and the theoretical peak is 7619 Mflops/sec.

• C90 (4 proc. 4.2 ns), CF77 5.0, n=1000, the TPP best effort is 3275 Mflops/sec and the theoretical peak is 3810 Mflops/sec.

• C90 (2 proc. 4.2 ns), CF77 5.0, n=1000, the TPP best effort is 1703 Mflops/sec and the theoretical peak is 1905 Mflops/sec.

• C90 (1 proc. 4.2 ns), CF77 5.0, n=1000, the TPP best effort is 902 Mflops/sec and the theoretical peak is 952 Mflops/sec.

### Networkability/ I/O System / Integrability / Reliability / Scalability:

One to a maximum of 16 custom ECL vector CPUs.

## CRAY T3E

### Overview of Platform:

Scalable parallel systems with configurations from 6 to 2048 processors with corresponding peak performance levels from gigaflops to teraflops. T3E was built on the scalable architecture of the T3D which was introduced in 1993. Each system's PEs are configured with one of six memory sizes: 64, 128, 256, or 512 Mbytes, 1 Gbyte, or 2 Gbytes. Within a module, all PEs have the same memory size. Cray T3E-900 series was introduced later with about 50 percent higher performance than its predecessor and it is the first in a series of enhancements to the T3E line. The newest enhancement is the T3E-1200 series with almost twice the performance of the original T3E. The T3E-1200 performance is derived mainly from combining ultrafast microprocessors (600MHz), interprocessor communications, with operating system and high-performing I/O capabilities.

Information Source: Cray Research.

### Compute Hardware:

Liquid-cooled CRAY T3E systems are available in configurations from 64 to 2048 processors with peak performance levels from 38.4 GFLOPS to 1.2 TFLOPS. Also there are air-cooled configurations. The cabinet footprint area
• for LC256 model is 35.4 ft2 (3.2 m2)
• for LC512 model is 57.4 ft2 (5.2 m2)
• for LC1024 model is 114.8 ft2 (10.4 m2)
• for LC2048 model 229.6 ft2 (20.8 m2).

### Interconnect / Communications System:

Each T3E topology is a 3-D torus. The data payload bisection bandwidth for 512 PEs is 22 Gbytes/s.

### Memory System:

Is a shared memory, physically distributed, globally addressable system using 16-Mbit or 64-Mbit DRAM. For a total system memory between 4 Gbytes and 4 Tbytes.

### Benchmarks / Compute and data transfer performance:

Examining the NAS Kernel benchmark test on T3E (300 MHz) we get the following results:
kernels with zero changes (Mflop/s, CRAY T3E 300MHz):

Kernel         f90
======        =====
MXM            84.8
CFFT2D         21.2
CHOLSKY        19.7
BTRIX          49.3
GMTRY          67.9
EMIT           61.5
VPENTA         4.2

The CRAY T3E network latency is nominally 1 microsecond. From various message passing libraries, however, the effective latency can be much larger due to overhead associated with buffering and with deadlock detection. Network latency for message passing libraries:
Library             Network latency       Bandwidth
[microseconds]        [Mbyte/s]
=============       ==============        =========
shared memory            1                   350
access

PVM                      11                  150

MPI                      14                  260

`

### Operating System Software and Environment:

The OS used is Cray UNICOS/mk providing a distributed functionality. It is basically a scalable version of the UNICOS OS. UNICOS/mk is divided into servers, which are distributed among the processors of the T3E systems. Local servers process OS requests specific to each user PE, while global servers provide system-wide OS capabilities.

This distribution of functions provides a global view of the computing environment a single-system image that allows administrators to manage system-wide resources as T3E-1200 as well as T3E-900 systems support both explicit distributed-memory parallelism through CF90 and C/C++ with message passing (MPI, MPI-2, and PVM) and data passing programming models, and implicit parallelism through HPF and the Cray CRAFT work-sharing features.

### Networkability/ I/O System / Integrability / Reliability / Scalability:

I/O based on the GigaRing channel, with sustainable bandwidths of 267 MB/s input and output for every four processors. All I/O channels are accessible and controllable from all PEs.

### Notable Applications / Customers / Market Sectors:

Applications: CFD, 3D oil exploration, weather simulation, 3D seismic processing, and electromagnetics.

Customers: Department of Defense Naval Oceanographic Office, Electronic Data Systems (EDS), Exxon, German Research Center Forschungszentrum Juelich (KFA), Mobil Oil Company's Exploration and Producing Technical Center, NASA's Goddard Space Flight Center, National Energy Research Scientific Computing Center, National Oceanographic and Atmospheric Administration's Geophysical Fluid Dynamics Laboratory, System Engineering Research Institute in Korea, United Kingdom Meteorological Office, and the U.S. Army High-Performance Computing Center.

### Summary:

Here are the T3E basic characteristics:

• It uses the non-uniform memory access (NUMA) design.

• Although memory is accessible to every processor, the distribution of memory across processors is exposed to the programmer.

• Caches are used only to hold data from local memory.

• It scales up to 2048 processors.

• No hardware mechanism for coherence, same as SGI Origin2000.

• Memory controller generates communication request for nonlocal references.
More details on the T3E: (Due to Culler et al. and Cray Research)

• Each node contains

1. DEC Alpha processor

2. local memory

3. network interface integrated with the memory controller

4. network switch

• It is organized as a 3-D cube, with each node connected to its 6 neighbors through 480-MB/s point-to-point links.

• Any processor can read or write any memory location, however, the NUMA characteristic is exposed in the communication architecture as well as in its performance characteristics.

• A short sequence of instructions is required to establish addressability to remote memory, which can be accessed by conventional loads and stores.

• The memory controller captures the access to a remote memory and conducts a message transaction with the controller of the remote node on behalf of the local processor.

• The message transaction is automatically routed through intermediate nodes to the desired destination, with a small delay per "hop."

• The remote data is not cached since there is no hardware mechanism to keep it consistent.

• The I/O system is distributed over a collection of nodes on the surface of the cube, which are connected to the external world through and additional I/O network.

Please take a look at the product specs available from here.

## CRAY T90

### Overview of Platform:

This series, based on Cray high-speed vector processors, provides nearly 60 gigaflops (billion calculations per second) of peak computing power on a number of supercomputing applications.

As the entry-level system of the T90 series, the T94 system provides up to 4 processors, 1024 Mbytes of central memory, and a peak performance of approximately 8 GFLOPS.

As the mid-range system of the T90 series, the T916 system provides up to 16 processors, 4096 Mbytes of central memory, and a peak performance of approximately 32 GFLOPS.

As the top-end system of the T90 series, the T932 system provides up to 32 processors, 8192 Mbytes of central memory, and a peak performance of more than 60 GFLOPS.

### Compute Hardware:

They are available in three models: The first, the T94, offered in air- or liquid-cooled systems, that scales up to four processors; the T916 system, a liquid-cooled system that scales up to 16 processors; and the top-of-the-line T932 system, also liquid-cooled with up to 32 processors.

### Interconnect / Communications System:

The T94, T916 and T932 systems support multiple ATM, FDDI, and HIPPI connections.

### Benchmarks / Compute and data transfer performance:

The following LINPACK benchmarking is due to Jack Dongarra, Univ. of Ten.

• T932 (32 proc. 2.2 ns), and the problem size (n=1000), the best effort (or TPP) is 29360 Mflops/sec, while the theoretical peak is 57600 Mflops/sec.

• T928 (28 proc. 2.2 ns), and problem size (n=10000), the best effort (or TPP) is 28349 Mflops/sec, while the theoretical peak is 50400 Mflops/sec.

• T924 (24 proc. 2.2 ns), and problem size (n=1000), the best effort (or TPP) is 26170 Mflops/sec, while the theoretical peak is 43200 Mflops/sec.

• T916 (16 proc. 2.2 ns), and problem size (n=1000), the best effort (or TPP) is 19980 Mflops/sec, while the theoretical peak is 28800 Mflops/sec.

• T916 (8 proc. 2.2 ns), and problem size (n=1000), the best effort (or TPP) is 10880 Mflops/sec, while the theoretical peak is 14400 Mflops/sec.

• T94 (4 proc. 2.2 ns), and problem size (n=1000), the best effort (or TPP) is 5735 Mflops/sec, while the theoretical peak is 7200 Mflops/sec.

• T94 (3 proc. 2.2 ns), and problem size (n=1000), the best effort (or TPP) is 4387 Mflops/sec, while the theoretical peak is 5400 Mflops/sec.

• T94 (2 proc. 2.2 ns), and problem size (n=1000), the best effort (or TPP) is 2998 Mflops/sec, while the theoretical peak is 3600 Mflops/sec.

• T94 (1 proc. 2.2 ns), and problem size (n=1000), the best effort (or TPP) is 1603 Mflops/sec, while the theoretical peak is 1800 Mflops/sec.

### Memory System:

The T932 model has 1024 to 8192 Mbytes of central memory and a memory bandwidth of over 800 Gbytes/s. The T916 model has 2048 to 4096 Mbytes of central memory and a memory bandwidth of more than 400 Gbytes/s . The T94 model has 512 to 1024 Mbytes of central memory and a memory bandwidth of over 100 Gbytes/s.

### Operating System Software and Environment:

UNICOS operating system based on UNIX system V. UNICOS is a standard UNIX environment that has been enhanced to provide parallel processing, production quality resource management, security, and network connectivity.

### Networkability/ I/O System / Integrability / Reliability / Scalability:

T932 has an aggregate I/O bandwidth of more than 35 Gbytes/s. The T916 has an aggregate I/O bandwidth greater than 17 Gbytes/s, and the T94 has an aggregate I/O bandwidth of over 8 Gbytes/s.

### Notable Applications / Customers / Market Sectors:

In the auto industry, companies such as Nissan and Ford use these systems for engineering and manufacturing applications. Another customer is NOAA which uses them to develop and run weather forecasting models.

### Summary:

Please take a look at the product specs:

hawick@npac.syr.edu
saleh@npac.syr.edu