Cray Research Incorporated

Cray Research Incorporated

Status:

Active systems manufacturer and subsidiary of Silicon Graphics, Inc. (SGI)

Overview of Organization:

Cray has a well established user-base in the aerospace, automotive, chemical, electronics and petroleum industries, as well as in research centers and laboratories. A substantial suite of libraries and applications are available for a number of fields (over 600 quoted), but particularly computational chemistry and fluid dynamics. Currently over 270 Cray installations in over 200 customer sites. Staffing is over 5000 worldwide.

Cray Research Inc. was founded by Seymour Cray in 1972 to produce the fastest computer in the world --- the first supercomputer. The Cray-1, available from 1976 to 1982, was the first vector uniprocessor supercomputer, against which other computers are often compared. This evolved into the X-MP, which was the first multi-processor supercomputer system, allowing up to 4 vector units to cooperate on large problems, which was available until 1988, before itself being superseded by two separate Cray development streams.

The evolutionary path lead from the X-MP to the Y-MP series, improving and repackaging the existing technology, and expanding parallelism from a maximum of 4 vector units to a maximum of 8 and now currently 16 CPUs. The Y-MP range has been extended by including an `entry-level' system (using technology acquired from Supertek Computers in 1990) and the current `top of the range' Y-MP/C90. Plans for the future include provision for more parallelism and indeed, a massively-parallel offering is currently being developed.

The Cray-2, on the other hand, offered a revolutionary approach with radically new technology, and was spawned off as Cray Computer Corporation, under the direction of Seymour Cray himself. Using innovative memory and interconnect systems, and faster processors rather than additional parallel processors, the Cray-2 has essentially matched the Cray Y-MP, but at the expense of a fragmented (non-portable) userbase and less reliable machines. The Cray-3, making extensive use of fledgling GaAs technology, has been expected for some time, but problems with the advanced processor and packing technology have lead to extensive delays and lost orders.

Cray computers are used in compute-intensive scientific fields such as high energy physics, weapons research, crash simulation in the automotive industry, aerospace engineering, biomedical research, and petroleum exploration. By geography, 67% of Cray Research's sales are in the US, and 33% in the rest of the world. 58% of machines are sold to government laboratories, 25% to commercial companies, and 17% to universities.


Diagrams and Figures provided:


Platforms Documented:


Contact Address:

Cray Research Inc.
1100 Lowater Rd.
Chippewa Falls, Wisconsin 54701
715-726-1211

Cray Research Inc.
1440 Northland Drive
Mendota Heights, MN 55120
612-452-6650

Cray Research 
655F Lone Oak Drive 
Eagan, Minnesota 55121

Cray Research (UK) Ltd
Cray House
London Road
Bracknell
Berkshire RG12 2SY
England
Tel 0344-485971
Telex  848841

See Also:

Cray Research, Inc have a WWW server of their own.


CRAY 1

Overview of Platform

This machine is no longer being produced, although when first introduced in 1976 (Los Alamos), it was without doubt the fastest processor in the world and is still used as a benchmark for high-speed computing. Since many CRAY customers are currently upgrading their systems to an X-MP, there are opportunities to buy a second-hand CRAY-1S at knockdown prices.

Compute Hardware:

A uniprocessor. Vector processor, uses pipelining and chaining to gain speed. 12.5-nsec clock. Fast scalar. Uses only four chip types with 2 gates per chip. 64-bit word size up to 4 Mwords of storage. The CRAY 1-S has bipolar (in units of 4K RAM), and the newer (1982) CRAY 1-M has MOS memory (in units of 16K RAM).

Interconnect / Communications System:

A uniprocessor.

Memory System:

There is only one pipe from memory-to-vector registers, resulting in a major bottleneck with loads and stores to memory from registers. Loads can be chained with arithmetic operations; stores cannot.

Benchmarks / Compute and data transfer performance:

Low vector start-up times and fast scalar performance make this a very general-purpose machine. Max. performance 160 Mflops; 64-bit arithmetic; max. attainable sustained performance 150 Mflops. There are codes for matrix multiplication and the solution of equations which get close to this. Maximum scalar rate is 80 mips. It is easy to attain over 100 Mflops for certain problems, even using Fortran.

The LINPACK benchmark for the Cray-1S system (12.5 ns), using the Fortran compiler cf77 2.1 and problem size(n = 1000):

The TPP (toward peak performance) was 110 Mflops/sec while the theoretical peak was 160 Mflops/sec.

Operating System Software and Environment:

An extensive range of software exists for this machine. Since the instruction set is compatible with the X-MP range, this software will also run on that range.


CRAY 2

Overview of Platform:

This is a 4-processor (quadrant) vector machine with pipelining and overlapping but no chaining. There are more segments in the pipes than in the other CRAYs. Multitasking primitives have same syntax as the X-MP.

Compute Hardware:

The system has a 4.1-nsec clock cycle time.

Overheads for vector operations are large:

Recent enhancements to the CRAY-2 include a 512 Mword memory and models with 128 Mword static RAM. Other improvements include implementing functional units in VLSI (and cutting latency time by half), a larger instruction buffer, reduced branch time, and faster issue rates for certain sequences of instructions.

The machine is liquid cooled using inert fluorocarbon.

Memory System:

Memory is 256 Mwords of 256 K DRAM in 128 banks. The bank busy time is 57 clocks, and the scalar memory access time is 59 clocks. Local memory is 16 Kwords, 4 clocks from local memory to vector registers. Vector references from local memory must be with unit stride. There are 8 vector registers each with 64 elements.

Benchmarks / Compute and data transfer performance:

Peak performance is 488 Mflops per processor. A matrix multiply code has run at 1.7Gflops on 4 processors.

The following figures were taken from "Performance of various Computers Using Linear Equations Software", by Jack Dongarra, Univ. of Tennessee. The LINPACK benchmark for the Cray-2/4-256 system, using the fortran compiler cf77 3.0 and problem size(n = 1000):

Operating System Software and Environment:

Software includes: UNIX-based OS (called UNICOS); C compiler; CFT2 (Fortran compiler); CFT77.

Networkability/ I/O System / Integrability / Reliability / Scalability:

Cray has an ongoing commitment to high-speed peripherals and fast network links. HSX is a 100 Mbytes/sec link for connecting CRAYs together. CRAYs can be linked to Ultra Corporations 1.6Gbit Ultra bus in addition to standard connections with Ethernet (TCP/IP), and VME buses. The DD-40 disks each hold 5 Gbytes and have a transfer rate of 10 Mbytes/sec.

Notable Applications / Customers / Market Sectors:

Delivered: NMFECC, NASA Ames, University of Minnesota, Harwell Laboratory, Stuttgart, and Ecole Polytechnique (Paris).


CRAY XMP

Overview of Platform

This multiprocessor pipelined vector machine has the same architecture as the CRAY-1. The major difference is that there are three paths from memory to the vector registers, and the clock cycle time is 8.5 nsec on all machines shipped after August 1986 (machines built before August have a cycle time of 9.5 nsec.)

The current machines come with 1, 2, or 4 processors. Gather/scatter hardware is available on the 2- or 4-processor version of the machine. The gather/scatter can be chained to a load/store operation. Users can control all processors through calls in Fortran. The processors share memory.

Memory System:

Main memory - ECL 4K RAMs with 25-nsec access time. (Interleaving to 64 banks is possible.)

Memory up to 16 M (64-bit) words X-MP-2 MOS. (Bank busy time is 68 nsec and memory access time is 17 clocks.)

X-MP-4 ECL. (Bank busy time is 34 nsec and memory access time is 14 clocks.)

ECL logic with .35-.5 nsec gate delay and 16 gates/chip.

High-speed connection at 1024 Mbytes/sec per channel (max. 2) to a CRAY SSD. The SSD comes in various sizes up to 512 Mwords of secondary MOS memory. Data transfer to high speed (1200 Mbyte) DD-49 disk takes 10 Mbytes/sec. Recent peripheral enhancements as reported under the CRAY-2.

Benchmarks / Compute and data transfer performance:

Peak of 235 Mflops per processor.

The following figures were taken from "Performance of various Computers Using Linear Equations Software", by Jack Dongarra, Univ. of Tennessee. The LINPACK benchmark for the Cray X-MP/416 system, using the fortran compiler CF77 4.0 and problem size(n = 1000):

Networkability/ I/O System / Integrability / Reliability / Scalability:

There are many possible front ends including IBM, CDC, VAX, and Apollo.

Notable Applications / Customers / Market Sectors:

Delivery: Announced in August 1982, first system delivered in June 1983.


CRAY YMP

Overview of Platform:

Models include: Y-MP 2E, 4E, 8E, 8I

This is a multiprocessor pipelined vector machine. It has a similar architecture to the CRAY X-MP. A major difference is the availability of 32- as well as 24-bit addressing. The cycle time is 6 nsec, and it is an 8 processor machine. As in the X-MP there are three paths from memory to the vector registers.

Compute Hardware:

There are only three module types, for the CPU (8 modules), the memory (32 modules, with 1 Mword/module), and the clock (1 module), making 41 modules in all compared with 144 in the X-MP. Each module is on an 11" by 21.2" board. 2.5 $\mu$ ECL in 2500 gate arrays with gate delay of 350 picoseconds. There are 312 arrays per processor on four PCBs with a power dissipation of 9 Watts per array.

Interconnect / Communications System:

Architecture up to a maximum of 2GB central shared-memory (in largest processor configuration only).

Memory System:

The processors share a common memory of 32 Mword in bipolar SRAM with a 15 nsec access time and a bank busy time of 102 nsec. Memory is interleaved in 256 banks. Total bandwidth is 340 Gbytes/sec (32 words/CP per processor). The Y-MP comes with a 128 Mword SSD as a standard feature. 2 IOSs can be fitted, each with a 4 Mword buffer memory.

Benchmarks / Compute and data transfer performance:

Peak performance of 4 Gflops. Overall performance of about 30 times a CRAY-1. Each processor should outperform a single X-MP processor by a factor of 1.4 in vector mode (1.2 in scalar).

Data Transfer up to a maximum of 10.4GB/s IO bandwidth (in largest processor configuration only).

Performance over 2Gflops sustained performance claimed for full 8-processor Y-MP/8, with 2.13Gflops achieved on SLALOM benchmark! Single CPU Y-MP/8 SLALOM performance measured as 0.28Gflops.

The following figures were taken from "Performance of various Computers Using Linear Equations Software", by Jack Dongarra, Univ. of Tennessee. The LINPACK benchmark for the Cray Y-MP system, using the fortran compiler CF77 4.0 and problem size(n = 1000):

Operating System Software and Environment:

Programming Environment Extensive parallel libraries and kernels, with compilers featuring auto detection and replacement of BLAS1/2/3, LAPACK and FFTLIB routines. CDBX debugger, along with tools for performance monitoring, analysis, enhancement and visualization, provide a comprehensive programmer support environment. Network interfacing provided via TCP/IP, NFS, HiPPI, FDDI, etc. Graphics support for AVS, Explorer, X11 and Cray Visualization Toolkit, etc. Autotasking Expert System proprietary automatic parallel multitasking tool.

COS and UNICOS are both supported. In addition to CFT77 a new Fortran compiler will be available. Performance tools include dynamic and static analysis, tuning, and debugging aids. Can run in X or Y mode, and software can also run (in X mode) on the X-MP.

Software is fully binary-compatible throughout Y-MP range (including C90 and EL systems).

Languages new ``compiling system'' releases of sophisticated vectorizing and parallelizing compilers for Cray Fortran-77, C and Ada (using Cray extensions and directives) together with existing Pascal and Common Lisp products. Supports IEEE floating-point and Fortran-90 array syntax, along with extensions for variable-length arrays and restricted pointers. Manual multitasking through (local) microtasking and (global) macrotasking directives distributed across up to 16 CPUs. Scalar processing supported via inlining, overlapping instructions and operation grouping; vector processing supported via pipelined overlap, vector chaining and multiple pipelines.

Networkability/ I/O System / Integrability / Reliability / Scalability:

There are many possible front ends including IBM, CDC, VAX, and Apollo. There are four VHISP channels each rated at 1250 Mbytes/sec, eight 100 Mbytes/sec HISP channels, and eight LOSP 6 Mbytes/sec channels. A full range of disks, tapes, terminals, workstations, and networks (including TCP/IP) are supported.

Cooled using inert fluorocarbon but not with liquid immersion technology (as on CRAY-2).

The dimensions of the machine are 77" x 30" x 75" with a total footprint of 98 sq ft and weighing 5,000 lb.

Support Environment Low-end machines are air-cooled and used `normal' electricity, but higher performance, more demanding configurations, required special (freon) cooling and power systems. SMARTE (System Maintenance and Remote Test Environment) for remote diagnostics.

Scalability numerous configurations from uniprocessors to a maximum of 8 custom vector CPUs.

Notable Applications / Customers / Market Sectors:

1 CPU running in 1987; first deliveries in 1988; nine deliveries in 1989; full production and possible enhancements in 1990.


CRAY Extended Architecture

Overview of Platform:

This is a broad product family, ranging from single-processor to two and four processor systems to the 8-processor Y-MP system. The EA series systems feature large central memories (4 million to 64 million 64-bit words). Many of the models are upgradable with regard to the number of CPUs and memory size. CRAY X-MP EA/4 computer systems offer up to ten times greater performance than the original CRAY-1 computer.

One-processor with 4 or 16 million words of MOS memory on the X-MP EQ/14se and X-MP EA/116se systems

One, two, or four processors sharing 16, 32, or 64 millions words of MOS memory 8.5 nsec clock cycle on X-MP EA systems 100 nsec clock cycle on X-MP EA/14se and X-MP EA/116se systems SECDED memory protecting four parallel memory ports per processor flexible hardware chaining for vector operations gather/scatter and compressed index vector support flexible processor clustering for multitasking applications dedicated registers for efficient interprocessor communications and control X-mode and Y-mode instruction

Operating System Software and Environment:

Software includes:

Fortran characteristics: ANSI standard compliance; automatic optimization, including vectorizing of DO loops; portability of applications codes; library routines optimized for the EA systems

Networkability/ I/O System / Integrability / Reliability / Scalability:

Several hardware interfaces are available to integrate Cray systems into customer environments:

Linkage to workstations through Network Systems Corporation or similar adapters. Front-end system communication with IBM, CDC, DEC, Honeywell, Data General and UNISYS computer systems. IOS optionally configured with a 32-million-word or 128-million-word SSD in the same cabinet. The machine is liquid cooled using inert fluorocarbon.

Notable Applications / Customers / Market Sectors:

Applications: advanced graphics; applied math; AI; atmospheric and oceanic research; circuit simulation and design; crashworhtiness simulation; economic modeling; energy research; financial modeling; fluid dynamics; genetics engineering; molecular dynamics; weather forecasting; signal and image processing.


CRAY YMP/EL

Overview of Platform:

This is an entry level version of the YMP.

Compute Hardware:

See the YMP entry above.

Interconnect / Communications System:

The system can be configured with up to a maximum of 0.5GB of shared-memory.

Benchmarks / Compute and data transfer performance:

133Mflops per CPU claimed peak performance.

A maximum of 1.05GB/s I/O bandwidth.

LINPACK Benchmark (Due to Jack Dongarra): Y-MP EL (4 proc. 30ns), CF77 5.0, the TPP best effort is 345 Mflops/sec and the theoretical peak is 532 Mflops/sec.

Networkability/ I/O System / Integrability / Reliability / Scalability:

one to a maximum of four CMOS vector processors.


CRAY C90

Overview of Platform

This architecture has a maximum of 2GB central shared-memory, connected to CPU processors, each consisting of 2 vector pipes. SECDED single-byte correction, double-byte detection memory protection. 64-way vector parallelism with two vector pipes and two functional pipes for each of the 16 possible CPUs.

Memory System:

I/O Subsystem (IOS) with up to 16 I/O clusters, consisting of up to 16 channel adapters per cluster. Optional Solid-state Storage Device (SSD) provides up to 16GB of fast storage for code or swap, with up to four 1800MB/s channels. DD-60 disk drives, with 20MB/s sustained transfer rate to 2GB capacity, and supporting disk striping with 8 disks per IOS/channel.

Benchmarks / Compute and data transfer performance:

Performance 16~Gflops claimed peak performance, with 13.6~Gflops achieved in LINPACK massively parallel benchmark.

Data Transfer 13.6GB/s IO bandwidth, and over 250GB/s total memory bandwidth.

LINPACK Benchmarks: (Due to Jack Dongarra)

Networkability/ I/O System / Integrability / Reliability / Scalability:

One to a maximum of 16 custom ECL vector CPUs.


CRAY T3E

Overview of Platform:

Scalable parallel systems with configurations from 6 to 2048 processors with corresponding peak performance levels from gigaflops to teraflops. T3E was built on the scalable architecture of the T3D which was introduced in 1993. Each system's PEs are configured with one of six memory sizes: 64, 128, 256, or 512 Mbytes, 1 Gbyte, or 2 Gbytes. Within a module, all PEs have the same memory size. Cray T3E-900 series was introduced later with about 50 percent higher performance than its predecessor and it is the first in a series of enhancements to the T3E line. The newest enhancement is the T3E-1200 series with almost twice the performance of the original T3E. The T3E-1200 performance is derived mainly from combining ultrafast microprocessors (600MHz), interprocessor communications, with operating system and high-performing I/O capabilities.

Information Source: Cray Research.

Compute Hardware:

Liquid-cooled CRAY T3E systems are available in configurations from 64 to 2048 processors with peak performance levels from 38.4 GFLOPS to 1.2 TFLOPS. Also there are air-cooled configurations. The cabinet footprint area

Interconnect / Communications System:

Each T3E topology is a 3-D torus. The data payload bisection bandwidth for 512 PEs is 22 Gbytes/s.

Memory System:

Is a shared memory, physically distributed, globally addressable system using 16-Mbit or 64-Mbit DRAM. For a total system memory between 4 Gbytes and 4 Tbytes.

Benchmarks / Compute and data transfer performance:

Examining the NAS Kernel benchmark test on T3E (300 MHz) we get the following results:
kernels with zero changes (Mflop/s, CRAY T3E 300MHz):

Kernel         f90
======        =====
MXM            84.8
CFFT2D         21.2
CHOLSKY        19.7
BTRIX          49.3
GMTRY          67.9
EMIT           61.5
VPENTA         4.2
The CRAY T3E network latency is nominally 1 microsecond. From various message passing libraries, however, the effective latency can be much larger due to overhead associated with buffering and with deadlock detection. Network latency for message passing libraries:
Library             Network latency       Bandwidth 
                    [microseconds]        [Mbyte/s] 
=============       ==============        =========
shared memory            1                   350
access

PVM                      11                  150

MPI                      14                  260

Operating System Software and Environment:

The OS used is Cray UNICOS/mk providing a distributed functionality. It is basically a scalable version of the UNICOS OS. UNICOS/mk is divided into servers, which are distributed among the processors of the T3E systems. Local servers process OS requests specific to each user PE, while global servers provide system-wide OS capabilities.

This distribution of functions provides a global view of the computing environment a single-system image that allows administrators to manage system-wide resources as T3E-1200 as well as T3E-900 systems support both explicit distributed-memory parallelism through CF90 and C/C++ with message passing (MPI, MPI-2, and PVM) and data passing programming models, and implicit parallelism through HPF and the Cray CRAFT work-sharing features.

Networkability/ I/O System / Integrability / Reliability / Scalability:

I/O based on the GigaRing channel, with sustainable bandwidths of 267 MB/s input and output for every four processors. All I/O channels are accessible and controllable from all PEs.

Notable Applications / Customers / Market Sectors:

Applications: CFD, 3D oil exploration, weather simulation, 3D seismic processing, and electromagnetics.

Customers: Department of Defense Naval Oceanographic Office, Electronic Data Systems (EDS), Exxon, German Research Center Forschungszentrum Juelich (KFA), Mobil Oil Company's Exploration and Producing Technical Center, NASA's Goddard Space Flight Center, National Energy Research Scientific Computing Center, National Oceanographic and Atmospheric Administration's Geophysical Fluid Dynamics Laboratory, System Engineering Research Institute in Korea, United Kingdom Meteorological Office, and the U.S. Army High-Performance Computing Center.

Summary:

Cray T3E scalable shared address
 space machine

Here are the T3E basic characteristics:

More details on the T3E: (Due to Culler et al. and Cray Research)

Please take a look at the product specs available from here.


CRAY T90

Overview of Platform:

This series, based on Cray high-speed vector processors, provides nearly 60 gigaflops (billion calculations per second) of peak computing power on a number of supercomputing applications.

As the entry-level system of the T90 series, the T94 system provides up to 4 processors, 1024 Mbytes of central memory, and a peak performance of approximately 8 GFLOPS.

As the mid-range system of the T90 series, the T916 system provides up to 16 processors, 4096 Mbytes of central memory, and a peak performance of approximately 32 GFLOPS.

As the top-end system of the T90 series, the T932 system provides up to 32 processors, 8192 Mbytes of central memory, and a peak performance of more than 60 GFLOPS.

Compute Hardware:

They are available in three models: The first, the T94, offered in air- or liquid-cooled systems, that scales up to four processors; the T916 system, a liquid-cooled system that scales up to 16 processors; and the top-of-the-line T932 system, also liquid-cooled with up to 32 processors.

Interconnect / Communications System:

The T94, T916 and T932 systems support multiple ATM, FDDI, and HIPPI connections.

Benchmarks / Compute and data transfer performance:

The following LINPACK benchmarking is due to Jack Dongarra, Univ. of Ten.

Memory System:

The T932 model has 1024 to 8192 Mbytes of central memory and a memory bandwidth of over 800 Gbytes/s. The T916 model has 2048 to 4096 Mbytes of central memory and a memory bandwidth of more than 400 Gbytes/s . The T94 model has 512 to 1024 Mbytes of central memory and a memory bandwidth of over 100 Gbytes/s.

Operating System Software and Environment:

UNICOS operating system based on UNIX system V. UNICOS is a standard UNIX environment that has been enhanced to provide parallel processing, production quality resource management, security, and network connectivity.

Networkability/ I/O System / Integrability / Reliability / Scalability:

T932 has an aggregate I/O bandwidth of more than 35 Gbytes/s. The T916 has an aggregate I/O bandwidth greater than 17 Gbytes/s, and the T94 has an aggregate I/O bandwidth of over 8 Gbytes/s.

Notable Applications / Customers / Market Sectors:

In the auto industry, companies such as Nissan and Ford use these systems for engineering and manufacturing applications. Another customer is NOAA which uses them to develop and run weather forecasting models.

Summary:

Please take a look at the product specs:


hawick@npac.syr.edu
saleh@npac.syr.edu