Intel Supercomputing Systems Division

Intel Supercomputing Systems Division

Status:

Active hardware manufacturer.

Overview of Organization:

Intel produce a range of MIMD hypercube machines based on their advanced microprocessor technology. In fact Intel's microprocessors are to be found inside the machines of several other designers of multi-processor machines.

Intel are best known for inventing the microprocessor. Their interest in parallelism was born over ten years ago out of the Cosmic Cube project at CalTech which used Intel 8086/7 microprocessors to build the first hypercube computer refers to the way in which processors are connected to each other for communication purposes. The hypercube topology is highly scalable and allows large numbers of processors to be connected in a network with desirable properties such as a low diameter, a low number of total connections and a regular shape that is well suited for efficient routing algorithms.}. They formed iSC (Intel Scientific Computers) in 1984, to design and market parallel computers. In February 1985 the iPSC (Intel Personal Super Computer) range was announced, commencing with the iPSC/1, followed by the iPSC/2 in early 1988 and the iPSC/860 in January 1990. In 1991 the first Touchstone Delta System was shipped, this being a research prototype for the Paragon XP/S system which Intel claim can be scaled up to TFLOPS level. Intel's iWARP system is also part of the research strategy to produce the Paragon XP/S.

The company's latest HPC product is the DOE Accelerated Strategic Computing Initiative (ASCI) machines which was installed at Sandia National Laboratory.

Platforms Documented:

Contact Address:

Headquarters			European Headquarters
15201 NW Greenbrier Parkway	Pipers Way
Beaverton			Swindon SN3 1RJ
Oregon 97006 USA		UK
Tel: (503) 6297629		Tel: 0793 696000

See Also:


iPSC/1

Overview of Platform:

The iPSC/1 architecture is distributed memory (MIMD) hypercube. Each node is Intel's 16-bit 80286/7 chip with 500 Kbytes of memory on each node, performing at 1 MIPS peak. The NX Operating System is supplied as standard. The iLBX-II expansion interface allows memory or vector modules to be added, boosting capacity and performance. Topology A hypercube with an extra node connected to all other nodes acting as the Cube Manager. The Cube Manager has access to an Ethernet communications link and disc and tape drives. Intel produce very little in-house software, the bulk of it being done by software companies with some involvement by iSC. Operating System MACH, Express and CrOS III are available. Communication Paradigms Extensions for explicit message passing are available. Languages C and Fortran. Concurrent Common Lisp, Virdex and Ada are also available. Programming Environment Tools are available for debugging, code parallelizing and profiling. Performance Peak performance is 128 MIPS or 8 MFLOPS (64-bit) for a 128-node machine. A vector extended 64-node machine has a peak performance of 422.4 MFLOPS. Data Transfer Each node has 7 communication channels at a peak bandwidth of 20 Mbyte/s. Scalability Scales from 32--128 nodes or 16--64 extended nodes.

Compute Hardware:

Interconnect / Communications System:

Memory System:

Benchmarks / Compute and data transfer performance:

Operating System Software and Environment:

Networkability/ I/O System / Integrability / Reliability / Scalability:

Notable Applications / Customers / Market Sectors:

Overall Comments:


iPSC/2

Overview of Platform:

The iPSC/2 architecture is a distributed memory MIMD hypercube. Each processor is Intel's 32-bit 80386/7 chip , performing at 4 MIPS peak, with 1, 4, 8 and 16 Mbyte memory options. There is also a 64 Kbyte cache on each node. Extension can be by means of a scalar arithmetic accelerator and/or a vector module. The NX operating system is replaced with the faster NX/2.

Topology A hypercube with an extra node which communicates to other nodes via a spanning tree acting as the Cube Manager. The Cube Manager has access to an Ethernet communications link and disc and tape drives. It runs UNIX with a windows interface. Space sharing is possible because the Cube Manager is able to split the cube into sub-cubes which can be allocated to users separately. The Concurrent File System enables nodes to access disc drives without using the Cube Manager.

Operating System: MACH, Express and CrOS III are available.

Communication Paradigms Extensions for explicit message passing are available.

Languages C and Fortran. Concurrent Common Lisp, Virdex and Ada are also available.

Programming Environment Tools are available for debugging, code parallelizing and profiling.

Performance: Peak performance is 512 MIPS or 27 MFLOPS (64-bit) for a maximum configuration of a 128-node machine with 1 Gbyte of memory. The scalar arithmetic expansion trebles this performance and a vector extended 64-node machine has a peak performance of 422.4 MFLOPS.

Data Transfer: Each node has 8 bi-directional communication channels at a peak bandwidth of 2.8 Mbyte/s with {\em wormhole} routing.

Compute Hardware:

Interconnect / Communications System:

Memory System:

Benchmarks / Compute and data transfer performance:

Operating System Software and Environment:

Networkability/ I/O System / Integrability / Reliability / Scalability:

Notable Applications / Customers / Market Sectors:

Overall Comments:


Touchstone Gamma

Overview of Platform:

Compute Hardware:

Interconnect / Communications System:

Memory System:

Benchmarks / Compute and data transfer performance:

Operating System Software and Environment:

Networkability/ I/O System / Integrability / Reliability / Scalability:

Notable Applications / Customers / Market Sectors:

Overall Comments:


iPSC/860

Overview of Platform:

The iPSC/860 architecture is a distributed memory MIMD hypercube. Each node is an Intel's 32-bit 80860 RISC chip, performing at 60 MFLOPS peak, with 8 --- 16 Mbyte of memory. Features include pipelining and instruction caching but these make it difficult to approach peak performance. Again the NX/2 kernel runs on each node but multiple processes per node are not allowed because of the heavy cost of context switching.

Topology A hypercube with an extra node which communicates to other nodes via a spanning tree acting as the Cube Manager. The Cube Manager has access to an Ethernet communications link and disc and tape drives. It runs UNIX with a windows interface. Space sharing is possible because the Cube Manager is able to split the cube into sub-cubes which can be allocated to users separately. The Concurrent File System enables nodes to access disc drives without using the Cube Manager.

Operating System: MACH, Express and CrOS III are available.

Communication Paradigms: Extensions for explicit message passing are available.

Languages: C and Fortran.

Programming Environment: Tools are available for debugging, code parallelizing and profiling.

Performance: Peak performance is 7.6 GFLOPS (64-bit) for the maximum configuration of a 128-node machine with 2 Gbytes of memory. Up to 165 Gbytes of disc space can be accessed.

Data Transfer: Each node has 8 bi-directional communication channels at a peak bandwidth of 2.8 Mbyte/s with wormhole routing.

Compute Hardware:

Interconnect / Communications System:

Memory System:

Benchmarks / Compute and data transfer performance:

Operating System Software and Environment:

Networkability/ I/O System / Integrability / Reliability / Scalability:

Notable Applications / Customers / Market Sectors:

Overall Comments:


iWARP

Overview of Platform:

Compute Hardware:

Interconnect / Communications System:

Memory System:

Benchmarks / Compute and data transfer performance:

Operating System Software and Environment:

Networkability/ I/O System / Integrability / Reliability / Scalability:

Notable Applications / Customers / Market Sectors:

Overall Comments:


Touchstone Delta

Overview of Platform:

The Touchstone Delta System

It should be noted that this is a research prototype for the Paragon system and is not intended for commercial production.

Architecture: Distributed Memory MIMD hypercube.

Node: Intel's 32-bit 80860 RISC chip, performing at 60 MFLOPS peak, with 8--16 Mbytes of memory. Features include pipelining and instruction caching but these make it difficult to approach peak performance.

Topology: A mesh with wormhole routing.

Operating System: MACH, Express and CrOS III are available.

Communication Paradigms: Extensions for explicit message passing are available.

Languages: C and Fortran.

Programming Environment: Tools are available for debugging, code parallelizing and profiling.

Performance: Peak performance is 32 GFLOPS for the maximum configuration of 484 nodes. Delta has achieved the highest LINPACK rating ever, with 13.9 GFLOPS, and until recently held the record for the SLALOM benchmark, with 5750 patches.

Compute Hardware:

Interconnect / Communications System:

Memory System:

Benchmarks / Compute and data transfer performance:

Operating System Software and Environment:

Networkability/ I/O System / Integrability / Reliability / Scalability:

Notable Applications / Customers / Market Sectors:

Overall Comments:


Intel Paragon

Overview of Platform:

The Paragon XP/S architecture is distributed memory MIMD hypercube.

Each node is an Intel's i860 XP chip with on-chip 16-Kbyte instruction and data caches, capable of 42 MIPS and 75 MFLOPS (double precision). Floating point unit to cache bandwidth peaks at 800 Mbytes/s. Each node runs an implementation of UNIX based on MACH.

Topology: A mesh with wormhole routing.

Operating System: MACH, Express and CrOS III are available.

Communication Paradigms: Extensions for explicit message passing are available.

Languages: C and Fortran.

Programming Environment: Tools are available for debugging, code parallelizing and profiling.

Performance: Peak performance is 300 GFLOPS for the maximum configuration of approximately 1000 nodes.

Compute Hardware:

Interconnect / Communications System:

Memory System:

Benchmarks / Compute and data transfer performance:

Operating System Software and Environment:

Networkability/ I/O System / Integrability / Reliability / Scalability:

Notable Applications / Customers / Market Sectors:

Overall Comments:


ASCI Option Red Supercomputer

All four rows of the entire system are pictured below:

A schematic diagram of the entire system is shown below.

Overview of Platform:

The ASCI Option Red Supercomputer is a Massively Parallel Processor (MPP) with a distributed memory Multiple-Instruction, Multiple Data (MIMD) architecture. All aspects of this system are scalable including the aggregate communication bandwidth, the number of compute nodes, the amount of main memory, disk storage capacity, and I/O bandwidth. The ASCI Option Red has 4536 compute nodes, 596 Gbyte of RAM, and two independent 1 Tbyte disk systems.

Programming Environment: Tools are available for debugging, code parallelizing and profiling.

Performance: Peak performance is 300 GFLOPS for the maximum configuration of approximately 1000 nodes.

Compute Hardware:

Boards and ICF units as shown in the above figure are packaged into cabinets and organized into full system. Each cabinet contains a power supply, four card cages, and a fan unit. A card cage holds a combination of eight Kestrel or Eagle node boards. The overall system has 4536 compute nodes, 32 service nodes, 32 disk I/O nodes, 2 system nodes, 10 network nodes, a system footprint of 1600 square feet, 85 cabinets, a system RAM of 594 Gbytes, a 38x32x2 topology, node to node bandwidth of 800 MB/sec, 51.6 GB/sec bi-directional cross section bandwidth, 9216 processors, 533 MB/sec processor to memory bandwidth, 400 MFLOPs compute node peak performance, 1.8 TFLOPS system peak performance, 1.0 Gbyte/sec RAID I/O bandwidth, and 1 Tbyte of RAID storage (per subsystem).

Interconnect / Communications System:

The systems's 9,216 Pentium Pro processors with 596 Gbytes of RAM are connected through a 38x32x2 mesh.

The interconnection facility (ICF) shown in the above figure utilizes a dual plan mesh to provide better aggregate bandwidth and to support routing around mesh failures. It uses two custom components: NIC and MRC. The MRC sits on the system back-plane and routes messages across the machine. It supports bi-directional bandwidths of up to 800 Mbytes/sec over each of six ports (i.e. two directions for each X, Y, and Z ports). Each port is composed of four virtual lanes that equally share the total bandwidth. This means that as many as four streams can pass through an MRC on any given port at any given time. The NIC resides on each node and provides an interface between the node's memory bus and the MRC.

The NIC can be connected to another NIC on one node, the outer node, is connected to the NIC on the other node, the inner node, which then connected to the MRC.

Memory System:

Benchmarks / Compute and data transfer performance:

No benchmark data available.

Operating System Software and Environment:

The OS used is identical to the one used for Paragon. It is a distributed version of Unix (POSIX 1003.1 and XPG3, System V.3 and 4.3 BSD Reno VFS). It is called TFLOPS OS.

The system uses different OS for different parts of the machines. The nodes involved with computation run small OS called Cougar. The nodes that support interactive user services (service nodes) and booting services (system nodes) run a distributed Unix OS.

Programming Environment: Tools are available for debugging, code parallelizing (using MPI, for example) and profiling.

Programming languages supported: Fortran77, Fortran90, C, and C++ compilers from PGI are available on the system. For data-parallel programming, PGI HPF is also supported and available on the system.

Networkability/ I/O System / Integrability / Reliability / Scalability:

The system uses split-plane mesh topology and has 4 partitions: system, service, I/O and compute. I/O partitions implement scalable file and network services. each end of the computer has its own I/O subsystem.

Notable Applications / Customers / Market Sectors:

The main customer of this system is the US Dept. of Energy (DOE), and the reason for developing such a large system is the important application the system need to work on. That application is the maintenance of the U.S. nuclear stockpile without testing; science-based testing will be used instead.

DOE scientists have determined that they can only run these nuclear simulations if they have 100 TFLOPS computers. In response to this need, the DOE launched a 5-year, 900 million dollar program in 1995 to accelerate the development of extreme scale, massively parallel supercomputers with the goal of having a 100 TFLOPS computer early next century. This program is called the Accelerated Strategic Computing Initiative or ASCI. The ASCI program will produce a series of machines leading to the 100 TFLOPS machine. The ASCI Option Red is the first of those machines.

Overall Comments:


Pentium Pro 4-processor "quad pack"

Overview of the Platform:

A rough sketch of the quad pack organization is as follows:

This is the kind of Pentium Pro motherboard used in many multiprocessor servers nowadays. This diagram shows:


hawick@npac.syr.edu
saleh@npac.syr.edu