Intel are best known for inventing the microprocessor. Their interest
in parallelism was born over ten years ago out of the Cosmic Cube
project at CalTech which used Intel 8086/7 microprocessors to build
the first hypercube computer refers to the way in which processors are
connected to each other for
communication purposes. The hypercube topology is highly scalable and
allows large numbers of processors to be connected in a network with
desirable properties such as a low diameter, a low number of total
connections and a regular shape that is well suited for efficient
routing algorithms.}. They formed iSC (Intel Scientific Computers) in
1984, to design and market parallel computers. In February 1985 the
iPSC (Intel Personal Super Computer) range was announced, commencing
with the iPSC/1, followed by the iPSC/2 in early 1988 and the iPSC/860
in January 1990.
In 1991 the first Touchstone Delta System was shipped, this being a
research prototype for the Paragon XP/S system which Intel claim can be
scaled up to TFLOPS level. Intel's iWARP system is also part of the
research strategy to produce the Paragon XP/S.
The company's latest HPC product is the DOE Accelerated Strategic
Computing Initiative (ASCI) machines which was installed at Sandia
National Laboratory.
iPSC/1
Overview of Platform:
The iPSC/1 architecture is distributed memory (MIMD) hypercube.
Each node is Intel's 16-bit 80286/7 chip with 500 Kbytes of
memory on each node, performing at 1 MIPS peak. The NX Operating System
is supplied as standard. The iLBX-II expansion interface allows memory
or vector modules to be added, boosting capacity and performance.
Topology A hypercube with an extra node connected to all
other nodes acting as the Cube Manager. The Cube Manager has access to
an Ethernet communications link and disc and tape drives.
Intel produce very little in-house software, the bulk of it being done
by software companies with some involvement by iSC.
Operating System MACH, Express and CrOS III are available.
Communication Paradigms Extensions for explicit message passing
are available.
Languages C and Fortran. Concurrent Common Lisp, Virdex and
Ada are also available.
Programming Environment Tools are available for debugging,
code parallelizing and profiling.
Performance Peak performance is 128 MIPS or 8 MFLOPS (64-bit)
for a 128-node machine. A vector extended 64-node machine has a peak
performance of 422.4 MFLOPS.
Data Transfer Each node has 7 communication channels at a
peak bandwidth of 20 Mbyte/s.
Scalability Scales from 32--128 nodes or 16--64 extended nodes.
Compute Hardware:
Interconnect / Communications System:
Memory System:
Benchmarks / Compute and data transfer performance:
Operating System Software and Environment:
Networkability/ I/O System / Integrability / Reliability / Scalability:
Notable Applications / Customers / Market Sectors:
Overall Comments:
iPSC/2
Overview of Platform:
The iPSC/2 architecture is a distributed memory MIMD hypercube.
Each processor is Intel's 32-bit 80386/7 chip , performing at 4
MIPS peak, with 1, 4, 8 and 16 Mbyte memory options. There is also a 64
Kbyte cache on each node. Extension can be by means of a scalar
arithmetic accelerator and/or a vector module. The NX operating system
is replaced with the faster NX/2.
Topology A hypercube with an extra node which communicates to
other nodes via a spanning tree acting as the Cube Manager. The Cube
Manager has access to an Ethernet communications link and disc and tape
drives. It runs UNIX with a windows interface. Space sharing is
possible because the Cube Manager is able to split the cube into
sub-cubes which can be allocated to users separately. The Concurrent
File System enables nodes to access disc drives without using the Cube
Manager.
Operating System: MACH, Express and CrOS III are available.
Communication Paradigms Extensions for explicit message passing
are available.
Languages C and Fortran. Concurrent Common Lisp, Virdex and
Ada are also available.
Programming Environment Tools are available for debugging,
code parallelizing and profiling.
Performance: Peak performance is 512 MIPS or 27 MFLOPS
(64-bit) for a maximum configuration of a 128-node machine with 1 Gbyte
of memory. The scalar arithmetic expansion trebles this performance and a
vector extended 64-node machine has a peak performance of 422.4 MFLOPS.
Data Transfer: Each node has 8 bi-directional communication
channels at a peak bandwidth of 2.8 Mbyte/s with {\em wormhole} routing.
Compute Hardware:
Interconnect / Communications System:
Memory System:
Benchmarks / Compute and data transfer performance:
Operating System Software and Environment:
Networkability/ I/O System / Integrability / Reliability / Scalability:
Notable Applications / Customers / Market Sectors:
Overall Comments:
Touchstone Gamma
Overview of Platform:
Compute Hardware:
Interconnect / Communications System:
Memory System:
Benchmarks / Compute and data transfer performance:
Operating System Software and Environment:
Networkability/ I/O System / Integrability / Reliability / Scalability:
Notable Applications / Customers / Market Sectors:
Overall Comments:
iPSC/860
Overview of Platform:
The iPSC/860 architecture is a distributed memory MIMD hypercube.
Each node is an Intel's 32-bit 80860 RISC chip, performing at
60 MFLOPS peak, with 8 --- 16 Mbyte of memory. Features include
pipelining and instruction caching but these make it difficult to
approach peak performance. Again the NX/2 kernel runs on each node but
multiple processes per node are not allowed because of the heavy cost of
context switching.
Topology A hypercube with an extra node which communicates
to other nodes via a spanning tree acting as the Cube Manager. The Cube
Manager has access to an Ethernet communications link and disc and tape
drives. It runs UNIX with a windows interface. Space sharing is
possible because the Cube Manager is able to split the cube into
sub-cubes which can be allocated to users separately. The Concurrent
File System enables nodes to access disc drives without using the Cube
Manager.
Operating System: MACH, Express and CrOS III are available.
Communication Paradigms: Extensions for explicit message passing
are available.
Languages: C and Fortran.
Programming Environment: Tools are available for debugging,
code parallelizing and profiling.
Performance: Peak performance is 7.6 GFLOPS (64-bit) for the
maximum configuration of a 128-node machine with 2 Gbytes of memory. Up
to 165 Gbytes of disc space can be accessed.
Data Transfer: Each node has 8 bi-directional communication
channels at a peak bandwidth of 2.8 Mbyte/s with wormhole routing.
Compute Hardware:
Interconnect / Communications System:
Memory System:
Benchmarks / Compute and data transfer performance:
Operating System Software and Environment:
Networkability/ I/O System / Integrability / Reliability / Scalability:
Notable Applications / Customers / Market Sectors:
Overall Comments:
iWARP
Overview of Platform:
Compute Hardware:
Interconnect / Communications System:
Memory System:
Benchmarks / Compute and data transfer performance:
Operating System Software and Environment:
Networkability/ I/O System / Integrability / Reliability / Scalability:
Notable Applications / Customers / Market Sectors:
Overall Comments:
Touchstone Delta
Overview of Platform:
The Touchstone Delta System
It should be noted that this is a research prototype for the Paragon
system and is not intended for commercial production.
Architecture: Distributed Memory MIMD hypercube.
Node: Intel's 32-bit 80860 RISC chip, performing at
60 MFLOPS peak, with 8--16 Mbytes of memory. Features include
pipelining and instruction caching but these make it difficult to
approach peak performance.
Topology: A mesh with wormhole routing.
Operating System: MACH, Express and CrOS III are available.
Communication Paradigms: Extensions for explicit message passing
are available.
Languages: C and Fortran.
Programming Environment: Tools are available for debugging,
code parallelizing and profiling.
Performance: Peak performance is 32 GFLOPS for the maximum
configuration of 484 nodes. Delta has achieved the highest LINPACK
rating ever, with 13.9 GFLOPS, and until recently held the record for
the SLALOM benchmark, with 5750 patches.
Compute Hardware:
Interconnect / Communications System:
Memory System:
Benchmarks / Compute and data transfer performance:
Operating System Software and Environment:
Networkability/ I/O System / Integrability / Reliability / Scalability:
Notable Applications / Customers / Market Sectors:
Overall Comments:
Intel Paragon
Overview of Platform:
The Paragon XP/S architecture is distributed memory MIMD hypercube.
Each node is an Intel's i860 XP chip with on-chip 16-Kbyte
instruction and data caches, capable of 42 MIPS and 75 MFLOPS (double
precision). Floating point unit to cache bandwidth peaks at 800
Mbytes/s. Each node runs an implementation of UNIX based on MACH.
Topology: A mesh with wormhole routing.
Operating System: MACH, Express and CrOS III are available.
Communication Paradigms: Extensions for explicit message passing
are available.
Languages: C and Fortran.
Programming Environment: Tools are available for debugging,
code parallelizing and profiling.
Performance: Peak performance is 300 GFLOPS for the maximum
configuration of approximately 1000 nodes.
Compute Hardware:
Interconnect / Communications System:
Memory System:
Benchmarks / Compute and data transfer performance:
Operating System Software and Environment:
Networkability/ I/O System / Integrability / Reliability /
Scalability:
Notable Applications / Customers / Market Sectors:
Overall Comments:
ASCI Option Red Supercomputer
All four rows of the entire system are pictured below:
A schematic diagram of the entire system is shown below.
Overview of Platform:
The ASCI Option Red Supercomputer is a Massively Parallel Processor
(MPP) with a distributed memory Multiple-Instruction, Multiple Data
(MIMD) architecture. All aspects of this system are scalable including
the aggregate communication bandwidth, the number of compute nodes,
the amount of main memory, disk storage capacity, and I/O bandwidth.
The ASCI Option Red has 4536 compute nodes, 596 Gbyte of RAM, and two
independent 1 Tbyte disk systems.
- The Intel Pentium Pro processor is used throughout and two per
node. The instruction set for the Pentium Pro processor is basically
the same as the IA-32 instructions used on a Pentium processor.
- The processors used run at 200 MHz so the peak floating point rate
is 200 MFLOPS.
- Each Kestrel board holds two compute nodes, as shown below:
- The node boards used in the I/O and system partitions are called
Eagle boards, as shown below:
- The nodes are connected into a 38x32x2 topology.
-
The OS used is identical to the one used for Paragon. It is a
distributed version of Unix (POSIX 1003.1 and XPG3, System V.3 and 4.3
BSD Reno VFS). It is called TFLOPS OS.
Programming Environment: Tools are available for debugging,
code parallelizing and profiling.
Performance: Peak performance is 300 GFLOPS for the maximum
configuration of approximately 1000 nodes.
Compute Hardware:
Boards and ICF units as shown in the above figure are packaged into
cabinets and organized into full system. Each cabinet contains a power
supply, four card cages, and a fan unit. A card cage holds a
combination of eight Kestrel or Eagle node boards. The overall system
has 4536 compute nodes, 32 service nodes, 32 disk I/O nodes, 2 system
nodes, 10 network nodes, a system footprint of 1600 square feet, 85
cabinets, a system RAM of 594 Gbytes, a 38x32x2 topology, node to node
bandwidth of 800 MB/sec, 51.6 GB/sec bi-directional cross section
bandwidth, 9216 processors, 533 MB/sec processor to memory bandwidth,
400 MFLOPs compute node peak performance, 1.8 TFLOPS system peak
performance, 1.0 Gbyte/sec RAID I/O bandwidth, and 1 Tbyte of RAID
storage (per subsystem).
Interconnect / Communications System:
The systems's 9,216 Pentium Pro processors with 596 Gbytes of RAM are
connected through a 38x32x2 mesh.
The interconnection facility (ICF) shown in the above figure utilizes
a dual plan mesh to provide better aggregate bandwidth and to support
routing around mesh failures. It uses two custom components: NIC and
MRC. The MRC sits on the system back-plane and routes messages across
the machine. It supports bi-directional bandwidths of up to 800
Mbytes/sec over each of six ports (i.e. two directions for each X, Y,
and Z ports). Each port is composed of four virtual lanes that equally
share the total bandwidth. This means that as many as four streams can
pass through an MRC on any given port at any given time. The NIC
resides on each node and provides an interface between the node's
memory bus and the MRC.
The NIC can be connected to another NIC on one node, the outer node,
is connected to the NIC on the other node, the inner node, which then
connected to the MRC.
Memory System:
- The 9216 Pentium Pro processors has 596 Gbytes of RAM.
- Each processor has a separate on-chip data instruction L1 cache
(each of which is 8 KBytes).
- It also has an L2 cache (256 Kbytes) packaged with the CPU.
- The processor bus supports memory and cache coherency for up to 4
processors. There are two processors per node.
- The memory subsystem on an individual compute node is implemented
using Intel's CCOTS Pentium Pro processor support chip-set. It is
structured as two rows of four, independently-controlled,
sequentially-interleaved banks of DRAM to produce up to 533 MB/sec of
data throughput. Each bank of memory is 72 bit wide, allowing for 64
data bits plus 8 bits EEC, which provides single bit error correction
and multiple bit error detection.
- The banks are implemented as two 36-bit SIMMs, so industry
standard SIMM modules may be used. Using commonly available 4 and 8
MByte SIMMs (based on 4Mx4 DRAM chips), 32 MB to 256 MB of memory per
node is supported.
- The system was delivered with 128 Mbytes/node.
Benchmarks / Compute and data transfer performance:
No benchmark data available.
Operating System Software and Environment:
The OS used is identical to the one used for Paragon. It is a
distributed version of Unix (POSIX 1003.1 and XPG3, System V.3 and 4.3
BSD Reno VFS). It is called TFLOPS OS.
The system uses different OS for different parts of the machines. The
nodes involved with computation run small OS called Cougar. The nodes
that support interactive user services (service nodes) and booting
services (system nodes) run a distributed Unix OS.
Programming Environment: Tools are available for debugging,
code parallelizing (using MPI, for example) and profiling.
Programming languages supported: Fortran77, Fortran90, C, and C++
compilers from PGI are available on the system. For data-parallel
programming, PGI HPF is also supported and available on the system.
Networkability/ I/O System / Integrability / Reliability /
Scalability:
The system uses split-plane mesh topology and has 4 partitions:
system, service, I/O and compute. I/O partitions implement scalable
file and network services. each end of the computer has its own I/O
subsystem.
Notable Applications / Customers / Market Sectors:
The main customer of this system is the US Dept. of Energy (DOE), and
the reason for developing such a large system is the important
application the system need to work on. That application is the
maintenance of the U.S. nuclear stockpile without testing;
science-based testing will be used instead.
DOE scientists have determined that they can only run these nuclear
simulations if they have 100 TFLOPS computers. In response to this
need, the DOE launched a 5-year, 900 million dollar program in 1995 to
accelerate the development of extreme scale, massively parallel
supercomputers with the goal of having a 100 TFLOPS computer early
next century. This program is called the Accelerated Strategic
Computing Initiative or ASCI. The ASCI program will produce a series
of machines leading to the 100 TFLOPS machine. The ASCI Option Red is
the first of those machines.
Overall Comments:
Pentium Pro 4-processor "quad pack"
Overview of the Platform:
A rough sketch of the quad pack organization is as follows:
This is the kind of Pentium Pro motherboard used in many multiprocessor
servers nowadays. This diagram shows:
- It accommodates up to 4 processor modules, each containing a
Pentium Pro processor, first-level cache, translation lookaside
buffer, a 256-KB second-level cache, an interrupt controller, and a
bus interface in a single chip connecting directly to a 64-bit memory
bus.
- The bus operates at 66 MHz, and memory transactions are pipelined
to give a peak bandwidth of 528 MB/sec.
- A two-chip memory controller and 4-chip memory interleave unit
(MIU) connect the bus to multiple banks of DRAM. Bridges connect the
memory bus to 2 independent PCI buses, which host display, network,
SCSI, and lower-speed I/O connections.
- The Pentium Pro module contains all the logic necessary to
support the multiprocessor communication architecture, including that
required for memory and cache consistency.
- The structure of the Pentium Pro "quad pack" is similar to a
large number of earlier SMP designs but shows an expanded view of a
typical Pentium Pro SMP, for example HP
NetServer in the LX series.
hawick@npac.syr.edu
saleh@npac.syr.edu