HPC Platforms
Fujitsu
Fujitsu
Status
One of the world largest manufacturers of high performance computers.
Overview of Organization
Fujitsu Limited is a leading provider of
information technology products and
solutions for the global marketplace.
Founded in Japan in 1935, Fujitsu is one of
the world's largest suppliers of high performance computers (HPC)
and information systems solutions,
telecommunications and semiconductor
products, software and services.
- Its HPC products are Vector Parallel Computers, Scalar Parallel
Servers, and Storage Servers. The Vector Parallal Processors
(VPP) are the VX-E, VPP300E, and VPP700E Series.
- The VX-E Series employs between 1 to 4 processors (PEs) at peak
performance between 2.4 to 9.6 GFLOPS, memory capacity between
512MB to 8GB (512MB or 2GB per PE), memory throughput between
19.6 to 78.4 GB/s, disk capacity starting with 4GB, and
crossbar network running at 615MB/s X 2/PE.
- The VPP300E Series employs between 1 to 16 PEs at peak performance
between 2.4 to 38.4 GFLOPS, memory capacity between 512MB to
32GB (512MB or 2GB per PE), memory throughput between 19.6 to
313.6 GB/s, disk capacity starting with 4GB, and crossbar
network running at 615MB/s X 2/PE.
- The VPP700E Series employs between 8 to 256 PE2 at peak performance
between 19.2 to 614.4 GFLOPS, memory capacity between 4 to
512GB (512MB or 2GB per PE), memory throughput between 156.8
to 5017.6 GB/s, disk capacity starting with 4GB, and crossbar
network running at 615MB/s X 2/PE.
Documented Systems
VPP Architecture
Each PE consists of a Scalar Unit (SU) and a
Vector Unit (VU). All PEs are linked via the high-speed crossbar
network, forming a vector parallel architecture for
Multiple-Instruction/Multiple Data stream (MIMD) processing.
- Scalar Unit:
The scalar unit employs a long instruction word (LIW) RISC
architecture, which makes it possible that up to three
operations can be simultaneously executed in one clock
cycle. High scalar performance is achieved through
asynchronous execution of memory access instructions,
floating-point operation instructions, and vector
instructions.
- Vector Unit:
The vector unit consists of seven pipelines, a vector register, and a
mask register. The vector unit performs ultra-high-speed
vector operations at a speed of 2.4 GFLOPS.
- Crossbar network:
All PEs are interconnected via crossbar network. A special inter-PE
communication unit called DTU makes a simultaneous execution of
communications between processors and computations possible. This
allows high-speed data transmission and reception of 615MB/sec in each
direction while PEs are performing computations.
VPP Hardware
Each PE can have up to 2 GB of main memory and can achieve a
maximum vector performance of 2.4 GFLOPS. System performance, main
memory size, and I/O can also be expanded by upscaling the
configuration. The maximum 256 PE configuration offers an
ultra-high-speed vector performance of 614.4 GFLOPS together with 512
GB of main memory.
VPP Software
VPP employs the UXP/V operating system, which is based on
the UNIX System V Release 4 standard UNIX operating system. In UXP/V,
functions have been enhanced for vector parallel architecture.
Development Tools
- Fortran90/VP:
In addition to general optimization functions, this vectorizing
compiler features also such optimization functions like
instruction scheduling, which maximize the performance of the
LIW RISC architecture of the scalar processor. Besides, the
automatic vectorization function automatically identifies the
portions of a program where vectorized execution is possible. In
addition, advanced vector optimization functions are available
(including vectorization of IF statements, partial
vectorization, and array expression vectorization).
- C/VP:
C/VP features vectorization functions that are specialized for C based
on those of the Fortran compiler, including vectorization of
loops containing pointers or structures/unions.
- Vector Scientific Subroutine Library SSLII/VP:
This is the Fortran subroutine library for mathematical solutions that
are frequently used in scientific computations. The library is
vectorized to provide optimum performance.
Contact Address
HPC Group
FUJITSU LIMITED
Tokyo, Japan.
The compnay's
HPC Homepage.
AP 3000
Overview of the Platform
Fujitsu's Parallel Server AP3000 series is a distributed memory
parallel server based on 64-bit UltraSPARC technology.
The system's architecture is shown in the following figure:
Compute Hardware
-
Each node is of type U170, U200, or U300 with an UltraSprac processor
running 167MHz, 200 MHz or 300 MHz.
-
The U170 model has one processor, the U200 model has 1 or 2
processors, and likewise the U300 model.
-
Also, each node has an internal cache and an external one. The
internal is 32KB for the above three types and the
external is 512KB for the U170, 1MB for the U200 and 2MB
for U300.
The following figure illustrates the node's architecture:
Interconnect / Communications System
-
The system's topology is 2-D Torus (Up to 32 x 32).
-
For processor connectivity, it employs the AP-Net (Advanced Parallel
System Network) with a data transfer of 200 MB/s x 2
(bi-directional) for each port, 4 ports/node.
-
The wormhole algorithm is used by the hardware for routing and also
synchronizes through message-based.
-
It has 2/port tranmission channels. The node interface is the SBUS x 1
slot per node.
A more clearer picture illustrating the above connectivity is as
follows:
Memory System
-
Memory capacity of the system ranges from 256MB to 2TB.
-
For the U170 node, its memory size ranges between 64MB to 1GB, for the
U200 the memory size ranges between 128MB to
2GB, and for the U300 the range is 128MB to
2GB.
Benchmarks / Compute and data transfer performance
Benchmarking per node type:
U170 U200 U300
------ -------------------- --------------
SPECint95 5.56 7.72 (1CPU) / 7.88(2CPUs) 12.1 (1CPU) / 12.3(2CPUs)
SPECfp95 9.06 11.40 (1CPU) / 14.70 15.5 (1CPU) / 20.2
(2CPUs) (2CPUs)
Operating System Software and Environment
-
The Solaris operating system runs on each node. All applications
written under Solaris will run on the
AP3000 with no change.
-
Operation and system management software:
- Provide partitioned operations.
- Provide automatic load distribution.
- System backup.
- System monitoring.
- Check point restart.
-
Provide software to respond to a variety of usage.
-
In addition to the libraries run on Solaris, the AP3000 support HPF,
PVM and MPI libraries for parallel
system developments.
-
The system comes with a workbench for browsing, profiling, etc.
Networkability/ I/O System / Integrability / Reliability /
Scalability
-
AP3000 nodes are interconnected via a high-speed AP-Net(max. 200MB/s x
2). AP-Net
enables high-speed, bi-directional communication and low latency through hardware
implementation of the message handling features and the use of
worm-hole routing.
-
The system is scalable going from a 4-node system up to up to a 1,024
configuration.
-
As number of processors scale up so are the memory, I/O network and
operating system.
-
External interfaces are supported, including Fast SCSI-2, Fast/Wide SCSI-2,
Ethernet, Fast Ethernet, FDDI, ATM, and Fibre Channel.
Comments
Here is 1994 paper on the
Architecture of the VPP500 parallel supercomputer.
Benchmarks
- Fujitsu VP2600/10 (3.2 ns) using Fortran77 EX/VP V11L10, for problem
size n=100 it gives about 249 Mflop/sec, for n=1000 it gives 4009 Mflop/sec
as a best effort TPP. Theoretically, it runs at 5000 Mflop/sec.
- Fujitsu VPP500/1 (1 processor, 10 ns), using Fortran77 EX/VP V12L20,
for problem size n=100 it gives about 206 Mflop/sec, for n=1000 it
gives about 1159 Mflop/sec. Theoretical peak is 1333 Mflop/sec.
- Fujitsu VX/1 (1 processor, 7ns), using Fortran90/VP V10L10, for
problem size n=100 it gives about 203 Mflop/sec, for n=1000 it runs
at 1936 Mflop/sec best effort TPP, and theoretical peak of 2200 Mflop/sec.
- Fujitsu VPP300/1 (1 processor, 7ns), using Fortran90/VP V10L10, for
problem size n=100 it runs at 203 Mflop/sec, for n=1000 it runs
at 1936 Mflop/sec best effort TPP, and theoretical peak of 2200 Mflop/sec.
- Fujitsu VPP700/1 (1 processor, 7ns), using Fortran90/VP V10L10, for
problem size n=100 it runs at 203 Mflop/sec, for n=1000 it runs
at 1936 Mflop/sec best effort TPP, and theoretical peak of 2200 Mflop/sec.
- Fujitsu VPP2200/10 (3.2 ns), using Fortran77 EX/VP V12L10,
for problem size n=100 it runs at 203 Mflop/sec, for n=1000 it runs
at 1048 Mflop/sec best effort TPP, and theoretical peak of 1250 Mflop/sec.
- Fujitsu VP2400/10 (4 ns), using Fortran77 EX/VP V11L10,
for problem size n=100 it runs at 170 Mflop/sec, for n=1000 it runs
at 1688 Mflop/sec best effort TPP, and theoretical peak of 2000 Mflop/sec.
- Fujitsu VP2100/10 (4 ns), using Fortran77 EX/VP V11L10,
for problem size n=100 it runs at 112 Mflop/sec, for n=1000 it runs
at 445 Mflop/sec best effort TPP, and theoretical peak of 500 Mflop/sec.
- Fujitsu VP-400, using Fortran 77 V10L30, for problem size n=100 it
runs at 20 Mflop/sec, and a theoretical peak of 1142 Mflop/sec.
- Fujitsu VP-100, using Fortran 77, for n=100 it run at 16Mflop/sec and
a theoretical peak of 267 Mflop/sec.
saleh@npac.syr.edu