Sun Microsystems
Sun Microsystems
Status:
Active manufacturer of high performance computing systems.
Overview of Organization:
Founded in 1982 in cramped quarters in Santa Clara, California,
(moved to Palo Alto later) Sun Microsystems, Inc. has emerged as the
world leader in enterprise network computing, with more than $9 billion in
revenues and operations in 150 countries.
Beginning with SUN-1, it has been and continue to be quite
successful in building high-performance UNIX-based
computer system.
Diagrams and Figures provided:
Platforms Documented:
Contact Address:
Sun Microsystems Inc.
901 San Antonio Rd,
Palo Alto, CA 94303 USA.
The
HPC Division
See Also:
The company's
homepage.
Enterprise 10000
Overview of Platform
Enterprise servers (including the 10000) are based on symmetric
multiprocessing (SMP) architectures. It is essentially a
bus-based design with modified architecture allowing for an
increase in bandwidth. One modification to increase data bandwidth
is to make the data bus wider. The other modification is the
use of multiple buses including address buses. With this design,
the Enterprise 10000 combines the use of multiple address buses
and data crossbars to deal with the issue of scalability.
Compute Hardware
Summary of the main hardware features:
- Each 4 CPUs running at about 336 MHz are placed on a board along
with 4 banks of memory (each is up to 1 GB), and two independent SBUS
I/O buses.
- 16 of these boards are connected by 16 x 16 data crossbar with
paths 144 bits wide as well as 4 address buses associated with
the 4 banks on each board.
- Collectively, this provides about 12.6 GB/s of data bandwidth
and a relatively high snoop rate of 250 MHz.
Here are the overall Enterprise 10000 specifications (according to the
manufacturer):
- Processor
- Number of processors range from 4 up tp 64.
- The architecture is superscalar SPARC version 9, UltraSPARC
processors.
- Cache per processor: primary cache of 16KB instruction and 16KB
data on the chip. 4MB of secondary external cache.
- Each CPU is interfaced with a 128-bit Ultra Port Architecture (UPA).
- System Boards
- No. of boards: maximum of 16 system boards and minimum 4
boards.
- each board holds up to 4 processors, up to 4 SBus cards,
and a memory module with 4 banks of 8 SIMMs each.
- Standard Interfaces:
Are accomplished via SBus which is a 64-bit width data bus with
a rate of 25 MHz.
- Internal Mass Storage:
The system has a disk tray of SPARCstorage RSM (with a maximum
of 7x4.2 or 9.1 GB disks).
- Power Supplies:
The system comes with fully redundant power and cooling standard.
- Environment
- AC Power: 200-240 single phase VAC, 47-63 Hz, 24 A per line cord
(up to 5 redundant line cords).
- Operating: 10C to 30C (50F to 86F) and 30% to 70% relative
humidity, noncondensing.
- Dimensions and Weight:
Hight: 178cm (70 in.), Width: 127 cm (50in.), Depth: 99 cm (39in.)
Weight, main cabinet: about 635 kg (1400 lb) when fully configured.
Power cord: 4.6 m (15ft.)
Memory System
Maximum is 64GB and minimum is 2GB per system with an option
to expand an additional minimum of 512MB or a maximum of 2GB,
for each group of 16 SIMMs. Each system board has up to 2
memory expansion options. Memory bandwidth is about 12.8
Gbytes/sec.
The memory organization of the Ultra architecture is a 2-level
hierarchy. The first-level cache is designed to have two
independent caches, the data cache and the instruction
cache. The instruction cache (I-Cache) is a 16 Kbyte 2-way set
associative cache with 32 byte blocks. It is physically
indexed. The data cache (D-Cache) is a write-through, 16 Kbyte
direct-mapped cache with two 16-byte subblocks per line. It is
virtually indexed. The second-level cache, the so-called
external cache (E-Cache) is a write-back, 1 Mbyte direct-mapped
cache. It's line size is 64 bytes.
Benchmarks / Compute and data transfer performance
According to Sun, the system has a constant latency of 500 nanoseconds,
memory bandwidth of 12.8 Gbytes/sec. and the I/O bandwidth tops at
6.4 Gbytes/sec.
An interesting comparison test of bandwidth and latency of a
4-processor Enterprise 10000 vs. bandwidth and latency of the
Origin2000 systems recently
carried out by Samson Cheung of NAS at
NASA Ames. Here is the test report.
Operating System Software and Environment
- Operating System:
Solaris 2.5.1/2.6 operating environment
- Languages: The following languages are supported: C, C++,
Pascal, Fortran, and Java.
- Windowing system: Is OpenWindows Version 3 optional.
- System monitoring: Hostview.
- System and network: is done through Solstice Site Manager
and Solstice Domain Manager, also through Network management
software for Solaris, Solstice DiskSuite, and SPARCstorage
Volume Manager.
Networkability/ I/O System / Integrability / Reliability /
Scalability
It has up to 32 Sbuses and up to 64 I/O slots. Regarding external
storage, it has 60+ TBytes online disk and I/O bandwidth at about
6.4 Gbytes/sec.
- Networking: the following options are supported:
ONC, NSF, TCP/IP, SunNet, OSI, MHS, X.25, DCE, and Netware.
- IBM Connectivity is done through SunLink connectivity products.
- The SBus Options
The system comes with these options: SunFastEthernet adapter, SunATM
adapter, SBus quad Ethernet controller, differential fast/wide
intelligent SCSI-2 SBus (DWIS/S), single-ended fast/wide intelligent
SCSI-2 SBus (SWIS/S) host adaptor, HSI, Token Ring interface, FDDI,
Fast SCSI-2/buffered Ethernet, Fibre channel, and Sun ISDN.
On the issue of reliability, it provides error correction (ECC)
throughout memory, optionally redundant control boards and SSP, fault
tolerant power and cooling systems, protect data with mirroring
(RAID-5 redundancy) and multipathing of controllers. Also,
supports multi-pathing of network controllers to protect against a
controller failure, and auto reconfiguration.
On scalability, the Enterprise 10000 scales up to 64 processors
by combining data crossbars with the use of multiple address buses.
Notable Applications / Customers / Market Sectors
The Enterprise 10000 is used in a number of fields: education,
financial services, government, health care, retail telecommunication,
transportation, and in oil and petrochemical industry.
- University of Utah uses it for Data Warehousing and
PeopleSoft.
- Abby National Bank, UK uses Enterprise 10000 for Data
Warehousing.
- Guardian Computer Services uses it for
Consolidation/Disaster Recovery.
- Tokyo Mitsubishi International uses it for
Consolidation.
- Florida Communities Network uses it for
Consolidation/Web Server.
- Western Provident Association uses it for
Consolidation.
- Littlewoods uses the 10000 for Data Warehousing.
- GTE use it for Data Warehousing.
- Telecom Italia Mobile uses Enterprise 10000 for
Enterprise Resource Planning (ERP).
- Alask Airlines uses their Enterprise 10000 for
Mainframe Affinity.
- Many other customers make a use of it to provide
ERP solutions. Online retailers such as Amazon.com are
acquiring the Enterprise to manage their e-commerce. In
the oil industry the Starfire used for for large ERP projects,
etc.
Enterprise 6000/6500
Overview of Platform
Compute Hardware
The following diagram shows the logical
organization of the Enterprise multiprocessor.
The boxes labeled with "$" and "$2" stand for primary and
secondary cache, respectively. This design uses a hierarchical
structure, where each card is either a complete dual processor
with memory or a complete I/O system.
- The full system configuration provides 16 bus slots that can be
occupied by either processor or I/O boards, but must be
at least one of each.
- Each processing board contains 2 CPU modules and 2 (512-bit wide)
banks of memory of up to 1GB each, which are uniformly accessible to
all boards.
- The I/O board provides connectors for multiple independent
peripheral buses and appears like any another cache controller
on the system bus.
- Each processor is an UltraSparc with 16KB level 1 cache and 512KB
level2 cache.
- The split-transaction bus allows up to 112 transactions at a time.
- Max number of processors supported is 30 and a peak performance
at 9 GFLOPs.
- Both memory and bandwidth scales up with the number of processors.
- The Gigaplane system bus provides a peak bandwidth of 2.67 GB/s.
- The system has a nonmultiplexed, split-transaction bus (Sun
Gigaplane) with 256-bit data lines and 41-bit physical
address. Gigaplane is clocked at 83.5 MHz.
- Gigaplane can support up to 112 outstanding transactions,
including up to 7 from each board.
- The bus consists of a total of 388 signals: 256 data, 322 ECC
(error correcting code), 43 address (with parity), 7 ID tag,
18 for arbitration, and a number of configuration signals.
Memory System
- Since the system is designed to support up to 30 processors, and
since each processing board contains up to 2 GB, the overall
system can supports up to 30 GB of up to 16-way interleaved
memory.
- In the above diagram it is shown that every processor is
associated with first level and second level cache. Arrays of
16 KB or less fit entirely in the first-level.
- Second-level cache accesses have an access time of about 40 ns,
and the transfer between the two levels is about 16
bytes. Overall the access time is about 300 ns, 130 ns out of
this goes to the bus protocol which has a transfer rate of 83.5 MHz.
Operating System Software and Environment
- Operating System:
Solaris 2.5.1/2.6 operating environment
- Languages: The following languages are supported: C, C++,
Pascal, Fortran, and Java.
- Windowing system: Is OpenWindows Version 3 optional.
- System monitoring: Hostview.
- System and network: is done through Solstice Site Manager and
Solstice Domain Manager, also through
Network management software for Solaris, Solstice DiskSuite,
and SPARCstorage Volume Manager.
Networkability/ I/O System / Integrability / Reliability /
Scalability
- The Enterprise I/O board uses the same bus interface as the processing
board, but the internal bus is only half as wide and there is no memory path.
- Externally, the I/O boards only do cache-block-sized transactions,
just like the processing boards.
- Internally, 2 independent 64-bit 25-MHz SBUSs are supported. One of these
supports 2 dedicated FiberChannel modules providing redundant, high-bandwidth
interconnect to large disk storage arrays. The other provides dedicated
Ethernet and fast wide SCSI connections. In addition, 3 SBUS interface cards can
be plugged into the 2 buses to support arbitrary peripherals, including
a 622-MB/s ATM interface.
- The I/O bandwidth, the connectivity to peripherals, and the cost
of the I/O system scales with the number of I/O cards.
- Again in the above figure each I/O card provides two independent 64-bit X 25-MHz
SBUS I/O buses, so the I/O bandwidth scales with the number of I/O cards.
- Total of disk storage is in tens of terabytes.
The system scales up to 30 processors.
Benchmarks / Compute and data transfer performance
Performance on the LINPACK benchmark:
-
Using Enterprise 6000
with 4 processors each running at 250 MHz gave a performance of
1126 Mflops/sec TPP (Toward Peak Performance) for n=1000.
Same configuration was running at 2000 Mflops/sec theoretical peak.
-
Using Enterprise 6000
with 6 processors each running at 250 MHz gave a performance of
1607 Mflops/sec TPP (Toward Peak Performance) for n=1000.
Same configuration was running at 3000 Mflops/sec theoretical peak.
-
Using Enterprise 6000
with 8 processors each running at 250 MHz gave a performance of
2038 Mflops/sec TPP (Toward Peak Performance) for n=1000.
Same configuration was running at 4000 Mflops/sec theoretical peak.
-
Using Enterprise 6000
with 14 processors each running at 250 MHz gave a performance of
3112 Mflops/sec TPP (Toward Peak Performance) for n=1000.
Same configuration was running at 7000 Mflops/sec theoretical peak.
-
Using Enterprise 6000
with 16 processors each running at 250 MHz gave a performance of
3493 Mflops/sec TPP (Toward Peak Performance) for n=1000.
Same configuration was running at 8000 Mflops/sec theoretical peak.
-
Using Enterprise 6000
with 24 processors each running at 250 MHz gave a performance of
4389 Mflops/sec TPP (Toward Peak Performance) for n=1000.
Same configuration was running at 12000 Mflops/sec theoretical peak.
-
Using Enterprise 6000
with 30 processors each running at 250 MHz gave a performance of
4755 Mflops/sec TPP (Toward Peak Performance) for n=1000.
Same configuration was running at 15000 Mflops/sec theoretical peak.
Using the 336 MHz UltraSPARC II, the LINPACK benchmark for n=1000
is as follows:
- For a 2-processor system the TPP best effort was 843 Mflops/sec and the
theoretical peak was 1344 Mflops/sec.
- For a 4-processor system the TPP best effort was 1438 Mflops/sec and the
theoretical peak was 2688 Mflops/sec.
- For a 6-processor system the TPP best effort was 1990 Mflops/sec and the
theoretical peak was 4032 Mflops/sec.
- For a 8-processor system the TPP best effort was 2481 Mflops/sec and the
theoretical peak was 5376 Mflops/sec.
- For a 14-processor system the TPP best effort was 3721 Mflops/sec and the
theoretical peak was 9408 Mflops/sec.
- For a 16-processor system the TPP best effort was 3981 Mflops/sec and the
theoretical peak was 10752 Mflops/sec.
- For a 24-processor system the TPP best effort was 4755 Mflops/sec and the
theoretical peak was 16128 Mflops/sec.
- For a 30-processor system the TPP best effort was 5187 Mflops/sec and the
theoretical peak was 20160 Mflops/sec.
Using the 167 MHz UltraSPARC I, the LINPACK benchmark for n=1000
is as follows:
- For a 2-processor system the TPP best effort was 456 Mflops/sec and the
theoretical peak was 667 Mflops/sec.
- For a 4-processor system the TPP best effort was 871 Mflops/sec and the
theoretical peak was 1333 Mflops/sec.
- For a 8-processor system the TPP best effort was 1607 Mflops/sec and the
theoretical peak was 2667 Mflops/sec.
- For a 12-processor system the TPP best effort was 2238 Mflops/sec and the
theoretical peak was 4000 Mflops/sec.
- For a 16-processor system the TPP best effort was 2761 Mflops/sec and the
theoretical peak was 5333 Mflops/sec.
- For a 20-processor system the TPP best effort was 3170 Mflops/sec and the
theoretical peak was 6667 Mflops/sec.
- For a 24-processor system the TPP best effort was 3566 Mflops/sec and the
theoretical peak was 8000 Mflops/sec.