Tera
Tera Computer Company
Status:
Active manufacturer of parallel computers.
Overview of Organization:
Tera Computer Systems was founded by Burton Smith in 199? and funded
and continue to be funded by DARPA and NSF.
In 1998 they introduced their first general-purpose parallel
computer, Tera MTA (Multi-Threaded Architecture). It was
specifically designed for use in tackling large scale applications
such as reservoir simulation, seismic exploration, 3-D computer
aided design, molecular modeling, etc.
The first MTA has been installed at
San Diego
Supercomputer Center and as stated by the company,
the system will be used to implement, optimize, and evaluate a
wide range of applications.
Overview of the MTA:
Tera computer system is a shared memory multiprocessor. From its
specification, it also implements the true shared memory programming
model for which the performance of the system does not depend
on the placement of data in memory.
Interconnect / Communications System:
Is a multi-processor accommodating up to 256 processors.
The system runs stand-alone and requires no front end.
Network connection to workstations and other computer
systems is accomplished via 32- or 64-bit HIPPI channels.
All data path widths are 64 bits, including the
processor-network interface.
Processors:
- Multithreaded and each processor
switches context every cycle among as many as 128
hardware threads, thereby hiding up to 128 cycles (384 ns)
of memory latency.
- Each stream can issue as many as eight memory
references without waiting for earlier ones to finish,
further augmenting the memory latency tolerance of the
processor.
- The clock speed is nominally 333 Mhz, giving each
processor a data path bandwidth of one billion 64-bit
results per second and a peak performance of one
gigaflops.
- The peak memory bandwidth is 2.67 gigabytes
per second.
- A stream implements a load-store architecture with three
addressing modes and 31 general-purpose 64-bit
registers.
- The processors implement IEEE Standard 754 arithmetic
using the 64-bit double basic format.
64 processor performance estimates (by the manufacturer):
Kernel Estimated Time
-------- ----------------
Matrix multiply, 50 ms
1K x 1K
3D FFT, 256 x 63 ms
256 x 256
Sparse matrix 50 ms
times vector,
400M
nonzeros
Integer sort, 36 ms
100M keys
Interconnect / Communications System:
The interconnection net is a 3-D packet switched containing
p^(3/2) nodes, where p is the number of processors.
These nodes are toroidally connected in three dimensions to form a
p^(1/2)-ary three-cube, and processor and memory resources are
attached to some of the nodes.
The latency of a node is three cycles: a message spends two cycles
in the node logic proper and one on the wire that connects the
node to its neighbors.
A p-processor system has worst-case one-way latency of 4.5p^(1/2)
cycles.
A node has four ports (five if a resource is attached).
Each port simultaneously transmits and receives an entire
164-bit packet every 3 ns clock cycle.
Of the 164 bits, 64 are data, so the data bandwidth per port is
2.67 GB/s in each direction.
The network bisection bandwidth is 2.67p GB/s.
The network routing nodes contain no buffers other than those
required for the pipeline. Instead, all messages are
immediately routed to an output port.
Messages are assigned random priorities and then routed
in priority order. Under heavy load, some messages are
derouted by this process. The randomization at each node
insures that each packet eventually reaches its
destination.
The overall hardware config of the sytem:
Peak Gflops 16 64 256
Memory, Gbytes 16-32 64-128 256-51
HIPPI channels 32 128 512
Processors 16 64 256
I/O, Gbytes/s 6.2 25 102
Memory of the System:
- The memory system is implemented as either 2p or 4p
memory units distributed around the network.
- Memory is implemented using 16-megabit DRAM chips. The memory
units are interleaved 64 ways.
- The memory units can be addressed by 8-bit byte, 16-bit
quarterword, 32-bit halfword or 64-bit word. A
fetch-and-add operation is provided on words.
I/O System:
- The maximum bandwidth in a p-processor system is 200p
megabytes per second in each direction via p duplex HIPPI
channels.
- Maximum Strategy Gen5 XL RAIDs are used, with a sustained
bandwidth of about 130 megabytes per second
each.
- At least p/16 disk arrays must be configured in a
p-processor system.
- The maximum capacity per disk array is about 360 gigabytes, so
system disk capacity can approach 300p gigabytes.
Contact Address:
Tera Computer Company
2815 Eastlake Avenue East
Seattle, Washington 98102
USA
For more information on the Tera platforms, take a look at this this or look at the
company's homepage.
Saleh Elmohamed, saleh@npac.syr.edu