MasPar Computer Corporation

MasPar Computer Corporation

Status:

No longer active.

Overview of Organization:

MasPar was formed in 1988 by a Digital Equipment Corporation Vice-President, and the company retains an association with DEC. The company is quite small with a base of around 30 machines and 100 staff. It produces a single range of SIMD machines, the MP-1 series, which consists of five models. The range currently supports a UNIX operating system, C and Fortran compilers, an advanced graphical programming environment and other tools.

Platforms Documented:


MasPar MP-1

Overview of Platform:

Architecture: the machine consists of Processing Elements (PEs) connected in a 2-D lattice. The machine is driven by a front-end computer (typically a VAX). High speed I/O devices can be attached, and direct access to the DEC memory bus is possible.

Processors: The PEs are custom designed by MasPar. They are RISC-like and grouped into clusters of 16 on the chips. Each cluster has the PE memories and connections to the communications network. Instructions are issued by the Array Control Unit, which is a RISC-like processor based on standard chips from Texas Instruments.

Topology: Grid connections allow communication to 8 nearest neighbors.

Operating System: Supplied with UNIX front-end.

Languages: The languages supported are an ANSI compatible C, and MasPar Fortran (MPF) which is an in-house version of Fortran 90.

Programming Environment: MasPar has licensed a version of the Fortran conversion package VAST-2 from Pacific-Sierra Research Corporation. This product converts from scalar Fortran 77 source code to parallel MPF source. The conversion can also be done in reverse.

Performance: 1.2 GFLOPS (2.6 GIPS) for a 16384 PE machine.

Data Transfer: Nearest neighbor 18 Gbyte/sec for 16384 machine, and 1300 Mbyte/sec using the global router (Manufacturers figures).

Scalability: Scales from 1024 -- 16384 processing elements.

Fault Tolerance: Manufacturers claim mean time between failures of over 8,000 hours. No fault tolerant features.

Price Performance: With an estimated \pounds500,000 for a 16384 processor system this gives \pounds450,000 per GFLOP (using figures from PPP).

User base: The machine is marketed as a Grand Challenge machine due to its high reliability. The DAP 610c has a lower FLOPS rating by a factor of two for a machine with 4 time fewer processors. The installed base is small. Typical applications are DNA sequence matching and image deblurring.

It is not clear whether the DECmpp is exactly the same product as the MP-1. It will be interesting to see what DEC does if the SIMD market takes off.


MasPar MP-2

A picture of the MasPar MP-2/MP-1 cluster at NASA ESS

Overview

The diagram below shows how the MasPar architecture is organized. The two main parts to be concerned with are:


Diagram from MasPar System Overview and MPPE Manual
MasPar Computer Corporation

MasPar front-end

Since the computational engine of the MasPar does not have an operating system of it's own, a UNIX based workstation is used to provide the programmer with a "friendly" interface to the MasPar. The MasPar front-end at our department is a DEC 3100 workstation named beauty. Beauty runs DEC's version of UNIX, called ULTRIX, and provides users with a windowing programming environment, networking capability, I/O device access, etc. When MasPar programs are executed, the user process runs on beauty while parallel code is automatically passed to the DPU for execution. Programs can be compiled and debugged on beauty using MPPE (MasPar Programming Environment).

DPU (Data Parallel Unit)

The DPU executes the parallel portions of a program and consists of two parts:

ACU

The ACU has two tasks

  1. Execute instructions that operate on singular data.
  2. Simultaneously feed instructions which operate on parallel data (known as "plural" data in MPL) to each PE (Processor Element).

Programs written in normal C and Fortran are executed on the front-end machine. These programs can contain procedures written in MPL (MasPar Programming Language) or MPF. When these procedures are called, they are executed entirely inside the DPU. Executing entirely in the DPU is an advantage in the sense that the code is slightly simpler in terms of design. However, sequential code segments will probably perform poorly due to the limited processing capability of the processor inside the ACU. Depending on the amount of sequential code in the entire program, it may or may not pay off to run it entirely in the DPU.

Parallel operations on parallel (plural) data are executed in the DPU as follows. The ACU broadcasts each instruction to all PE's in the PE Array. Each PE in the array then executes the instruction simultaneously, manipulating each PE's copy of the plural (parallel) data.

Programs written using MPL are executed as above except for the following. The ACU fetches and decodes all program instructions during its execution. When an instruction that operates on singular data is encountered, the ACU simply executes the instruction locally on its own processor. When an instruction operating on plural data is decoded, it is processed as described above. The front-end processor does not execute any code in this case.

PE Array

The PE Array is a 2D mesh of relatively simple processors (PEs). Our MasPar at UO consists of a 64 x 64 array of PEs, for a total of 4096 processors. Each processor is connected to all eight of its neighbors as shown in the diagram below. The connections at the edges of the mesh wrap around to form a torus shaped network.


Diagram from MasPar System Overview and MPPE Manual
MasPar Computer Corporation

Each PE is capable of reading and writing memory and performing arithmetic operations. The PEs are not able to fetch or decode instructions, they can only execute instructions. Each PE has 16K of RAM and forty 32-bit registers.

The 2D mesh of PEs is divided up into 4 x 4 clusters of processors. Since our MasPar here at UO has 4096 processors, we have a 16 x 16 mesh of clusters. The diagram below illustrates how clusters and PEs are related.


Diagram from MasPar System Overview and MPPE Manual
MasPar Computer Corporation

An important point to remember is that each PE in a cluster shares a common global communications channel (a crossbar switch). This is important because bottlenecks can arise when massive amounts of inter-cluster global routing is used for communication.


Processor communication

There are three constructs for communicating between PEs in the PE Array.

proc

The proc construct is used to access or modify data which resides on a particular PE in the DPU. This form of communication is useful for situations where parallel (plural) data needs to be accessed or modified at a single PE. The proc construct can be used in two ways

  1. proc[PE_index_expression].expression

    and

  2. proc[PE_row_index][PE_column_index].expression

PE_index_expression (the absolute index) ranges from 0 to 4095
PE_row_index (the y-coordinate of the PE) ranges from 0 to 63
PE_column_index (the x-coordinate of the PE) ranges from 0 to 63

The expressions, PE_index_expression, PE_row_index, and PE_column_index must be singular expressions. This means that they may not reference parallel (plural) variables. These expressions are used to index the PE array in order to uniquely specify a single PE.

The expression, expression, is used to specify one of two things, depending on which side of the '=' the proc expression is on.

  1. Which parallel (plural) variable is to be modified on the selected PE.

    or

  2. An expression computed from plural data located at the selected PE.

xnet

Xnet is generally used for communicating between PEs when the communication pattern among PEs exhibits a uniform direction and distance. Typically, xnet communication is relatively short range. With the xnet construct, data may be accessed or modified at PEs lying at a specified radius in any of eight directions from an active PE. Also, since the connections at the edges of the PE Array wrap around to form a toroid, xnet communication wraps around as well.

The xnet construct has eight forms

  1. xnetN[distance_expr].expression
  2. xnetNE[distance_expr].expression
  3. xnetE[distance_expr].expression
  4. xnetSE[distance_expr].expression
  5. xnetS[distance_expr].expression
  6. xnetSW[distance_expr].expression
  7. xnetW[distance_expr].expression
  8. xnetNW[distance_expr].expression

The expression, distance_expr, must be a singular expression. This means that it cannot reference plural (parallel) variables. This expression is used to compute the distance, measured in number of PEs, between the communicating PEs.

The expression, expression, is used to specify one of two things, depending on which side of the '=' the xnet expression is on.

  1. Which parallel (plural) variable is to be modified on the target PEs.

    or

  2. An expression computed from plural data residing at PEs located at a radius of distance_expr in the specified direction.

router

Router is used for point-to-point communication from PEs in the active set to any other PEs. This is useful in situations where the PE communication pattern is non-uniform in direction and/or distance. The router construct has the form

router[PE_index_expression].expression

PE_index_expression (the absolute index) ranges from 0 to 4095

For the router construct, PE_index_expression must evaluate to a plural (parallel) integer. This means that each PE calculates the index of the PE it wants to communicate with using its own copy of parallel data.

The expression, expression, is used to specify one of two things, depending on which side of the '=' the router expression is on.

  1. Which parallel (plural) variable is to be modified on the target PEs.

    or

  2. An expression computed using the plural data residing at the PEs indexed by PE_index_expression.


CSEP MP-2 Guide:
Their chapter on the MasPar mainly contains information about the architecture of the MasPar, data mapping strategies, and processor communication issues.


saleh@npac.syr.edu