We've all coded matrix multiply at least once in our lives (and soon at least twice). Here is the simplest algorithm to compute C =index.html C + A*B, where all matrices are n-by-n:
for i=1 to n
for j=1 to n
for k=1 to n
C(i,j) =index.html C(i,j) + A(i,k)*B(k,j)
end for
end for
end for
Unfortunately, this algorithm is oblivious to the computer's memory hierarchy,
and optimizing compilers provide limited solace. Consider the
If you didn't want to pay for ESSL, you might look around on the WWW in the
Netlib Repository of public-domain
software, and discover a public-domain matrix multiply optimized for the RS6000
(
dmr).
Using f77 -O2 on this routine yields 152 Mflops, and using f77 -O3 -qhot
yields 186 Mflops, still a ways from ESSL's 240 Mflops. Not only that, since
the careful programmer always tests to see if the answer is correct, he or she
will soon discover that the answer using dmr with f77 -O3 -qhot is completely
wrong for n>64 matrices (well, the compiler did issue a warning message about
possibly changing the semantics of the program).
For this assignment, we will be coding an optimized memory-hierarchy-cognizant matrix-matrix multiply routine in C. For simplicity, we will only require square matrices (worrying about the non-square case, essential for a good library code, can be a bit time-consuming). The goal of the assignment is to:
For extra credit, you may try to optimize your routine for another architecture different form IBM RS6000 or implement one more BLAS3 routine.
We will be using an IBM RS6000/590, accounts to be made available
soon. Note that the IBM compiler supports a wide variety of compiler
optimizations
which you should read about
(see
http://www.austin.ibm.com/tech/options.html, or type 'xlc'
once you are logged on to the RS6000).
Here is some basic RS6000 architectural information:
System CPU ClkMHz Cache SPEC SPEC Info Source Name Type ext/in Ext+I/D 92Int 92 FP Date Obtained ------------ --------- ------ ------------ ----- ----- ----- ------------------ IBM 590 POWER2 66.6 32/256 121.6 259.7 Jul94 comp.benchmarksMore architectural information can be found at
To judge the relative qualities of your routines, we will have a contest with a non-negligible 1st prize and a ..., well, let's just say a 2nd prize. We will work in mixed teams of three people. Each team will create a single routine and submit it to the matrix appropriations committee (MAC) consisting of the Professor and TA. We'll all gather on Tuesday following the due date during the 6:00PM time-slot with each team making a short 2-minute presentation about their routine, and why it should run fast. Then we will time the routine on an dedicated RS6000, as well as make sure you get the right answer. The test program and method for measuring performance will be posted later. The winning team will then be applauded gratuitously, and presented with the non-negligible prize.
You are welcome to discuss your optimization strategies with other teams. You should, however, keep in mind that this is a competition -- any hints might distort the contest results.
Here is the routine interface we will use.
/*
** mul_mfmf_mf:
** Optimized matrix multiply routine.
**
** Parameters:
** int matdim: the matrix dimension
** double * A: source1 matrix of size (matdim)x(matdim)
** double * B: source2 matrix of size (matdim)x(matdim)
** double * C: destination matrix of size (matdim)x(matdim)
**
** Operation:
** C =index.html C + A*B;
**
*/
void
mul_mfmf_mf(
int matdim,
const double *A,
const double *B,
double *C);
Assume an ANSI compliant C compiler and that the arrays are stored in
row-major order. This means that the following loop will access
memory locations in increasing consecutive order:
for (i=0;i < matdim;i++)
for (j=0;j < matdim;j++)
access(A[i*matdim+j]);
This document is available
at
/assignment1.html.
For reference,
The BLAS directory on Netlib is located at
http://www.netlib.org/blas/index.html. A
paper describing the interface in that directory is
located at
http://www.netlib.org/blas/blasqr.ps.
And a BLAS matrix multiply in fortran
can be found at
http://www.netlib.org/blas/dgemm.f.
A good set of papers on the general issues of blocking algorithms for
memory hierarchies is
here. Look at the
README
file for a list of titles of each paper.
tile.ps is also interesting.