Do this laboratory on a MathLAN workstation in 3819 or 3815.
Determine some basic facts about your machine, like how much memory
it has
$ free -m
and how big its pages are
$ getconf PAGESIZE
Theory-based predictions
Consider the following loop.
int i;
long array[ARRAY_LEN];
int stride=??;
for (i=0 ; i<ARRAY_LEN ; i+=stride)
array[i]++;
Alternatively, we might write the same loop using pointers.
int i;
long array[ARRAY_LEN];
int stride=??;
long *p_array_val = array;
for (i=0 ; i<ARRAY_LEN ; i+=stride, p_array_val+=stride)
(*p_array_val)++;
Given the page size you found above, what is the smallest value of
stride that would guarantee each iteration of the loop touches
a different page? (Hint: You also need to know how many bites
an long is!).
What do you expect to happen to the average time for each loop iteration
when stride reaches the value you found? (Note: We must measure
average time to control for differing numbers of iterations
as stride changes.) Give both qualitative and quantitative
answers; for the latter, ratios or percentages might be appropriate.
Your answers to these questions are the motivation for this lab; you
should confirm both with another group.
Browse along with the starter code in your editor as you read the
commentary below.
matrix.h
This header file defines the type matrix_t. Note the fields
are simply an array (actually, a pointer to a memory address) *data,
and the dimensions of the matrix (rows and cols).
We will assume that the matrix is stored linearly in memory in row-major
order. This means that rows of the matrix are stored contiguously.
Thus, as we walk along the linear indices of the array, the column
index changes the fastest, while the row index changes slowest. For
instance, a 4×3 matrix has the following linear indices:
0
1
2
3
4
5
6
7
8
9
10
11
which means it is actually stored in memory as
0
1
2
3
4
5
6
7
8
9
10
11
The following macro converts row and column matrix subscript indices
to a single linear index of the array:
We could change the type of the data stored in the matrix at compile
time by changing the value defined by MTX_TYPE. To use plenty
of memory for our experiments, we'll use the unsigned long
type given here.
matrixop.c
This file implements one function, mtxAdd, that takes pointers
to two matrices (the parameters are type matrix_t*) and
stores their computed sum in a third matrix. There is also a helper
function, mtxCheckAddDim, that verifies the two matrices
have dimensions that are compatible for addition.
To compute this quickly, without doing any array indexing arithmetic,
pointers to the matrix data are used. Rather than addition, you will
write functions that compute and store matrix products. To maximize
efficiency, you will also be asked to use pointers to the matrix data.
matrixutil.h
You don't need to understand the implementation of these routines,
but the functions mtxAlloc and mtxRand will be useful.
You can read the comments in the header for descriptions.
testadd.c
A simple test program for the given matrix addition function mtxAdd.
Exercises
Part A: Multiplication and Locality
Read the Makefile, then make and run the testadd
program to verify its correctness.
Pseudocode for the matrix addition algorithm
in testadd.c looks like the following:
for i=0 to M-1
for j=0 to N-1
c[i,j] = a[i,j] + b[i,j]
Note that switching the order of the loops does not change the correctness
of the algorithm.
Explain what would happen at the hardware level (i.e., with respect
to the TLB and memory cache) if the order of the two loops were reversed.
(Hint: Consider how the data is organized in memory and how
close in memory the algorithmically sequential reads and writes are.)
Write similar pseudocode to calculate a matrix
product (you will need three for loops) so that the temporally
sequential reads and writes are as contiguous as possible.
You will need at least three loops. Using your ordering of the loops,
explain which memory accesses (reads from a and b,
writes to c) are contiguous and why it is not possible for
any other ordering to exhibit greater contiguity.
Write another matrix product pseudocode so
that the temporally sequential reads and writes are as disparate
as possible. Explain which memory accesses are NOT contiguous and
why it is not possible for any other ordering to exhibit lesser contiguity.
In matrixop.c, implement and document the function
that verifies whether a and b are valid factors
for matrix product c (i.e., a must be M×N,
b must be N×P and c must be M×P
for any positive values of M,N,P). Your function should return
zero if the arguments are valid and -1 otherwise.
In matrixop.c (and the header) implement and document two
functions to calculate the product of two matrices:
(for minimizing cache misses, according to A.3)
and a similar function mtxMultiplyMax (for maximizing misses,
according to A.4), both of which should based
on the responses to your pseudocode above.
The procedures should return zero upon success and a negative value
if the dimensions of the factors and product do not match appropriately
using mtxCheckMultiplyDim.
Evaluation
While it is no easier to implement matrix multiplication using array
indexing, the efficiency effects you profile in Part B will be more
pronounced if all matrix data references use pointers. Therefore more
credit will be given for solutions that use pointers for all
matrix data references (both reading and storing), like mtxAdd.
Tips
Keep the following things in mind as you plan your implementations:
Unless you and your partner are very sure of your pointer
programming (and debugging!) skills, you should write these functions
using array indices first, to make sure you have got the right idea.
(With citation, you can use the macro above to convert subscripts
to linear indices.) Once you have done that, it should be somewhat
easier to to write a version using pointers.
You do not need to initialize the pointers in the header of the for
loop; you can initialize some using your counter variables inside
loop bodies.
Write a small program (analogous to testadd.c) to test your
implementation, and use printf liberally to verify you are
accessing the values you mean to.
Part B: Performance Profiling
The program timemultiply.c takes N (matrix size)
and a method (min or max)
as arguments. It then takes the product of two N×N matrices
three times. To maximize the chance of getting a cold cache each time,
new matrices are allocated for each allocation. To isolate resources,
it uses a GNU/Linux-specific extention of the system call getrusage(2)
to query the resource usage caused by your matrix multiplication routine
in a child thread.
Because we don't want to include the set-up (memory allocation and
matrix initialization) in our time, it creates a separate thread that
calls the appropriate multiplication function after all the initialization
is done in the master thread). After the worker thread performs the
multiplication and terminates, it calls getrusage(2)
to query its own resource usage for the child. Using timeradd(3),
the thread prints the sum of
total user time (CPU time spent in user mode, in seconds)
total system time (CPU time spent in kernel mode, in seconds)
tab separated for each of three runs on a single line.
$ ./timemultiply 128 min
0.044002 0.044002 0.044002
$ ./timemultiply 512 max
1.764110 1.756109 1.772110
Of course, timemultiply.c requires you to have implemented
mtxMultiplyMin and mtxMultiplyMax.
However, a precompiled version is already provided for you should
you want to benchmark the (non)efficiency of your pointer arithmetic
or use a trusted version for your experiments.
You can build your own from your modified matrixop.c with
make
$ make timemultiply
In any case, be sure your report (below) documents the provenance
of the program used to generate your data.
Using both min and max options, run the timemultiply
program for values of N = [128, 192, 256, 384, 512, 768, 1024,
1536]. Collect the times and order them nicely in a text file or
spreadsheet.
Note:N=1536 may take between 1-2 minutes.
Extra: If you implemented your functions efficiently
with pointers, you should be able to use N=2048 as an additional
data point. (My mtxMultiplyMax took about 12 minutes
for one run.)
Here is an example of a (bash) shell command that will generate the
file mtxMultiplyProfile with lines giving the measurements
in the format
N run1 run2 run3
If your shell is something other than bash, you may need to modify
this slightly to meet the syntax of your shell. You can augment the
following command to include the matrix sizes you must measure times
for.
for n in 2 4 8 16 32 64
do
echo -ne $nt >> mtxMultiplyProfile
./timemultiply $n min >> mtxMultiplyProfile
done
Use a spreadsheet (or another tool of your choice,
such as R, gnuplot, or MATLAB) to create the following
plots:
Matrix size N versus average total time for both
the min and max methods. Show the two plots on the same
axes with clear axis labels, a title, and a legend.
Matrix size N versus the ratio of the paired
average times
Tmax(N)
Tmin(N)
for the same N. Title your plot and label your axes.
Note:
Because the size N is increasing exponentially, you
may find your graphs more informative to use a log scale on the x-axis.
Analyze your results by answering the following
questions:
Give relevant details about the program and machine you used to perform
your experiments, so that your reader can understand the context for
them (cf. Preliminaries above)
Using B.2a, does the expected O(N3)
trend for matrix multiplication seem to hold for the CPU times you
found?
(Hint: Measure the ratio
T(N)
T(N/2)
for varying N.)
Compare and contrast the trends between the two methods, explaining
the results.
Using B.2b), characterize and explain the difference
between the two times you plotted.
The preliminaries suggest an important performance impact of taking
jumps in an array that straddle page boundaries. How big does N
need to be for entries in the same column but different rows to be
in different pages? Put another way, if an unsigned long
is 8 bytes on your machine, what value of N makes an entire row
of the matrix exactly one page? Show your work.
What to turn in
Your source code in matrixop.h, matrixop.c, and
any other files developed for testing
A single PDF containing (merged)
Your enscripted source code files (see above; do not include any extra
files)
A transcript of your programs' compilations and test run(s) demonstrating
correctness
How do you expect the results to change with other matrix types? In
particular, what happens as you keep the matrix size constant but
the data type gets smaller or bigger (i.e., short, int,long long)? Explore the performance relationships to one
another and explain your results.