Lab: Locality in Matrix Multiplication

CSC 213 - Operating Systems and Parallel Algorithms - Weinman

Summary:

Discover just how useful the principle of locality is by experimenting with the effects of (non-)caching

Assigned:

Tuesday 30 September

Due:

11:30 PM Monday 6 October

Objectives:

Learn to analyze equivalent algorithms for locality-related hardware performance
Improve pointer skills
Practice empirical system behavioral studies

Reading:

A primer on matrix arithmetic.

Preliminaries

Find a machine

Do this laboratory on a MathLAN workstation in 3819 or 3815.
Determine some basic facts about your machine, like how much memory it has

$ free -m

and how big its pages are

$ getconf PAGESIZE

Theory-based predictions

Consider the following loop.

: int i;
long array[ARRAY_LEN];
int stride=??;
for (i=0 ; i<ARRAY_LEN ; i+=stride)
array[i]++;

Alternatively, we might write the same loop using pointers.

: int i;
long array[ARRAY_LEN];
int stride=??;
long *p_array_val = array;
for (i=0 ; i<ARRAY_LEN ; i+=stride, p_array_val+=stride)
(*p_array_val)++;

Given the page size you found above, what is the smallest value of stride that would guarantee each iteration of the loop touches a different page? (Hint: You also need to know how many bites an long is!).
What do you expect to happen to the average time for each loop iteration when stride reaches the value you found? (Note: We must measure average time to control for differing numbers of iterations as stride changes.) Give both qualitative and quantitative answers; for the latter, ratios or percentages might be appropriate.

Your answers to these questions are the motivation for this lab; you should confirm both with another group.

Starter code

Copy the starter files to your own directory:

: $ cp -R ~weinman/public_html/courses/CSC213/2014F/labs/code/locality ~/somewhere

Browse along with the starter code in your editor as you read the commentary below.

`matrix.h`

This header file defines the type matrix_t. Note the fields are simply an array (actually, a pointer to a memory address) *data, and the dimensions of the matrix (rows and cols).

We will assume that the matrix is stored linearly in memory in row-major order. This means that rows of the matrix are stored contiguously. Thus, as we walk along the linear indices of the array, the column index changes the fastest, while the row index changes slowest. For instance, a 4×3 matrix has the following linear indices:

which means it is actually stored in memory as

The following macro converts row and column matrix subscript indices to a single linear index of the array:

: #define SUB2IND(row,col,numcols) (row) * (numcols) + (col)

We could change the type of the data stored in the matrix at compile time by changing the value defined by MTX_TYPE. To use plenty of memory for our experiments, we'll use the unsigned long type given here.

`matrixop.c`

This file implements one function, mtxAdd, that takes pointers to two matrices (the parameters are type matrix_t*) and stores their computed sum in a third matrix. There is also a helper function, mtxCheckAddDim, that verifies the two matrices have dimensions that are compatible for addition.

To compute this quickly, without doing any array indexing arithmetic, pointers to the matrix data are used. Rather than addition, you will write functions that compute and store matrix products. To maximize efficiency, you will also be asked to use pointers to the matrix data.

`matrixutil.h`

You don't need to understand the implementation of these routines, but the functions mtxAlloc and mtxRand will be useful. You can read the comments in the header for descriptions.

`testadd.c`

A simple test program for the given matrix addition function mtxAdd.

Exercises

Part A: Multiplication and Locality

Read the Makefile, then make and run the testadd program to verify its correctness.
Pseudocode for the matrix addition algorithm in testadd.c looks like the following:

for i=0 to M-1
for j=0 to N-1
c[i,j] = a[i,j] + b[i,j]

Note that switching the order of the loops does not change the correctness of the algorithm.
Explain what would happen at the hardware level (i.e., with respect to the TLB and memory cache) if the order of the two loops were reversed. (Hint: Consider how the data is organized in memory and how close in memory the algorithmically sequential reads and writes are.)
Write similar pseudocode to calculate a matrix product (you will need three for loops) so that the temporally sequential reads and writes are as contiguous as possible. You will need at least three loops. Using your ordering of the loops, explain which memory accesses (reads from a and b, writes to c) are contiguous and why it is not possible for any other ordering to exhibit greater contiguity.
Write another matrix product pseudocode so that the temporally sequential reads and writes are as disparate as possible. Explain which memory accesses are NOT contiguous and why it is not possible for any other ordering to exhibit lesser contiguity.
In matrixop.c, implement and document the function

int mtxCheckMultiplyDim( const struct matrix_t *a,
const struct matrix_t *b,
const struct matrix_t *c)

that verifies whether a and b are valid factors for matrix product c (i.e., a must be M×N, b must be N×P and c must be M×P for any positive values of M,N,P). Your function should return zero if the arguments are valid and -1 otherwise.
In matrixop.c (and the header) implement and document two functions to calculate the product of two matrices:

int mtxMultiplyMin( const struct matrix_t *a,
const struct matrix_t *b,
struct matrix_t *c)

(for minimizing cache misses, according to A.3) and a similar function mtxMultiplyMax (for maximizing misses, according to A.4), both of which should based on the responses to your pseudocode above.
The procedures should return zero upon success and a negative value if the dimensions of the factors and product do not match appropriately using mtxCheckMultiplyDim.

Evaluation

While it is no easier to implement matrix multiplication using array indexing, the efficiency effects you profile in Part B will be more pronounced if all matrix data references use pointers. Therefore more credit will be given for solutions that use pointers for all matrix data references (both reading and storing), like mtxAdd.

Tips

Keep the following things in mind as you plan your implementations:

Unless you and your partner are very sure of your pointer programming (and debugging!) skills, you should write these functions using array indices first, to make sure you have got the right idea. (With citation, you can use the macro above to convert subscripts to linear indices.) Once you have done that, it should be somewhat easier to to write a version using pointers.
You do not need to initialize the pointers in the header of the for loop; you can initialize some using your counter variables inside loop bodies.
Write a small program (analogous to testadd.c) to test your implementation, and use printf liberally to verify you are accessing the values you mean to.

Part B: Performance Profiling

The program timemultiply.c takes N (matrix size) and a method (min or max) as arguments. It then takes the product of two N×N matrices three times. To maximize the chance of getting a cold cache each time, new matrices are allocated for each allocation. To isolate resources, it uses a GNU/Linux-specific extention of the system call getrusage(2) to query the resource usage caused by your matrix multiplication routine in a child thread.

Because we don't want to include the set-up (memory allocation and matrix initialization) in our time, it creates a separate thread that calls the appropriate multiplication function after all the initialization is done in the master thread). After the worker thread performs the multiplication and terminates, it calls getrusage(2) to query its own resource usage for the child. Using timeradd(3), the thread prints the sum of

total user time (CPU time spent in user mode, in seconds)
total system time (CPU time spent in kernel mode, in seconds)

tab separated for each of three runs on a single line.

: $ ./timemultiply 128 min
0.044002 0.044002 0.044002
$ ./timemultiply 512 max
1.764110 1.756109 1.772110

Of course, timemultiply.c requires you to have implemented mtxMultiplyMin and mtxMultiplyMax. However, a precompiled version is already provided for you should you want to benchmark the (non)efficiency of your pointer arithmetic or use a trusted version for your experiments.

You can build your own from your modified matrixop.c with make

: $ make timemultiply

In any case, be sure your report (below) documents the provenance of the program used to generate your data.

Using both min and max options, run the timemultiply program for values of N = [128, 192, 256, 384, 512, 768, 1024, 1536]. Collect the times and order them nicely in a text file or spreadsheet.
Note: N=1536 may take between 1-2 minutes.
Extra: If you implemented your functions efficiently with pointers, you should be able to use N=2048 as an additional data point. (My mtxMultiplyMax took about 12 minutes for one run.)
Here is an example of a (bash) shell command that will generate the file mtxMultiplyProfile with lines giving the measurements in the format

N run1 run2 run3

If your shell is something other than bash, you may need to modify this slightly to meet the syntax of your shell. You can augment the following command to include the matrix sizes you must measure times for.

for n in 2 4 8 16 32 64
  do
    echo -ne $nt >> mtxMultiplyProfile
    ./timemultiply $n min >> mtxMultiplyProfile
  done
Use a spreadsheet (or another tool of your choice, such as R, gnuplot, or MATLAB) to create the following plots:
1. Matrix size N versus average total time for both the min and max methods. Show the two plots on the same axes with clear axis labels, a title, and a legend.
2. Matrix size N versus the ratio of the paired average times
  
  T_max(N)
  T_min(N)
  
  for the same N. Title your plot and label your axes.
Note:

Because the size N is increasing exponentially, you may find your graphs more informative to use a log scale on the x-axis.
Analyze your results by answering the following questions:
1. Give relevant details about the program and machine you used to perform your experiments, so that your reader can understand the context for them (cf. Preliminaries above)
2. Using B.2a, does the expected O(N³) trend for matrix multiplication seem to hold for the CPU times you found?
  (Hint: Measure the ratio
  
  T(N)
  T(N/2)
  
  for varying N.)
  Compare and contrast the trends between the two methods, explaining the results.
3. Using B.2b), characterize and explain the difference between the two times you plotted.
4. The preliminaries suggest an important performance impact of taking jumps in an array that straddle page boundaries. How big does N need to be for entries in the same column but different rows to be in different pages? Put another way, if an unsigned long is 8 bytes on your machine, what value of N makes an entire row of the matrix exactly one page? Show your work.

What to turn in

Your source code in matrixop.h, matrixop.c, and any other files developed for testing
A single PDF containing (merged)
- Your enscripted source code files (see above; do not include any extra files)
- A transcript of your programs' compilations and test run(s) demonstrating correctness
- Answers to the questions in A.2-A.4
- Plots from B.2 and analysis B.3

Extra credit

Opportunity 1

How do you expect the results to change with other matrix types? In particular, what happens as you keep the matrix size constant but the data type gets smaller or bigger (i.e., short, int, long long)? Explore the performance relationships to one another and explain your results.

Opportunity 2

Two earlier versions of this lab (one, two) included a second problem that explores graph connectivity via the Small World Phenomenon to find the six degrees of Kevin Bacon using IMDB data.. Complete the original (2008) lab's "Part B".

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

Acknowledgment

This lab was inspired by a 2008 conversation with Emery Berger. Janet Davis contributed improvements to the text.