Lab: Locality in Matrix Multiplication

CSC 213 - Operating Systems and Parallel Algorithms - Weinman

Summary:

Discover just how useful the principle of locality is by experimenting with the effects of (non-)caching

Assigned:

Tuesday 2 October

Due:

10:30 PM Monday 8 October

Objectives:

Learn to analyze equivalent algorithms for locality-related hardware performance
Improve pointer skills
Practice empirical system behavioral studies

Reading:

A primer on matrix arithmetic.

Preliminaries

Do this laboratory on a MathLAN workstation in 3819 or 3815.
Copy the starter files to your own directory:

: $ cp -R ~weinman/public_html/courses/CSC213/2012F/labs/code/locality ~/somewhere

Determine some basic facts about your machine, like how much memory it has

$ free -m

and how big its pages are

$ getconf PAGESIZE
Begin by browsing the starter code ...

`matrix.h`

This header file defines the type matrix_t. Note the fields are simply an array (actually, a pointer to a memory address) *data, and the dimensions of the matrix (rows and cols).

We will assume that the matrix is stored linearly in memory in row-major order. This means that rows of the matrix are stored contiguously. Thus, as we walk along the linear indices of the array, the column index changes the fastest, while the row index changes slowest. For instance, a 4×3 matrix has the following linear indices:

which means it is actually stored in memory as

The following macro converts row and column matrix subscript indices to a single linear index of the array:

: #define SUB2IND(row,col,numcols) (row) * (numcols) + (col)

We could change the type of the data stored in the matrix at compile time by changing the value defined by MTX_TYPE. To use plenty of memory for our experiments, we'll use the unsigned long type given here.

`matrixop.c`

This file implements one function, mtxAdd, that takes pointers to two matrices (the parameters are type matrix_t*) and stores their computed sum in a third matrix. There is also a helper function, mtxCheckAddDim, that verifies the two matrices have dimensions that are compatible for addition.

To compute this quickly, without doing any array indexing arithmetic, pointers to the matrix data are used. Rather than addition, you will write functions that compute and store matrix products. To maximize efficiency, you will also be asked to use pointers to the matrix data.

`matrixutil.h`

You don't need to understand the implementation of these routines, but the functions mtxAlloc and mtxRand will be useful. You can read the comments in the header for descriptions.

`testadd.c`

A simple test program for the given matrix addition function mtxAdd.

Exercises

Part A: Multiplication and Locality

Read the Makefile, then make and run the testadd program to verify its correctness.
Pseudocode for the matrix addition algorithm in testadd.c looks like the following:

for i=0 to M-1
for j=0 to N-1
c[i,j] = a[i,j] + b[i,j]

Note that switching the order of the loops does not change the correctness of the algorithm. Explain what would happen at the hardware level (i.e., with respect to the TLB and memory cache) if the order of the two loops were reversed. (Hint: Consider how the data is organized in memory and how close in memory the algorithmically sequential reads and writes are.)
Write similar pseudocode to calculate a matrix product so that the sequential reads and writes are as contiguous as possible. Explain which memory accesses (reads or writes) are contiguous and why it is not possible for any other ordering to exhibit greater contiguity.
Write another matrix product pseudocode so that the sequential reads and writes are as disparate as possible. Explain which memory accesses are NOT contiguous and why it is not possible for any other ordering to exhibit lesser contiguity.
Implement two functions in matrixop.c (and the header): mtxMultiplyMin (for minimizing misses) and mtxMultiplyMax (for maximizing misses), both of which should calculate the product of two matrices based on the responses to your pseudocode above. Like mtxAdd, both functions should use pointers for all matrix data references (both reading and storing).
Tips
Keep the following things in mind as you plan your implementations:
- Unless you and your partner are very sure of your pointer programming (and debugging!) skills, you may want to write at least one of these functions using array indices first, to make sure you've got the right idea. (You can use the macro above to convert subscripts to linear indices.) Once you've done that, it should be somewhat easier to to write a version using pointers.
- You do not need to initialize the pointers in the header of the for loop; you can initialize some using your counter variables inside loop bodies.
- Write a small program to test your implementation, and use printf liberally to verify you are accessing the values you mean to.

Part B: Performance Profiling

Write a program testmultiply.c that takes N (matrix size) and method (min or max) as arguments. Your program should create two N×N matrices filled with random values (see the function mtxRand in matrixutil.h) and multiply them using the given method. Do not print the result.
Add this program to the Makefile.
Next, modify your program to perform the multiplication in a child process and use the system call getrusage(2) to query the resource usage caused by your matrix multiplication routine.
Because we don't want to include the set-up (memory allocation and matrix initialization) in our time, use fork(2) to create a child process that calls the appropriate multiplication function (be sure any initialization is done in the parent process before the fork). After the child performs the multiplication and terminates, the parent process should use getrusage to query the resource usage for the child. See the manual for timeradd(3) to get details on the timeval struct used in the rusage struct.
Your program should report:
- The total user time (CPU time spent in user mode, in seconds)
- The total system time (CPU time spent in kernel mode, in seconds)
- The total CPU time (user + system time, in seconds)
tab separated on a single line. For example:

$ ./testmultiply 128 min
0.012 0 0.012
$ ./testmultiply 512 max
1.76011 0.004 1.76411

Note that the times are reported as integer second and microsecond times. Use printf to print these these as decimal seconds. Be sure to print a sufficient number of digits.
Tip: One nice way to print struct timeval data is

printf("%3lu.%06lu",sec,usec);
Using both min and max options, run your program for values of N = [128, 192, 256, 384, 512, 768, 1024, 1792]. Use each value of N three times. Collect the total times and order them nicely in a text file or spreadsheet.
Note: N=1792 may take between 1-2 minutes.
Extra: If you implemented your functions efficiently with pointers, you should be able to use N=2048 as an additional data point. (My mtxMultiplyMax took about 4 minutes for one run.)
If your program produces output on a single line, here is an example of a (bash) shell command that will generate the file mtxMultiplyProfile with lines giving the measurements in the format

N your-output

If your shell is something other than bash, you may need to modify this slightly to meet the syntax of your shell. You can augment the following command to include the matrix sizes you must measure times for.

for n in 2 4 8 16 32 64
  do
    for ((t=0 ; t<3 ; t++))
      do
        echo -n $n  >> mtxMultiplyProfile
      ./testmultiply $n min >> mtxMultiplyProfile
      done
  done

The leading spaces are not required; they are only helpful for visualization.
Use a spreadsheet (or another tool of your choice, such as R, gnuplot, or MATLAB) to create the following plots:
1. Matrix size N versus average total time for both the min and max methods. Show the two plots on the same axes with clear axis labels and a title.
2. Matrix size N versus the ratio of the paired average times
  
  T_max(N)
  T_min(N)
  
  for the same N. Title your plot and label your axes.
Analyze your results by answering the following questions:
1. Does the expected O(N³) trend for matrix multiplication seem to hold for the CPU times you found? Explain your answer.
2. Characterize and explain the difference (if any) between the two times you plotted.
Be sure to include any relevant details about the machine on which you performed your experiments, so that your reader can understand the context for them.

What to turn in

Your source code in matrixop.h, matrixop.c, and testmultiply.c
A single PDF containing (merged)
- Your enscripted source code files
- A transcript of your programs' compilations and test run(s)
- Answers to the questions in A.2-A.4
- Plots from B.4 and analysis B.5

Extra credit

Opportunity 1

How do you expect the results to change with other matrix types? In particular, what happens as you keep the matrix size constant but the data type gets smaller or bigger (i.e., short, int, long long)? Explore the performance relationships to one another and explain your results.

Opportunity 2

Two earlier versions of this lab (one, two) included a second problem that explores graph connectivity via the Small World Phenomenon to find the six degrees of Kevin Bacon using IMDB data.. Complete the original lab's "Part B".

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

Acknowledgment

This lab was inspired by a conversation with Emery Berger. Janet Davis contributed improvements to the text.