Lab: Locality in Matrix Multiplication

CSC 213 - Operating Systems and Parallel Algorithms - Weinman



Summary:
Discover just how useful the principle of locality is by experimenting with the effects of (non-)caching
Assigned:
Tuesday 30 September
Due:
11:30 PM Monday 6 October
Objectives:
 
Reading:
A primer on matrix arithmetic.

Preliminaries

Find a machine

Theory-based predictions

Consider the following loop.
int i;
long array[ARRAY_LEN];
int stride=??;
for (i=0 ; i<ARRAY_LEN ; i+=stride)
  array[i]++;
Alternatively, we might write the same loop using pointers.
int i;
long array[ARRAY_LEN];
int stride=??;
long *p_array_val = array;
for (i=0 ; i<ARRAY_LEN ; i+=stride, p_array_val+=stride)
  (*p_array_val)++;
Your answers to these questions are the motivation for this lab; you should confirm both with another group.

Starter code

Copy the starter files to your own directory:
cp -R ~weinman/public_html/courses/CSC213/2014F/labs/code/locality ~/somewhere
Browse along with the starter code in your editor as you read the commentary below.

matrix.h

This header file defines the type matrix_t. Note the fields are simply an array (actually, a pointer to a memory address) *data, and the dimensions of the matrix (rows and cols).
We will assume that the matrix is stored linearly in memory in row-major order. This means that rows of the matrix are stored contiguously. Thus, as we walk along the linear indices of the array, the column index changes the fastest, while the row index changes slowest. For instance, a 4×3 matrix has the following linear indices:
0
1
2
3
4
5
6
7
8
9
10
11
which means it is actually stored in memory as
0
1
2
3
4
5
6
7
8
9
10
11
The following macro converts row and column matrix subscript indices to a single linear index of the array:
#define SUB2IND(row,col,numcols) (row) * (numcols) + (col)
We could change the type of the data stored in the matrix at compile time by changing the value defined by MTX_TYPE. To use plenty of memory for our experiments, we'll use the unsigned long type given here.

matrixop.c

This file implements one function, mtxAdd, that takes pointers to two matrices (the parameters are type matrix_t*) and stores their computed sum in a third matrix. There is also a helper function, mtxCheckAddDim, that verifies the two matrices have dimensions that are compatible for addition.
To compute this quickly, without doing any array indexing arithmetic, pointers to the matrix data are used. Rather than addition, you will write functions that compute and store matrix products. To maximize efficiency, you will also be asked to use pointers to the matrix data.

matrixutil.h

You don't need to understand the implementation of these routines, but the functions mtxAlloc and mtxRand will be useful. You can read the comments in the header for descriptions.

testadd.c

A simple test program for the given matrix addition function mtxAdd.

Exercises

Part A: Multiplication and Locality

  1. Read the Makefile, then make and run the testadd program to verify its correctness.
  2. Pseudocode for the matrix addition algorithm in testadd.c looks like the following:
    for i=0 to M-1
        for j=0 to N-1
            c[i,j] = a[i,j] + b[i,j]
    Note that switching the order of the loops does not change the correctness of the algorithm.
    Explain what would happen at the hardware level (i.e., with respect to the TLB and memory cache) if the order of the two loops were reversed. (Hint: Consider how the data is organized in memory and how close in memory the algorithmically sequential reads and writes are.)
  3. Write similar pseudocode to calculate a matrix product (you will need three for loops) so that the temporally sequential reads and writes are as contiguous as possible. You will need at least three loops. Using your ordering of the loops, explain which memory accesses (reads from a and b, writes to c) are contiguous and why it is not possible for any other ordering to exhibit greater contiguity.
  4. Write another matrix product pseudocode so that the temporally sequential reads and writes are as disparate as possible. Explain which memory accesses are NOT contiguous and why it is not possible for any other ordering to exhibit lesser contiguity.
  5. In matrixop.c, implement and document the function
    int mtxCheckMultiplyDim( const struct matrix_t *a, 
                             const struct matrix_t *b,
                             const struct matrix_t *c)
    that verifies whether a and b are valid factors for matrix product c (i.e., a must be M×N, b must be N×P and c must be M×P for any positive values of M,N,P). Your function should return zero if the arguments are valid and -1 otherwise.
  6. In matrixop.c (and the header) implement and document two functions to calculate the product of two matrices:
    int mtxMultiplyMin( const struct matrix_t *a, 
                        const struct matrix_t *b,
                        struct matrix_t *c)
    (for minimizing cache misses, according to A.3) and a similar function mtxMultiplyMax (for maximizing misses, according to A.4), both of which should based on the responses to your pseudocode above.
    The procedures should return zero upon success and a negative value if the dimensions of the factors and product do not match appropriately using mtxCheckMultiplyDim.

Evaluation

While it is no easier to implement matrix multiplication using array indexing, the efficiency effects you profile in Part B will be more pronounced if all matrix data references use pointers. Therefore more credit will be given for solutions that use pointers for all matrix data references (both reading and storing), like mtxAdd.
Tips
Keep the following things in mind as you plan your implementations:

Part B: Performance Profiling

The program timemultiply.c takes N (matrix size) and a method (min or max) as arguments. It then takes the product of two N×N matrices three times. To maximize the chance of getting a cold cache each time, new matrices are allocated for each allocation. To isolate resources, it uses a GNU/Linux-specific extention of the system call getrusage(2) to query the resource usage caused by your matrix multiplication routine in a child thread.
Because we don't want to include the set-up (memory allocation and matrix initialization) in our time, it creates a separate thread that calls the appropriate multiplication function after all the initialization is done in the master thread). After the worker thread performs the multiplication and terminates, it calls getrusage(2) to query its own resource usage for the child. Using timeradd(3), the thread prints the sum of tab separated for each of three runs on a single line.
./timemultiply 128 min
  0.044002   0.044002   0.044002
./timemultiply 512 max
  1.764110   1.756109   1.772110
Of course, timemultiply.c requires you to have implemented mtxMultiplyMin and mtxMultiplyMax. However, a precompiled version is already provided for you should you want to benchmark the (non)efficiency of your pointer arithmetic or use a trusted version for your experiments.
You can build your own from your modified matrixop.c with make
make timemultiply
In any case, be sure your report (below) documents the provenance of the program used to generate your data.
  1. Using both min and max options, run the timemultiply program for values of N = [128, 192, 256, 384, 512, 768, 1024, 1536]. Collect the times and order them nicely in a text file or spreadsheet.
    Note: N=1536 may take between 1-2 minutes.
    Extra: If you implemented your functions efficiently with pointers, you should be able to use N=2048 as an additional data point. (My mtxMultiplyMax took about 12 minutes for one run.)
    Here is an example of a (bash) shell command that will generate the file mtxMultiplyProfile with lines giving the measurements in the format
    N run1 run2 run3
    If your shell is something other than bash, you may need to modify this slightly to meet the syntax of your shell. You can augment the following command to include the matrix sizes you must measure times for.
    for n in 2 4 8 16 32 64
      do
        echo -ne $nt >> mtxMultiplyProfile
        ./timemultiply $n min >> mtxMultiplyProfile
      done
  2. Use a spreadsheet (or another tool of your choice, such as R, gnuplot, or MATLAB) to create the following plots:
    1. Matrix size N versus average total time for both the min and max methods. Show the two plots on the same axes with clear axis labels, a title, and a legend.
    2. Matrix size N versus the ratio of the paired average times
      Tmax(N)

      Tmin(N)
      for the same N. Title your plot and label your axes.
    Note:
    Because the size N is increasing exponentially, you may find your graphs more informative to use a log scale on the x-axis.
  3. Analyze your results by answering the following questions:
    1. Give relevant details about the program and machine you used to perform your experiments, so that your reader can understand the context for them (cf. Preliminaries above)
    2. Using B.2a, does the expected O(N3) trend for matrix multiplication seem to hold for the CPU times you found?
      (Hint: Measure the ratio
      T(N)

      T(N/2)
      for varying N.)
      Compare and contrast the trends between the two methods, explaining the results.
    3. Using B.2b), characterize and explain the difference between the two times you plotted.
    4. The preliminaries suggest an important performance impact of taking jumps in an array that straddle page boundaries. How big does N need to be for entries in the same column but different rows to be in different pages? Put another way, if an unsigned long is 8 bytes on your machine, what value of N makes an entire row of the matrix exactly one page? Show your work.

What to turn in

Extra credit

Opportunity 1

How do you expect the results to change with other matrix types? In particular, what happens as you keep the matrix size constant but the data type gets smaller or bigger (i.e., short, int, long long)? Explore the performance relationships to one another and explain your results.

Opportunity 2

Two earlier versions of this lab (one, two) included a second problem that explores graph connectivity via the Small World Phenomenon to find the six degrees of Kevin Bacon using IMDB data.. Complete the original (2008) lab's "Part B".
Copyright © 2012, 2014 Jerod Weinman.
ccbyncsa.png
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

Acknowledgment

This lab was inspired by a 2008 conversation with Emery Berger. Janet Davis contributed improvements to the text.