Lab: Threads

CSC 213 - Operating Systems and Parallel Algorithms - Weinman

Summary:

Parallelize a multi-threaded matrix multiplication algorithm.

Assigned:

Tuesday 7 October

Due:

11:30 PM Tuesday 14 October

Objectives:

Introduce data-based problem decomposition
Become familiar with launching threads in the POSIX pthreads environment
Investigate parallel program behavior through empirical analysis

Resource:

Blaise Barney's POSIX Threads Programming, sections 1-5.

Collaboration:

You will complete this lab in teams as assigned by the instructor.

Preliminaries

You will build on your mtxMultiplyMin implementation from the previous lab. If your previous partner has the files you wish to derive your solutions from, make arrangements to copy them before the lab session. Make sure you properly cite the source of any code your solutions to this lab are derived from.
Do this laboratory on a MathLAN workstation in 3819 or 3815.
Determine the number of available processor cores in your machine

$ nproc

You may also find other interesting details worth noting

$ less /proc/cpuinfo

Exercises

Part A: Threading Multiplication

In matrixop.h, define a struct, called mtxThreadParam_t that will contain the parameters needed for a threaded matrix product. The struct should include pointers to the two factors, the product, as well as a particular thread number and the total number of threads over which the product is distributed:

struct matrixThreadParam_t {
  const struct matrix_t *a;
  const struct matrix_t *b;
  struct matrix_t *c;
  int numThreads;
  int threadId;
};

Now that you have copied and pasted this code into your file, be sure to cite and attribute its origin properly.
In matrixop.c, add the function

void* threadMtxMultiplyMin(void *multParam)

which performs the multiplication work for a single thread; it is the analog of mtxMultiplyMin. The function will take struct matrixThreadParam_t* as an argument and return nothing (NULL).
While there are several ways to divy up the work, the easiest approach is to start the outermost loop (whether for a row or column) at the particular thread index and set the loop stride to the number of threads. For example, if there are a total of two threads, each will do the work for every other column (row, respectively); if three threads, each will do every third column (row, respectively). The inner loops would use a unit stride, as in the unthreaded versions.
Add a function parMtxMultiply that computes the threaded, concurrent (and hopefully parallel) matrix product. In addition to the matrix pointer arguments, the function should take the number of threads to use:

int parMtxMultiply( const struct matrix_t *a,
                    const struct matrix_t *b,
                    struct matrix_t *c,
                    int numThreads)

Your function should
- verify the dimensions of the factors and product, returning -1 if the parameters are invalid and zero when multiplication is successful
- allocate a parameter struct (cf. A.1) for each thread
- initialize each thread and explicitly make it joinable
- print an appropriate message to stderr and exit the program with EXIT_FAILURE if any thread cannot be created, and
- wait for all threads to complete before returning.
For good additional examples of how to do these things, see Example 2 - Thread Argument Passing and Example Code - Pthread Joining in the POSIX Threads Programming tutorial.

Evaluation

As before, more credit will be given for solutions that are careful to distribute all stages of the computation and that efficiently use pointers for all matrix data references (both reading and storing).

Part B: Performance Profiling

The program ptimemultiply.c takes N (matrix size) and T (number of threads) as arguments. It then takes the multi-threaded product of two N×N matrices three times, using your parMtxMultiply function. While the program does not include the set-up (memory allocation and matrix initialization), to simplify matters, it measures and prints the wall clock time using gettimeofday(2) and timersub(3). Because this puts your measurements somewhat at the mercy of the scheduler, it will be even more important to print multiple test runs, tab separated for each of three runs on a single line:

: $ ./ptimemultiply 512 4
0.098640 0.098599 0.100816
$ ./ptimemultiply 256 2
0.032886 0.029050 0.028003

Of course, ptimemultiply.c requires you to have implemented parMtxMultiply. However, a precompiled version is already provided for you should you want to benchmark the (non)efficiency of your pointer arithmetic or use a trusted version for your experiments.

You can build your own from your modified matrixop.c with make

: $ make ptimemultiply

In any case, be sure your report (below) documents the provenance of the program used to generate your data.

Run the ptimemultiply program for values of N = [128, 192, 256, 384, 512, 768, 1024, 1536] with 1, 2, 3, and 4 threads. Collect the times and order them nicely in a text file or spreadsheet.
Hint: Use bash for loops for easy data generation.
If you implemented your function efficiently with pointers, you should be able (and are encouraged) to use N=2048 as well.
Use a spreadsheet (or another tool of your choice, such as R, gnuplot, or MATLAB) to create the following plots:
1. Matrix size N versus average total time with varying numbers of threads. You may collect the plots into axes in the way you deem most sensible for analysis. Be sure to label your axes (giving units), title your plot, and (if appropriate) provide a legend.
2. Matrix size N versus the speedup¹
  
  S= T₁(N)
  T_P(N)
  
  where T is the total time and P > 1 is the number of threads. Show the plots on the same axes. Title your plot, label your axes, and provide a legend.
3. Matrix size N versus the efficiency
  
  E= T₁(N)
  P×T_P(N)
  .
  
  Show the plots for P > 1 on the same axes. Title your plot, label your axes, and provide a legend.
4. Number of threads (P > 1) versus speedup using only the largest N you ran.
5. Number of threads (P > 1) versus efficiency using only the largest N you ran.
Analyze your results by answering the following questions:
1. Describe and explain how the times vary (Plot 2a) with the number of threads.
2. Describe and explain the trend in the speedup as N increases as well as any deviations from the trend when N is smaller.
3. Describe and explain the trend in the efficiency as N increases as well as any deviations from the trend when N is smaller.
4. Describe and explain the trend in the speedup as P increases.
5. Describe and explain the trend in the efficiency as P increases.
Be sure to include any relevant details about the machine on which you performed your experiments, so that your reader can understand the context for them.

What to turn in

Your source code in matrixop.h, matrixop.c, and any other files developed for testing
A single PDF containing (merged)
- Your enscripted source code files (see above; do not include any extra files)
- A transcript of your program's compilation and test run(s)
- Plots (B.2) and analysis (B.3)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

Footnotes:

¹See Foster, Efficiency and Speedup, or Wikipedia, Speedup