Lab: Threads and Multicore Locality

CSC 213 - Operating Systems and Parallel Algorithms - Weinman

Summary:

Test how the the principle of locality applies to parallel programs.

Assigned:

Tuesday 9 October

Due:

10:30 PM Monday 15 October

Objectives:

Introduce data-based problem decomposition
Become familiar with launching threads in the POSIX pthreads environment
Explore locality effects in multi-threaded programs on multicore processors

Resource:

Blaise Barney's POSIX Threads Programming, sections 1-5.

Collaboration:

You will complete this lab in teams as assigned by the instructor.

Preliminaries

You will build on your mtxMultiplyMin and mtxMultiplyMax implementations from the previous lab. If your previous partner has the files you wish to derive your solutions from, make arrangements to copy them before the lab session. Make sure you attribute the source of any code your solutions to this lab are derived from.
Do this laboratory on a MathLAN workstation in 3819 or 3815.
Determine the number of processor cores in your machine (as well as any other details you find interesting)

$ cat /proc/cpuinfo

Exercises

Part A: Threading Multiplication

In matrixop.h, create a struct, called mtxThreadParam_t that will contain the parameters needed for a threaded matrix product. It should include pointers to the two multiplicands, the product, as well as the particular thread number and the total number of threads over which the product is distributed.
In matrixop.c, add the function

void* threadMtxMultiplyMin(void *multParam)

which performs the (faster) multiplication work for a single thread; it is the analog of mtxMultiplyMin. It will take *mtxThreadParam_t as an argument and return nothing.
While there are several ways to divy up the work, the easiest approach is to start the outermost loop (whether for a row or column) at the particular thread index and set the loop stride to the number of threads. For example, if there are a total of two threads, each will do the work for every other column (row, respectively); if three threads, each will do every third column (row, respectively). The inner loops would use a unit stride, as in the unthreaded versions.
Add the corresponding function threadMtxMultiplyMax, which performs the slower multiplication work; it is the analog of mtxMultiplyMax.
Add a function parMtxMultiply that computes the threaded, concurrent (and hopefully parallel) matrix product. In addition to the matrix pointer arguments, the function should take the number of threads to use and which method (max or min). Your function should
- allocate a parameter struct for each thread,
- initialize each thread and explicitly make it joinable
- print an appropriate message to stderr and exit with an appropriate status if any thread cannot be created, and
- wait for all threads to complete before returning.
For good additional examples, see Example 2 - Thread Argument Passing and Example Code - Pthread Joining in the POSIX Threads Programming tutorial.
Write a program ptestmultiply.c that takes N (matrix size), method (min or max), and a positive number of threads as arguments. Your program should create two N×N matrices filled with random values, and multiply them using the given method with the specified number of threads. The final program should not print the result, but you may want to in an intermediate version for testing.
Add this program to your Makefile.

Part B: Performance Profiling

Finally, modify your program to print the elapsed wall-clock time (in decimal seconds) for only the call to parMtxMultiply (that is, exclude any initialization from the timing). You can do this easily using a combination of gettimeofday(2) and timersub(3).
For example:

$ ./ptestmultiply 128 min 2
0.008150
$ ./ptestmultiply 2048 max 4
58.491341
Using both min and max options, run your program for values of N = [128, 192, 256, 384, 512, 768, 1024, 1792] with 1-4 threads. Collect the total times and order them nicely in a text file or spreadsheet.
If you implemented your functions efficiently with pointers, you should be able to use N=2048 as well.
Use a spreadsheet (or another tool of your choice, such as R, gnuplot, or MATLAB) to create the following plots:
1. Matrix size N versus total time for both the min and max methods and varying numbers of threads. You may collect the various plots into axes in the way you deem most sensible for analysis.
2. Matrix size N versus the speedup¹
  
  S= T₁(N)
  T_P(N)
  
  where T is the total time and P is the number of threads. Show the plots for P > 1 on the same axes.
3. Matrix size N versus the efficiency
  
  E= T₁(N)
  P×T_P(N)
  .
  
  Show the plots for for P > 1 on the same axes.
4. Number of threads (P > 1) versus speedup for both the min and max methods using only the largest N you ran.
5. Number of threads (P > 1) versus efficiency for both the min and max methods using only the largest N you ran.
Analyze your results by answering the following questions:
1. How do the times vary (Plot 3a) between the min and max methods and the number of threads? Explain any similarities and differences you see.
2. Describe and explain what trends (if any) you see in the speedup as N increases.
3. Describe and explain what trends (if any) you see in the efficiency as N increases.
4. Describe and explain what trends (if any) you see in the speedup as P increases.
5. Describe and explain what trends (if any) you see in the efficiency as P increases.
Be sure to include any relevant details about the machine on which you performed your experiments, so that your reader can understand the context for them.

What to turn in

Your source code in matrixop.h, matrixop.c, and ptestmultiply.c
A single PDF containing (merged)
- Your enscripted source code files
- A transcript of your program's compilation and test run(s)
- A unified writeup with plots (B.3) and analysis (B.4)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

Footnotes:

¹See Foster, Efficiency and Speedup, or Wikipedia, Speedup