Lab: Parallelized Image Restoration

CSC 213 - Operating Systems and Parallel Algorithms - Weinman

Summary:

You will devise a synchronization strategy for a parallel multi-threaded implementation of an algorithm for cleaning up noisy images.

Assigned:

Tuesday 30 October

Due:

This assignment will be due in three parts

Part A: 10:30 PM Friday 2 November
Part B: 10:30 PM Friday 9 November
Part C: 10:30 PM Monday 12 November

Objectives:

Experience parallelizing sequential code, in a simple but realistic application.
Practice using Foster's PCAM framework.
Develop a new semaphore pattern to refine synchronization skills.
Improve analytical writing through revision; practice quantitative writing.

Reading:

A Brief Introduction to Iterated Conditional Modes (ICM), which provides background on the algorithm we'll be using.

Preliminaries

Do this laboratory on a MathLAN workstation in 3819 or 3815.
Copy the starter files to your own directory:

$ cp -R ~weinman/public_html/courses/CSC213/2012F/labs/code/icm ~/somewhere
Read icm.c and be sure you understand how it works.
Compile the icm program using the Makefile.

$ make icm
Investigate the contents of the data directory using an image viewer such as eog. The file imageNN.pbm indates that N% noise has been added to the original image.pbm.
Run icm on some of these (2*5=10) noisy images and inspect the output results. You could also experiment with different values for alpha and beta to obtain different results.

Exercises

Part A: Algorithm design

Now you will devise a parallel implementation of the ICM algorithm. With sufficient processing power, each iteration could take constant time, rather than time proportional to the size of the image.

Assume that your algorithm will run as threads on a shared memory multiprocessor (e.g., a MathLAN workstation). There are several ways to partition the problem into a parallel implementation (the example in pinvert.c, which is likely similar to your prior parMtxMultiply, is one partitioning), but your parallel implementation must meet requirements:

You may create worker threads only once, at the outset of the algorithm.
The previous requirement will necessitate synchronization between the threads. Mutex variables and semaphores are probably the most straightforward way to do this, though you may propose alternative modes to the instructor, who may or may not approve them.
The complexity of your synchronization strategy should not depend on the number of threads. In other words, the number of semaphores required should be constant.

Write a design paper that addresses the following:

Explain what type of partitioning strategy is most appropriate: domain or functional decomposition.
Explain what communication is required, bearing in mind the partitioning and the execution environment.
Give a strategy for agglomeration and mapping. What issues should be considered?
Write a complete description of a parallel version of the algorithm in pseudocode, with appropriate English explanations. Clearly describe what thread does and any shared data or synchronization constructs. Note: You do not need to explain the ICM algorithm; just explain your parallelization.
Explain and justify (i.e., prove) why your algorithm will exhibit proper synchronization (threads wait for data they depend on) and will not deadlock. You should explain what variables are used and what the purpose of each is, as well as the interleaving properties of the threads as they relate to the synchronizations. (Consult Downey, your prior lab, and any feedback on it for examples.)

Hint: One nice solution to this problem extends the barrier pattern so that it is reusable. In other words, the barrier can be used within a loop. The key is to create, in addition to a turnstile that ensures all threads rendezvous at the beginning of a section of code, another turnstile syncs them at the end of the looped code. This prevents one thread from lapping another by getting ahead one iteration. While the first in turnstile is only unlocked once all threads have arrived (by incrementing and checking a variable), the second out turnstile is only unlocked when all threads have completed (by decrementing and checking the same variable.)

Part B: Implementation

Implement the multi-threaded version (using POSIX threads) of your parallel algorithm by copying icm.c to picm.c and modifying the existing functions and introducing any necessary new functions. Like icm, your program picm should accept the filenames of an input and an output image. It should also take as a command-line parameter the number of threads to create.

Preserve the wall clock timer to measure time for the algorithm. (You'll find that I/O dominates the overall time.)

The parallel and single-thread versions should produce the same results on all images. To assert this, the command diff(1) will tell you whether two files are the same. If you want to create an image that shows you the difference between two images (rather than just whether there is a difference), you can use this handy program:

: $ compare image1.pbm image1.pbm differences.png

which will overlay the differences in red pixels in output images differences.png on top of a matted version of image1.pbm. For a simple binary output, use differences.pbm, which will be black where pixels differ and white where they're the same.

Part C: Performance analysis and evaluation

Revise your design from part A to include any changes you made and any other feedback.

Integrate your revised design within a paper that also analyzes the performance of your algorithm, including the following elements:

Setup

Describe the experimental setup, including the number of cores on your machine, image size, algorithm parameters, etc.

Experiment

Do the experiment, reporting the number of ICM update iterations required for convergence. Use the bash time command to measure the times used by both the single- and multi-threaded programs (with 1, 2, 3, 4, 8, 16, and 512 threads) for one of the noisy palisade images. For each run, record the user and system times from time and the wall clock time from main.

Report

Use a spreadsheet (or another tool of your choice) to create plots of the number of threads versus the three times, as well as the speedup and efficiency (using only the wall clock time).

Summarize

Give a brief, objective overall picture of the performance:

How does the elapsed (wall clock) time compare between the single-threaded and multi-threaded programs, when the number of threads is less than or equal to the number of cores? When it exceeds the number of cores?
How does the user time compare between the single-threaded and multi-threaded programs, when the number of threads is less than or equal to the number of cores? When it exceeds the number of cores?
What happens to all three times as the number of threads grows much larger than the number of cores?
What is the relative change of the three time categories? Does one category change proportionally more than the others?

Infer

Draw conclusions. Write a paragraph explaining the results. Why do you see the patterns in your results?

Recommend

Make recommendations. What lessons have you learned about parallel programming on a multiprocessor?

What to turn in

Part A: A single PDF of your algorithm design.
Part B

Your source code in picm.c
A single PDF containing (merged)
- Your enscripted source code
- A transcript of your program's compilation and test run(s)

Part C: A single PDF of your (revised) design from Part A and your analysis.

Acknowledgments

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.