Lab: Parallelized Image Restoration
CSC 213 - Operating Systems and Parallel Algorithms - Weinman
- Summary:
- You will devise a synchronization strategy for a parallel
multi-threaded implementation of an algorithm for cleaning up noisy
images.
- Assigned:
- Tuesday 30 October
- Due:
- This assignment will be due in three parts
- Part A
- 10:30 PM Friday 2 November
- Part B
- 10:30 PM Friday 9 November
- Part C
- 10:30 PM Monday 12 November
- Objectives:
-
- Experience parallelizing sequential code, in a simple but realistic
application.
- Practice using Foster's PCAM framework.
- Develop a new semaphore pattern to refine synchronization skills.
- Improve analytical writing through revision; practice quantitative
writing.
- Reading:
- A Brief Introduction to Iterated
Conditional Modes (ICM), which provides background on the algorithm
we'll be using.
Preliminaries
- Do this laboratory on a MathLAN workstation in 3819 or 3815.
- Copy the starter files to your own directory:
-
$ cp -R ~weinman/public_html/courses/CSC213/2012F/labs/code/icm ~/somewhere
- Read icm.c and be sure you understand how it works.
- Compile the icm program using the Makefile.
-
$ make icm
- Investigate the contents of the data directory using an image
viewer such as eog. The file imageNN.pbm indates
that N% noise has been added to the original image.pbm.
- Run icm on some of these (2*5=10) noisy images and inspect
the output results. You could also experiment with different values
for alpha and beta to obtain different results.
Exercises
Part A: Algorithm design
Now you will devise a parallel implementation of the ICM algorithm.
With sufficient processing power, each iteration could take constant
time, rather than time proportional to the size of the image.
Assume that your algorithm will run as threads on a shared memory
multiprocessor (e.g., a MathLAN workstation). There are several ways
to partition the problem into a parallel implementation (the example
in pinvert.c, which is likely similar to your prior parMtxMultiply,
is one partitioning), but your parallel implementation must meet requirements:
- You may create worker threads only once, at the outset of the
algorithm.
- The previous requirement will necessitate synchronization between
the threads. Mutex variables and semaphores are probably the most
straightforward way to do this, though you may propose alternative
modes to the instructor, who may or may not approve them.
- The complexity of your synchronization strategy should not depend
on the number of threads. In other words, the number of semaphores
required should be constant.
Write a design paper that addresses the following:
- Explain what type of partitioning strategy is most appropriate:
domain or functional decomposition.
- Explain what communication is required, bearing in mind the
partitioning and the execution environment.
- Give a strategy for agglomeration and mapping. What
issues should be considered?
- Write a complete description of a parallel version of the algorithm
in pseudocode, with appropriate English explanations. Clearly describe
what thread does and any shared data or synchronization constructs.
Note: You do not need to explain the ICM algorithm; just explain
your parallelization.
- Explain and justify (i.e., prove) why your algorithm will exhibit
proper synchronization (threads wait for data they depend on) and
will not deadlock. You should explain what variables are used and
what the purpose of each is, as well as the interleaving properties
of the threads as they relate to the synchronizations. (Consult Downey,
your prior lab, and any feedback on it for examples.)
Hint: One nice solution to this problem extends the
barrier pattern so that it is reusable.
In other words, the barrier can be used within a loop. The key is
to create, in addition to a turnstile that ensures all threads rendezvous
at the beginning of a section of code, another turnstile syncs
them at the end of the looped code. This prevents one thread
from lapping another by getting ahead
one iteration. While the first in turnstile
is only unlocked once all threads have arrived (by incrementing and
checking a variable), the second out
turnstile is only unlocked when all threads have completed (by decrementing
and checking the same variable.)
Part B: Implementation
Implement the multi-threaded version (using POSIX threads) of your
parallel algorithm by copying icm.c to picm.c and
modifying the existing functions and introducing any necessary new
functions. Like icm, your program picm should accept
the filenames of an input and an output image. It should also take
as a command-line parameter the number of threads to create.
Preserve the wall clock timer to measure time for the algorithm. (You'll
find that I/O dominates the overall time.)
The parallel and single-thread versions should produce the same results
on all images. To assert this, the command diff(1)
will tell you whether two files are the same. If you want to create
an image that shows you the difference between two images (rather
than just whether there is a difference), you can use this handy program:
-
$ compare image1.pbm image1.pbm differences.png
which will overlay the differences in red pixels in output images
differences.png on top of a matted version of image1.pbm.
For a simple binary output, use differences.pbm, which
will be black where pixels differ and white where they're the same.
Part C: Performance analysis and evaluation
Revise your design from part A to include any changes you made and
any other feedback.
Integrate your revised design within a paper that also analyzes the
performance of your algorithm, including the following elements:
- Setup
- Describe the experimental setup, including the number
of cores on your machine, image size, algorithm parameters, etc.
- Experiment
- Do the experiment, reporting the number of ICM update
iterations required for convergence. Use the bash time
command to measure the times used by both the single- and multi-threaded
programs (with 1, 2, 3, 4, 8, 16, and 512 threads) for one of the
noisy palisade images. For each run, record the user and system times
from time and the wall clock time from main.
- Report
- Use a spreadsheet (or another tool of your choice) to
create plots of the number of threads versus the three times, as well
as the speedup and efficiency (using only the wall clock time).
- Summarize
- Give a brief, objective overall picture of the performance:
- How does the elapsed (wall clock) time compare between the single-threaded
and multi-threaded programs, when the number of threads is less than
or equal to the number of cores? When it exceeds the number of cores?
- How does the user time compare between the single-threaded and multi-threaded
programs, when the number of threads is less than or equal to the
number of cores? When it exceeds the number of cores?
- What happens to all three times as the number of threads grows much
larger than the number of cores?
- What is the relative change of the three time categories? Does one
category change proportionally more than the others?
- Infer
- Draw conclusions. Write a paragraph explaining the results.
Why do you see the patterns in your results?
- Recommend
- Make recommendations. What lessons have you learned
about parallel programming on a multiprocessor?
What to turn in
- Part A
- A single PDF of your algorithm design.
- Part B
-
- Your source code in picm.c
- A single PDF containing (merged)
- Your enscripted source code
- A transcript of your program's compilation and test run(s)
- Part C
- A single PDF of your (revised) design from Part A and
your analysis.
Acknowledgments
Janet Davis contributed notable improvements to the text. The public
domain Palisade
image is by the Pearson Scott Foresman company. The Dilbert
image is Copyright 2008 Scott Adams, Inc.
Copyright © 2012 Jerod
Weinman.
This work is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 3.0 United States License.