Lab: GPU Image Restoration
CSC 213 - Operating Systems and Parallel Algorithms - Weinman
- Summary:
- You will devise a CUDA-based strategy for a GPU implementation
of an algorithm for cleaning up noisy images.
- Assigned:
- Tuesday 13 November
- Due:
- This assignment will be due in two parts.
- Part A,B
- 11:30 PM Monday 24 November
- Parts C,D
- 11:30 PM Monday 1 December
- Objectives:
-
- Experience CUDA-based GPU programming.
- Practice informal benchmarking system analysis.
- Refine memory management and locality considerations for parallel
programming.
- Resources
-
- Important Note:
- The CUDA devices in our labs have what is known
as compute-capability 2.0. Our CUDA driver is the rather old 5.0.
Be aware that you may hear about many new (and useful) features that
have been introduced into newer drivers and devices with higher compute-capabilities,
but these versions may not support them.
Preliminaries
- Do this laboratory on a MathLAN workstation in 3819 or 3815.
- The starter files are the same as the prior lab, but there are some
we did not look at before (and you may need to copy).
- Read cubitmap.h and be sure you understand what each of the
functions does and is for.
In particular, note that our two-dimensional images will no longer
be wrapped in bm_t but must now be explicitly represented
on the CUDA device as linear arrays, a bit* pointer, for
which we will need to know the dimension (analogous to the matrix
data used in Kirk and Hwu).
- Read over the example in cuinvert.cu and be sure you understand
how it works.
- Read the Makefile target cuinvert; be sure you understand
the build process.
- Compile and test the CUDA-based image inversion program.
-
$ make cuinvert
$ ./cuinvert infile outfile
- Read the capabilities of
your CUDA device. In particular, you should note how many threads
per SM and threads per block it can handle, as these could influence
your design decisions.
- Copy the original icm.c to a new CUDA program.
-
$ cp icm.c cuicm.cu
Cite and attribute the source of your new file. As you proceed, clearly
separate which portions are exclusively yours, which are faithful
to the original, and which you have modified.
- Compile the cuicm program using the Makefile (which
should already have a line for it).
-
$ make cuicm
Exercises
Part A: Design sketch
Sketch a design for a simple parallel, CUDA-based implementation of
the ICM algorithm. Consider the following questions and briefly record
your answers.
- What partitioning will you use? That is, how will you
arrange/divide the blocks and grids? What will each thread be responsible
for? What is the host responsible for?
- What data needs to be transferred from the host to the device? What
data (if any) is only needed on the device?
- What data is local to a thread? What data (if any) can be shared among
threads in a block?
- What data or results need to be transferred from the device to the
host?
Part B: First implementation
Change the body of the runicm function in cuicm.c
so that it uses CUDA and the GPU to perform the ICM algorithm, adding
and modifying any other functions and variables that may be necessary
in the process.
Start with the simplest, most straightforward adaptation possible.
That is, just make a kernel and launch grid that works; do
not yet worry about optimizing performance in any way.
Important Tip: cudaMemcpy only works on cudaMalloc'd
memory. If you wish to transfer global device variables or constants
between device and host, you'll need the cudaMemcpyFromSymbol
and cudaMemcpyToSymbol functions. For example:
-
__device__ int flag_d; /* Device variable */
int flag_h; /* Host variable */
__host__ void doSomething()
{
cudaError_t rc;
// Copy value of flag_h on host into flag_d on device
rc = cudaMemcpyToSymbol( flag_d, &flag_h, sizeof(int), 0,
cudaMemcpyHostToDevice);
if (rc != cudaSuccess)
fprintf(stderror,"Unable to copy value to device: %s",
cudaGetErrorString(rc) );
// Copy value of flag_d on device into flag_h on host
rc = cudaMemcpyFromSymbol( &flag_h, flag_d, sizeof(int), 0,
cudaMemcpyDeviceToHost);
if (rc != cudaSuccess)
fprintf(stderror,"Unable to copy value from device: %s",
cudaGetErrorString(rc) );
}
Part C: Revised implementation
Tweak your solution incrementally to test different approaches and
their effectiveness. Some design implications to consider:
- Grid and block dimensions
- Accesses to registers, local memory, and global memory
- Thread divergence
- Tip:
- Unless you rearrange the timing mechanism to only time the
ICM iterations (rather than all the other startup overhead), make
sure you take several measurements to get a sense of how efficient
your changes are, otherwise highly variable I/O time may give you
false readings.
Through a combination of strategies (I tried at least six approaches
in various combinations, but did not keep all of them), I was able
to achieve a speedup of about 7.2× over a poorly tuned basic
adaptation (The simplest single tuning gave a 5.2× speedup.)
Part D: Design paper
In light of your sketch, and final implementation, write a design
paper for your program that clearly describes what each thread does,
where its data resides, and (for both) why. In particular, you should
explain:
- what communication is required among the threads and blocks;
- your agglomeration strategy;
- your kernel's CGMA ratio;
- the values of any configuration parameters you have chosen; and
- (if applicable) other approaches or variations you tried and how they
compare.
You should also report the measured speedup your program (and perhaps
variants) achieves over the single-threaded version (icm.c)
provided and the original version from Part A.
Evaluation and Extra Credit
The programming portion of your grade will largely be based on how
well you decompose the task, leverage locality, and optimize your
execution configuration for the given hardware.
A revision including only the basic, functioning adaptation would
be considered Good ("B"), while reports of additional strategies
or a strategy that effectively reduced runtime would be considered
Very Good ("B+"). Reports of several additional strategies
that collectively reduce runtime would be considered Excellent
("A-") and reports of several strategies that collectively reduced
runtime and more that did not would be considered Truly Excellent
("A+")
The top-performing programs may receive substantial extra credit if
they utilize additional performance considerations. Although we have
not covered these in class, Kirk and Hwu discuss some other powerful
performance considerations in chapter 6, including:
- Data prefetching
- Memory coalescence
- Thread granularity
What to turn in
- Parts A and B
-
- Your source code in cuicm.c (from Part B)
- A single PDF containing (merged)
- Your brief answers from Part A
- Your enscripted source code in cuicm.c (from Part B)
- A transcript of your program's compilation and test run, i.e.,
-
$ make clean icm cuicm ; rm out*.pbm
$ ./icm in.pbm out.pbm
$ ./cuicm in.pbm cout.pbm
$ diff out.pbm cout.pbm
- Parts C and D
-
- Your source code in cuicm.c (from Part C)
- A single PDF containing (merged)
- Your enscripted source code in cuicm.c (from Part C)
- A transcript of your program's compilation and test run, i.e.,
-
$ make clean icm cuicm ; rm out*.pbm
$ ./icm in.pbm out.pbm
$ ./cuicm in.pbm cout.pbm
$ diff out.pbm cout.pbm
- Your final design paper and performance report (from Part D)
Copyright © 2012, 2014 Jerod
Weinman.
This work is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 3.0 United States License.