Lab: GPU Image Restoration
CSC 213 - Operating Systems and Parallel Algorithms - Weinman
- Summary:
- You will devise a CUDA-based strategy for a GPU implementation
of an algorithm for cleaning up noisy images.
- Assigned:
- Tuesday 13 November
- Due:
- 10:30 PM Monday 26 November
- Objectives:
-
- Experience CUDA-based GPU programming.
- Practice informal benchmarking system analysis.
- Refine memory management and locality considerations for parallel,
distributed programming.
- Resources
-
Preliminaries
- Do this laboratory on a MathLAN workstation in 3819 or 3815.
- The starter files are the same as the prior lab, but there are some
we did not look at before (and you may need to copy).
- Read cupbmio.h and be sure you understand what each of the
functions does and is for.
In particular, note that our two-dimensional images are no longer
type pbm_t but will now be represented on the CUDA device
as linear arrays, a bit* pointer, for which we will need
to know the dimensions. This is analogous to the matrix data used
in Kirk and Hwu.
- Read over the example in cuinvert.cu and be sure you understand
how it works.
- Compile and test the CUDA-based image inversion program.
-
$ make cuinvert
$ ./cuinvert infile outfile
- Investigate the capabilities of your CUDA device.
-
$ /usr/local/cuda/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery
In particular, you should note how many threads per SM and threads
per block it can handle, as these could influence your design decisions.
- Copy the original icm.c to a new cuda program.
-
$ cp icm.c cuicm.cu
- Compile the cuicm program using the Makefile (which
should already have a line for it).
-
$ make cuicm
Exercises
Part A: Design sketch
Sketch a design for a simple parallel, CUDA-based implementation of
the ICM algorithm. Consider the following questions and briefly record
your answers.
- What partitioning will you use? That is, how will you
arrange/divide the blocks and grids? What will each thread be responsible
for? What is the host responsible for?
- What data needs to be transferred from the host to the device? What
data (if any) is only needed on the device?
- What data is local to a thread? What data (if any) can be shared among
threads in a block?
- What data or results need to be transferred from the device to the
host?
Part B: Implementation
Change the body of the runicm function in cuicm.c
so that it uses CUDA and the GPU to perform the ICM algorithm, adding
and/or modifying any other functions and/or variables that may be
necessary in the process.
I strongly recommend you start with the simplest, most straightforward
adaptation possible. After that, you can tweak your solution incrementally
to test different approaches and their effectiveness. Some design
implications to consider:
- Grid and block dimensions
- Accesses to registers, local memory, and global memory
- Thread divergence
Important Tip: cudaMemcpy only works on cudaMalloc'd
memory. If you wish to transfer global device variables or constants
to/from the host, you'll need the cudaMemcpyFromSymbol and
cudaMemcpyToSymbol functions. For example:
-
__device__ int flag_d; /* Device variable */
int flag_h; /* Host variable */
__host__ void doSomething()
{
cudaError_t rc;
rc = cudaMemcpyToSymbol( "flag_d", &flag_h, sizeof(int), 0,
cudaMemcpyHostToDevice);
if (rc != cudaSuccess)
fprintf(stderror,"Unable to copy value to device: %s",
cudaGetErrorString(rc) );
rc = cudaMemcpyFromSymbol( &flag_h, "flag_d", sizeof(int), 0,
cudaMemcpyDeviceToHost);
if (rc != cudaSuccess)
fprintf(stderror,"Unable to copy value from device: %s",
cudaGetErrorString(rc) );
}
Part C: Design paper
In light of your sketch and final implementation, write a design paper
for your program that clearly describes what each thread does, where
its data resides, and (for both) why. In particular, you should explain:
- what communication is required among the threads and blocks;
- your agglomeration strategy;
- your kernel's CGMA ratio;
- the values of any configuration parameters you have chosen; and
- (if applicable) other approaches or variations you tried and how they
compare.
You should also report the measured speedup your program (and perhaps
variants) achieves over the single-threaded version given. Optionally,
you can report measured speedup over your parallelized pthread
version.
Evaluation and Extra Credit
Your grade will largely be based on how well you decompose the task,
leverage locality, and optimize your execution configuration for the
given hardware. I am likely to give a substantial amount of extra
credit to the top performing programs. Although we have not covered
these in class, Kirk and Hwu discuss some other powerful performance
considerations in chapter 6, including:
- Data prefetching
- Memory coalescence
- Thread granularity
What to turn in
- Your source code in cuicm.cu
- A single PDF containing (merged)
- Your design paper (Part C)
- Your enscripted source code (Part B)
- A transcript of your program's compilation and test run(s)
Copyright © 2012 Jerod
Weinman.
This work is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 3.0 United States License.