Lab: GPU Image Restoration

CSC 213 - Operating Systems and Parallel Algorithms - Weinman

Summary:

You will devise a CUDA-based strategy for a GPU implementation of an algorithm for cleaning up noisy images.

Assigned:

Tuesday 13 November

Due:

10:30 PM Monday 26 November

Objectives:

Experience CUDA-based GPU programming.
Practice informal benchmarking system analysis.
Refine memory management and locality considerations for parallel, distributed programming.

Resources

NVIDIA CUDA C Programming Guide

Preliminaries

Do this laboratory on a MathLAN workstation in 3819 or 3815.
The starter files are the same as the prior lab, but there are some we did not look at before (and you may need to copy).
Read cupbmio.h and be sure you understand what each of the functions does and is for.
In particular, note that our two-dimensional images are no longer type pbm_t but will now be represented on the CUDA device as linear arrays, a bit* pointer, for which we will need to know the dimensions. This is analogous to the matrix data used in Kirk and Hwu.
Read over the example in cuinvert.cu and be sure you understand how it works.
Compile and test the CUDA-based image inversion program.

$ make cuinvert
$ ./cuinvert infile outfile
Investigate the capabilities of your CUDA device.

$ /usr/local/cuda/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery

In particular, you should note how many threads per SM and threads per block it can handle, as these could influence your design decisions.
Copy the original icm.c to a new cuda program.

$ cp icm.c cuicm.cu
Compile the cuicm program using the Makefile (which should already have a line for it).

$ make cuicm

Exercises

Part A: Design sketch

Sketch a design for a simple parallel, CUDA-based implementation of the ICM algorithm. Consider the following questions and briefly record your answers.

What partitioning will you use? That is, how will you arrange/divide the blocks and grids? What will each thread be responsible for? What is the host responsible for?
What data needs to be transferred from the host to the device? What data (if any) is only needed on the device?
What data is local to a thread? What data (if any) can be shared among threads in a block?
What data or results need to be transferred from the device to the host?

Part B: Implementation

Change the body of the runicm function in cuicm.c so that it uses CUDA and the GPU to perform the ICM algorithm, adding and/or modifying any other functions and/or variables that may be necessary in the process.

I strongly recommend you start with the simplest, most straightforward adaptation possible. After that, you can tweak your solution incrementally to test different approaches and their effectiveness. Some design implications to consider:

Grid and block dimensions
Accesses to registers, local memory, and global memory
Thread divergence

Important Tip: cudaMemcpy only works on cudaMalloc'd memory. If you wish to transfer global device variables or constants to/from the host, you'll need the cudaMemcpyFromSymbol and cudaMemcpyToSymbol functions. For example:

: __device__ int flag_d; /* Device variable */
           int flag_h; /* Host variable */

__host__ void doSomething()
{
    cudaError_t rc;
    rc = cudaMemcpyToSymbol( "flag_d", &flag_h, sizeof(int), 0,
                             cudaMemcpyHostToDevice);

    if (rc != cudaSuccess)
        fprintf(stderror,"Unable to copy value to device: %s",
                cudaGetErrorString(rc) );

     rc = cudaMemcpyFromSymbol( &flag_h, "flag_d", sizeof(int), 0,
                                cudaMemcpyDeviceToHost);

    if (rc != cudaSuccess)
        fprintf(stderror,"Unable to copy value from device: %s",
                cudaGetErrorString(rc) );
}

Part C: Design paper

In light of your sketch and final implementation, write a design paper for your program that clearly describes what each thread does, where its data resides, and (for both) why. In particular, you should explain:

what communication is required among the threads and blocks;
your agglomeration strategy;
your kernel's CGMA ratio;
the values of any configuration parameters you have chosen; and
(if applicable) other approaches or variations you tried and how they compare.

You should also report the measured speedup your program (and perhaps variants) achieves over the single-threaded version given. Optionally, you can report measured speedup over your parallelized pthread version.

Evaluation and Extra Credit

Your grade will largely be based on how well you decompose the task, leverage locality, and optimize your execution configuration for the given hardware. I am likely to give a substantial amount of extra credit to the top performing programs. Although we have not covered these in class, Kirk and Hwu discuss some other powerful performance considerations in chapter 6, including:

Data prefetching
Memory coalescence
Thread granularity

What to turn in

Your source code in cuicm.cu
A single PDF containing (merged)
- Your design paper (Part C)
- Your enscripted source code (Part B)
- A transcript of your program's compilation and test run(s)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.