Lab: GPU Image Restoration

CSC 213 - Operating Systems and Parallel Algorithms - Weinman

Summary:

You will devise a CUDA-based strategy for a GPU implementation of an algorithm for cleaning up noisy images.

Assigned:

Tuesday 13 November

Due:

This assignment will be due in two parts.

Part A,B: 11:30 PM Monday 24 November
Parts C,D: 11:30 PM Monday 1 December

Objectives:

Experience CUDA-based GPU programming.
Practice informal benchmarking system analysis.
Refine memory management and locality considerations for parallel programming.

Resources

The starter files were already copied with your last lab.
NVIDIA CUDA C Programming Guide

Important Note:: The CUDA devices in our labs have what is known as compute-capability 2.0. Our CUDA driver is the rather old 5.0. Be aware that you may hear about many new (and useful) features that have been introduced into newer drivers and devices with higher compute-capabilities, but these versions may not support them.

Preliminaries

Do this laboratory on a MathLAN workstation in 3819 or 3815.
The starter files are the same as the prior lab, but there are some we did not look at before (and you may need to copy).
Read cubitmap.h and be sure you understand what each of the functions does and is for.
In particular, note that our two-dimensional images will no longer be wrapped in bm_t but must now be explicitly represented on the CUDA device as linear arrays, a bit* pointer, for which we will need to know the dimension (analogous to the matrix data used in Kirk and Hwu).
Read over the example in cuinvert.cu and be sure you understand how it works.
Read the Makefile target cuinvert; be sure you understand the build process.
Compile and test the CUDA-based image inversion program.

$ make cuinvert
$ ./cuinvert infile outfile
Read the capabilities of your CUDA device. In particular, you should note how many threads per SM and threads per block it can handle, as these could influence your design decisions.
Copy the original icm.c to a new CUDA program.

$ cp icm.c cuicm.cu

Cite and attribute the source of your new file. As you proceed, clearly separate which portions are exclusively yours, which are faithful to the original, and which you have modified.
Compile the cuicm program using the Makefile (which should already have a line for it).

$ make cuicm

Exercises

Part A: Design sketch

Sketch a design for a simple parallel, CUDA-based implementation of the ICM algorithm. Consider the following questions and briefly record your answers.

What partitioning will you use? That is, how will you arrange/divide the blocks and grids? What will each thread be responsible for? What is the host responsible for?
What data needs to be transferred from the host to the device? What data (if any) is only needed on the device?
What data is local to a thread? What data (if any) can be shared among threads in a block?
What data or results need to be transferred from the device to the host?

Part B: First implementation

Change the body of the runicm function in cuicm.c so that it uses CUDA and the GPU to perform the ICM algorithm, adding and modifying any other functions and variables that may be necessary in the process.

Start with the simplest, most straightforward adaptation possible. That is, just make a kernel and launch grid that works; do not yet worry about optimizing performance in any way.

Important Tip: cudaMemcpy only works on cudaMalloc'd memory. If you wish to transfer global device variables or constants between device and host, you'll need the cudaMemcpyFromSymbol and cudaMemcpyToSymbol functions. For example:

: __device__ int flag_d; /* Device variable */
           int flag_h; /* Host variable */

__host__ void doSomething()
{
    cudaError_t rc;
    // Copy value of flag_h on host into flag_d on device
    rc = cudaMemcpyToSymbol( flag_d, &flag_h, sizeof(int), 0,
                             cudaMemcpyHostToDevice);

    if (rc != cudaSuccess)
        fprintf(stderror,"Unable to copy value to device: %s",
                cudaGetErrorString(rc) );
     // Copy value of flag_d on device into flag_h on host
     rc = cudaMemcpyFromSymbol( &flag_h, flag_d, sizeof(int), 0,
                                cudaMemcpyDeviceToHost);

    if (rc != cudaSuccess)
        fprintf(stderror,"Unable to copy value from device: %s",
                cudaGetErrorString(rc) );
}

Part C: Revised implementation

Tweak your solution incrementally to test different approaches and their effectiveness. Some design implications to consider:

Grid and block dimensions
Accesses to registers, local memory, and global memory
Thread divergence

Tip:: Unless you rearrange the timing mechanism to only time the ICM iterations (rather than all the other startup overhead), make sure you take several measurements to get a sense of how efficient your changes are, otherwise highly variable I/O time may give you false readings.

Through a combination of strategies (I tried at least six approaches in various combinations, but did not keep all of them), I was able to achieve a speedup of about 7.2× over a poorly tuned basic adaptation (The simplest single tuning gave a 5.2× speedup.)

Part D: Design paper

In light of your sketch, and final implementation, write a design paper for your program that clearly describes what each thread does, where its data resides, and (for both) why. In particular, you should explain:

what communication is required among the threads and blocks;
your agglomeration strategy;
your kernel's CGMA ratio;
the values of any configuration parameters you have chosen; and
(if applicable) other approaches or variations you tried and how they compare.

You should also report the measured speedup your program (and perhaps variants) achieves over the single-threaded version (icm.c) provided and the original version from Part A.

Evaluation and Extra Credit

The programming portion of your grade will largely be based on how well you decompose the task, leverage locality, and optimize your execution configuration for the given hardware.

A revision including only the basic, functioning adaptation would be considered Good ("B"), while reports of additional strategies or a strategy that effectively reduced runtime would be considered Very Good ("B+"). Reports of several additional strategies that collectively reduce runtime would be considered Excellent ("A-") and reports of several strategies that collectively reduced runtime and more that did not would be considered Truly Excellent ("A+")

The top-performing programs may receive substantial extra credit if they utilize additional performance considerations. Although we have not covered these in class, Kirk and Hwu discuss some other powerful performance considerations in chapter 6, including:

Data prefetching
Memory coalescence
Thread granularity

What to turn in

Parts A and B

Your source code in cuicm.c (from Part B)
A single PDF containing (merged)
- Your brief answers from Part A
- Your enscripted source code in cuicm.c (from Part B)
- A transcript of your program's compilation and test run, i.e.,
$ make clean icm cuicm ; rm out*.pbm
$ ./icm in.pbm out.pbm
$ ./cuicm in.pbm cout.pbm
$ diff out.pbm cout.pbm

Parts C and D

Your source code in cuicm.c (from Part C)
A single PDF containing (merged)
- Your enscripted source code in cuicm.c (from Part C)
- A transcript of your program's compilation and test run, i.e.,
$ make clean icm cuicm ; rm out*.pbm
$ ./icm in.pbm out.pbm
$ ./cuicm in.pbm cout.pbm
$ diff out.pbm cout.pbm
- Your final design paper and performance report (from Part D)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.