Published on November 2016 | Categories: Documents | Downloads: 50 | Comments: 0 | Views: 346
of 39
Download PDF   Embed   Report




An Honors Thesis Presented to The Faculty of the Department of Computer Science Washington and Lee University

In Partial Fulfillment Of the Requirements for Honors in Computer Science

by Alexander Lee Jackson May 2009

To my Mother and Father. . .

1 Introduction 1.1 1.2 1.3 1.4 General Purpose GPU Programming . . . . . . . . . . . . . . . . . . . . . . Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 3 3 5 6 6 10 11 13 17 19 19 21 25 25 26 31

Thesis Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Background 2.1 2.2 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 2.3 General Purpose Graphics Processing Unit . . . . . . . . . . . . . . Compute Unified Device Architecture . . . . . . . . . . . . . . . . .

Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 GPU Edge Detection Algorithms 3.1 3.2 One Pixel Per Thread Algorithm . . . . . . . . . . . . . . . . . . . . . . . . Multiple Pixels Per Thread Algorithm . . . . . . . . . . . . . . . . . . . . .

4 Evaluation and Results 4.1 4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Conclusions


5.1 5.2

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 31 33




I would like to thank my adviser Rance Necaise for assisting me through this long and, at times, frustrating process. Thanks also to Tania S. Douglas and her research team at the University of Cape Town in South Africa for inspiring this project and supplying a number of sample images. A special thank you to Ryleigh for talking me down from many ledges when it got overwhelming.



Often, it is a race against time to make a proper diagnosis of a disease. In areas of the world where qualified medical personnel are scarce, work is being done on the automated diagnosis of illnesses. Automated diagnosis involves several stages of image processing on lab samples in search of abnormalities that may indicate the presence of such things as tuberculosis. These image processing tasks are good candidates for migration to parallelism which would significantly speed up the process. However, a traditional parallel computer is not a very accessible piece of hardware to many. The graphics processing unit (GPU) has evolved into a highly parallel component that recently has gained the ability to be utilized by developers for non-graphical computations. This paper demonstrates the parallel computing power of the GPU in the area of medical image processing. We present a new algorithm for performing edge detection on images using NVIDIA’s CUDA programming model in order to program the GPU in C. We evaluated our algorithm on a number of sample images and compared it to two other implementations; one sequential and one parallel. This new algorithm produces impressive speedup in the edge detection process.



Chapter 1

Graphics processing units (GPUs) have evolved over the past decade into highly parallel, multicore processors[2]. Until recently, these extremely powerful pieces of hardware were typically only used for processing graphical data. Unless the user is playing a graphics intensive computer game or executing some other application of a graphical nature, these high-powered GPUs are often underutilized. In recent years researchers have begun to investigate the viability of using this highly parallel, highly efficient processing unit for computations of a non-graphical nature. This project focused on using GPUs for image processing. We implemented two different load-balancing algorithms for use with the GPU and showed the advantages of programming the GPU over the central processing unit (CPU) for such problems.


General Purpose GPU Programming

Parallel computing has become the go-to method for dealing with problems that have large data sets and computationally intense calculations. Some examples of such problems are scientific modeling simulations, weather forecasting, and modeling the relations between many heavenly bodies. However, there are limitations to the use of the high-powered computers that are necessary for executing these parallel applications. These super-computers


CHAPTER 1. INTRODUCTION are often very large in size and are typically quite pricey.


Working with the GPU for computationally intensive problems has several advantages over the alternative options for parallelism. The GPU is a much more physically and financially manageable piece of machinery with a top of the line unit going for at most a few thousand dollars. It also has a growing community of enthusiasts that have been showing impressive speed-up capabilities through GPU utilization. NVIDIA has become the industry’s leading proponent of GPGPU (General Purpose Graphics Processing Unit) programming through their release and support of CUDA. CUDA is an extension to the C programming language that has helped to make GPGPU programming more accessible to developers.


Image Processing

Image processing “refers to the manipulation and analysis of pictorial information”[3]. Using forms of image processing has become an important part of society. It is utilized in the scientific and entertainment community to do such things as convert photographs to black and white or increase sharpness of an image. In relation to this paper, manipulating an image is an important step for implementation of computer vision. Computer vision involves the automation of such important tasks as the diagnosis of illnesses or facial recognition. The general idea behind image processing involves examining image pixels and manipulating them as defined by the type of image processing desired. Image processing can be a time consuming task and, luckily, lends itself nicely to conversion to a parallel algorithm.



Performing research under the R.E. Lee Scholar program initially sparked interest in utilizing parallel algorithms for high performance computing. The Summer of 2007 and 2008 was spent doing work with this concept. Preliminary research was done for this thesis prior to the beginning of classes in the Fall. We spent this time becoming proficient in program-



ming with CUDA and converted a simple physics heat-diffusion algorithm to a GPU based solution. Inspiration for this project was taken from Professor Tania S Douglas and her research group, MRC/UCT Medical Imaging Research Unit, Department of Human Biology, University of Cape Town in South Africa. This research group has been developing an automated process for diagnosis of tuberculosis particularly for low-income and developing countries[4]. In the current environment, the diagnosis of tuberculosis is a time consuming task that requires a highly trained technician to examine a sputum smear underneath a microscope. Examining sputum smears (smears of matter taken from the respiratory tract) under a microscope is the number one method for diagnosing tuberculosis, according to the World Health Organization (WHO) [1]. The problem with the current method for TB diagnosis is the human element inherent to it. Each slide must be closely examined by a medical technician with a level of competency that is not necessarily guaranteed. This problem is compounded by the fact that in developing countries where TB is still a major health risk, there are usually a shortage of senior pathologists to verify manual screening, a requirement of the WHO. Additionally, slide examination is a tedious and time consuming task. On average a technician will examine each sputum slide for five minutes and examine around 25 slides per day [4]. Automating TB diagnosis will help to alleviate the need for highly trained medical technicians to perform unexciting tasks while at the same time increasing the accuracy of diagnosis and the number of diagnoses that can be made in a period of time. Our work on the GPU with CUDA is significant for the University of Cape Town group because of the aforementioned advantages of utilizing the GPU in parallel applications. Using CUDA has the potential for speedups that are orders of magnitude faster than its sequential counterparts. Additionally the small space requirement and relatively low-cost of implementing a GPU for scientific applications allows for a degree of portability unavailable to a typical high powered computer. Creating a cost-effective method for medical computing in undeveloped areas has the potential to help improve the conditions in places that do not

CHAPTER 1. INTRODUCTION have the level of health care enjoyed in other countries.


The automated diagnosis of TB through the examination of sputum smears requires several steps, one of which is the recognition of abnormal smears. This recognition requires that the image of the smear go through some form of image processing involving edge detection. There are a number of well documented edge detection algorithms but the one which we chose to implement for this research was the Laplacian method of pixel classification[3].


Thesis Layout

Our intention in this thesis is to show the advantages of utilizing CUDA for image processing. In the second chapter we provide some background information on parallel computing in general and GPGPU programming in addition to a more in-depth description of image processing. The third chapter describes our two GPU algorithms. Chapter four reports our method of experimentation and the results of our work, and we make our conclusions in chapter five. We show that the edge detection algorithm on the GPU is considerably faster and more efficient than that of the sequential version and discuss how a GPU implementation of the entire diagnostic process could be achieved.

Chapter 2

This thesis makes reference to the area of parallel computing. We used this area of computational science as foundation for our work with GPGPU programming with CUDA. Typical parallel computing has a number of advantages and disadvantages, which we discuss in this section. We also lay out how GPGPU programming compares to parallel computing.


Parallel Computing

Parallel computing is a simple idea. The human brain is well versed in performing tasks in parallel; however, a single computer processor can only do one thing at a time in sequential order. A parallel computer, which can perform multiple computations at once, can be used to solve simple problems in a matter of minutes that would normally take hours or days on a single processor. The most basic concept behind parallel computing is the distribution of the workload between the individual processors that are working together to perform computations. For example, consider the problem of matrix addition, which is an embarrassingly parallel application. Embarrassingly parallel problems are those that require no other effort beyond dividing up the work and having each processor operate on their portion as though it were a sequential algorithm. Suppose we have a parallel machine with p processors, and we want




to add two matrices with p elements each. This problem only requires that we distribute the two corresponding elements from each matrix to their own processor, let the processor add the elements, and then collect them into a solution matrix, as illustrated in Figure 2.1. Other problems, such as matrix multiplication, are more complicated due to their requirement for processor cooperation or more careful and efficient distribution of data for their calculations.

P0 P4 P8

Matrix 1 P1 P2 P5 P9 P6

P3 P7


Matrix 2 P1 P2 P5 P9 P6

P3 P7


P4 P8

P10 P11

P10 P11

Figure 2.1: Parallel matrix addition. Each processor, Pn, gets an element from each matrix.

The ideal theoretical parallel computer consists of p processors with an infinite amount of memory. With such resources available, we can divide the workload between p processors to the point where each processor works with the smallest possible piece of data. In practice however, there does not exist such a computer. Efficient and proper use of a parallel computer involves careful workload distribution between processors. There are two basic architectures used with parallel computers: SIMD and MIMD. Single Instruction Multiple Data (SIMD) architecture is defined by a simultaneous execution of a single instruction on multiple pieces of data[9]. For example, referring back to the matrix addition case, every processor has access to each element in both matrices and each processor executes the same instructions, adding the corresponding elements based on their processor ids. In Multiple Instruction Multiple Data (MIMD) architecture, each processor runs independently using a its own set of instructions[9]. A Beowulf cluster, a common type of parallel computer, is implemented using the MIMD architecture[9]. Inter-processor communications allow for the transmission of data between processors



and sending of signals for reporting on their status. In many cases this allows for the synchronization of the individual processors among their group, preventing some from moving forward before the others are ready. Often, synchronization is crucial for reliable execution of applications due to sharing of memory space. A race condition may occur if two processors require read/write access to the same data register. When this happens you cannot predict which processor gains access to the data first and therefore you can’t guarantee the integrity of the data[8]. Race conditions can be avoided through inter-processor communications. How the processors communicate usually depends on the type of processor relationship being implemented. Washington and Lee’s Beowulf cluster, The Inferno, is a 64 processor cluster that uses the MPI (Message Passing Interface) protocol for communication between processors. In clusters such as this one, there is no central shared memory repository for data; instead data is often distributed by a central processor and stored in each processor’s local memory. Early parallel computers made use of a central shared memory pool, but as time has progressed it has become “difficult and expensive” to make larger machines with this form of memory[6]. MPI is used in order to distribute data between processors and allow for inter-processor communication. With the message-passing model, we are able to work with very large sets of data without being restricted by the size of a global shared memory pool[6]. The type of relationship between processors working in parallel determines the structure for the program itself. The master/slave relationship is a common paradigm for working in parallel. This consists of a sole “master” processor that presides over a group of “slave” processors. The master is responsible for organizing and distributing the data while the slaves operate on the data[6]. Typically, the slaves, upon starting, signal the master that they are ready and waiting. As these signals are received, the master processor distributes the data among the slave processors. The master is only responsible for managing the data. In some cases, the master processor may clean up loose ends, but generally the slave processors do the majority of the computing, as illustrated in Figure 2.2. Once they have finished, the master collects the resulting data set from the slaves. The work-pool method



for processor communication, as illustrated in Figure 2.3, is similar to the master/slave method, except that the slave processors make requests for smaller chunks of data from a “pool” managed and distributed by the master processor [8].

P2 P3
Data Data



Master P0

Figure 2.2: Master/Slave workload distribution.

P2 P3
Data Data Request Data Request Data Data Request Data Data Request Data


P1 Work Pool P0

Figure 2.3: Work pool workload distribution.

When the data set is large or the computation complex, utilizing a parallel computer and an algorithm that employs an effective load balancing scheme can result in an exponential increase in performance. With enough processors working together, the only performance

CHAPTER 2. BACKGROUND limitation is the time for inter-processor communication.



Graphics Processing Units

A GPU is quite different from a CPU. The GPU is concerned with one thing, the processing of graphics data whereas the CPU is responsible for general computations and system administration. GPUs process graphical data in the form of vertices in a geometric space. This data is converted into a 2-dimensional image for display on the monitor through a process known as the graphics pipeline[5]. The pipeline consists of a number of stages, as illustrated in Figure 2.4. At each stage, the input consists of a set of vertices that are manipulated or transformed. Data can be streamed into the graphics pipeline since vertices are able to be processed independently from each other. Thus, the basic operations of the graphics pipeline can be performed in parallel.

Figure 2.4: The OpenGL graphics pipeline.

Over time, the GPU has evolved into a very different component. The first generation of graphics processors were not themselves programmable[5]. Their instructions were



hard-coded in the chip-set with data transmitted from the CPU. However, this changed with the development of the programmable GPU. The original programmable GPUs were utilized with graphics processing API’s such as OpenGL and DirectX[5]. Developers gained more control over the GPU in the next generation with graphics specific languages such as NVIDIA’s Cg[5]. Due in large part to increased consumer demand over the years for computer graphics that continue to dazzle the eye, graphics processing units have “evolved into highly parallel, multithreaded, many-core processors with tremendous computational horsepower and very high memory bandwidth”[2]. Today’s GPUs are capable of performing the necessary calculations in real-time without skipping a beat so that gamers and researchers can become immersed in the newest state-of-the-art computer games and scientific imaging applications. The tremendous power and level of control the developer has over the current generation of GPUs has made it attractive for developers to utilize the GPU for general purpose applications in addition to graphics processing—a practice known as General Purpose Graphics Processing Units (GPGPU). The rising popularity of GPGPU has led to the newest generation of GPUs being constructed with the idea that they might not necessarily be used exclusively for graphics processing.


General Purpose Graphics Processing Unit

In recent years a new area of parallel computing has begun to garner a good deal of attention due to its affordability and power. GPGPU programming takes the highly parallel nature of the GPU and applies it to computationally expensive algorithms. The area of GPGPU programming focuses on using these programmable graphics cards for more than what they were intended for, that is, general purpose high end computations. GPGPU programming came about from the powerful nature of the GPU. Graphics processing involves a heavy volume of mathematically intensive operations to create and transform objects within a geometric space. Since the GPU traditionally only handles one aspect of the computer, there is considerably more space on the chip for data processing



and storage to the point where the number of transistors present on the GPU has, over the past five years or so, greatly surpassed the number of transistors on the area of the CPU, as illustrated in Figure 2.5[2]. Furthermore, since Moore’s Law of increasing computational power also applies to the GPU, we can expect the number of transistors on the area of a state-of-the-art graphics processor to double approximately every two years[2]. Additionally, a powerful GPU simply resides in the computer tower along with the other pieces of hardware as opposed to a parallel cluster, which takes up an entire room.

Figure 2.5: GPU vs. CPU speeds over time.

Taking advantage of the GPU’s ability for speed is not that simple. Early GPGPU methods for programming were tedious with high learning curves because data had to be represented in ways completely different from typical programming methods on the CPU. Data in the GPU must be stored in the form of vertices. This adds complication to programming the GPU for tasks other than those of a graphical nature. When considering the use of GPGPU programming, it is important to consider the advantages and disadvantages. As stated earlier, the highly parallel nature of the GPU is capable of far superior performance on certain types of computations when compared to the CPU. Also, the price of a graphics card is another huge draw. A top-of-the-line GPU sells for only a few thousand dollars whereas a traditional parallel computer of any reasonable

CHAPTER 2. BACKGROUND size goes for many times that.



Compute Unified Device Architecture

NVIDIA, the most prevalent graphics card producing corporation in the computing industry, released its Compute Unified Device Architecture and the corresponding CUDA language in 2006 as one of the first programming languages meant specifically for GPGPU programming. CUDA is an extension of the C programming language that adds some syntax for working with the GPU. NVIDIA’s newest generation of graphics cards are CUDA capable allowing GPU programming with this simple extension to the C programming language. The CUDA based graphics cards use a SIMD-like architecture referred to by NVIDIA as Single Instruction Multiple Thread (SIMT)[2]. As an application is executing, each thread is mapped to one of the multiprocessor cores (8 to 128 cores per multiprocessor, up to 30 multiprocessors, depending on the card). Each thread has an id that is used to distribute the workload among them. Each core is able to run in parallel, resulting in as many as 3840 threads running simultaneously. When there are too many threads for a total parallel execution, scheduling of threads on cores is handled by the hardware. Scheduling being handled on the GPU results in very fast context switches giving the illusion of a fast parallel execution. The section of code that is executed by the GPU is known as the device kernel. Before this kernel can execute, data must be transferred to an allocated memory space on the GPU. To make the most of CUDA, the programmer must distribute data throughout the GPU’s memory and determine what type of memory to utilize[7]. Allocating correct memory requires a basic understanding of how the different memory types are accessed and which type is most efficient for the task at hand because if memory distribution is not utilized correctly performance can be even worse than if the problem were simply being solved on the CPU. When allocating memory on the device, there are primarily three kinds of memory that can be accessed: global memory, shared memory, and local registers[2]. Figure 2.6 illustrates



how different types of memory is arranged with respect to each other. Data stored in global memory is available to all threads at once. Global data is loaded into the device from the CPU during a memory allocation stage that occurs prior to the kernel execution. Memory locations that are accessed consecutively within memory are most efficiently allocated to global memory; however, care must be taken because if different thread blocks on the GPU require read/write access to the same global memory register, there is no guarantee the value in the memory location is correct due to race conditions.[2] In the case of shared memory, data stored in this memory type is only accessible by threads in a common block. If able to divide data into chunks, loading this into shared memory allows for extremely fast retrieval within the thread block. Much like global memory, the developer must be careful using shared memory to avoid race conditions and the resulting data corruption. Shared memory on this most recent generation of GPUs is limited to 16Kb per block[2].

Figure 2.6: CUDA memory organization and access.

CUDA takes care of organizing the threads in the GPU. The user simply specifies how the threads are divided amongst a collection of blocks, the number of these blocks and



how they relate to each other in a grid of up to three dimensions is also specified by the user. Thread blocks have three dimensions. Taking advantage of these dimensions is useful for working with data of various sizes. There are restrictions, however, as a thread block has a limit to the number of threads that can be allocated to the block. CUDA allows for a maximum of 512 threads per block[2]. How these threads are organized is up to the user, so a one dimensional block of 512 threads and a two dimensional block of 32 x 16 are both valid. Some possible thread blocks are shown in Figure 2.7. Additionally, in the background, threads within a block are grouped together in what is referred to as a warp[2]. These warps are made up of at most 32 threads and always contain consecutive threads of increasing ids. The order of execution of warps is undefined. However, warps within the same block are able to synchronize with each other for safe global and shared memory access.
1-D Thread Block 0 ty 0 1 tx 2 3 4 5


2-D Thread Block 0 0 1 ty 2 3 3-D Thread Block 0 0 ty 1 2 0 tz 1 1 tx 2 3 4 5 6 1 2 3 4 5

Figure 2.7: Some possible thread block allocations.

Each thread within a block is assigned a unique thread id that is determined by its



placement within the block dimensions. Typically these thread ids play a part in the division of labor between threads of a block. For instance, the thread with id of one may be responsible for all data points in the first column of some collection. Additionally each block of threads is also assigned a unique block id that, like the thread id, determines the part of the workload each block is responsible for[2]. Blocks of threads execute a CUDA kernel. A kernel is a globally defined function that is run by all threads. The threads within a block execute the kernel independently of each other[2]. This independent thread execution results in the need for a way of synchronizing the threads when data is shared. This would ensure reliable data retrieval. Luckily, CUDA has a built in syncthreads() function that, when called within the kernel, forces all threads to wait upon reaching the syncthreads() function call until all threads within the same block reach that point within the kernel[2]. Typically, the syncthreads() function is needed after the threads have loaded data into shared memory, before they begin retrieving and performing computations on it. This synchronization process is illustrated in Figure 2.8
Threads Thread Block

Shared Memory Load

Execute Calculations


Figure 2.8: Threads loading data into memory and synchronizing.




Image Processing

Image processing is defined as the manipulation of images. Operations on images that are considered a form of image processing include zooming, converting to gray scale, increasing/decreasing image brightness, red-eye redaction in photographs, and, in the case of this study, edge detection, as illustrated in Figures 2.9 and 2.10. These operations typically involve an exhaustive iteration over each individual pixel in an image.

Figure 2.9: Test image before edge detection.

Figure 2.10: Test image after edge detection.



A common method for image processing is pixel classification. Pixel classification defines a pixel’s class based on one of its features, in the case of edge detection, the feature examined is its intensity versus the intensity of its neighbor pixels. Pixel classification is not limited to edge detection alone; it is also used for converting an image to gray-scale. (Gray-scale conversion is also used in our work since we found that edges of images that had been run through an edge-detection algorithm were easier to discern if they had been converted to gray scale first). Pixel classification works as follows: for each pixel in an image, its desired feature is examined, and the pixel is modified as specified. For the Laplacian edge detection method this process is defined by an image kernel. (Note: this kernel is unrelated to the CUDA kernel that is executed on the GPU). The Laplacian image kernel is a 3 x 3 two-dimensional array, as shown in Figure 2.11. This kernel is applied to each pixel in the image and takes into account the pixel’s neighbors in a 3 x 3 area around it. Given the pixel identified as xi,j and the kernel k the formula for the new value of xi,j is as follows:

out i , j= x i−1, j−1 k i −1, j−1 x i , j−1 k i , j −1 x i1, j−1 k i 1, j−1 x i −1, j k i−1, j  x i , j k i , j x i1, j k i 1, j x i −1, j 1 k i−1, j 1 x i , j 1 k i , j1 x i 1, j 1 k i1, j 1
This algorithm is non-trivial for large images as it must perform this calculation three times total for each pixel in an image, once for the red, blue, and green RGB values. The number of computations that must be performed along with the ability to represent the data as a two dimensional array indicated that edge detection would greatly benefit from a parallel implementation on the GPU with CUDA.
Kernel -1 -1 -1 -1 8 -1 -1 -1 -1 Pixel Neighborhood xi-1,j-1 xi-1,j xi,j-1 xi,j xi+1,-+1 xi+1,j xi+1,j+1

xi-1,j+1 xi,j+1

Figure 2.11: Laplacian edge detection.

Chapter 3

GPU Edge Detection Algorithms
The organization of multidimensional thread blocks into multidimensional grids suits CUDA development to the processing of arrays of data. As a result, data that can easily be represented in these forms are usually best suited for migration to CUDA for processing. Such data types include images that can be represented as a two-dimensional matrix where each entry corresponds to a single pixel in the image. An image pixel consists of a discrete red, green, and blue component in the range [0 . . . 255]. To develop a CUDA parallel algorithm for Laplacian edge detection, we took two approaches. The first was straight forward in its data distribution scheme and organization of thread blocks, while the second took a new approach in an attempt to increase efficiency within thread blocks.


One Pixel Per Thread Algorithm

Our first implementation of the Laplacian edge detection algorithm using CUDA is fairly straightforward. We create a two-dimensional grid that is overlaid on the image, segmenting it into several rectangular sections, as illustrated in Figure 3.1. For simplicity, we assume the image can be evenly divided into full sized segments. Processing images dimensions that do not divide evenly would not be an overly complicated addition to the application.




Each thread within the thread block corresponds to a single pixel within the image. However, each thread is not necessarily only responsible for loading one pixel entry into the shared memory. The nature of the Laplacian pixel group processing method for edge detection requires that a 3x3 area surrounding the target pixel be analyzed to calculate the output. Therefore, threads on the edge of a thread block must examine pixels that are outside the dimensions of the thread block. In order to ensure accuracy of the output image, these threads are responsible for loading the pixels they are adjacent to that do not have a mapping in the thread block into shared memory. That is, the threads on the edge of the block load the boundary pixels into shared memory. This extra step is performed after the initial shared memory load that all threads perform. To compensate for the required extra space, the two-dimensional shared memory array is allocated to have dimensions of (blockDim.x + 2, blockDim.y + 2). This allocates two additional rows and two additional columns of shared memory.

Shared Memory


Figure 3.1: Thread blocks for single pixel per thread method.

Once the block has loaded its respective section of the target image into shared memory, a syncthreads() function is called so that the threads can regroup before proceeding. With the integrity of the data verified the kernel then proceeds with the convolution of the image.



void edged(uchar3* pixelsIn, uchar3* pixelsOut, int1* devKernel, int width) { __shared__ uchar3 sData[blockDim.x + 2][blockDim.y + 2]; int ndx = compute2DOffset(width, threadidx, blockidx, blockDim); loadImageBlock(sData, pixelsIn, ndx); __syncthreads(); /*****solve*****/ int3 value = makeint3(0,0,0); for(int u = -1; u < 2; u++) { for(int v = -1; v < 2; v++) convolve(value, kernel[u+1][v+1].x * sData[tx + 1 + u][ty + 1 + v]); } /******clamp RGB values*******/ clampRGB(value); pixelsOut[ndx] = value; } For each thread in the block we iterate through the convolution kernel and the 3x3 pixel group in which the target pixel is the center element. After applying the convolution formula the pixel’s red, green, and blue values may be outside the range of [0 . . . 255]. We fix this by clamping the RGB values so that if the value is greater than 255, it is set to 255 and if the value is less than 0, it is set to 0. Without this clamping step, the value mod 256 is the new color value. This would produce the incorrect pixel value. As pixels are calculated, they are stored in the out-pixel array that belongs to the designated output image. Once the CUDA kernel has finished executing, the allocated memory within the GPU is freed and the program exits.


Multiple Pixels Per Thread Algorithm

Our second implementation of the edge detection algorithm takes a different approach to the data distribution aspect of the problem. Instead of creating two-dimensional thread blocks that map directly to the target image, we create a series of one dimensional thread



blocks that are each responsible for a 2-D shared memory space, as illustrated in Figure 3.2.


Shared Memory

Figure 3.2: Thread blocks for multiple pixels per thread method.

The basic flow of the second GPU implementation works the same as the initial one. After allocating GPU memory and copying the source image to the GPU, the threads are responsible for copying the global memory into shared memory. Because the blocks are responsible for more pixels than threads, iteration through the image is necessary. A number of f or loops are utilized within the CUDA kernel to cycle through the portions of the image each block is responsible for. In order to maximize efficiency there is one for loop per stage in the algorithm. This allows each stage to complete before moving on rather than alternating between loading data into shared memory and solving. Loading data into the 2-D shared array works similarly to the first implementation. An initial load is done of the pixel that maps directly to a thread in the block. Then the kernel iterates down the image segment within a f or loop. In each iteration we increment the index by width, where width is the width of the image. The first and last threads in the block are responsible for loading the left and right boundaries of the shared memory block, and all threads load the pixels they correspond to in the upper and lower boundaries. Following the successful data transfer to shared memory space, the actual computation



algorithm is nearly identical to the first implementation. The main difference is that the calculations must be done for one row of the shared memory space at a time. Because of the fewer CUDA threads executing. After each pixel is determined through the convolution process, it is stored temporarily in the shared 2-D array until the entire convolution algorithm is completed. We can store the end-pixel in the shared memory space without corrupting the data needed for the next computation since the algorithm iterates through the shared array one row at a time. The convolution process only requires knowledge of a pixel’s immediate neighbors in a 3x3 region, therefore, once one row is completed, the kernel no longer needs the row above in any future calculations. This allows the kernel to store the newly determined pixels in the previous row of shared memory with impunity so long as the syncthreads() function is invoked to ensure thread synchronization. Once all threads have finished convolution for the entire array of shared memory, the end-pixels stored in shared space are then copied to the output pixel array. void edged(uchar3* pixelsIn, uchar3* pixelsOut, int1* devKernel, int width) { __shared__ uchar3 sData[blockDim.x + 2][SHARED_SIZE_Y + 2]; int3 value = makeint3(0,0,0); int ndx = compute2DOffset(width, threadidx, blockidx, blockDim); /*****load shared data*****/ for(int i = 1; i < SHARED_SIZE_Y-1; i++) loadImageBlock(sData, pixelsIn, ndx, i); /*****sync from shared memory load******/ __syncthreads(); for(int i = 1; i < SHARED_SIZE_Y-1; i++) { value.x = value.y = value.z = 0; for(int u = -1; u < 2; u++) { for(int v = -1; v < 2; v++) convolve(value, kernel[u+1][v+1].x * sData[tx + 1 + u][ty + 1 + v]); }

/******clamp RGB values******/

CHAPTER 3. GPU EDGE DETECTION ALGORITHMS clampRGB(value); /****make sure all threads are done with this round of data****/ __syncthreads(); /***store calculated pixel values in shared memory that****/ /*** will no longer be used ****/ sData[tx + 1][ i - 1] = value; } for(int i = 1; i < SHARED_SIZE_Y-1; i++) { loadOutImage(sData, pixelsOut, ndx, i); } }


Chapter 4

Evaluation and Results
To evaluate the parallel GPU algorithms, we performed several tests on images and compared the times to the sequential implementation. The before and after sputum smear images can be seen in Figures 4.1 and 4.2. The results of these tests showed an impressive speedup of our parallel algorithm over the sequential. In addition, we further investigated the ideal number of threads per block to maximize the efficiency of our algorithms.



We evaluated our algorithms on three different GPUs whose specifications are in Table 4.1. All machines run Linux Fedora 9 and use CUDA 2.0. For each each implementation, a C++ driver which runs a main method that processes the arguments from the user, opens/creates the image objects to be manipulated, and initializes the convolution kernel. This driver then invokes the CUDA source code file, which in turn calls the CUDA kernel (the GPU device function), which executes the respective algorithms described above. Our edge detection algorithms were evaluated on a collection of example images of sputum smears provided by the University of Cape Town research group. Each image had dimensions of 1280x968 pixels, and the block dimensions were adjusted accordingly so that they would fit the image evenly. To make the detected edges in the output image as stark




as possible, it was usually best to first convert the image to gray-scale. Since edge detection usually works by detecting a sudden change in pixel brightness levels we observed that a gray-scale image will work best for an edge detection algorithm. Converting the image to gray-scale helped bring out the contrast at object edges, making them more distinguishable than in images with color. Model Total graphics memory Number of Multiprocessors Number of Cores Clock Rate Concurrent copy and execution Model Total graphics memory Number of Multiprocessors Number of Cores Clock Rate Concurrent copy and execution Model Total graphics memory Number of Multiprocessors Number of Cores Clock Rate Concurrent copy and execution GeForce 8400GS 512Mb 1 8 1.4 Ghz No GeForce 8500GT 512Mb 2 16 0.92 Ghz Yes GeForce 9800GTX 512Mb 8 128 1.89 Ghz Yes

Table 4.1: GPU Specifications the NVIDIA graphics cards



Results differed, as expected, depending on the hardware being utilized. On average the speed up of computation of the GPU algorithm versus the sequential algorithm was on the scale of an order of magnitude under the fastest GPU running the single pixel per thread algorithm and two to three orders of magnitude with the multiple pixels per thread algorithm, as illustrated in Figures 4.3 and 4.4. Running on the slowest machine, the 8400GS, times only averaged about 20ms faster than the sequential algorithm under the



Figure 4.1: Original sputum smear image.

single pixel per thread algorithm and a 2.5x speedup with the multiple pixels per thread algorithm. As can be seen in Figure 4.5, the multiple pixels per thread algorithm was consistently and considerably faster than the single pixel per thread algorithm. After showing that the results of these implementations were far superior to the sequential one, we wanted to investigate how much of an impact different sized thread blocks have on execution. We ran each algorithm a number of times with thread block sizes starting at 32 and doubling every time, stopping at 256, illustrated in Figure 4.6 and Figure 4.7.



Figure 4.2: Sputum smear after being run through edge detection algorithm.





Sequenti al 8400GS 8500GT 9800GTX/ GTX+

50 0

Figure 4.3: Sequential time vs one pixel per thread GPU algorithm; 32 threads.







Sequential 8400GS 8500GT 9800GTX/ GTX+

50 0 250 200 150

Figure 4.4: Sequential time vs multiple pixels per thread GPU algorithm; 32 threads.


One Pixel per Thread Mul tipl e Pixels per Thread



0 8400GS 8500GT 9800GTX/GTX+

Figure 4.5: Comparison of GPU algorithms on different machines; 64 threads.





150 8400GS 8500GT 9800GTX/GT X+




0 64 threads 256 threads 32 threads 128 threads

Figure 4.6: Differences in computation time with different sized thread blocks; one pixel per thread algorithm.

120 100 80 60 8400GS 8500GT 9800GTX/GT X+

40 20 0 64 threads 256 threads 32 threads 128 threads

Figure 4.7: Differences in computation time with different sized thread blocks; multiple pixels per thread algorithm.

Chapter 5

The GPU is an impressively powerful piece of hardware that has become well suited for parallel applications. We have shown that an excellent candidate for one of these applications would be edge detection especially where speed is of the essence. There are also a number of possible projects that can be taken up as future work with this thesis as a foundation.



GPGPU programming is a powerful tool that, when applied correctly, can give impressive results. In this paper we have described two possible load-balancing algorithms for use with performing edge detection and compared them to the sequential counterpart. Our findings have indicated that our second implementation ,the multiple pixels per thread method, was significantly more efficient than the single pixel per thread method.


Future Work

Edge detection is only one part of the auto-diagnosis project that the South-African research group is undertaking. Further investigation of aspects of their research that could be implemented in GPGPU programming with CUDA has the potential to make automatic diagnosis an even more attractive project. Parts of their research that could have the poten31



tial for benefit include the auto-focus algorithm for the microscope and the actual diagnosis of the image after having gone through the edge detection process. GPUs are intended for working with data streams. Future work could involve investigating the possibility of streaming multiple images into the GPU for edge detection. Creating data streams would enable copying of data to the device as the prior image is being processed[2]. Working with multiple images over the course of a single execution would likely make working with CUDA even more efficient than it already is. GPU programming has the potential to be just as effective at performing fast computations as traditional parallel computing. Further investigation into this process could produce a very fast, relatively inexpensive method for efficient medical computations and image processing. With this technology it is feasible that we could medical care could be improved in areas without access to expensive hospitals or medical experts.

[1] Tuberculosis Fact Sheets., 2007. [2] CUDA Programming Guide. NVIDIA Corporation, Santa Clara, CA, 2009. [3] Gregory A. Baxes. Digital Image Processing: Principles and Applications. John Wiley and Sons, Inc, New York, NY, 1994. [4] Tania S. Douglas, Rethabile Khutlang, Sriram Krishnan, Andrew Whitelaw, and Genevieve Learmonth. Image segmentation for automatic detection of tuberculosis in sputum smears. 2008. [5] Randima Fernando and Mark J. Kilgard. The Cg Tutorial: The Definitive Guide to Programmable Real-Time Graphics. Addison-Wesley, New York, NY, 2003. [6] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface; second edition. The MIT Press, Cambridge, MA, 1999. [7] Tom R. Halfhill. Parallel processing with cuda. Microprocessor Report, 2008. [8] Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne. Operating System Concepts; sixth edition. John Wiley and Sons, Inc, New York, NY, 2002. [9] Barry Wilkinson and Michael Allen. Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice Hall, Upper Saddle River, NJ, 2005.


Sponsor Documents

Or use your account on


Forgot your password?

Or register your new account on


Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in