Jan 06

Having fun with cuda…. again

Having fun with cuda…. again















Hello there, today we are talking about cuda again, some time ago you saw me talking about pycuda in THIS post, here today I did something similar but with a more serious approach,

I decided to get better at c++,  parallel computation and Linux, this led me to start this learning project about image processing.

The first step was able of course to read an image and convert the data in something that can be easily processed both from cpu and gpu, also it gave me an excuse to learn how to read a binary file.

I decided to go with a BMP image format, because is one of the simplest there is around, with no compression and just simple header.

Now don’t get me wrong I am not trying to re-invent the wheel at all, I just made a simple bmp class to learn the process, the class itself is quite limited as well features wise, it just let me load and save the image or if needed generate an empty buffer then save it, as a reference I used THIS awesome article, be sure to have a look at it.

Once the BMP was out of the way I could finally start working on the image processing itself, first step is the most basic and simple black and withe filter, I first implemented the serial version then the TBB version and finally the cuda version, I am not sharing the code yet just because I don’t believe it is decent enough to be shown, but I will do that soon and you will see it popping up on timings


Let s start with the black and white, the first thing to notice is that we don’t have such big difference between cuda and TBB, such small computation was not so worth based on all the managing of the data that cuda needs ( allocating/copying memory etc ), we start get some more speed up when we start going up on heavier images. Another interesting thing was that if I was trying to run 8 thread with TBB (aka hyper-threading), I was actually getting lower performances that just running 4 threads, unluckily I don’t know much about hyperthreading to know why but I plan on looking a little into it.

Here is a carth of the running times:





Everything looks fairly standard, the increase in time is also quite linear, even though the gap between 4k and 5k is not that much, but we will see later that, that particular spot will hold some surprises for the blur aswelas well

Speaking of blur, we can see that since we have abit more computation in the blur we start getting some way better performances out of cuda, lets see the charts for the timings.





As you can see here we start to get some nice performance out of the GTX 780M, we get an ~76x speedup compared to the serial code and a ~21x speed up compared TBB on a 9k Image.

It is all good and nice, except the serial execution between 4k – 5k, you can see that the 5k is actually running faster then the 4k, don’t ask me because I have no clue about it, at the beginning I thought it

was an error, but I run the program other two times for both the 4k and 5k image the result were the same, the only guess might be some combination in memory that made the 4k particularly  slower for some reason.

Other then that I was quite pleased to see the GPU starting to crunch as it is supposed to, but I am not stopping here, the plan for the future are the following, first I want to implement a gaussian blur, with a radius parameter which will force me to implement a dynamic size stancil and implement a generic convolution kernel with that stancil, also I came across THIS page where they show the different effects of different stencils, like sharpening or edge detection etc. Once the generic kernel will be implemented it will be fairly easy to play with those.

In the road map there is also the implementation of some more complex filters, or better I hope to find some filters more computationally expensive but not necessarily   super complex math wise, I don’t want to dive too much yet into image processing , last step will be add a nice UI where the user will be able to load an image and display that using opengl and then stack on each other different filters and see the result (hopefully) realtime.

Once that is done I will be ready to jump back on my opengl sandbox and make it way more robust and start implementing some more cool stuff.

That is it guys , here below a small video demo of me running the program.

Parallel Image: simple image processing program from Marco Giordano on Vimeo.




Leave a Reply