Jan 31



Hello everyone today we are talking about pycuda , I will tell you my experience and will show you some tests! Let s jump into it!

Everything started a couple of weeks ago with the start of an image processing course on Coursera.com , I actually decided to just had a quick look around on the course rather than really follow it, but it is really interesting

so I am taking the course in order to get the certificate. I hope I can manage it with the other 3 courses I am taking , but so far looks good.

The first week we just did some theory of image processing so I decided to have a quick look and mess around , you can actually read what I did in my previous post here.

It was quite a straight forward task , but the computation was talking like 5-6 seconds for a 1440×900 picture, so I decided to try PyCuda and doing the processing on my GPU.

It was not too easy to make Pycuda up and running but with some googling I got it done , I will actually make a post about that in the near future.

Pycuda is deeply boundwith Numpy , in order to send stuff on the GPU you can create GPU data like gpu_array and so on or create Numpy data that have straight conversion to C data.

That was the source of all my problems and frustration . It took me a while to figure out how the data was moved to the GPU and back and also understand the way Numpy handles

the arrays /matrices.

In the way I am used to work with matrices , I would have loved to parse the image by going left to right and top to bottom (row major style) , instead Numpy was converting straight from a PIL

image object to a column major matrix, which is ok , I am just not so used to, when you copy the array on the gpu memory you don’t get the same matrix shape.

I was using a 3 dimensional matrix on the host , something like matrixImage[column][row][rgb], instead  on the gpu I got a matrixImage[column*row*rgb] , which is a 1D matrix, took me a while to figure

out that actually , luckily I found a guy explaining it in a random stackoverflow post!

Once all those small technicalities were settled I could finally work on the kernel which was uber simple for a black and withe effect.

Here is the code I used , in order to make it run you need to have :

– pillow , or PIL

– numpy

– Pycuda (and its dependencies


import PIL
from PIL import Image
import time

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy

def blackWhite(inPath , outPath , mode = "luminosity",log = 0):

    if log == 1 :
        print ("----------> SERIAL CONVERSION")
    totalT0 = time.time()

    im = Image.open(inPath)
    px = numpy.array(im)

    getDataT1 = time.time()

    print ("-----> Opening path :" , inPath)

    processT0 =  time.time()
    for x in range(im.size[1]):
        for y in range(im.size[0]):

            r = px[x][y][0]
            g = px[x][y][1]
            b = px[x][y][2]
            if mode == "luminosity" :
                val =  int(0.21 *float(r)  + 0.71*float(g)  + 0.07 * float(b))

            else :
                val = int((r +g + b) /3)

            px[x][y][0] = val
            px[x][y][1] = val
            px[x][y][2] = val

    processT1= time.time()
    #px = numpy.array(im.getdata())
    im = Image.fromarray(px)

    print ("-----> Saving path :" , outPath)
    totalT1 = time.time()

    if log == 1 :
        print ("Image size : ",im.size)
        print ("get and convert Image data  : " ,getDataT1-totalT0 )
        print ("Processing data : " , processT1 - processT0 )
        print ("Save image time : " , totalT1-processT1)
        print ("total  Execution time : " ,totalT1-totalT0 )
        print ("\n")

def CudablackWhite(inPath , outPath , mode = "luminosity" , log = 0):

    if log == 1 :
        print ("----------> CUDA CONVERSION")

    totalT0 = time.time()

    im = Image.open(inPath)
    px = numpy.array(im)
    px = px.astype(numpy.float32)

    getAndConvertT1 = time.time()

    allocT0 = time.time()
    d_px = cuda.mem_alloc(px.nbytes)
    cuda.memcpy_htod(d_px, px)

    allocT1 = time.time()

    #Kernel declaration
    kernelT0 = time.time()

    #Kernel grid and block size
    BLOCK_SIZE = 1024
    block = (1024,1,1)
    checkSize = numpy.int32(im.size[0]*im.size[1])
    grid = (int(im.size[0]*im.size[1]/BLOCK_SIZE)+1,1,1)

    #Kernel text
    kernel = """

    __global__ void bw( float *inIm, int check ){

        int idx = (threadIdx.x ) + blockDim.x * blockIdx.x ;

        if(idx *3 < check*3)
        int val = 0.21 *inIm[idx*3] + 0.71*inIm[idx*3+1] + 0.07 * inIm[idx*3+2];

        inIm[idx*3]= val;
        inIm[idx*3+1]= val;
        inIm[idx*3+2]= val;

    #Compile and get kernel function
    mod = SourceModule(kernel)
    func = mod.get_function("bw")
    func(d_px,checkSize, block=block,grid = grid)

    kernelT1 = time.time()

    #Get back data from gpu
    backDataT0 = time.time()

    bwPx = numpy.empty_like(px)
    cuda.memcpy_dtoh(bwPx, d_px)
    bwPx = (numpy.uint8(bwPx))

    backDataT1 = time.time()

    #Save image
    storeImageT0 = time.time()
    pil_im = Image.fromarray(bwPx,mode ="RGB")

    print ("-----> Saving path :" , outPath)

    totalT1 = time.time()

    getAndConvertTime = getAndConvertT1 - totalT0
    allocTime = allocT1 - allocT0
    kernelTime = kernelT1 - kernelT0
    backDataTime = backDataT1 - backDataT0
    storeImageTime =totalT1 - storeImageT0
    totalTime = totalT1-totalT0

    if log == 1 :
        print ("Image size : ",im.size)
        print ("get and convert Image data to gpu ready : " ,getAndConvertTime )
        print ("allocate mem to gpu: " , allocTime )
        print ("Kernel execution time : " , kernelTime)
        print ("Get data from gpu and convert : " , backDataTime)
        print ("Save image time : " , storeImageTime)
        print ("total  Execution time : " ,totalTime )
        print ("\n")

As you can see there is not too much stuff going on, but I know you guys want to see some benchmarsk 😀 and I am not going to disappoint you!

Before starting here is a couple of stats on my hardware


i7 4900MQ  full specks here : http://ark.intel.com/products/75131


NVidia GTX 780 M  : full specs here


I started by working on a 1K images , here down you can see the source , the serial result and the cuda result , for full image click on it.




Here is some nice debug prints outline the performance :


-----> Opening path : C:/Users/Marco/Desktop/jessTest.jpg
-----> Saving path : C:/Users/Marco/Desktop/jessTestLuminosity.jpg
Image size : (1440, 900)
get and convert Image data : 0.04000210762023926
Processing data : 4.67326807975769
Save image time : 0.03500199317932129
total Execution time : 4.748272180557251

-----> Saving path : C:/Users/Marco/Desktop/jessTestLuminosityCuda.jpg
Image size : (1440, 900)
get and convert Image data to gpu ready : 0.042001962661743164
allocate mem to gpu: 0.006000041961669922
Kernel execution time : 0.04200291633605957
Get data from gpu and convert : 0.010999917984008789
Save image time : 0.03200197219848633
total Execution time : 0.13300681114196777

OOOOOH YEAAAH that s the good stuff 😀 a nice ~40X performance speed up , that s quite impressive considering that the computing part of the black and white filter is so small that is not really
taking full advantage of the GPU.

3K IMAGE image3K


-----> Opening path : C:/Users/Marco/Desktop/image3K.jpg
-----> Saving path : C:/Users/Marco/Desktop/image3KBW.jpg
Image size : (2494, 1663)
get and convert Image data : 0.07900404930114746
Processing data : 14.747843980789185
Save image time : 0.10500597953796387
total Execution time : 14.931854009628296

-----> Saving path : C:/Users/Marco/Desktop/image3KBWCuda.jpg
Image size : (2494, 1663)
get and convert Image data to gpu ready : 0.09000515937805176
allocate mem to gpu: 0.014000892639160156
Kernel execution time : 0.03500199317932129
Get data from gpu and convert : 0.03200197219848633
Save image time : 0.10600614547729492
total Execution time : 0.27701616287231445

I am gonna shoot out the rest of the test and then we can discuss about it 😀

Remember the order of the pictures is : source , serial , cuda.

Harry Potter and the Deathly Hallows Part 1


-----> Opening path : C:/Users/Marco/Desktop/image4K.jpg
-----> Saving path : C:/Users/Marco/Desktop/image4KBW.jpg
Image size : (4096, 2440)
get and convert Image data : 0.17701005935668945
Processing data : 35.46102809906006
Save image time : 0.2350139617919922
total Execution time : 35.87305212020874

-----> Saving path : C:/Users/Marco/Desktop/image4KBWCuda.jpg
Image size : (4096, 2440)
get and convert Image data to gpu ready : 0.21601200103759766
allocate mem to gpu: 0.029001951217651367
Kernel execution time : 0.03500223159790039
Get data from gpu and convert : 0.07800483703613281
Save image time : 0.23401308059692383
total Execution time : 0.592034101486206

Lets move forward!



-----> Opening path : C:/Users/Marco/Desktop/image5K.jpg
-----> Saving path : C:/Users/Marco/Desktop/image5KBW.jpg
Image size : (5184, 3456)
get and convert Image data : 0.27501511573791504
Processing data : 67.43985795974731
Save image time : 0.39202213287353516
total Execution time : 68.10689520835876
-----> Saving path : C:/Users/Marco/Desktop/image5KBWCuda.jpg
Image size : (5184, 3456)
get and convert Image data to gpu ready : 0.33801889419555664
allocate mem to gpu: 0.0760049819946289
Kernel execution time : 0.22401213645935059
Get data from gpu and convert : 0.18601083755493164
Save image time : 0.40102314949035645
total Execution time : 1.2250699996948242

Hell yeah 😀 I found a 9K image and I am not afraid to use it ! The original png I downloaded was  ~160MB!
The actual JPG i used in the code was 50 mb and the site doesn’t let me upload that size , so I had to compress it to a 70% but the processed Image is coming out
from the HD version aaand I actually lost the one converted serially so I will just add the cuda one .


And here the LOG for the fat 9k resolution.

-----> Opening path : C:/Users/Marco/Desktop/image9K.png
-----> Saving path : C:/Users/Marco/Desktop/image9KBW.png
Image size : (9000, 4637)
get and convert Image data : 0.6430368423461914
Processing data : 144.94929099082947
Save image time : 11.3296480178833
total Execution time : 156.92197585105896
-----> Saving path : C:/Users/Marco/Desktop/image9KBWCuda.jpg
Image size : (9000, 4637)
get and convert Image data to gpu ready : 2.886164903640747
allocate mem to gpu: 0.12500715255737305
Kernel execution time : 0.037001848220825195
Get data from gpu and convert : 0.3350191116333008
Save image time : 1.0140578746795654
total Execution time : 4.3972508907318115

Alright , as you guys can see the GPU totally nailed the CPU and of course that was expcected , even though not on that extend , I expected  the conversion and memory “travels”
back and forth from host and device memory was going to be a huge factor, but if we see the log actually the major part of the time is spent in coverting the data in a way
that first the gpu can understand it and then in a way PIL can save the image.

Here is a plot of the performances of CPU vs GPU for this task


Looks like from those simple data the gain in performance is not completely linear , but again those are simple data , I should have taken like 20 -30 runs of the same code and take an average
of it so what you see here might not be 100% true , but I can confidently say a good 90% is .

Here a plot representing the time spent in data conversion :

The time spent in converting the data for the 5K compared to the 9k is quite bigger , that might also be due to the implementation of the numpy arrays.

Here another plot of the memory allocation time :

Here everything looks pretty normal no surprises .
The surprises actually starts here in the kernel execution time plot :

chart_kernelAgain this data is not completely accurate but if we look the deltas between each sample we can say that the kernel pretty much took the same time on all the pictures,
Now that might be due to the python clock not being able to have enough fine time sampling? or simply because I did not reach full usage of the GPU yet so It can still easily handle that amount
of pixels without breaking a sweat.

Last thing I wanted to check was if I actually got the same exact result out of the CPU and GPU , so I subtracted the 1k images and that s the result :


No , your monitor is not dirty , the thing you see in the image is a small delta between the two images , now beware that I actually bumped up the delta for a factor of 10 in order to make it more visible ,

so It s pretty much nothing, now if I have to say where this delta is coming from I would say most likely from the many conversions back and fort that happened to the memory during the whole process?

If you have an idea don’t be shy and shoot it out!

That s it folks ! I hope you enjoyed that as much as I did (the good and bad times of the development XD).

I am still a noob in this huge and wild world of massive parallel computing so any feedback is more than welcome !



Leave a Reply