Sunday, 28 March 2010

CUDA3.0 cubin Files

It appears that NVidia has changed the format of CUBIN files with CUDA 3.0 into a standard binary ELF format.  Heres what they say in the release notes:

  • CUDA C/C++ kernels are now compiled to standard ELF format
You can find out about ELF files at the wikipedia entry.  In previous releases the partially compiled .cubin files were plain text readable and could be added into a library as a string resource.  If you open old cubin files in Visual studio, they looked something like this:

architecture {sm_10}
abiversion   {1}
modname      {cubin}
code {
    name = cuFunction_Laser
    lmem = 0
    smem = 44
    reg  = 6
    bar  = 0
    bincode {
        0x10004209 0x0023c780 0x40024c09 0x00200780
        0xa000000d 0x04000780 0x20000411 0x0400c780
        0x3004d1fd 0x642107c8 0x30000003 0x00000500
        ...
blah..blah..blah

    }
}


Rather than ship cubin files with libraries, I have always built them into the file as a string resource and then use the windows API functions such as FindResource and LoadResource to get a pointer to the string.  This is then passed to the CUDA cuModuleLoadDataEx function for final compilation into the GPU code.   


With CUDA3.0 and this new ELF format, cubin files look slightly different since they are now a binary file:


When I compiled some old projects against CUDA3.0, everything went very wrong due to this change.
 
The problem was that my old method used to copy the cubin resource string into another memory location using strcpy and add also a final \0 character for good measure at the end of the string.  With the new binary format, the string copy does not work and a partially mangled buffer ended up being passed to the CUDA compiler, which promptly fell over.


So if anybody else out there is using string resources to include and manipulate cubin files, this may catch you out too.  The fix is easy, simply treat the new cubin files as binary data not strings. 


One final point, if you really want to stick with the previous cubin string format, then apparently (I haven't confirmed this) you can direct nvcc to emit string cubin files by changing the nvcc.profile and the CUBINS_ARE_ELF flag. 


Vision Experts

Friday, 26 March 2010

NVPP Performance Benchmarks

In my last post, I cast some doubt on the performance and utility of GPU's for small image processing functions.   Today I had a look at how NVidias own Image processing library - NVPP - stacked up against the latest Intel Performance Primitives (IPPI v6.0) for some basic arithmetic on one of my Dev machines.  This development PC has a mid-performance Quad-core Intel Q8400@2.66GHz and a mid-performance NVidia GTX260 with 216cores@1.1GHz.  


The results are interesting and pretty much what I expected.  As an example, here are the results for a simple image addition of two images to produce one output image (average 1000 iterations):


512x512 Pixels:
GPU-Transfer and Processing = 0.72 milliseconds
CPU = 0.16 milliseconds


2048x2048 Pixels:
GPU-Transfer and Processing = 6.78 milliseconds
CPU  = 2.81 milliseconds

The CPU wins easily - so whats happening here?  The transfer overheads to-and-from the GPU over a PCIex16 bus are by far the dominant factor, taking approx 2ms per image transfer for the 2048x2048 images (two input images, one image output = approx 6ms).  Whilst transfer times can be significantly improved (perhaps halved) if the input and output images were put into page-locked memory, the conclusion would not change; performing individual simple image operations on the GPU does not significantly accelerate image processing.


So what happens if we emulate a compute-intensive algorithms on the GPU?  When we perform only one transfer but then replace the single addition with 1000 compounded additions, the total time for the GPU operation becomes:

2048x2048 Pixels:
GPU-1xTransfer and 1000xImAdds = 0.29 milliseconds

So for a compute intensive operations which transfers the data once, then re-uses the image data multiple times, the GPU can easily be 10x faster than the CPU.

This means that algorithms such as deconvolution, optic flow, deformable registration, FFTs, iterative segmentation etc are all good candidates for GPU acceleration.  Now, if you look at the NVidia community showcase then these are the sorts of algorithms that you will see making use of the GPU for imaging.  When the new Fermi architecture hits the shelves, with its larger L1 cache and new L2 cache, then the GPU imaging performance should make a real jump. 

Its worth mentioning a minor technical problem with NVPP and Visual Studio 2008 - NVPP1.0 doesn't link properly in MSVC2008 unless you disable whole program optimisation (option /GL). Its also worth noting that the NVPP is built on the runtime API, which is not suitable for real-time multi-threaded applications.  If you really need some of the NVPP functionality for a real-world application, then we would suggest you get a custom library developed using the driver API.


Vision Experts

Saturday, 13 March 2010

A GPU is not Always Fastest

There has been a huge amount of interest in GPU computing (GPGPU) over the last couple of years.  Unsurprisingly, a number of image processing algorithms have been implemented using this technology.  In most cases, large performance gains are reported.  However, whilst I have been writing image processing algorithms that leverage the GPU performance for some time now, I have often found that the GPU is not the best solution.  As a rule of thumb, I aim for a x10 increase in speed to justify the development, if I can't achieve a x4 increase in speed then its just not worth the effort.  

Sometimes, the performance gains are misleading for practical applications.  NVidia themselves are guilty of this in their SDK with their image processing examples.  For instance, in many of their SDK demonstration applications they use the SDK functions to load an 8-bit image and then pre-convert it on the host to a packed floating point format before uploading to the GPU.  They then show large gains in speed, but ignore the huge time penalty of the CPU-side format conversion.  In another example they have to unpack 24-bit RGB data into 128bit packed quads of floating point data on the host before they can process it.  In the real-world this is not practical.  I do wonder how many other people have used some constructive accounting in their reported acceleration factors.

So, despite generally being a GPU evangelist for accelerating image processing, I wanted to write a bit about the downsides to provide a balanced view.

Architecture constraints.   You need to be doing a lot of work on the image data to make the architecture work for you.  Many (Most?) practical algorithms just don't fit into a GPU very well.  For example, it may be the case that a GPU can do a brute-force template correlation faster than a quad-core CPU, but brute-force correlation for pattern matching isn't the method of choice these days.  Contemporary vision libraries have extremely sophisticated algorithms that do a far superior job of pattern matching than correlation, plus they are highly optimised for multi-threading on the CPU.  These algorithms simply do not fit into the GPU 'brute force' computational model. 


By way of a painful example, I have been developing a complete JPG conversion library for NVidia GPUs.  This is blazingly fast at RGB-YUV conversion, DCT and Quantisation, but falls down on the Huffman coding which is a sequential algorithm.  Add in the transfer overheads and it gets slower.  At the time of writing, hand-optimised multi-threaded CPU version is almost as fast.  All is not lost on this development, but its a tough sell at this point.

Multi-threading.  Whilst a GPU is massively parallel internally, it cannot run multiple algorithms (kernels) in parallel*.  So if your application is used to doing multiple operations in parallel, e.g. processing the images from multiple sensors in parallel, then it will have to change and serialize the images into GPU work chunks.  So whilst your quad-core CPU could be doing four images at once, the GPU is doing them in serial.  This means the GPU has to process at least four times the rate than a single CPU core in order to break even.  

*I believe the new NVidia Fermi architecture can run multiple Kernels simultaneously but most don't.

Transfer Overheads.  It takes time to transfer data across the PCIe bus to and from the GPU.  If the algorithm already runs quickly on the CPU (e.g. a few milliseconds) then GPU acceleration is usually a non-starter.
 

Algorithm development time.  It takes longer to write and debug a massively parallel GPU algorithm than it does to parallelize the algorithm on the CPU to make use of a fast quad-core.  Development time is expensive.

Hardware cost.  You do get a lot of horsepower for your money with a GPU, and a good performance card can be purchased for £150.  That still has to be factored into the system cost.   
Hardware obsolescence.  Whilst NVidia have confirmed that CUDA will be available in every new GPU they produce, the exact same GPU card quickly becomes obsolete.  Code should be forward compatible, but I don't think this has really been put to the test yet.




Of course, there are still lots of good things about this new technology and it really can accelerate the big number crunching algorithms like optic flow and deconvolution and FFTs.  But you have to choose carefully.


Vision Experts