Friday 26 March 2010

NVPP Performance Benchmarks

In my last post, I cast some doubt on the performance and utility of GPU's for small image processing functions.   Today I had a look at how NVidias own Image processing library - NVPP - stacked up against the latest Intel Performance Primitives (IPPI v6.0) for some basic arithmetic on one of my Dev machines.  This development PC has a mid-performance Quad-core Intel Q8400@2.66GHz and a mid-performance NVidia GTX260 with 216cores@1.1GHz.  


The results are interesting and pretty much what I expected.  As an example, here are the results for a simple image addition of two images to produce one output image (average 1000 iterations):


512x512 Pixels:
GPU-Transfer and Processing = 0.72 milliseconds
CPU = 0.16 milliseconds


2048x2048 Pixels:
GPU-Transfer and Processing = 6.78 milliseconds
CPU  = 2.81 milliseconds

The CPU wins easily - so whats happening here?  The transfer overheads to-and-from the GPU over a PCIex16 bus are by far the dominant factor, taking approx 2ms per image transfer for the 2048x2048 images (two input images, one image output = approx 6ms).  Whilst transfer times can be significantly improved (perhaps halved) if the input and output images were put into page-locked memory, the conclusion would not change; performing individual simple image operations on the GPU does not significantly accelerate image processing.


So what happens if we emulate a compute-intensive algorithms on the GPU?  When we perform only one transfer but then replace the single addition with 1000 compounded additions, the total time for the GPU operation becomes:

2048x2048 Pixels:
GPU-1xTransfer and 1000xImAdds = 0.29 milliseconds

So for a compute intensive operations which transfers the data once, then re-uses the image data multiple times, the GPU can easily be 10x faster than the CPU.

This means that algorithms such as deconvolution, optic flow, deformable registration, FFTs, iterative segmentation etc are all good candidates for GPU acceleration.  Now, if you look at the NVidia community showcase then these are the sorts of algorithms that you will see making use of the GPU for imaging.  When the new Fermi architecture hits the shelves, with its larger L1 cache and new L2 cache, then the GPU imaging performance should make a real jump. 

Its worth mentioning a minor technical problem with NVPP and Visual Studio 2008 - NVPP1.0 doesn't link properly in MSVC2008 unless you disable whole program optimisation (option /GL). Its also worth noting that the NVPP is built on the runtime API, which is not suitable for real-time multi-threaded applications.  If you really need some of the NVPP functionality for a real-world application, then we would suggest you get a custom library developed using the driver API.


Vision Experts

No comments:

Post a Comment