Saturday 13 March 2010

A GPU is not Always Fastest

There has been a huge amount of interest in GPU computing (GPGPU) over the last couple of years.  Unsurprisingly, a number of image processing algorithms have been implemented using this technology.  In most cases, large performance gains are reported.  However, whilst I have been writing image processing algorithms that leverage the GPU performance for some time now, I have often found that the GPU is not the best solution.  As a rule of thumb, I aim for a x10 increase in speed to justify the development, if I can't achieve a x4 increase in speed then its just not worth the effort.  

Sometimes, the performance gains are misleading for practical applications.  NVidia themselves are guilty of this in their SDK with their image processing examples.  For instance, in many of their SDK demonstration applications they use the SDK functions to load an 8-bit image and then pre-convert it on the host to a packed floating point format before uploading to the GPU.  They then show large gains in speed, but ignore the huge time penalty of the CPU-side format conversion.  In another example they have to unpack 24-bit RGB data into 128bit packed quads of floating point data on the host before they can process it.  In the real-world this is not practical.  I do wonder how many other people have used some constructive accounting in their reported acceleration factors.

So, despite generally being a GPU evangelist for accelerating image processing, I wanted to write a bit about the downsides to provide a balanced view.

Architecture constraints.   You need to be doing a lot of work on the image data to make the architecture work for you.  Many (Most?) practical algorithms just don't fit into a GPU very well.  For example, it may be the case that a GPU can do a brute-force template correlation faster than a quad-core CPU, but brute-force correlation for pattern matching isn't the method of choice these days.  Contemporary vision libraries have extremely sophisticated algorithms that do a far superior job of pattern matching than correlation, plus they are highly optimised for multi-threading on the CPU.  These algorithms simply do not fit into the GPU 'brute force' computational model. 


By way of a painful example, I have been developing a complete JPG conversion library for NVidia GPUs.  This is blazingly fast at RGB-YUV conversion, DCT and Quantisation, but falls down on the Huffman coding which is a sequential algorithm.  Add in the transfer overheads and it gets slower.  At the time of writing, hand-optimised multi-threaded CPU version is almost as fast.  All is not lost on this development, but its a tough sell at this point.

Multi-threading.  Whilst a GPU is massively parallel internally, it cannot run multiple algorithms (kernels) in parallel*.  So if your application is used to doing multiple operations in parallel, e.g. processing the images from multiple sensors in parallel, then it will have to change and serialize the images into GPU work chunks.  So whilst your quad-core CPU could be doing four images at once, the GPU is doing them in serial.  This means the GPU has to process at least four times the rate than a single CPU core in order to break even.  

*I believe the new NVidia Fermi architecture can run multiple Kernels simultaneously but most don't.

Transfer Overheads.  It takes time to transfer data across the PCIe bus to and from the GPU.  If the algorithm already runs quickly on the CPU (e.g. a few milliseconds) then GPU acceleration is usually a non-starter.
 

Algorithm development time.  It takes longer to write and debug a massively parallel GPU algorithm than it does to parallelize the algorithm on the CPU to make use of a fast quad-core.  Development time is expensive.

Hardware cost.  You do get a lot of horsepower for your money with a GPU, and a good performance card can be purchased for £150.  That still has to be factored into the system cost.   
Hardware obsolescence.  Whilst NVidia have confirmed that CUDA will be available in every new GPU they produce, the exact same GPU card quickly becomes obsolete.  Code should be forward compatible, but I don't think this has really been put to the test yet.




Of course, there are still lots of good things about this new technology and it really can accelerate the big number crunching algorithms like optic flow and deconvolution and FFTs.  But you have to choose carefully.


Vision Experts

3 comments:

  1. I have been trying to implement JPEG compression with CUDA. But the result was disappointing. Would you please share your JPG conversion library with me? It will help me. Thank you. My Email is rockypengpp@yahoo.com.cn

    ReplyDelete
  2. Fast CUDA JPEG encoder
    http://www.fastvideo.ru/english/products/software/cuda-jpeg-encoder.htm
    Huffman coding is not a bottleneck for JPEG compression.

    ReplyDelete
  3. JPEG is a great application.I've been used this application.

    Medical Product Design

    ReplyDelete