Thursday, 8 July 2010

Debunking the x100 GPU Myth - Intel Fights Back

Intel recently published this paper titled 'Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU' that makes an attempt to compare a number of GPU kernels with algorithms that are highly optimised for Intel architectures.  The authors concluded that for the right problems, the GPU was up to 14x faster than an equivalent optimised CPU implementation. On average a x2.5 increase in speed was seen.  

I am all in favour of using GPU's to accelerate image processing when it is appropriate but the hype has gotten out of control over the last year, so I am very pleased to see Intel try and put their case forward and bring some balance to the arguments.

What I liked about the paper was that for once, significant effort was expended to optimise BOTH the CPU and the GPU implementations.  Too many biased comparisons are made between highly optimised GPU implementations and the naive, plain vanilla single threaded 'C' versions.  When a x100 increase in speed is cited, I always suspect that the author was being either highly selective in what parts of the overall system were being timed, or that the algorithm was unrealistically well mapped to GPU hardware and not representative of a real problem, or even that the CPU implementation was simply not optimised at all.  The NVidia showcase website has made publishing an impressive acceleration factor in the authors best interest.

I certainly have not come across any imaging systems that have achieved anything like x100 accelerations in throughput by employing GPU technology.  There may be some algorithms that map superbly well to GPUs and can achieve x100 performance increase in a single algorithm stage, but these numbers published by Intel are much more in line with the total throughput increase I have seen when using GPU's to do image processing in real-world applications, when compared to the optimised CPU algorithms that are readily available.  

An example of disengenuous performance metrics would be the image processing blur demo in the NVidia SDK - here the image is loaded from file, pre-processed and converted into a 512x512 floating point greyscale image, transferred to the GPU once, and THEN processed repeatedly at high speed to show how fast the GPU is.  The CPU conversion to floating point format is omitted from the GPU compute time.



I would also agree with Intel that most often, in practice, optimisation of an algorithm to use multiple cores, maximize cache usage and SSE instructions is easier, faster and ultimately more portable than developing a CUDA replacement algorithm.  I would also agree with the GPU evangelists that the hardware cost of an upgrade to a top-end Intel based PC system, vs the investement in a GTX280 is significantly higher.  With the tools improving all the time, it is becoming easier to code and deploy GPU enhanced algorithms.  


The conclusion is, for the time being, we must take a balanced view of the technology available and choose the right processing method to suit the application.  And be realistic.

Vision Experts

2 comments:

  1. I totally agree. There are a whole bunch of considerations when implementing in GPU, such as data segmentations to fit into shared memory, and streaming large scale data in and out of GPU RAM (considerably smaller than what's available to CPUs nowadays), optimizing memory access patterns (coalescing), making algorithms SIMD friendly, etc. It's been fun but I'd prefer 100 core systems (if they become cheap enough) even if the theoretical performance gain may be slightly less than 1000 core GPUs.

    Optimizing for multi-core is no picnic either, but it almost seems like a walk in the park compared to the effort for implementing things in GPU.

    ReplyDelete
  2. Interesting blog. It would be great if you can provide more details about it. Thanks you

    Image Processing Company in Chennai

    ReplyDelete