Accelerated Image Processing

Sunday, 2 May 2010

Faster Memory Transfers

NVidia provide a mechanism to to allocate non-paged ('pinned') memory on the host, which can significantly improve host-to-GPU transfer performance. But does it help in practice?

The main bottleneck in GPU processing is the PCIe bus, which has a relatively low bandwidth. For many trivial operations this data transfer overhead dominates the overall execution time, negating any benefit of using the GPU. For normal host-to-GPU data transfers using the cuMemcpy function a bandwidth of around 2.0-2.5GB/sec is about average for a 16-lane PCI express bus. This represents about half the theoretical maximum bandwidth of the PCIe v1.1 bus, and introduces about 1ms overhead to transfer an 1920x1080 greyscale image.

Figure 1. A normal cuMemcpy from host-to-device runs at about 2GB/sec

If we use the NVidia cuMemAllocHost function to allocate non-paged memory on the host, we can almost double the bandwidth when copying this buffer to the GPU device memory, achieving nearer 4GB/sec on most systems. If you are able to write your capture code so that the frame grabber driver will DMA image data directly into one of these page-locked buffers then that is a worthwhile thing to do. Unfortunately, thats not always possible.

Page-Locked Intermediate Buffer
Sometimes, the frame-grabber acquires images into a host memory buffer without giving us the option to acquire directly into our CUDA allocated page-locked memory. In this situation, we can either copy our captured image directly to the device memory as in Figure1, or choose to memcpy into a page-locked buffer prior to transfer across the PCIe bus as in Figure2. Since a host memcpy takes time, this erodes some of the benefit of using the page-locked buffer.

Figure 2. Using a page-locked buffer as a staging post before transfer can still increase performance, despite introducing an additional host memcpy operation from the acquire buffer to the page-locked buffer.

Using a page-locked transfer buffer as shown in figure2 is only worthwhile when the cost of the host memcpy operation is low - which requires a relatively high performance chipset (e.g. ICH10) with fast DDR2 (6.4GB/sec) or DDR3 (8.5GB/sec) memory. At a minimum, the host-to-host copy must execute faster than 4GB/sec otherwise the direct copy in figure1 is usually faster. As an example, the approximate time taken to transfer 1GB using paged memory is:

1GB / (2GB/sec) =500ms

When using the scheme in figure2, the total time taken to transfer 1GB from host to the page-locked buffer and then onto the GPU is approximately:

1GB/(8GB/sec) = 125ms

+1GB/(4GB/sec) = 250ms

= 375ms

This is an improvement over the straight copy, so it would appear that non-paged memory does help even in this non-ideal situation. When using a newer P45 chipset with PCIexpress v2.0 the maximum achievable transfer bandwidth is higher. In theory, the PCIe bus on the newer Intel P45 and P35 chipsets will handle 16GB/sec and 8GB/sec respectively, but are limited by main memory bandwidth, reducing host-to-GPU bandwidth to something between 5 and 6GB/sec.

The conclusion is that if at all possible, acquire directly into a pinned, page-locked memory. If that isn't possible, using an intermediate page-locked buffer is still worthwhile, provided the host chipset and memory performance is good.

Direct FrameGrabber-to-GPU DMA
It would be really great to be able to DMA directly from a frame grabber into GPU device memory, avoiding the CPU and main memory entirely, but I don't believe this is possible. It may be achieved using driver-level transfers akin to DirectShow drivers, but it is not currently possible to get a physical address of GPU device memory using CUDA.

Simon Green, from NVidia says:

"A lot of people have asked for this. It is technically possible for other PCI-E devices to DMA directly into GPU memory, but we don't have a solution yet. We'll keep you posted." - Sep 2009

This is a capability worth waiting for, but don't hold your breath.

Vision Experts

Sunday, 28 March 2010

CUDA3.0 cubin Files

It appears that NVidia has changed the format of CUBIN files with CUDA 3.0 into a standard binary ELF format. Heres what they say in the release notes:

CUDA C/C++ kernels are now compiled to standard ELF format

You can find out about ELF files at the wikipedia entry. In previous releases the partially compiled .cubin files were plain text readable and could be added into a library as a string resource. If you open old cubin files in Visual studio, they looked something like this:

architecture {sm_10}
abiversion   {1}
modname      {cubin}
code {
    name = cuFunction_Laser
    lmem = 0
    smem = 44
    reg = 6
    bar = 0
    bincode {
        0x10004209 0x0023c780 0x40024c09 0x00200780
        0xa000000d 0x04000780 0x20000411 0x0400c780
        0x3004d1fd 0x642107c8 0x30000003 0x00000500
        ...

blah..blah..blah

}
}

Rather than ship cubin files with libraries, I have always built them into the file as a string resource and then use the windows API functions such as FindResource and LoadResource to get a pointer to the string. This is then passed to the CUDA cuModuleLoadDataEx function for final compilation into the GPU code.

With CUDA3.0 and this new ELF format, cubin files look slightly different since they are now a binary file:

When I compiled some old projects against CUDA3.0, everything went very wrong due to this change.

The problem was that my old method used to copy the cubin resource string into another memory location using strcpy and add also a final \0 character for good measure at the end of the string. With the new binary format, the string copy does not work and a partially mangled buffer ended up being passed to the CUDA compiler, which promptly fell over.

So if anybody else out there is using string resources to include and manipulate cubin files, this may catch you out too. The fix is easy, simply treat the new cubin files as binary data not strings.

One final point, if you really want to stick with the previous cubin string format, then apparently (I haven't confirmed this) you can direct nvcc to emit string cubin files by changing the nvcc.profile and the CUBINS_ARE_ELF flag.

Vision Experts

Friday, 26 March 2010

NVPP Performance Benchmarks

In my last post, I cast some doubt on the performance and utility of GPU's for small image processing functions. Today I had a look at how NVidias own Image processing library - NVPP - stacked up against the latest Intel Performance Primitives (IPPI v6.0) for some basic arithmetic on one of my Dev machines. This development PC has a mid-performance Quad-core Intel Q8400@2.66GHz and a mid-performance NVidia GTX260 with 216cores@1.1GHz.

The results are interesting and pretty much what I expected. As an example, here are the results for a simple image addition of two images to produce one output image (average 1000 iterations):

512x512 Pixels:
GPU-Transfer and Processing = 0.72 milliseconds
CPU = 0.16 milliseconds

2048x2048 Pixels:
GPU-Transfer and Processing = 6.78 milliseconds
CPU = 2.81 milliseconds

The CPU wins easily - so whats happening here? The transfer overheads to-and-from the GPU over a PCIex16 bus are by far the dominant factor, taking approx 2ms per image transfer for the 2048x2048 images (two input images, one image output = approx 6ms). Whilst transfer times can be significantly improved (perhaps halved) if the input and output images were put into page-locked memory, the conclusion would not change; performing individual simple image operations on the GPU does not significantly accelerate image processing.

So what happens if we emulate a compute-intensive algorithms on the GPU? When we perform only one transfer but then replace the single addition with 1000 compounded additions, the total time for the GPU operation becomes:

2048x2048 Pixels:
GPU-1xTransfer and 1000xImAdds = 0.29 milliseconds

So for a compute intensive operations which transfers the data once, then re-uses the image data multiple times, the GPU can easily be 10x faster than the CPU.

This means that algorithms such as deconvolution, optic flow, deformable registration, FFTs, iterative segmentation etc are all good candidates for GPU acceleration. Now, if you look at the NVidia community showcase then these are the sorts of algorithms that you will see making use of the GPU for imaging. When the new Fermi architecture hits the shelves, with its larger L1 cache and new L2 cache, then the GPU imaging performance should make a real jump.

Its worth mentioning a minor technical problem with NVPP and Visual Studio 2008 - NVPP1.0 doesn't link properly in MSVC2008 unless you disable whole program optimisation (option /GL). Its also worth noting that the NVPP is built on the runtime API, which is not suitable for real-time multi-threaded applications. If you really need some of the NVPP functionality for a real-world application, then we would suggest you get a custom library developed using the driver API.

Vision Experts

Saturday, 13 March 2010

A GPU is not Always Fastest

There has been a huge amount of interest in GPU computing (GPGPU) over the last couple of years. Unsurprisingly, a number of image processing algorithms have been implemented using this technology. In most cases, large performance gains are reported. However, whilst I have been writing image processing algorithms that leverage the GPU performance for some time now, I have often found that the GPU is not the best solution. As a rule of thumb, I aim for a x10 increase in speed to justify the development, if I can't achieve a x4 increase in speed then its just not worth the effort.

Sometimes, the performance gains are misleading for practical applications. NVidia themselves are guilty of this in their SDK with their image processing examples. For instance, in many of their SDK demonstration applications they use the SDK functions to load an 8-bit image and then pre-convert it on the host to a packed floating point format before uploading to the GPU. They then show large gains in speed, but ignore the huge time penalty of the CPU-side format conversion. In another example they have to unpack 24-bit RGB data into 128bit packed quads of floating point data on the host before they can process it. In the real-world this is not practical. I do wonder how many other people have used some constructive accounting in their reported acceleration factors.

So, despite generally being a GPU evangelist for accelerating image processing, I wanted to write a bit about the downsides to provide a balanced view.

Architecture constraints. You need to be doing a lot of work on the image data to make the architecture work for you. Many (Most?) practical algorithms just don't fit into a GPU very well. For example, it may be the case that a GPU can do a brute-force template correlation faster than a quad-core CPU, but brute-force correlation for pattern matching isn't the method of choice these days. Contemporary vision libraries have extremely sophisticated algorithms that do a far superior job of pattern matching than correlation, plus they are highly optimised for multi-threading on the CPU. These algorithms simply do not fit into the GPU 'brute force' computational model.

By way of a painful example, I have been developing a complete JPG conversion library for NVidia GPUs. This is blazingly fast at RGB-YUV conversion, DCT and Quantisation, but falls down on the Huffman coding which is a sequential algorithm. Add in the transfer overheads and it gets slower. At the time of writing, hand-optimised multi-threaded CPU version is almost as fast. All is not lost on this development, but its a tough sell at this point.

Multi-threading. Whilst a GPU is massively parallel internally, it cannot run multiple algorithms (kernels) in parallel*. So if your application is used to doing multiple operations in parallel, e.g. processing the images from multiple sensors in parallel, then it will have to change and serialize the images into GPU work chunks. So whilst your quad-core CPU could be doing four images at once, the GPU is doing them in serial. This means the GPU has to process at least four times the rate than a single CPU core in order to break even.

*I believe the new NVidia Fermi architecture can run multiple Kernels simultaneously but most don't.

Transfer Overheads. It takes time to transfer data across the PCIe bus to and from the GPU. If the algorithm already runs quickly on the CPU (e.g. a few milliseconds) then GPU acceleration is usually a non-starter.

Algorithm development time. It takes longer to write and debug a massively parallel GPU algorithm than it does to parallelize the algorithm on the CPU to make use of a fast quad-core. Development time is expensive.

Hardware cost. You do get a lot of horsepower for your money with a GPU, and a good performance card can be purchased for £150. That still has to be factored into the system cost.
Hardware obsolescence. Whilst NVidia have confirmed that CUDA will be available in every new GPU they produce, the exact same GPU card quickly becomes obsolete. Code should be forward compatible, but I don't think this has really been put to the test yet.

Of course, there are still lots of good things about this new technology and it really can accelerate the big number crunching algorithms like optic flow and deconvolution and FFTs. But you have to choose carefully.

Vision Experts

Wednesday, 24 February 2010

High Throughput for High Resolution

We've been using the ProSilica/AVT GE4900 recently to get super high resolution 16megapixel images at about 3Hz. It's a nice camera, but that resolution tends to demand high performance from the processor.

We have about 45MB/sec of raw image data we have to process. In order to chew through all this data we've been pushing the raw bayer mosaic images onto an NVidia GTX260 GPU and performing colour conversion, gamma correction and even the sensors flat field correction on the GPU at high speed. We also use the GPU to produce reduced size greyscale images for processing and alalysis alongside the regular colour converted image for display. The ability to process such high resolution images using the GPU has really made the difference for this application and it would not be possible without this capability.
Vision Experts

Wednesday, 10 February 2010

Interface Acceleration

Machine vision sensors are getting big, and Camera are increasingly available with a number of pixels that is truly enormous by historical standards. Cameras in the 10+ Megapixel range seem to be increasing in popularity for industrial inspection, possibly driven by the consumer market in which such large sensors are now the norm, partly due to price decreases, and possibly because processing and storing the data is just about feasible these days.

The bandwidth between cameras and computer is also increasing, which it needs to. Already, it seems that a single GigE connection just isnt enough bandwidth for tomorrows applications. For example, AVT have a dual GigE output on a camera to give 2Gbits/sec of bandwidth. The CoaXPress digital interface is capable of 6.25 Gbits/sec over 50m of pretty much bog-standard coax cable, a capability I find incredible. Likewise, the HSLINK standard, proposed by DALSA, uses InfiniBand to achieve 2100Mbytes/sec. Most of these standards even permit using multiple connections to double, or quadruple the bandwidth. With all this data flying around, trying to process this on a PC is going to be like taking a drink from a hose pipe. Or two, or four.

Think about it, at 2Gbits/sec, the computational demand will be 250Mpix/sec (assuming 8bit pixels). Using a 3GHz processor core, thats 12 clock cycles availble per pixel. You can't do a whole lot of processing with that. Even if you scale up to Quad-core and make sure you use as many SSE SIMD instructions as you can, you still aren't going to be doing anything sophisticated with that data. It could be like machine vision development 15 years ago, when I remember the only realistic goal was to count the number pixels above threshold to take a measurement!

I feel that the new generation of ultra-high resolution cameras streaming data at ultra-high bandwidths are going to require a new generation of processing solutions. I suspect this will be in the form of massivley parallel processors - such as GPU's and perhaps Intel's Larrabee processor (when it finally materialises).

In the mean time, I'm plugging away writing GPU accelerated algorithms just for format conversion so that we can even display and store this stuff.

Vision Experts

Friday, 5 February 2010

GPU Supercluster

I was interested to see this GPU system doing some biologically inspired processing at Harvard. Whilst I doubt that there will be any practical industrial applications to emerge from this, it does show how inexpensive it can be to build a minor supercomputer. To quote from their website...

...With peak performance around 4 TFLOPS (4 trillion floating point operations per second), this little 18”x18”x18” cube is perhaps one of the world’s most compact and inexpensive supercomputers....

Vision Experts