Thursday 26 November 2009

RGB Images and CUDA

When using CUDA, using a 32-bit RGBX format to store colour images pays massive performance dividends when compared to the more commonly found 24-bit RGB image format. If you have the choice, avoid 24 bit colour.

NVidia neatly skirt around this problem by simply ignoring it in all of their SDK demos. All these demos just happen to load 32-bit RGBX images for any of their colour image processing routines. They effectively do the transformation from 24bit to 32 during the file load and then hide this cost when running the demo. Good for them, but back here in the real world my image capture device is throwing out 24bit RGB images at 60Hz.

For 32-bit RGBX, the texture references in the kernel code files (*.cu) look like this:



texture < unsigned char, 2, cudaReadModeNormalizedFloat > tex;


which works fine. You can then access pixels using tex2D and all is well. However, if you have an RGB24 image and try this:


texture < uchar3, 2, cudaReadModeNormalizedFloat > tex;




It just wont work. There is no version of tex2D able to fetch 24 bit RGB pixels. In fact, you cannot even allocate a CUDA array with 3 channels - if you try this:

CUDA_ARRAY_DESCRIPTOR desc;
desc.Format = CU_AD_FORMAT_UNSIGNED_INT8;
desc.NumChannels = 3;
desc.Width = m_dWidth;
desc.Height = m_dHeight;
cuArrayCreate( &m_dInputArray, &desc );


then the array creation will fail. With CUDA you can only access textures and declare array memory with NumChannels equal to 1,2 or 4 elements.

Furthermore, it is not possible to convert from 24bit to 32bit during a cuMemcpy2D call - whilst this will pad line length to align the pitch (to 256bytes) it will not pad each pixel to match the destination array format.

The only solution is to declare your input array as 1 channel but three times as wide, like this:

CUDA_ARRAY_DESCRIPTOR desc;
desc.Format = CU_AD_FORMAT_UNSIGNED_INT8;
desc.NumChannels = 1;
desc.Width = m_dWidth*3;
desc.Height = m_dHeight;
cuArrayCreate( &m_dInputArray, &desc );


You can then access pixels in groups of three in your kernel. Unfortunately, dont expect coalesced memory acesses when each thread has to read three bytes....

Sunday 8 November 2009

Real-time CUDA Video Processing

This week I have learnt some hard lessons about integrating CUDA into a real-time video processing application. The problem is related to the way most image acquisition devices work under Windows - that is, via interrupt driven capture callback functions.

Hardware capture devices, be they analogue, digital, CameraLink, GigE or anything else, always come with a vendor specific programming API. This API typcially uses one of two methods to let your program know when a new video frame has been acquired and is ready for processing; events or callbacks. Either your application starts a thread which waits for a sync Event (using WaitForSingleObject) to be signalled by the vendor API when the frame is ready, or you register a callback function which gets called directly by the vendor API when the frame is ready. Its pretty easy to swap between the two methods, but thats the subject of another blog entry.

The problem I have come across relates to using the CUDA runtime API with capture Callbacks. You see, the runtime API is pretty hard wired to be single threaded. You have to allocate, use, and deallocate all CUDA runtie resources on the same thread. Thats why all the runtime API demos with the NVidia SDK are single threaded console apps. The runtime appears to setup its own private context and attach it to the calling thread the first time any function is called, from then on only that thread is able to use functions in the runtime api. If any other thread tries to do something with device memory or call a kernel, you get an exception without explanation.

So you have to be pretty careful about which thread you are going to make that first CUDA runtime API call from.

Now, for an imaging application we know we have to use CUDA in the capture callback thread to process each new image as it arrives. Well probably want to copy the newly captured image to the GPU device and then run a kernel on it to do something useful. Since we are using the runtime API, that means we have no choice but to allocate and deallocate all CUDA resources in the same capture callback thread. But we dont really want to allocate and deallocate device memory every new frame as that is very inefficient, so we put a little catch to only allocate the first time the callback runs. Everything seems great, our callback runs a bit slow the first time, but it runs. It seems great until you realise that you dont know when the last time your callback will be called, so you dont know when to deallocate anything. And you cant do it when the application exits, because only the capture callback thread is allowed to touch the runtime API. Now thats a problem.

There are also problems with OpenGL interop. The cudaRegisterBuffer function really needs an opposite cudaUnregisterBuffer call before the callback thread terminates. If you dont unregister then CUDA barfs an exception from deep in the bowels of the nvcuda.dll when the callback thread terminates. But if you register/unregister every time a new frame arrives, that is really inefficient. So its all getting sticky with the CUDA runtime API.

The solution is to start out with the CUDA driver API for any real-time multi-threaded imaging applications. Lesson learnt.