Accelerated Image Processing

Friday, 4 December 2009

24-bit RGB in CUDA

In an earlier post I wrote about the difficulty in accessing 24-bit RGB data in CUDA, caused by restrictions on optimised global memory access patterns to 32/64/128-bit data. I've been working on this problem for RGB to YUV conversion and heres the best plan I have so far.

The kernel is launched using a 64x1 block size and the image data is passed in as int* so that 32-bit access is coalesced. The input is read into shared memory as 48ints using the first 48 threads, the output is written as 64 ints using all 64 threads. During the read into shared memory, 16 threads are idle - this is a half-warp size so should not waste much time as I believe the entire half-warp will execute in a single instruction.

__global__ void kernel_RGB24_to_YUV32(unsigned int *pInput, unsigned int pitchin, unsigned int *pOutput, unsigned int pitchout)
{
unsigned int xin = blockIdx.x*48 + threadIdx.x;
unsigned int xout = blockIdx.x*64 + threadIdx.x;
unsigned int y = blockIdx.y;

//Shared memory for 48 input ints
__shared__ unsigned int rgbTriplets[64];
unsigned char *pBGR = (unsigned char*) &rgbTriplets[0];

//Global memory read into shared memory
//Read in 48 x 32bit ints which will load 64 packed rgb triplets.
//Only 48 of the 64 threads are active during this read.
//48 is divisible by the 16 thread half-warp size so fully utilises three entire half-warps
//but leaves one half-warp doing nothing
if (threadIdx.x<48) { rgbTriplets[threadIdx.x] = *(pInput + xin + y*pitchin); } __syncthreads(); unsigned int tidrgb = threadIdx.x*3; float3 rgbpix = make_float3(pBGR[tidrgb+2],pBGR[tidrgb+1],pBGR[tidrgb+0]); //Make YUV floats
uchar3 yuvpix;
yuvpix.x = 0.2990*rgbpix.x + 0.5870*rgbpix.y + 0.1140*rgbpix.z;
yuvpix.y = -0.1687*rgbpix.x - 0.3313*rgbpix.y + 0.5000*rgbpix.z + 128;
yuvpix.z = 0.5000*rgbpix.x - 0.4187*rgbpix.y - 0.0813*rgbpix.z + 128;

//Write out 64 ints which are 64 32bit YUVX quads
*(pOutput+xout+y*pitchout) = make_color_rgb(yuvpix.x,yuvpix.y,yuvpix.z);

return;
}

Thursday, 26 November 2009

RGB Images and CUDA

When using CUDA, using a 32-bit RGBX format to store colour images pays massive performance dividends when compared to the more commonly found 24-bit RGB image format. If you have the choice, avoid 24 bit colour.

NVidia neatly skirt around this problem by simply ignoring it in all of their SDK demos. All these demos just happen to load 32-bit RGBX images for any of their colour image processing routines. They effectively do the transformation from 24bit to 32 during the file load and then hide this cost when running the demo. Good for them, but back here in the real world my image capture device is throwing out 24bit RGB images at 60Hz.

For 32-bit RGBX, the texture references in the kernel code files (*.cu) look like this:

texture < unsigned char, 2, cudaReadModeNormalizedFloat > tex;

which works fine. You can then access pixels using tex2D and all is well. However, if you have an RGB24 image and try this:

texture < uchar3, 2, cudaReadModeNormalizedFloat > tex;

It just wont work. There is no version of tex2D able to fetch 24 bit RGB pixels. In fact, you cannot even allocate a CUDA array with 3 channels - if you try this:

CUDA_ARRAY_DESCRIPTOR desc;
desc.Format = CU_AD_FORMAT_UNSIGNED_INT8;
desc.NumChannels = 3;
desc.Width = m_dWidth;
desc.Height = m_dHeight;
cuArrayCreate( &m_dInputArray, &desc );

then the array creation will fail. With CUDA you can only access textures and declare array memory with NumChannels equal to 1,2 or 4 elements.

Furthermore, it is not possible to convert from 24bit to 32bit during a cuMemcpy2D call - whilst this will pad line length to align the pitch (to 256bytes) it will not pad each pixel to match the destination array format.

The only solution is to declare your input array as 1 channel but three times as wide, like this:

CUDA_ARRAY_DESCRIPTOR desc;
desc.Format = CU_AD_FORMAT_UNSIGNED_INT8;
desc.NumChannels = 1;
desc.Width = m_dWidth*3;
desc.Height = m_dHeight;
cuArrayCreate( &m_dInputArray, &desc );

You can then access pixels in groups of three in your kernel. Unfortunately, dont expect coalesced memory acesses when each thread has to read three bytes....

Sunday, 8 November 2009

Real-time CUDA Video Processing

This week I have learnt some hard lessons about integrating CUDA into a real-time video processing application. The problem is related to the way most image acquisition devices work under Windows - that is, via interrupt driven capture callback functions.

Hardware capture devices, be they analogue, digital, CameraLink, GigE or anything else, always come with a vendor specific programming API. This API typcially uses one of two methods to let your program know when a new video frame has been acquired and is ready for processing; events or callbacks. Either your application starts a thread which waits for a sync Event (using WaitForSingleObject) to be signalled by the vendor API when the frame is ready, or you register a callback function which gets called directly by the vendor API when the frame is ready. Its pretty easy to swap between the two methods, but thats the subject of another blog entry.

The problem I have come across relates to using the CUDA runtime API with capture Callbacks. You see, the runtime API is pretty hard wired to be single threaded. You have to allocate, use, and deallocate all CUDA runtie resources on the same thread. Thats why all the runtime API demos with the NVidia SDK are single threaded console apps. The runtime appears to setup its own private context and attach it to the calling thread the first time any function is called, from then on only that thread is able to use functions in the runtime api. If any other thread tries to do something with device memory or call a kernel, you get an exception without explanation.

So you have to be pretty careful about which thread you are going to make that first CUDA runtime API call from.

Now, for an imaging application we know we have to use CUDA in the capture callback thread to process each new image as it arrives. Well probably want to copy the newly captured image to the GPU device and then run a kernel on it to do something useful. Since we are using the runtime API, that means we have no choice but to allocate and deallocate all CUDA resources in the same capture callback thread. But we dont really want to allocate and deallocate device memory every new frame as that is very inefficient, so we put a little catch to only allocate the first time the callback runs. Everything seems great, our callback runs a bit slow the first time, but it runs. It seems great until you realise that you dont know when the last time your callback will be called, so you dont know when to deallocate anything. And you cant do it when the application exits, because only the capture callback thread is allowed to touch the runtime API. Now thats a problem.

There are also problems with OpenGL interop. The cudaRegisterBuffer function really needs an opposite cudaUnregisterBuffer call before the callback thread terminates. If you dont unregister then CUDA barfs an exception from deep in the bowels of the nvcuda.dll when the callback thread terminates. But if you register/unregister every time a new frame arrives, that is really inefficient. So its all getting sticky with the CUDA runtime API.

The solution is to start out with the CUDA driver API for any real-time multi-threaded imaging applications. Lesson learnt.

Sunday, 25 October 2009

Proper Work

I can barely believe that it's nearly that time of year when I pack up my Laptop and go to Stuttgart for the Vision show to catch up with the latest news and technology in the Machine Vision industry. Last year I returned feeling enthusiastic and determined to produce some quality software algorithm of my own using my newly found CUDA skills. Now that an entire year has passed I have been struggling to remember what exactly I have achieved towards that goal. It appears that I have been doing alot of WORK, which pays the bills, but hardly any work.

Emanuel Derman spoke about WORK and work which captured the essence of what I think many talented engineers feel every day. There is always alot of WORK to be done, things like paperwork and meetings and bug fixes and presentations and little tasks which get you through the day and pay your bills. But the number of days in which real work gets done, stuff which will last more than a day and which feels like rewarding constructive activity... well that doesn't seem to happen enough.

The only solution I have found is to do some work after a full day of WORK. Unfortunatley an hour or two of typing in the evening, propped up by strong coffee and hungry for food is not the environment which germinates really good blue-sky development.

So I'm taking out a few days next week, turning off the mobile and shutting down outlook, and doing some work.

Saturday, 12 September 2009

OpenGL interop Woes

Writing real-world multi-threaded apps to capture, process and display video data in real-time is probably, in fairness, a slightly advanced topic. But after a fair amount of experimentation and frustration I think i can offer a piece of advise to other would-be image processing engineers embarking on a CUDA project:

Basically, don't use the CUDA runtime API in a real-time multi-threaded imaging application. And definately dont use the CUDA runtime API with OpenGL interop in a real-time multi-threaded imaging application.

You can just about get away with using the runtime API in a multi-threaded app IF you restrict your app so that only one host thread ever touches CUDA. Thats not usually possible in a real-time imaging system with interrupt driven capture callbacks and an asynchronous processing and display architecture. If you persist with the runtime API then...

Bad Things Can Happen

Under the hood, the runtime API is creating a CUDA context and attaching it to the first thread that touched CUDA. From then on, only that host thread should touch and CUDA API function, and if that thread terminates before CUDA resources are deallocated then bad things can happen. Alternatively, if you allocate some device memory in your application start-up, but then try and access or process that memory in a capture callback thread, then once again bad things can happen. Worst of all, if OpenGL interop tries to do something on a separate thread, whilst your capture callback is doing something on another thread then some CUDA operations may work, but sometimes very bad things can happen. For instance, I was quite successfully and repeatably able to instantly reboot my PC by running a badly coded piece of multi-threaded CUDA code with OpenGL interop. It was probably my fault, but that is difficult one to debug.

This is what led me to use the Driver API in all subsequent imaging applications and really take control over which host thread owned and used the CUDA context. So far, no problems. No crashes.

Vision Experts

Saturday, 4 July 2009

CUDA function overheads

Whilst working on my CUDA accelerated JPEG algorithm I found a problem with my design which demanded launching a large number of small kernels followed by many thousands of small memcopy operations. I was launching kernels to compress a fixed number of image blocks, many hundreds in all. The result was compressed image blocks, and the output size was only known at runtime after the algorithm was finished, but required many thousands of mem copy operations. The design was bad, but I was trying things out to see what would happen.

On a CPU, a function call will typically take a few nanoseconds to push parameters on the stack and jump the program pointer to the function address. On the GPU however, much more work has to be performed via the driver. So kernel launches and cuda mem copy operations take at least three orders of magnitude more to setup than a CPU call - several microseconds in all.

This means that if you want to perform many hundreds or thousands of calls then the function calls themselves can start to add up much more quickly than the equivalent CPU calls. This effect can then become significant - so make your kernels big!

Thursday, 11 June 2009

How to Display YUV420 Video

Recently, I came across YUV420 image data whilst working with a hardware H264 compression card. The image data was planar, and arranged like the image below. This is standard YUV420 planar format, with the U and V components being at 1/2 resolution of the Y component.

Now, when using MS Windows, you have to display images using bitmaps packed as RGB888 or RGBA8888 colour format. Even when using OpenGL you need basically RGB or 8bit grey images. I believe MacOS might be nicer (anybody?) and certainly supports YUV422, but under Windows, you have a problem with YUV. DirectX might help out too (anybody?) - but I live in the OpenGL world here.

So how do we display YUV420 video in real-time?

The first thing I tried was a CPU conversion to RGB888, then transferred the RGB data to OpenGL for display. Easy enough to code in C++ and took about an hour to optimise. But it still took about 8ms per frame to convert (on 768x576 frames) and really hit the CPU loading, which felt like a real waste of clock cycles for just displaying an image.

The solution we ended up with was to transfer YUV420 image data raw as GL_LUMINANCE image data, essentially just transferring the whole image (as above) as if it were a 768x864 greyscale image. We then wrote a Cg fragment shader to do the YUV to RGB conversion and display on the graphics unit. This worked a treat and even the Intel embedded graphics on the motherboard was able to handle the shader. This reduced the time to 1.4ms per frame, without any CPU loading.

To finish up, we wrapped up the entire functionality as a stand-alone DLL with just a few simple function calls. Now anybody here can display YUV420 images in a Window without any CPU overhead and without having to be concerned about how it happens. NVidia Cg requires two additional DLL's to be supplied with the package, but thats it.

You can get the DLL from us at http://www.vision4ce.com