Accelerated Image Processing: Real-time CUDA Video Processing

Sunday, 8 November 2009

Real-time CUDA Video Processing

This week I have learnt some hard lessons about integrating CUDA into a real-time video processing application. The problem is related to the way most image acquisition devices work under Windows - that is, via interrupt driven capture callback functions.

Hardware capture devices, be they analogue, digital, CameraLink, GigE or anything else, always come with a vendor specific programming API. This API typcially uses one of two methods to let your program know when a new video frame has been acquired and is ready for processing; events or callbacks. Either your application starts a thread which waits for a sync Event (using WaitForSingleObject) to be signalled by the vendor API when the frame is ready, or you register a callback function which gets called directly by the vendor API when the frame is ready. Its pretty easy to swap between the two methods, but thats the subject of another blog entry.

The problem I have come across relates to using the CUDA runtime API with capture Callbacks. You see, the runtime API is pretty hard wired to be single threaded. You have to allocate, use, and deallocate all CUDA runtie resources on the same thread. Thats why all the runtime API demos with the NVidia SDK are single threaded console apps. The runtime appears to setup its own private context and attach it to the calling thread the first time any function is called, from then on only that thread is able to use functions in the runtime api. If any other thread tries to do something with device memory or call a kernel, you get an exception without explanation.

So you have to be pretty careful about which thread you are going to make that first CUDA runtime API call from.

Now, for an imaging application we know we have to use CUDA in the capture callback thread to process each new image as it arrives. Well probably want to copy the newly captured image to the GPU device and then run a kernel on it to do something useful. Since we are using the runtime API, that means we have no choice but to allocate and deallocate all CUDA resources in the same capture callback thread. But we dont really want to allocate and deallocate device memory every new frame as that is very inefficient, so we put a little catch to only allocate the first time the callback runs. Everything seems great, our callback runs a bit slow the first time, but it runs. It seems great until you realise that you dont know when the last time your callback will be called, so you dont know when to deallocate anything. And you cant do it when the application exits, because only the capture callback thread is allowed to touch the runtime API. Now thats a problem.

There are also problems with OpenGL interop. The cudaRegisterBuffer function really needs an opposite cudaUnregisterBuffer call before the callback thread terminates. If you dont unregister then CUDA barfs an exception from deep in the bowels of the nvcuda.dll when the callback thread terminates. But if you register/unregister every time a new frame arrives, that is really inefficient. So its all getting sticky with the CUDA runtime API.

The solution is to start out with the CUDA driver API for any real-time multi-threaded imaging applications. Lesson learnt.

No comments:

Post a Comment

Welcome

Practical software & algorithm development in the machine vision industry.

Often a pretty technical blog, often just observations on being a developer, leader and human being in the Machine Vision Industry.

I'm professional software engineer, tech lead and company director at Vision Experts and Red Engine.

I spend most of my time managing a small team of world class engineers, inventing IP and consulting for industry in the computer vision space.

Websites can be found at www.flightclubdarts.com and www.visionexperts.co.uk

The Parallel Revolution

“We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”
- Paul Otellini, President, Intel (2005)

“Multicore: This is the one which will have the biggest impact on us. We have never had a problem to solve like this. A breakthrough is needed in how applications are done on multicore devices.”
- Bill Gates, Microsoft

“When we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.”
- John Hennessy, President of Stanford

Accelerated Image Processing

Sunday, 8 November 2009

Real-time CUDA Video Processing

No comments:

Post a Comment

Welcome

Blog Archive

The Parallel Revolution

Keywords

About Me

Other Vision Blogs