Accelerated Image Processing

Friday, 22 January 2010

LoaderLock MDA

This post isn't really about accelerated image processing, but the topic is related to deployment of DLLs of any type. I hope this helps somebody save some time if they encounter this issue. Whilst developing a C# demo app for one of my CUDA libraries, I encountered a strange error:

LoaderLock was detected
Message: DLL 'Cephalon.dll' is attempting managed execution inside OS Loader lock. Do not attempt to run managed code inside a DllMain or image initialization function since doing so can cause the application to hang.

It took me a while to figure out what was going on, and it was related to how I build CUDA libraries.

When I make a CUDA enabled library, I wrap up the compiled kernel cubin files as a resource compiled into the DLL itself. An alternative simpler method is to supply a cubin text file along with each dll and load it directly using the CUDA function:

cuModuleLoad(&cuModule, pszModulePath)

but having two files can lead to version control and maintainance issues. Plus, I try and make a living doing this, and dont really want people reading my precious kernel code that took four months to write too easily.

So I wrap up the compiled code string neatly inside the DLL as a resource, then get that resource string and compile it on-the-fly (or just-in-time) using the alternative CUDA function:

cuModuleLoadDataEx( &cuModule,pCubinStr,3,&jitOptions[0],&jitOptVals[0]);

This is great but in order to get the string resource from inside the DLL I need to call a varient of LoadResource. And I need to call FindResource to find that resource first. And I need to call GetModuleHandle("LibraryName.dll") before any of those. The problem is that GetModuleHandle is a prohibited function to call even indirectly from LoadLibrary when the DLL is first loaded and mapped into the process address space.

The C# application was loading the DLL when it first encountered one of the functions, this then tried to initialise the CUDA module and load the resource automatically from the dll entry point. Ultimately, the call to GetModuleHandle raised an alarm back in the managed code. Not easy to spot.

More on the LoaderLock MDA can be found here

Vision Experts

Saturday, 12 December 2009

OpenGL Interop

Looks Matter

Never ignore the display capabilities of the GPU you have. After all, with the rendering capabililties of a meaty CUDA enabled GPU and the software capabilities of OpenGL you should easily be able to exploit the images and data you've just processed in interesting and engaging ways. Granted, it should be the result that matters. But when it's human engineers there comparing your system against the competition and the look-and-feel of your application is slick and polished, it can make a difference. The key to doing making your CUDA app look slick is OpenGL interop.

Process vs Display

Image processing and Image generation are two sides of the same coin. Image processing seeks to take an image and extract information (size, shape, depth, region, motion, identity etc.), image generation seeks to take information (vertices, texture, algorithms) and turn that data into an image. Looking at a software level, the two sides of the coin can be addressed by using two GPU technologies - image processing using CUDA and image generation using OpenGL. By using CUDA we can turn an NVidia GPU into a powerful image processor, by using OpenGL we can use the same GPU hardware to generate new images. For example, with CUDA image processing algorithm we could extract the motion and depth from a scene in real-time, then with OpenGL image generation we could re-generate camera stabilized video or generate a panorama or even completely re-render the scene from a novel and augmented perspective. It is when we combine image processing with image rendering this way that things get really interesting.

Interop

In my option, CUDA OpenGL interop seems to be under-documented as well as being a bit more complex that it should be. In sequence - here's how I use CUDA-OpenGL interop:

At program initialisation:

Allocate an OpenGL texture that will be compatible with your results (not always easy)
Allocate an OpenGL Pixel Buffer Object (PBO) using glGenBuffers
Register the PBO with CUDA using cuGLRegisterBufferObject

Note that (in CUDAv2.2) OpenGL cannot access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do an requested operations on the buffer. Deleting a buffer while it is mapped for CUDA results in undefined behaviour (bad things). Also, always use the same context to map/unmap as the context used to register the buffer. This can be difficult with the Runtime API in a multi-threaded app and results in strange behaviour.

A Digression on Pitch
There is a complication with texture allocation and cuda device memory allocation. With CUDA, you really must allocate pitched device memory (using cuMemAllocPitch) for image processing usage. This is in order to meet strict alignment requirements for fast coalseced memory access. You dont have control over the pitch that CUDA will use, but cuMemAllocPitch returns the actual pitch of the device mem that was allocated, which is anything up to 256bytes. When you allocate a texture in OpenGL, you cannot specify a texture pitch, only width, height and format. This means that your OpenGL texture buffer may not be pitch-compatible with your CUDA device memory layout. You can use GL_UNPACK_ALIGNMENT and GL_UNPACK_ROW_LENGTH to help out here, but there are still some fairly common situations when this wont quite give you the control you need. A symptom of mis-matched texture and device memory pitch is when the image data looks like its made it across the interop but is weirdly diagonally scewed or of the wrong aspect ratio. Usually, through a combination of modification to your texture width, packing alignment and/or format you can achieve something compatible.

For now, I'll assume you have managed to allocate a compatible texture, then;

At run-time:

Run the CUDA kernel putting the results into device memory (cuDevicePtr)
Map the PBO using cuGLMapBufferObject, which returns the device pointer of the texture memory (another cuDevicePtr)
Use cuMemcpy2D to copy from the device memory to the mapped PBO memory. These are device-to-device copies.
Unmap the PBO (cuGLUnmapBufferObject)
Update the texture from the PBO
Use OpenGL to draw with your new texture

Develop a Generic Interop Class

In most of the NVidia examples, CUDA results are written straight to the mapped texture memory during kernel execution. In reality I found it much more efficient (from a productivity perspective) to write the code for the above operations once, and package that up as a little interop utility class. Now, I can always copy from any CUDA device buffer into a suitable OpenGL texture without having to write similar code for every type of kernel launch. Not writing results directly into mapped opengl memory means that you incur an additional copy afterwards, but device-to-device copies are relatively fast in the scheme of things here.

Im putting together some tutorials on interop - they'll be along soon.

In the mean time - take a look at the Interop Release notes from CUDA 2.2...

o OpenGL interoperability
  - OpenGL cannot access a buffer that is currently
    *mapped*. If the buffer is registered but not mapped, OpenGL can do any
    requested operations on the buffer.
  - Deleting a buffer while it is mapped for CUDA results in undefined behavior.
  - Attempting to map or unmap while a different context is bound than was
    current during the buffer register operation will generally result in a
    program error and should thus be avoided.
  - Interoperability will use a software path on SLI
  - Interoperability will use a software path if monitors are attached to
    multiple GPUs and a single desktop spans more than one GPU
    (i.e. WinXP dualview).

Vision Experts

Friday, 4 December 2009

24-bit RGB in CUDA

In an earlier post I wrote about the difficulty in accessing 24-bit RGB data in CUDA, caused by restrictions on optimised global memory access patterns to 32/64/128-bit data. I've been working on this problem for RGB to YUV conversion and heres the best plan I have so far.

The kernel is launched using a 64x1 block size and the image data is passed in as int* so that 32-bit access is coalesced. The input is read into shared memory as 48ints using the first 48 threads, the output is written as 64 ints using all 64 threads. During the read into shared memory, 16 threads are idle - this is a half-warp size so should not waste much time as I believe the entire half-warp will execute in a single instruction.

__global__ void kernel_RGB24_to_YUV32(unsigned int *pInput, unsigned int pitchin, unsigned int *pOutput, unsigned int pitchout)
{
unsigned int xin = blockIdx.x*48 + threadIdx.x;
unsigned int xout = blockIdx.x*64 + threadIdx.x;
unsigned int y = blockIdx.y;

//Shared memory for 48 input ints
__shared__ unsigned int rgbTriplets[64];
unsigned char *pBGR = (unsigned char*) &rgbTriplets[0];

//Global memory read into shared memory
//Read in 48 x 32bit ints which will load 64 packed rgb triplets.
//Only 48 of the 64 threads are active during this read.
//48 is divisible by the 16 thread half-warp size so fully utilises three entire half-warps
//but leaves one half-warp doing nothing
if (threadIdx.x<48) { rgbTriplets[threadIdx.x] = *(pInput + xin + y*pitchin); } __syncthreads(); unsigned int tidrgb = threadIdx.x*3; float3 rgbpix = make_float3(pBGR[tidrgb+2],pBGR[tidrgb+1],pBGR[tidrgb+0]); //Make YUV floats
uchar3 yuvpix;
yuvpix.x = 0.2990*rgbpix.x + 0.5870*rgbpix.y + 0.1140*rgbpix.z;
yuvpix.y = -0.1687*rgbpix.x - 0.3313*rgbpix.y + 0.5000*rgbpix.z + 128;
yuvpix.z = 0.5000*rgbpix.x - 0.4187*rgbpix.y - 0.0813*rgbpix.z + 128;

//Write out 64 ints which are 64 32bit YUVX quads
*(pOutput+xout+y*pitchout) = make_color_rgb(yuvpix.x,yuvpix.y,yuvpix.z);

return;
}

Thursday, 26 November 2009

RGB Images and CUDA

When using CUDA, using a 32-bit RGBX format to store colour images pays massive performance dividends when compared to the more commonly found 24-bit RGB image format. If you have the choice, avoid 24 bit colour.

NVidia neatly skirt around this problem by simply ignoring it in all of their SDK demos. All these demos just happen to load 32-bit RGBX images for any of their colour image processing routines. They effectively do the transformation from 24bit to 32 during the file load and then hide this cost when running the demo. Good for them, but back here in the real world my image capture device is throwing out 24bit RGB images at 60Hz.

For 32-bit RGBX, the texture references in the kernel code files (*.cu) look like this:

texture < unsigned char, 2, cudaReadModeNormalizedFloat > tex;

which works fine. You can then access pixels using tex2D and all is well. However, if you have an RGB24 image and try this:

texture < uchar3, 2, cudaReadModeNormalizedFloat > tex;

It just wont work. There is no version of tex2D able to fetch 24 bit RGB pixels. In fact, you cannot even allocate a CUDA array with 3 channels - if you try this:

CUDA_ARRAY_DESCRIPTOR desc;
desc.Format = CU_AD_FORMAT_UNSIGNED_INT8;
desc.NumChannels = 3;
desc.Width = m_dWidth;
desc.Height = m_dHeight;
cuArrayCreate( &m_dInputArray, &desc );

then the array creation will fail. With CUDA you can only access textures and declare array memory with NumChannels equal to 1,2 or 4 elements.

Furthermore, it is not possible to convert from 24bit to 32bit during a cuMemcpy2D call - whilst this will pad line length to align the pitch (to 256bytes) it will not pad each pixel to match the destination array format.

The only solution is to declare your input array as 1 channel but three times as wide, like this:

CUDA_ARRAY_DESCRIPTOR desc;
desc.Format = CU_AD_FORMAT_UNSIGNED_INT8;
desc.NumChannels = 1;
desc.Width = m_dWidth*3;
desc.Height = m_dHeight;
cuArrayCreate( &m_dInputArray, &desc );

You can then access pixels in groups of three in your kernel. Unfortunately, dont expect coalesced memory acesses when each thread has to read three bytes....

Sunday, 8 November 2009

Real-time CUDA Video Processing

This week I have learnt some hard lessons about integrating CUDA into a real-time video processing application. The problem is related to the way most image acquisition devices work under Windows - that is, via interrupt driven capture callback functions.

Hardware capture devices, be they analogue, digital, CameraLink, GigE or anything else, always come with a vendor specific programming API. This API typcially uses one of two methods to let your program know when a new video frame has been acquired and is ready for processing; events or callbacks. Either your application starts a thread which waits for a sync Event (using WaitForSingleObject) to be signalled by the vendor API when the frame is ready, or you register a callback function which gets called directly by the vendor API when the frame is ready. Its pretty easy to swap between the two methods, but thats the subject of another blog entry.

The problem I have come across relates to using the CUDA runtime API with capture Callbacks. You see, the runtime API is pretty hard wired to be single threaded. You have to allocate, use, and deallocate all CUDA runtie resources on the same thread. Thats why all the runtime API demos with the NVidia SDK are single threaded console apps. The runtime appears to setup its own private context and attach it to the calling thread the first time any function is called, from then on only that thread is able to use functions in the runtime api. If any other thread tries to do something with device memory or call a kernel, you get an exception without explanation.

So you have to be pretty careful about which thread you are going to make that first CUDA runtime API call from.

Now, for an imaging application we know we have to use CUDA in the capture callback thread to process each new image as it arrives. Well probably want to copy the newly captured image to the GPU device and then run a kernel on it to do something useful. Since we are using the runtime API, that means we have no choice but to allocate and deallocate all CUDA resources in the same capture callback thread. But we dont really want to allocate and deallocate device memory every new frame as that is very inefficient, so we put a little catch to only allocate the first time the callback runs. Everything seems great, our callback runs a bit slow the first time, but it runs. It seems great until you realise that you dont know when the last time your callback will be called, so you dont know when to deallocate anything. And you cant do it when the application exits, because only the capture callback thread is allowed to touch the runtime API. Now thats a problem.

There are also problems with OpenGL interop. The cudaRegisterBuffer function really needs an opposite cudaUnregisterBuffer call before the callback thread terminates. If you dont unregister then CUDA barfs an exception from deep in the bowels of the nvcuda.dll when the callback thread terminates. But if you register/unregister every time a new frame arrives, that is really inefficient. So its all getting sticky with the CUDA runtime API.

The solution is to start out with the CUDA driver API for any real-time multi-threaded imaging applications. Lesson learnt.

Sunday, 25 October 2009

Proper Work

I can barely believe that it's nearly that time of year when I pack up my Laptop and go to Stuttgart for the Vision show to catch up with the latest news and technology in the Machine Vision industry. Last year I returned feeling enthusiastic and determined to produce some quality software algorithm of my own using my newly found CUDA skills. Now that an entire year has passed I have been struggling to remember what exactly I have achieved towards that goal. It appears that I have been doing alot of WORK, which pays the bills, but hardly any work.

Emanuel Derman spoke about WORK and work which captured the essence of what I think many talented engineers feel every day. There is always alot of WORK to be done, things like paperwork and meetings and bug fixes and presentations and little tasks which get you through the day and pay your bills. But the number of days in which real work gets done, stuff which will last more than a day and which feels like rewarding constructive activity... well that doesn't seem to happen enough.

The only solution I have found is to do some work after a full day of WORK. Unfortunatley an hour or two of typing in the evening, propped up by strong coffee and hungry for food is not the environment which germinates really good blue-sky development.

So I'm taking out a few days next week, turning off the mobile and shutting down outlook, and doing some work.

Saturday, 12 September 2009

OpenGL interop Woes

Writing real-world multi-threaded apps to capture, process and display video data in real-time is probably, in fairness, a slightly advanced topic. But after a fair amount of experimentation and frustration I think i can offer a piece of advise to other would-be image processing engineers embarking on a CUDA project:

Basically, don't use the CUDA runtime API in a real-time multi-threaded imaging application. And definately dont use the CUDA runtime API with OpenGL interop in a real-time multi-threaded imaging application.

You can just about get away with using the runtime API in a multi-threaded app IF you restrict your app so that only one host thread ever touches CUDA. Thats not usually possible in a real-time imaging system with interrupt driven capture callbacks and an asynchronous processing and display architecture. If you persist with the runtime API then...

Bad Things Can Happen

Under the hood, the runtime API is creating a CUDA context and attaching it to the first thread that touched CUDA. From then on, only that host thread should touch and CUDA API function, and if that thread terminates before CUDA resources are deallocated then bad things can happen. Alternatively, if you allocate some device memory in your application start-up, but then try and access or process that memory in a capture callback thread, then once again bad things can happen. Worst of all, if OpenGL interop tries to do something on a separate thread, whilst your capture callback is doing something on another thread then some CUDA operations may work, but sometimes very bad things can happen. For instance, I was quite successfully and repeatably able to instantly reboot my PC by running a badly coded piece of multi-threaded CUDA code with OpenGL interop. It was probably my fault, but that is difficult one to debug.

This is what led me to use the Driver API in all subsequent imaging applications and really take control over which host thread owned and used the CUDA context. So far, no problems. No crashes.

Vision Experts