Saturday 12 December 2009

OpenGL Interop

Looks Matter

Never ignore the display capabilities of the GPU you have.   After all, with the rendering capabililties of a meaty CUDA enabled GPU and the software capabilities of OpenGL you should easily be able to exploit the images and data you've just processed in interesting and engaging ways.  Granted, it should be the result that matters.  But when it's human engineers there comparing your system against the competition and the look-and-feel of your application is slick and polished, it can make a difference.  The key to doing making your CUDA app look slick is OpenGL interop.  

Process vs Display

Image processing and Image generation are two sides of the same coin.   Image processing seeks to take an image and extract information (size, shape, depth, region, motion, identity etc.), image generation seeks to take information (vertices, texture, algorithms) and turn that data into an image.  Looking at a software level, the two sides of the coin can be addressed by using two GPU technologies - image processing using CUDA and image generation using OpenGL.  By using CUDA we can turn an NVidia GPU into a powerful image processor, by using OpenGL we can use the same GPU hardware to generate new images. For example, with CUDA image processing algorithm we could extract the motion and depth from a scene in real-time, then with OpenGL image generation we could re-generate camera stabilized video or generate a panorama or even completely re-render the scene from a novel and augmented perspective.  It is when we combine image processing with image rendering this way that things get really interesting.  

Interop

In my option, CUDA OpenGL interop seems to be under-documented as well as being a bit more complex that it should be.  In sequence - here's how I use CUDA-OpenGL interop:


At program initialisation:

  1. Allocate an OpenGL texture that will be compatible with your results (not always easy)
  2. Allocate an OpenGL Pixel Buffer Object (PBO) using glGenBuffers
  3. Register the PBO with CUDA using cuGLRegisterBufferObject

Note that (in CUDAv2.2) OpenGL cannot access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do an requested operations on the buffer. Deleting a buffer while it is mapped for CUDA results in undefined behaviour (bad things).  Also, always use the same context to map/unmap as the context used to register the buffer.  This can be difficult with the Runtime API in a multi-threaded app and results in strange behaviour.



A Digression on Pitch
There is a complication with texture allocation and cuda device memory allocation.  With CUDA, you really must allocate pitched device memory (using cuMemAllocPitch) for image processing usage.   This is in order to meet strict alignment requirements for fast coalseced memory access.  You dont have control over the pitch that CUDA will use, but cuMemAllocPitch returns the actual pitch of the device mem that was allocated, which is anything up to 256bytes.  When you allocate a texture in OpenGL, you cannot specify a texture pitch, only width, height and format.  This means that your OpenGL texture buffer may not be pitch-compatible with your CUDA device memory layout.  You can use GL_UNPACK_ALIGNMENT and GL_UNPACK_ROW_LENGTH to help out here, but there are still some fairly common situations when this wont quite give you the control you need.  A symptom of mis-matched texture and device memory pitch is when the image data looks like its made it across the interop but is weirdly diagonally scewed or of the wrong aspect ratio.  Usually, through a combination of modification to your texture width, packing alignment and/or format you can achieve something compatible. 


For now, I'll assume you have managed to allocate a compatible texture, then;



At run-time:
  1. Run the CUDA kernel putting the results into device memory (cuDevicePtr)
  2. Map the PBO using cuGLMapBufferObject, which returns the device pointer of the texture memory (another cuDevicePtr)
  3. Use cuMemcpy2D to copy from the device memory to the mapped PBO memory.  These are device-to-device copies.
  4. Unmap the PBO (cuGLUnmapBufferObject)
  5. Update the texture from the PBO
  6. Use OpenGL to draw with your new texture
Develop a Generic Interop Class

In most of the NVidia examples, CUDA results are written straight to the mapped texture memory during kernel execution.  In reality I found it much more efficient (from a productivity perspective) to write the code for the above operations once, and package that up as a little interop utility class.  Now, I can always copy from any CUDA device buffer into a suitable OpenGL texture without having to write similar code for every type of kernel launch. Not writing results directly into mapped opengl memory means that you incur an additional copy afterwards, but device-to-device copies are relatively fast in the scheme of things here.


Im putting together some tutorials on interop - they'll be along soon.


In the mean time - take a look at the Interop Release notes from CUDA 2.2...

o OpenGL interoperability
  - OpenGL cannot access a buffer that is currently
    *mapped*. If the buffer is registered but not mapped, OpenGL can do any
    requested operations on the buffer.
  - Deleting a buffer while it is mapped for CUDA results in undefined behavior.
  - Attempting to map or unmap while a different context is bound than was
    current during the buffer register operation will generally result in a
    program error and should thus be avoided.
  - Interoperability will use a software path on SLI
  - Interoperability will use a software path if monitors are attached to
    multiple GPUs and a single desktop spans more than one GPU
    (i.e. WinXP dualview).



Vision Experts

Friday 4 December 2009

24-bit RGB in CUDA

In an earlier post I wrote about the difficulty in accessing 24-bit RGB data in CUDA, caused by restrictions on optimised global memory access patterns to 32/64/128-bit data. I've been working on this problem for RGB to YUV conversion and heres the best plan I have so far.

The kernel is launched using a 64x1 block size and the image data is passed in as int* so that 32-bit access is coalesced. The input is read into shared memory as 48ints using the first 48 threads, the output is written as 64 ints using all 64 threads. During the read into shared memory, 16 threads are idle - this is a half-warp size so should not waste much time as I believe the entire half-warp will execute in a single instruction.

__global__ void kernel_RGB24_to_YUV32(unsigned int *pInput, unsigned int pitchin, unsigned int *pOutput, unsigned int pitchout)
{
unsigned int xin = blockIdx.x*48 + threadIdx.x;
unsigned int xout = blockIdx.x*64 + threadIdx.x;
unsigned int y = blockIdx.y;

//Shared memory for 48 input ints
__shared__ unsigned int rgbTriplets[64];
unsigned char *pBGR = (unsigned char*) &rgbTriplets[0];


//Global memory read into shared memory
//Read in 48 x 32bit ints which will load 64 packed rgb triplets.
//Only 48 of the 64 threads are active during this read.
//48 is divisible by the 16 thread half-warp size so fully utilises three entire half-warps
//but leaves one half-warp doing nothing
if (threadIdx.x<48) { rgbTriplets[threadIdx.x] = *(pInput + xin + y*pitchin); } __syncthreads(); unsigned int tidrgb = threadIdx.x*3; float3 rgbpix = make_float3(pBGR[tidrgb+2],pBGR[tidrgb+1],pBGR[tidrgb+0]); //Make YUV floats
uchar3 yuvpix;
yuvpix.x = 0.2990*rgbpix.x + 0.5870*rgbpix.y + 0.1140*rgbpix.z;
yuvpix.y = -0.1687*rgbpix.x - 0.3313*rgbpix.y + 0.5000*rgbpix.z + 128;
yuvpix.z = 0.5000*rgbpix.x - 0.4187*rgbpix.y - 0.0813*rgbpix.z + 128;

//Write out 64 ints which are 64 32bit YUVX quads
*(pOutput+xout+y*pitchout) = make_color_rgb(yuvpix.x,yuvpix.y,yuvpix.z);

return;
}

Thursday 26 November 2009

RGB Images and CUDA

When using CUDA, using a 32-bit RGBX format to store colour images pays massive performance dividends when compared to the more commonly found 24-bit RGB image format. If you have the choice, avoid 24 bit colour.

NVidia neatly skirt around this problem by simply ignoring it in all of their SDK demos. All these demos just happen to load 32-bit RGBX images for any of their colour image processing routines. They effectively do the transformation from 24bit to 32 during the file load and then hide this cost when running the demo. Good for them, but back here in the real world my image capture device is throwing out 24bit RGB images at 60Hz.

For 32-bit RGBX, the texture references in the kernel code files (*.cu) look like this:



texture < unsigned char, 2, cudaReadModeNormalizedFloat > tex;


which works fine. You can then access pixels using tex2D and all is well. However, if you have an RGB24 image and try this:


texture < uchar3, 2, cudaReadModeNormalizedFloat > tex;




It just wont work. There is no version of tex2D able to fetch 24 bit RGB pixels. In fact, you cannot even allocate a CUDA array with 3 channels - if you try this:

CUDA_ARRAY_DESCRIPTOR desc;
desc.Format = CU_AD_FORMAT_UNSIGNED_INT8;
desc.NumChannels = 3;
desc.Width = m_dWidth;
desc.Height = m_dHeight;
cuArrayCreate( &m_dInputArray, &desc );


then the array creation will fail. With CUDA you can only access textures and declare array memory with NumChannels equal to 1,2 or 4 elements.

Furthermore, it is not possible to convert from 24bit to 32bit during a cuMemcpy2D call - whilst this will pad line length to align the pitch (to 256bytes) it will not pad each pixel to match the destination array format.

The only solution is to declare your input array as 1 channel but three times as wide, like this:

CUDA_ARRAY_DESCRIPTOR desc;
desc.Format = CU_AD_FORMAT_UNSIGNED_INT8;
desc.NumChannels = 1;
desc.Width = m_dWidth*3;
desc.Height = m_dHeight;
cuArrayCreate( &m_dInputArray, &desc );


You can then access pixels in groups of three in your kernel. Unfortunately, dont expect coalesced memory acesses when each thread has to read three bytes....

Sunday 8 November 2009

Real-time CUDA Video Processing

This week I have learnt some hard lessons about integrating CUDA into a real-time video processing application. The problem is related to the way most image acquisition devices work under Windows - that is, via interrupt driven capture callback functions.

Hardware capture devices, be they analogue, digital, CameraLink, GigE or anything else, always come with a vendor specific programming API. This API typcially uses one of two methods to let your program know when a new video frame has been acquired and is ready for processing; events or callbacks. Either your application starts a thread which waits for a sync Event (using WaitForSingleObject) to be signalled by the vendor API when the frame is ready, or you register a callback function which gets called directly by the vendor API when the frame is ready. Its pretty easy to swap between the two methods, but thats the subject of another blog entry.

The problem I have come across relates to using the CUDA runtime API with capture Callbacks. You see, the runtime API is pretty hard wired to be single threaded. You have to allocate, use, and deallocate all CUDA runtie resources on the same thread. Thats why all the runtime API demos with the NVidia SDK are single threaded console apps. The runtime appears to setup its own private context and attach it to the calling thread the first time any function is called, from then on only that thread is able to use functions in the runtime api. If any other thread tries to do something with device memory or call a kernel, you get an exception without explanation.

So you have to be pretty careful about which thread you are going to make that first CUDA runtime API call from.

Now, for an imaging application we know we have to use CUDA in the capture callback thread to process each new image as it arrives. Well probably want to copy the newly captured image to the GPU device and then run a kernel on it to do something useful. Since we are using the runtime API, that means we have no choice but to allocate and deallocate all CUDA resources in the same capture callback thread. But we dont really want to allocate and deallocate device memory every new frame as that is very inefficient, so we put a little catch to only allocate the first time the callback runs. Everything seems great, our callback runs a bit slow the first time, but it runs. It seems great until you realise that you dont know when the last time your callback will be called, so you dont know when to deallocate anything. And you cant do it when the application exits, because only the capture callback thread is allowed to touch the runtime API. Now thats a problem.

There are also problems with OpenGL interop. The cudaRegisterBuffer function really needs an opposite cudaUnregisterBuffer call before the callback thread terminates. If you dont unregister then CUDA barfs an exception from deep in the bowels of the nvcuda.dll when the callback thread terminates. But if you register/unregister every time a new frame arrives, that is really inefficient. So its all getting sticky with the CUDA runtime API.

The solution is to start out with the CUDA driver API for any real-time multi-threaded imaging applications. Lesson learnt.

Sunday 25 October 2009

Proper Work

I can barely believe that it's nearly that time of year when I pack up my Laptop and go to Stuttgart for the Vision show to catch up with the latest news and technology in the Machine Vision industry. Last year I returned feeling enthusiastic and determined to produce some quality software algorithm of my own using my newly found CUDA skills. Now that an entire year has passed I have been struggling to remember what exactly I have achieved towards that goal. It appears that I have been doing alot of WORK, which pays the bills, but hardly any work.

Emanuel Derman spoke about WORK and work which captured the essence of what I think many talented engineers feel every day. There is always alot of WORK to be done, things like paperwork and meetings and bug fixes and presentations and little tasks which get you through the day and pay your bills. But the number of days in which real work gets done, stuff which will last more than a day and which feels like rewarding constructive activity... well that doesn't seem to happen enough.

The only solution I have found is to do some work after a full day of WORK. Unfortunatley an hour or two of typing in the evening, propped up by strong coffee and hungry for food is not the environment which germinates really good blue-sky development.

So I'm taking out a few days next week, turning off the mobile and shutting down outlook, and doing some work.

Saturday 12 September 2009

OpenGL interop Woes

Writing real-world multi-threaded apps to capture, process and display video data in real-time is probably, in fairness, a slightly advanced topic.  But after a fair amount of experimentation and frustration I think i can offer a piece of advise to other would-be image processing engineers embarking on a CUDA project:


Basically, don't use the CUDA runtime API in a real-time multi-threaded imaging application.  And definately dont use the CUDA runtime API with OpenGL interop in a real-time multi-threaded imaging application.

You can just about get away with using the runtime API in a multi-threaded app IF you restrict your app so that only one host thread ever touches CUDA.  Thats not usually possible in a real-time imaging system with interrupt driven capture callbacks and an asynchronous processing and display architecture.  If you persist with the runtime API then...


Bad Things Can Happen

Under the hood, the runtime API is creating a CUDA context and attaching it to the first thread that touched CUDA.  From then on, only that host thread should touch and CUDA API function, and if that thread terminates before CUDA resources are deallocated then bad things can happen.  Alternatively, if you allocate some device memory in your application start-up, but then try and access or process that memory in a capture callback thread, then once again bad things can happen.  Worst of all, if OpenGL interop tries to do something on a separate thread, whilst your capture callback is doing something on another thread then some CUDA operations may work, but sometimes very bad things can happen.  For instance, I was quite successfully and repeatably able to instantly reboot my PC by running a badly coded piece of multi-threaded CUDA code with OpenGL interop.  It was probably my fault, but that is difficult one to debug.


This is what led me to use the Driver API in all subsequent imaging applications and really take control over which host thread owned and used the CUDA context.  So far, no problems. No crashes.





Vision Experts

Saturday 4 July 2009

CUDA function overheads

Whilst working on my CUDA accelerated JPEG algorithm I found a problem with my design which demanded launching a large number of small kernels followed by many thousands of small memcopy operations. I was launching kernels to compress a fixed number of image blocks, many hundreds in all. The result was compressed image blocks, and the output size was only known at runtime after the algorithm was finished, but required many thousands of mem copy operations. The design was bad, but I was trying things out to see what would happen.

On a CPU, a function call will typically take a few nanoseconds to push parameters on the stack and jump the program pointer to the function address. On the GPU however, much more work has to be performed via the driver. So kernel launches and cuda mem copy operations take at least three orders of magnitude more to setup than a CPU call - several microseconds in all.

This means that if you want to perform many hundreds or thousands of calls then the function calls themselves can start to add up much more quickly than the equivalent CPU calls. This effect can then become significant - so make your kernels big!

Thursday 11 June 2009

How to Display YUV420 Video

Recently, I came across YUV420 image data whilst working with a hardware H264 compression card. The image data was planar, and arranged like the image below. This is standard YUV420 planar format, with the U and V components being at 1/2 resolution of the Y component.

Now, when using MS Windows, you have to display images using bitmaps packed as RGB888 or RGBA8888 colour format. Even when using OpenGL you need basically RGB or 8bit grey images. I believe MacOS might be nicer (anybody?) and certainly supports YUV422, but under Windows, you have a problem with YUV. DirectX might help out too (anybody?) - but I live in the OpenGL world here.

So how do we display YUV420 video in real-time?

The first thing I tried was a CPU conversion to RGB888, then transferred the RGB data to OpenGL for display. Easy enough to code in C++ and took about an hour to optimise. But it still took about 8ms per frame to convert (on 768x576 frames) and really hit the CPU loading, which felt like a real waste of clock cycles for just displaying an image.

The solution we ended up with was to transfer YUV420 image data raw as GL_LUMINANCE image data, essentially just transferring the whole image (as above) as if it were a 768x864 greyscale image. We then wrote a Cg fragment shader to do the YUV to RGB conversion and display on the graphics unit. This worked a treat and even the Intel embedded graphics on the motherboard was able to handle the shader. This reduced the time to 1.4ms per frame, without any CPU loading.


To finish up, we wrapped up the entire functionality as a stand-alone DLL with just a few simple function calls. Now anybody here can display YUV420 images in a Window without any CPU overhead and without having to be concerned about how it happens. NVidia Cg requires two additional DLL's to be supplied with the package, but thats it.

You can get the DLL from us at http://www.vision4ce.com

Wednesday 11 March 2009

Multi-core Momentum


We've been looking at the new Intel Core i7 processors today. Obviously, with the Nehalem architecture Intel are really getting ready for massively multi-threaded multi-core chips in the near future. This is good news for image processing as a great many useful algorithms can be implemented using a parallel architecture, so the more mainstream parallel computing gets, the better. At Vision4ce , we've had a lot of success accelerating algorithms using the GPU and NVidia CUDA, so that experience should help us deploy onto other multi-core architectures, probably using OpenCL.

Another bit from Intel's blurb that interests me is the QuickPath technology:

'Intel QuickPath technology is a point-to-point connection—there is no single bus that all the processors must use and contend with to reach memory and I/O. This improves scalability and eliminates the competition between processors for bus bandwidth.'.

Now that sounds like Intel is really laying good groundwork for connecting lots of cores together in a parallel architecture - lets see what happens with the Intel Nehalem architecture in the near future.

Monday 9 March 2009

Vcd on XP

Here is a useful little utility from MS for XP that allows you to mount ISO cd's without having to have some huge software bundle installed. Its old, but at 60kb its still useful.

Download XP Virtual Control Panel

http://download.microsoft.com/download/7/b/6/7b6abd84-7841-4978-96f5-bd58df02efa2/winxpvirtualcdcontrolpanel_21.exe



Installation instructions
=========================
1. Copy VCdRom.sys to your %systemroot%\system32\drivers folder.
2. Execute VCdControlTool.exe
3. Click “Driver control”
4. If the “Install Driver” button is available, click it. Navigate to the %systemroot%\system32\drivers folder, select VCdRom.sys, and click Open.
5. Click “Start”
6. Click OK
7. Click “Add Drive” to add a drive to the drive list. Ensure that the drive added is not a local drive. If it is, continue to click “Add Drive” until an unused drive letter is available.
8. Select an unused drive letter from the drive list and click “Mount”.
9. Navigate to the image file, select it, and click “OK”. UNC naming conventions should not be used, however mapped network drives should be OK.

Thursday 8 January 2009

Sub-Pixel Maximum

Ever needed to find the position of the maximum of a peak in an image to sub-pixel precision? Examples are finding the peak energy in Fourier space, or the location of a peak in Hough space. One simple method that I have used with some success is a simple quadratic interpolation, which is performed once for each axis.

Sub-Pixel Peak finding recipe...
Firstly, you take the value of your highest point and its neighbours (Pa, Pb and Pc). Assuming the three points are equally spaced and bracket the top of the peak, you can solve for the position of the 'true' maximum using:

x = 0.5*(Pa - Pc) / (Pa-2Pb+Pc)

The value of x will range between -0.5 and +0.5, which is relative to your maximum pixel.

Jason Dale. www.visionexperts.co.uk

Friday 2 January 2009

New Year. New Blog

In an effort to pay back some of my huge debt to the internet, I've decided to put a little something back into the web-o-sphere. And seeing as I'd like it to be something useful, I'm going to share the stuff that I find useful, for the benefit of image processing people everywhere.