Saturday 12 December 2009

OpenGL Interop

Looks Matter

Never ignore the display capabilities of the GPU you have.   After all, with the rendering capabililties of a meaty CUDA enabled GPU and the software capabilities of OpenGL you should easily be able to exploit the images and data you've just processed in interesting and engaging ways.  Granted, it should be the result that matters.  But when it's human engineers there comparing your system against the competition and the look-and-feel of your application is slick and polished, it can make a difference.  The key to doing making your CUDA app look slick is OpenGL interop.  

Process vs Display

Image processing and Image generation are two sides of the same coin.   Image processing seeks to take an image and extract information (size, shape, depth, region, motion, identity etc.), image generation seeks to take information (vertices, texture, algorithms) and turn that data into an image.  Looking at a software level, the two sides of the coin can be addressed by using two GPU technologies - image processing using CUDA and image generation using OpenGL.  By using CUDA we can turn an NVidia GPU into a powerful image processor, by using OpenGL we can use the same GPU hardware to generate new images. For example, with CUDA image processing algorithm we could extract the motion and depth from a scene in real-time, then with OpenGL image generation we could re-generate camera stabilized video or generate a panorama or even completely re-render the scene from a novel and augmented perspective.  It is when we combine image processing with image rendering this way that things get really interesting.  

Interop

In my option, CUDA OpenGL interop seems to be under-documented as well as being a bit more complex that it should be.  In sequence - here's how I use CUDA-OpenGL interop:


At program initialisation:

  1. Allocate an OpenGL texture that will be compatible with your results (not always easy)
  2. Allocate an OpenGL Pixel Buffer Object (PBO) using glGenBuffers
  3. Register the PBO with CUDA using cuGLRegisterBufferObject

Note that (in CUDAv2.2) OpenGL cannot access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do an requested operations on the buffer. Deleting a buffer while it is mapped for CUDA results in undefined behaviour (bad things).  Also, always use the same context to map/unmap as the context used to register the buffer.  This can be difficult with the Runtime API in a multi-threaded app and results in strange behaviour.



A Digression on Pitch
There is a complication with texture allocation and cuda device memory allocation.  With CUDA, you really must allocate pitched device memory (using cuMemAllocPitch) for image processing usage.   This is in order to meet strict alignment requirements for fast coalseced memory access.  You dont have control over the pitch that CUDA will use, but cuMemAllocPitch returns the actual pitch of the device mem that was allocated, which is anything up to 256bytes.  When you allocate a texture in OpenGL, you cannot specify a texture pitch, only width, height and format.  This means that your OpenGL texture buffer may not be pitch-compatible with your CUDA device memory layout.  You can use GL_UNPACK_ALIGNMENT and GL_UNPACK_ROW_LENGTH to help out here, but there are still some fairly common situations when this wont quite give you the control you need.  A symptom of mis-matched texture and device memory pitch is when the image data looks like its made it across the interop but is weirdly diagonally scewed or of the wrong aspect ratio.  Usually, through a combination of modification to your texture width, packing alignment and/or format you can achieve something compatible. 


For now, I'll assume you have managed to allocate a compatible texture, then;



At run-time:
  1. Run the CUDA kernel putting the results into device memory (cuDevicePtr)
  2. Map the PBO using cuGLMapBufferObject, which returns the device pointer of the texture memory (another cuDevicePtr)
  3. Use cuMemcpy2D to copy from the device memory to the mapped PBO memory.  These are device-to-device copies.
  4. Unmap the PBO (cuGLUnmapBufferObject)
  5. Update the texture from the PBO
  6. Use OpenGL to draw with your new texture
Develop a Generic Interop Class

In most of the NVidia examples, CUDA results are written straight to the mapped texture memory during kernel execution.  In reality I found it much more efficient (from a productivity perspective) to write the code for the above operations once, and package that up as a little interop utility class.  Now, I can always copy from any CUDA device buffer into a suitable OpenGL texture without having to write similar code for every type of kernel launch. Not writing results directly into mapped opengl memory means that you incur an additional copy afterwards, but device-to-device copies are relatively fast in the scheme of things here.


Im putting together some tutorials on interop - they'll be along soon.


In the mean time - take a look at the Interop Release notes from CUDA 2.2...

o OpenGL interoperability
  - OpenGL cannot access a buffer that is currently
    *mapped*. If the buffer is registered but not mapped, OpenGL can do any
    requested operations on the buffer.
  - Deleting a buffer while it is mapped for CUDA results in undefined behavior.
  - Attempting to map or unmap while a different context is bound than was
    current during the buffer register operation will generally result in a
    program error and should thus be avoided.
  - Interoperability will use a software path on SLI
  - Interoperability will use a software path if monitors are attached to
    multiple GPUs and a single desktop spans more than one GPU
    (i.e. WinXP dualview).



Vision Experts

Friday 4 December 2009

24-bit RGB in CUDA

In an earlier post I wrote about the difficulty in accessing 24-bit RGB data in CUDA, caused by restrictions on optimised global memory access patterns to 32/64/128-bit data. I've been working on this problem for RGB to YUV conversion and heres the best plan I have so far.

The kernel is launched using a 64x1 block size and the image data is passed in as int* so that 32-bit access is coalesced. The input is read into shared memory as 48ints using the first 48 threads, the output is written as 64 ints using all 64 threads. During the read into shared memory, 16 threads are idle - this is a half-warp size so should not waste much time as I believe the entire half-warp will execute in a single instruction.

__global__ void kernel_RGB24_to_YUV32(unsigned int *pInput, unsigned int pitchin, unsigned int *pOutput, unsigned int pitchout)
{
unsigned int xin = blockIdx.x*48 + threadIdx.x;
unsigned int xout = blockIdx.x*64 + threadIdx.x;
unsigned int y = blockIdx.y;

//Shared memory for 48 input ints
__shared__ unsigned int rgbTriplets[64];
unsigned char *pBGR = (unsigned char*) &rgbTriplets[0];


//Global memory read into shared memory
//Read in 48 x 32bit ints which will load 64 packed rgb triplets.
//Only 48 of the 64 threads are active during this read.
//48 is divisible by the 16 thread half-warp size so fully utilises three entire half-warps
//but leaves one half-warp doing nothing
if (threadIdx.x<48) { rgbTriplets[threadIdx.x] = *(pInput + xin + y*pitchin); } __syncthreads(); unsigned int tidrgb = threadIdx.x*3; float3 rgbpix = make_float3(pBGR[tidrgb+2],pBGR[tidrgb+1],pBGR[tidrgb+0]); //Make YUV floats
uchar3 yuvpix;
yuvpix.x = 0.2990*rgbpix.x + 0.5870*rgbpix.y + 0.1140*rgbpix.z;
yuvpix.y = -0.1687*rgbpix.x - 0.3313*rgbpix.y + 0.5000*rgbpix.z + 128;
yuvpix.z = 0.5000*rgbpix.x - 0.4187*rgbpix.y - 0.0813*rgbpix.z + 128;

//Write out 64 ints which are 64 32bit YUVX quads
*(pOutput+xout+y*pitchout) = make_color_rgb(yuvpix.x,yuvpix.y,yuvpix.z);

return;
}