Never ignore the display capabilities of the GPU you have. After all, with the rendering capabililties of a meaty CUDA enabled GPU and the software capabilities of OpenGL you should easily be able to exploit the images and data you've just processed in interesting and engaging ways. Granted, it should be the result that matters. But when it's human engineers there comparing your system against the competition and the look-and-feel of your application is slick and polished, it can make a difference. The key to doing making your CUDA app look slick is OpenGL interop.
Process vs Display
Image processing and Image generation are two sides of the same coin. Image processing seeks to take an image and extract information (size, shape, depth, region, motion, identity etc.), image generation seeks to take information (vertices, texture, algorithms) and turn that data into an image. Looking at a software level, the two sides of the coin can be addressed by using two GPU technologies - image processing using CUDA and image generation using OpenGL. By using CUDA we can turn an NVidia GPU into a powerful image processor, by using OpenGL we can use the same GPU hardware to generate new images. For example, with CUDA image processing algorithm we could extract the motion and depth from a scene in real-time, then with OpenGL image generation we could re-generate camera stabilized video or generate a panorama or even completely re-render the scene from a novel and augmented perspective. It is when we combine image processing with image rendering this way that things get really interesting.
Interop
In my option, CUDA OpenGL interop seems to be under-documented as well as being a bit more complex that it should be. In sequence - here's how I use CUDA-OpenGL interop:
At program initialisation:
- Allocate an OpenGL texture that will be compatible with your results (not always easy)
- Allocate an OpenGL Pixel Buffer Object (PBO) using glGenBuffers
- Register the PBO with CUDA using cuGLRegisterBufferObject
Note that (in CUDAv2.2) OpenGL cannot access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do an requested operations on the buffer. Deleting a buffer while it is mapped for CUDA results in undefined behaviour (bad things). Also, always use the same context to map/unmap as the context used to register the buffer. This can be difficult with the Runtime API in a multi-threaded app and results in strange behaviour.
A Digression on Pitch
There is a complication with texture allocation and cuda device memory allocation. With CUDA, you really must allocate pitched device memory (using cuMemAllocPitch) for image processing usage. This is in order to meet strict alignment requirements for fast coalseced memory access. You dont have control over the pitch that CUDA will use, but cuMemAllocPitch returns the actual pitch of the device mem that was allocated, which is anything up to 256bytes. When you allocate a texture in OpenGL, you cannot specify a texture pitch, only width, height and format. This means that your OpenGL texture buffer may not be pitch-compatible with your CUDA device memory layout. You can use GL_UNPACK_ALIGNMENT and GL_UNPACK_ROW_LENGTH to help out here, but there are still some fairly common situations when this wont quite give you the control you need. A symptom of mis-matched texture and device memory pitch is when the image data looks like its made it across the interop but is weirdly diagonally scewed or of the wrong aspect ratio. Usually, through a combination of modification to your texture width, packing alignment and/or format you can achieve something compatible.
For now, I'll assume you have managed to allocate a compatible texture, then;
At run-time:
- Run the CUDA kernel putting the results into device memory (cuDevicePtr)
- Map the PBO using cuGLMapBufferObject, which returns the device pointer of the texture memory (another cuDevicePtr)
- Use cuMemcpy2D to copy from the device memory to the mapped PBO memory. These are device-to-device copies.
- Unmap the PBO (cuGLUnmapBufferObject)
- Update the texture from the PBO
- Use OpenGL to draw with your new texture
In most of the NVidia examples, CUDA results are written straight to the mapped texture memory during kernel execution. In reality I found it much more efficient (from a productivity perspective) to write the code for the above operations once, and package that up as a little interop utility class. Now, I can always copy from any CUDA device buffer into a suitable OpenGL texture without having to write similar code for every type of kernel launch. Not writing results directly into mapped opengl memory means that you incur an additional copy afterwards, but device-to-device copies are relatively fast in the scheme of things here.
Im putting together some tutorials on interop - they'll be along soon.
In the mean time - take a look at the Interop Release notes from CUDA 2.2...
o OpenGL interoperability - OpenGL cannot access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do any requested operations on the buffer. - Deleting a buffer while it is mapped for CUDA results in undefined behavior. - Attempting to map or unmap while a different context is bound than was current during the buffer register operation will generally result in a program error and should thus be avoided. - Interoperability will use a software path on SLI - Interoperability will use a software path if monitors are attached to multiple GPUs and a single desktop spans more than one GPU (i.e. WinXP dualview).
Vision Experts