Friday, 23 July 2010

Blazingly Fast Image Warping

Want to achieve over 1 Gigapixel/sec warping throughput?  Then leverage your GPU texture units using CUDA.


Image warping is a very useful and important image processing function that we use all the time.  It is often used, when calibrated, to remove distortions such as perspective projection and lens distortion.  Many pattern matching libraries make use of affine image warps to compute image alignment.  Almost all imaging libraries have a warping tool in their toolbox.  In this post I will say a little about how we make use of the texture hardware in a GPU using CUDA, plus we show some benchmarks for polar unwrapping - and it is fast.

If there is one thing that the GPU is excellent at, it is image warping.  We can thank the gamers for their insatiable appetite for speed in warping image data or 'textures' onto polygons. Fortunately, even when using CUDA to program a GPU as a general purpose co-pro, the fast texture hardware is still available to us for accelerated warping.


  • There are several good reasons to use the texture hardware from CUDA when image processing:The ordering of Texture fetches are generally less proscriptive than the strict requirements for coalescing global memory reads.  When the order of your data reads does not fit in with a coalesced memory access pattern, consider texture fetches.
  • Texture fetches are cached.  For CUDA array memory, the texture cache has a high level of 2-D locality.  Texture fetches from linear memory are also cached.
  • Texture fetches perform bilinear interpolation in hardware.
  • Texture fetches can clamp or wrap at image boundaries, so you don't have to have careful bounds checking yourself.

Linear Memory vs Array Memory
When using a GPU to write a graphics application using an API like OpenGL or DirectX, the texture images were transferred and stored on the GPU in a way that optimized the cache for 2-D spatial locality.  With CUDA, a type of memory called a CUDA Array is available to serve this purpose, and CUDA Array memory stores 2-D image data in a bespoke way to enhance 2-D throughput.  CUDA array memory is managed separately from CUDA linear device memory and has its own memory allocation and copy functions.   

Dedicated CUDA Array memory meant that in the early days of CUDA (going waaay back maybe three whole years), the developer had to manage copying between host, linear device memory and CUDA array memory.  When using the texture hardware, the data had to be in the right place at the right time, forcing many additional copies to array memory.  

Fortunately, from CUDA 2.0 onwards, it became possible to use texture fetch hardware with normal linear device memory.  As far as I can tell, this innovation obviated the need for Array memory entirely.  If there is a good reason to still be using CUDA Array memory then please - post a comment and let us all know.






Textures - Kernel Code

Very little code is required in a CUDA kernel in order to use the texture hardware.  A texture unit for accessing the pixels of a regular 8-bit, 2-dimensional image is created in the kernel code (the .cu file) using the code:

 
texture<unsigned char, 2, cudaReadModeElementType> tex;

The data can then be fetched using the 2d texture fetch hardware using 'tex2d' as below:

unsigned char pix = tex2D( tex ,fPosX ,fPosY );

The really neat thing here is that the position to sample the input image is specified by floating point coordinates (fPosX and fPosY).  The texture reference can be set to perform either nearest-neighbor or bi-linear interpolation in hardware without any additional overhead.  It's not often you get something as useful as floating point bi-linear interpolation for free - thank NVidia.

It is also possible for the texture fetch hardware to return normalized floating point values, which is beneficial in many circumstances.  For example, in many cases the GPU is faster with floating point arithmetic operations than it is with integer operations.  Integer division is rarely a good idea.  For this reason I usually declare a float texture object using the following:

texture<unsigned char, 2, cudaReadModeNormalizedFloat> tex;


then access the pixels as floating point values:


float pix = tex2D( tex ,fPosX, fPosY );

Of course, I have to convert the float pixels back to bytes when I have finished playing around, but that's no big overhead and the hardware provides a fast saturation function to limit the float to the unit range for us:

*pPixOut = 255 * __saturatef(pix);


Textures - Initialization Host Code (Driver API)
A few lines of additional code are required in your host code during initialization in order to setup the kernel texture object.  I tend to do this once during a setup phase of the application, typically just after loading the cubin file and getting function handles.


Firstly, you will need to get a handle to your kernels texture object for the host to use.  This is similar to getting a handle to a device constant variable since the reference is retrieved from the kernel cubin by name. In our example above we declared a texture object in the kernel named 'tex'.  The host code when using the driver API is therefore:

CUtexref m_cuTexref;
cuModuleGetTexRef(&m_cuTexref, m_cuModule, "tex")



Where m_cuModule is the kernel module handle previously loaded/compiled using cuModuleLoadDataEx.  Now we need to set up how the texture unit will access the data.  Firstly, I tell the texture fetch to clamp to the boundary in both dimensions:


    cuTexRefSetAddressMode(m_cuTexref, 0, CU_TR_ADDRESS_MODE_CLAMP);
    cuTexRefSetAddressMode(m_cuTexref, 1, CU_TR_ADDRESS_MODE_CLAMP);

Then we can tell the hardware to fetch image data using nearest neighbour interpolation (point):


    cuTexRefSetFilterMode(m_cuTexref, CU_TR_FILTER_MODE_POINT);

Or bilinear interpolation mode:

    cuTexRefSetFilterMode(m_cuTexref, CU_TR_FILTER_MODE_LINEAR);

Finally, we tell the texture reference about the linear memory we are going to use as a texture.  Assuming that there is some device memory (CUdeviceptr m_dPtr) allocated during initialization that will contain the image data of dimensions Width and Height with a byte pitch of m_dPitch.
 
    // Bind texture reference to linear memory
    CUDA_ARRAY_DESCRIPTOR cad;
    cad.Format = CU_AD_FORMAT_UNSIGNED_INT8;    // Input linear memory is 8-bit
    cad.NumChannels = 1;                        // Input is greyscal
    cad.Width = Width;                    // Input Width
    cad.Height = Height;                   // Input Height

    cuTexRefSetAddress2D(m_cuTexref, &cad, m_dPtr , m_dPitch);
The actual image data can be copied into the device memory at a later time, or repeatedly every time a new image is available for video processing.  The texture reference 'tex' in the kernel has now been connected to the linear device memory.


Textures - Kernel Call Host Code (Driver API)
There is very little left to do by the time it comes to call a kernel.  We have to activate a hardware texture unit and tell it which texture it will be using.  On the host side, the texture reference was called m_cuTexref, we have already connected this reference to the texture object named 'tex' in the kernel during setup (using cuModuleGetTexRef)One additional line is required to tell the kernel function which texture is active in the default texture unit.
 
cuParamSetTexRef(cuFunction_Handle, CU_PARAM_TR_DEFAULT, m_cuTexref);

So, the kernel will now be able to use the hardware texture fetch functions (tex2d) to fetch data from the texture object named 'tex'.  It is interesting that the texture unit MUST be CU_PARAM_TR_DEFAULT.  A CUDA enabled GPU will almost certainly have multiple texture units, so in theory it should be possible to read from multiple texture units simultaneously in a kernel to achieve image blending/fusion effects.  Unfortunately, this is not made available to us in CUDA at the time of writing (with CUDA 3.1). 

To launch the kernel, proceed as normal.  For example:

cuFuncSetBlockShape( cuFunction_Handle, BLOCK_SIZE_X, BLOCK_SIZE_Y, 1 );
cuLaunchGridAsync( cuFunction_Handle, GRIDWIDTH, GRIDHEIGHT, stream ))

Note that I use async calls and multiple streams in order to overlap computation and PCI transfers, thus hiding some of the transfer overhead (a subject for another post).  This can all be hidden from the user by maintaining a rolling buffer internally in the library, making the warp algorithm appear to run faster.
Performance
In order to test the performance I have developed a general purpose warping library that uses our GPU framework to hide all of the CUDA code, JIT compilation, transfers, contexts, streams and threads behind a few simple function calls.  A commonly used useful warp function for polar unwrap has been implemented using the texture fetching method described above and the results look very good.

The input images we chose were from Opto-Engineering who have a range of lenses that produce polar images of the sides of product.  It is possible to capture high resolution images of the sides of containers as a polar image (below) but in order to accelerate any analysis, a fast polar unwrap is needed.



The output images look good when using the hardware bi-linear interpolation (below):


As expected, when nearest-neighbour interpolation is used, the image quality is degraded with aliasing problems (below).  Whilst this would be faster on a CPU, the GPU is able to perform the bilinear interpolation mode at the same speed.



The performance depends on the size of the output image, but typically achieves well over 1GB/sec in transform bandwidth, including all the transfer overheads (Core2Quad Q8400@2.66GHz & GTX260 216cores).  For these input images (1024x768), the average total transform time to produce the output (1280x384) was under 400 microseconds.  That works out at over 1.2 Gigapixels/sec
A quick comparison to a third party software polar unwrap tool showed that this was at least an order of magnitude faster.


The algorithm to perform the polar coordinate conversion is computed on-the-fly.  Any number of complex transform functions can be implemented in the library very quickly to achieve this performance.  So far, affine, perspective and polar transforms are done.  Any requests?




vxGWarp Interfaces
Just FYI - the interface to these polar warp functions are pretty trivial, all the GPU expertise is hidden from the end user in the DLL.  The key functions in the header file are:

vxGWarpCreate(VXGWARPHANDLE *hGW, int W, int H);
vxGWarpDestroy(VXGWARPHANDLE hGW);
vxGWarpAccessXferBufIn(VXGWARPHANDLE hGW, unsigned char **pInput, int *nW, int *nP, int *nH);
vxGWarpAccessXferBufOut(VXGWARPHANDLE hGW, unsigned char **pOutput, int *nW, int *nP, int *nH);
vxGWarpPolar(VXGWARPHANDLE hGW, POLARPARAMS PP);



Vision Experts

4 comments:

  1. Hello,

    Interesting indeed.

    Would it also work with a general displacement vector field? (With a range limit on the displacement vectors maybe) ..

    Would that have about the same performance?

    That would be really nice :)

    Regards, Martin

    ReplyDelete
  2. It would work with a general displacement field but would be a little slower since the GPU would have to fetch both the image data and the x&y vector field data from global memory.
    It would be a useful addition to the library and could be used in optic flow or motion compensation applications.

    ReplyDelete
  3. Nice article; very helpful! Although it isn't relevant to your application, here's one reason why CUarray is still a relevant data structure: Opengl interoperability. When you want to render something in OpenGL and send the pixels to cuda, you can map the color buffer in OpenGL directly to a CUarray in Cuda, without any copying. The alternative is reading the pixels into main memory using glReadPixels and then calling cuMemcpyHtoD(), but glReadPixels is *REALLY* slow, even with PBO's.

    ReplyDelete
  4. "If there is a good reason to still be using CUDA Array memory then please - post a comment and let us all know."

    In theory texture fetches from CUDA arrays can be faster than fetches from 2D linear memory, since arrays are arranged for 2D locality.

    ReplyDelete