Friday 23 July 2010

Blazingly Fast Image Warping

Want to achieve over 1 Gigapixel/sec warping throughput?  Then leverage your GPU texture units using CUDA.


Image warping is a very useful and important image processing function that we use all the time.  It is often used, when calibrated, to remove distortions such as perspective projection and lens distortion.  Many pattern matching libraries make use of affine image warps to compute image alignment.  Almost all imaging libraries have a warping tool in their toolbox.  In this post I will say a little about how we make use of the texture hardware in a GPU using CUDA, plus we show some benchmarks for polar unwrapping - and it is fast.

If there is one thing that the GPU is excellent at, it is image warping.  We can thank the gamers for their insatiable appetite for speed in warping image data or 'textures' onto polygons. Fortunately, even when using CUDA to program a GPU as a general purpose co-pro, the fast texture hardware is still available to us for accelerated warping.


  • There are several good reasons to use the texture hardware from CUDA when image processing:The ordering of Texture fetches are generally less proscriptive than the strict requirements for coalescing global memory reads.  When the order of your data reads does not fit in with a coalesced memory access pattern, consider texture fetches.
  • Texture fetches are cached.  For CUDA array memory, the texture cache has a high level of 2-D locality.  Texture fetches from linear memory are also cached.
  • Texture fetches perform bilinear interpolation in hardware.
  • Texture fetches can clamp or wrap at image boundaries, so you don't have to have careful bounds checking yourself.

Linear Memory vs Array Memory
When using a GPU to write a graphics application using an API like OpenGL or DirectX, the texture images were transferred and stored on the GPU in a way that optimized the cache for 2-D spatial locality.  With CUDA, a type of memory called a CUDA Array is available to serve this purpose, and CUDA Array memory stores 2-D image data in a bespoke way to enhance 2-D throughput.  CUDA array memory is managed separately from CUDA linear device memory and has its own memory allocation and copy functions.   

Dedicated CUDA Array memory meant that in the early days of CUDA (going waaay back maybe three whole years), the developer had to manage copying between host, linear device memory and CUDA array memory.  When using the texture hardware, the data had to be in the right place at the right time, forcing many additional copies to array memory.  

Fortunately, from CUDA 2.0 onwards, it became possible to use texture fetch hardware with normal linear device memory.  As far as I can tell, this innovation obviated the need for Array memory entirely.  If there is a good reason to still be using CUDA Array memory then please - post a comment and let us all know.






Textures - Kernel Code

Very little code is required in a CUDA kernel in order to use the texture hardware.  A texture unit for accessing the pixels of a regular 8-bit, 2-dimensional image is created in the kernel code (the .cu file) using the code:

 
texture<unsigned char, 2, cudaReadModeElementType> tex;

The data can then be fetched using the 2d texture fetch hardware using 'tex2d' as below:

unsigned char pix = tex2D( tex ,fPosX ,fPosY );

The really neat thing here is that the position to sample the input image is specified by floating point coordinates (fPosX and fPosY).  The texture reference can be set to perform either nearest-neighbor or bi-linear interpolation in hardware without any additional overhead.  It's not often you get something as useful as floating point bi-linear interpolation for free - thank NVidia.

It is also possible for the texture fetch hardware to return normalized floating point values, which is beneficial in many circumstances.  For example, in many cases the GPU is faster with floating point arithmetic operations than it is with integer operations.  Integer division is rarely a good idea.  For this reason I usually declare a float texture object using the following:

texture<unsigned char, 2, cudaReadModeNormalizedFloat> tex;


then access the pixels as floating point values:


float pix = tex2D( tex ,fPosX, fPosY );

Of course, I have to convert the float pixels back to bytes when I have finished playing around, but that's no big overhead and the hardware provides a fast saturation function to limit the float to the unit range for us:

*pPixOut = 255 * __saturatef(pix);


Textures - Initialization Host Code (Driver API)
A few lines of additional code are required in your host code during initialization in order to setup the kernel texture object.  I tend to do this once during a setup phase of the application, typically just after loading the cubin file and getting function handles.


Firstly, you will need to get a handle to your kernels texture object for the host to use.  This is similar to getting a handle to a device constant variable since the reference is retrieved from the kernel cubin by name. In our example above we declared a texture object in the kernel named 'tex'.  The host code when using the driver API is therefore:

CUtexref m_cuTexref;
cuModuleGetTexRef(&m_cuTexref, m_cuModule, "tex")



Where m_cuModule is the kernel module handle previously loaded/compiled using cuModuleLoadDataEx.  Now we need to set up how the texture unit will access the data.  Firstly, I tell the texture fetch to clamp to the boundary in both dimensions:


    cuTexRefSetAddressMode(m_cuTexref, 0, CU_TR_ADDRESS_MODE_CLAMP);
    cuTexRefSetAddressMode(m_cuTexref, 1, CU_TR_ADDRESS_MODE_CLAMP);

Then we can tell the hardware to fetch image data using nearest neighbour interpolation (point):


    cuTexRefSetFilterMode(m_cuTexref, CU_TR_FILTER_MODE_POINT);

Or bilinear interpolation mode:

    cuTexRefSetFilterMode(m_cuTexref, CU_TR_FILTER_MODE_LINEAR);

Finally, we tell the texture reference about the linear memory we are going to use as a texture.  Assuming that there is some device memory (CUdeviceptr m_dPtr) allocated during initialization that will contain the image data of dimensions Width and Height with a byte pitch of m_dPitch.
 
    // Bind texture reference to linear memory
    CUDA_ARRAY_DESCRIPTOR cad;
    cad.Format = CU_AD_FORMAT_UNSIGNED_INT8;    // Input linear memory is 8-bit
    cad.NumChannels = 1;                        // Input is greyscal
    cad.Width = Width;                    // Input Width
    cad.Height = Height;                   // Input Height

    cuTexRefSetAddress2D(m_cuTexref, &cad, m_dPtr , m_dPitch);
The actual image data can be copied into the device memory at a later time, or repeatedly every time a new image is available for video processing.  The texture reference 'tex' in the kernel has now been connected to the linear device memory.


Textures - Kernel Call Host Code (Driver API)
There is very little left to do by the time it comes to call a kernel.  We have to activate a hardware texture unit and tell it which texture it will be using.  On the host side, the texture reference was called m_cuTexref, we have already connected this reference to the texture object named 'tex' in the kernel during setup (using cuModuleGetTexRef)One additional line is required to tell the kernel function which texture is active in the default texture unit.
 
cuParamSetTexRef(cuFunction_Handle, CU_PARAM_TR_DEFAULT, m_cuTexref);

So, the kernel will now be able to use the hardware texture fetch functions (tex2d) to fetch data from the texture object named 'tex'.  It is interesting that the texture unit MUST be CU_PARAM_TR_DEFAULT.  A CUDA enabled GPU will almost certainly have multiple texture units, so in theory it should be possible to read from multiple texture units simultaneously in a kernel to achieve image blending/fusion effects.  Unfortunately, this is not made available to us in CUDA at the time of writing (with CUDA 3.1). 

To launch the kernel, proceed as normal.  For example:

cuFuncSetBlockShape( cuFunction_Handle, BLOCK_SIZE_X, BLOCK_SIZE_Y, 1 );
cuLaunchGridAsync( cuFunction_Handle, GRIDWIDTH, GRIDHEIGHT, stream ))

Note that I use async calls and multiple streams in order to overlap computation and PCI transfers, thus hiding some of the transfer overhead (a subject for another post).  This can all be hidden from the user by maintaining a rolling buffer internally in the library, making the warp algorithm appear to run faster.
Performance
In order to test the performance I have developed a general purpose warping library that uses our GPU framework to hide all of the CUDA code, JIT compilation, transfers, contexts, streams and threads behind a few simple function calls.  A commonly used useful warp function for polar unwrap has been implemented using the texture fetching method described above and the results look very good.

The input images we chose were from Opto-Engineering who have a range of lenses that produce polar images of the sides of product.  It is possible to capture high resolution images of the sides of containers as a polar image (below) but in order to accelerate any analysis, a fast polar unwrap is needed.



The output images look good when using the hardware bi-linear interpolation (below):


As expected, when nearest-neighbour interpolation is used, the image quality is degraded with aliasing problems (below).  Whilst this would be faster on a CPU, the GPU is able to perform the bilinear interpolation mode at the same speed.



The performance depends on the size of the output image, but typically achieves well over 1GB/sec in transform bandwidth, including all the transfer overheads (Core2Quad Q8400@2.66GHz & GTX260 216cores).  For these input images (1024x768), the average total transform time to produce the output (1280x384) was under 400 microseconds.  That works out at over 1.2 Gigapixels/sec
A quick comparison to a third party software polar unwrap tool showed that this was at least an order of magnitude faster.


The algorithm to perform the polar coordinate conversion is computed on-the-fly.  Any number of complex transform functions can be implemented in the library very quickly to achieve this performance.  So far, affine, perspective and polar transforms are done.  Any requests?




vxGWarp Interfaces
Just FYI - the interface to these polar warp functions are pretty trivial, all the GPU expertise is hidden from the end user in the DLL.  The key functions in the header file are:

vxGWarpCreate(VXGWARPHANDLE *hGW, int W, int H);
vxGWarpDestroy(VXGWARPHANDLE hGW);
vxGWarpAccessXferBufIn(VXGWARPHANDLE hGW, unsigned char **pInput, int *nW, int *nP, int *nH);
vxGWarpAccessXferBufOut(VXGWARPHANDLE hGW, unsigned char **pOutput, int *nW, int *nP, int *nH);
vxGWarpPolar(VXGWARPHANDLE hGW, POLARPARAMS PP);



Vision Experts

Thursday 8 July 2010

Debunking the x100 GPU Myth - Intel Fights Back

Intel recently published this paper titled 'Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU' that makes an attempt to compare a number of GPU kernels with algorithms that are highly optimised for Intel architectures.  The authors concluded that for the right problems, the GPU was up to 14x faster than an equivalent optimised CPU implementation. On average a x2.5 increase in speed was seen.  

I am all in favour of using GPU's to accelerate image processing when it is appropriate but the hype has gotten out of control over the last year, so I am very pleased to see Intel try and put their case forward and bring some balance to the arguments.

What I liked about the paper was that for once, significant effort was expended to optimise BOTH the CPU and the GPU implementations.  Too many biased comparisons are made between highly optimised GPU implementations and the naive, plain vanilla single threaded 'C' versions.  When a x100 increase in speed is cited, I always suspect that the author was being either highly selective in what parts of the overall system were being timed, or that the algorithm was unrealistically well mapped to GPU hardware and not representative of a real problem, or even that the CPU implementation was simply not optimised at all.  The NVidia showcase website has made publishing an impressive acceleration factor in the authors best interest.

I certainly have not come across any imaging systems that have achieved anything like x100 accelerations in throughput by employing GPU technology.  There may be some algorithms that map superbly well to GPUs and can achieve x100 performance increase in a single algorithm stage, but these numbers published by Intel are much more in line with the total throughput increase I have seen when using GPU's to do image processing in real-world applications, when compared to the optimised CPU algorithms that are readily available.  

An example of disengenuous performance metrics would be the image processing blur demo in the NVidia SDK - here the image is loaded from file, pre-processed and converted into a 512x512 floating point greyscale image, transferred to the GPU once, and THEN processed repeatedly at high speed to show how fast the GPU is.  The CPU conversion to floating point format is omitted from the GPU compute time.



I would also agree with Intel that most often, in practice, optimisation of an algorithm to use multiple cores, maximize cache usage and SSE instructions is easier, faster and ultimately more portable than developing a CUDA replacement algorithm.  I would also agree with the GPU evangelists that the hardware cost of an upgrade to a top-end Intel based PC system, vs the investement in a GTX280 is significantly higher.  With the tools improving all the time, it is becoming easier to code and deploy GPU enhanced algorithms.  


The conclusion is, for the time being, we must take a balanced view of the technology available and choose the right processing method to suit the application.  And be realistic.

Vision Experts

Saturday 3 July 2010

CUDA Parameter Alignment

When executing a CUDA kernel, it is almost always necessary to pass some parameters into the kernel function.  For image processing, the parameters are usually at least a pointer to the image data to be processed, plus the width, height, pitch etc. that describe the image.  The GPU kernel can then access the input parameters when it runs.  For this to happen, the parameters passed into the Kernel function call have to be copied from the host memory to the device code running on the GPU.  The mechanism for passing parameters to Kernels at execution is different to the majority of the host-to-device data copies, which use explicit function calls such as cuMemcpy().  Kernel function parameters, similarly to regular function calls, are passed using a parameter stack.

When using the CUDA Runtime API then parameter passing is taken care of transparently, and no additional work is required on the part of the programmer.  The Runtime API hides the details of copying host parameters from host memory into a parameter stack in the GPU device memory which the kernel can then go on to access as input parameters.  The Driver API is somewhat lower level.

The CUDA Driver API does not hide as much of the detail and the programmer must manage the process correctly, pushing variables onto a parameter stack in the correct order and with the correct alignment and size.  In my experience, and judging from the number of questions out there on newsgroups, parameter passing can be a source of trouble.

In the Driver API, function parameters are all passed to the kernel parameter space using the functions: 
  • cuParamSeti(CUfunction hFunc, int offset, unsigned int value) - Pass an integer
  • cuParamSetf(CUfunction hFunc, int offset, float value)  - Pass a float
  • cuParamSetv(CUfunction hFunc, int offset, void*, unsigned int numbytes) - Pass data
These functions place data residing in the calling host memory onto the kernel parameter stack at the position specified by offset.  It is crucial to make sure that offset is correct and must take into account the total size of all the previous items placed on the stack, taking their alignment into account. 


A few of the common causes of problems are:
  • Differences between the host alignment and device alignment of some data types.  Sometimes, additional alignment bytes must be added to offset to give the correct alignment.
  • Differences between the host size and device size of some data types, leading to incorrect value for numbytes or incorrect offset accumulation.
  • 32-bit and 64-bit memory addressing when passing device pointers to cuParamSetv
Standard Data Types
CUDA uses the same size and alignment for all standard types, so using sizeof() and __alignof() in host code will yield the correct numbers to put parameters on the kernel stack.  The exception is that the host compiler can choose to align double, long long and 64 bit long (on 64-bit OS) on WORD (2byte) boundary, but the kernel will always expect these to be aligned on a DWORD (4Byte) boundary on the stack.  

A common mistake is to push a small data type onto the stack, followed by a larger data type with larger alignment requirements, but forgetting to increment offset to meet the alignment of the larger type.  For example, in the code below a 2-byte short is pushed onto the stack followed by a four-byte int. 


WRONG: Byte alignment of int is 4-bytes but offset is only accumulated by the size of short.
offset = 0;
short myshort16 = 5434;
int myint32 = 643826;
cuParamSetv(hMyFunc, offset, &myshort16 , 2)
offset+= 2;  /// wrong
cuParamSetv(hMyFunc, offset, &myint32, 4)

RIGHT: Byte alignment of int is 4-bytes so offset has to be aligned correctly
offset = 0;
short myshort16 = 5434;
int myint32 = 643826;
cuParamSetv(hMyFunc, offset, &myshort16 , 2)
offset+=4;
cuParamSetv(hMyFunc, offset, &myint32, 4)

In order to ensure you have the right number for offset, NVidia provide a macro called ALIGN_UP that should be called to adjust the offset, prior to calling the next cuSetParamx function.   


Built-In Vector Types
CUDA provides some built-in vector types, listed in Table B.1 in section B.3.1 of the CUDA programming guide 3.1.  This means that the kernel can interpret some of the parameters on its input parameters stack as one of these vector types.  The host code does not have equivalent vector types, so again, care must be taken to use the right offset and alignment.  Most alignments are obvious, but there are exceptions, for example float2 and int2 have 8byte alignment, float3 and int3 have 4byte alignment.


Device Pointers

This starts to get a bit more complicated.  There used to be only two possibilities, the GPU always used always 32-bit pointers but the calling OS was either a 32-bit OS or a 64-bit OS.  With the arrival of Fermi, support for 64-bit addressing is possible, meaning we have three valid possibilities.


32-bit OS
This covers probably the most common scenario.  For all devices except Fermi, a cuDevicePtr can be safely cast into a 32bit void* without issue.  On 32-bit operating systems, the address of operator & will result in a 32-bit pointer, so CUDA allocated device pointers can be passed as (void*) parameters.  For example


cuParamSetv(MycuFunction, offset, &MyDevicePtr, sizeof(MyDevicePtr));

64-bit OS, 32-bit GPU
For 64-bit operating systems, there is a difference in size between a 32-bit cuDevicePtr and a 64-bit (void*).


So THIS LINE BELOW WILL NOT WORK:

cuParamSetv(MycuFunction, offset, &MyDevicePtr, sizeof(MyDevicePtr));

The line above will not work since sizeof(cuDevicePtr)=4 but the address of MyDevicePtr will be a 64bit (8byte) pointer.  Using the code above will cause bad things to happen. The correct code is:


cuParamSetv(MycuFunction, offset, &MyDevicePtr, sizeof(void*));

or - even better (more portable)
void *ptr = (void*)MyDevicePtr;
cuParamSetv(MycuFunction, offset, &ptr , sizeof(ptr ));

Care must be taken to make sure offset is always a multiple of 8 bytes before calling this function, since these 64-bit pointers have 8-byte alignment requirements.

64-bit OS, 64-bit Fermi GPU addressing
When using nvcc to compile 64-bit code for Fermi, both host and GPU code will use 64-bit addressing. The pointer size for both host and GPU will now be the same, so the call used above will still work:

void *ptr = (void*)MyDevicePtr;
cuParamSetv(MycuFunction, offset, &ptr , sizeof(ptr ));

Care must still be taken since these 64-bit pointers have 8-byte alignment requirements. 

So the key points to remember are:
  1. Check that the size is right.  Be aware of (void*) size differences.  Be aware of double, long long, and long (64-bit) differences in size.
  2. Increment the stack offset by the right amount.  Then:
  3. Check that the stack offset is aligned ready for the requirements of the parameter to be added next.  
  4. Repeat from 1.

 


Vision Experts