Friday, 23 July 2010

Blazingly Fast Image Warping

Want to achieve over 1 Gigapixel/sec warping throughput?  Then leverage your GPU texture units using CUDA.


Image warping is a very useful and important image processing function that we use all the time.  It is often used, when calibrated, to remove distortions such as perspective projection and lens distortion.  Many pattern matching libraries make use of affine image warps to compute image alignment.  Almost all imaging libraries have a warping tool in their toolbox.  In this post I will say a little about how we make use of the texture hardware in a GPU using CUDA, plus we show some benchmarks for polar unwrapping - and it is fast.

If there is one thing that the GPU is excellent at, it is image warping.  We can thank the gamers for their insatiable appetite for speed in warping image data or 'textures' onto polygons. Fortunately, even when using CUDA to program a GPU as a general purpose co-pro, the fast texture hardware is still available to us for accelerated warping.


  • There are several good reasons to use the texture hardware from CUDA when image processing:The ordering of Texture fetches are generally less proscriptive than the strict requirements for coalescing global memory reads.  When the order of your data reads does not fit in with a coalesced memory access pattern, consider texture fetches.
  • Texture fetches are cached.  For CUDA array memory, the texture cache has a high level of 2-D locality.  Texture fetches from linear memory are also cached.
  • Texture fetches perform bilinear interpolation in hardware.
  • Texture fetches can clamp or wrap at image boundaries, so you don't have to have careful bounds checking yourself.

Linear Memory vs Array Memory
When using a GPU to write a graphics application using an API like OpenGL or DirectX, the texture images were transferred and stored on the GPU in a way that optimized the cache for 2-D spatial locality.  With CUDA, a type of memory called a CUDA Array is available to serve this purpose, and CUDA Array memory stores 2-D image data in a bespoke way to enhance 2-D throughput.  CUDA array memory is managed separately from CUDA linear device memory and has its own memory allocation and copy functions.   

Dedicated CUDA Array memory meant that in the early days of CUDA (going waaay back maybe three whole years), the developer had to manage copying between host, linear device memory and CUDA array memory.  When using the texture hardware, the data had to be in the right place at the right time, forcing many additional copies to array memory.  

Fortunately, from CUDA 2.0 onwards, it became possible to use texture fetch hardware with normal linear device memory.  As far as I can tell, this innovation obviated the need for Array memory entirely.  If there is a good reason to still be using CUDA Array memory then please - post a comment and let us all know.






Textures - Kernel Code

Very little code is required in a CUDA kernel in order to use the texture hardware.  A texture unit for accessing the pixels of a regular 8-bit, 2-dimensional image is created in the kernel code (the .cu file) using the code:

 
texture<unsigned char, 2, cudaReadModeElementType> tex;

The data can then be fetched using the 2d texture fetch hardware using 'tex2d' as below:

unsigned char pix = tex2D( tex ,fPosX ,fPosY );

The really neat thing here is that the position to sample the input image is specified by floating point coordinates (fPosX and fPosY).  The texture reference can be set to perform either nearest-neighbor or bi-linear interpolation in hardware without any additional overhead.  It's not often you get something as useful as floating point bi-linear interpolation for free - thank NVidia.

It is also possible for the texture fetch hardware to return normalized floating point values, which is beneficial in many circumstances.  For example, in many cases the GPU is faster with floating point arithmetic operations than it is with integer operations.  Integer division is rarely a good idea.  For this reason I usually declare a float texture object using the following:

texture<unsigned char, 2, cudaReadModeNormalizedFloat> tex;


then access the pixels as floating point values:


float pix = tex2D( tex ,fPosX, fPosY );

Of course, I have to convert the float pixels back to bytes when I have finished playing around, but that's no big overhead and the hardware provides a fast saturation function to limit the float to the unit range for us:

*pPixOut = 255 * __saturatef(pix);


Textures - Initialization Host Code (Driver API)
A few lines of additional code are required in your host code during initialization in order to setup the kernel texture object.  I tend to do this once during a setup phase of the application, typically just after loading the cubin file and getting function handles.


Firstly, you will need to get a handle to your kernels texture object for the host to use.  This is similar to getting a handle to a device constant variable since the reference is retrieved from the kernel cubin by name. In our example above we declared a texture object in the kernel named 'tex'.  The host code when using the driver API is therefore:

CUtexref m_cuTexref;
cuModuleGetTexRef(&m_cuTexref, m_cuModule, "tex")



Where m_cuModule is the kernel module handle previously loaded/compiled using cuModuleLoadDataEx.  Now we need to set up how the texture unit will access the data.  Firstly, I tell the texture fetch to clamp to the boundary in both dimensions:


    cuTexRefSetAddressMode(m_cuTexref, 0, CU_TR_ADDRESS_MODE_CLAMP);
    cuTexRefSetAddressMode(m_cuTexref, 1, CU_TR_ADDRESS_MODE_CLAMP);

Then we can tell the hardware to fetch image data using nearest neighbour interpolation (point):


    cuTexRefSetFilterMode(m_cuTexref, CU_TR_FILTER_MODE_POINT);

Or bilinear interpolation mode:

    cuTexRefSetFilterMode(m_cuTexref, CU_TR_FILTER_MODE_LINEAR);

Finally, we tell the texture reference about the linear memory we are going to use as a texture.  Assuming that there is some device memory (CUdeviceptr m_dPtr) allocated during initialization that will contain the image data of dimensions Width and Height with a byte pitch of m_dPitch.
 
    // Bind texture reference to linear memory
    CUDA_ARRAY_DESCRIPTOR cad;
    cad.Format = CU_AD_FORMAT_UNSIGNED_INT8;    // Input linear memory is 8-bit
    cad.NumChannels = 1;                        // Input is greyscal
    cad.Width = Width;                    // Input Width
    cad.Height = Height;                   // Input Height

    cuTexRefSetAddress2D(m_cuTexref, &cad, m_dPtr , m_dPitch);
The actual image data can be copied into the device memory at a later time, or repeatedly every time a new image is available for video processing.  The texture reference 'tex' in the kernel has now been connected to the linear device memory.


Textures - Kernel Call Host Code (Driver API)
There is very little left to do by the time it comes to call a kernel.  We have to activate a hardware texture unit and tell it which texture it will be using.  On the host side, the texture reference was called m_cuTexref, we have already connected this reference to the texture object named 'tex' in the kernel during setup (using cuModuleGetTexRef)One additional line is required to tell the kernel function which texture is active in the default texture unit.
 
cuParamSetTexRef(cuFunction_Handle, CU_PARAM_TR_DEFAULT, m_cuTexref);

So, the kernel will now be able to use the hardware texture fetch functions (tex2d) to fetch data from the texture object named 'tex'.  It is interesting that the texture unit MUST be CU_PARAM_TR_DEFAULT.  A CUDA enabled GPU will almost certainly have multiple texture units, so in theory it should be possible to read from multiple texture units simultaneously in a kernel to achieve image blending/fusion effects.  Unfortunately, this is not made available to us in CUDA at the time of writing (with CUDA 3.1). 

To launch the kernel, proceed as normal.  For example:

cuFuncSetBlockShape( cuFunction_Handle, BLOCK_SIZE_X, BLOCK_SIZE_Y, 1 );
cuLaunchGridAsync( cuFunction_Handle, GRIDWIDTH, GRIDHEIGHT, stream ))

Note that I use async calls and multiple streams in order to overlap computation and PCI transfers, thus hiding some of the transfer overhead (a subject for another post).  This can all be hidden from the user by maintaining a rolling buffer internally in the library, making the warp algorithm appear to run faster.
Performance
In order to test the performance I have developed a general purpose warping library that uses our GPU framework to hide all of the CUDA code, JIT compilation, transfers, contexts, streams and threads behind a few simple function calls.  A commonly used useful warp function for polar unwrap has been implemented using the texture fetching method described above and the results look very good.

The input images we chose were from Opto-Engineering who have a range of lenses that produce polar images of the sides of product.  It is possible to capture high resolution images of the sides of containers as a polar image (below) but in order to accelerate any analysis, a fast polar unwrap is needed.



The output images look good when using the hardware bi-linear interpolation (below):


As expected, when nearest-neighbour interpolation is used, the image quality is degraded with aliasing problems (below).  Whilst this would be faster on a CPU, the GPU is able to perform the bilinear interpolation mode at the same speed.



The performance depends on the size of the output image, but typically achieves well over 1GB/sec in transform bandwidth, including all the transfer overheads (Core2Quad Q8400@2.66GHz & GTX260 216cores).  For these input images (1024x768), the average total transform time to produce the output (1280x384) was under 400 microseconds.  That works out at over 1.2 Gigapixels/sec
A quick comparison to a third party software polar unwrap tool showed that this was at least an order of magnitude faster.


The algorithm to perform the polar coordinate conversion is computed on-the-fly.  Any number of complex transform functions can be implemented in the library very quickly to achieve this performance.  So far, affine, perspective and polar transforms are done.  Any requests?




vxGWarp Interfaces
Just FYI - the interface to these polar warp functions are pretty trivial, all the GPU expertise is hidden from the end user in the DLL.  The key functions in the header file are:

vxGWarpCreate(VXGWARPHANDLE *hGW, int W, int H);
vxGWarpDestroy(VXGWARPHANDLE hGW);
vxGWarpAccessXferBufIn(VXGWARPHANDLE hGW, unsigned char **pInput, int *nW, int *nP, int *nH);
vxGWarpAccessXferBufOut(VXGWARPHANDLE hGW, unsigned char **pOutput, int *nW, int *nP, int *nH);
vxGWarpPolar(VXGWARPHANDLE hGW, POLARPARAMS PP);



Vision Experts

Thursday, 8 July 2010

Debunking the x100 GPU Myth - Intel Fights Back

Intel recently published this paper titled 'Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU' that makes an attempt to compare a number of GPU kernels with algorithms that are highly optimised for Intel architectures.  The authors concluded that for the right problems, the GPU was up to 14x faster than an equivalent optimised CPU implementation. On average a x2.5 increase in speed was seen.  

I am all in favour of using GPU's to accelerate image processing when it is appropriate but the hype has gotten out of control over the last year, so I am very pleased to see Intel try and put their case forward and bring some balance to the arguments.

What I liked about the paper was that for once, significant effort was expended to optimise BOTH the CPU and the GPU implementations.  Too many biased comparisons are made between highly optimised GPU implementations and the naive, plain vanilla single threaded 'C' versions.  When a x100 increase in speed is cited, I always suspect that the author was being either highly selective in what parts of the overall system were being timed, or that the algorithm was unrealistically well mapped to GPU hardware and not representative of a real problem, or even that the CPU implementation was simply not optimised at all.  The NVidia showcase website has made publishing an impressive acceleration factor in the authors best interest.

I certainly have not come across any imaging systems that have achieved anything like x100 accelerations in throughput by employing GPU technology.  There may be some algorithms that map superbly well to GPUs and can achieve x100 performance increase in a single algorithm stage, but these numbers published by Intel are much more in line with the total throughput increase I have seen when using GPU's to do image processing in real-world applications, when compared to the optimised CPU algorithms that are readily available.  

An example of disengenuous performance metrics would be the image processing blur demo in the NVidia SDK - here the image is loaded from file, pre-processed and converted into a 512x512 floating point greyscale image, transferred to the GPU once, and THEN processed repeatedly at high speed to show how fast the GPU is.  The CPU conversion to floating point format is omitted from the GPU compute time.



I would also agree with Intel that most often, in practice, optimisation of an algorithm to use multiple cores, maximize cache usage and SSE instructions is easier, faster and ultimately more portable than developing a CUDA replacement algorithm.  I would also agree with the GPU evangelists that the hardware cost of an upgrade to a top-end Intel based PC system, vs the investement in a GTX280 is significantly higher.  With the tools improving all the time, it is becoming easier to code and deploy GPU enhanced algorithms.  


The conclusion is, for the time being, we must take a balanced view of the technology available and choose the right processing method to suit the application.  And be realistic.

Vision Experts

Saturday, 3 July 2010

CUDA Parameter Alignment

When executing a CUDA kernel, it is almost always necessary to pass some parameters into the kernel function.  For image processing, the parameters are usually at least a pointer to the image data to be processed, plus the width, height, pitch etc. that describe the image.  The GPU kernel can then access the input parameters when it runs.  For this to happen, the parameters passed into the Kernel function call have to be copied from the host memory to the device code running on the GPU.  The mechanism for passing parameters to Kernels at execution is different to the majority of the host-to-device data copies, which use explicit function calls such as cuMemcpy().  Kernel function parameters, similarly to regular function calls, are passed using a parameter stack.

When using the CUDA Runtime API then parameter passing is taken care of transparently, and no additional work is required on the part of the programmer.  The Runtime API hides the details of copying host parameters from host memory into a parameter stack in the GPU device memory which the kernel can then go on to access as input parameters.  The Driver API is somewhat lower level.

The CUDA Driver API does not hide as much of the detail and the programmer must manage the process correctly, pushing variables onto a parameter stack in the correct order and with the correct alignment and size.  In my experience, and judging from the number of questions out there on newsgroups, parameter passing can be a source of trouble.

In the Driver API, function parameters are all passed to the kernel parameter space using the functions: 
  • cuParamSeti(CUfunction hFunc, int offset, unsigned int value) - Pass an integer
  • cuParamSetf(CUfunction hFunc, int offset, float value)  - Pass a float
  • cuParamSetv(CUfunction hFunc, int offset, void*, unsigned int numbytes) - Pass data
These functions place data residing in the calling host memory onto the kernel parameter stack at the position specified by offset.  It is crucial to make sure that offset is correct and must take into account the total size of all the previous items placed on the stack, taking their alignment into account. 


A few of the common causes of problems are:
  • Differences between the host alignment and device alignment of some data types.  Sometimes, additional alignment bytes must be added to offset to give the correct alignment.
  • Differences between the host size and device size of some data types, leading to incorrect value for numbytes or incorrect offset accumulation.
  • 32-bit and 64-bit memory addressing when passing device pointers to cuParamSetv
Standard Data Types
CUDA uses the same size and alignment for all standard types, so using sizeof() and __alignof() in host code will yield the correct numbers to put parameters on the kernel stack.  The exception is that the host compiler can choose to align double, long long and 64 bit long (on 64-bit OS) on WORD (2byte) boundary, but the kernel will always expect these to be aligned on a DWORD (4Byte) boundary on the stack.  

A common mistake is to push a small data type onto the stack, followed by a larger data type with larger alignment requirements, but forgetting to increment offset to meet the alignment of the larger type.  For example, in the code below a 2-byte short is pushed onto the stack followed by a four-byte int. 


WRONG: Byte alignment of int is 4-bytes but offset is only accumulated by the size of short.
offset = 0;
short myshort16 = 5434;
int myint32 = 643826;
cuParamSetv(hMyFunc, offset, &myshort16 , 2)
offset+= 2;  /// wrong
cuParamSetv(hMyFunc, offset, &myint32, 4)

RIGHT: Byte alignment of int is 4-bytes so offset has to be aligned correctly
offset = 0;
short myshort16 = 5434;
int myint32 = 643826;
cuParamSetv(hMyFunc, offset, &myshort16 , 2)
offset+=4;
cuParamSetv(hMyFunc, offset, &myint32, 4)

In order to ensure you have the right number for offset, NVidia provide a macro called ALIGN_UP that should be called to adjust the offset, prior to calling the next cuSetParamx function.   


Built-In Vector Types
CUDA provides some built-in vector types, listed in Table B.1 in section B.3.1 of the CUDA programming guide 3.1.  This means that the kernel can interpret some of the parameters on its input parameters stack as one of these vector types.  The host code does not have equivalent vector types, so again, care must be taken to use the right offset and alignment.  Most alignments are obvious, but there are exceptions, for example float2 and int2 have 8byte alignment, float3 and int3 have 4byte alignment.


Device Pointers

This starts to get a bit more complicated.  There used to be only two possibilities, the GPU always used always 32-bit pointers but the calling OS was either a 32-bit OS or a 64-bit OS.  With the arrival of Fermi, support for 64-bit addressing is possible, meaning we have three valid possibilities.


32-bit OS
This covers probably the most common scenario.  For all devices except Fermi, a cuDevicePtr can be safely cast into a 32bit void* without issue.  On 32-bit operating systems, the address of operator & will result in a 32-bit pointer, so CUDA allocated device pointers can be passed as (void*) parameters.  For example


cuParamSetv(MycuFunction, offset, &MyDevicePtr, sizeof(MyDevicePtr));

64-bit OS, 32-bit GPU
For 64-bit operating systems, there is a difference in size between a 32-bit cuDevicePtr and a 64-bit (void*).


So THIS LINE BELOW WILL NOT WORK:

cuParamSetv(MycuFunction, offset, &MyDevicePtr, sizeof(MyDevicePtr));

The line above will not work since sizeof(cuDevicePtr)=4 but the address of MyDevicePtr will be a 64bit (8byte) pointer.  Using the code above will cause bad things to happen. The correct code is:


cuParamSetv(MycuFunction, offset, &MyDevicePtr, sizeof(void*));

or - even better (more portable)
void *ptr = (void*)MyDevicePtr;
cuParamSetv(MycuFunction, offset, &ptr , sizeof(ptr ));

Care must be taken to make sure offset is always a multiple of 8 bytes before calling this function, since these 64-bit pointers have 8-byte alignment requirements.

64-bit OS, 64-bit Fermi GPU addressing
When using nvcc to compile 64-bit code for Fermi, both host and GPU code will use 64-bit addressing. The pointer size for both host and GPU will now be the same, so the call used above will still work:

void *ptr = (void*)MyDevicePtr;
cuParamSetv(MycuFunction, offset, &ptr , sizeof(ptr ));

Care must still be taken since these 64-bit pointers have 8-byte alignment requirements. 

So the key points to remember are:
  1. Check that the size is right.  Be aware of (void*) size differences.  Be aware of double, long long, and long (64-bit) differences in size.
  2. Increment the stack offset by the right amount.  Then:
  3. Check that the stack offset is aligned ready for the requirements of the parameter to be added next.  
  4. Repeat from 1.

 


Vision Experts

Friday, 4 June 2010

GPU 5x5 Bayer Conversion

The standard Bayer mosaic conversion algorithms used in machine vision typically employ fast bi-linear interpolation method in order to reconstruct a full 24-bit per pixel colour image.  Numerous other, more sophisticated algorithms exist in the public domain, but very few are implemented in a sensible machine vision library.  Since most industrial vision tasks are not aimed at recovering very high-fidelity images for human consumption, the Bayer conversion quality does not seem to have been a priority.  Its strange really, given that it is easy to spend $10k on a color machine vision camera, capture device and lens, only to put the captured images through the basic Bayer de-mosaic algorithm at the last moment.

In order to try and improve the situation, we've implemented various Bayer algorithms, including our own adaptive version of the 5x5 Malvar-He-Cutler interpolation algorithm.  Our implementation of the Malvar algorithm (we call Ultra Mode) is noticeably sharper and has less color fringing than the standard method.

The 2-frame gif below shows the difference on a long-range image taken with a well-known machine vision camera.  OK - granted its not a drastic difference and the gif encoding doesn't help, but sometimes this fidelity change can be important.  Given that our implementation runs on any CUDA enabled GPU faster than a basic CPU bi-linear algorithm, there isn't really a down-side to using the better method. 





Vision Experts

Sunday, 16 May 2010

GPU Accelerated Laser Profiling

Laser Profiling extracts a dense set of 3D coordinates of a target object by measuring the deviation of a straight laser line as it is swept across the target.  Many of these systems make use of custom hardware (e.g. Sick IVC3D) and an FPGA to achieve high line profile rates, often achieving multiple thousands of profiles per second. 

It is also possible to assemble a laser profiling system using any high speed camera and a laser line.  Partial-scan cameras can be useful to get high frame-rates but some fast software is also required to find and measure the laser line position in every image to sub-pixel accuracy for every profile.  These positions are then converted to world coordinates using a calibrated projection and lens distortion correction - which requires some floating point operations.  The hardware solutions typically manage several thousand profiles/sec, software is normally slower.  

Recently, I've been experimenting with GPU accelerated line profiling - and its looking fast.  The GPU turns out pretty well suited for measuring the laser lines in parallel since we can launch a single thread per column of the input image.  In fact, for memory access efficiency, it is better for each thread to read a 32-bit int that packs four 8-bit pixels.  A block of 16 threads therefore computes the laser positions for 64 pixels in parallel.  With multiple blocks of 64 pixels running concurrently (Figure1), the processing rate is pretty much only limited by GPU-host transfers.  On my test rig, the GTX260 GPU has 216 cores, so can execute 3,456 threads in parallel, way more than are actually needed and so many are idle in my current implementation.


Figure1.  Each thread scans four columns in order to compute the position of the laser line in that column.  With each GPU core executing 64 threads in parallel, this can be very fast.


Figure2. The C# test application (using our own OpenGL 3D display library) was able to achieve over 200MPix/sec throughput using Common Vision Blox images.  The lower-level C interface was double that speed when using RAW image data.   

My initial results show that the C# interface is able to achieve about 200MPix throughput (Figure2) - but that uses Common Vision Blox images which must be unwrapped and marshaled to the 'C' dll and slows things down.
The low-level 'C' dll library was achieving >600MPix/sec throughput (Figure3)- thats many KHz for a range of resolutions.  It may be that this GPU accelerated algorithm is able to provide line rates that previously only hardware could achieve.

Figure3. The low level DOS test application with 'C' dll interface was able to achieve over 600MPix/sec throughput using pre-loaded raw images.  That was 2.5KHz profile rate on 1280x200 laser images, or 390fps for 1280x1280 scans.


Vision Experts

Sunday, 2 May 2010

Faster Memory Transfers

NVidia provide a mechanism to to allocate non-paged ('pinned') memory on the host, which can significantly improve host-to-GPU transfer performance.  But does it help in practice?

The main bottleneck in GPU processing is the PCIe bus, which has a relatively low bandwidth.  For many trivial operations this data transfer overhead dominates the overall execution time, negating any benefit of using the GPU.  For normal host-to-GPU data transfers using the cuMemcpy function a bandwidth of around 2.0-2.5GB/sec is about average for a 16-lane PCI express bus.  This represents about half the theoretical maximum bandwidth of the PCIe v1.1 bus, and introduces about 1ms overhead to transfer an 1920x1080 greyscale image.


Figure 1A normal cuMemcpy from host-to-device runs at about 2GB/sec

If we use the NVidia cuMemAllocHost function to allocate non-paged memory on the host, we can almost double the bandwidth when copying this buffer to the GPU device memory, achieving nearer 4GB/sec on most systems.  If you are able to write your capture code so that the frame grabber driver will DMA image data directly into one of these page-locked buffers then that is a worthwhile thing to do.  Unfortunately, thats not always possible.

Page-Locked Intermediate Buffer
Sometimes, the frame-grabber acquires images into a host memory buffer without giving us the option to acquire directly into our CUDA allocated page-locked memory.  In this situation, we can either copy our captured image directly to the device memory as in Figure1, or choose to memcpy into a page-locked buffer prior to transfer across the PCIe bus as in Figure2.  Since a host memcpy takes time, this erodes some of the benefit of using the page-locked buffer. 



Figure 2.  Using a page-locked buffer as a staging post before transfer can still increase performance, despite introducing an additional host memcpy operation from the acquire buffer to the page-locked buffer.

Using a page-locked transfer buffer as shown in figure2 is only worthwhile when the cost of the host memcpy operation is low - which requires a relatively high performance chipset (e.g. ICH10) with fast DDR2 (6.4GB/sec) or DDR3 (8.5GB/sec) memory.  At a minimum, the host-to-host copy must execute faster than 4GB/sec otherwise the direct copy in figure1 is usually faster.  As an example, the approximate time taken to transfer 1GB using paged memory is:

1GB / (2GB/sec) =500ms

When using the scheme in figure2, the total time taken to transfer 1GB from host to the page-locked buffer and then onto the GPU is approximately:
  1GB/(8GB/sec) = 125ms
+1GB/(4GB/sec) = 250ms

= 375ms


This is an improvement over the straight copy, so it would appear that non-paged memory does help even in this non-ideal situation.  When using a newer P45 chipset with PCIexpress v2.0 the maximum achievable transfer bandwidth is higher.  In theory, the PCIe bus on the newer Intel P45 and P35 chipsets will handle 16GB/sec and 8GB/sec respectively, but are limited by main memory bandwidth, reducing host-to-GPU bandwidth to something between 5 and 6GB/sec.


The conclusion is that if at all possible, acquire directly into a pinned, page-locked memory.  If that isn't possible, using an intermediate page-locked buffer is still worthwhile, provided the host chipset and memory performance is good.

Direct FrameGrabber-to-GPU DMA
It would be really great to be able to DMA directly from a frame grabber into GPU device memory, avoiding the CPU and main memory entirely, but I don't believe this is possible.  It may be achieved using driver-level transfers akin to DirectShow drivers, but it is not currently possible to get a physical address of GPU device memory using CUDA.

Simon Green, from NVidia says:
"A lot of people have asked for this. It is technically possible for other PCI-E devices to DMA directly into GPU memory, but we don't have a solution yet. We'll keep you posted." - Sep 2009
This is a capability worth waiting for, but don't hold your breath.



Vision Experts

Sunday, 28 March 2010

CUDA3.0 cubin Files

It appears that NVidia has changed the format of CUBIN files with CUDA 3.0 into a standard binary ELF format.  Heres what they say in the release notes:

  • CUDA C/C++ kernels are now compiled to standard ELF format
You can find out about ELF files at the wikipedia entry.  In previous releases the partially compiled .cubin files were plain text readable and could be added into a library as a string resource.  If you open old cubin files in Visual studio, they looked something like this:

architecture {sm_10}
abiversion   {1}
modname      {cubin}
code {
    name = cuFunction_Laser
    lmem = 0
    smem = 44
    reg  = 6
    bar  = 0
    bincode {
        0x10004209 0x0023c780 0x40024c09 0x00200780
        0xa000000d 0x04000780 0x20000411 0x0400c780
        0x3004d1fd 0x642107c8 0x30000003 0x00000500
        ...
blah..blah..blah

    }
}


Rather than ship cubin files with libraries, I have always built them into the file as a string resource and then use the windows API functions such as FindResource and LoadResource to get a pointer to the string.  This is then passed to the CUDA cuModuleLoadDataEx function for final compilation into the GPU code.   


With CUDA3.0 and this new ELF format, cubin files look slightly different since they are now a binary file:


When I compiled some old projects against CUDA3.0, everything went very wrong due to this change.
 
The problem was that my old method used to copy the cubin resource string into another memory location using strcpy and add also a final \0 character for good measure at the end of the string.  With the new binary format, the string copy does not work and a partially mangled buffer ended up being passed to the CUDA compiler, which promptly fell over.


So if anybody else out there is using string resources to include and manipulate cubin files, this may catch you out too.  The fix is easy, simply treat the new cubin files as binary data not strings. 


One final point, if you really want to stick with the previous cubin string format, then apparently (I haven't confirmed this) you can direct nvcc to emit string cubin files by changing the nvcc.profile and the CUBINS_ARE_ELF flag. 


Vision Experts

Friday, 26 March 2010

NVPP Performance Benchmarks

In my last post, I cast some doubt on the performance and utility of GPU's for small image processing functions.   Today I had a look at how NVidias own Image processing library - NVPP - stacked up against the latest Intel Performance Primitives (IPPI v6.0) for some basic arithmetic on one of my Dev machines.  This development PC has a mid-performance Quad-core Intel Q8400@2.66GHz and a mid-performance NVidia GTX260 with 216cores@1.1GHz.  


The results are interesting and pretty much what I expected.  As an example, here are the results for a simple image addition of two images to produce one output image (average 1000 iterations):


512x512 Pixels:
GPU-Transfer and Processing = 0.72 milliseconds
CPU = 0.16 milliseconds


2048x2048 Pixels:
GPU-Transfer and Processing = 6.78 milliseconds
CPU  = 2.81 milliseconds

The CPU wins easily - so whats happening here?  The transfer overheads to-and-from the GPU over a PCIex16 bus are by far the dominant factor, taking approx 2ms per image transfer for the 2048x2048 images (two input images, one image output = approx 6ms).  Whilst transfer times can be significantly improved (perhaps halved) if the input and output images were put into page-locked memory, the conclusion would not change; performing individual simple image operations on the GPU does not significantly accelerate image processing.


So what happens if we emulate a compute-intensive algorithms on the GPU?  When we perform only one transfer but then replace the single addition with 1000 compounded additions, the total time for the GPU operation becomes:

2048x2048 Pixels:
GPU-1xTransfer and 1000xImAdds = 0.29 milliseconds

So for a compute intensive operations which transfers the data once, then re-uses the image data multiple times, the GPU can easily be 10x faster than the CPU.

This means that algorithms such as deconvolution, optic flow, deformable registration, FFTs, iterative segmentation etc are all good candidates for GPU acceleration.  Now, if you look at the NVidia community showcase then these are the sorts of algorithms that you will see making use of the GPU for imaging.  When the new Fermi architecture hits the shelves, with its larger L1 cache and new L2 cache, then the GPU imaging performance should make a real jump. 

Its worth mentioning a minor technical problem with NVPP and Visual Studio 2008 - NVPP1.0 doesn't link properly in MSVC2008 unless you disable whole program optimisation (option /GL). Its also worth noting that the NVPP is built on the runtime API, which is not suitable for real-time multi-threaded applications.  If you really need some of the NVPP functionality for a real-world application, then we would suggest you get a custom library developed using the driver API.


Vision Experts

Saturday, 13 March 2010

A GPU is not Always Fastest

There has been a huge amount of interest in GPU computing (GPGPU) over the last couple of years.  Unsurprisingly, a number of image processing algorithms have been implemented using this technology.  In most cases, large performance gains are reported.  However, whilst I have been writing image processing algorithms that leverage the GPU performance for some time now, I have often found that the GPU is not the best solution.  As a rule of thumb, I aim for a x10 increase in speed to justify the development, if I can't achieve a x4 increase in speed then its just not worth the effort.  

Sometimes, the performance gains are misleading for practical applications.  NVidia themselves are guilty of this in their SDK with their image processing examples.  For instance, in many of their SDK demonstration applications they use the SDK functions to load an 8-bit image and then pre-convert it on the host to a packed floating point format before uploading to the GPU.  They then show large gains in speed, but ignore the huge time penalty of the CPU-side format conversion.  In another example they have to unpack 24-bit RGB data into 128bit packed quads of floating point data on the host before they can process it.  In the real-world this is not practical.  I do wonder how many other people have used some constructive accounting in their reported acceleration factors.

So, despite generally being a GPU evangelist for accelerating image processing, I wanted to write a bit about the downsides to provide a balanced view.

Architecture constraints.   You need to be doing a lot of work on the image data to make the architecture work for you.  Many (Most?) practical algorithms just don't fit into a GPU very well.  For example, it may be the case that a GPU can do a brute-force template correlation faster than a quad-core CPU, but brute-force correlation for pattern matching isn't the method of choice these days.  Contemporary vision libraries have extremely sophisticated algorithms that do a far superior job of pattern matching than correlation, plus they are highly optimised for multi-threading on the CPU.  These algorithms simply do not fit into the GPU 'brute force' computational model. 


By way of a painful example, I have been developing a complete JPG conversion library for NVidia GPUs.  This is blazingly fast at RGB-YUV conversion, DCT and Quantisation, but falls down on the Huffman coding which is a sequential algorithm.  Add in the transfer overheads and it gets slower.  At the time of writing, hand-optimised multi-threaded CPU version is almost as fast.  All is not lost on this development, but its a tough sell at this point.

Multi-threading.  Whilst a GPU is massively parallel internally, it cannot run multiple algorithms (kernels) in parallel*.  So if your application is used to doing multiple operations in parallel, e.g. processing the images from multiple sensors in parallel, then it will have to change and serialize the images into GPU work chunks.  So whilst your quad-core CPU could be doing four images at once, the GPU is doing them in serial.  This means the GPU has to process at least four times the rate than a single CPU core in order to break even.  

*I believe the new NVidia Fermi architecture can run multiple Kernels simultaneously but most don't.

Transfer Overheads.  It takes time to transfer data across the PCIe bus to and from the GPU.  If the algorithm already runs quickly on the CPU (e.g. a few milliseconds) then GPU acceleration is usually a non-starter.
 

Algorithm development time.  It takes longer to write and debug a massively parallel GPU algorithm than it does to parallelize the algorithm on the CPU to make use of a fast quad-core.  Development time is expensive.

Hardware cost.  You do get a lot of horsepower for your money with a GPU, and a good performance card can be purchased for £150.  That still has to be factored into the system cost.   
Hardware obsolescence.  Whilst NVidia have confirmed that CUDA will be available in every new GPU they produce, the exact same GPU card quickly becomes obsolete.  Code should be forward compatible, but I don't think this has really been put to the test yet.




Of course, there are still lots of good things about this new technology and it really can accelerate the big number crunching algorithms like optic flow and deconvolution and FFTs.  But you have to choose carefully.


Vision Experts

Wednesday, 24 February 2010

High Throughput for High Resolution

We've been using the ProSilica/AVT GE4900 recently to get super high resolution 16megapixel images at about 3Hz.  It's a nice camera, but that resolution tends to demand high performance from the processor.

We have about 45MB/sec of raw image data we have to process.  In order to chew through all this data we've been pushing the raw bayer mosaic images onto an NVidia GTX260 GPU and performing colour conversion, gamma correction and even the sensors flat field correction on the GPU at high speed.  We also use the GPU to produce reduced size greyscale images for processing and alalysis alongside the regular colour converted image for display.  The ability to process such high resolution images using the GPU has really made the difference for this application and it would not be possible without this capability. 
Vision Experts

Wednesday, 10 February 2010

Interface Acceleration

Machine vision sensors are getting big, and Camera are increasingly available with a number of pixels that is truly enormous by historical standards.  Cameras in the 10+ Megapixel range seem to be increasing in popularity for industrial inspection, possibly driven by the consumer market in which such large sensors are now the norm, partly due to price decreases, and possibly because processing and storing the data is just about feasible these days.  

The bandwidth between cameras and computer is also increasing, which it needs to.  Already, it seems that a single GigE connection just isnt enough bandwidth for tomorrows applications.  For example, AVT have a dual GigE output on a camera to give 2Gbits/sec of bandwidth.  The CoaXPress digital interface is capable of 6.25 Gbits/sec over 50m of pretty much bog-standard coax cable, a capability I find incredible.  Likewise, the HSLINK standard, proposed by DALSA, uses InfiniBand to achieve 2100Mbytes/sec.   Most of these standards even permit using multiple connections to double, or quadruple the bandwidth.  With all this data flying around, trying to process this on a PC is going to be like taking a drink from a hose pipe.  Or two, or four. 

Think about it, at 2Gbits/sec, the computational demand will be 250Mpix/sec (assuming 8bit pixels).  Using a 3GHz processor core, thats 12 clock cycles availble per pixel.  You can't do a whole lot of processing with that.  Even if you scale up to Quad-core and make sure you use as many SSE SIMD instructions as you can, you still aren't going to be doing anything sophisticated with that data.  It could be like machine vision development 15 years ago, when I remember the only realistic goal was to count the number pixels above threshold to take a measurement!


I feel that the new generation of ultra-high resolution cameras streaming data at ultra-high bandwidths are going to require a new generation of processing solutions.  I suspect this will be in the form of massivley parallel processors - such as GPU's and perhaps Intel's Larrabee processor (when it finally materialises).


In the mean time, I'm plugging away writing GPU accelerated algorithms just for format conversion so that we can even display and store this stuff.


Vision Experts

Friday, 5 February 2010

GPU Supercluster

I was interested to see this GPU system doing some biologically inspired processing at Harvard.  Whilst I doubt that there will be any practical industrial applications to emerge from this, it does show how inexpensive it can be to build a minor supercomputer. To quote from their website...


...With peak performance around 4 TFLOPS (4 trillion floating point operations per second), this little 18”x18”x18” cube is perhaps one of the world’s most compact and inexpensive supercomputers....




Vision Experts

Tuesday, 26 January 2010

C# Interop

I write image processing algorithms for a (so-called) living, which explains why my posts are so badly written and incomprehensible.  I write quite a few libraries in C/C++ for various industrial tasks and supply them as a trusty old windows dll.   Providing an interface for calling my libraries from a C# application front-end is something I have to do quite a bit of.  C++ delivers the performance needed for image processing and C# gives the quick and easy GUI. 



In order to call a regular 'C' dll from C#, you need to use something in .NET called P/Invoke.  This mechanism defines a function that is callable from C# which maps to a dll function call.  In the definition of the dll function, you can specify things like character sets for string passing, calling conversions etc.  As an example, if you wanted to import the windows kernel32 function Beep into C# using P/Invoke it looks something like:



[DllImport("kernel32.dll")]
public static extern bool Beep(int frequency, int duration);



In the beep example, the integer values passed from managed to unmanaged dll are so-called blittable types and will be passed directly.  Passing arrays is not quite as simple, since they have to be converted (marshalled) by the framework before they are passed to the unmanged dll.  Normally, you don't use unsafe code in your C# GUI, so you don't have pointers to data lying around handy.  Of course, most C libraries for image processing expect a pointer to some image data to be passed in somewhere, not a managed array object.  So somebody has to do some work to turn a managed array into a pointer, without totally screwing up the safe part of C# and all the other stuff going on like the garbage collector.  This is job of P/Invoke.


For example, if we have a C# array declared as:

 float[] CalTgtX = { 58,198,340, 58,198,340, 58,198,340};


which we want to pass into a C++ function that looks like this:

extern "C" __declspec(dllexport) void __stdcall CalibrateProjection(float *pTargetX)

Then we need to carefully define how C# should carry out this conversion.  Heres how to define the function in C# so that we can pass (marshal) that float array object from C# to the C dll function:


[DllImport("MyLib.dll", EntryPoint = "CalibrateProjection", CallingConvention = CallingConvention.StdCall)]
public static extern RETCODE CalibrateProjection([MarshalAs(UnmanagedType.LPArray, SizeConst = 9)] float[] pTargetX)

The first line tells the C# compiler to import a function.  DllImport tells c# we are importing a function from a Dll.  EntryPoint tells c# what the function stub is named. CallingConvention should match that used by the Dll - here it was __stdcall.  
The second line defines the function as it will appear to C#.  The key to converting the float array object to a pointer is in the MarshalAs attribute.  This will involve a copy to an unsafe array on the heap so can be slow for large arrays... very slow.


All the different flavours of managed types and structs can be marshalled this way.  More information on PInvoke can be found at http://msdn.microsoft.com/en-us/library/aa288468%28VS.71%29.aspx

 

Vision Experts

Friday, 22 January 2010

LoaderLock MDA

This post isn't really about accelerated image processing, but the topic is related to deployment of DLLs of any type.  I hope this helps somebody save some time if they encounter this issue.  Whilst developing a C# demo app for one of my CUDA libraries, I encountered a strange error:

LoaderLock was detected
Message: DLL 'Cephalon.dll' is attempting managed execution inside OS Loader lock. Do not attempt to run managed code inside a DllMain or image initialization function since doing so can cause the application to hang.

It took me a while to figure out what was going on, and it was related to how I build CUDA libraries.  

When I make a CUDA enabled library, I wrap up the compiled kernel cubin files as a resource compiled into the DLL itself.  An alternative simpler method is to supply a cubin text file along with each dll and load it directly using the CUDA function:

cuModuleLoad(&cuModule, pszModulePath) 
 

but having two files can lead to version control and maintainance issues. Plus, I try and make a living doing this, and dont really want people reading my precious kernel code that took four months to write too easily. 



So I wrap up the compiled code string neatly inside the DLL as a resource, then get that resource string and compile it on-the-fly (or just-in-time) using the alternative CUDA function:


cuModuleLoadDataEx( &cuModule,pCubinStr,3,&jitOptions[0],&jitOptVals[0]);



This is great but in order to get the string resource from inside the DLL I need to call a varient of LoadResource. And I need to call FindResource to find that resource first.  And I need to call GetModuleHandle("LibraryName.dll") before any of those.  The problem is that GetModuleHandle is a prohibited function to call even indirectly from LoadLibrary when the DLL is first loaded and mapped into the process address space.  

The C# application was loading the DLL when it first encountered one of the functions, this then tried to initialise the CUDA module and load the resource automatically from the dll entry point.  Ultimately, the call to GetModuleHandle raised an alarm back in the managed code.  Not easy to spot.


More on the LoaderLock MDA can be found here





Vision Experts