Friday, 26 April 2013

Multicore MJPG

Over the years, I've spent quite a lot of time developing and tweaking my own highly optimised JPG compressor.  It's fast.  I've made heavy use of the Intel Performance Primitives (IPP) to leverage SSE instructions and get impressive performance boost over other third-party libraries I know of.

However, my implementation was limited to a single thread per image, which limited performance when encoding video.  Yes, I know there are ways to parallise the JPG algorithm on a single frame using restart intervals, but that was going to be complex to implement in my existing code.  

For video encoding, I could see a simpler way forward; encode sequential frames concurrently on separate threads.

My new VXJPG library is able to compress each image on its own thread.  So sequential frames are encoded concurrently.  This means that when processing sequences of images, the overall performance is vastly improved.  In fact, on a quad core processor the library pretty much achieves a 4x speedup over the single core version.

Multi-Threaded Compression

The problem with running multiple threads, each encoding a separate frame on a seperate core, is ensuring that the frames remain in sequence when added to the output video file.  Without proper synchronisation, it is entirely possible for the frames to end up out-of-order.

Each thread in a thread pool is launched when the library is initialised, then enters a loop waiting for new data to process.  When a new frame arrives, the next thread in sequence is selected from the pool. The scheduler waits until a 'Ready' event is fired by the thread, to signal that the thread is waiting for new data.  After Ready is signaled, the thread is passed the data and a 'DoProcess' event fired by the scheduler, which allows the thread to proceed onto compressing the frame.  

Figure 1. A single encoding thread waits until a Ready event before processing.  It must then wait for the previous thread to complete before adding its compressed frame to the file.  It can then signal the next thread before returning to wait for more data.
At this point, the scheduler is done and ready to receive a new frame, whilst the thread performs its time-consuming compression task.
Inside the thread, after compression completes, each thread must wait on a 'Frame Completed' event from the previous thread in the pool before it is permitted to write to the file.  This ensures that even when all threads are processing concurrently, the frame sequence remains correct.  

Finally, when the frame is written, the thread can signal 'Ready' for when the scheduler next visits.  When the data stops arriving, the threads naturally finish writing frames in sequence and simply all enter the wait state.  

Termination is achieved by signalling a terminate flag in each thread prior to firing the ready event.

Benchmark Compression

I tested the compressor using both single and multi-threaded modes using high def 1920x1080 colour bitmaps, two input frames, repeated 250 times.  

That's 1920x1080 RGB x 500 frames = 3.1GB input data to chew through.  I'm using an Intel Core2 Quad Q8400@2.66GHz to do this, with 4GB DDR3 1066MHz RAM on Win7 32bit.
Single Threaded:  
Total time 11.66 seconds
Avg time per frame = 23.2ms
Avg data consumption = 268 MB/sec

Multi Threaded:  
Total time 3.23 seconds
Avg time per frame = 6.5ms
Avg data consumption = 963 MB/sec

I am aware that the input images will be cached, so this is probably a higher performance than I would get with a camera, but its still impressive.  Recording multiple live HD color streams to MJPG is no problem at all even on a modest embedded PC.  MJPG does fill up drives pretty quickly, but is far better than RAW, which requires RAID and SSDs to achieve. 

At present, I'm using vxMJPG for my (sorry, shameless plug) Gecko GigE recorder with Stemmer Imaging cameras for high speed video recording projects.  The output (*.mjpg) files are my own thin custom container format but will play and can be transcoded in VLC.  Unfortunately, they don't play in Windows media player.  That's the price of speed.  To help portability, I also have my own MJPG player for the files which has some features like slow-mo, frame step, save frame etc to help.

My next project is to wrap this up in a DirectShow encoder, so that I can also make AVI files. However, that might have to be single threaded.  I cannot currently see a way of making the encoder in a DS filter graph safely asynchronous in the manner I am employing.  Fortunately, even single threaded, my version should be vastly faster then the codecs I am currently using - and I shall report those results when I have them.
Vision Experts

Friday, 11 January 2013

Fast Pattern Features

Inventing really fast pattern matching algorithms is good fun. 

I've recently been looking at methods to accelerate one of my pattern matching algorithms and thought I'd share some of my techniques with you.  It takes alot of precomputation, but then alignment is really fast.  The algorithm is intended for registration of a high resolution template to a similar target image, for applications when some alignment is required but not a full image search.

Feature Patches

Performing image correlation is slow, but it can be accelerated by not using every pixel.  The algorithm I'm using splits a high resolution trained template image into multiple small feature patches. Only the best patches that have the highest structure tensor are then used for alignment (Figure 1).  The reduction of the image to its key features reduces the amount of correlation required significantly.

Patches that are close together are pruned, leaving the more widely spaced patches. We don't want all our patches clustered together since that would make determination of global rotation or scaling more sensitive to local distortions.  This patch feature extraction dramatically cuts down the number of pixels actually required to find a good match using any correlation based approach.  Using small patches (in my case between 16x16 and 32x32 pixels) has the advantage that each patch can usually be matched using a straightforward translation-only search.  After all (or most) of the patches are matched, a full affine deformation can still be computed from the individual match locations.

Figure1. High structure patch regions can be located using the structure tensor.  These can be used to reduce the number of pixels required to make a match.

Fast Feature Vectors

Even searching for a small number (e.g. 16) small patches over a reasonable search region takes too much time for most applications.   Fortunately, a significant accleration can be achieved by rapidly rejecting very poor quality matches using a cadidate search routine.  Most of the possible match positions will be very different from the correct match position.  We can rapidly discriminate between a candidate match and a non-match by comparing a few carefully selected pixels in each patch (Figure.2)

Figure 2.  A high structure patch can be further reduced into a feature pixel representation.  Key pixels can be used to rapdily reject non-matches during search.  The problem is determining which pixels are optimal to use.

The interesting part of the algorithm must decide which combination (and how many) pixels inside each patch are required to provide good discrimination between a possible match and a probable non-match.  I decide that a measure of 'goodness' for a set of pixels is found by auto-correlating the set with the patch.  Highly discriminative sets will have the fewest match locations, with a really good set having only one match location.

Clearly, there are an awful lot of potential combinations of pixels that could be used.  A  brute force test of all possible combinations is not really feasible.  We must cut down the search space.  We can guess that discriminating pixels lie on or near an edge that divides areas of different intensity.  The intensity of a pixel direclty on a strong edge is not very robust to sub-pixel translations, so those should probably not be used (unless we pre-blur).  We can guess that descriptive pixels do not lie on plain featureless regions either, since these pixels are not uniquely matchable.  In all, we can assume that the most discriminative pixels do not lie on the high gradient pixels (edges), and not on the lowest gradient pixels (too plain).  So if we restrict possible discriminative to those ones on a local contrast maximum we can reduce the optimisation considerably.

There are still a large number of possible combinations, but now few enough to enumerate and test in a feasible amount of time.  I limit this to the top 32 such pixels.

Every possible combination of four of these candidate pixels is tested as a potential discriminating arrangement.  This is a total of 32^4, or about a million possible combinations to be tested.  Every combination of four pixels is auto-correlated with the entire patch search region and the number of matches counted.  The combination that yields the fewest false matches is chosen as the most discriminating set.  This pixel set can be used at run time to rapidly reject false match locations with only the locations that pass this quick test going forward to full patch correlation.

This process cuts a multi-megapixel image correlation down to a very limited set key patches, with a very limited number of pixels in each patch required to find a match location.  I have used this algorithm successfully for very rapid alignment of printed material without recourse to pyramid formation or more complex feature matching. 

Vision Experts

Wednesday, 8 August 2012

Multicore or SIMD?

I've recently been optimizing one of my image processing libraries and wanted to share my results with you.  Two acceleration methods which are relatively straightforward for me to implement and therefore have a high return-on-investment are:
  1. Using multi-core parallelism via OpenMP
  2. Using SIMD instructions via the Intel IPP
This post shows my results.  Which helped more?


I always design my image processing libraries so that all the high level complexity makes use of a separate low level toolbox of processing functions.  In this particular algorithm, the image was divided into blocks.  Each independent block was encapsulated by a simple block class.  The block class contained implementations of all the basic arithmetic processing from which all the high level functions were built from.  From the outset, I had in mind that every block could eventually be processed in parallel and the individual arithmetic functions could be accelerated using SIMD instructions.

I started with vanilla c++, single threaded implementations of the arithmetic and functions.  When everything was working and debugged, I could add parallelism using OMP and SIMD using IntelIPP without a huge effort.


I love OMP.  Its so simple to use and so powerful it allows me to leverage multi-core processors with almost no effort.  What's more, if you have VisualStudio2010 then you have OpenMP already.  You just need to switch it on in the project properties under the C/C++ language tab, as below:

Adding OpenMP support to a C++ project in VS2010

 Using OpenMP was easyfor my project, since I already designed the algorithm to process the image in a series independent blocks.  This is my original block loop:

      for (int c=0; c<BLOCKS; c++)
         val = CB[c]->ComputeMean();
Using OpenMP is as easy as switching it on (Figure1) and then using the #pragma omp parallel compiler instruction before the for loop:
      #pragma omp parallel for
      for (int c=0; c<BLOCKS; c++)
         val = CB[c]->ComputeMean();
No code changes required - it really couldn't be easier.  It just requires some thought at the outset of how to partition the algorithm so that the parallism can be leveraged.  It is indeed faster.

Note: In VS2008 OpenMP is not available in the standard edition, but if you have VS2010, you can still find and use vcomp.lib and omp.h with VS2008.  I guess you could use the libraries with any version of visual studio, even the free express versions, although I'm not sure what the licensing/distribution restrictions are when doing that.

Intel IPP 

I own a developer license for the Intel Integrated Performance Primitives.  Since the processing already used my own vector processing functions, swapping them for equivalent IPP versions was straightforward.  Take for example this very simple loop to compute the average of a vector:

   int n;
   float Sum=0.0f;
   for (n=0;n<m_nE;n++)
      Sum += *(m_pFloatData+n);
   m_fMean = (Sum / (float)m_nE);

This has a direct equivalent in the IPP:
   ippsMean_32f(m_pFloatData, m_nE, &m_fMean, ippAlgHintFast);  

This single function performs the same vector average, puts the result in the same output variable, but uses the SSE instructions to pack the floats (yes, its float image data here, but thats another story) so that multiple values are processed in parallel.  It is indeed faster. 


So what happened and which method provided the most bang-for-buck? Perhaps unsurprisingly, it depends whats going on. 

Simple Functions
Block Average Normal (x1)
Block Average OMP (x1.15)
Block Average IPP (x2.42)
Block Average Combined (x2.67)

Block Scalar Mult Normal (1)
Block Scalar Mult OMP (x2.93)
Block Scalar Mult IPP (x5.45)
Block Scalar Mult Combined (x6.21)

More Complex Functions
Block Alpha Blend Normal (1)
Block Alpha Blend OMP (x2.06)
Block Alpha Blend IPP (x1.49)
Block Alpha Blend Combined (x1.48)

All functions performed on a Quad Core i3, Win7 x64


The granularity of the parallelism matters. 
  1. IPP accelerates the simple vector processing functions more than OMP
  2. OpenMP accelerates more complicated high level functions more than IPP
  3. Combining IPP and OpenMP only makes a small difference for these functions
Since IPP already uses OpenMP internally, perhaps it is not surprising that an additional higher level of OpenMP paralellism does not yield a large speed increase.  However, for higher level functions that combine tens of low-level functions I have found OMP to add considerable value.  I'm sure the reasons are a complex combination of memory bandwidth, cache and processor loading, but the general rule of using OMP for high level parallelism and IPP for low level parallelism seems sensible.

Vision Experts

Friday, 23 July 2010

Blazingly Fast Image Warping

Want to achieve over 1 Gigapixel/sec warping throughput?  Then leverage your GPU texture units using CUDA.

Image warping is a very useful and important image processing function that we use all the time.  It is often used, when calibrated, to remove distortions such as perspective projection and lens distortion.  Many pattern matching libraries make use of affine image warps to compute image alignment.  Almost all imaging libraries have a warping tool in their toolbox.  In this post I will say a little about how we make use of the texture hardware in a GPU using CUDA, plus we show some benchmarks for polar unwrapping - and it is fast.

If there is one thing that the GPU is excellent at, it is image warping.  We can thank the gamers for their insatiable appetite for speed in warping image data or 'textures' onto polygons. Fortunately, even when using CUDA to program a GPU as a general purpose co-pro, the fast texture hardware is still available to us for accelerated warping.

  • There are several good reasons to use the texture hardware from CUDA when image processing:The ordering of Texture fetches are generally less proscriptive than the strict requirements for coalescing global memory reads.  When the order of your data reads does not fit in with a coalesced memory access pattern, consider texture fetches.
  • Texture fetches are cached.  For CUDA array memory, the texture cache has a high level of 2-D locality.  Texture fetches from linear memory are also cached.
  • Texture fetches perform bilinear interpolation in hardware.
  • Texture fetches can clamp or wrap at image boundaries, so you don't have to have careful bounds checking yourself.

Linear Memory vs Array Memory
When using a GPU to write a graphics application using an API like OpenGL or DirectX, the texture images were transferred and stored on the GPU in a way that optimized the cache for 2-D spatial locality.  With CUDA, a type of memory called a CUDA Array is available to serve this purpose, and CUDA Array memory stores 2-D image data in a bespoke way to enhance 2-D throughput.  CUDA array memory is managed separately from CUDA linear device memory and has its own memory allocation and copy functions.   

Dedicated CUDA Array memory meant that in the early days of CUDA (going waaay back maybe three whole years), the developer had to manage copying between host, linear device memory and CUDA array memory.  When using the texture hardware, the data had to be in the right place at the right time, forcing many additional copies to array memory.  

Fortunately, from CUDA 2.0 onwards, it became possible to use texture fetch hardware with normal linear device memory.  As far as I can tell, this innovation obviated the need for Array memory entirely.  If there is a good reason to still be using CUDA Array memory then please - post a comment and let us all know.

Textures - Kernel Code

Very little code is required in a CUDA kernel in order to use the texture hardware.  A texture unit for accessing the pixels of a regular 8-bit, 2-dimensional image is created in the kernel code (the .cu file) using the code:

texture<unsigned char, 2, cudaReadModeElementType> tex;

The data can then be fetched using the 2d texture fetch hardware using 'tex2d' as below:

unsigned char pix = tex2D( tex ,fPosX ,fPosY );

The really neat thing here is that the position to sample the input image is specified by floating point coordinates (fPosX and fPosY).  The texture reference can be set to perform either nearest-neighbor or bi-linear interpolation in hardware without any additional overhead.  It's not often you get something as useful as floating point bi-linear interpolation for free - thank NVidia.

It is also possible for the texture fetch hardware to return normalized floating point values, which is beneficial in many circumstances.  For example, in many cases the GPU is faster with floating point arithmetic operations than it is with integer operations.  Integer division is rarely a good idea.  For this reason I usually declare a float texture object using the following:

texture<unsigned char, 2, cudaReadModeNormalizedFloat> tex;

then access the pixels as floating point values:

float pix = tex2D( tex ,fPosX, fPosY );

Of course, I have to convert the float pixels back to bytes when I have finished playing around, but that's no big overhead and the hardware provides a fast saturation function to limit the float to the unit range for us:

*pPixOut = 255 * __saturatef(pix);

Textures - Initialization Host Code (Driver API)
A few lines of additional code are required in your host code during initialization in order to setup the kernel texture object.  I tend to do this once during a setup phase of the application, typically just after loading the cubin file and getting function handles.

Firstly, you will need to get a handle to your kernels texture object for the host to use.  This is similar to getting a handle to a device constant variable since the reference is retrieved from the kernel cubin by name. In our example above we declared a texture object in the kernel named 'tex'.  The host code when using the driver API is therefore:

CUtexref m_cuTexref;
cuModuleGetTexRef(&m_cuTexref, m_cuModule, "tex")

Where m_cuModule is the kernel module handle previously loaded/compiled using cuModuleLoadDataEx.  Now we need to set up how the texture unit will access the data.  Firstly, I tell the texture fetch to clamp to the boundary in both dimensions:

    cuTexRefSetAddressMode(m_cuTexref, 0, CU_TR_ADDRESS_MODE_CLAMP);
    cuTexRefSetAddressMode(m_cuTexref, 1, CU_TR_ADDRESS_MODE_CLAMP);

Then we can tell the hardware to fetch image data using nearest neighbour interpolation (point):

    cuTexRefSetFilterMode(m_cuTexref, CU_TR_FILTER_MODE_POINT);

Or bilinear interpolation mode:

    cuTexRefSetFilterMode(m_cuTexref, CU_TR_FILTER_MODE_LINEAR);

Finally, we tell the texture reference about the linear memory we are going to use as a texture.  Assuming that there is some device memory (CUdeviceptr m_dPtr) allocated during initialization that will contain the image data of dimensions Width and Height with a byte pitch of m_dPitch.
    // Bind texture reference to linear memory
    cad.Format = CU_AD_FORMAT_UNSIGNED_INT8;    // Input linear memory is 8-bit
    cad.NumChannels = 1;                        // Input is greyscal
    cad.Width = Width;                    // Input Width
    cad.Height = Height;                   // Input Height

    cuTexRefSetAddress2D(m_cuTexref, &cad, m_dPtr , m_dPitch);
The actual image data can be copied into the device memory at a later time, or repeatedly every time a new image is available for video processing.  The texture reference 'tex' in the kernel has now been connected to the linear device memory.

Textures - Kernel Call Host Code (Driver API)
There is very little left to do by the time it comes to call a kernel.  We have to activate a hardware texture unit and tell it which texture it will be using.  On the host side, the texture reference was called m_cuTexref, we have already connected this reference to the texture object named 'tex' in the kernel during setup (using cuModuleGetTexRef)One additional line is required to tell the kernel function which texture is active in the default texture unit.
cuParamSetTexRef(cuFunction_Handle, CU_PARAM_TR_DEFAULT, m_cuTexref);

So, the kernel will now be able to use the hardware texture fetch functions (tex2d) to fetch data from the texture object named 'tex'.  It is interesting that the texture unit MUST be CU_PARAM_TR_DEFAULT.  A CUDA enabled GPU will almost certainly have multiple texture units, so in theory it should be possible to read from multiple texture units simultaneously in a kernel to achieve image blending/fusion effects.  Unfortunately, this is not made available to us in CUDA at the time of writing (with CUDA 3.1). 

To launch the kernel, proceed as normal.  For example:

cuFuncSetBlockShape( cuFunction_Handle, BLOCK_SIZE_X, BLOCK_SIZE_Y, 1 );
cuLaunchGridAsync( cuFunction_Handle, GRIDWIDTH, GRIDHEIGHT, stream ))

Note that I use async calls and multiple streams in order to overlap computation and PCI transfers, thus hiding some of the transfer overhead (a subject for another post).  This can all be hidden from the user by maintaining a rolling buffer internally in the library, making the warp algorithm appear to run faster.
In order to test the performance I have developed a general purpose warping library that uses our GPU framework to hide all of the CUDA code, JIT compilation, transfers, contexts, streams and threads behind a few simple function calls.  A commonly used useful warp function for polar unwrap has been implemented using the texture fetching method described above and the results look very good.

The input images we chose were from Opto-Engineering who have a range of lenses that produce polar images of the sides of product.  It is possible to capture high resolution images of the sides of containers as a polar image (below) but in order to accelerate any analysis, a fast polar unwrap is needed.

The output images look good when using the hardware bi-linear interpolation (below):

As expected, when nearest-neighbour interpolation is used, the image quality is degraded with aliasing problems (below).  Whilst this would be faster on a CPU, the GPU is able to perform the bilinear interpolation mode at the same speed.

The performance depends on the size of the output image, but typically achieves well over 1GB/sec in transform bandwidth, including all the transfer overheads (Core2Quad Q8400@2.66GHz & GTX260 216cores).  For these input images (1024x768), the average total transform time to produce the output (1280x384) was under 400 microseconds.  That works out at over 1.2 Gigapixels/sec
A quick comparison to a third party software polar unwrap tool showed that this was at least an order of magnitude faster.

The algorithm to perform the polar coordinate conversion is computed on-the-fly.  Any number of complex transform functions can be implemented in the library very quickly to achieve this performance.  So far, affine, perspective and polar transforms are done.  Any requests?

vxGWarp Interfaces
Just FYI - the interface to these polar warp functions are pretty trivial, all the GPU expertise is hidden from the end user in the DLL.  The key functions in the header file are:

vxGWarpCreate(VXGWARPHANDLE *hGW, int W, int H);
vxGWarpAccessXferBufIn(VXGWARPHANDLE hGW, unsigned char **pInput, int *nW, int *nP, int *nH);
vxGWarpAccessXferBufOut(VXGWARPHANDLE hGW, unsigned char **pOutput, int *nW, int *nP, int *nH);

Vision Experts

Thursday, 8 July 2010

Debunking the x100 GPU Myth - Intel Fights Back

Intel recently published this paper titled 'Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU' that makes an attempt to compare a number of GPU kernels with algorithms that are highly optimised for Intel architectures.  The authors concluded that for the right problems, the GPU was up to 14x faster than an equivalent optimised CPU implementation. On average a x2.5 increase in speed was seen.  

I am all in favour of using GPU's to accelerate image processing when it is appropriate but the hype has gotten out of control over the last year, so I am very pleased to see Intel try and put their case forward and bring some balance to the arguments.

What I liked about the paper was that for once, significant effort was expended to optimise BOTH the CPU and the GPU implementations.  Too many biased comparisons are made between highly optimised GPU implementations and the naive, plain vanilla single threaded 'C' versions.  When a x100 increase in speed is cited, I always suspect that the author was being either highly selective in what parts of the overall system were being timed, or that the algorithm was unrealistically well mapped to GPU hardware and not representative of a real problem, or even that the CPU implementation was simply not optimised at all.  The NVidia showcase website has made publishing an impressive acceleration factor in the authors best interest.

I certainly have not come across any imaging systems that have achieved anything like x100 accelerations in throughput by employing GPU technology.  There may be some algorithms that map superbly well to GPUs and can achieve x100 performance increase in a single algorithm stage, but these numbers published by Intel are much more in line with the total throughput increase I have seen when using GPU's to do image processing in real-world applications, when compared to the optimised CPU algorithms that are readily available.  

An example of disengenuous performance metrics would be the image processing blur demo in the NVidia SDK - here the image is loaded from file, pre-processed and converted into a 512x512 floating point greyscale image, transferred to the GPU once, and THEN processed repeatedly at high speed to show how fast the GPU is.  The CPU conversion to floating point format is omitted from the GPU compute time.

I would also agree with Intel that most often, in practice, optimisation of an algorithm to use multiple cores, maximize cache usage and SSE instructions is easier, faster and ultimately more portable than developing a CUDA replacement algorithm.  I would also agree with the GPU evangelists that the hardware cost of an upgrade to a top-end Intel based PC system, vs the investement in a GTX280 is significantly higher.  With the tools improving all the time, it is becoming easier to code and deploy GPU enhanced algorithms.  

The conclusion is, for the time being, we must take a balanced view of the technology available and choose the right processing method to suit the application.  And be realistic.

Vision Experts

Saturday, 3 July 2010

CUDA Parameter Alignment

When executing a CUDA kernel, it is almost always necessary to pass some parameters into the kernel function.  For image processing, the parameters are usually at least a pointer to the image data to be processed, plus the width, height, pitch etc. that describe the image.  The GPU kernel can then access the input parameters when it runs.  For this to happen, the parameters passed into the Kernel function call have to be copied from the host memory to the device code running on the GPU.  The mechanism for passing parameters to Kernels at execution is different to the majority of the host-to-device data copies, which use explicit function calls such as cuMemcpy().  Kernel function parameters, similarly to regular function calls, are passed using a parameter stack.

When using the CUDA Runtime API then parameter passing is taken care of transparently, and no additional work is required on the part of the programmer.  The Runtime API hides the details of copying host parameters from host memory into a parameter stack in the GPU device memory which the kernel can then go on to access as input parameters.  The Driver API is somewhat lower level.

The CUDA Driver API does not hide as much of the detail and the programmer must manage the process correctly, pushing variables onto a parameter stack in the correct order and with the correct alignment and size.  In my experience, and judging from the number of questions out there on newsgroups, parameter passing can be a source of trouble.

In the Driver API, function parameters are all passed to the kernel parameter space using the functions: 
  • cuParamSeti(CUfunction hFunc, int offset, unsigned int value) - Pass an integer
  • cuParamSetf(CUfunction hFunc, int offset, float value)  - Pass a float
  • cuParamSetv(CUfunction hFunc, int offset, void*, unsigned int numbytes) - Pass data
These functions place data residing in the calling host memory onto the kernel parameter stack at the position specified by offset.  It is crucial to make sure that offset is correct and must take into account the total size of all the previous items placed on the stack, taking their alignment into account. 

A few of the common causes of problems are:
  • Differences between the host alignment and device alignment of some data types.  Sometimes, additional alignment bytes must be added to offset to give the correct alignment.
  • Differences between the host size and device size of some data types, leading to incorrect value for numbytes or incorrect offset accumulation.
  • 32-bit and 64-bit memory addressing when passing device pointers to cuParamSetv
Standard Data Types
CUDA uses the same size and alignment for all standard types, so using sizeof() and __alignof() in host code will yield the correct numbers to put parameters on the kernel stack.  The exception is that the host compiler can choose to align double, long long and 64 bit long (on 64-bit OS) on WORD (2byte) boundary, but the kernel will always expect these to be aligned on a DWORD (4Byte) boundary on the stack.  

A common mistake is to push a small data type onto the stack, followed by a larger data type with larger alignment requirements, but forgetting to increment offset to meet the alignment of the larger type.  For example, in the code below a 2-byte short is pushed onto the stack followed by a four-byte int. 

WRONG: Byte alignment of int is 4-bytes but offset is only accumulated by the size of short.
offset = 0;
short myshort16 = 5434;
int myint32 = 643826;
cuParamSetv(hMyFunc, offset, &myshort16 , 2)
offset+= 2;  /// wrong
cuParamSetv(hMyFunc, offset, &myint32, 4)

RIGHT: Byte alignment of int is 4-bytes so offset has to be aligned correctly
offset = 0;
short myshort16 = 5434;
int myint32 = 643826;
cuParamSetv(hMyFunc, offset, &myshort16 , 2)
cuParamSetv(hMyFunc, offset, &myint32, 4)

In order to ensure you have the right number for offset, NVidia provide a macro called ALIGN_UP that should be called to adjust the offset, prior to calling the next cuSetParamx function.   

Built-In Vector Types
CUDA provides some built-in vector types, listed in Table B.1 in section B.3.1 of the CUDA programming guide 3.1.  This means that the kernel can interpret some of the parameters on its input parameters stack as one of these vector types.  The host code does not have equivalent vector types, so again, care must be taken to use the right offset and alignment.  Most alignments are obvious, but there are exceptions, for example float2 and int2 have 8byte alignment, float3 and int3 have 4byte alignment.

Device Pointers

This starts to get a bit more complicated.  There used to be only two possibilities, the GPU always used always 32-bit pointers but the calling OS was either a 32-bit OS or a 64-bit OS.  With the arrival of Fermi, support for 64-bit addressing is possible, meaning we have three valid possibilities.

32-bit OS
This covers probably the most common scenario.  For all devices except Fermi, a cuDevicePtr can be safely cast into a 32bit void* without issue.  On 32-bit operating systems, the address of operator & will result in a 32-bit pointer, so CUDA allocated device pointers can be passed as (void*) parameters.  For example

cuParamSetv(MycuFunction, offset, &MyDevicePtr, sizeof(MyDevicePtr));

64-bit OS, 32-bit GPU
For 64-bit operating systems, there is a difference in size between a 32-bit cuDevicePtr and a 64-bit (void*).


cuParamSetv(MycuFunction, offset, &MyDevicePtr, sizeof(MyDevicePtr));

The line above will not work since sizeof(cuDevicePtr)=4 but the address of MyDevicePtr will be a 64bit (8byte) pointer.  Using the code above will cause bad things to happen. The correct code is:

cuParamSetv(MycuFunction, offset, &MyDevicePtr, sizeof(void*));

or - even better (more portable)
void *ptr = (void*)MyDevicePtr;
cuParamSetv(MycuFunction, offset, &ptr , sizeof(ptr ));

Care must be taken to make sure offset is always a multiple of 8 bytes before calling this function, since these 64-bit pointers have 8-byte alignment requirements.

64-bit OS, 64-bit Fermi GPU addressing
When using nvcc to compile 64-bit code for Fermi, both host and GPU code will use 64-bit addressing. The pointer size for both host and GPU will now be the same, so the call used above will still work:

void *ptr = (void*)MyDevicePtr;
cuParamSetv(MycuFunction, offset, &ptr , sizeof(ptr ));

Care must still be taken since these 64-bit pointers have 8-byte alignment requirements. 

So the key points to remember are:
  1. Check that the size is right.  Be aware of (void*) size differences.  Be aware of double, long long, and long (64-bit) differences in size.
  2. Increment the stack offset by the right amount.  Then:
  3. Check that the stack offset is aligned ready for the requirements of the parameter to be added next.  
  4. Repeat from 1.


Vision Experts

Friday, 4 June 2010

GPU 5x5 Bayer Conversion

The standard Bayer mosaic conversion algorithms used in machine vision typically employ fast bi-linear interpolation method in order to reconstruct a full 24-bit per pixel colour image.  Numerous other, more sophisticated algorithms exist in the public domain, but very few are implemented in a sensible machine vision library.  Since most industrial vision tasks are not aimed at recovering very high-fidelity images for human consumption, the Bayer conversion quality does not seem to have been a priority.  Its strange really, given that it is easy to spend $10k on a color machine vision camera, capture device and lens, only to put the captured images through the basic Bayer de-mosaic algorithm at the last moment.

In order to try and improve the situation, we've implemented various Bayer algorithms, including our own adaptive version of the 5x5 Malvar-He-Cutler interpolation algorithm.  Our implementation of the Malvar algorithm (we call Ultra Mode) is noticeably sharper and has less color fringing than the standard method.

The 2-frame gif below shows the difference on a long-range image taken with a well-known machine vision camera.  OK - granted its not a drastic difference and the gif encoding doesn't help, but sometimes this fidelity change can be important.  Given that our implementation runs on any CUDA enabled GPU faster than a basic CPU bi-linear algorithm, there isn't really a down-side to using the better method. 

Vision Experts