Sunday 16 May 2010

GPU Accelerated Laser Profiling

Laser Profiling extracts a dense set of 3D coordinates of a target object by measuring the deviation of a straight laser line as it is swept across the target.  Many of these systems make use of custom hardware (e.g. Sick IVC3D) and an FPGA to achieve high line profile rates, often achieving multiple thousands of profiles per second. 

It is also possible to assemble a laser profiling system using any high speed camera and a laser line.  Partial-scan cameras can be useful to get high frame-rates but some fast software is also required to find and measure the laser line position in every image to sub-pixel accuracy for every profile.  These positions are then converted to world coordinates using a calibrated projection and lens distortion correction - which requires some floating point operations.  The hardware solutions typically manage several thousand profiles/sec, software is normally slower.  

Recently, I've been experimenting with GPU accelerated line profiling - and its looking fast.  The GPU turns out pretty well suited for measuring the laser lines in parallel since we can launch a single thread per column of the input image.  In fact, for memory access efficiency, it is better for each thread to read a 32-bit int that packs four 8-bit pixels.  A block of 16 threads therefore computes the laser positions for 64 pixels in parallel.  With multiple blocks of 64 pixels running concurrently (Figure1), the processing rate is pretty much only limited by GPU-host transfers.  On my test rig, the GTX260 GPU has 216 cores, so can execute 3,456 threads in parallel, way more than are actually needed and so many are idle in my current implementation.


Figure1.  Each thread scans four columns in order to compute the position of the laser line in that column.  With each GPU core executing 64 threads in parallel, this can be very fast.


Figure2. The C# test application (using our own OpenGL 3D display library) was able to achieve over 200MPix/sec throughput using Common Vision Blox images.  The lower-level C interface was double that speed when using RAW image data.   

My initial results show that the C# interface is able to achieve about 200MPix throughput (Figure2) - but that uses Common Vision Blox images which must be unwrapped and marshaled to the 'C' dll and slows things down.
The low-level 'C' dll library was achieving >600MPix/sec throughput (Figure3)- thats many KHz for a range of resolutions.  It may be that this GPU accelerated algorithm is able to provide line rates that previously only hardware could achieve.

Figure3. The low level DOS test application with 'C' dll interface was able to achieve over 600MPix/sec throughput using pre-loaded raw images.  That was 2.5KHz profile rate on 1280x200 laser images, or 390fps for 1280x1280 scans.


Vision Experts

Sunday 2 May 2010

Faster Memory Transfers

NVidia provide a mechanism to to allocate non-paged ('pinned') memory on the host, which can significantly improve host-to-GPU transfer performance.  But does it help in practice?

The main bottleneck in GPU processing is the PCIe bus, which has a relatively low bandwidth.  For many trivial operations this data transfer overhead dominates the overall execution time, negating any benefit of using the GPU.  For normal host-to-GPU data transfers using the cuMemcpy function a bandwidth of around 2.0-2.5GB/sec is about average for a 16-lane PCI express bus.  This represents about half the theoretical maximum bandwidth of the PCIe v1.1 bus, and introduces about 1ms overhead to transfer an 1920x1080 greyscale image.


Figure 1A normal cuMemcpy from host-to-device runs at about 2GB/sec

If we use the NVidia cuMemAllocHost function to allocate non-paged memory on the host, we can almost double the bandwidth when copying this buffer to the GPU device memory, achieving nearer 4GB/sec on most systems.  If you are able to write your capture code so that the frame grabber driver will DMA image data directly into one of these page-locked buffers then that is a worthwhile thing to do.  Unfortunately, thats not always possible.

Page-Locked Intermediate Buffer
Sometimes, the frame-grabber acquires images into a host memory buffer without giving us the option to acquire directly into our CUDA allocated page-locked memory.  In this situation, we can either copy our captured image directly to the device memory as in Figure1, or choose to memcpy into a page-locked buffer prior to transfer across the PCIe bus as in Figure2.  Since a host memcpy takes time, this erodes some of the benefit of using the page-locked buffer. 



Figure 2.  Using a page-locked buffer as a staging post before transfer can still increase performance, despite introducing an additional host memcpy operation from the acquire buffer to the page-locked buffer.

Using a page-locked transfer buffer as shown in figure2 is only worthwhile when the cost of the host memcpy operation is low - which requires a relatively high performance chipset (e.g. ICH10) with fast DDR2 (6.4GB/sec) or DDR3 (8.5GB/sec) memory.  At a minimum, the host-to-host copy must execute faster than 4GB/sec otherwise the direct copy in figure1 is usually faster.  As an example, the approximate time taken to transfer 1GB using paged memory is:

1GB / (2GB/sec) =500ms

When using the scheme in figure2, the total time taken to transfer 1GB from host to the page-locked buffer and then onto the GPU is approximately:
  1GB/(8GB/sec) = 125ms
+1GB/(4GB/sec) = 250ms

= 375ms


This is an improvement over the straight copy, so it would appear that non-paged memory does help even in this non-ideal situation.  When using a newer P45 chipset with PCIexpress v2.0 the maximum achievable transfer bandwidth is higher.  In theory, the PCIe bus on the newer Intel P45 and P35 chipsets will handle 16GB/sec and 8GB/sec respectively, but are limited by main memory bandwidth, reducing host-to-GPU bandwidth to something between 5 and 6GB/sec.


The conclusion is that if at all possible, acquire directly into a pinned, page-locked memory.  If that isn't possible, using an intermediate page-locked buffer is still worthwhile, provided the host chipset and memory performance is good.

Direct FrameGrabber-to-GPU DMA
It would be really great to be able to DMA directly from a frame grabber into GPU device memory, avoiding the CPU and main memory entirely, but I don't believe this is possible.  It may be achieved using driver-level transfers akin to DirectShow drivers, but it is not currently possible to get a physical address of GPU device memory using CUDA.

Simon Green, from NVidia says:
"A lot of people have asked for this. It is technically possible for other PCI-E devices to DMA directly into GPU memory, but we don't have a solution yet. We'll keep you posted." - Sep 2009
This is a capability worth waiting for, but don't hold your breath.



Vision Experts