The main bottleneck in GPU processing is the PCIe bus, which has a relatively low bandwidth. For many trivial operations this data transfer overhead dominates the overall execution time, negating any benefit of using the GPU. For normal host-to-GPU data transfers using the cuMemcpy function a bandwidth of around 2.0-2.5GB/sec is about average for a 16-lane PCI express bus. This represents about half the theoretical maximum bandwidth of the PCIe v1.1 bus, and introduces about 1ms overhead to transfer an 1920x1080 greyscale image.
Figure 1. A normal cuMemcpy from host-to-device runs at about 2GB/sec
If we use the NVidia cuMemAllocHost function to allocate non-paged memory on the host, we can almost double the bandwidth when copying this buffer to the GPU device memory, achieving nearer 4GB/sec on most systems. If you are able to write your capture code so that the frame grabber driver will DMA image data directly into one of these page-locked buffers then that is a worthwhile thing to do. Unfortunately, thats not always possible.
Page-Locked Intermediate Buffer
Sometimes, the frame-grabber acquires images into a host memory buffer without giving us the option to acquire directly into our CUDA allocated page-locked memory. In this situation, we can either copy our captured image directly to the device memory as in Figure1, or choose to memcpy into a page-locked buffer prior to transfer across the PCIe bus as in Figure2. Since a host memcpy takes time, this erodes some of the benefit of using the page-locked buffer.
Figure 2. Using a page-locked buffer as a staging post before transfer can still increase performance, despite introducing an additional host memcpy operation from the acquire buffer to the page-locked buffer.
1GB / (2GB/sec) =500ms
When using the scheme in figure2, the total time taken to transfer 1GB from host to the page-locked buffer and then onto the GPU is approximately:
1GB/(8GB/sec) = 125ms
+1GB/(4GB/sec) = 250ms
= 375ms
This is an improvement over the straight copy, so it would appear that non-paged memory does help even in this non-ideal situation. When using a newer P45 chipset with PCIexpress v2.0 the maximum achievable transfer bandwidth is higher. In theory, the PCIe bus on the newer Intel P45 and P35 chipsets will handle 16GB/sec and 8GB/sec respectively, but are limited by main memory bandwidth, reducing host-to-GPU bandwidth to something between 5 and 6GB/sec.
The conclusion is that if at all possible, acquire directly into a pinned, page-locked memory. If that isn't possible, using an intermediate page-locked buffer is still worthwhile, provided the host chipset and memory performance is good.
Direct FrameGrabber-to-GPU DMA
It would be really great to be able to DMA directly from a frame grabber into GPU device memory, avoiding the CPU and main memory entirely, but I don't believe this is possible. It may be achieved using driver-level transfers akin to DirectShow drivers, but it is not currently possible to get a physical address of GPU device memory using CUDA.
Simon Green, from NVidia says:
"A lot of people have asked for this. It is technically possible for other PCI-E devices to DMA directly into GPU memory, but we don't have a solution yet. We'll keep you posted." - Sep 2009This is a capability worth waiting for, but don't hold your breath.
Vision Experts


