Saturday 4 July 2009

CUDA function overheads

Whilst working on my CUDA accelerated JPEG algorithm I found a problem with my design which demanded launching a large number of small kernels followed by many thousands of small memcopy operations. I was launching kernels to compress a fixed number of image blocks, many hundreds in all. The result was compressed image blocks, and the output size was only known at runtime after the algorithm was finished, but required many thousands of mem copy operations. The design was bad, but I was trying things out to see what would happen.

On a CPU, a function call will typically take a few nanoseconds to push parameters on the stack and jump the program pointer to the function address. On the GPU however, much more work has to be performed via the driver. So kernel launches and cuda mem copy operations take at least three orders of magnitude more to setup than a CPU call - several microseconds in all.

This means that if you want to perform many hundreds or thousands of calls then the function calls themselves can start to add up much more quickly than the equivalent CPU calls. This effect can then become significant - so make your kernels big!