Saturday, 3 July 2010

CUDA Parameter Alignment

When executing a CUDA kernel, it is almost always necessary to pass some parameters into the kernel function.  For image processing, the parameters are usually at least a pointer to the image data to be processed, plus the width, height, pitch etc. that describe the image.  The GPU kernel can then access the input parameters when it runs.  For this to happen, the parameters passed into the Kernel function call have to be copied from the host memory to the device code running on the GPU.  The mechanism for passing parameters to Kernels at execution is different to the majority of the host-to-device data copies, which use explicit function calls such as cuMemcpy().  Kernel function parameters, similarly to regular function calls, are passed using a parameter stack.

When using the CUDA Runtime API then parameter passing is taken care of transparently, and no additional work is required on the part of the programmer.  The Runtime API hides the details of copying host parameters from host memory into a parameter stack in the GPU device memory which the kernel can then go on to access as input parameters.  The Driver API is somewhat lower level.

The CUDA Driver API does not hide as much of the detail and the programmer must manage the process correctly, pushing variables onto a parameter stack in the correct order and with the correct alignment and size.  In my experience, and judging from the number of questions out there on newsgroups, parameter passing can be a source of trouble.

In the Driver API, function parameters are all passed to the kernel parameter space using the functions: 
  • cuParamSeti(CUfunction hFunc, int offset, unsigned int value) - Pass an integer
  • cuParamSetf(CUfunction hFunc, int offset, float value)  - Pass a float
  • cuParamSetv(CUfunction hFunc, int offset, void*, unsigned int numbytes) - Pass data
These functions place data residing in the calling host memory onto the kernel parameter stack at the position specified by offset.  It is crucial to make sure that offset is correct and must take into account the total size of all the previous items placed on the stack, taking their alignment into account. 


A few of the common causes of problems are:
  • Differences between the host alignment and device alignment of some data types.  Sometimes, additional alignment bytes must be added to offset to give the correct alignment.
  • Differences between the host size and device size of some data types, leading to incorrect value for numbytes or incorrect offset accumulation.
  • 32-bit and 64-bit memory addressing when passing device pointers to cuParamSetv
Standard Data Types
CUDA uses the same size and alignment for all standard types, so using sizeof() and __alignof() in host code will yield the correct numbers to put parameters on the kernel stack.  The exception is that the host compiler can choose to align double, long long and 64 bit long (on 64-bit OS) on WORD (2byte) boundary, but the kernel will always expect these to be aligned on a DWORD (4Byte) boundary on the stack.  

A common mistake is to push a small data type onto the stack, followed by a larger data type with larger alignment requirements, but forgetting to increment offset to meet the alignment of the larger type.  For example, in the code below a 2-byte short is pushed onto the stack followed by a four-byte int. 


WRONG: Byte alignment of int is 4-bytes but offset is only accumulated by the size of short.
offset = 0;
short myshort16 = 5434;
int myint32 = 643826;
cuParamSetv(hMyFunc, offset, &myshort16 , 2)
offset+= 2;  /// wrong
cuParamSetv(hMyFunc, offset, &myint32, 4)

RIGHT: Byte alignment of int is 4-bytes so offset has to be aligned correctly
offset = 0;
short myshort16 = 5434;
int myint32 = 643826;
cuParamSetv(hMyFunc, offset, &myshort16 , 2)
offset+=4;
cuParamSetv(hMyFunc, offset, &myint32, 4)

In order to ensure you have the right number for offset, NVidia provide a macro called ALIGN_UP that should be called to adjust the offset, prior to calling the next cuSetParamx function.   


Built-In Vector Types
CUDA provides some built-in vector types, listed in Table B.1 in section B.3.1 of the CUDA programming guide 3.1.  This means that the kernel can interpret some of the parameters on its input parameters stack as one of these vector types.  The host code does not have equivalent vector types, so again, care must be taken to use the right offset and alignment.  Most alignments are obvious, but there are exceptions, for example float2 and int2 have 8byte alignment, float3 and int3 have 4byte alignment.


Device Pointers

This starts to get a bit more complicated.  There used to be only two possibilities, the GPU always used always 32-bit pointers but the calling OS was either a 32-bit OS or a 64-bit OS.  With the arrival of Fermi, support for 64-bit addressing is possible, meaning we have three valid possibilities.


32-bit OS
This covers probably the most common scenario.  For all devices except Fermi, a cuDevicePtr can be safely cast into a 32bit void* without issue.  On 32-bit operating systems, the address of operator & will result in a 32-bit pointer, so CUDA allocated device pointers can be passed as (void*) parameters.  For example


cuParamSetv(MycuFunction, offset, &MyDevicePtr, sizeof(MyDevicePtr));

64-bit OS, 32-bit GPU
For 64-bit operating systems, there is a difference in size between a 32-bit cuDevicePtr and a 64-bit (void*).


So THIS LINE BELOW WILL NOT WORK:

cuParamSetv(MycuFunction, offset, &MyDevicePtr, sizeof(MyDevicePtr));

The line above will not work since sizeof(cuDevicePtr)=4 but the address of MyDevicePtr will be a 64bit (8byte) pointer.  Using the code above will cause bad things to happen. The correct code is:


cuParamSetv(MycuFunction, offset, &MyDevicePtr, sizeof(void*));

or - even better (more portable)
void *ptr = (void*)MyDevicePtr;
cuParamSetv(MycuFunction, offset, &ptr , sizeof(ptr ));

Care must be taken to make sure offset is always a multiple of 8 bytes before calling this function, since these 64-bit pointers have 8-byte alignment requirements.

64-bit OS, 64-bit Fermi GPU addressing
When using nvcc to compile 64-bit code for Fermi, both host and GPU code will use 64-bit addressing. The pointer size for both host and GPU will now be the same, so the call used above will still work:

void *ptr = (void*)MyDevicePtr;
cuParamSetv(MycuFunction, offset, &ptr , sizeof(ptr ));

Care must still be taken since these 64-bit pointers have 8-byte alignment requirements. 

So the key points to remember are:
  1. Check that the size is right.  Be aware of (void*) size differences.  Be aware of double, long long, and long (64-bit) differences in size.
  2. Increment the stack offset by the right amount.  Then:
  3. Check that the stack offset is aligned ready for the requirements of the parameter to be added next.  
  4. Repeat from 1.

 


Vision Experts

No comments:

Post a Comment