Thursday 26 November 2009

RGB Images and CUDA

When using CUDA, using a 32-bit RGBX format to store colour images pays massive performance dividends when compared to the more commonly found 24-bit RGB image format. If you have the choice, avoid 24 bit colour.

NVidia neatly skirt around this problem by simply ignoring it in all of their SDK demos. All these demos just happen to load 32-bit RGBX images for any of their colour image processing routines. They effectively do the transformation from 24bit to 32 during the file load and then hide this cost when running the demo. Good for them, but back here in the real world my image capture device is throwing out 24bit RGB images at 60Hz.

For 32-bit RGBX, the texture references in the kernel code files (*.cu) look like this:



texture < unsigned char, 2, cudaReadModeNormalizedFloat > tex;


which works fine. You can then access pixels using tex2D and all is well. However, if you have an RGB24 image and try this:


texture < uchar3, 2, cudaReadModeNormalizedFloat > tex;




It just wont work. There is no version of tex2D able to fetch 24 bit RGB pixels. In fact, you cannot even allocate a CUDA array with 3 channels - if you try this:

CUDA_ARRAY_DESCRIPTOR desc;
desc.Format = CU_AD_FORMAT_UNSIGNED_INT8;
desc.NumChannels = 3;
desc.Width = m_dWidth;
desc.Height = m_dHeight;
cuArrayCreate( &m_dInputArray, &desc );


then the array creation will fail. With CUDA you can only access textures and declare array memory with NumChannels equal to 1,2 or 4 elements.

Furthermore, it is not possible to convert from 24bit to 32bit during a cuMemcpy2D call - whilst this will pad line length to align the pitch (to 256bytes) it will not pad each pixel to match the destination array format.

The only solution is to declare your input array as 1 channel but three times as wide, like this:

CUDA_ARRAY_DESCRIPTOR desc;
desc.Format = CU_AD_FORMAT_UNSIGNED_INT8;
desc.NumChannels = 1;
desc.Width = m_dWidth*3;
desc.Height = m_dHeight;
cuArrayCreate( &m_dInputArray, &desc );


You can then access pixels in groups of three in your kernel. Unfortunately, dont expect coalesced memory acesses when each thread has to read three bytes....

No comments:

Post a Comment