In an earlier post I wrote about the difficulty in accessing 24-bit RGB data in CUDA, caused by restrictions on optimised global memory access patterns to 32/64/128-bit data. I've been working on this problem for RGB to YUV conversion and heres the best plan I have so far.
The kernel is launched using a 64x1 block size and the image data is passed in as int* so that 32-bit access is coalesced. The input is read into shared memory as 48ints using the first 48 threads, the output is written as 64 ints using all 64 threads. During the read into shared memory, 16 threads are idle - this is a half-warp size so should not waste much time as I believe the entire half-warp will execute in a single instruction.
__global__ void kernel_RGB24_to_YUV32(unsigned int *pInput, unsigned int pitchin, unsigned int *pOutput, unsigned int pitchout)
{
unsigned int xin = blockIdx.x*48 + threadIdx.x;
unsigned int xout = blockIdx.x*64 + threadIdx.x;
unsigned int y = blockIdx.y;
//Shared memory for 48 input ints
__shared__ unsigned int rgbTriplets[64];
unsigned char *pBGR = (unsigned char*) &rgbTriplets[0];
//Global memory read into shared memory
//Read in 48 x 32bit ints which will load 64 packed rgb triplets.
//Only 48 of the 64 threads are active during this read.
//48 is divisible by the 16 thread half-warp size so fully utilises three entire half-warps
//but leaves one half-warp doing nothing
if (threadIdx.x<48) { rgbTriplets[threadIdx.x] = *(pInput + xin + y*pitchin); } __syncthreads(); unsigned int tidrgb = threadIdx.x*3; float3 rgbpix = make_float3(pBGR[tidrgb+2],pBGR[tidrgb+1],pBGR[tidrgb+0]); //Make YUV floats
uchar3 yuvpix;
yuvpix.x = 0.2990*rgbpix.x + 0.5870*rgbpix.y + 0.1140*rgbpix.z;
yuvpix.y = -0.1687*rgbpix.x - 0.3313*rgbpix.y + 0.5000*rgbpix.z + 128;
yuvpix.z = 0.5000*rgbpix.x - 0.4187*rgbpix.y - 0.0813*rgbpix.z + 128;
//Write out 64 ints which are 64 32bit YUVX quads
*(pOutput+xout+y*pitchout) = make_color_rgb(yuvpix.x,yuvpix.y,yuvpix.z);
return;
}
Job for a rule-loving engineer?
10 years ago