Friday, 4 December 2009

24-bit RGB in CUDA

In an earlier post I wrote about the difficulty in accessing 24-bit RGB data in CUDA, caused by restrictions on optimised global memory access patterns to 32/64/128-bit data. I've been working on this problem for RGB to YUV conversion and heres the best plan I have so far.

The kernel is launched using a 64x1 block size and the image data is passed in as int* so that 32-bit access is coalesced. The input is read into shared memory as 48ints using the first 48 threads, the output is written as 64 ints using all 64 threads. During the read into shared memory, 16 threads are idle - this is a half-warp size so should not waste much time as I believe the entire half-warp will execute in a single instruction.

__global__ void kernel_RGB24_to_YUV32(unsigned int *pInput, unsigned int pitchin, unsigned int *pOutput, unsigned int pitchout)
{
unsigned int xin = blockIdx.x*48 + threadIdx.x;
unsigned int xout = blockIdx.x*64 + threadIdx.x;
unsigned int y = blockIdx.y;

//Shared memory for 48 input ints
__shared__ unsigned int rgbTriplets[64];
unsigned char *pBGR = (unsigned char*) &rgbTriplets[0];


//Global memory read into shared memory
//Read in 48 x 32bit ints which will load 64 packed rgb triplets.
//Only 48 of the 64 threads are active during this read.
//48 is divisible by the 16 thread half-warp size so fully utilises three entire half-warps
//but leaves one half-warp doing nothing
if (threadIdx.x<48) { rgbTriplets[threadIdx.x] = *(pInput + xin + y*pitchin); } __syncthreads(); unsigned int tidrgb = threadIdx.x*3; float3 rgbpix = make_float3(pBGR[tidrgb+2],pBGR[tidrgb+1],pBGR[tidrgb+0]); //Make YUV floats
uchar3 yuvpix;
yuvpix.x = 0.2990*rgbpix.x + 0.5870*rgbpix.y + 0.1140*rgbpix.z;
yuvpix.y = -0.1687*rgbpix.x - 0.3313*rgbpix.y + 0.5000*rgbpix.z + 128;
yuvpix.z = 0.5000*rgbpix.x - 0.4187*rgbpix.y - 0.0813*rgbpix.z + 128;

//Write out 64 ints which are 64 32bit YUVX quads
*(pOutput+xout+y*pitchout) = make_color_rgb(yuvpix.x,yuvpix.y,yuvpix.z);

return;
}

8 comments:

  1. Hi,

    First of all thank you for your blog. It is very useful. I've a question concerning the 24-bit RGB to 32-bit YUVX conversion. I was reading your code and I couldn't find where the function make_color_rgb() come from. Could you explain how it works ? I'm trying to find a solution to replace it with bitwise operations.

    Thank you.

    Greg

    ReplyDelete
  2. __device__ unsigned int pack_rgb(float r, float g, float b)
    {
    return
    ((int)r << 24) |
    ((int)g << 16) |
    ((int)b << 8);
    }

    ReplyDelete
  3. Many thanks. This is what I used finally :
    #define make_color_rgb(r, g, b) ((r << 24) | (g << 16) | (b<<8))

    Is yours better ?

    ReplyDelete
  4. Differences are probably just programming style - its the old argument of Macros vs functions. The device function has type checking at compile time, but is essentially the same.

    ReplyDelete
  5. Hi. I ran into this post while searching for a way to handle RGB bitmaps with CUDA.
    I want to suggest another mechanism (I wanted to post code but in the comment it comes out messy)

    The principal is this - You use each kernel iteration to read 3 integer which are 4 RGB tuples. Then you write back integer values.

    For instance in my implementation of RGB to Gray
    Each iteration I read 3 integers into a local array. I cast the array to unsigned char* and compute the unsigned char value of 4 output pixels.
    These are then written to the gray scale output as 1 integer.
    To be able to do this you have to define the kernel dimensions in the following way:

    //(nIntegers in row)
    int n4PixelsGroup = (step + 3)/4;

    //3 ints is 12 bytes is 4 rgb pixels
    n4PixelsGroup = (n4PixelsGroup + 2)/3;

    dim3 dimBlock(min(512,n4PixelsGroup), 1, 1);
    dim3 dimGrid
    (
    (2*n4PixelsGroup - 1)/dimBlock.x,
    imageHeight
    , 1
    );

    In the kernel you compute the following:
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    //intStep is the input rgb pitch devided by 4
    unsigned int color[3] =
    {
    rgbImage[y*intStep + 3*x],
    rgbImage[y*intStep + 3*x + 1], rgbImage[y*intStep + 3*x + 2]
    };

    unsigned char* pRGB = (unsigned char*)color;

    /*
    manipulate the values of pRGB - I write back the output to color[0]
    */
    /*
    For grayscale output (opitch is the pitch of the grayscale output) :
    */
    unsigned int* pOut =
    (unsigned int*)(g_odata + y*opitch + 4*x);

    *pOut = color[0];

    ReplyDelete
  6. Sorry just a small fix:
    dim3 dimGrid((n4PixelsGroup + dimBlock.x - 1)/dimBlock.x, imageHeight, 1);

    Also it should be noted that for small images there is no speedup using this method - all the time is taken by memory transfers.

    Please let me know if your method performs better.

    ReplyDelete
  7. when you say "small images"? what dimensions are you taking about?

    ReplyDelete
  8. Would you have any recommandation on how I could acheive good performance on the opposite transformation? I have YV12 formatted pictures coming out from my codecs, but I need to convert them to RGBA32, and I can't seem to get good performance with CUDA.

    My CPU algorithm is faster but I'm pretty sure its only because I'm not using CUDA as I should be using it...

    Could anyone help me with this please?

    ReplyDelete