Accelerated Image Processing: 24-bit RGB in CUDA

Friday 4 December 2009

24-bit RGB in CUDA

In an earlier post I wrote about the difficulty in accessing 24-bit RGB data in CUDA, caused by restrictions on optimised global memory access patterns to 32/64/128-bit data. I've been working on this problem for RGB to YUV conversion and heres the best plan I have so far.

The kernel is launched using a 64x1 block size and the image data is passed in as int* so that 32-bit access is coalesced. The input is read into shared memory as 48ints using the first 48 threads, the output is written as 64 ints using all 64 threads. During the read into shared memory, 16 threads are idle - this is a half-warp size so should not waste much time as I believe the entire half-warp will execute in a single instruction.

__global__ void kernel_RGB24_to_YUV32(unsigned int *pInput, unsigned int pitchin, unsigned int *pOutput, unsigned int pitchout)
{
unsigned int xin = blockIdx.x*48 + threadIdx.x;
unsigned int xout = blockIdx.x*64 + threadIdx.x;
unsigned int y = blockIdx.y;

//Shared memory for 48 input ints
__shared__ unsigned int rgbTriplets[64];
unsigned char *pBGR = (unsigned char*) &rgbTriplets[0];

//Global memory read into shared memory
//Read in 48 x 32bit ints which will load 64 packed rgb triplets.
//Only 48 of the 64 threads are active during this read.
//48 is divisible by the 16 thread half-warp size so fully utilises three entire half-warps
//but leaves one half-warp doing nothing
if (threadIdx.x<48) { rgbTriplets[threadIdx.x] = *(pInput + xin + y*pitchin); } __syncthreads(); unsigned int tidrgb = threadIdx.x*3; float3 rgbpix = make_float3(pBGR[tidrgb+2],pBGR[tidrgb+1],pBGR[tidrgb+0]); //Make YUV floats
uchar3 yuvpix;
yuvpix.x = 0.2990*rgbpix.x + 0.5870*rgbpix.y + 0.1140*rgbpix.z;
yuvpix.y = -0.1687*rgbpix.x - 0.3313*rgbpix.y + 0.5000*rgbpix.z + 128;
yuvpix.z = 0.5000*rgbpix.x - 0.4187*rgbpix.y - 0.0813*rgbpix.z + 128;

//Write out 64 ints which are 64 32bit YUVX quads
*(pOutput+xout+y*pitchout) = make_color_rgb(yuvpix.x,yuvpix.y,yuvpix.z);

return;
}

8 comments:

Twz26 May 2010 at 17:03
Hi,

First of all thank you for your blog. It is very useful. I've a question concerning the 24-bit RGB to 32-bit YUVX conversion. I was reading your code and I couldn't find where the function make_color_rgb() come from. Could you explain how it works ? I'm trying to find a solution to replace it with bitwise operations.

Thank you.

Greg
ReplyDelete
Replies
Jason28 May 2010 at 13:11
__device__ unsigned int pack_rgb(float r, float g, float b)
{
return
((int)r << 24) |
((int)g << 16) |
((int)b << 8);
}
ReplyDelete
Replies
Twz28 May 2010 at 15:09
Many thanks. This is what I used finally :
#define make_color_rgb(r, g, b) ((r << 24) | (g << 16) | (b<<8))

Is yours better ?
ReplyDelete
Replies
Jason28 May 2010 at 15:28
Differences are probably just programming style - its the old argument of Macros vs functions. The device function has type checking at compile time, but is essentially the same.
ReplyDelete
Replies
Unknown22 July 2010 at 10:39
Hi. I ran into this post while searching for a way to handle RGB bitmaps with CUDA.
I want to suggest another mechanism (I wanted to post code but in the comment it comes out messy)

The principal is this - You use each kernel iteration to read 3 integer which are 4 RGB tuples. Then you write back integer values.

For instance in my implementation of RGB to Gray
Each iteration I read 3 integers into a local array. I cast the array to unsigned char* and compute the unsigned char value of 4 output pixels.
These are then written to the gray scale output as 1 integer.
To be able to do this you have to define the kernel dimensions in the following way:

//(nIntegers in row)
int n4PixelsGroup = (step + 3)/4;

//3 ints is 12 bytes is 4 rgb pixels
n4PixelsGroup = (n4PixelsGroup + 2)/3;

dim3 dimBlock(min(512,n4PixelsGroup), 1, 1);
dim3 dimGrid
(
(2*n4PixelsGroup - 1)/dimBlock.x,
imageHeight
, 1
);

In the kernel you compute the following:
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

//intStep is the input rgb pitch devided by 4
unsigned int color[3] =
{
rgbImage[y*intStep + 3*x],
rgbImage[y*intStep + 3*x + 1], rgbImage[y*intStep + 3*x + 2]
};

unsigned char* pRGB = (unsigned char*)color;

/*
manipulate the values of pRGB - I write back the output to color[0]
*/
/*
For grayscale output (opitch is the pitch of the grayscale output) :
*/
unsigned int* pOut =
(unsigned int*)(g_odata + y*opitch + 4*x);

*pOut = color[0];
ReplyDelete
Replies
Unknown22 July 2010 at 15:06
Sorry just a small fix:
dim3 dimGrid((n4PixelsGroup + dimBlock.x - 1)/dimBlock.x, imageHeight, 1);

Also it should be noted that for small images there is no speedup using this method - all the time is taken by memory transfers.

Please let me know if your method performs better.
ReplyDelete
Replies
Carlos11 August 2010 at 22:59
when you say "small images"? what dimensions are you taking about?
ReplyDelete
Replies
Unknown28 June 2011 at 20:51
Would you have any recommandation on how I could acheive good performance on the opposite transformation? I have YV12 formatted pictures coming out from my codecs, but I need to convert them to RGBA32, and I can't seem to get good performance with CUDA.

My CPU algorithm is faster but I'm pretty sure its only because I'm not using CUDA as I should be using it...

Could anyone help me with this please?
ReplyDelete
Replies

Add comment

Welcome

Practical software & algorithm development in the machine vision industry.

Often a pretty technical blog, often just observations on being a developer, leader and human being in the Machine Vision Industry.

I'm professional software engineer, tech lead and company director at Vision Experts and Red Engine.

I spend most of my time managing a small team of world class engineers, inventing IP and consulting for industry in the computer vision space.

Websites can be found at www.flightclubdarts.com and www.visionexperts.co.uk

The Parallel Revolution

“We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”
- Paul Otellini, President, Intel (2005)

“Multicore: This is the one which will have the biggest impact on us. We have never had a problem to solve like this. A breakthrough is needed in how applications are done on multicore devices.”
- Bill Gates, Microsoft

“When we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.”
- John Hennessy, President of Stanford

Accelerated Image Processing

Friday 4 December 2009

24-bit RGB in CUDA

8 comments:

Welcome

Blog Archive

The Parallel Revolution

Keywords

About Me

Other Vision Blogs