<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-2714105800710449447</id><updated>2011-10-17T20:23:47.172+01:00</updated><category term='GPU'/><category term='crash'/><category term='laser'/><category term='processing'/><category term='interop'/><category term='tools'/><category term='Performance'/><category term='maths'/><category term='Colour conversion'/><category term='polar'/><category term='callback'/><category term='interfaces'/><category term='misc'/><category term='YUV'/><category term='cameras'/><category term='RGB'/><category term='GPGPU'/><category term='3D'/><category term='warp'/><category term='opengl'/><category term='intel'/><category term='texture'/><category term='runtime'/><category term='kernel'/><category term='CUDA'/><category term='format conversion'/><category term='parallel'/><category term='NVPP'/><category term='launch'/><category term='image'/><category term='unwrap'/><category term='profile'/><title type='text'>Accelerated Image Processing</title><subtitle type='html'>Experience with Machine Vision, Image Processing and Computer Vision Algorithms using NVidia CUDA</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>26</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-201110637374946452</id><published>2010-07-23T12:08:00.007+01:00</published><updated>2010-07-23T16:44:26.319+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='CUDA'/><category scheme='http://www.blogger.com/atom/ns#' term='warp'/><category scheme='http://www.blogger.com/atom/ns#' term='texture'/><category scheme='http://www.blogger.com/atom/ns#' term='polar'/><category scheme='http://www.blogger.com/atom/ns#' term='unwrap'/><title type='text'>Blazingly Fast Image Warping</title><content type='html'>&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Want to achieve over 1 Gigapixel/sec warping throughput?&amp;nbsp; Then leverage your GPU texture units using CUDA. &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Image warping is a very useful and important image processing function that we use all the time.&amp;nbsp; It is often used, when calibrated, to remove distortions such as perspective projection and lens distortion.&amp;nbsp; Many pattern matching libraries make use of affine image warps to compute image alignment.&amp;nbsp; Almost all imaging libraries have a warping tool in their toolbox.&amp;nbsp; In this post I will say a little about how we make use of the texture hardware in a GPU using CUDA, plus we show some benchmarks for polar unwrapping - and it is &lt;i&gt;fast&lt;/i&gt;.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;If there is one thing that the GPU is excellent at, it is image warping.&amp;nbsp; We can thank the gamers for their insatiable appetite for speed in warping image data or 'textures' onto polygons. Fortunately, even when using CUDA to program a GPU as a general purpose co-pro, the fast texture hardware is still available to us for accelerated warping.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;There are several good reasons to use the texture hardware from CUDA when image processing:&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The ordering of Texture fetches are generally less proscriptive than the strict requirements for coalescing global memory reads.&amp;nbsp; When the order of your data reads does not fit in with a coalesced memory access pattern, consider texture fetches.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Texture fetches are cached.&amp;nbsp; For CUDA array memory, the texture cache has a high level of 2-D locality.&amp;nbsp; Texture fetches from linear memory are also cached.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Texture fetches perform bilinear interpolation in hardware.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Texture fetches can clamp or wrap at image boundaries, so you don't have to have careful bounds checking yourself.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;b&gt;Linear Memory vs Array Memory&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;When using a GPU to write a graphics application using an API like OpenGL or DirectX, the texture images were transferred and stored on the GPU in a way that optimized the cache for 2-D spatial locality.&amp;nbsp; With CUDA, a type of memory called a CUDA Array is available to serve this purpose, and CUDA Array memory stores 2-D image data in a bespoke way to enhance 2-D throughput.&amp;nbsp; CUDA array memory is managed separately from CUDA linear device memory and has its own memory allocation and copy functions.&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Dedicated CUDA Array memory meant that in the early days of CUDA (&lt;span style="font-size: xx-small;"&gt;going waaay back maybe three whole years&lt;/span&gt;), the developer had to manage copying between host, linear device memory and CUDA array memory.&amp;nbsp; When using the texture hardware, the data had to be in the right place at the right time, forcing many additional copies to array memory.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Fortunately, from CUDA 2.0 onwards, it became possible to use texture fetch hardware with normal linear device memory.&amp;nbsp; As far as I can tell, this innovation obviated the need for Array memory entirely.&amp;nbsp; If there is a good reason to still be using CUDA Array memory then please - post a comment and let us all know.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;span style="font-size: small;"&gt;&lt;b&gt;Textures - Kernel Code&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Very little code is required in a CUDA kernel in order to use the texture hardware.&amp;nbsp; A texture unit for accessing the pixels of a regular &lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;8-bit,&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; 2-dimensional image is created in the kernel code (the .cu file) using the code:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;&lt;/div&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;&amp;nbsp; &lt;/div&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;texture&amp;lt;&lt;span style="color: blue;"&gt;unsigned&lt;/span&gt; &lt;span style="color: blue;"&gt;char&lt;/span&gt;, 2, cudaReadModeElementType&amp;gt; tex;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The data can then be fetched using the 2d texture fetch hardware using 'tex2d' as below:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;/div&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="color: blue;"&gt;unsigned char &lt;/span&gt;pix = tex2D( tex ,fPosX ,fPosY );&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The really neat thing here is that the position to sample the input image is specified by floating point coordinates (&lt;/span&gt;&lt;/span&gt;fPosX &lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;and &lt;/span&gt;&lt;/span&gt;fPosY&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;).&amp;nbsp; The texture reference can be set to perform either nearest-neighbor or bi-linear interpolation in hardware without any additional overhead.&amp;nbsp; It's not often you get something as useful as floating point bi-linear interpolation for free - thank NVidia.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;It is also possible for the texture fetch hardware to return normalized floating point values, which is beneficial in many circumstances.&amp;nbsp; For example, in many cases the GPU is faster with floating point arithmetic operations than it is with integer operations.&amp;nbsp; Integer division is rarely a good idea.&amp;nbsp; For this reason I usually declare a float texture object using the following:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;texture&amp;lt;&lt;span style="color: blue;"&gt;unsigned&lt;/span&gt; &lt;span style="color: blue;"&gt;char&lt;/span&gt;, 2, cudaReadModeNormalizedFloat&amp;gt; tex;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;/div&gt;&lt;/div&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;then access the pixels as floating point values:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="color: blue;"&gt;float&lt;/span&gt; pix = tex2D( tex ,fPosX, fPosY );&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Of course, I have to convert the float pixels back to bytes when I have finished playing around, but that's no big overhead and the hardware provides a fast saturation function to limit the float to the unit range for us:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;*pPixOut = 255 * __saturatef(pix);&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;b&gt;Textures - Initialization Host Code (Driver API)&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;A few lines of additional code are required in your host code during initialization in order to setup the kernel texture object.&amp;nbsp; I tend to do this once during a setup phase of the application, typically just after loading the cubin file and getting function handles.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Firstly, you will need to get a handle to your kernels texture object for the host to use.&amp;nbsp; This is similar to getting a handle to a device constant variable since the reference is retrieved from the kernel cubin by name. In our example above we declared a texture object in the kernel named 'tex'.&amp;nbsp; The host code when using the driver API is therefore:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;CUtexref m_cuTexref;&lt;/div&gt;&lt;/div&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;cuModuleGetTexRef(&amp;amp;m_cuTexref, m_cuModule, &lt;span style="color: #a31515;"&gt;"tex"&lt;/span&gt;)&lt;/div&gt;&lt;/div&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Where m_cuModule is the kernel module handle previously loaded/compiled using cuModuleLoadDataEx.&amp;nbsp; Now we need to set up how the texture unit will access the data.&amp;nbsp; Firstly, I tell the texture fetch to clamp to the boundary in both dimensions:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; cuTexRefSetAddressMode(m_cuTexref, 0, CU_TR_ADDRESS_MODE_CLAMP);&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; cuTexRefSetAddressMode(m_cuTexref, 1, CU_TR_ADDRESS_MODE_CLAMP);&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Then we can tell the hardware to fetch image data using nearest neighbour interpolation (point):&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; cuTexRefSetFilterMode(m_cuTexref, CU_TR_FILTER_MODE_POINT);&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Or bilinear interpolation mode: &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; cuTexRefSetFilterMode(m_cuTexref, CU_TR_FILTER_MODE_LINEAR);&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Finally, we tell the texture reference about the linear memory we are going to use as a texture.&amp;nbsp; Assuming&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; that there is some device memory &lt;/span&gt;&lt;/span&gt;(CUdeviceptr m_dPtr) &lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;allocated during initialization that will contain the image data of dimensions &lt;/span&gt;&lt;/span&gt;Width &lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;and &lt;/span&gt;&lt;/span&gt;Height &lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;with a byte pitch of &lt;/span&gt;&lt;/span&gt;m_dPitch.&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;/div&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;span style="color: green;"&gt;// Bind texture reference to linear memory&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; CUDA_ARRAY_DESCRIPTOR cad;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; cad.Format = CU_AD_FORMAT_UNSIGNED_INT8;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;span style="color: green;"&gt;// Input linear memory is 8-bit&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; cad.NumChannels = 1;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &lt;span style="color: green;"&gt;// Input is greyscal &lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; cad.Width = Width;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp; &lt;span style="color: green;"&gt;// Input Width&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; cad.Height = Height;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;span style="color: green;"&gt;// Input Height&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; cuTexRefSetAddress2D(m_cuTexref, &amp;amp;cad, m_dPtr , m_dPitch);&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The actual image data can be copied into the device memory at a later time, or repeatedly every time a new image is available for video processing.&amp;nbsp; The texture reference 'tex' in the kernel has now been connected to the linear device memory.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;b&gt;Textures - Kernel Call Host Code (Driver API)&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;There is very little left to do by the time it comes to call a kernel.&amp;nbsp; We have to activate a hardware texture unit and tell it which texture it will be using.&amp;nbsp; On the host side, the texture reference was called &lt;/span&gt;&lt;/span&gt;m_cuTexref, &lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;we have already connected this reference to the texture object named 'tex' in the kernel during setup (using &lt;/span&gt;&lt;/span&gt;cuModuleGetTexRef)&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;.&amp;nbsp; &lt;/span&gt;&lt;span style="font-family: verdana;"&gt;One additional line is required to tell the kernel function which texture is active in the default texture unit.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt; &lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;cuParamSetTexRef(cuFunction_Handle, CU_PARAM_TR_DEFAULT, m_cuTexref);&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;So, the kernel will now be able to use the hardware texture fetch functions (tex2d) to fetch data from the texture object named 'tex'.&amp;nbsp; It is interesting that the texture unit MUST be &lt;/span&gt;&lt;/span&gt;CU_PARAM_TR_DEFAULT.&amp;nbsp; &lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;A CUDA enabled GPU will almost certainly have multiple texture units, so in theory it should be possible to read from multiple texture units simultaneously in a kernel to achieve image blending/fusion effects.&amp;nbsp; Unfortunately, this is not made available to us in CUDA at the time of writing (with CUDA 3.1).&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;To launch the kernel, proceed as normal.&amp;nbsp; For example:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;/div&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;cuFuncSetBlockShape( cuFunction_Handle, BLOCK_SIZE_X, BLOCK_SIZE_Y, 1 );&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;/div&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;cuLaunchGridAsync( cuFunction_Handle, GRIDWIDTH, GRIDHEIGHT, stream ))&lt;/div&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Note that I use async calls and multiple streams in order to overlap computation and PCI transfers, thus hiding some of the transfer overhead (&lt;i&gt;a subject for another post&lt;/i&gt;).&amp;nbsp; This can all be hidden from the user by maintaining a rolling buffer internally in the library, making the warp algorithm appear to run faster.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: small;"&gt;&lt;b&gt;&lt;span style="font-family: verdana;"&gt;Performance&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;In order to test the performance I have developed a general purpose warping library that uses our GPU framework to hide all of the CUDA code, JIT compilation, transfers, contexts, streams and threads behind a few simple function calls.&amp;nbsp; A commonly used useful warp function for polar unwrap has been implemented using the texture fetching method described above and the results look &lt;i&gt;very &lt;/i&gt;good.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The input images we chose were from &lt;a href="http://www.opto-engineering.com/"&gt;Opto-Engineering&lt;/a&gt; who have a range of lenses that produce polar images of the sides of product.&amp;nbsp; It is possible to capture high resolution images of the sides of containers as a polar image (below) but in order to accelerate any analysis, a fast polar unwrap is needed.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_LVn-8-fmLeI/TEl1PzW2LuI/AAAAAAAAAE0/xlyKOZOgW70/s1600/opto.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/_LVn-8-fmLeI/TEl1PzW2LuI/AAAAAAAAAE0/xlyKOZOgW70/s320/opto.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The output images look good when using the hardware bi-linear interpolation (below):&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_LVn-8-fmLeI/TEl17ZRwVsI/AAAAAAAAAE8/QIKhVo-r3rI/s1600/opto_out.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/_LVn-8-fmLeI/TEl17ZRwVsI/AAAAAAAAAE8/QIKhVo-r3rI/s320/opto_out.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;As expected, when nearest-neighbour interpolation is used, the image quality is degraded with aliasing problems (below).&amp;nbsp; Whilst this would be faster on a CPU, the GPU is able to perform the bilinear interpolation mode at the same speed.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_LVn-8-fmLeI/TEl1_HhDGKI/AAAAAAAAAFE/GbDJ7XLWdig/s1600/opto_out_nn.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/_LVn-8-fmLeI/TEl1_HhDGKI/AAAAAAAAAFE/GbDJ7XLWdig/s320/opto_out_nn.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The performance depends on the size of the output image, but typically achieves well over 1GB/sec in transform bandwidth, including all the transfer overheads (&lt;span style="font-size: xx-small;"&gt;Core2Quad Q8400@2.66GHz &amp;amp; GTX260 216cores&lt;/span&gt;).&amp;nbsp; For these input images (1024x768), the average total transform time to produce the output (1280x384) was &lt;i&gt;&lt;b&gt;under 400 microseconds&lt;/b&gt;&lt;/i&gt;.&amp;nbsp; That works out at &lt;i&gt;&lt;b&gt;over 1.2 Gigapixels/sec&lt;/b&gt;&lt;/i&gt;.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;A quick comparison to a third party software polar unwrap tool showed that this was at least an order of magnitude faster.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_LVn-8-fmLeI/TEl3jEmvskI/AAAAAAAAAFM/Ram8vajjiu0/s1600/benchmark.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/_LVn-8-fmLeI/TEl3jEmvskI/AAAAAAAAAFM/Ram8vajjiu0/s320/benchmark.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The algorithm to perform the polar coordinate conversion is computed on-the-fly.&amp;nbsp; Any number of complex transform functions can be implemented in the library very quickly to achieve this performance.&amp;nbsp; So far, affine, perspective and polar transforms are done.&amp;nbsp; Any requests?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;vxGWarp Interfaces&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Just FYI - the interface to these polar warp functions are pretty trivial, all the GPU expertise is hidden from the end user in the DLL.&amp;nbsp; The key functions in the header file are:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt; &lt;br /&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;vxGWarpCreate(VXGWARPHANDLE *hGW, &lt;span style="color: blue;"&gt;int&lt;/span&gt; W, &lt;span style="color: blue;"&gt;int&lt;/span&gt; H);&lt;/div&gt;&lt;div style="margin: 0px;"&gt;vxGWarpDestroy(VXGWARPHANDLE hGW);&lt;/div&gt;&lt;div style="margin: 0px;"&gt;vxGWarpAccessXferBufIn(VXGWARPHANDLE hGW, &lt;span style="color: blue;"&gt;unsigned&lt;/span&gt; &lt;span style="color: blue;"&gt;char&lt;/span&gt; **pInput, &lt;span style="color: blue;"&gt;int&lt;/span&gt; *nW, &lt;span style="color: blue;"&gt;int&lt;/span&gt; *nP, &lt;span style="color: blue;"&gt;int&lt;/span&gt; *nH);&lt;/div&gt;&lt;div style="margin: 0px;"&gt;vxGWarpAccessXferBufOut(VXGWARPHANDLE hGW, &lt;span style="color: blue;"&gt;unsigned&lt;/span&gt; &lt;span style="color: blue;"&gt;char&lt;/span&gt; **pOutput, &lt;span style="color: blue;"&gt;int&lt;/span&gt; *nW, &lt;span style="color: blue;"&gt;int&lt;/span&gt; *nP, &lt;span style="color: blue;"&gt;int&lt;/span&gt; *nH);&lt;/div&gt;vxGWarpPolar(VXGWARPHANDLE hGW, POLARPARAMS PP);&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="background: none repeat scroll 0% 0% white; color: black; font-family: Courier New; font-size: 10pt;"&gt;&lt;div style="margin: 0px;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-201110637374946452?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/201110637374946452/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/07/image-warping-using-texture-fetches.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/201110637374946452'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/201110637374946452'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/07/image-warping-using-texture-fetches.html' title='Blazingly Fast Image Warping'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_LVn-8-fmLeI/TEl1PzW2LuI/AAAAAAAAAE0/xlyKOZOgW70/s72-c/opto.jpg' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-1943856813463288680</id><published>2010-07-08T11:29:00.000+01:00</published><updated>2010-07-08T11:29:17.199+01:00</updated><title type='text'>Debunking the x100 GPU Myth - Intel Fights Back</title><content type='html'>&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Intel recently published &lt;a href="http://portal.acm.org/citation.cfm?id=1816021&amp;amp;coll=GUIDE&amp;amp;dl=GUIDE&amp;amp;CFID=11111111&amp;amp;CFTOKEN=2222222&amp;amp;ret=1"&gt;this paper&lt;/a&gt; titled 'Debunking the 100X GPU vs. CPU myth:&amp;nbsp;an evaluation of throughput computing on CPU and GPU' that makes an attempt to compare a number of GPU kernels with algorithms that are highly optimised for Intel architectures.&amp;nbsp; The authors concluded that for the right problems, the GPU was up to 14x faster than an equivalent optimised CPU implementation. On average a &lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;x2.5 increase in speed was seen.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;I am all in favour of using GPU's to accelerate image processing when it is appropriate but the hype has gotten out of control over the last year, so I am very pleased to see Intel try and put their case forward and bring some balance to the arguments. &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;What I liked about the paper was that for once, significant effort was expended to optimise &lt;i&gt;BOTH&lt;/i&gt; the CPU and the GPU implementations.&amp;nbsp; Too many biased comparisons are made between highly optimised GPU implementations and the naive, plain vanilla single threaded 'C' versions.&amp;nbsp; When a x100 increase in speed is cited, I always suspect that the author was being either highly selective in what parts of the overall system were being timed, or that the algorithm was unrealistically well mapped to GPU hardware and not representative of a real problem, or even that the CPU implementation was simply not optimised at all.&amp;nbsp; The NVidia showcase &lt;a href="http://www.nvidia.com/object/cuda_home_new.html"&gt;website &lt;/a&gt;has made publishing an impressive acceleration factor in the authors best interest. &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;I certainly have not come across any imaging &lt;i&gt;systems &lt;/i&gt;that have achieved anything like x100 accelerations in throughput by employing GPU technology.&amp;nbsp; There may be some algorithms that map superbly well to GPUs and can achieve x100 performance increase in a single algorithm stage, but these numbers published by Intel are much more in line with the total throughput increase I have seen when using GPU's to do image processing &lt;i&gt;in real-world applications&lt;/i&gt;, when compared to the optimised CPU algorithms that are readily available.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;An example of disengenuous performance metrics would be the image processing blur demo in the NVidia SDK - here the image is loaded from file, pre-processed and converted into a 512x512 floating point greyscale image, transferred to the GPU once, and &lt;i&gt;THEN &lt;/i&gt;processed repeatedly at high speed to show how fast the GPU is.&amp;nbsp; The CPU conversion to floating point format is omitted from the GPU compute time.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;I would also agree with Intel that most often, in practice, optimisation of an algorithm to use multiple cores, maximize cache usage and SSE instructions is easier, faster and ultimately more portable than developing a CUDA replacement algorithm.&amp;nbsp; I would also agree with the GPU evangelists that the hardware cost of an upgrade to a top-end Intel based PC system, vs the investement in a GTX280 is significantly higher.&amp;nbsp; With the tools improving all the time, it is becoming easier to code and deploy GPU enhanced algorithms.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The conclusion is, for the time being, we must take a balanced view of the technology available and choose the right processing method to suit the application.&amp;nbsp; And be realistic.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-1943856813463288680?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/1943856813463288680/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/07/debunking-x100-gpu-myth-intel-fights.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/1943856813463288680'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/1943856813463288680'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/07/debunking-x100-gpu-myth-intel-fights.html' title='Debunking the x100 GPU Myth - Intel Fights Back'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-7950308546023505312</id><published>2010-07-03T18:29:00.001+01:00</published><updated>2010-07-03T20:49:55.702+01:00</updated><title type='text'>CUDA Parameter Alignment</title><content type='html'>&lt;span style="font-family: verdana; font-size: x-small;"&gt;When executing a CUDA kernel, it is almost always necessary to pass some parameters into the kernel function.&amp;nbsp; For image processing, the parameters are usually at least a pointer to the image data to be processed, plus the width, height, pitch etc. that describe the image.&amp;nbsp; The GPU kernel can then access the input parameters when it runs.&amp;nbsp; For this to happen, the parameters passed into the Kernel function call have to be copied from the host memory to the device code running on the GPU.&amp;nbsp; The mechanism for passing parameters to Kernels at execution is different to the majority of the host-to-device data copies, which use explicit function calls such as &lt;span style="color: #38761d;"&gt;cuMemcpy&lt;/span&gt;().&amp;nbsp; Kernel function parameters, similarly to regular function calls, are passed using a parameter stack.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;When using the CUDA Runtime API then parameter passing is taken care of transparently, and no additional work is required on the part of the programmer.&amp;nbsp; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; The Runtime API hides the details of  copying host parameters from host memory into a parameter stack in the GPU device memory which the kernel can then go on to access as input parameters.&amp;nbsp; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;The Driver API is somewhat lower level.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;The CUDA Driver API does not hide as much of the detail and the programmer must manage the process correctly, pushing variables onto a parameter stack in the correct order and with the correct alignment and size.&amp;nbsp; In my experience, and judging from the number of questions out there on newsgroups, parameter passing can be a source of trouble.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;In the Driver API, function parameters are all passed to the kernel parameter space using the functions:&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;cuParamSeti(CUfunction hFunc, int offset, unsigned int value)&lt;span style="color: black;"&gt; - Pass an integer&lt;/span&gt;&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;cuParamSetf(&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;CUfunction  hFunc, int offset, &lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;float value)&lt;span style="color: black;"&gt;&lt;/span&gt;&lt;/span&gt;&amp;nbsp; - Pass a float&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;cuParamSetv(&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;CUfunction  hFunc, int offset, void*, unsigned int numbytes&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;)&lt;span style="color: black;"&gt; - Pass data&lt;/span&gt;&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;These functions place data residing in the calling host memory onto the kernel parameter stack at the position specified by &lt;span style="color: #38761d;"&gt;offset&lt;/span&gt;.&amp;nbsp; It is crucial to make sure that offset is correct and must take into account the total size of all the previous items placed on the stack, taking their alignment into account.&amp;nbsp; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;A few of the common causes of problems are:&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Differences between the host alignment and device alignment of some data types.&amp;nbsp; Sometimes, additional alignment bytes must be added to offset to give the correct alignment.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Differences between the host size and device size of some data types, leading to incorrect value for numbytes or incorrect offset accumulation.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;32-bit and 64-bit memory addressing when passing device pointers to &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;cuParamSetv&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Standard Data Types&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;CUDA uses the same size and alignment for all standard types, so using &lt;span style="color: #351c75;"&gt;sizeof()&lt;/span&gt; and &lt;span style="color: #351c75;"&gt;__alignof() &lt;span style="color: black;"&gt;in host code will yield the correct numbers to put parameters on the kernel stack.&lt;/span&gt;&lt;/span&gt;&amp;nbsp; The exception is that the host compiler can choose to align double, long long and 64 bit long (on 64-bit OS) on WORD (2byte) boundary, but the kernel will always expect these to be aligned on a DWORD (4Byte) boundary on the stack.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;A common mistake is to push a small data type onto the stack, followed by a larger data type with larger alignment requirements, but forgetting to increment offset to meet the alignment of the larger type.&amp;nbsp; For example, in the code below a 2-byte short is pushed onto the stack followed by a four-byte int.&amp;nbsp; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #cc0000;"&gt;&lt;b&gt;WRONG&lt;/b&gt;: Byte alignment of int is 4-bytes but offset is only accumulated by the size of short.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="color: #cc0000;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;offset = 0;&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #cc0000;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;short myshort16 = 5434;&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #cc0000;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;int myint32 = 643826; &lt;/span&gt;&lt;/div&gt;&lt;div style="color: #cc0000;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;cuParamSetv(&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;hMyFunc, offset, &amp;amp;myshort16 , 2&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;)&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #cc0000;"&gt;&lt;span style="color: #cc0000; font-family: verdana; font-size: x-small;"&gt;offset+= 2;&amp;nbsp; /// wrong&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #cc0000;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;cuParamSetv(&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;hMyFunc, offset,  &amp;amp;myint32, 4&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;)&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;&lt;b&gt;RIGHT&lt;/b&gt;: Byte alignment of int is 4-bytes so offset has to be aligned correctly&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;offset =  0;&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;short myshort16 = 5434;&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;int myint32 = 643826; &lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;cuParamSetv(&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;hMyFunc, offset, &amp;amp;myshort16 , 2&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;)&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;offset+=4; &lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;cuParamSetv(&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;hMyFunc, offset,  &amp;amp;myint32, 4&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;)&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #351c75;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: #351c75;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: black;"&gt;In order to ensure you have the right number for offset, NVidia provide a macro called ALIGN_UP that should be called to adjust the offset, prior to calling the next &lt;span style="color: #38761d;"&gt;cuSetParamx &lt;/span&gt;function.&amp;nbsp; &amp;nbsp;&lt;/span&gt; &lt;/span&gt;&lt;/div&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Built-In Vector Types&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;CUDA provides some built-in vector types, listed in Table B.1 in section B.3.1 of the CUDA programming guide 3.1.&amp;nbsp; This means that the kernel can interpret some of the parameters on its input parameters stack as one of these vector types.&amp;nbsp; The host code does not have equivalent vector types, so again, care must be taken to use the right offset and alignment.&amp;nbsp; Most alignments are obvious, but there are exceptions, for example float2 and int2 have 8byte alignment, float3 and int3 have 4byte alignment.&lt;b&gt; &lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Device Pointers&lt;/span&gt;&lt;/b&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;This starts to get a bit more complicated.&amp;nbsp; There used to be only two possibilities, the GPU always used always 32-bit pointers but the calling OS was either a 32-bit OS or a 64-bit OS.&amp;nbsp; With the arrival of Fermi, support for 64-bit addressing is possible, meaning we have three valid possibilities.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;b&gt;32-bit OS &lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;This covers probably the most common scenario.&lt;/span&gt;&amp;nbsp; &lt;span style="color: black;"&gt;For all devices except Fermi, a &lt;span style="color: #38761d;"&gt;cuDevicePtr &lt;/span&gt;can be safely cast into a 32bit &lt;span style="color: #351c75;"&gt;void* &lt;span style="color: black;"&gt;without issue.&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;On 32-bit operating systems, the address of operator &amp;amp; will result in a 32-bit pointer, so CUDA allocated device pointers can be passed as &lt;span style="color: #351c75;"&gt;(void*)&lt;/span&gt; parameters.&amp;nbsp; For example&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;cuParamSetv(MycuFunction, offset, &amp;amp;MyDevicePtr, sizeof(&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;MyDevicePtr&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;));&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;&lt;b&gt;&lt;span style="color: black;"&gt;64-bit OS, 32-bit GPU&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;&lt;span style="color: black;"&gt;For 64-bit operating systems, there is a difference in size between a 32-bit &lt;span style="color: #38761d;"&gt;cuDevicePtr &lt;/span&gt;and a 64-bit &lt;span style="color: #351c75;"&gt;(void*&lt;/span&gt;).&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;&lt;span style="color: black;"&gt;So THIS LINE BELOW WILL NOT WORK:&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="color: #cc0000;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;cuParamSetv(MycuFunction,  offset, &amp;amp;MyDevicePtr, sizeof(&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;MyDevicePtr&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;));&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #cc0000;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #38761d;"&gt;&lt;span style="color: black;"&gt;The line above will not work since sizeof(cuDevicePtr)=4 but the address of&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; MyDevicePtr will be a 64bit (8byte) pointer.&amp;nbsp; Using the code above will cause bad things to happen. The correct code is:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;cuParamSetv(MycuFunction,   offset, &amp;amp;MyDevicePtr, sizeof(void*&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;));&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: black;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;or - even better (more portable)&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;void *ptr = (void*)MyDevicePtr;&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;cuParamSetv(MycuFunction,    offset, &amp;amp;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;ptr &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;, sizeof(&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;ptr &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;));&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="color: black; font-family: verdana; font-size: x-small;"&gt;Care must be taken to make sure offset is always a multiple of 8 bytes before calling this function, since these  64-bit pointers have 8-byte alignment requirements.&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: black;"&gt;&lt;b&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;64-bit OS, 64-bit Fermi GPU addressing&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="color: black;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;When using &lt;i&gt;nvcc&lt;/i&gt; to compile 64-bit code for Fermi, both host and GPU code will use 64-bit addressing. The pointer size for both host and GPU will now be the same, so the &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="color: black; font-family: verdana; font-size: x-small;"&gt;call used above will still work:&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;void  *ptr = (void*)MyDevicePtr;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;cuParamSetv(MycuFunction,     offset, &amp;amp;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;ptr &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;, sizeof(&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;ptr  &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;));&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="color: black;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Care must still be taken since these 64-bit pointers have 8-byte alignment requirements.&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div style="color: black;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="color: black;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;So the key points to remember are:&lt;/span&gt;&lt;/div&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Check that the size is right.&amp;nbsp; Be aware of (void*) size differences.&amp;nbsp; Be aware of double, long long, and long (64-bit) differences in size.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Increment the stack offset by the right amount.&amp;nbsp; Then: &lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Check that the stack offset is aligned ready for the requirements of the parameter to be added next.&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Repeat from 1.&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;div style="color: #38761d;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="color: black;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-7950308546023505312?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/7950308546023505312/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/07/cuda-parameter-alignment.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/7950308546023505312'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/7950308546023505312'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/07/cuda-parameter-alignment.html' title='CUDA Parameter Alignment'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-8949494887678916464</id><published>2010-06-04T21:13:00.002+01:00</published><updated>2010-06-04T21:24:08.197+01:00</updated><title type='text'>GPU 5x5 Bayer Conversion</title><content type='html'>The standard Bayer mosaic conversion algorithms used in machine vision typically employ fast bi-linear interpolation method in order to reconstruct a full 24-bit per pixel colour image.&amp;nbsp; Numerous other, more sophisticated algorithms exist in the public domain, but very few are implemented in a sensible machine vision library.&amp;nbsp; Since most industrial vision tasks are not aimed at recovering very high-fidelity images for human consumption, the Bayer conversion quality does not seem to have been a priority.&amp;nbsp; Its strange really, given that it is easy to spend $10k on a color machine vision camera, capture device and lens, only to put the captured images through the basic Bayer de-mosaic algorithm at the last moment.&lt;br /&gt;&lt;br /&gt;In order to try and improve the situation, we've implemented various Bayer algorithms, including our own adaptive version of the 5x5 Malvar-He-Cutler interpolation algorithm.&amp;nbsp; Our implementation of the Malvar algorithm (we call Ultra Mode) is noticeably sharper and has less color fringing than the standard method.&lt;br /&gt;&lt;br /&gt;The 2-frame gif below shows the difference on a long-range image taken with a well-known machine vision camera.&amp;nbsp; OK - granted its not a drastic difference and the gif encoding doesn't help, but sometimes this fidelity change can be important.&amp;nbsp; Given that our implementation runs on any CUDA enabled GPU faster than a basic CPU bi-linear algorithm, there isn't really a down-side to using the better method.&amp;nbsp;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://www.visionexperts.co.uk/news/images/ultra.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="293" src="http://www.visionexperts.co.uk/news/images/ultra.gif" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-8949494887678916464?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/8949494887678916464/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/06/gpu-5x5-bayer-conversion.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/8949494887678916464'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/8949494887678916464'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/06/gpu-5x5-bayer-conversion.html' title='GPU 5x5 Bayer Conversion'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-3810229875420716423</id><published>2010-05-16T19:37:00.001+01:00</published><updated>2010-05-16T19:41:56.752+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='laser'/><category scheme='http://www.blogger.com/atom/ns#' term='3D'/><category scheme='http://www.blogger.com/atom/ns#' term='profile'/><title type='text'>GPU Accelerated Laser Profiling</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Laser Profiling extracts a dense set of 3D coordinates of a target object by measuring the deviation of a straight laser line as it is swept across the target.&amp;nbsp; Many of these systems make use of custom hardware (e.g. &lt;a href="http://www.sickivp.com/sickivp/products/smart_cameras/ivc/en.html"&gt;Sick IVC3D&lt;/a&gt;) and an FPGA to achieve high line profile rates, often achieving multiple thousands of profiles per second.&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;It is also possible to assemble a laser profiling system using any high speed camera and a laser line.&amp;nbsp; Partial-scan cameras can be useful to get high frame-rates but some fast software is also required to find and measure the laser line position in every image to sub-pixel accuracy for every profile.&amp;nbsp; These positions are then converted to world coordinates using a calibrated projection and lens distortion correction - which requires some floating point operations.&amp;nbsp; The hardware solutions typically manage several thousand profiles/sec, software is normally slower.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Recently, I've been experimenting with GPU accelerated line profiling - and its looking fast.&amp;nbsp; The GPU turns out pretty well suited for measuring the laser lines in parallel since we can launch a single thread per column of the input image.&amp;nbsp; In fact, for memory access efficiency, it is better for each thread to read a 32-bit int that packs four 8-bit pixels.&amp;nbsp; A block of 16 threads therefore computes the laser positions for 64 pixels in &lt;i&gt;parallel&lt;/i&gt;.&amp;nbsp; With multiple blocks of 64 pixels running concurrently &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;(Figure1)&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;, the processing rate is pretty much only limited by GPU-host transfers.&amp;nbsp; On my test rig, the GTX260 GPU has 216 cores, so can execute 3,456 threads in parallel, way more than are actually needed and so many are idle in my current implementation. &lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_LVn-8-fmLeI/S_A2z9ZIExI/AAAAAAAAAEM/z4_WWAEhghM/s1600/Diagram.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="165" src="http://2.bp.blogspot.com/_LVn-8-fmLeI/S_A2z9ZIExI/AAAAAAAAAEM/z4_WWAEhghM/s400/Diagram.gif" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;b&gt;&lt;span style="font-family: verdana; font-size: xx-small;"&gt;Figure1.&amp;nbsp; Each thread scans four columns in order to compute the position of the laser line in that column.&amp;nbsp; With each GPU core executing 64 threads in parallel, this can be very fast.&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_LVn-8-fmLeI/S_A3ka_q1sI/AAAAAAAAAEU/NiybJQp2WmE/s1600/LaserProfile.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="151" src="http://2.bp.blogspot.com/_LVn-8-fmLeI/S_A3ka_q1sI/AAAAAAAAAEU/NiybJQp2WmE/s200/LaserProfile.jpg" width="200" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;b&gt;&lt;span style="font-family: verdana; font-size: xx-small;"&gt;Figure2. The C# test application (using our own OpenGL 3D display library) was able to achieve over 200MPix/sec throughput using Common Vision Blox images.&amp;nbsp; The lower-level C interface was &lt;i&gt;double &lt;/i&gt;that speed when using RAW image data.&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;b&gt;&lt;span style="font-family: verdana; font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;My initial results show that the C# interface is able to achieve about 200MPix throughput (Figure2) - but that uses Common Vision Blox images which must be unwrapped and marshaled to the 'C' dll and slows things down. &lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;The low-level 'C' dll library was achieving &amp;gt;600MPix/sec throughput (Figure3)- thats many KHz for a range of resolutions.&amp;nbsp; It may be that this GPU accelerated algorithm is able to provide line rates that previously only hardware could achieve.&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/_LVn-8-fmLeI/S_A6MGAh8aI/AAAAAAAAAEc/lF3rAiHncLQ/s1600/DOSResult.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/_LVn-8-fmLeI/S_A6MGAh8aI/AAAAAAAAAEc/lF3rAiHncLQ/s320/DOSResult.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;b&gt;&lt;span style="font-family: verdana; font-size: xx-small;"&gt;Figure3. The low level DOS test application with 'C' dll interface was able to achieve over 600MPix/sec  throughput using pre-loaded raw images.&amp;nbsp; That was 2.5KHz profile rate on 1280x200 laser images, or 390fps for 1280x1280 scans.&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-3810229875420716423?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/3810229875420716423/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/05/gpu-accelerated-laser-profiling.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/3810229875420716423'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/3810229875420716423'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/05/gpu-accelerated-laser-profiling.html' title='GPU Accelerated Laser Profiling'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_LVn-8-fmLeI/S_A2z9ZIExI/AAAAAAAAAEM/z4_WWAEhghM/s72-c/Diagram.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-5365855417831244498</id><published>2010-05-02T10:20:00.004+01:00</published><updated>2010-05-02T17:24:17.493+01:00</updated><title type='text'>Faster Memory Transfers</title><content type='html'>&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;NVidia provide a mechanism to to allocate non-paged ('pinned') memory on the host, which can significantly improve host-to-GPU transfer performance.&amp;nbsp; But does it help in practice?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The main bottleneck in GPU processing is the PCIe bus, which has a relatively low bandwidth.&amp;nbsp; For many trivial operations this data transfer overhead dominates the overall execution time, negating any benefit of using the GPU.&amp;nbsp; For normal host-to-GPU data transfers using the &lt;span style="color: #38761d;"&gt;cuMemcpy &lt;/span&gt;function a bandwidth of around 2.0-2.5GB/sec is about average for a 16-lane PCI express bus.&amp;nbsp; This represents about half the theoretical maximum bandwidth of the PCIe v1.1 bus, and introduces about 1ms overhead to transfer an 1920x1080 greyscale image.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_LVn-8-fmLeI/S903hDMJ2WI/AAAAAAAAADw/vb2jBP6RBvQ/s1600/Transfera.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="76" src="http://1.bp.blogspot.com/_LVn-8-fmLeI/S903hDMJ2WI/AAAAAAAAADw/vb2jBP6RBvQ/s400/Transfera.gif" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;span style="font-size: xx-small;"&gt;&lt;b&gt;Figure 1&lt;/b&gt;&lt;/span&gt;.&amp;nbsp; &lt;span style="font-size: xx-small;"&gt;&lt;b&gt;A normal cuMemcpy from host-to-device runs at about 2GB/sec&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;If we use the NVidia &lt;span style="color: #38761d;"&gt;cuMemAllocHost&lt;/span&gt; function to allocate non-paged memory on the host, we can almost double the bandwidth when copying this buffer to the GPU device memory, achieving nearer 4GB/sec on most systems.&amp;nbsp; If you are able to write your capture code so that the frame grabber driver will DMA image data directly into one of these page-locked buffers then that is a worthwhile thing to do.&amp;nbsp; Unfortunately, thats not always possible. &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;b&gt;Page-Locked Intermediate Buffer &lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Sometimes, the frame-grabber acquires images into a host memory buffer without giving us the option to acquire directly into our CUDA allocated page-locked memory.&amp;nbsp; In this situation, we can either copy our captured image directly to the device memory as in Figure1, or choose to &lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;span style="color: blue;"&gt;memcpy &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;into a page-locked buffer prior to transfer across the PCIe bus as in Figure2.&amp;nbsp; Since a host &lt;span style="color: blue;"&gt;memcpy &lt;span style="color: black;"&gt;takes time, this erodes some of the benefit of using the page-locked buffer.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_LVn-8-fmLeI/S90_qL9zjWI/AAAAAAAAAD4/V5aEF8KB4cg/s1600/Transferb.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="81" src="http://3.bp.blogspot.com/_LVn-8-fmLeI/S90_qL9zjWI/AAAAAAAAAD4/V5aEF8KB4cg/s640/Transferb.gif" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-size: xx-small;"&gt;&lt;b&gt;&lt;span style="font-family: verdana;"&gt;Figure 2.&amp;nbsp; Using a page-locked buffer as a staging post before transfer can still increase performance, despite introducing an additional host memcpy operation from the acquire buffer to the page-locked buffer.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Using a page-locked transfer buffer as shown in figure2 is only worthwhile when the cost of the host memcpy operation is low - which requires a relatively high performance chipset (e.g. ICH10) with fast DDR2 (6.4GB/sec) or DDR3 (8.5GB/sec) memory.&amp;nbsp; At a minimum, the host-to-host copy must execute faster than 4GB/sec otherwise the direct copy in figure1 is usually faster.&amp;nbsp; As an example, the approximate time taken to transfer 1GB using paged memory is:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;1GB / (2GB/sec) =500ms&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;When using the scheme in figure2, the total time taken to transfer 1GB from host to the page-locked buffer and then onto the GPU is approximately:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&amp;nbsp; 1GB/(8GB/sec) = 125ms&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;+1GB/(4GB/sec) = 250ms&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;= 375ms&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;This is an improvement over the straight copy, so it would appear that non-paged memory does help even in this non-ideal situation.&amp;nbsp; When using a newer P45 chipset with PCIexpress v2.0 the maximum achievable transfer bandwidth is higher.&amp;nbsp; In theory, the PCIe bus on the newer Intel P45 and P35 chipsets will handle 16GB/sec and 8GB/sec respectively, but are limited by main memory bandwidth, reducing host-to-GPU bandwidth to something between 5 and 6GB/sec. &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The conclusion is that if at all possible, acquire directly into a pinned, page-locked memory.&amp;nbsp; If that isn't possible, using an intermediate page-locked buffer is still worthwhile, provided the host chipset and memory performance is good.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;b&gt;Direct FrameGrabber-to-GPU DMA &lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;It would be really great to be able to DMA directly from a frame grabber into GPU device memory, avoiding the CPU and main memory entirely, but I don't believe this is possible.&amp;nbsp; It may be achieved using driver-level transfers akin to DirectShow drivers, but it is not currently possible to get a physical address of GPU device memory using CUDA. &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;Simon Green, from NVidia says: &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;i&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: Verdana,sans-serif;"&gt;"A lot of people have asked for this. It is technically possible for  other PCI-E devices to DMA directly into GPU memory, but we don't have a  solution yet. We'll keep you posted.&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;" - &lt;i&gt;Sep 2009&lt;/i&gt;&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;This is a capability worth waiting for, but don't hold your breath.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-5365855417831244498?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/5365855417831244498/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/05/faster-memory-transfers.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/5365855417831244498'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/5365855417831244498'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/05/faster-memory-transfers.html' title='Faster Memory Transfers'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_LVn-8-fmLeI/S903hDMJ2WI/AAAAAAAAADw/vb2jBP6RBvQ/s72-c/Transfera.gif' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-7518580199769175853</id><published>2010-03-28T17:57:00.000+01:00</published><updated>2010-03-28T17:57:50.811+01:00</updated><title type='text'>CUDA3.0 cubin Files</title><content type='html'>&lt;span style="font-family: verdana; font-size: x-small;"&gt;It appears that NVidia has changed the format of CUBIN files with CUDA 3.0 into a standard binary ELF format.&amp;nbsp; Heres what they say in the release notes:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-size: x-small;"&gt;CUDA C/C++ kernels are now compiled to standard ELF format&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;You can find out about ELF files at the &lt;a href="http://en.wikipedia.org/wiki/Executable_and_Linkable_Format"&gt;wikipedia entry&lt;/a&gt;.&amp;nbsp; In previous releases the partially compiled .cubin files were plain text readable and could be added into a library as a string resource.&amp;nbsp; If you open old cubin files in Visual studio, they looked something like this:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="color: blue;"&gt;&lt;span style="font-family: verdana; font-size: xx-small;"&gt;architecture {sm_10}&lt;br /&gt;abiversion&amp;nbsp;&amp;nbsp; {1}&lt;br /&gt;modname&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {cubin}&lt;br /&gt;code {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; name = cuFunction_Laser&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; lmem = 0&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; smem = 44&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; reg&amp;nbsp; = 6&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; bar&amp;nbsp; = 0&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; bincode {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; 0x10004209 0x0023c780 0x40024c09 0x00200780 &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; 0xa000000d 0x04000780 0x20000411 0x0400c780 &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; 0x3004d1fd 0x642107c8 0x30000003 0x00000500 &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; ...&lt;/span&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;&lt;span style="font-size: xx-small;"&gt;&lt;span style="font-family: verdana;"&gt;blah..blah..blah&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: blue; font-size: xx-small;"&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;}&lt;/span&gt; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Rather than ship cubin files with libraries, I have always built them into the file as a string resource and then use the windows API functions such as &lt;span style="color: blue;"&gt;FindResource &lt;/span&gt;and&lt;span style="background-color: white; color: blue;"&gt; LoadResource &lt;span style="color: black;"&gt;to get a pointer to the string.&amp;nbsp; This is then passed to the CUDA &lt;span style="color: #274e13;"&gt;cuModuleLoadDataEx &lt;span style="color: black;"&gt;function&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; for final compilation into the GPU code.&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;With CUDA3.0 and this new ELF format, cubin files look slightly different since they are now a binary file:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_LVn-8-fmLeI/S6-CdCmfQjI/AAAAAAAAADk/NnwFmvplXxk/s1600/cubin3.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="155" src="http://1.bp.blogspot.com/_LVn-8-fmLeI/S6-CdCmfQjI/AAAAAAAAADk/NnwFmvplXxk/s200/cubin3.gif" width="200" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;When I compiled some old projects against  CUDA3.0, everything went very wrong due to this change.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;The problem was that my old method used to copy the cubin resource string into another memory location using &lt;span style="color: blue;"&gt;strcpy &lt;/span&gt;and add also a final \0 character for good measure at the end of the string.&amp;nbsp; With the new binary format, the string copy does not work and a partially mangled buffer ended up being passed to the CUDA compiler, which promptly fell over.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;So if anybody else out there is using string resources to include and manipulate cubin files, this may catch you out too.&amp;nbsp; The fix is easy, simply treat the new cubin files as binary data not strings.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;One final point, if you really want to stick with the previous cubin string format, then apparently (I haven't confirmed this) you can direct nvcc to emit string cubin files by changing the nvcc.profile and the &lt;/span&gt;CUBINS_ARE_ELF &lt;span style="font-family: Verdana,sans-serif; font-size: x-small;"&gt;flag.&amp;nbsp; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-7518580199769175853?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/7518580199769175853/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/03/cuda30-cubin-files.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/7518580199769175853'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/7518580199769175853'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/03/cuda30-cubin-files.html' title='CUDA3.0 cubin Files'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_LVn-8-fmLeI/S6-CdCmfQjI/AAAAAAAAADk/NnwFmvplXxk/s72-c/cubin3.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-1504122745912731844</id><published>2010-03-26T01:32:00.002Z</published><updated>2010-03-26T19:11:32.180Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='CUDA'/><category scheme='http://www.blogger.com/atom/ns#' term='NVPP'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>NVPP Performance Benchmarks</title><content type='html'>&lt;span style="font-family: verdana; font-size: x-small;"&gt; In my last post, &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;I cast some doubt on the performance and utility of GPU's for small image processing functions.&amp;nbsp;&amp;nbsp; Today I had a look at how NVidias own Image processing library - NVPP - stacked up against the latest Intel Performance Primitives (IPPI v6.0) for some basic arithmetic on one of my Dev machines.&amp;nbsp; This development PC has a mid-performance Quad-core Intel Q8400@2.66GHz and a mid-performance NVidia GTX260 with 216cores@1.1GHz.&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;The results are interesting and pretty much what I expected.&amp;nbsp; As an example, here are the results for a simple image addition of two images to produce one output image (average 1000 iterations):&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;512x512&lt;/b&gt; Pixels:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;GPU&lt;/b&gt;-&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Transfer &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;and Processing = &lt;b&gt;0.72&lt;/b&gt; milliseconds&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;b&gt;CPU &lt;/b&gt;= &lt;b&gt;0.16&lt;/b&gt; milliseconds&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;2048x2048&lt;/b&gt; Pixels:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;GPU&lt;/b&gt;-&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Transfer&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; and Processing = &lt;b&gt;6.78&lt;/b&gt; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;milliseconds&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;b&gt;CPU&amp;nbsp; &lt;/b&gt;= &lt;b&gt;2.81&lt;/b&gt; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;milliseconds&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;The CPU wins easily - so whats happening here?&amp;nbsp; The transfer overheads to-and-from the GPU over a PCIex16 bus are by far the dominant factor, taking approx 2ms per image transfer for the 2048x2048 images (two input images, one image output = approx 6ms).&amp;nbsp; Whilst transfer times can be significantly improved (perhaps halved) if the input and output images were put into page-locked memory, the conclusion would not change; performing individual simple image operations on the GPU does not significantly accelerate image processing.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;So what happens if we emulate a compute-intensive algorithms on the GPU?&amp;nbsp; When we perform only one transfer but then replace the single addition with 1000 compounded additions, the total time for the GPU operation becomes:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;2048x2048&lt;/b&gt; Pixels: &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;GPU&lt;/b&gt;-1xTransfer and &lt;i&gt;1000xImAdds&lt;/i&gt; = &lt;b&gt;0.29&lt;/b&gt;  milliseconds&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;So for a compute intensive operations which transfers the data once, then re-uses the image data multiple times, the GPU can easily be 10x faster than the CPU.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;This means that algorithms such as deconvolution, optic flow, deformable registration, FFTs, iterative segmentation etc are all good candidates for GPU acceleration.&amp;nbsp; Now, if you look at the &lt;a href="http://www.nvidia.com/object/cuda_apps_flash_new.html"&gt;NVidia community showcase&lt;/a&gt; then these are the sorts of algorithms that you will see making use of the GPU for imaging.&amp;nbsp; When the new Fermi architecture hits the shelves, with its larger L1 cache and new L2 cache, then the GPU imaging performance should make a real jump.&amp;nbsp; &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: xx-small;"&gt;Its worth mentioning a minor technical problem with NVPP and Visual Studio 2008 - NVPP1.0 doesn't link properly in MSVC2008 unless you disable whole program optimisation (option /GL). Its also worth noting that the NVPP is built on the runtime API, which is not suitable for real-time multi-threaded applications.&amp;nbsp; If you really need some of the NVPP functionality for a real-world application, then we would suggest you get a custom library developed using the driver API.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-1504122745912731844?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/1504122745912731844/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/03/nvpp-performance-benchmarks.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/1504122745912731844'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/1504122745912731844'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/03/nvpp-performance-benchmarks.html' title='NVPP Performance Benchmarks'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-2849809851808586606</id><published>2010-03-13T19:35:00.003Z</published><updated>2010-03-13T19:57:14.429Z</updated><title type='text'>A GPU is not Always Fastest</title><content type='html'>&lt;span style="font-family: verdana; font-size: x-small;"&gt;There has been a huge amount of interest in GPU computing (&lt;a href="http://www.gpgpu.org/"&gt;GPGPU&lt;/a&gt;) over the last couple of years.&amp;nbsp; Unsurprisingly, a number of image processing algorithms have been implemented using this technology.&amp;nbsp; In most cases, large performance gains are reported.&amp;nbsp; However, whilst I have been writing image processing algorithms that leverage the GPU performance for some time now, I have often found that the GPU is not the best solution.&amp;nbsp; As a rule of thumb, I aim for a x10 increase in speed to justify the development, if I can't achieve a x4 increase in speed then its just not worth the effort.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Sometimes, the performance gains are misleading for practical applications.&amp;nbsp; NVidia themselves are guilty of this in their SDK with their image processing examples.&amp;nbsp; For instance, in many of their SDK demonstration applications they use the SDK functions to load an 8-bit image and then pre-convert it on the host to a packed floating point format before uploading to the GPU.&amp;nbsp; They then show large gains in speed, but ignore the huge time penalty of the CPU-side format conversion.&amp;nbsp; In another example they have to unpack 24-bit RGB data into 128bit packed quads of floating point data on the host before they can process it.&amp;nbsp; In the real-world this is not practical.&amp;nbsp; I do wonder how many other people have used some constructive accounting in their reported acceleration factors.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;So, despite generally being a GPU evangelist for accelerating image processing, I wanted to write a bit about the downsides to provide a balanced view.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Architecture constraints.&lt;/b&gt; &amp;nbsp; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;You need to be doing a lot of work on the  image data to make the architecture work for you.&amp;nbsp; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Many (Most?) practical algorithms just don't fit into a GPU very well.&amp;nbsp; For example, it may be the case that a GPU can do a brute-force template correlation faster than a quad-core CPU, but brute-force correlation for pattern matching isn't the method of choice these days.&amp;nbsp; Contemporary vision libraries have extremely sophisticated algorithms that do a far superior job of pattern matching than correlation, plus they are highly optimised for multi-threading on the CPU.&amp;nbsp; These algorithms simply do not fit into the GPU 'brute force' computational model.&amp;nbsp; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;By way of a painful example, I have been developing a complete JPG conversion library for NVidia GPUs.&amp;nbsp; This is blazingly fast at RGB-YUV conversion, DCT and Quantisation, but falls down on the Huffman coding which is a sequential algorithm.&amp;nbsp; Add in the transfer overheads and it gets slower.&amp;nbsp; At the time of writing, hand-optimised multi-threaded CPU version is almost as fast.&amp;nbsp; All is not lost on this development, but its a tough sell at this point.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Multi-threading.&amp;nbsp; &lt;/b&gt;Whilst a GPU is massively parallel internally, it cannot run multiple algorithms (kernels) in parallel*.&amp;nbsp; So if your application is used to doing multiple operations in parallel, e.g. processing the images from multiple sensors in parallel, then it will have to change and serialize the images into GPU work chunks.&amp;nbsp; So whilst your quad-core CPU could be doing four images at once, the GPU is doing them in serial.&amp;nbsp; This means the GPU has to process at least four times the rate than a single CPU core in order to break even.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="font-size: xx-small;"&gt;*I believe the new NVidia  Fermi architecture can run multiple Kernels simultaneously but most  don't.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Transfer Overheads.&amp;nbsp; &lt;/b&gt;It takes time to transfer data across the PCIe bus to and from the GPU.&amp;nbsp; If the algorithm already runs quickly on the CPU (e.g. a few milliseconds) then GPU acceleration is usually a non-starter. &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Algorithm development time&lt;/b&gt;.&amp;nbsp; It takes longer to write and debug a massively parallel GPU algorithm than it does to parallelize the algorithm on the CPU to make use of a fast quad-core.&amp;nbsp; Development time is expensive.&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Hardware cost&lt;/b&gt;.&amp;nbsp; You do get a lot of horsepower for your money with a GPU, and a good performance card can be purchased for £150.&amp;nbsp; That still has to be factored into the system cost.&amp;nbsp; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Hardware obsolescence&lt;/b&gt;.&amp;nbsp; Whilst NVidia have confirmed that CUDA will be available in every new GPU they produce, the exact same GPU card quickly becomes obsolete.&amp;nbsp; Code should be forward compatible, but I don't think this has really been put to the test yet.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Of course, there are still lots of good things about this new technology and it really can accelerate the big number crunching algorithms like optic flow and deconvolution and FFTs.&amp;nbsp; But you have to choose carefully.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-2849809851808586606?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/2849809851808586606/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/03/gpu-is-not-often-fastest.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/2849809851808586606'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/2849809851808586606'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/03/gpu-is-not-often-fastest.html' title='A GPU is not Always Fastest'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-4536953904029062578</id><published>2010-02-24T20:44:00.001Z</published><updated>2010-03-13T20:17:59.496Z</updated><title type='text'>High Throughput for High Resolution</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: left;"&gt;We've been using the ProSilica/AVT GE4900 recently to get super high resolution 16megapixel images at about 3Hz.&amp;nbsp; It's a nice camera, but that resolution tends to demand high performance from the processor.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://www.machinevisiononline.org/userAssets/productImage/AVT_Prosilica_GE4900.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="200" src="http://www.machinevisiononline.org/userAssets/productImage/AVT_Prosilica_GE4900.jpg" width="200" /&gt;&lt;/a&gt;&lt;/div&gt;We have about 45MB/sec of raw image data we have to process.&amp;nbsp; In order  to chew through all this data we've been pushing the raw bayer mosaic  images onto an NVidia GTX260 GPU and performing colour conversion, gamma  correction and even the sensors flat field correction on the GPU at  high speed.&amp;nbsp; We also use the GPU to produce reduced size greyscale  images for processing and alalysis alongside the regular colour  converted image for display.&amp;nbsp; The ability to process such high  resolution images using the GPU has really made the difference for this  application and it would not be possible without this capability.&amp;nbsp; &lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-4536953904029062578?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/4536953904029062578/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/02/high-throughput-for-high-resolution.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/4536953904029062578'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/4536953904029062578'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/02/high-throughput-for-high-resolution.html' title='High Throughput for High Resolution'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-6056516783549532480</id><published>2010-02-10T12:03:00.001Z</published><updated>2010-02-10T12:29:28.159Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='interfaces'/><category scheme='http://www.blogger.com/atom/ns#' term='cameras'/><category scheme='http://www.blogger.com/atom/ns#' term='processing'/><title type='text'>Interface Acceleration</title><content type='html'>&lt;span style="font-family: verdana; font-size: x-small;"&gt;Machine vision sensors are getting big, and Camera are increasingly available with a number of pixels that is truly enormous by historical standards.&amp;nbsp; Cameras in the 10+ Megapixel range seem to be increasing in popularity for industrial inspection, possibly driven by the consumer market in which such large sensors are now the norm, partly due to price decreases, and possibly because processing and storing the data is just about feasible these days.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;The bandwidth between cameras and computer is also increasing, which it needs to.&amp;nbsp; Already, it seems that a single GigE connection just isnt enough bandwidth for tomorrows applications.&amp;nbsp; For example, &lt;a href="http://www.alliedvisiontec.com/"&gt;AVT&lt;/a&gt; have a dual GigE output on a camera to give 2Gbits/sec of bandwidth.&amp;nbsp; The &lt;a href="http://www.coaxpress.com/coaxpress.php"&gt;CoaXPress&lt;/a&gt; digital interface is capable of 6.25 Gbits/sec over 50m of pretty much bog-standard coax cable, a capability I find incredible.&amp;nbsp; Likewise, the HSLINK standard, proposed by DALSA, uses InfiniBand to achieve 2100Mbytes/sec.&amp;nbsp;&amp;nbsp; Most of these standards even permit using multiple connections to double, or quadruple the bandwidth.&amp;nbsp; With all this data flying around, trying to process this on a PC is going to be like taking a drink from a hose pipe.&amp;nbsp; Or two, or four.&amp;nbsp; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Think about it, at 2Gbits/sec, the computational demand will be 250Mpix/sec (&lt;span style="font-size: xx-small;"&gt;assuming 8bit pixels&lt;/span&gt;).&amp;nbsp; Using a 3GHz processor core, thats 12 clock cycles availble per pixel.&amp;nbsp; You can't do a whole lot of processing with that.&amp;nbsp; Even if you scale up to Quad-core and make sure you use as many SSE SIMD instructions as you can, you still aren't going to be doing anything sophisticated with that data.&amp;nbsp; It could be like machine vision development 15 years ago, when I remember the only realistic goal was to count the number pixels above threshold to take a measurement!&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;I feel that the new generation of ultra-high resolution cameras streaming data at ultra-high bandwidths are going to require a new generation of processing solutions.&amp;nbsp; I suspect this will be in the form of massivley parallel processors - such as GPU's and perhaps Intel's Larrabee processor (&lt;i&gt;when it finally materialises&lt;/i&gt;).&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;In the mean time, I'm plugging away writing GPU accelerated algorithms just for format conversion so that we can even display and store this stuff.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-6056516783549532480?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/6056516783549532480/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/02/interface-acceleration.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/6056516783549532480'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/6056516783549532480'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/02/interface-acceleration.html' title='Interface Acceleration'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-7686034262658343776</id><published>2010-02-05T12:28:00.000Z</published><updated>2010-02-10T12:30:26.177Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='GPU'/><title type='text'>GPU Supercluster</title><content type='html'>&lt;span style="font-family: verdana; font-size: x-small;"&gt;I was interested to see &lt;a href="http://webmac.rowland.org/rjf/cox/Projects/Computation/GPUs/16GPUs.html"&gt;this&lt;/a&gt; GPU system doing some biologically inspired processing at Harvard.&amp;nbsp; Whilst I doubt that there will be any practical industrial applications to emerge from this, it does show how inexpensive it can be to build a minor supercomputer. To quote from their website...&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;...With peak performance around 4 TFLOPS (4 trillion floating point operations per second), this little 18”x18”x18” cube is perhaps one of the world’s most compact and inexpensive supercomputers....&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-7686034262658343776?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/7686034262658343776/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/02/gpu-supercluster.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/7686034262658343776'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/7686034262658343776'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/02/gpu-supercluster.html' title='GPU Supercluster'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-3592091638494492672</id><published>2010-01-26T18:36:00.002Z</published><updated>2010-02-02T18:09:27.347Z</updated><title type='text'>C# Interop</title><content type='html'>&lt;span style="font-family: verdana; font-size: x-small;"&gt;I write image processing algorithms for a (so-called) living, which explains why my posts are so badly written and incomprehensible.&amp;nbsp; I write quite a few libraries in C/C++ for various industrial tasks and supply them as a trusty old windows dll.&amp;nbsp;&amp;nbsp; Providing an interface for calling my libraries from a C# application front-end is something I have to do quite a bit of.&amp;nbsp; C++ delivers the performance needed for image processing and C# gives the quick and easy GUI.&amp;nbsp; &lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;In order to call a regular 'C' dll from C#, you need to use something in .NET called P/Invoke.&amp;nbsp; This mechanism defines a function that is callable from C# which maps to a dll function call.&amp;nbsp; In the definition of the dll function, you can specify things like character sets for string passing, calling conversions etc.&amp;nbsp; As an example, if you wanted to import the windows kernel32 function Beep into C# using P/Invoke it looks something like:&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span id="ctl00_MTCS_main_ctl12"&gt;&lt;/span&gt;&lt;br /&gt;&lt;pre class="libCScode" id="ctl00_MTCS_main_ctl12_code" space="preserve"&gt;&lt;span id="ctl00_MTCS_main_ctl12"&gt;&lt;br /&gt;&lt;span style="color: #38761d; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[DllImport("kernel32.dll")]&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #38761d; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;public static extern bool Beep(int frequency, int duration);&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/pre&gt;&lt;span id="ctl00_MTCS_main_ctl12"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;In the beep example, the integer values passed from managed to unmanaged dll are so-called blittable types and will be passed directly.&amp;nbsp; Passing arrays is not quite as simple, since they have to be converted (&lt;i&gt;marshalled&lt;/i&gt;) by the framework before they are passed to the unmanged dll.&amp;nbsp; Normally, you don't use &lt;span style="color: #0b5394;"&gt;unsafe &lt;span style="color: black;"&gt;code in your C# GUI, so you don't have pointers to data lying around handy.&amp;nbsp; Of course, most C libraries for image processing expect a pointer to some image data to be passed in somewhere, not a managed array object.&amp;nbsp; So somebody has to do some work to turn a managed array into a pointer, without totally screwing up the safe part of C# and all the other stuff going on like the garbage collector.&lt;/span&gt;&lt;/span&gt;&amp;nbsp; This is job of P/Invoke.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;For example, if we have a C# array declared as:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="color: #274e13; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; text-align: center;"&gt;&lt;span style="font-size: x-small;"&gt;&amp;nbsp;float[] CalTgtX = { 58,198,340, 58,198,340, 58,198,340};&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;which we want to pass into a C++ function that looks like this:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="color: #38761d; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: x-small;"&gt;extern "C" __declspec(dllexport) void __stdcall CalibrateProjection(float *pTargetX)&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Then we need to carefully define how C# should carry out this conversion.&amp;nbsp; Heres how to define the function in C# so that we can pass (marshal) that float array object from C# to the C dll function:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="color: #38761d; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; text-align: left;"&gt;&lt;span style="font-size: x-small;"&gt;[DllImport("MyLib.dll", EntryPoint = "CalibrateProjection", CallingConvention = CallingConvention.StdCall)]&lt;br /&gt;public static extern RETCODE CalibrateProjection([MarshalAs(UnmanagedType.LPArray, SizeConst = 9)] float[] pTargetX)&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The first line tells the C# compiler to import a function.&amp;nbsp; DllImport tells c# we are importing a function from a Dll.&amp;nbsp; EntryPoint tells c# what the function stub is named. CallingConvention should match that used by the Dll - here it was __stdcall.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;The second line defines the function as it will appear to C#.&amp;nbsp; The key to converting the float array object to a pointer is in the &lt;span style="color: #38761d;"&gt;MarshalAs &lt;span style="color: black;"&gt;attribute&lt;/span&gt;&lt;/span&gt;.&amp;nbsp; This will involve a copy to an unsafe array on the heap so can be slow for large arrays... very slow.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;All the different flavours of managed types and structs can be marshalled this way.&amp;nbsp; More information on PInvoke can be found at &lt;a href="http://msdn.microsoft.com/en-us/library/aa288468%28VS.71%29.aspx"&gt;http://msdn.microsoft.com/en-us/library/aa288468%28VS.71%29.aspx&lt;/a&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #274e13; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace; font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-3592091638494492672?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/3592091638494492672/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/01/c-interop.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/3592091638494492672'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/3592091638494492672'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/01/c-interop.html' title='C# Interop'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-6423162511440014702</id><published>2010-01-22T19:37:00.001Z</published><updated>2010-01-26T18:30:14.016Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='misc'/><title type='text'>LoaderLock MDA</title><content type='html'>&lt;span style="font-family: verdana; font-size: x-small;"&gt;This post isn't really about accelerated image processing, but the topic is related to deployment of DLLs of any type.&amp;nbsp; I hope this helps somebody save some time if they encounter this issue.&amp;nbsp; Whilst developing a C# demo app for one of my CUDA libraries, I encountered a strange error: &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span style="font-size: x-small;"&gt;LoaderLock was detected&lt;br /&gt;Message: DLL 'Cephalon.dll' is attempting managed execution inside OS Loader lock. Do not attempt to run managed code inside a DllMain or image initialization function since doing so can cause the application to hang.&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;It took me a while to figure out what was going on, and it was related to how I build CUDA libraries.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;When I make a CUDA enabled library, I wrap up the compiled kernel &lt;i&gt;cubin &lt;/i&gt;files as a resource compiled into the DLL itself.&amp;nbsp; An alternative simpler method is to supply a cubin text file along with each dll and load it directly using the CUDA function:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #274e13; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;cuModuleLoad(&amp;amp;cuModule, pszModulePath)&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #274e13; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;but having two files can lead to version control and maintainance issues. Plus, I try and make a living doing this, and dont really want people reading my precious kernel code that took four months to write too easily.&amp;nbsp; &lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;So I wrap up the compiled code string neatly inside the DLL as a resource, then get that resource string and compile it on-the-fly (or &lt;i&gt;just-in-time&lt;/i&gt;) using the alternative CUDA function:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #274e13; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;cuModuleLoadDataEx( &amp;amp;cuModule,pCubinStr,3,&amp;amp;jitOptions[0],&amp;amp;jitOptVals[0]);&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;This is great but in order to get the string resource from inside the DLL I need to call a varient of &lt;span style="color: #20124d; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;LoadResource&lt;/span&gt;. And I need to call &lt;span style="color: #20124d; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;FindResource &lt;/span&gt;to find that resource first.&amp;nbsp; And I need to call &lt;span style="color: #20124d; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;GetModuleHandle("LibraryName.dll") &lt;/span&gt;before any of those.&amp;nbsp; The problem is that &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #20124d; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;GetModuleHandle &lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;is a prohibited function to call &lt;i&gt;even indirectly&lt;/i&gt; from LoadLibrary when the DLL is first loaded and mapped into the process address space.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;The C# application was loading the DLL when it first encountered one of the functions, this then tried to initialise the CUDA module and load the resource automatically from the dll entry point.&amp;nbsp; Ultimately, the call to &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: #20124d; font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;GetModuleHandle&lt;/span&gt;&lt;/span&gt; &lt;span style="font-size: x-small;"&gt;&lt;span style="font-family: verdana;"&gt;raised an alarm back in the managed code.&amp;nbsp; Not easy to spot.&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;More on the LoaderLock MDA can be found &lt;a href="http://msdn.microsoft.com/en-us/library/ms172219.aspx"&gt;here &lt;/a&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-6423162511440014702?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/6423162511440014702/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2010/01/loaderlock-mda.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/6423162511440014702'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/6423162511440014702'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2010/01/loaderlock-mda.html' title='LoaderLock MDA'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-2780125523175623709</id><published>2009-12-12T18:19:00.003Z</published><updated>2009-12-30T14:55:35.835Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='CUDA'/><category scheme='http://www.blogger.com/atom/ns#' term='warp'/><category scheme='http://www.blogger.com/atom/ns#' term='interop'/><category scheme='http://www.blogger.com/atom/ns#' term='opengl'/><title type='text'>OpenGL Interop</title><content type='html'>&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Looks Matter &lt;/b&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Never ignore the display capabilities of the GPU you have.&amp;nbsp;&amp;nbsp; After all, with the rendering capabililties of a meaty CUDA enabled GPU and the software capabilities of OpenGL you should easily be able to exploit the images and data you've just processed in interesting and engaging ways.&amp;nbsp; Granted, it should be the result that matters.&amp;nbsp; But when it's human engineers there comparing your system against the competition and the look-and-feel of your application is slick and polished, it can make a difference.&amp;nbsp; The key to doing making your CUDA app look slick is OpenGL interop.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Process vs Display&lt;/b&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Image processing and Image generation are two sides of the same coin.&amp;nbsp;&amp;nbsp; Image processing seeks to take an image and extract information (size, shape, depth, region, motion, identity etc.), image generation seeks to take information (vertices, texture, algorithms) and turn that data into an image.&amp;nbsp; Looking at a software level, the two sides of the coin can be addressed by using two GPU technologies - image processing using CUDA and image generation using OpenGL.&amp;nbsp; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;By using CUDA we can turn an NVidia GPU into a powerful image processor, by using OpenGL we can use the same GPU hardware to generate new images. &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;For example, with CUDA image processing algorithm we could extract the motion and depth from a scene in real-time, then with OpenGL image generation we could re-generate camera stabilized video or generate a panorama or even completely re-render the scene from a novel and augmented perspective.&amp;nbsp; It is when we combine image processing with image rendering this way that things get really interesting.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Interop&lt;/b&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;In my option, CUDA OpenGL interop seems to be under-documented as well as being a bit more complex that it should be.&amp;nbsp; In sequence - here's how I use CUDA-OpenGL interop:&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;At program initialisation: &lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Allocate an OpenGL texture that will be compatible with your results (not always easy)&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; Allocate an OpenGL Pixel Buffer Object (PBO) using &lt;span style="color: #38761d;"&gt;glGenBuffers&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Register the PBO with CUDA using &lt;span style="color: blue;"&gt;cuGLRegisterBufferObject&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;Note that (in CUDAv2.2) OpenGL cannot access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do an requested operations on the buffer. Deleting a buffer while it is mapped for CUDA results in undefined behaviour (bad things).&amp;nbsp; Also, always use the same context to map/unmap as the context used to register the buffer.&amp;nbsp; This can be difficult with the Runtime API in a multi-threaded app and results in strange behaviour.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;b&gt;A Digression on Pitch &lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: xx-small;"&gt;There is a complication with texture allocation and cuda device memory allocation.&amp;nbsp; With CUDA, you really must allocate pitched device memory (using &lt;span style="color: blue;"&gt;cuMemAllocPitch&lt;/span&gt;) for image processing usage.&amp;nbsp;&amp;nbsp; This is in order to meet strict alignment requirements for fast coalseced memory access.&amp;nbsp; You dont have control over the pitch that CUDA will use, but &lt;/span&gt;&lt;span style="font-family: verdana; font-size: xx-small;"&gt;&lt;span style="color: blue;"&gt;cuMemAllocPitch&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: xx-small;"&gt; returns the actual pitch of the device mem that was allocated, which is anything up to 256bytes.&amp;nbsp; When you allocate a texture in OpenGL, you cannot specify a texture pitch, only width, height and format.&amp;nbsp; This means that your OpenGL texture buffer may not be pitch-compatible with your CUDA device memory layout.&amp;nbsp; You &lt;i&gt;can &lt;/i&gt;use &lt;span style="color: #38761d;"&gt;GL_UNPACK_ALIGNMENT&lt;/span&gt; and &lt;span style="color: #38761d;"&gt;GL_UNPACK_ROW_LENGTH&lt;/span&gt; to help out here, but there are still some fairly common situations when this wont quite give you the control you need.&amp;nbsp; A symptom of mis-matched texture and device memory pitch is when the image data looks like its made it across the interop but is weirdly diagonally scewed or of the wrong aspect ratio.&amp;nbsp; Usually, through a combination of modification to your texture width, packing alignment and/or format you can achieve something compatible.&amp;nbsp; &lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;For now, I'll assume you have managed to allocate a compatible texture, then;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;At run-time:&lt;/span&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Run the CUDA kernel putting the results into device memory (&lt;span style="color: blue;"&gt;cuDevicePtr&lt;/span&gt;)&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Map the PBO using &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;span style="color: blue;"&gt;cuGLMapBufferObject&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;, which returns the device pointer of the texture memory (another &lt;span style="color: blue;"&gt;cuDevicePtr&lt;/span&gt;)&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Use &lt;span style="color: blue;"&gt;cuMemcpy2D &lt;/span&gt;to copy from the device memory to the mapped PBO memory.&amp;nbsp; These are device-to-device copies.&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Unmap the PBO (&lt;span style="color: blue;"&gt;cuGLUnmapBufferObject&lt;/span&gt;)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Update the texture from the PBO&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Use OpenGL to draw with your new texture&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Develop a Generic Interop Class&lt;/b&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;In most of the NVidia examples, CUDA results are written straight to the mapped texture memory during kernel execution.&amp;nbsp; In reality I found it much more efficient (from a productivity perspective) to write the code for the above operations once, and package that up as a little interop utility class.&amp;nbsp; Now, I can always copy from any CUDA device buffer into a suitable OpenGL texture without having to write similar code for every type of kernel launch. Not writing results directly into mapped opengl memory means that you incur an additional copy afterwards, but device-to-device copies are relatively fast in the scheme of things here.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Im putting together some tutorials on interop - they'll be along soon.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;In the mean time - take a look at the Interop Release notes from CUDA 2.2...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;o OpenGL interoperability&lt;br /&gt;  - OpenGL cannot access a buffer that is currently&lt;br /&gt;    *mapped*. If the buffer is registered but not mapped, OpenGL can do any&lt;br /&gt;    requested operations on the buffer.&lt;br /&gt;  - Deleting a buffer while it is mapped for CUDA results in undefined behavior.&lt;br /&gt;  - Attempting to map or unmap while a different context is bound than was&lt;br /&gt;    current during the buffer register operation will generally result in a&lt;br /&gt;    program error and should thus be avoided.&lt;br /&gt;  - Interoperability will use a software path on SLI&lt;br /&gt;  - Interoperability will use a software path if monitors are attached to&lt;br /&gt;    multiple GPUs and a single desktop spans more than one GPU&lt;br /&gt;    (i.e. WinXP dualview).&lt;br /&gt;&lt;/pre&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;/span&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-2780125523175623709?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/2780125523175623709/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2009/12/opengl-interop.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/2780125523175623709'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/2780125523175623709'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2009/12/opengl-interop.html' title='OpenGL Interop'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-6063621608344332534</id><published>2009-12-04T08:18:00.003Z</published><updated>2009-12-12T17:54:08.553Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='CUDA'/><category scheme='http://www.blogger.com/atom/ns#' term='YUV'/><category scheme='http://www.blogger.com/atom/ns#' term='RGB'/><title type='text'>24-bit RGB in CUDA</title><content type='html'>In an earlier post I wrote about the difficulty in accessing 24-bit RGB data in CUDA, caused by restrictions on optimised global memory access patterns to 32/64/128-bit data.  I've been working on this problem for RGB to YUV conversion and heres the best plan I have so far.&lt;br /&gt;&lt;br /&gt;The kernel is launched using a 64x1 block size and the image data is passed in as int* so that 32-bit access is coalesced.   The input is read into shared memory as 48ints using the first 48 threads, the output is written as 64 ints using all 64 threads.   During the read into shared memory, 16 threads are idle - this is a half-warp size so should not waste much time as I believe the entire half-warp will execute in a single instruction. &lt;br /&gt;&lt;br /&gt;&lt;span style="color: #000099; font-size: 78%;"&gt;__global__ void kernel_RGB24_to_YUV32(unsigned int *pInput, unsigned int pitchin, unsigned int *pOutput, unsigned int pitchout)&lt;br /&gt;{&lt;br /&gt;unsigned int xin  = blockIdx.x*48 + threadIdx.x;&lt;br /&gt;unsigned int xout = blockIdx.x*64 + threadIdx.x;&lt;br /&gt;unsigned int y = blockIdx.y;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #009900;"&gt;   //Shared memory for 48 input ints&lt;/span&gt;&lt;br /&gt;__shared__ unsigned int rgbTriplets[64];&lt;br /&gt;unsigned char *pBGR = (unsigned char*) &amp;amp;rgbTriplets[0];&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #009900;"&gt;   //Global memory read into shared memory&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #009900;"&gt;   //Read in 48 x 32bit ints which will load 64 packed rgb triplets.&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #009900;"&gt;   //Only 48 of the 64 threads are active during this read.&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #009900;"&gt;   //48 is divisible by the 16 thread half-warp size so fully utilises three entire half-warps&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #009900;"&gt;   //but leaves one half-warp doing nothing&lt;/span&gt;&lt;br /&gt;if (threadIdx.x&amp;lt;48) { rgbTriplets[threadIdx.x] = *(pInput + xin + y*pitchin); } __syncthreads();   unsigned int tidrgb = threadIdx.x*3; float3 rgbpix = make_float3(pBGR[tidrgb+2],pBGR[tidrgb+1],pBGR[tidrgb+0]);  &lt;span style="color: #009900;"&gt;   //Make YUV floats&lt;/span&gt;&lt;br /&gt;uchar3 yuvpix;&lt;br /&gt;yuvpix.x =  0.2990*rgbpix.x + 0.5870*rgbpix.y + 0.1140*rgbpix.z;&lt;br /&gt;yuvpix.y = -0.1687*rgbpix.x - 0.3313*rgbpix.y + 0.5000*rgbpix.z + 128;&lt;br /&gt;yuvpix.z =  0.5000*rgbpix.x - 0.4187*rgbpix.y - 0.0813*rgbpix.z + 128;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #009900;"&gt;   //Write out 64 ints which are 64 32bit YUVX quads &lt;/span&gt;&lt;br /&gt;*(pOutput+xout+y*pitchout) = make_color_rgb(yuvpix.x,yuvpix.y,yuvpix.z);&lt;br /&gt;&lt;br /&gt;return;&lt;br /&gt;}&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-6063621608344332534?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/6063621608344332534/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2009/12/24-bit-rgb-in-cuda.html#comment-form' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/6063621608344332534'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/6063621608344332534'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2009/12/24-bit-rgb-in-cuda.html' title='24-bit RGB in CUDA'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-3584919584101554220</id><published>2009-11-26T15:41:00.009Z</published><updated>2009-12-12T17:54:48.589Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='CUDA'/><category scheme='http://www.blogger.com/atom/ns#' term='RGB'/><title type='text'>RGB Images and CUDA</title><content type='html'>When using CUDA, using a 32-bit RGBX format to store colour images pays massive performance dividends when compared to the more commonly found 24-bit RGB image format.   If you have the choice, avoid 24 bit colour.&lt;br /&gt;&lt;br /&gt;NVidia neatly skirt around this problem by simply ignoring it in all of their SDK demos.  All these demos just happen to load 32-bit RGBX images for any of their colour image processing routines.  They effectively do the transformation from 24bit to 32 during the file load and then hide this cost when running the demo.  Good for them, but back here in the real world my image capture device is throwing out 24bit RGB images at 60Hz.&lt;br /&gt;&lt;br /&gt;For 32-bit RGBX, the texture references in the kernel code files (*.cu) look like this:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #000099; font-size: 85%;"&gt;&lt;span style="font-family: courier new;"&gt;&lt;br /&gt;&lt;br /&gt;texture &amp;lt; unsigned char, 2, cudaReadModeNormalizedFloat &amp;gt; tex;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;which works fine.  You can then access pixels using &lt;span style="color: #000099; font-family: courier new;"&gt;tex2D &lt;/span&gt;and all is well.  However, if you have an RGB24 image and try this:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #000099; font-size: 85%;"&gt;&lt;span style="font-family: courier new;"&gt;&lt;br /&gt;texture &amp;lt; uchar3, 2, cudaReadModeNormalizedFloat &amp;gt; tex;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;It just wont work.  There is no version of &lt;span style="color: #000099; font-family: courier new;"&gt;tex2D &lt;/span&gt;able to fetch 24 bit RGB pixels.   In fact, you cannot even allocate a CUDA array with 3 channels - if you try this:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #000099; font-size: 85%;"&gt;CUDA_ARRAY_DESCRIPTOR desc;&lt;br /&gt;desc.Format = CU_AD_FORMAT_UNSIGNED_INT8;&lt;br /&gt;desc.NumChannels = 3;&lt;br /&gt;desc.Width = m_dWidth;&lt;br /&gt;desc.Height = m_dHeight;&lt;br /&gt;cuArrayCreate( &amp;amp;m_dInputArray, &amp;amp;desc );&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;then the array creation will fail.  With CUDA you can only access textures and declare array memory with NumChannels equal to 1,2 or 4 elements.&lt;br /&gt;&lt;br /&gt;Furthermore, it is not possible to convert from 24bit to 32bit during a &lt;span style="color: #000099;"&gt;cuMemcpy2D &lt;/span&gt;call - whilst this will pad line length to align the pitch (to 256bytes) it will not pad each pixel to match the destination array format.&lt;br /&gt;&lt;br /&gt;The only solution is to declare your input array as 1 channel but three times as wide, like this:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #000099; font-size: 85%;"&gt;CUDA_ARRAY_DESCRIPTOR desc;&lt;br /&gt;desc.Format = CU_AD_FORMAT_UNSIGNED_INT8;&lt;br /&gt;desc.NumChannels = 1;&lt;br /&gt;desc.Width = m_dWidth*3;&lt;br /&gt;desc.Height = m_dHeight;&lt;br /&gt;cuArrayCreate( &amp;amp;m_dInputArray, &amp;amp;desc );&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;You can then access pixels in groups of three in your kernel.  Unfortunately,  dont expect coalesced memory acesses when each thread has to read three bytes....&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-3584919584101554220?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/3584919584101554220/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2009/11/rgb-images-and-cuda.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/3584919584101554220'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/3584919584101554220'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2009/11/rgb-images-and-cuda.html' title='RGB Images and CUDA'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-3361691819529703572</id><published>2009-11-08T23:16:00.003Z</published><updated>2009-12-04T08:29:11.620Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='CUDA'/><category scheme='http://www.blogger.com/atom/ns#' term='image'/><category scheme='http://www.blogger.com/atom/ns#' term='runtime'/><category scheme='http://www.blogger.com/atom/ns#' term='callback'/><title type='text'>Real-time CUDA Video Processing</title><content type='html'>This week I have learnt some hard lessons about integrating CUDA into a real-time video processing application.  The problem is related to the way most image acquisition devices work under Windows - that is, via interrupt driven capture callback functions.&lt;br /&gt;&lt;br /&gt;Hardware capture devices, be they analogue, digital, CameraLink, GigE or anything else, always come with a vendor specific programming API.  This API typcially uses one of two methods to let your program know when a new video frame has been acquired and is ready for processing; &lt;span style="font-style: italic;"&gt;events &lt;/span&gt;or &lt;span style="font-style: italic;"&gt;callbacks&lt;/span&gt;.   Either your application starts a thread which waits for a sync Event (using&lt;span style="font-style: italic;"&gt; WaitForSingleObject&lt;/span&gt;)  to be signalled by the vendor API when the frame is ready, or you register a callback function which gets called directly by the vendor API when the frame is ready.  Its pretty easy to swap between the two methods, but thats the subject of another blog entry.&lt;br /&gt;&lt;br /&gt;The problem I have come across relates to using the CUDA &lt;span style="font-style: italic;"&gt;runtime &lt;/span&gt;API with capture Callbacks.  You see, the runtime API is pretty hard wired to be single threaded.   You have to allocate, use, and deallocate all CUDA runtie resources on the same thread.  Thats why all the runtime API demos with the NVidia SDK are single threaded console apps.   The runtime appears to setup its own private context and attach it to the calling thread the first time &lt;span style="font-style: italic;"&gt;any &lt;/span&gt;function is called, from then on only that thread is able to use functions in the runtime api.  If any other thread tries to do something with device memory or call a kernel, you get an exception without explanation.&lt;br /&gt;&lt;br /&gt;So you have to be pretty careful about which thread you are going to make that first CUDA runtime API call from.&lt;br /&gt;&lt;br /&gt;Now, for an imaging application we know we have to &lt;span style="font-style: italic;"&gt;use &lt;/span&gt;CUDA in the capture callback thread to process each new image as it arrives.   Well probably want to copy the newly captured image to the GPU device and then run a kernel on it to do something useful.  Since we are using the runtime API, that means we have no choice but to allocate and deallocate all CUDA resources in the same capture callback thread.   But we dont really want to allocate and deallocate device memory every new frame as that is very inefficient, so we put a little catch to only allocate the first time the callback runs.  Everything seems great, our callback runs a bit slow the first time, but it runs.  It seems great until you realise that you dont know when the last time your callback will be called, so you dont know when to deallocate anything.  And you cant do it when the application exits, because only the capture callback thread is allowed to touch the runtime API.  Now thats a problem.&lt;br /&gt;&lt;br /&gt;There are also problems with OpenGL interop.  The &lt;span style="font-style: italic;"&gt;cudaRegisterBuffer &lt;/span&gt;function really needs an opposite &lt;span style="font-style: italic;"&gt;cudaUnregisterBuffer &lt;/span&gt;call before the callback thread terminates.  If you dont unregister then CUDA barfs an exception from deep in the bowels of the nvcuda.dll when the callback thread terminates.  But if you register/unregister every time a new frame arrives, that is really inefficient.  So its all getting sticky with the CUDA runtime API.&lt;br /&gt;&lt;br /&gt;The solution is to start out with the CUDA driver API for any real-time multi-threaded imaging applications.  Lesson learnt.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-3361691819529703572?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/3361691819529703572/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2009/11/cuda-video-processing.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/3361691819529703572'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/3361691819529703572'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2009/11/cuda-video-processing.html' title='Real-time CUDA Video Processing'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-1186050108704757012</id><published>2009-10-25T10:46:00.004Z</published><updated>2010-01-26T18:30:36.040Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='misc'/><title type='text'>Proper Work</title><content type='html'>I can barely believe that it's nearly that time of year when I pack up my Laptop and go to Stuttgart for the Vision show to catch up with the latest news and technology in the Machine Vision industry. Last year I returned feeling enthusiastic and determined to produce some quality software algorithm of my own using my newly found CUDA skills.  Now that an entire year has passed I have been struggling to remember what exactly I have achieved towards that goal.  It appears that I have been doing alot of &lt;span style="font-weight: bold;"&gt;WORK&lt;/span&gt;, which pays the bills, but hardly any &lt;span style="font-style: italic;"&gt;work&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Emanuel Derman spoke about &lt;span style="font-weight: bold;"&gt;WORK &lt;/span&gt;and &lt;span style="font-style: italic;"&gt;work&lt;/span&gt; which captured the essence of what I think many talented engineers feel every day.  There is always alot of &lt;span style="font-weight: bold;"&gt;WORK &lt;/span&gt;to be done, things like paperwork and meetings and bug fixes and presentations and little tasks which get you through the day and pay your bills.  But the number of days in which real &lt;span style="font-style: italic;"&gt;work &lt;/span&gt;gets done, stuff which will last more than a day and which feels like rewarding constructive activity... well that doesn't seem to happen enough.&lt;br /&gt;&lt;br /&gt;The only solution I have found is to do some &lt;span style="font-style: italic;"&gt;work &lt;/span&gt;after a full day of &lt;span style="font-weight: bold;"&gt;WORK&lt;/span&gt;.  Unfortunatley an hour or two of typing in the evening, propped up by strong coffee and hungry for food is not the environment which germinates really good blue-sky development.&lt;br /&gt;&lt;br /&gt;So I'm taking out a few days next week, turning off the mobile and shutting down outlook, and doing some &lt;span style="font-style: italic;"&gt;work&lt;/span&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-1186050108704757012?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/1186050108704757012/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2009/10/proper-work.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/1186050108704757012'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/1186050108704757012'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2009/10/proper-work.html' title='Proper Work'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-8567729510952402471</id><published>2009-09-12T23:08:00.000+01:00</published><updated>2009-12-12T23:12:27.021Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='interop'/><category scheme='http://www.blogger.com/atom/ns#' term='opengl'/><category scheme='http://www.blogger.com/atom/ns#' term='crash'/><title type='text'>OpenGL interop Woes</title><content type='html'>&lt;span style="font-family: verdana; font-size: x-small;"&gt;Writing real-world multi-threaded apps to capture, process and display video data in real-time is probably, in fairness, a slightly advanced topic.&amp;nbsp; But after a fair amount of experimentation and frustration I think i can offer a piece of advise to other would-be image processing engineers embarking on a CUDA project:&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Basically, don't use the CUDA runtime API in a real-time multi-threaded imaging application.&amp;nbsp; And definately dont use the CUDA runtime API with OpenGL interop in &lt;/span&gt;&lt;/i&gt;&lt;i&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;a real-time multi-threaded imaging application.&lt;/span&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;You can just about get away with using the runtime API in a multi-threaded app &lt;i&gt;IF &lt;/i&gt;you restrict your app so that only one host thread ever touches CUDA.&amp;nbsp; Thats not usually possible in a real-time imaging system with interrupt driven capture callbacks and an asynchronous processing and display architecture.&amp;nbsp; If you persist with the runtime API then...&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;b&gt;Bad Things Can Happen &lt;/b&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;Under the hood, the runtime API is creating a CUDA context and attaching it to the first thread that touched CUDA.&amp;nbsp; From then on, only that host thread should touch and CUDA API function, and if that thread terminates before CUDA resources are deallocated then bad things can happen.&amp;nbsp; Alternatively, if you allocate some device memory in your application start-up, but then try and access or process that memory in a capture callback thread, then once again bad things can happen.&amp;nbsp; Worst of all, if OpenGL interop tries to do something on a separate thread, whilst your capture callback is doing something on another thread then some CUDA operations may work, but sometimes &lt;i&gt;very &lt;/i&gt;bad things can happen.&amp;nbsp; For instance, I was quite successfully and repeatably able to &lt;i&gt;instantly&lt;/i&gt; reboot my PC by running a badly coded piece of multi-threaded CUDA code with OpenGL interop.&amp;nbsp; It was probably my fault, but that is difficult one to debug.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;This is what led me to use the Driver API in all subsequent imaging applications and really take control over which host thread owned and used the CUDA context.&amp;nbsp; So far, no problems. No crashes.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt; &lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: verdana; font-size: x-small;"&gt;&lt;a href="http://www.visionexperts.co.uk/"&gt;Vision Experts&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-8567729510952402471?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/8567729510952402471/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2009/12/opengl-interop-woes.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/8567729510952402471'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/8567729510952402471'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2009/12/opengl-interop-woes.html' title='OpenGL interop Woes'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-3889314276663629368</id><published>2009-07-04T18:03:00.002+01:00</published><updated>2009-12-04T18:16:35.566Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='CUDA'/><category scheme='http://www.blogger.com/atom/ns#' term='launch'/><category scheme='http://www.blogger.com/atom/ns#' term='image'/><category scheme='http://www.blogger.com/atom/ns#' term='kernel'/><title type='text'>CUDA function overheads</title><content type='html'>Whilst working on my CUDA accelerated JPEG algorithm I found a problem with my design which demanded launching a large number of small kernels followed by many thousands of small memcopy operations.  I was launching kernels to compress a fixed number of image blocks, many hundreds in all.  The result was compressed image blocks, and the output size was only known at runtime after the algorithm was finished, but required many thousands of mem copy operations.  The design was bad, but I was trying things out to see what would happen.&lt;br /&gt;&lt;br /&gt;On a CPU, a function call will typically take a few nanoseconds to push parameters on the stack and jump the program pointer to the function address.  On the GPU however, much more work has to be performed via the driver.  So kernel launches and cuda mem copy operations take at least three orders of magnitude more to setup than a CPU call - several microseconds in all.&lt;br /&gt;&lt;br /&gt;This means that if you want to perform many hundreds or thousands of calls then the function calls themselves can start to add up much more quickly than the equivalent CPU calls.   This effect can then become significant - so make your kernels big!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-3889314276663629368?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/3889314276663629368/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2009/07/cuda-function-overheads.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/3889314276663629368'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/3889314276663629368'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2009/07/cuda-function-overheads.html' title='CUDA function overheads'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-3805423215735781569</id><published>2009-06-11T14:25:00.008+01:00</published><updated>2009-06-12T11:43:40.874+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Colour conversion'/><category scheme='http://www.blogger.com/atom/ns#' term='YUV'/><category scheme='http://www.blogger.com/atom/ns#' term='format conversion'/><category scheme='http://www.blogger.com/atom/ns#' term='RGB'/><category scheme='http://www.blogger.com/atom/ns#' term='GPU'/><title type='text'>How to Display YUV420 Video</title><content type='html'>Recently, I came across YUV420 image data whilst working with a hardware H264 compression card.    The image data was planar, and arranged like the image below.  This is standard YUV420 planar format, with the U and V components being at 1/2 resolution of the Y component.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_LVn-8-fmLeI/SjEHlf6gV5I/AAAAAAAAAAs/OCNhE1ptato/s1600-h/YUV420RAW_720x576+Origsml.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 167px; height: 200px;" src="http://1.bp.blogspot.com/_LVn-8-fmLeI/SjEHlf6gV5I/AAAAAAAAAAs/OCNhE1ptato/s200/YUV420RAW_720x576+Origsml.jpg" alt="" id="BLOGGER_PHOTO_ID_5346062573468931986" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Now, when using MS Windows, you have to display images using bitmaps packed as RGB888 or RGBA8888 colour format.    Even when using OpenGL you need basically RGB or 8bit grey images.  I believe MacOS might be nicer (anybody?) and certainly supports YUV422, but under Windows, you have a problem with YUV.  DirectX might help out too (anybody?) - but I live in the OpenGL world here.&lt;br /&gt;&lt;br /&gt;So how do we display YUV420 video in real-time?&lt;br /&gt;&lt;br /&gt;The first thing I tried was a CPU conversion to RGB888, then transferred the RGB data to OpenGL for display.  Easy enough to code in C++ and took about an hour to optimise.  But it still took about 8ms per frame to convert (on 768x576 frames) and really hit the CPU loading, which felt like a real waste of clock cycles for just displaying an image.&lt;br /&gt;&lt;br /&gt;The solution we ended up with was to transfer YUV420 image data raw as GL_LUMINANCE image data, essentially just transferring the whole image (as above) as if it were a 768x864 greyscale image.  We then wrote a Cg fragment shader to do the YUV to RGB conversion and display on the graphics unit.  This worked a treat and even the Intel embedded graphics on the motherboard was able to handle the shader.  This reduced the time to 1.4ms per frame, without any CPU loading.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_LVn-8-fmLeI/SjEJx81LoRI/AAAAAAAAAA0/FeaXhNTG3EQ/s1600-h/DiagramSml.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 200px; height: 127px;" src="http://2.bp.blogspot.com/_LVn-8-fmLeI/SjEJx81LoRI/AAAAAAAAAA0/FeaXhNTG3EQ/s200/DiagramSml.jpg" alt="" id="BLOGGER_PHOTO_ID_5346064986412917010" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;To finish up, we wrapped up the entire functionality as a stand-alone DLL with just a few simple function calls.  Now anybody here can display YUV420 images in a Window without any CPU overhead and without having to be concerned about how it happens.  NVidia Cg requires two additional DLL's to be supplied with the package, but thats it.&lt;br /&gt;&lt;br /&gt;You can get the DLL from us at http://www.vision4ce.com&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-3805423215735781569?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/3805423215735781569/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2009/06/how-to-display-yuv420-video.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/3805423215735781569'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/3805423215735781569'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2009/06/how-to-display-yuv420-video.html' title='How to Display YUV420 Video'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_LVn-8-fmLeI/SjEHlf6gV5I/AAAAAAAAAAs/OCNhE1ptato/s72-c/YUV420RAW_720x576+Origsml.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-1305105693360769580</id><published>2009-03-11T09:47:00.005Z</published><updated>2009-06-12T11:34:31.389+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='CUDA'/><category scheme='http://www.blogger.com/atom/ns#' term='intel'/><category scheme='http://www.blogger.com/atom/ns#' term='GPGPU'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><title type='text'>Multi-core Momentum</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.intel.com/sites/sitewide/pix/badges/core/ci7_62.gif"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 62px; height: 76px;" src="http://www.intel.com/sites/sitewide/pix/badges/core/ci7_62.gif" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;We've been looking at the new Intel Core i7 processors today.   Obviously, with the Nehalem architecture Intel are really getting ready for massively multi-threaded multi-core chips in the near future.   This is good news for image processing as a great many useful algorithms can be implemented using a parallel architecture, so the more mainstream parallel computing gets, the better.    At &lt;a href="http://www.vision4ce.com/"&gt;Vision4ce &lt;/a&gt;, we've  had a lot of success accelerating algorithms using the &lt;a href="http://gpgpu.org/"&gt;GPU &lt;/a&gt;and &lt;a href="http://www.nvidia.com/object/cuda_learn.html"&gt;NVidia CUDA&lt;/a&gt;, so that experience should help us deploy onto other multi-core architectures, probably using&lt;a href="http://www.nvidia.com/object/cuda_opencl.html"&gt; OpenCL&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Another bit from Intel's blurb that interests me is the &lt;a href="http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect"&gt;QuickPath &lt;/a&gt;technology:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-style: italic;"&gt;'Intel QuickPath technology is a point-to-point connection—there is no single bus that all the processors must use and contend with to reach memory and I/O. This improves scalability and eliminates the competition between processors for bus bandwidth.'&lt;/span&gt;&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Now that sounds like Intel is really laying good groundwork for connecting lots of cores together in a parallel architecture - lets see what happens with the Intel Nehalem architecture in the near future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-1305105693360769580?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/1305105693360769580/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2009/03/weve-been-looking-at-new-intel-core-i7.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/1305105693360769580'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/1305105693360769580'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2009/03/weve-been-looking-at-new-intel-core-i7.html' title='Multi-core Momentum'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-7346473339462769555</id><published>2009-03-09T12:38:00.005Z</published><updated>2009-03-09T14:59:28.383Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='tools'/><title type='text'>Vcd on XP</title><content type='html'>Here is a useful little utility from MS for XP that allows you to mount ISO cd's without having to have some huge software bundle installed. Its old, but at 60kb its still useful.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://download.microsoft.com/download/7/b/6/7b6abd84-7841-4978-96f5-bd58df02efa2/winxpvirtualcdcontrolpanel_21.exe"&gt;&lt;span style="font-size:78%;"&gt;Download XP Virtual Control Panel&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-size:78%;"&gt;&lt;br /&gt;http://download.microsoft.com/download/7/b/6/7b6abd84-7841-4978-96f5-bd58df02efa2/winxpvirtualcdcontrolpanel_21.exe&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:78%;"&gt;&lt;br /&gt;Installation instructions&lt;br /&gt;=========================&lt;br /&gt;1. Copy VCdRom.sys to your %systemroot%\system32\drivers folder.&lt;br /&gt;2. Execute VCdControlTool.exe&lt;br /&gt;3. Click “Driver control”&lt;br /&gt;4. If the “Install Driver” button is available, click it. Navigate to the %systemroot%\system32\drivers folder, select VCdRom.sys, and click Open.&lt;br /&gt;5. Click “Start”&lt;br /&gt;6. Click OK&lt;br /&gt;7. Click “Add Drive” to add a drive to the drive list. Ensure that the drive added is not a local drive. If it is, continue to click “Add Drive” until an unused drive letter is available.&lt;br /&gt;8. Select an unused drive letter from the drive list and click “Mount”.&lt;br /&gt;9. Navigate to the image file, select it, and click “OK”. UNC naming conventions should not be used, however mapped network drives should be OK.&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-7346473339462769555?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/7346473339462769555/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2009/03/vcd-on-xp.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/7346473339462769555'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/7346473339462769555'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2009/03/vcd-on-xp.html' title='Vcd on XP'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-355721234927777545</id><published>2009-01-08T14:07:00.003Z</published><updated>2009-03-09T14:58:51.082Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='maths'/><title type='text'>Sub-Pixel Maximum</title><content type='html'>&lt;span style="font-size:85%;"&gt;&lt;span style="font-family:verdana;"&gt;&lt;span style="font-size:100%;"&gt;Ever needed to find the position of the maximum of a peak in an image to sub-pixel precision?  Examples are finding the peak energy in Fourier space, or the location of a peak in Hough space.  One simple method that I have used with some success is a simple quadratic interpolation, which is performed once for each axis.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a style="font-family: verdana;" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_LVn-8-fmLeI/SbUpob8-UVI/AAAAAAAAAAk/VzIn0NsbO8E/s1600-h/quadratic.gif"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 200px; height: 162px;" src="http://3.bp.blogspot.com/_LVn-8-fmLeI/SbUpob8-UVI/AAAAAAAAAAk/VzIn0NsbO8E/s200/quadratic.gif" alt="" id="BLOGGER_PHOTO_ID_5311197110228373842" border="0" /&gt;&lt;/a&gt;&lt;span style="font-family: verdana;"&gt;Sub-Pixel Peak finding recipe...&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:verdana;"&gt;Firstly, you take the value of your highest point and its neighbours (Pa, Pb and Pc). Assuming the three points are equally spaced and bracket the top of the peak, you can solve for the position of the 'true' maximum using:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 0, 204);font-family:verdana;" &gt;x = 0.5*(Pa - Pc) / (Pa-2Pb+Pc)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:verdana;"&gt;The value of x will range between -0.5 and +0.5, which is relative to your maximum pixel.  &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size:78%;"&gt;&lt;span style="font-family:verdana;"&gt;Jason Dale.   www.visionexperts.co.uk&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-355721234927777545?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/355721234927777545/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2009/03/sub-pixel-maximum.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/355721234927777545'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/355721234927777545'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2009/03/sub-pixel-maximum.html' title='Sub-Pixel Maximum'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_LVn-8-fmLeI/SbUpob8-UVI/AAAAAAAAAAk/VzIn0NsbO8E/s72-c/quadratic.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2714105800710449447.post-6195384535728013350</id><published>2009-01-02T10:44:00.000Z</published><updated>2009-03-09T14:47:25.028Z</updated><title type='text'>New Year.  New Blog</title><content type='html'>In an effort to pay back some of my huge debt to the internet, I've decided to put a little something back into the web-o-sphere.  And seeing as I'd like it to be something useful, I'm going to share the stuff that I find useful, for the benefit of image processing people everywhere.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2714105800710449447-6195384535728013350?l=visionexperts.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://visionexperts.blogspot.com/feeds/6195384535728013350/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://visionexperts.blogspot.com/2009/01/new-year-new-blog.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/6195384535728013350'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2714105800710449447/posts/default/6195384535728013350'/><link rel='alternate' type='text/html' href='http://visionexperts.blogspot.com/2009/01/new-year-new-blog.html' title='New Year.  New Blog'/><author><name>Jason</name><uri>http://www.blogger.com/profile/10730990257448904133</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
