Wednesday, 8 August 2012

Multicore or SIMD?


I've recently been optimizing one of my image processing libraries and wanted to share my results with you.  Two acceleration methods which are relatively straightforward for me to implement and therefore have a high return-on-investment are:
  1. Using multi-core parallelism via OpenMP
  2. Using SIMD instructions via the Intel IPP
This post shows my results.  Which helped more?

Preface:

I always design my image processing libraries so that all the high level complexity makes use of a separate low level toolbox of processing functions.  In this particular algorithm, the image was divided into blocks.  Each independent block was encapsulated by a simple block class.  The block class contained implementations of all the basic arithmetic processing from which all the high level functions were built from.  From the outset, I had in mind that every block could eventually be processed in parallel and the individual arithmetic functions could be accelerated using SIMD instructions.

I started with vanilla c++, single threaded implementations of the arithmetic and functions.  When everything was working and debugged, I could add parallelism using OMP and SIMD using IntelIPP without a huge effort.

OpenMP

I love OMP.  Its so simple to use and so powerful it allows me to leverage multi-core processors with almost no effort.  What's more, if you have VisualStudio2010 then you have OpenMP already.  You just need to switch it on in the project properties under the C/C++ language tab, as below:

Adding OpenMP support to a C++ project in VS2010



 Using OpenMP was easyfor my project, since I already designed the algorithm to process the image in a series independent blocks.  This is my original block loop:

      for (int c=0; c<BLOCKS; c++)
      {
         val = CB[c]->ComputeMean();
      }
 
Using OpenMP is as easy as switching it on (Figure1) and then using the #pragma omp parallel compiler instruction before the for loop:
 
      #pragma omp parallel for
      for (int c=0; c<BLOCKS; c++)
      {
         val = CB[c]->ComputeMean();
      }
 
No code changes required - it really couldn't be easier.  It just requires some thought at the outset of how to partition the algorithm so that the parallism can be leveraged.  It is indeed faster.

Note: In VS2008 OpenMP is not available in the standard edition, but if you have VS2010, you can still find and use vcomp.lib and omp.h with VS2008.  I guess you could use the libraries with any version of visual studio, even the free express versions, although I'm not sure what the licensing/distribution restrictions are when doing that.

Intel IPP 

I own a developer license for the Intel Integrated Performance Primitives.  Since the processing already used my own vector processing functions, swapping them for equivalent IPP versions was straightforward.  Take for example this very simple loop to compute the average of a vector:

   int n;
   float Sum=0.0f;
   for (n=0;n<m_nE;n++)
   {
      Sum += *(m_pFloatData+n);
   }
   m_fMean = (Sum / (float)m_nE);

This has a direct equivalent in the IPP:
 
   ippsMean_32f(m_pFloatData, m_nE, &m_fMean, ippAlgHintFast);  

This single function performs the same vector average, puts the result in the same output variable, but uses the SSE instructions to pack the floats (yes, its float image data here, but thats another story) so that multiple values are processed in parallel.  It is indeed faster. 

Results

So what happened and which method provided the most bang-for-buck? Perhaps unsurprisingly, it depends whats going on. 

Simple Functions
Block Average Normal (x1)
Block Average OMP (x1.15)
Block Average IPP (x2.42)
Block Average Combined (x2.67)

Block Scalar Mult Normal (1)
Block Scalar Mult OMP (x2.93)
Block Scalar Mult IPP (x5.45)
Block Scalar Mult Combined (x6.21)

More Complex Functions
Block Alpha Blend Normal (1)
Block Alpha Blend OMP (x2.06)
Block Alpha Blend IPP (x1.49)
Block Alpha Blend Combined (x1.48)

All functions performed on a Quad Core i3, Win7 x64



Conclusion 

The granularity of the parallelism matters. 
  1. IPP accelerates the simple vector processing functions more than OMP
  2. OpenMP accelerates more complicated high level functions more than IPP
  3. Combining IPP and OpenMP only makes a small difference for these functions
Since IPP already uses OpenMP internally, perhaps it is not surprising that an additional higher level of OpenMP paralellism does not yield a large speed increase.  However, for higher level functions that combine tens of low-level functions I have found OMP to add considerable value.  I'm sure the reasons are a complex combination of memory bandwidth, cache and processor loading, but the general rule of using OMP for high level parallelism and IPP for low level parallelism seems sensible.

Vision Experts

6 comments:

  1. If you can predict the usage pattern of your vision library, you can with very little effort get a lot of bang-for-the-buck with openMP - as you mention.

    The problem is the prediction step. Let us take the example of simple Gaussian filtering. If all your application is ever doing, is Gaussian filtering, you might
    as well #pragma omp the convolution step according to all the cores available.

    But what if your fast openMP'ed Gaussian algorithm is to be used in the context of a more advanced algorithm, e.g. as a prefiltering step to a SURF/SIFT multi view matching.

    Here you probably want to filter all images in parallel and possibly start extracting and classifying descriptors when they appear. Do you still want X Gaussian filtering algorithms
    to each try to max out all cores with #omp pragma - probably not.

    Controlling the affinity is the hard part about openMP, and predicting how your users (or your self) is going to use your newly created building block in a future parallel context.

    Having worked with IPP, I assume you also know Intel TBB (Thread Building Blocks). If not, it would be worth taking a look!

    http://threadingbuildingblocks.org/

    ReplyDelete
    Replies
    1. Good points Jakob. These tools make it relatively easy to add parallelism at every level of the implementation, and third party libraries often hide what parallelism is used internally. I wonder if the nested parallelism you talk about could ultimately reduce performance!

      Delete
  2. You may want to check this post:
    http://www.briancbecker.com/blog/2010/analysis-of-jpeg-decoding-speeds/

    According to the post:
    1. Manually optimized single core libjpeg-simd is 20-30% faster than multi-core IPP library.
    2. Doubling the number of cores makes libjpeg-simd twice faster than IPP

    Some speculations:
    IPP is using the multi-core optimizations internally. Therefore, using IPP on a sinlge core CPU will probably make IPP twice slower than the libjpeg-simd

    My conclusions are:
    1. IPP is no match for a proper manual optimizations by a programmer
    2. You have more flexibility with openmp and you'll get close results even without assembly level modifications
    3. IPP provides a slightly faster code than VS or GCC, however there is no difference for AMD CPUs

    YMMV

    ReplyDelete
  3. That's a great analysis - thanks for the link.

    IPPv6 does not split an image across cores when (de)compressing. I think that capability was introduced in IPPv7 and requires use of the restart markers in the JPG spec. Individual function calls (quantize, dct etc) do use OpenMP internally so must split the work on individual JPG blocks across cores. That fine granularity is probably not quite as efficient.

    My implementation pre-allocates a thread pool so that all cores are kept busy (de)compressing an entire image, plus it uses the IPP to make use of SIMD on each jpg block in each image. Overall, this is pretty darn fast on multi-core machine and I do see a close-to-linear speed up in multi-core machines.

    Thanks again for your comments

    ReplyDelete
    Replies
    1. Ipp is great for single process, yet in multi process environment it gets unexpected slowdowns. I use dual xeon with each cpu having 6 cores. I wrote a code that splits the data to binded threads and merges back the results. The code itself is written in simd (sse2-see4.2) and the speed ups are amazing - as long as there is no memory saturation.

      Delete