Wednesday 8 August 2012

Multicore or SIMD?


I've recently been optimizing one of my image processing libraries and wanted to share my results with you.  Two acceleration methods which are relatively straightforward for me to implement and therefore have a high return-on-investment are:
  1. Using multi-core parallelism via OpenMP
  2. Using SIMD instructions via the Intel IPP
This post shows my results.  Which helped more?

Preface:

I always design my image processing libraries so that all the high level complexity makes use of a separate low level toolbox of processing functions.  In this particular algorithm, the image was divided into blocks.  Each independent block was encapsulated by a simple block class.  The block class contained implementations of all the basic arithmetic processing from which all the high level functions were built from.  From the outset, I had in mind that every block could eventually be processed in parallel and the individual arithmetic functions could be accelerated using SIMD instructions.

I started with vanilla c++, single threaded implementations of the arithmetic and functions.  When everything was working and debugged, I could add parallelism using OMP and SIMD using IntelIPP without a huge effort.

OpenMP

I love OMP.  Its so simple to use and so powerful it allows me to leverage multi-core processors with almost no effort.  What's more, if you have VisualStudio2010 then you have OpenMP already.  You just need to switch it on in the project properties under the C/C++ language tab, as below:

Adding OpenMP support to a C++ project in VS2010



 Using OpenMP was easyfor my project, since I already designed the algorithm to process the image in a series independent blocks.  This is my original block loop:

      for (int c=0; c<BLOCKS; c++)
      {
         val = CB[c]->ComputeMean();
      }
 
Using OpenMP is as easy as switching it on (Figure1) and then using the #pragma omp parallel compiler instruction before the for loop:
 
      #pragma omp parallel for
      for (int c=0; c<BLOCKS; c++)
      {
         val = CB[c]->ComputeMean();
      }
 
No code changes required - it really couldn't be easier.  It just requires some thought at the outset of how to partition the algorithm so that the parallism can be leveraged.  It is indeed faster.

Note: In VS2008 OpenMP is not available in the standard edition, but if you have VS2010, you can still find and use vcomp.lib and omp.h with VS2008.  I guess you could use the libraries with any version of visual studio, even the free express versions, although I'm not sure what the licensing/distribution restrictions are when doing that.

Intel IPP 

I own a developer license for the Intel Integrated Performance Primitives.  Since the processing already used my own vector processing functions, swapping them for equivalent IPP versions was straightforward.  Take for example this very simple loop to compute the average of a vector:

   int n;
   float Sum=0.0f;
   for (n=0;n<m_nE;n++)
   {
      Sum += *(m_pFloatData+n);
   }
   m_fMean = (Sum / (float)m_nE);

This has a direct equivalent in the IPP:
 
   ippsMean_32f(m_pFloatData, m_nE, &m_fMean, ippAlgHintFast);  

This single function performs the same vector average, puts the result in the same output variable, but uses the SSE instructions to pack the floats (yes, its float image data here, but thats another story) so that multiple values are processed in parallel.  It is indeed faster. 

Results

So what happened and which method provided the most bang-for-buck? Perhaps unsurprisingly, it depends whats going on. 

Simple Functions
Block Average Normal (x1)
Block Average OMP (x1.15)
Block Average IPP (x2.42)
Block Average Combined (x2.67)

Block Scalar Mult Normal (1)
Block Scalar Mult OMP (x2.93)
Block Scalar Mult IPP (x5.45)
Block Scalar Mult Combined (x6.21)

More Complex Functions
Block Alpha Blend Normal (1)
Block Alpha Blend OMP (x2.06)
Block Alpha Blend IPP (x1.49)
Block Alpha Blend Combined (x1.48)

All functions performed on a Quad Core i3, Win7 x64



Conclusion 

The granularity of the parallelism matters. 
  1. IPP accelerates the simple vector processing functions more than OMP
  2. OpenMP accelerates more complicated high level functions more than IPP
  3. Combining IPP and OpenMP only makes a small difference for these functions
Since IPP already uses OpenMP internally, perhaps it is not surprising that an additional higher level of OpenMP paralellism does not yield a large speed increase.  However, for higher level functions that combine tens of low-level functions I have found OMP to add considerable value.  I'm sure the reasons are a complex combination of memory bandwidth, cache and processor loading, but the general rule of using OMP for high level parallelism and IPP for low level parallelism seems sensible.

Vision Experts