Accelerated Image Processing: Multicore or SIMD?

Wednesday, 8 August 2012

Multicore or SIMD?

I've recently been optimizing one of my image processing libraries and wanted to share my results with you. Two acceleration methods which are relatively straightforward for me to implement and therefore have a high return-on-investment are:

Using multi-core parallelism via OpenMP
Using SIMD instructions via the Intel IPP

This post shows my results. Which helped more?

Preface:

I always design my image processing libraries so that all the high level complexity makes use of a separate low level toolbox of processing functions. In this particular algorithm, the image was divided into blocks. Each independent block was encapsulated by a simple block class. The block class contained implementations of all the basic arithmetic processing from which all the high level functions were built from. From the outset, I had in mind that every block could eventually be processed in parallel and the individual arithmetic functions could be accelerated using SIMD instructions.

I started with vanilla c++, single threaded implementations of the arithmetic and functions. When everything was working and debugged, I could add parallelism using OMP and SIMD using IntelIPP without a huge effort.

OpenMP

I love OMP. Its so simple to use and so powerful it allows me to leverage multi-core processors with almost no effort. What's more, if you have VisualStudio2010 then you have OpenMP already. You just need to switch it on in the project properties under the C/C++ language tab, as below:

Adding OpenMP support to a C++ project in VS2010

Using OpenMP was easyfor my project, since I already designed the algorithm to process the image in a series independent blocks. This is my original block loop:


      for (int c=0; c<BLOCKS; c++)
      {
         val = CB[c]->ComputeMean();
      }

Using OpenMP is as easy as switching it on (Figure1) and then using the #pragma omp parallel compiler instruction before the for loop:

      #pragma omp parallel for
      for (int c=0; c<BLOCKS; c++)
      {
         val = CB[c]->ComputeMean();
      }

No code changes required - it really couldn't be easier. It just requires some thought at the outset of how to partition the algorithm so that the parallism can be leveraged. It is indeed faster.

Note: In VS2008 OpenMP is not available in the standard edition, but if you have VS2010, you can still find and use vcomp.lib and omp.h with VS2008. I guess you could use the libraries with any version of visual studio, even the free express versions, although I'm not sure what the licensing/distribution restrictions are when doing that.

Intel IPP

I own a developer license for the Intel Integrated Performance Primitives. Since the processing already used my own vector processing functions, swapping them for equivalent IPP versions was straightforward. Take for example this very simple loop to compute the average of a vector:

   int n;
   float Sum=0.0f;
   for (n=0;n<m_nE;n++)
   {
      Sum += *(m_pFloatData+n);
   }
   m_fMean = (Sum / (float)m_nE);

This has a direct equivalent in the IPP:

   ippsMean_32f(m_pFloatData, m_nE, &m_fMean, ippAlgHintFast);
 

This single function performs the same vector average, puts the result in the same output variable, but uses the SSE instructions to pack the floats (yes, its float image data here, but thats another story) so that multiple values are processed in parallel. It is indeed faster.

Results

So what happened and which method provided the most bang-for-buck? Perhaps unsurprisingly, it depends whats going on.

Simple Functions

Block Average Normal (x1)

Block Average OMP (x1.15)

Block Average IPP (x2.42)

Block Average Combined (x2.67)

Block Scalar Mult Normal (1)

Block Scalar Mult OMP (x2.93)

Block Scalar Mult IPP (x5.45)

Block Scalar Mult Combined (x6.21)

More Complex Functions

Block Alpha Blend Normal (1)

Block Alpha Blend OMP (x2.06)

Block Alpha Blend IPP (x1.49)

Block Alpha Blend Combined (x1.48)

All functions performed on a Quad Core i3, Win7 x64

Conclusion

The granularity of the parallelism matters.

IPP accelerates the simple vector processing functions more than OMP
OpenMP accelerates more complicated high level functions more than IPP
Combining IPP and OpenMP only makes a small difference for these functions

Since IPP already uses OpenMP internally, perhaps it is not surprising that an additional higher level of OpenMP paralellism does not yield a large speed increase. However, for higher level functions that combine tens of low-level functions I have found OMP to add considerable value. I'm sure the reasons are a complex combination of memory bandwidth, cache and processor loading, but the general rule of using OMP for high level parallelism and IPP for low level parallelism seems sensible.

Vision Experts

6 comments:

Anonymous8 August 2012 at 20:51
If you can predict the usage pattern of your vision library, you can with very little effort get a lot of bang-for-the-buck with openMP - as you mention.

The problem is the prediction step. Let us take the example of simple Gaussian filtering. If all your application is ever doing, is Gaussian filtering, you might
as well #pragma omp the convolution step according to all the cores available.

But what if your fast openMP'ed Gaussian algorithm is to be used in the context of a more advanced algorithm, e.g. as a prefiltering step to a SURF/SIFT multi view matching.

Here you probably want to filter all images in parallel and possibly start extracting and classifying descriptors when they appear. Do you still want X Gaussian filtering algorithms
to each try to max out all cores with #omp pragma - probably not.

Controlling the affinity is the hard part about openMP, and predicting how your users (or your self) is going to use your newly created building block in a future parallel context.

Having worked with IPP, I assume you also know Intel TBB (Thread Building Blocks). If not, it would be worth taking a look!

http://threadingbuildingblocks.org/
ReplyDelete
Replies
Kolobok_3.1415926535831 August 2012 at 14:27
You may want to check this post:
http://www.briancbecker.com/blog/2010/analysis-of-jpeg-decoding-speeds/

According to the post:
1. Manually optimized single core libjpeg-simd is 20-30% faster than multi-core IPP library.
2. Doubling the number of cores makes libjpeg-simd twice faster than IPP

Some speculations:
IPP is using the multi-core optimizations internally. Therefore, using IPP on a sinlge core CPU will probably make IPP twice slower than the libjpeg-simd

My conclusions are:
1. IPP is no match for a proper manual optimizations by a programmer
2. You have more flexibility with openmp and you'll get close results even without assembly level modifications
3. IPP provides a slightly faster code than VS or GCC, however there is no difference for AMD CPUs

YMMV

ReplyDelete
Replies
Jason31 August 2012 at 14:55
That's a great analysis - thanks for the link.

IPPv6 does not split an image across cores when (de)compressing. I think that capability was introduced in IPPv7 and requires use of the restart markers in the JPG spec. Individual function calls (quantize, dct etc) do use OpenMP internally so must split the work on individual JPG blocks across cores. That fine granularity is probably not quite as efficient.

My implementation pre-allocates a thread pool so that all cores are kept busy (de)compressing an entire image, plus it uses the IPP to make use of SIMD on each jpg block in each image. Overall, this is pretty darn fast on multi-core machine and I do see a close-to-linear speed up in multi-core machines.

Thanks again for your comments
ReplyDelete
Replies
sitampan5 August 2019 at 12:29
pasarqq
warnetqq
honda qq
dewaqq
lipoqq
heloqq
asikqq
ReplyDelete
Replies

Add comment

Welcome

Practical software & algorithm development in the machine vision industry.

Often a pretty technical blog, often just observations on being a developer, leader and human being in the Machine Vision Industry.

I'm professional software engineer, tech lead and company director at Vision Experts and Red Engine.

I spend most of my time managing a small team of world class engineers, inventing IP and consulting for industry in the computer vision space.

Websites can be found at www.flightclubdarts.com and www.visionexperts.co.uk

The Parallel Revolution

“We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”
- Paul Otellini, President, Intel (2005)

“Multicore: This is the one which will have the biggest impact on us. We have never had a problem to solve like this. A breakthrough is needed in how applications are done on multicore devices.”
- Bill Gates, Microsoft

“When we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.”
- John Hennessy, President of Stanford

Accelerated Image Processing