I've recently been optimizing one of my image processing libraries and wanted to share my results with you. Two acceleration methods which are relatively straightforward for me to implement and therefore have a high return-on-investment are:
- Using multi-core parallelism via OpenMP
- Using SIMD instructions via the Intel IPP
Preface:
I always design my image processing libraries so that all the high level complexity makes use of a separate low level toolbox of processing functions. In this particular algorithm, the image was divided into blocks. Each independent block was encapsulated by a simple block class. The block class contained implementations of all the basic arithmetic processing from which all the high level functions were built from. From the outset, I had in mind that every block could eventually be processed in parallel and the individual arithmetic functions could be accelerated using SIMD instructions.
I started with vanilla c++, single threaded implementations of the arithmetic and functions. When everything was working and debugged, I could add parallelism using OMP and SIMD using IntelIPP without a huge effort.
OpenMP
I love OMP. Its so simple to use and so powerful it allows me to leverage multi-core processors with almost no effort. What's more, if you have VisualStudio2010 then you have OpenMP already. You just need to switch it on in the project properties under the C/C++ language tab, as below:Adding OpenMP support to a C++ project in VS2010 |
Using OpenMP was easyfor my project, since I already designed the algorithm to process the image in a series independent blocks. This is my original block loop:
for (int c=0; c<BLOCKS; c++) { val = CB[c]->ComputeMean(); }
Using OpenMP is as easy as switching it on (Figure1) and then using the #pragma omp parallel compiler instruction before the for loop:
#pragma omp parallel for for (int c=0; c<BLOCKS; c++) { val = CB[c]->ComputeMean(); }
No code changes required - it really couldn't be easier. It just requires some thought at the outset of how to partition the algorithm so that the parallism can be leveraged. It is indeed faster.
Note: In VS2008 OpenMP is not available in the standard
edition, but if you have VS2010, you can still find and use vcomp.lib and omp.h
with VS2008. I guess you could use the libraries with any version of visual studio, even the free express versions, although I'm not sure what the licensing/distribution restrictions are when doing that.
Intel IPP
I own a developer license for the Intel Integrated Performance Primitives. Since the processing already used my own vector processing functions, swapping them for equivalent IPP versions was straightforward. Take for example this very simple loop to compute the average of a vector:int n; float Sum=0.0f; for (n=0;n<m_nE;n++) { Sum += *(m_pFloatData+n); } m_fMean = (Sum / (float)m_nE);
This has a direct equivalent in the IPP:
ippsMean_32f(m_pFloatData, m_nE, &m_fMean, ippAlgHintFast);
This single function performs the same vector average, puts the result in the same output variable, but uses the SSE instructions to pack the floats (yes, its float image data here, but thats another story) so that multiple values are processed in parallel. It is indeed faster.
Results
So what happened and which method provided the most bang-for-buck? Perhaps unsurprisingly, it depends whats going on.
Simple Functions
Block Average Normal (x1)
Block Average OMP (x1.15)
Block Average IPP (x2.42)
Block Average Combined (x2.67)
Block Scalar Mult Normal (1)
Block Scalar Mult OMP (x2.93)
Block Scalar Mult IPP (x5.45)
Block Scalar Mult Combined (x6.21)
More Complex Functions
Block Alpha Blend Normal (1)
Block Alpha Blend OMP (x2.06)
Block Alpha Blend IPP (x1.49)
Block Alpha Blend Combined (x1.48)
All functions performed on a Quad Core i3, Win7 x64
Conclusion
The granularity of the parallelism matters.- IPP accelerates the simple vector processing functions more than OMP
- OpenMP accelerates more complicated high level functions more than IPP
- Combining IPP and OpenMP only makes a small difference for these functions
Since IPP already uses OpenMP internally, perhaps it is not surprising that an additional higher level of OpenMP paralellism does not yield a large speed increase. However, for higher level functions that combine tens of low-level functions I have found OMP to add considerable value. I'm sure the reasons are a complex combination of memory bandwidth, cache and processor loading, but the general rule of using OMP for high level parallelism and IPP for low level parallelism seems sensible.
Vision Experts
If you can predict the usage pattern of your vision library, you can with very little effort get a lot of bang-for-the-buck with openMP - as you mention.
ReplyDeleteThe problem is the prediction step. Let us take the example of simple Gaussian filtering. If all your application is ever doing, is Gaussian filtering, you might
as well #pragma omp the convolution step according to all the cores available.
But what if your fast openMP'ed Gaussian algorithm is to be used in the context of a more advanced algorithm, e.g. as a prefiltering step to a SURF/SIFT multi view matching.
Here you probably want to filter all images in parallel and possibly start extracting and classifying descriptors when they appear. Do you still want X Gaussian filtering algorithms
to each try to max out all cores with #omp pragma - probably not.
Controlling the affinity is the hard part about openMP, and predicting how your users (or your self) is going to use your newly created building block in a future parallel context.
Having worked with IPP, I assume you also know Intel TBB (Thread Building Blocks). If not, it would be worth taking a look!
http://threadingbuildingblocks.org/
Good points Jakob. These tools make it relatively easy to add parallelism at every level of the implementation, and third party libraries often hide what parallelism is used internally. I wonder if the nested parallelism you talk about could ultimately reduce performance!
DeleteYou may want to check this post:
ReplyDeletehttp://www.briancbecker.com/blog/2010/analysis-of-jpeg-decoding-speeds/
According to the post:
1. Manually optimized single core libjpeg-simd is 20-30% faster than multi-core IPP library.
2. Doubling the number of cores makes libjpeg-simd twice faster than IPP
Some speculations:
IPP is using the multi-core optimizations internally. Therefore, using IPP on a sinlge core CPU will probably make IPP twice slower than the libjpeg-simd
My conclusions are:
1. IPP is no match for a proper manual optimizations by a programmer
2. You have more flexibility with openmp and you'll get close results even without assembly level modifications
3. IPP provides a slightly faster code than VS or GCC, however there is no difference for AMD CPUs
YMMV
That's a great analysis - thanks for the link.
ReplyDeleteIPPv6 does not split an image across cores when (de)compressing. I think that capability was introduced in IPPv7 and requires use of the restart markers in the JPG spec. Individual function calls (quantize, dct etc) do use OpenMP internally so must split the work on individual JPG blocks across cores. That fine granularity is probably not quite as efficient.
My implementation pre-allocates a thread pool so that all cores are kept busy (de)compressing an entire image, plus it uses the IPP to make use of SIMD on each jpg block in each image. Overall, this is pretty darn fast on multi-core machine and I do see a close-to-linear speed up in multi-core machines.
Thanks again for your comments
Ipp is great for single process, yet in multi process environment it gets unexpected slowdowns. I use dual xeon with each cpu having 6 cores. I wrote a code that splits the data to binded threads and merges back the results. The code itself is written in simd (sse2-see4.2) and the speed ups are amazing - as long as there is no memory saturation.
Deletepasarqq
ReplyDeletewarnetqq
honda qq
dewaqq
lipoqq
heloqq
asikqq