Friday, 26 April 2013

Multicore MJPG

Over the years, I've spent quite a lot of time developing and tweaking my own highly optimised JPG compressor.  It's fast.  I've made heavy use of the Intel Performance Primitives (IPP) to leverage SSE instructions and get impressive performance boost over other third-party libraries I know of.

However, my implementation was limited to a single thread per image, which limited performance when encoding video.  Yes, I know there are ways to parallise the JPG algorithm on a single frame using restart intervals, but that was going to be complex to implement in my existing code.  

For video encoding, I could see a simpler way forward; encode sequential frames concurrently on separate threads.

My new VXJPG library is able to compress each image on its own thread.  So sequential frames are encoded concurrently.  This means that when processing sequences of images, the overall performance is vastly improved.  In fact, on a quad core processor the library pretty much achieves a 4x speedup over the single core version.

Multi-Threaded Compression

The problem with running multiple threads, each encoding a separate frame on a seperate core, is ensuring that the frames remain in sequence when added to the output video file.  Without proper synchronisation, it is entirely possible for the frames to end up out-of-order.

Each thread in a thread pool is launched when the library is initialised, then enters a loop waiting for new data to process.  When a new frame arrives, the next thread in sequence is selected from the pool. The scheduler waits until a 'Ready' event is fired by the thread, to signal that the thread is waiting for new data.  After Ready is signaled, the thread is passed the data and a 'DoProcess' event fired by the scheduler, which allows the thread to proceed onto compressing the frame.  


Figure 1. A single encoding thread waits until a Ready event before processing.  It must then wait for the previous thread to complete before adding its compressed frame to the file.  It can then signal the next thread before returning to wait for more data.
At this point, the scheduler is done and ready to receive a new frame, whilst the thread performs its time-consuming compression task.
 
Inside the thread, after compression completes, each thread must wait on a 'Frame Completed' event from the previous thread in the pool before it is permitted to write to the file.  This ensures that even when all threads are processing concurrently, the frame sequence remains correct.  



Finally, when the frame is written, the thread can signal 'Ready' for when the scheduler next visits.  When the data stops arriving, the threads naturally finish writing frames in sequence and simply all enter the wait state.  

Termination is achieved by signalling a terminate flag in each thread prior to firing the ready event.


Benchmark Compression

I tested the compressor using both single and multi-threaded modes using high def 1920x1080 colour bitmaps, two input frames, repeated 250 times.  

That's 1920x1080 RGB x 500 frames = 3.1GB input data to chew through.  I'm using an Intel Core2 Quad Q8400@2.66GHz to do this, with 4GB DDR3 1066MHz RAM on Win7 32bit.
 
Single Threaded:  
Total time 11.66 seconds
Avg time per frame = 23.2ms
Avg data consumption = 268 MB/sec

Multi Threaded:  
Total time 3.23 seconds
Avg time per frame = 6.5ms
Avg data consumption = 963 MB/sec

I am aware that the input images will be cached, so this is probably a higher performance than I would get with a camera, but its still impressive.  Recording multiple live HD color streams to MJPG is no problem at all even on a modest embedded PC.  MJPG does fill up drives pretty quickly, but is far better than RAW, which requires RAID and SSDs to achieve. 

At present, I'm using vxMJPG for my (sorry, shameless plug) Gecko GigE recorder with Stemmer Imaging cameras for high speed video recording projects.  The output (*.mjpg) files are my own thin custom container format but will play and can be transcoded in VLC.  Unfortunately, they don't play in Windows media player.  That's the price of speed.  To help portability, I also have my own MJPG player for the files which has some features like slow-mo, frame step, save frame etc to help.

My next project is to wrap this up in a DirectShow encoder, so that I can also make AVI files. However, that might have to be single threaded.  I cannot currently see a way of making the encoder in a DS filter graph safely asynchronous in the manner I am employing.  Fortunately, even single threaded, my version should be vastly faster then the codecs I am currently using - and I shall report those results when I have them.
 
Vision Experts

Friday, 11 January 2013

Fast Pattern Features

Inventing really fast pattern matching algorithms is good fun. 

I've recently been looking at methods to accelerate one of my pattern matching algorithms and thought I'd share some of my techniques with you.  It takes alot of precomputation, but then alignment is really fast.  The algorithm is intended for registration of a high resolution template to a similar target image, for applications when some alignment is required but not a full image search.

Feature Patches

Performing image correlation is slow, but it can be accelerated by not using every pixel.  The algorithm I'm using splits a high resolution trained template image into multiple small feature patches. Only the best patches that have the highest structure tensor are then used for alignment (Figure 1).  The reduction of the image to its key features reduces the amount of correlation required significantly.

Patches that are close together are pruned, leaving the more widely spaced patches. We don't want all our patches clustered together since that would make determination of global rotation or scaling more sensitive to local distortions.  This patch feature extraction dramatically cuts down the number of pixels actually required to find a good match using any correlation based approach.  Using small patches (in my case between 16x16 and 32x32 pixels) has the advantage that each patch can usually be matched using a straightforward translation-only search.  After all (or most) of the patches are matched, a full affine deformation can still be computed from the individual match locations.

Figure1. High structure patch regions can be located using the structure tensor.  These can be used to reduce the number of pixels required to make a match.

Fast Feature Vectors

Even searching for a small number (e.g. 16) small patches over a reasonable search region takes too much time for most applications.   Fortunately, a significant accleration can be achieved by rapidly rejecting very poor quality matches using a cadidate search routine.  Most of the possible match positions will be very different from the correct match position.  We can rapidly discriminate between a candidate match and a non-match by comparing a few carefully selected pixels in each patch (Figure.2)


Figure 2.  A high structure patch can be further reduced into a feature pixel representation.  Key pixels can be used to rapdily reject non-matches during search.  The problem is determining which pixels are optimal to use.

The interesting part of the algorithm must decide which combination (and how many) pixels inside each patch are required to provide good discrimination between a possible match and a probable non-match.  I decide that a measure of 'goodness' for a set of pixels is found by auto-correlating the set with the patch.  Highly discriminative sets will have the fewest match locations, with a really good set having only one match location.

Clearly, there are an awful lot of potential combinations of pixels that could be used.  A  brute force test of all possible combinations is not really feasible.  We must cut down the search space.  We can guess that discriminating pixels lie on or near an edge that divides areas of different intensity.  The intensity of a pixel direclty on a strong edge is not very robust to sub-pixel translations, so those should probably not be used (unless we pre-blur).  We can guess that descriptive pixels do not lie on plain featureless regions either, since these pixels are not uniquely matchable.  In all, we can assume that the most discriminative pixels do not lie on the high gradient pixels (edges), and not on the lowest gradient pixels (too plain).  So if we restrict possible discriminative to those ones on a local contrast maximum we can reduce the optimisation considerably.

There are still a large number of possible combinations, but now few enough to enumerate and test in a feasible amount of time.  I limit this to the top 32 such pixels.

Every possible combination of four of these candidate pixels is tested as a potential discriminating arrangement.  This is a total of 32^4, or about a million possible combinations to be tested.  Every combination of four pixels is auto-correlated with the entire patch search region and the number of matches counted.  The combination that yields the fewest false matches is chosen as the most discriminating set.  This pixel set can be used at run time to rapidly reject false match locations with only the locations that pass this quick test going forward to full patch correlation.

This process cuts a multi-megapixel image correlation down to a very limited set key patches, with a very limited number of pixels in each patch required to find a match location.  I have used this algorithm successfully for very rapid alignment of printed material without recourse to pyramid formation or more complex feature matching. 


Vision Experts