Friday, 26 April 2013

Multicore MJPG

Over the years, I've spent quite a lot of time developing and tweaking my own highly optimised JPG compressor.  It's fast.  I've made heavy use of the Intel Performance Primitives (IPP) to leverage SSE instructions and get impressive performance boost over other third-party libraries I know of.

However, my implementation was limited to a single thread per image, which limited performance when encoding video.  Yes, I know there are ways to parallise the JPG algorithm on a single frame using restart intervals, but that was going to be complex to implement in my existing code.  

For video encoding, I could see a simpler way forward; encode sequential frames concurrently on separate threads.

My new VXJPG library is able to compress each image on its own thread.  So sequential frames are encoded concurrently.  This means that when processing sequences of images, the overall performance is vastly improved.  In fact, on a quad core processor the library pretty much achieves a 4x speedup over the single core version.

Multi-Threaded Compression

The problem with running multiple threads, each encoding a separate frame on a seperate core, is ensuring that the frames remain in sequence when added to the output video file.  Without proper synchronisation, it is entirely possible for the frames to end up out-of-order.

Each thread in a thread pool is launched when the library is initialised, then enters a loop waiting for new data to process.  When a new frame arrives, the next thread in sequence is selected from the pool. The scheduler waits until a 'Ready' event is fired by the thread, to signal that the thread is waiting for new data.  After Ready is signaled, the thread is passed the data and a 'DoProcess' event fired by the scheduler, which allows the thread to proceed onto compressing the frame.  


Figure 1. A single encoding thread waits until a Ready event before processing.  It must then wait for the previous thread to complete before adding its compressed frame to the file.  It can then signal the next thread before returning to wait for more data.
At this point, the scheduler is done and ready to receive a new frame, whilst the thread performs its time-consuming compression task.
 
Inside the thread, after compression completes, each thread must wait on a 'Frame Completed' event from the previous thread in the pool before it is permitted to write to the file.  This ensures that even when all threads are processing concurrently, the frame sequence remains correct.  



Finally, when the frame is written, the thread can signal 'Ready' for when the scheduler next visits.  When the data stops arriving, the threads naturally finish writing frames in sequence and simply all enter the wait state.  

Termination is achieved by signalling a terminate flag in each thread prior to firing the ready event.


Benchmark Compression

I tested the compressor using both single and multi-threaded modes using high def 1920x1080 colour bitmaps, two input frames, repeated 250 times.  

That's 1920x1080 RGB x 500 frames = 3.1GB input data to chew through.  I'm using an Intel Core2 Quad Q8400@2.66GHz to do this, with 4GB DDR3 1066MHz RAM on Win7 32bit.
 
Single Threaded:  
Total time 11.66 seconds
Avg time per frame = 23.2ms
Avg data consumption = 268 MB/sec

Multi Threaded:  
Total time 3.23 seconds
Avg time per frame = 6.5ms
Avg data consumption = 963 MB/sec

I am aware that the input images will be cached, so this is probably a higher performance than I would get with a camera, but its still impressive.  Recording multiple live HD color streams to MJPG is no problem at all even on a modest embedded PC.  MJPG does fill up drives pretty quickly, but is far better than RAW, which requires RAID and SSDs to achieve. 

At present, I'm using vxMJPG for my (sorry, shameless plug) Gecko GigE recorder with Stemmer Imaging cameras for high speed video recording projects.  The output (*.mjpg) files are my own thin custom container format but will play and can be transcoded in VLC.  Unfortunately, they don't play in Windows media player.  That's the price of speed.  To help portability, I also have my own MJPG player for the files which has some features like slow-mo, frame step, save frame etc to help.

My next project is to wrap this up in a DirectShow encoder, so that I can also make AVI files. However, that might have to be single threaded.  I cannot currently see a way of making the encoder in a DS filter graph safely asynchronous in the manner I am employing.  Fortunately, even single threaded, my version should be vastly faster then the codecs I am currently using - and I shall report those results when I have them.
 
Vision Experts