Tuning your invariants

I’m currently building a pretty complex parallel computer that presently runs across multiple nodes in-process. The idea is to push the design to work across a grid of collaborating nodes, either as separate threads, processes or networked machines. I am currently working on two key areas:

  1. Limiting the number of garbage collections that are necessary.
  2. Achieving high performance in the multi-threaded core engine.

This week has been focused on both axis of performance, but recently I have been pushing my design to improve its scalability. Today I have managed to get the system to run an extreme number of networking downloads with a CPU consumption envelope remaining below 3%, which is pretty impressive. Most of this has come from harnessing IO completion ports as well as being extremely savvy with the use of threads provided by the .Net thread pool. So far I am doing well riding on the back of the .Net thread pool, but I may need to consider dedicated threads at some point – the thread pool scheduler can thrash a little in certain scenarios, but I’m not noticing that yet.

Due to the nature of this system, which I’ll explain more of soon, there can be multiple active downloaders retrieving online data from n sources. Downloaded data comes into the system and is processed via a sophisticated logical pipeline. An earlier naive implementation resulted in a saturation of the thread pool, or a profound number of context switches due to equally primitive locking strategies.

To limit or reduce garbage collections I have also needed to improve the types of data structures I am using, as well as understand better the invariants I need to uphold during runtime. For example, I have a pipeline that needs to have data live and in transit as much as possible, to utilize resources as best as can be achieved, whilst not saturating the CPU or increasing the working set pointlessly. Processors on the pipeline work at different rates, but this means that at various points in time, there could be too much waiting in the pipeline, which stresses the memory footprint.

Part of the solution comes from throttling inputs to the pipeline, and using non-collectable data structures (buffers) to handle the task items waiitng to be processed. The head of my pipeline now uses a ring buffer and a semaphore to help avoid creating thousands of network requests to fill the pipeline. The nature of the downstream elements of the pipeline is to process large volumes of data per item, which naturally results in their execution running slower than the source of the pipeline (hence the possibility of the active objects at the source filling up the buffers and killing the memory footprint.

Through careful steps I have managed to produce a highly performant parallel data pipeline without incurring much resources at all. My original design worked well in every way apart from the two key areas above. I could execute multiple concurrent searches and download and process a lot of data, but CPU utilization was around 80%. By breaking the design down, understanding the finer low-level elements of real-time programming, I have arrived at present invariants that allow my system to download the same data as before at nearly 3x the speed, using no more than 3% CPU. I’m still working on the GC issues, but thats not bad going.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: