Looking at Arrays and Lists

Most of this ramble relates to .Net 2.0/C#.

To build or enhance a high performance managed application has a set of constraints, three of which are data structure choice, algorithm choice and the garbage collection envelope. An interesting discussion arises when talking about Array vs List<T>, but this can be quickly addressed by understanding the mechanics of both and where they differ. I use List<T> as the comparator simply because this is the common replacement of choice by most C# programmers I have come across.

Some think data locality is not so important in a managed host environment and you can ignore it, and largely you can, but data locality is just as important as any other axis when building a high performance application. The simple question is, does a poor data locality eat into CPU cycles? Yes. Then it requires just as much respect as choosing a poor algorithm. Data locality is so important there is research into enhancing the garbage collector so that it improves locality of reference as part of its role. So with that said, an array becomes a very good choice because its allocation creates a contiguous block of memory together in one place. To work with an array structure in our programs results in the complete block of memory being brought into the cache pipeline (caveats ignored).

So an array is created and stored contiguously in memory. From this point on you will get the fastest type of performance you can expect, because an index into the data structure is a direct address (as good as can be in a managed world), and this gives you O(1) access time anywhere in the structure. Furthermore, arrays are a fundamental type in IL, meaning their implementation is as raw as you can get it, with minimal levels of indirection.

Consider:

int[] a = new int[100];
a[0] = 0x10;

Disassembly:

Listing 1

At memory location 0x013B1BF4 in my debugger I see the following:

Listing 2

We are basically looking at the heap, and the first red value of 0x791240f0 is the type pointer, followed by the size of the array (64h == 100 denary). On assembly line 0x00000031 0x10 is assigned to the first element, which results in the following changes:

Listing 3

So the resulting JIT compiled code is as direct as you could get it with an array, and the memory is all kept together as shown above, which results in performance and perfect data locality for value types. Note, an array of reference types requires some extra care to ensure a better locality of reference which you don’t get for free as you do with value types.

The issue that arises with arrays is that they are fixed – once created, thats it. To resize and array involves scrapping the current block of memory and requesting another. There may be performance improvements provided by the CLR, but largely this means discarding the original array to be garbage collected and requesting a new block of memory.

Figure 1

 

Above the array a is created on the heap. Creating a new array will result in this being lost:

Figure 2

 

 

The difference in the location of the arrays would occur if in between creating the first array of 100 and the second of 200, more memory was allocated on the heap. For your information, if this were a non-managed situation and you had much more control over memory, you could re-establish the 200 element array at the original address of a, avoiding fragmentation. But in a managed world creating a new array and assigning it to an existing array will not allow the GC to release the memory (the 100 element array), since this will only occur at some non-predetermined time.

Figure 3

For those not interested in the details you can skip this next piece.

Listing 4

The red text marks where the previous instance of a was (a = 100), and the blue text marks where the new array of 200 elements was created, showing how the GC does indeed allocate memory contiguously.

 

So if the size of an array needs to change, should we use an array, or should we choose a different data structure such as List<T>?

Consider this:

int[] intArray = new int[100];

List<int> intList = new List<int>();

The different between the two is that to get something into box structures is programmatically different, since its a direct memory address with the array, but it requires at least a first call to Add() for the list. With the array we can immediately do something like intArray[99] = 99, but for the list you would need to call Add() x100 before being able to achieve the same. You’d be forgiven for thinking you could do this:

List<int> intList = new List<int>(100);

intList[99] = 99;

The Capacity of a List is simply a statement to the implementation that the initial capacity will be 100, and from there after will double its capacity once exceeded. So if you add 101 items into the list its capacity will immediately become 200, add 201, and the capacity would increase to 400. If you understood a little more about how the List is implemented you might get frustrated with why the above doesnt work, but there is a very good reason why it doesnt.

While I am going to look at the internals of List from this point, let it be said that good engineering is about abstraction and encapsulation, and relying on implementation details is a form of very bad coupling. At the same time it is important to understand how something works, especially in a managed environment that comes with the price of the garbage collector.

So the reason you cant immediately gain access to the 100th element in the list is because the Capacity of a List is not th same as its Count. You cant grumble at that. Internally List is implemented using an Array<T>. When you add the first element, a block of memory will be created the size of Capacity. If you create the list without a size for Capacity, your default array size will be 4 in .Net 2.0.

The interesting stuff in List happens when you call Add:

public void Add(T item) 
{ 
    if (this._size == this._items.Length) 
    { 
        this.EnsureCapacity(this._size + 1); 
    } 
    this._items[this._size++] = item; 
    this._version++; 
}

 

 

 

 

If the current size has reached Capacity, the first thing to happen is to create a new array inside EnsureCapacity:

private void EnsureCapacity(int min) 
{ 
    if (this._items.Length < min) 
    { 
        int num = (this._items.Length == 0) ? 4 : (this._items.Length * 2); 
        if (num < min) 
        { 
            num = min; 
        } 
        this.Capacity = num; 
    } 
}

This private method deals with working out the next size to expand the array to – notice this._items.Length * 2. The Capacity assignment is where the re-allocation occurs:

public int Capacity 
{ 
    get 
    { 
        return this._items.Length; 
    } 
    set 
    { 
        if (value != this._items.Length) 
        { 
            if (value < this._size) 
            { 
                ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.value, ExceptionResource.ArgumentOutOfRange_SmallCapacity); 
            } 
            if (value > 0) 
            { 
                T[] destinationArray = new T[value]; 
                if (this._size > 0) 
                { 
                    Array.Copy(this._items, 0, destinationArray, 0, this._size); 
                } 
                this._items = destinationArray; 
            } 
            else 
            { 
                this._items = List<T>._emptyArray; 
            } 
        } 
    } 
}

Because the List uses an Array, this means all that I said earlier about arrays is largely applicable here. The issue with a List is that even though the full array is created underneath, you are limited to reaching into the List no further than Count, which can be less than the size of the internal array (as can be seen above). There will be a performance hit with the call to Add because it does some sanity checking on the array – and potentially resizes – as well as the method call itself (a List could be used in a hard realtime scenario as long as the programmer ensured the capacity was such that it obviated a resize).

One last comment to note about the Capacity of a List. Most people know of the impact of not creating a list with a capacity, but I’ll explain it anyway. If you dont create a list with a capacity, after the first Add() your underlying array will have a size of 4. When you exceed 4 it will double to 8, then to 16, … So clearly, if you know you are going to load the list with 1250 items, you should do this : list = new List<int>(1250). What happens if you know your list will become 1250, but you are doing lots of async calls which means it could take some time to fill the list? Still create the list with 1250 – remember, if you decide not to create the list with a Capacity you will incur 10 memory allocations before arriving at a structure that supports your needs. If you are not in a realtime situation, you must still respect the GC, but you can be less aggressive.

From a class role perspective, both Array and List offer the same rich set of interfaces such as IList, ICollection and IEnumerable, allowing both to be interchanged with each other in your code which is useful.

Both List and Array are highly performant data structures, and if you use the List correctly you can achieve comparable performance to an Array. Data locality will be good with both data structures, with the same caveats when creating data structures, associative or otherwise, that cannot be established 100% inplace within the data structure – remember that if the type being stored is a reference type, all you will hold within your data structure is a reference to some place else in the heap.

If you know you will need to handle expansion of an array manually, dont bother reinventing the wheel and just use the List – its as fast you could probably write. The List however falls short as soon as you add another dimension to the underlyng data structure – a List is a one-dimensional concept and will offer little, unless again you opt for creating jagged array style structures. As soon as you move into the world of holding 2 or more dimensions of data, you must resort back to arrays for all the reasons already mentioned. If you need an n-dimensional structure that may need to expand, consider the algorithm for a List – you could use a lot from its implementation in creating an expandable 2D matrix.

6c 00 61 00 74 00 65 00 72 00 7a 00 2e.

Advertisements
  1. Great analysis Nick.

    It’s tricky to reason about the performance characteristics of programs written on top of complex libraries like .NET or the STL. Knowing what the original author’s implementation options are (e.g.: how memory is managed generically, and how compacting garbage collectors automate the process) can give people a good basis for making educated guesses about the best way to use such libraries (even if they haven’t read all of the source code).

    • damien morton
    • July 9th, 2007

    amortised analysis – doubling the list is O(1) in an amortised sense.

    • Nick Robinson
    • July 9th, 2007

    Yeh. It wont be quite so deterministic if a GC is needed.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: