CLR Inside Out: Collections Best Practices

[アーティクル]
10/02/2019

CLR Inside Out

Collections Best Practices

Inbar Gazit

Contents

Most introductory textbooks on computer science and programming include a chapter about collections. They may be called arrays or data structures, but the concept remains the same. The ability to tie a set of elements to one another in a formal data object is essential to modern programming techniques.

In the Microsoft® .NET Framework, a lot of effort went into creating collection classes that are powerful and that address a variety of needs and styles. These collections are simple to use, intuitive, and have adequate performance—all very important characteristics. In this month’s installment, I’ll look at collections in .NET, how they work, when to use them, and some best practices.

When to Use a Collection

The first decision a developer faces seems quite simple: should I use a collection? Sometimes the answer is obvious. Sometimes it’s a matter of preference. The basic purpose of collections is to tie certain objects together as a group because they have some logical connection to each other. As an example, you may have a Book class that represents a book and you may have a BookCollection class that represents a library or a shelf of books in a library. By using a collection in this case, you can quickly add or remove books from the shelf, move books around, find a book based on title or ISBN or some other descriptive data, count the number of books, iterate over the books, and much more. I’ll cover the details later, but in this scenario using a collection is an obvious choice.

Now, let’s take a look at a case where you would almost certainly choose not to use collections—rational numbers. A rational number is a number that can be expressed as the ratio of two integers. In my description of collections, I proposed that a collection should be used to tie together a set of elements, and in this scenario, you could use a collection with two elements—one for the numerator and one for the denominator. However, the purpose of the numerator and denominator are very different (unlike in the books example, where all elements served a similar purpose). One of the primary reasons for storing items in a collection is to enable easy iteration through all elements so as to allow an operation to be performed on each individual element, a task that would rarely make sense for the integers composing a rational number. Moreover, a rational number has a fixed, very small number of elements. As a result, it doesn’t make sense to use a collection to store the elements of a rational number; here you should use two separate integral fields.

Now, consider a scenario in which the use of a collection could be debated. Imagine a data structure used to represent the teams in a sporting event, such as baseball. There are operations that would need to be performed on each team participating in the game, and each team serves a similar purpose (unlike the numerator and denominator in a rational number). However, in a game like baseball, there are only two teams, so while iteration support might be beneficial, the number of items to be enumerated is very small (and in prototypical implementation, fixed). Some might choose to use a TeamCollection or an array of Team instances to represent both sides, and some might choose to use two individual Team fields. There are pros and cons to both implementations.

Nevertheless, there are some simple guidelines to keep in mind when considering a collection. You should use a collection when at least one of following statements holds true:

Individual elements serve similar purposes and are of equal importance.
The number of elements is unknown or is not fixed at compile time.
You need to support iteration over all elements.
You need to support sorting of the elements.
You need to expose the elements from a library where a consumer will expect a collection type.

Note that for the purposes of this column, I consider arrays (System.Array) to be a special kind of collection.

When to Create a Custom Collection

The .NET Framework includes a variety of collection types, the most general of which are under the System.Collections namespace and subnamespaces (except for System.Array). I would strongly suggest reviewing what’s available there before writing a custom collection as the existing types will cover the large majority of your collection needs.

A case where you might need to create a custom collection is when you are trying to implement a special data structure that does not yet exist in the Framework. Graphs, trees, and other special data structures come to mind. They can be very useful in many situations. When you want a variant of an existing collection type, you must also create a custom collection. It could be that you want the collection to host only certain type(s) and that the specific implementation of this collection is dependent on these types. You may need a custom collection for performance; if you have special performance needs that cannot be addressed with one of the provided general-purpose collections, writing a custom collection may be your only option. For example, you may need a list where you can insert items into the middle very quickly while at the same time you can find items just as quickly without having to iterate over the list.

Existing Collections

If you use an existing collection, you’ll still need to choose the right one. Depending on what you are building, there are a variety of different collections that you can choose from. First, you need to determine what is going to be stored in the collections and then decide how the collection is going to be used. Are you going to be adding all the elements at once? Are you going to be dynamically adding and removing elements? Do you only need to iterate over the entire collection, or do you also need to locate and use specific elements? Are you exposing your collection through a public API? What are the performance characteristics and requirements of your application? Asking these questions will help you decide which collection to use, and if the options don’t suit your needs, consider extending the collection rather than starting from scratch.

Generic or Non-Generic

The .NET Framework 2.0 introduced the notion of generics. In the .NET Framework 1.x, most of the collection classes stored System.Object references. This meant that every time you extracted an item out of a collection you would have to cast it back to its original type (and if you developed a library, the consumer of that library was required to know what those types were). This cast had a performance impact, especially when storing value types, as adding one to a collection required it first to be boxed into a System.Object and later unboxed when pulled out of the collection and cast back to its original type.

By using generic collection types (found under the System.Collections.Generic namespace), you can avoid these issues. For example, if you need a list of strings in your C# application, you can use System.Collections.Generic.List<String> (List(Of String) in Visual Basic® and List<String^> in C++). This ensures that when you extract elements, they only come in the System.String type (meaning you don’t need to cast from Object to String). It also allows the compiler to enforce that behavior by providing a compile-time error if you try to store anything else into the list. Moreover, if you are developing a library to be consumed by someone else and you don’t know the contained type he’ll be using, you can continue to use a generic type T that the consumer of the library will later define. You can also constrain this T so that you know something about the types that will be used (that they derive from a specific set of interfaces, that they implement a parameterless constructor, and so forth).

If you’re using the .NET Framework 2.0 or later, using the old non-generic collections is strongly discouraged. The only reason you may need to keep them is for legacy purposes or if you need your code to work with older versions of the Framework. Using the old non-generic collections may slow down your code and also make it less readable. Even if you need to store more than one type in a collection, you can still use generic collections. You just need to find a common base-type for the objects you are about to store; in the worst case, you can use System.Object as the generic type parameter to be able to store anything (such as List<Object>), since every type derives from System.Object.

If you have legacy code using non-generic collections, I strongly recommend you consider changing your code to use generic collections. The table in Figure 1 shows recommended types and interfaces you should use. In addition, the .NET Framework 2.0 includes new generic collections that don’t have direct counterparts in the .NET Framework 1.x—for example, LinkedList<T> and SortedDictionary<TKey, TValue>. The .NET Framework 3.5 also includes a HashSet<T> class that provides new standard set functionality (in previous versions, the need for a set was typically satisfied using a Hashtable or Dictionary<TKey,TValue>).

Performance Implications

Designs that require collections frequently come from the need to work with a large number of items, and the number of items you work with will affect the performance of your application. To make sure you choose the right collection type for your needs, it is important to understand the usage patterns that are expected from your collection. For example, you may be inserting items to the collection only on initialization of your application, or you may be doing so throughout the lifetime of your program. Your only interaction with the collection may be based on iterating through all of the elements, or you may need random access capabilities. Understanding your usage profile allows you to find the best collection that fits your needs.

When you want additions, removals, and lookups to be very quick, and when you are not concerned about the order of the items in the collections, your first inclination should be to use a System.Collections.Generic.Dictionary<TKey, TValue> (or if you’re using the .NET Framework 1.x, a Hashtable). The three basic operations (Add, Remove, and Contains) all operate quickly even if the collection contains millions of items. On the other hand, with a List<T> (or ArrayList in the .NET Framework 1.x), inserting and removing items can take a variable amount of time. (Both List<T> and ArrayList store items in an underlying array that maintains order. Adding items may require existing items in the underlying array be moved to make room. Adding items at the end will not require any moves and will be very quick).

If your usage pattern requires few deletions and mostly additions, and if it is important for you to keep the collection in order, you may still want to choose a List<T>. Lookup could be slow (since the underlying array will need to be traversed while searching for the target item), but you are guaranteed that your collection maintains a specific order. Alternatively, you can choose Queue<T> to implement a first-in-first-out (FIFO) order or a Stack<T> to implement a last-in-first-out (LIFO) order. While both Queue<T> and Stack<T> support enumeration of all items in the collection, the former only supports insertion at the end and removal from the beginning, while the latter only supports insertion and removal from the beginning.

Using the new LinkedList<T> collection can potentially help you improve performance in scenarios where you need to maintain order yet still achieve fast inserts. Unlike List<T>, LinkedList<T> is implemented as a chain of dynamically allocated objects. In comparison to List<T>, inserting an object in the middle only requires updating two connections and adding the new item. The downside of a linked list from a performance point of view is increased activity by the garbage collector as it has to traverse the entire list to make sure objects were not disposed of. Additionally, performance problems can arise with large linked lists due to the overhead associated with each node as well as where in memory each node lives. And while the actual act of inserting an item into a LinkedList<T> is much faster than doing so into a List<T>, finding the particular location at which you want to insert a new value still requires traversing the list to find the correct location.

As mentioned, for scenarios where you need fast insertions and lookups, a Dictionary<TKey, TValue> may be most appropriate. However, lookup is only fast in the average case, and certain datasets could cause that performance to degrade quickly. A different implementation, SortedDictionary<TKey,TValue>, uses a balanced tree implementation as the underlying data store; this provides relatively fast lookups and maintains items in a sorted order, but insertions will most likely be slower (and will vary based on the number of items in the collection). Alternatively, you can also use SortedList<Tkey,TValue>, which uses two separate arrays to maintain the keys and the values separately and to maintain them both in order (in the worse case you would have to shift all the keys and values).

A Weather Almanac Using a List<T>

The following example shows a simple application to calculate the average temperature and humidity for a list of weather reports from various cities (see Figure 2). The WeatherReport class is used to store a single weather event for a selected city. The List<WeatherReport> collection is used to store the information that I’ll be averaging. I chose a List<T> collection in this case because the pattern is that I only add items and never remove them. Also, I do not care about sorting or inserting items in the middle, nor do I care about looking up individual items in the list. The only other thing I do with the list is iterate over the items. If you so choose, you can extend this program to find the hottest or coldest city and many additional weather almanac type operations.

Custom Collections

In certain situations, you may not find that all of your requirements are implemented in any of the existing collection types in the Framework. It may be that you have a unique set of requirements or that you can still use an existing collection with minor changes. If you decide you want to write a custom collection, you should first look into extending an existing collection type.

Start by looking at the System.Collections.ObjectModel namespace. These collections already implement the most needed interfaces and give you the baseline functionality you expect. It’s easy to add methods to inherited classes that enable your specific needs.

As an example, let’s say you want to implement a simple collection of System.Double values, but you also want to support an Add(string) method that uses Double.TryParse, only adding the result if the value can be parsed to a Double successfully. The System.Collections.ObjectModel.Collection<T> class implements all of the standard collection interfaces, and your custom collection type can simply derive from Collection<Double> in order to provide the additional overload of Add:

class DoubleCollection : Collection<double>
{
    public void Add(string st)
    {
        Double d;
        if (Double.TryParse(st, out d)) base.Add(d);
        else throw new ArgumentException(
            “Cannot parse string to a double. “ +
            “Item was not added to collection.”);
    }
}

The .NET Framework 3.5 with its introduction of extension methods provides an additional mechanism for creating methods that extend existing types. For example, you could rewrite this method as an extension method in C# 3.0 as follows:

class CollectionExtensions
{
    public static void Add(this Collection<Double> c, string st)
    {
        Double d;
        if (Double.TryParse(st, out d)) c.Add(d);
        else throw new ArgumentException(
            “Cannot parse string to a double. “ +
            “Item was not added to collection.”);
    }
}

Given a Collection<Double> that holds named values, this static method can be used to add a new value to the collection using traditional syntax:

CollectionExtensions.Add(values, “3.14”);

However, as this is an extension method (evident from the "this" keyword attributing the first parameter to the static Add method), you can rewrite this as follows:

values.Add(“3.14”);

This is the exact same method call you would write if you’d extended Collection<Double> with a new type, but you can now use this new Add method with any instance of Collection<Double> (or any type that derives from it), rather than just with your custom type. Alternatively, you can start building your own collection without inheriting from any of the existing ones. In that case you can take advantage of the appropriate interfaces that are provided in the .NET Framework.

Implementing Collection Interfaces

If you do decide that you want to write your own custom collection from scratch, you should implement all of the appropriate interfaces provided with the .NET Framework. These interfaces expose common ways to interact with collections that most people find very intuitive and easy to use. Here is a short overview of the various interfaces you can choose from.

IEnumerable This interface allows the collection to be enumerated by returning an enumerator for the collection. The interface exposes a single method, GetEnumerator, that returns an IEnumerator object. The IEnumerator object allows for quick traversal of the collection. Any collection that implements this interface can be accessed using the foreach language construct in C# (For Each in Visual Basic).

IEnumerable<T> This interface extends the non-generic IEnumerable by adding an additional GetEnumerator method that returns an IEnumerator<T> rather than IEnumerator. This means you must also implement IEnumerable if you plan to implement IEnumerable<T>. The difference is that this additional GetEnumerator returns the generic version of IEnumerator (IEnumerator<T>) instead of the non-generic one. It is a common practice to provide your collection’s enumeration logic in IEnumerable<T> and then have the non-generic GetEnumerator method delegate to the generic GetEnumerator method to assure that both of them return the same enumerator (and to prevent you from needing to implement the same logic twice).

IEnumerator This interface is not typically implemented by the collection itself. I mention it here as it is a useful interface to keep in mind in the context of building a custom collection. This interface is used to allow the enumeration of the items in a collection. It exposes three methods: Reset, Current, and MoveNext. The Reset method is used to initialize the enumeration. The Current method is used to get the current item, and MoveNext is used to move to the next item. If you can get an enumerator for the collection using GetEnumerator, then you can implement a foreach loop. For example, if Words is an ArrayList of strings, the following would do the same:

foreach (string word in Words)  Console.WriteLine (word);

IEnumerator enumerator = Words.GetEnumerator();
while(enumerator.MoveNext())    
    Console.WriteLine(enumerator.Current);

IEnumerator <T> The generic version of the IEnumerator interface extends the non-generic version by adding an overload of the Current property that returns a T. This overload makes enumerating the collection easier by removing the need to cast.

ICollection This interface is an extension of IEnumerable. It also exposes three additional properties and one additional method: Count is a read-only property that returns the number of elements in the collection; IsSynchronized indicates that this collection is thread safe so that you can use it in a multithreaded application; SyncRoot is used to allow implementers to implement a thread-safe collection by providing an object that can be used for this purpose; and CopyTo is used to copy items from the collection to an array at a given index. For the most part, I would suggest avoiding the non-generic ICollection interface as the generic one is much more useful.

ICollection<T> This interface is an extension of IEnumerable<T> (and thus also of IEnumerable). It also exposes five new methods and two new properties. The methods are Add, which is used to add a new element to a collection; Remove, which is used to remove an element; Clear, which is used to remove all elements; Contains, which tells us if an item is in the collection or not; and CopyTo, which is used to copy items to an array. The two read-only properties are Count (as in the non-generic ICollection) and IsReadOnly, which tells us if the collection is read-only, meaning that items cannot be removed or added (mutable items in the collection can still be modified even if the collection is read-only).

This interface is very useful as a starting point for any custom collection. If you don’t know the type of the items you want to store in your collection (or you want to store more than one type), just use ICollection<Object> and you essentially get a non-generic extended ICollection interface.

IList and IList<T> IList is an extension of ICollection. In addition to everything that is in ICollection (the generic or non-generic version), it has IndexOf, Insert, and RemoveAt methods as well as an indexer that allows you to get and set any item in the collection, just as you can when using arrays. The non-generic IList also includes the methods that exist in the generic ICollection and not in the non-generic one (Add, Remove, Clear, and Contains). The IList interface is very useful when creating custom collections that allow you to add and remove items at will and get and set values of any item. The RemoveAt method allows you to remove an item based on its index (whereas the regular Remove only removes based on the value). The Insert method allows you to add an item at any position in the list, whereas the regular Add only adds it last.

IDictionary and IDictionary<TKey, TValue> IDictionary is an extension of ICollection. It supports the concept of using a key to store and locate items in the collection. In addition to all the methods that ICollection supports, it also exposes another Add overload that associates a key with the item that is added. The ContainsKey method will check for the existence of a particular key. There’s a Keys and Values properties that returns an ICollection of all the keys or values in the IDictionary, a remove overload based on the key instead of the value, and an indexer using the key. The generic version of this interface is a little more complex, as it allows you to specify the type for the value and the type for the key separately.

Generic or Non-Generic Again

When creating a custom collection, it may not be necessary to make it generic. If you know the types of objects that your collection is going to store, and you know that these types will not change in the future, using a fixed-type collection makes sense. You can still extend a generic framework collection or use generic interfaces, but your class will always store objects of a particular type.

If you are building a library, or a more general-purpose collection such as a tree or a graph, it’s probably a good idea to make it generic. This will enable the consumer of this type to use it with various types without the penalty of using an Object and casting it back and forth. Here is what a general-purpose directional graph class might look like:

class DirectionalGraph<T> : 
    IEnumerable<T>, ICollection<T>, IList<T>
{
    public void Clear() { }
    public void Add(T item, IEnumerable<T> inNodes, 
        IEnumerable<T> outNodes) { }
    public void Remove(T item) { }
    public bool Contains(T item) { }
    static public DirectionalGraph<T> EmptyGraph { get{}}
 
    // implement the interfaces...
}

A Family Tree

In the next example, I’ll implement a custom collection to store a family tree for genealogical purposes (see Figure 3). For simplicity, I’ll assume the following: each person has a sex and an age, and each person has two biological parents. Let’s call them Mother and Father.

The tree is a special type of collection that consists of nodes. Each node (the Person class, as in Figure 4) stores information about other nodes in the tree. There’s a special node called Root, which is the base of the tree. All other nodes could be assessed from the root traversing the tree. I also implemented IEnumerable<T> to allow enumerating all the members in the FamilyTree in a foreach loop. Once all persons have been entered into the tree you should be able to answer the question: what is the relationship between person A and B?

Note that not all the possible relationships are implemented in this example, but you can add more relationships to the FindRelationship method yourself. Also note that for clarity of the example, not all methods have adequate parameter checking. It is recommended that you make sure that every method only accepts valid values for its arguments before implementing it.

When you run the example in Figure 5, you should see the output that is shown in Figure 6.

Figure 6 Output from the Family Tree Sample

Figure 6** Output from the Family Tree Sample **(Click the image for a larger view)

By now you should have a pretty good understanding of collections in the .NET Framework, what they are and when and why you would want to use them. Collections can be a powerful tool in your programming arsenal so it’s important to know when they will benefit you and how to use them wisely. If you’re looking for more information, check out the resources in the "Further Reading" sidebar.

Send your questions and comments to clrinout@microsoft.com.

Inbar Gazit is a Program Manager focusing on the base class libraries for the CLR team. He can be reached at inbar@microsoft.com.

その他のリソース