# An Extensive Examination of Data Structures

**Visual Studio .NET 2003**

Scott Mitchell

4GuysFromRolla.com

April 2004

**Summary: **Scott Mitchell discusses data structures for implementing general and disjoint sets. A set is an unordered collection of unique items that can be enumerated and compared to other sets in a variety of ways. (20 printed pages)

Download the Sets.msi sample file.

#### Contents

Introduction

The Fundamentals of Sets

Implementing an Efficient Set Data Structure

Maintaining a Collection of Disjoint Sets

References

Related Books

## Introduction

One of the most basic mathematical constructs is a *set*, which is an unordered collection of unique objects. The objects contained within a set are referred to as the set's *elements*. Formally, a set is denoted as a capital, italic letter, with its elements appearing within curly braces (`{...}`

). Examples of this notation can be seen below:

S = { 1, 3, 5, 7, 9 } T = { Scott, Jisun, Sam } U = { -4, 3.14159, Todd, x }

In mathematics, sets are typically comprised of numbers, such as set *S* above, which contains the odd positive integers less than 10. However, notice that the elements of a set can be anything—numbers, people, strings, letters, variables, and so on. Set *T*, for example, contains peoples' names, and set *U* contains a mix of numbers, names, and variables.

In this article, we'll start with a basic introduction of sets, including common notation and the operations that can be performed on sets. Following that, we'll examine how to efficiently implement a set data structure with a defined universe. The article concludes with an examination of disjoint sets, and the best data structures to use.

## The Fundamentals of Sets

Recall that a set is simply a collection of elements. The "element of" operator, denoted *x* Î *S*, implies that *x* is an element in the set *S*. For example, if set *S* contains the odd positive integers less than 10, then 1Î *S*. When reading such notation, you'd say, "1 is an element of *S*." In addition to 1 being an element of *S*, we have 3 Î *S*, 5 Î *S*, 7 Î *S*, and 9 Î *S*. The "not an element of" operator, denoted *x* Ï *S*, means that *x* is not an element of set *S*.

The number of unique elements in a set is the set's *cardinality*. The set {1, 2, 3} has cardinality 3, just as does the set {1, 1, 1, 1, 1, 1, 1, 2, 3} (because it only has three unique elements). A set may have no elements in it at all. Such a set is called the *empty set*, and is denoted as {} or Æ, and has a cardinality of 0.

When first learning about sets, many developers assume they are the same as collections, like an ArrayList. However, there are some subtle differences. An ArrayList is an *ordered* collection of elements—each element in an ArrayList has an associated ordinal index, which implies order. Too, there can be duplicate elements in an ArrayList.

A set, on the other hand, is *unordered* and contains *unique* items. Since sets are unordered, the elements of a set may be listed in any order. That is, the sets {1, 2, 3} and {3, 1, 2} are considered equivalent. Also, any duplicates in a set are considered redundant. The set {1, 1, 1, 2, 3} and the set {1, 2, 3} are equivalent. Two sets are equivalent if they have the same elements. (Equivalence is denoted with the = sign; if S and T are equivalent they are written as *S* = *T*.)

NoteIn mathematics, anorderedcollection of elements that allows duplicates is referred to as alist. Two lists,L_{1}andL_{2}are considered equal if and only if foriranging from 1 to the number of elements in the list theith element inL_{1}equals theith element inL_{2}.

Typically the elements that can appear in a set are restricted to some *universe*. The universe is the set of all possible values that can appear in a set. For example, we might only be interested in working with sets whose universe are integers. By restricting the universe to integers, we can't have a set that has a non-integer element, like 8.125, or Sam. (The universe is denoted as the set *U*.)

### Relational Operators of Sets

There are a bevy of relational operators that are commonly used with numbers. Some of the more often used ones, especially in programming languages, include `<`

, `<=`

, `=`

, `!=`

, `>`

, and `>=`

. A relational operator determines if the operand on the left hand side is related to the operand on the right hand side based on criteria defined by the relational operator. Relational operators return a "true" or "false" value, indicating whether or not the relationship holds between the operands. For example, *x* < *y* returns true if *x* is less than *y*, and false otherwise. (Of course the meaning of "less than" depends on the data type of *x* and *y*.)

Relational operators like `<`

, `<=`

, `=`

, `!=`

, `>`

, and `>=`

are typically used with numbers. Sets, as we've seen, use the `=`

relational operator to indicate that two sets are equivalent (and can likewise use `!=`

to denote that two sets are not equivalent), but relational operators `<`

, `<=`

, `>`

, and `>=`

are not defined for sets. After all, how is one to determine if the set {1, 2, 3} is less than the set {Scott, 3.14159}?

Instead of notions of `<`

and `<=`

, sets use the relational operators *subset* and *proper subset*, denoted Í and Ì, respectively. (Some older texts will use Ì for subset and Í for proper subset.) *S* is a subset of *T*—denoted *S* Í *T*—if every element in *S* is in *T*. That is, *S* is a subset of *T* if it is *contained* within *T*. If *S* = {1, 2, 3}, and *T* = {0, 1, 2, 3, 4, 5}, then *S* Í *T* since every element in *S*—1, 2 and 3—is an element in *T*. *S* is a proper subset of *T*—denoted *S* Ì *T*—if *S* Í *T* and *S* ¹ *T*. That is, if *S* = {1, 2, 3} and *T* = {1, 2, 3}, then *S* Í *T* since every element in *S* is an element in *T*, but *S* *T* since *S* = *T*. (Notice that there is a similarity between the relational operators `<`

and `<=`

for numbers and the relational operators Í and Ì for sets.)

Using the new subset operator, we can more formally define set equality. Given sets *S* and *T*, *S* = *T* if and only if *S* Í *T *and *T* Í *S*. In English, *S* and *T* are equivalent if and only if every element in *S* is in *T*, and every element in *T* is in *S*.

NoteSince Í is analogous to`<=`

, it would make sense that there exists a set relational operator analogous to`>=`

. This relational operator is calledsuperset, and is denoted Ê; aproper supersetis denoted . Like with`<=`

and`>=`

,SÊTif and only ifTÍS.

### Set Operations

As with the relational operators, many operations defined for numbers don't translate well to sets. Common operations on numbers include addition, multiplication, subtraction, exponentiation, and so on. For sets, there are four basic operations:

**Union**– the union of two sets, denoted*S*È*T*, is akin to addition for numbers. The union operator returns a set that contains all of the elements in*S*and all of the elements in*T*. For example, {1, 2, 3} È {2, 4, 6} equals {1, 2, 3, 2, 4, 6}. (The duplicate 2 can be removed to provide a more concise answer, yielding {1, 2, 3, 4, 6}.) Formally,*S*È*T*= {*x*:*x*Î*S*or*x*Î*T*}. In English, this translates to*S*union*T*results in the set that contains an element*x*if*x*is in*S*or in*T*.**Intersection**–the intersection of two sets, denoted*S*Ç*T*, is the set of elements that*S*and*T*have in common. For example, {1, 2, 3} Ç {2, 4, 6} equals {2}, since that's the only element both {1, 2, 3} and {2, 4, 6} share in common. Formally,*S*Ç*T*= {*x*:*x*Î*S*and*x*Î*T*}. In English, this translates to*S*intersect*T*results in the set that contains an element*x*if*x*is both in*S***and**in*T*.**Difference**– the difference of two sets, denoted*S*-*T*, are all of the elements in*S*that are**not**in*T*. For example, {1, 2, 3} - {2, 4, 6} equals {1, 3}, since 1 and 3 are the elements in*S*that are not in*T*. Formally,*S - T*= {*x*:*x*Î*S*and*x*Ï*T*}. In English, this translates to*S*set difference*T*results in the set that contains an element*x*if*x*is in*S*and**not**in*T*.**Complement**– Earlier we discussed how typically sets are limited to a known universe of possible values, such as the integers. The complement of a set, denoted*S*', is*U*-*S*. (Recall that*U*is the universe set.) If our universe is the integers 1 through 10, and*S*= {1, 4, 9, 10}, then*S*' = {2, 3, 5, 6, 7, 8}. (Complementing a set is akin to negating a number. Just like negating a number twice will give you the original number back—that is, --*x*=*x*—complementing a set twice will give you the original set back—*S*'' =*S*.)

When examining new operations, it is always important to get a solid grasp on the nature of the operations. Some questions to ask yourself when learning about any operation—be it one defined for numbers or one defined for sets—are:

- Is the operation commutative? An operator
*op*is commutative if*x**op**y*is equivalent to*y**op x*. In the realm of numbers, addition is an example of a commutative operator, while division is not commutative. - Is the operation associative? That is, does the order of operations matter. If an operator
*op*is associative, then*x op*(*y op z*) is equivalent to (*x op y*)*op z*. Again, in the realm of numbers addition is associative, but division is not.

For sets, the union and intersection operations are both commutative and associative. *S* È *T* is equivalent to *T* È *S*, and *S *È (*T* È *V*) is equivalent to (*S* È *T*) È *V*. Set difference, however, is neither commutative nor associative. (To see that set difference is not commutative, consider that {1, 2, 3} - {3, 4, 5} = {1, 2}, but {3, 4, 5} - {1, 2, 3} = {4, 5}.)

### Finite Sets and Infinite Sets

All of the set examples we've looked at thus far have dealt with finite sets. A finite set is a set that has a finite number of elements. While it may seem counterintuitive at first, a set can contain an infinite number of elements. The set of positive integers, for example, is an infinite set since there are no bounds to the number of elements in the set.

In mathematics, there are a couple infinite sets that are used so often that they are given a special symbol to represent them. These include:

- N = {0, 1, 2, …}
- Z = {…, -2, -1, 0, 1, 2, …}
- Q = {a/b: a Î Z, b Î Z, and b ¹ 0}
- R =
*set of real numbers*

N is the set of *natural numbers*, or positive integers greater than or equal to 0. Z is the set of *integers*. Q is the set of *rational numbers*, which are numbers that can be expressed as a fraction of two integers. Finally, R is the set of *real numbers*, which are all rational numbers, plus irrational numbers as well (numbers that cannot be expressed as a fraction of two integers, such as *pi*, and the square root of 2).

Infinite sets, of course, can't be written down in their entirety, as you'd never finish jotting down the elements, but instead are expressed more tersely using mathematical notation like so:

S = {x : xÎ N and x > 100}

Here *S* would be the set of all natural numbers greater than 100.

In this article we will be looking at data structures for representing finite sets. While infinite sets definitely have their place in mathematics, rarely will we need to work with infinite sets in a computer program. Too, there are unique challenges with representing and operating upon infinite sets, since an infinite set's contents cannot be completely stored in a data structure or enumerated.

NoteComputing the cardinality of finite sets is simple—just count up the number of elements. But how does one compute the cardinality of an infinite set? This discussion is beyond the scope of this article, but realize that there are different types of cardinality for infinite sets. Interestingly, there is the same "number" of positive integers, as there areallintegers, but there are more real numbers than there are integers.

### Sets in Programming Languages

C++, C#, Visual Basic .NET, and Java don't provide inherent language features for working with sets. If you want to use sets, you need to create your own set class with the appropriate methods, properties, and logic. (We'll do precisely this in the next section!) There have been programming languages in the past, though, that have offered sets as a fundamental building block in the language. Pascal, for example, provides a set construct that can be used to create sets with an explicitly defined universe. To work with sets, Pascal provides the `in`

operator to determine if an element is in a particular set. The operators `+`

, `*`

, and – are used for union, intersection, and set difference. The following Pascal code illustrates the syntax used to work with sets:

/* declares a variable named possibleNumbers, a set whose universe is the set of integers between 1 and 100... */ var possibleNumbers = set of 1..100; ... /* Assigns the set {1, 45, 23, 87, 14} to possibleNumbers */ possibleNumbers := [1, 45, 23, 87, 14]; /* Sets possibleNumbers to the union of possibleNumbers and {3, 98} */ possibleNumbers := possibleNumbers + [3, 98]; /* Checks to see if 4 is an element of possible numbers... */ if 4 in possibleNumbers then write("4 is in the set!");

Other previous languages have allowed for more powerful set semantics and syntax. A language called SETL (an acronym for SET Language) was created in the 70s and offered sets as a first-class citizen. Unlike Pascal, when using sets in SETL you are not restricted to specifying the set's universe.

## Implementing an Efficient Set Data Structure

In this section we'll look at creating a class that provides the functionality and features of a set. When creating such a data structure, one of the first things we need to decide is how to store the elements of the set. This decision can greatly affect the asymptotic efficiency of the operations performed on the set data structure. (Keep in mind that the operations we'll need to perform on the set data structure include: union, intersection, set difference, subset, and element of.)

To illustrate how storing the set's elements can affect the run time, imagine that we created a set class that used an underlying ArrayList to hold the elements of the set. If we had two sets, *S*_{1} and *S*_{2} that we wanted to union (where *S*_{1} had *m* elements and *S*_{2} had *n* elements), we'd have to perform the following steps:

- Create a new set class,
*T*, that holds the union of*S*_{1}and*S*_{2} - Iterate through the elements of
*S*_{1}, adding it to*T*. - Iterate through the elements of
*S*_{2}. If the element does not already exist in*T*, then add it to*T*.

How many steps would performing the union take? Step (2) would require *m* steps through *S*_{1}'s *m* elements. Step (3) would take *n* steps, and for each element in *S*_{2}, we'd have to determine if the element was in *T*. Using unsorted ArrayLists, to determine if an element is in an ArrayList the entire ArrayList must be enumerated linearly. So, for each of the *n* elements in *S*_{2} we might have to search through the *m* elements in *T*. This would lead to a quadratic running time for union of *O*(*m * n*).

The reason a union with an ArrayList takes quadratic time is because determining if an element exists within a set takes linear time. That is, to determine if an element exists in a set, the set's ArrayList must be exhaustively searched. If we could reduce the running time for the "element of" operation to a constant, we could improve the union's running time to a linear *O*(*m* + *n*). Recall from Part 2 of this article series that a Hashtable provides constant running time to determine if an item resides within the Hashtable. Hence, a Hashtable would be a better choice for storing the set's elements than an ArrayList.

If we require that the set's universe be known, we can implement an even more efficient set data structure using a bit array. Assume that the universe consists of elements *e*_{1}, *e*_{2}, …, *e*_{k}. Then we can denote a set with a *k*-element bit array; if the *i*th bit is 1, then the element *e*_{i} is in the set; if, on the other hand, the *i*th bit is 0, then the element *e*_{i} is not in the set. Representing sets as a bit array not only provides tremendous space savings, but also enables efficient set operations, as these set-based operations can be performed using simple bit-wise instructions. For example, determining if element *e*_{i} exists in a set takes constant time since only the *i*th bit in the bit array needs to be checked. The union of two sets is simply the bit-wise OR of the sets' bit arrays; the intersection of two sets is the bit-wise AND of the sets' bit arrays. Set difference and subset can be reduced down to bit-wise operations as well.

NoteA bit array is a compact array composed of 1s and 0s, typically implemented as an integer array. Since an integer in the Microsoft .NET Framework has 32 bits, a bit array can store 32 bit values in one element of an integer array (rather than requiring 32 array elements).

Bit-wise operations are ones that are performed on the individual bits of an integer. There are both binary bit-wise operators and unary bit-wise operators. The bit-wise AND and bit-wise OR operators are binary, taking in two bits each, and returning a single bit. Bit-wise AND returns 1 only if both inputs are 1, otherwise it returns 0. Bit-wise OR returns 0 only if both inputs are 0, otherwise it returns 1.

For a more in-depth look at bit-wise operations in C# be sure to read: Bit-Wise Operators in C#.

Let's look at how to implement a set class that uses C#'s bit-wise operations.

### Creating the PascalSet Class

Understand that to implement a set class that uses the efficient bit-wise operators the set's universe must be known. This is akin to the way Pascal uses sets, so in honor of the Pascal programming language I have decided to name this set class the **PascalSet** class. **PascalSet** restricts the universe to a range of integers or characters (just like the Pascal programming language). This range can be specified in the **PascalSet**'s constructor.

public class PascalSet : ICloneable, ICollection { // Private member variables private int lowerBound, upperBound; private BitArray data; public PascalSet(int lowerBound, int upperBound) { // make sure lowerbound is less than or equal to upperbound if (lowerBound > upperBound) throw new ArgumentException("The set's lower bound cannot be greater than its upper bound."); this.lowerBound = lowerBound; this.upperBound = upperBound; // Create the BitArray data = new BitArray(upperBound - lowerBound + 1); } ... }

So, to create a **PascalSet** whose universe is the set of integers between -100 and 250, the following syntax could be used:

PascalSet mySet = new PascalSet(-100, 250);

#### Implementing the Set Operations

**PascalSet** implements the standard set operations—union, intersection, and set difference—as well as the standard relational operators—subset, proper subset, superset, and proper superset. The set operations union, intersection, and set difference, all return a new **PascalSet** instance, which contains the result of unioning, intersecting, or set differencing. The following code for the **Union(PascalSet)** method illustrates this behavior:

public virtual PascalSet Union(PascalSet s) { if (!AreSimilar(s)) throw new ArgumentException("Attempting to union two dissimilar sets. Union can only occur between two sets with the same universe."); // do a bit-wise OR to union together this.data and s.data PascalSet result = new PascalSet(this.lowerBound, this.upperBound); result.data = this.data.Or(s.data); return result; } public static PascalSet operator +(PascalSet s, PascalSet t) { return s.Union(t); }

The **AreSimilar(PascalSet)** method determines if the **PascalSet** passed has the same lower and upper bounds as the **PascalSet** instance. Therefore, union (and intersection and set difference) can only be applied to two sets with the same universe. (You could make a modification to the code here to have the returned **PascalSet**'s universe be the union of the two universe sets, thereby allowing sets with non-disjoint universes to be unioned.) If the two **PascalSet**s have the same universe, then a new **PascalSet**—**result**—is created with the same universe and its **BitArray** member variable—**data**—is assigned to the bit-wise OR of the two **PascalSet**s' **BitArray**s. Notice that the **PascalSet** class also overloads the `+`

operator for union (just like the Pascal programming language).

#### Enumerating the PascalSet's Members

Since sets are an *unordered* collection of elements, it would not make sense to have **PascalSet** implement **IList**, as collections that implement **IList** imply that the list has some ordinal order. Since **PascalSet** is a collection of elements, though, it makes sense to have it implement **ICollection**. Since **ICollection** implements **IEnumerable**, **PascalSet** needs to provide a **GetEnumerator()** method that returns an **IEnumerator** instance allowing a developer to iterate through the set's elements.

Oftentimes when creating a specialized collection class that uses some other underlying collection class to hold the data, the **GetEnumerator()** method for the specialized class can simply return the **IEnumerator** from the underlying collection's **GetEnumerator()** method. Since the **PascalSet** uses a **BitArray** to represent what elements are in the set, it might at first seem plausible to have **PascalSet**'s **GetEnumerator()** method return the **IEnumerator** from the internal **BitArray**'s **GetEnumerator()** method. However, the **BitArray**'s **GetEnumerator()** returns an **IEnumerator** that enumerates *all* of the bits in the **BitArray**, returning a Boolean value for each bit—true if the bit is 1, false if the bit is 0.

The elements in the **PascalSet**, however, are only those elements where the **BitArray**'s bit is 1. Therefore we need to create a custom class that implements **IEnumerator** and that intelligently walks the **BitArray**, only returning those elements whose corresponding bit in the **BitArray** is 1. To handle this I created a class within the **PascalSet** class called **PascalSetEnumerator**. This class's constructor takes in the current **PascalSet** instance as a sole input parameter. In the **MoveNext()** method it steps through the **BitArray** until it finds a bit with a value of 1.

class PascalSetEnumerator : IEnumerator { private PascalSet pSet; private int position; public PascalSetEnumerator(PascalSet pSet) { this.pSet = pSet; position = -1; } ... public bool MoveNext() { // increment position position++; // see if there is another element greater than position for (int i = position; i < pSet.data.Length; i++) { if (pSet.data.Get(i)) { position = i; return true; } } // no element found return false; } }

The complete code for the **PascalSet** class is included as a download with this article. Along with the class, there is an interactive WinForms testing application, **SetTester**, from which you can create a **PascalSet** instance and perform various set operations, viewing the resulting set.

## Maintaining a Collection of Disjoint Sets

Next time you do a search at Google notice that with each result there's a link titled "Similar Pages." If you click this link, Google displays a list of URLs that are related to the item whose "Similar Pages" link you clicked. While I don't know how Google particularly determines how pages are related, one approach would be the following:

- Let
*x*be the Web page we are interested in finding related pages for. - Let
*S*_{1}be the set of Web pages that*x*links to. - Let
*S*_{2}be the set of Web pages that the Web pages in*S*_{1}link to. - Let
*S*_{3}be the set of Web pages that the Web pages in*S*_{2}link to. - …
- Let
*S*_{k}be the set of Web pages that the Web pages in*S*_{k-1}link to.

All of the Web pages in *S*_{1}, *S*_{2}, up to *S*_{k} are the related pages for *x*. Rather than compute the related Web pages on demand, we might opt to create the set of related pages for *all* Web pages once, and to store this relation in a database or some other permanent store. Then, when a user clicks on the "Similar Pages" link for a search term, we simply query the display to get the links related to this page.

Google has some sort of database with all of the Web pages it knows about. Each of these Web pages has a set of links. We can compute the set of related Web pages using the following algorithm:

- For each Web page in the database create a set, placing the single Web page in the set. (After this step completes, if we have
*n*Web pages in the database, we'll have*n*one-element sets.) - For a Web page
*x*in the database, find all of those Web pages it directly links to. Call these linked-to pages*S*. For each element*p*in*S*, union the set containing*p*with*x*'s set. - Repeat step 2 for all Web pages in the database.

After step 3 completes, the Web pages in the database will be partitioned out into related groups. To view a graphical representation of this algorithm in action, consult Figure 1.

**Figure 1. A graphical representation of an algorithm for grouping linked web pages.**

Examining Figure 1, notice that in the end, there are three related partitions:

- w0
- w1, w2, w3, and w4
- w5 and w6

So, when a user clicks the "Similar Pages" link for w2, they would see links to w1, w3, and w4; clicking the "Similar Pages" link for w6 would show only a link to w6.

Notice that with this particular problem only one set operation is being performed—union. Furthermore, all of the Web pages fall into *disjoint sets*. Given an arbitrary number of sets, these sets are said to be *disjoint* if they share no elements in common. {1,2,3} and {4,5,6} are disjoint, for example, while {1,2,3} and {2,4,6} are not, since they share the common element {2}. In all stages shown in Figure 1, each of the sets containing Web pages are disjoint. That is, it's never the case that one Web page exists in more than one set at a time.

When working with disjoint sets in this manner, we often need to know what particular disjoint set a given element belongs to. To identify each set we arbitrarily pick a *representative*. A representative is an element from the disjoint set that uniquely identifies that entire disjoint set. With the notion of a representative, I can determine if two given elements are in the same set by checking to see if they have the same representative.

A disjoint set data structure needs to provide two methods:

**GetRepresentative(element)**– this method accepts an element as an input parameter and returns the element's representative element.**Union(element, element)**– this method takes in two elements. If the elements are from the same disjoint set, then**Union()**does nothing. If, however, the two elements are from different disjoint sets, then**Union()**combines the two disjoint sets into one set.

The challenge that faces us now is how to efficiently maintain a number of disjoint sets, where these disjoint sets are often merged from two sets into one? There are two basic data structures that can be used to tackle this problem: one uses a series of linked lists, the other a collection of trees.

### Maintaining Disjoint Sets with Linked Lists

In Part 4 of this article series we took a moment to look at a primer on linked lists. Recall that linked lists are a set of nodes that typically have a single reference to their next neighbor. Figure 2 shows a linked list with four elements.

**Figure 2. A linked list with four elements**

For the disjoint set data structure, a set is represented using a modified linked list. Rather than just having a reference to its neighbor, each node in the disjoint set linked list has a reference to the set's representative. As Figure 3 illustrates, *all* nodes in the linked list point to the *same* node as their representative, which is, by convention, the head of the linked list. (Figure 3 shows the linked list representation of the disjoint sets from the final stage of the algorithm dissected in Figure 1. Notice that for each disjoint set there exists a linked list, and the nodes of the linked list contain the elements of that particular disjoint set.)

**Figure 3. A linked list representation of the disjoint sets from the final stage of the algorithm dissected in Figure 1.**

Since each element in a set has a direct reference back to the set's representative, the **GetRepresentative(element)** method takes constant time. (To understand why, consider that regardless of how many elements a set has, it will always take one operation to find a given element's representative, since it involves just checking the element's representative reference.)

Using the linked list approach, combining two disjoint sets into one involves adding one linked list to the end of another, and updating the representative reference in each of the appended nodes. The process of joining two disjoint sets is depicted in Figure 4.

**Figure 4. The process of joining two disjoint sets**

When unioning together two disjoint sets, the correctness of the algorithm is not affected by which of the two sets is appended to the other. However, the running time can be. Imagine that our union algorithm randomly chose one of the two linked lists to be appended to the other. By a stroke of bad luck, imagine that we always chose the longer of the two linked lists to append. This can negatively impact the running time of the union operation since we have to enumerate all of the nodes in the appended linked list to update their representative reference. That is, imagine we make *n* disjoint sets, *S*_{1} to *S*_{n}. Each set would have one element. We could then do *n *- 1 unions, joining all *n* sets into one big set with *n* elements. Imagine the first union joined *S*_{1} and *S*_{2}, having *S*_{1} be the representative for this two element unioned set. Since *S*_{2} only has one element, only one representative reference would need to be updated. Now, imagine *S*_{1}—which has two elements—is unioned with *S*_{3}, and *S*_{3} is made the representative. This time two representative references—*S*_{1}'s and *S*_{2}'s—will need to be updated. Similarly, when joining *S*_{3} with *S*_{4}, if *S*_{4} is made the representative of the new set, three representative references will need to be updated (*S*_{1}, *S*_{2}, and *S*_{3}). In the (*n*-1)th union, *n*-2 representative references will need to be updated.

Summing up the number of operations that must be done for each step, we find that the entire sequence of steps—*n* make set operations and *n*-1 unions—takes quadratic time—*O*(*n*^{2}).

This worst-case running time can transpire because it is possible that union will choose the longer set to append to the shorter set. Appending the longer set requires that more nodes' representative references need to be updated. A better approach is to keep track of the size of each set, and then, when joining two sets, to append the smaller of the two linked lists. The running time when using this improved approach is reduced to *O*(n log_{2} *n*). A thorough time analysis is a bit beyond the scope of this article, and is omitted for brevity. Refer to the readings in the References section for a formal proof of the time analysis.

To appreciate the improvement of *O*(n log_{2} *n*) from *O*(*n*^{2}), observe Figure 5, which shows the growth rate of *n*^{2} in blue, and the growth rate of n log_{2} *n* in pink. For small values of *n*, these two are comparable, but as *n* exceeds 32, the n log_{2} *n* grows much slower than *n*^{2}. For example, performing 64 unions would require over 4,000 operations using the naïve linked list implementation, while it would take only 384 operations for the optimized linked list implementation. These differences become even more profound as *n* gets larger.

**Figure 5. Growth rates of n**^{2}** and n log**_{2}

### Maintaining Disjoint Sets with a Forest

Disjoint sets can also be maintained using a *forest*. A forest is a set of trees (get it? ). Recall that with the linked list implementation, the set's representative was the head of the list. With the forest implementation, each set is implemented as a tree, and the set's representative is the root of the tree. (If you are unfamiliar with what trees are, consider reading Part 3 of this article series, where we discussed trees, binary trees, and binary search trees.)

With the linked list approach, given an element, finding its set's representative was fast since each node had a direct reference to its representative. However, with the linked list approach unioning took longer because it involved appending one linked list to another, which required that the appended nodes' representative references be updated. The forest approach aims at making unions fast, at the expense of finding a set's representative given an element in the set.

The forest approach implements each disjoint set as a tree, with the root as the representative. To union together two sets, one tree is appended as a child of the other. Figure 6 illustrates this concept graphically.

**Figure 6. The union of two sets**

To union two sets together requires constant time, as only one node needs to have its representative reference updated. (In Figure 6, to union together the w1 and w3 sets, all we had to do was have w3 update its reference to w1—nodes w4 and w5 didn't need any modification.)

Compared to the linked list implementation, the forest approach has improved the time required for unioning two disjoint sets, but has worsened the time for finding the representative for a set. The only way we can determine a set's representative, given an element, is to walk up the set's tree until we find the root. Imagine that we wanted to find the representative for w5 (after sets w1 and w3 had been unioned). We'd walk up the tree until we reached the root—first to w3, and then to w1. Hence, finding the set's representative takes time relative to the depth of the tree, and not constant time as it does with the linked list representation.

The forest approach offers two optimizations that, when both employed, yield a linear running time for performing *n* disjoint set operations, meaning that each single operation has an average constant running time. These two optimizations are called union by rank and path compression. What we are trying to avoid with these two optimizations is having a sequence of unions generate a tall, skinny tree. As discussed in Part 3 of this article series, the ratio of a tree's height to breadth typically impacts its running time. Ideally, a tree is fanned out as much as possible, rather than being tall and narrow.

#### The Union by Rank Optimization

Union by rank is akin to the linked list's optimization of appending the shorter list to the longer one. Specifically, union by rank maintains a rank for each sets' root, which provides an upper-bound on the height of the tree. When unioning two sets, the set with the smaller rank is appended as a child of the root with the larger rank. Union by rank helps ensure that our trees will be broad. However, even with union by rank we might still end up with tall, albeit wide, trees. Figure 7 shows a picture of a tree that might be formed by a series of unions that adhere only to the union by rank optimization. The problem is that leaf nodes on the right hand side still must perform a number of operations to find their set's representative.

**Figure 7. A tree that might be formed by a series of unions that adhere only to the union by rank optimization**

NoteThe forest approach, when implementing just the union by rank optimization, has the same running time as the optimized link list implementation.

#### The Path Compression Optimization

Since a tall tree makes finding a set's representative expensive, ideally we'd like our trees to be broad and flat. The path compression optimization works to flatten out a tree. As we discussed earlier, whenever an element is queried for its set's representative, the algorithm walks up the tree to the root. The way the path compression optimization works is in this algorithm; the nodes that are visited in the walk up to the root have their parent reference updated to the root.

To understand how this flattening works, consider the tree in Figure 7. Now, imagine that we need to find the set representative for w13. The algorithm will start at w13, walk up to w12, then to w8, and finally to w1, returning w1 as the representative. Using path compression, this algorithm will also have the side effect of updating w13 and w12's parents to the root—w1. Figure 8 shows a screenshot of the tree after this path compression has occurred.

**Figure 8. A tree after path compression**

Path compression pays a slight overhead the first time when finding a representative, but benefits future representative lookups. That is, after this path compression has occurred, finding the set representative for w13 takes one step, since w13 is a child of the root. In Figure 7, prior to path compression, finding the representative for w13 would have taken three steps. The idea here is that you pay for the improvement once, and then benefit from the improvement each time the check is performed in the future.

When employing both the union by rank and path compression algorithms, the time it takes to perform *n* operations on disjoint sets is linear. That is, the forest approach, utilizing both optimizations, has a running time of *O*(*n*). You'll have to take my word on this, as the formal proof for the time complexity is quite lengthy and involved, and could easily fill several printed pages. If you are interested, though, in reading this multi-page time analysis, refer to the "Introduction to Algorithms" text listed in the references.

## References

- Alur, Rajeev. "Disjoint Sets"
- Cormen, Thomas H., Charles E. Leiserson, and Ronald L. Rivest. "Introduction to Algorithms." MIT Press. 1990.
- Devroye, Luc. "Disjoint Set Structures".

#### Related Books

- Introduction to Algorithms by Thomas H. Cormen

**Scott Mitchell**, author of five books and founder of 4GuysFromRolla.com, has been working with Microsoft Web technologies for the past five years. Scott works as an independent consultant, trainer, and writer, and recently completed his Masters degree in Computer Science at the University of California, San Diego. He can be reached at mitchell@4guysfromrolla.com or through his blog at http://www.ScottOnWriting.NET.