September 2009

Volume 24 Number 09

CLR Inside Out - What’s New in the Base Class Libraries in .NET Framework 4

By Justin Van | September 2009

Just about everyone who uses Microsoft .NET uses the Base Class Libraries (BCL).When we make the BCL better, almost every managed developer benefits. This column will focus on the new additions to the BCL in .NET 4 beta 1.

Three of the additions have already been covered in previous articles -- I'll start with a brief review of these:

  • Support for code contracts
  • Parallel extensions (tasks, concurrent collections, and coordination data structures)
  • Support for tuples
    Then, in the main part of the article, I'll talk about three new additions we haven't written about yet:
  • File IO improvements
  • Support for memory mapped files
  • A sorted set collection

There isn't room to describe all of the new BCL improvements in this article, but you can read about them in upcoming posts on the BCL team blog at blogs.msdn.com/bclteam. These include:

  • Support for arbitrarily large integers
  • Generic variance annotations on interfaces and delegates
  • Support for accessing 32-bit and 64-bit registry views and creating volatile registry keys
  • Globalization data updates
  • Improved System.Resourcesresource lookup fallback logic
  • Compression improvements

We're also planning to include some additional features and improvements for beta 2 that you will be able to read about on the BCL team blog around the time when beta 2 ships.

Code Contracts

One of the major new features added to the BCL in the .NET Framework 4 is code contracts. This new library provides a languageagnostic way to specify pre-conditions, post-conditions and object invariants in your code. You'll find more information on code contracts in Melitta Andersen's August 2009 MSDN Magazine CLR Inside Out column. You should also take a look at the Code Contracts DevLabs site at msdn.microsoft.com/en-us/devlabs/dd491992.aspx and on the BCL team blog at blogs.msdn.com/bclteam.

Parallel Extensions

As multi-core processors have become more important on the client, and massively parallel servers have become more wide-spread, it's more important than ever to make it easy for programmers to use all of these processors. Another major new addition to the BCL in .NET 4 is the Parallel Extensions (PFX) feature that is being delivered by the Parallel Computing Platform team. PFX include the Task Parallel Library (TPL), Coordination Data Structures, Concurrent Collections and Parallel LINQ (PLINQ) -- all of which will make it easier to write code that can take advantage of multi-core machines. More background on PFX can be found in Stephen Toub and Hazim Shafi's article "Improved Support For Parallelism In The Next Version Of Visual Studio" in the October 2008 issue of MSDN Magazine, available online at msdn.microsoft.com/en-us/magazine/cc817396.aspx. The PFX team blog at blogs.msdn.com/pfxteam is also a great source of information on PFX.

Tuples

Another addition to the BCL in .NET 4 is support for tuples,which are similar to anonymous classes that you can create on the fly. A tuple is a data structure used in many functional and dynamic languages, such as F# and Iron Python. By providing common tuple types in the BCL, we are helping to better facilitate language interoperability. Many programmers find tuples to be convenient, particularly for returning multiple values from methods -- so even C# or Visual Basic developers will find them useful. Matt Ellis's July 2009 MSDN Magazine CLR Inside Out column discusses the new support for tuples in .NET 4.

File IO Improvements

One of the new additions to .NET 4 that we haven't written about in detail is the new methods on System.IO.File for reading and writing text files. Since .NET 2.0, if you needed to read the lines of a text file, you could call the File.ReadAllLines method, which returns a string array of all the lines in the file. The following code uses File.ReadAllLines to read the lines of a text file and writes the length of the line along with the line itself to the console:

string[] lines = File.ReadAllLines("file.txt");
foreach (var line in lines) {
Console.WriteLine("Length={0}, Line={1}", line.Length, line);
}

Unfortunately, there is a subtle issue with this code. The issue stems from the fact that ReadAllLines returns an array. Before ReadAllLines can return, it must read all lines and allocate an array to return. This isn’t too bad for relatively small files, but it can be problematic for large text files that have millions of lines. Imagine opening a text file of the phone book or a 900-page novel. ReadAll- Lines will block until all lines are loaded into memory. Not only is this an inefficient use of memory, but it also delays the processing of the lines, because you can’t access the first line until all lines have been read into memory.

To work around the issue, you could use a TextReader to open the file and then read the lines of the file into memory, one line at a time. This works, but it’s not as simple as calling File.ReadAllLines:

using (TextReader reader = new StreamReader("file.txt")) {
string line;
while ((line = reader.ReadLine()) != null) {
Console.WriteLine("Length={0}, Line={1}", line.Length, line);
}
}

In .NET 4, we've added a new method to File named ReadLines (as opposed to ReadAllLines) that returns IEnumerable<string> instead of string[]. This new method is much more efficient because it does not load all of the lines into memory at once; instead, it reads the lines one at a time. The following code uses File.Read-Lines to efficiently read the lines of a file and is just as easy to use as the less efficientFile.ReadAllLines:

IEnumerable<string> lines = File.ReadLines(@"verylargefile.txt");
foreach (var line in lines) {
Console.WriteLine("Length={0}, Line={1}", line.Length, line);
}

Note that the call to File.ReadLines returns immediately. You no longer have to wait until all of the lines are read into memory before you can iterate through the lines. The iteration of the foreach loop actually drives the reading of the file. Not only does this significantly improve the perceived performance of the code, because you can start processing the lines as they’re being read, it’s also much more efficient because the lines are being read one at a time. Using File. ReadLines has an added benefit in that it allows you to break out of the loop early if necessary, without wasting time reading additional lines you don’t care about.

We've also added new overloads to File.WriteAllLines that take an IEnumerable<string> parameter similar to the existing overloads that take a string[] parameter. And we added a new method called AppendAllLines that takes an IEnumerable<string> for appending lines to a file. These new methods allow you to easily write or append lines to a file without having to pass in an array. This means if you have a collection of strings, you can pass it directly to these methods without having to first convert it into a string array.

There are other places in the BCL that would benefit from the use of IEnumerable<T> over arrays. Take the file system enumeration APIs, for example. In previous versions of the framework, to get the files in a directory, you would call a method like DirectoryInfo. GetFiles, which returns an array of FileInfo objects. You could then iterate through the FileInfo objects to get information about the files, such as the Name and Length of each file. The code to do that might look like this:

DirectoryInfo directory = new DirectoryInfo(@"\\share\symbols");
FileInfo[] files = directory.GetFiles();
foreach (var file in files) {
Console.WriteLine("Name={0}, Length={1}", file.Name, file.Length);
}

There are two issues with this code. The first issue shouldn’t surprise you; just as with File.ReadAllLines, it stems from the fact that GetFiles returns an array. Before GetFiles can return, it must retrieve the complete list of files in the directory from the file system and then allocate an array to return. This means you have to wait for all files to be retrieved before getting the first results and this is an inefficient use of memory. If the directory contains a million files, you have to wait until all 1,000,000 files have been retrieved and a million-length array has been allocated.

The second issue with the above code is a bit more subtle. FileInfo instances are created by passing a file path to FileInfo’s constructor. Properties on FileInfo such as Length and CreationTime are initialized the first time one of the properties is accessed. When a property is first accessed, FileInfo.Refresh is called which calls into the operating system to retrieve the file’s properties from the file system. This avoids a call to retrieve the data if the properties are never used, and when they are used, it helps to ensure the data isn’t stale when first accessed. This works great for one-off instances of FileInfo, but it can be problematic when enumerating the contents of a directory because it means that additional calls to the file system will be made to get the file properties. These additional calls can hinder performance when looping through the results. This is especially problematic if you are enumerating the contents of a remote file share, because it means an additional round-trip call to the remote machine across the network.

In .NET 4, we have addressed both of these issues. To address the first issue, we have added new methods on Directory and DirectoryInfo that return IEnumerable<T> instead of arrays.

Like File.ReadLines, these new IEnumerable<T>-based methods are much more efficient than the older array-based equivalents. Consider the following code that has been updated to use .NET 4's DirectoryInfo. EnumerateFiles method instead of DirectoryInfo.GetFiles:

DirectoryInfo directory = new DirectoryInfo(@"\\share\symbols");
IEnumerable<FileInfo> files = directory.EnumerateFiles();
foreach (var file in files) {
Console.WriteLine("Name={0}, Length={1}", file.Name, file.Length);
}

Unlike GetFiles, EnumerateFiles does not have to block until all of the files are retrieved, nor does it have to allocate an array. Instead, it returns immediately, and you can process each file as it is returned from the file system.

To address the second issue, DirectoryInfo now makes use of data that the operating system already provides from the file system during
enumeration. The underlying Win32 functions that Directory-Info calls to get the contents of the file system during enumeration actually include data about each file, such as the length and creation time. We now use this data when initializing the FileInfo and DirectoryInfo instances returned from both the older array-based and new IEnumerable<T>-basedmethods on DirectoryInfo. This means that in the preceding code, there are no additional underlying calls to the file system to retrieve the length of the file when file.Length is called, since this data has already been initialized.

Together, the new IEnumerable<T>-based methods on File and Directory enable some interesting scenarios. Consider the following code:

var errorlines =
from file in Directory.EnumerateFiles(@"C:\logs", "*.log")
from line in File.ReadLines(file)
where line.StartsWith("Error:", StringComparison.OrdinalIgnoreCase)
select string.Format("File={0}, Line={1}", file, line);
File.WriteAllLines(@"C:\errorlines.log", errorlines);

This uses the new methods on Directory and File, along with LINQ, to efficiently find files that have a .log extension, and specifically lines in those files that start with "Error:". The query then projects the results into a new sequence of strings, with each string formatted to show the path to the file and the error line. Finally, File. WriteAllLines is used to write the error lines to a new file named "errorlines.log," without having to convert error lines into an array. The great thing about this code is that it is very efficient. At no time have we pulled the entire list of files into memory, nor have we pulled in the entire contents of a file into memory. No matter if C:\logs contains 10 files or a million files, and no matter if the files contain 10 lines or a million lines, the above code is just as efficient, using a minimal amount of memory.

Memory-Mapped Files

Support for memory-mapped files is another new feature in the .NET Framework 4. Memory-mapped files can be used to edit large files or create shared memory for inter-process communication (IPC). Memory-mapped files allow you to map a file into the address space of a process. Once mapped, an application can simply read or write to memory to access or modify the contents of the file. Since the file is accessed through the operating system’s memory manager, the file is automatically partitioned into a number of pages that are paged in and paged out of memory as needed. This makes working with large files easier, as you don’t have to handle the memory management yourself. It also allows for complete random access to a file without the need for seeking.

Memory-mapped files can be created without a backing file. Such memory-mapped files are backed by the system paging file (only if one exists and the contents needs to be paged out of memory). Memory-mapped files can be shared across multiple processes, which means they’re a great way to create shared memory for interprocess communication. Each mapping can have a name associated with it that other processes can use for opening the same memory-mapped file.

To use memory-mapped files, you must first create a Memory- MappedFile instance using one of the following static factory methods on the System.IO.MemoryMappedFiles.MemoryMappedFile class:

  • CreateFromFile
  • CreateNew
  • CreateOrOpen
  • OpenExisting

After this, you can create one or more views that actually maps the file into the process’s address space. Each view can map all or part of the memory mapped file, and views can overlap.

Using more than one view may be necessary if the file is greater than the size of the process’s logical memory space available for mapping (2GB on a 32-bit machine). You can create a view by calling either the CreateViewStream or CreateViewAccessor methods on the MemoryMappedFile object. CreateViewStream returns an instance of MemoryMappedFileViewStream, which inherits from System.IO.UnmanagedMemoryStream. This can be used like any other Stream in the framework. CreateViewAccessor, on the other hand, returns an instance of MemoryMappedFileViewAccessor, which inherits from the new System.IO.UnmanagedMemoryAccessorclass. UnmanagedMemoryAccessor enables random access, whereas UnmanagedMemoryStream enables sequential access.

New Methods ...

... on System.IO.File

  • public static IEnumerable<string>ReadLines(string path)
  • public static void WriteAllLines(string path, IEnumerable<string> contents)
  • public static void AppendAllLines(string path, IEnumerable<string> contents)

... on System.IO.Directory

  • public static Enumerable<string>EnumerateDirectories(string path)
  • public static IEnumerable<string>EnumerateFiles(string path)
  • public staticIEnumerable<string>EnumerateFileSystemEntries(string path)

... on System.IO.DirectoryInfo

  • publicIEnumerable<DirectoryInfo>EnumerateDirectories()
  • publicIEnumerable<FileInfo>EnumerateFiles()
  • publicIEnumerable<FileSystemInfo>EnumerateFileSystemInfos()

The following sample shows how to use memory-mapped files to create shared memory for IPC. Process 1, as shown in Figure 1, creates a new MemoryMappedFile instance using the CreateNew method specifying the name of the memory mapped file along with the capacity in bytes. This will create a memory-mapped file backed by the system paging file. Note that internally, the specified capacity is rounded up to the next multiple of the system’s page size (if you’re curious you can get the system page size from Environment.System- PageSize, which is new in .NET 4). Next, a view stream is created using CreateViewStream and “Hello Word!” is written to the stream using an instance of BinaryWriter. Then the second process is started.

Process 2, as shown in Figure 2, opens the existing memory mapped file using the OpenExisting method, specifying the appropriate name of the memory mapped file. From there, a view stream is created and the string is read using an instance of BinaryReader.

Figure 1 Process 1

using (varmmf = MemoryMappedFile.CreateNew("mymappedfile", 1000))
using (var stream = mmf.CreateViewStream()) {
var writer = new BinaryWriter(stream);
writer.Write("Hello World!");
varstartInfo = new ProcessStartInfo("process2.exe");
startInfo.UseShellExecute = false;
Process.Start(startInfo).WaitForExit();
}

Figure 2 Process 2

using (varmmf = MemoryMappedFile.OpenExisting("mymappedfile"))
using (var stream = mmf.CreateViewStream()) {
var reader = new BinaryReader(stream);
Console.WriteLine(reader.ReadString());
}

SortedSet<T>

In addition to the new collections in System.Collections.Concurrent (part of PFX), the .NET Framework 4 includes a new set collection in System.Collections.Generic, called SortedSet<T>. Like HashSet<T>, which was added in .NET 3.5, SortedSet<T> is a collection of unique elements, but unlike HashSet<T>, SortedSet<T> keeps the elements in sorted order.

SortedSet<T> is implemented using a self-balancing red-black tree that gives a performance complexity of O(log n) for insert, delete, and lookup. HashSet<T>, on the other hand, provides slightly better performance of O(1) for insert, delete and lookup. If you just need a general purpose set, in most cases you should use HashSet<T>. However, if you need to keep the elements in sorted order, get the subset of elements in a particular range, or get the min or max element, you'll want to use SortedSet<T>. The following code demonstrates the use of SortedSet<T> with integers:

var set1 = new SortedSet<int>() { 2, 5, 6, 2, 1, 4, 8 };
bool first = true;
foreach (var i in set1) {
if (first) {
first = false;
}
else {
Console.Write(",");
}
Console.Write(i);
}
// Output: 1,2,4,5,6,8

The set is created and initialized using C#'s collection initializer syntax. Note that the integers are added to the set in no particular order. Also note that 2 is added twice. It should be no surprise that when looping through set1's elements, we see that the integers are in sorted order and that the set contains only one 2. Like HashSet<T>, SortedSet<T>'s Add method has a return type of bool that can be used to determine whether the item was successfully added (true) or if it wasn't added because the set already contains the item (false).

Figure 3 shows how to get the max and min elements within the set and get a subset of elements in a particular range.

Figure 3 Getting Max, Min and Subset View of Elements

var set1 = new SortedSet<int>() { 2, 5, 6, 2, 1, 4, 8 };
Console.WriteLine("Min: {0}", set1.Min);
Console.WriteLine("Max: {0}", set1.Max);
var subset1 = set1.GetViewBetween(2, 6);
Console.Write("Subset View: ");
bool first = true;
foreach (var i in subset1) {
if (first) {
first = false;
}
else {
Console.Write(",");
}
Console.Write(i);
}
// Output:
// Min: 1
// Max: 8
// Subset View: 2,4,5,6

The GetViewBetween method returns a view of the original set. This means that any changes made to the view will be reflected in the original. For example, if 3 is added to subset1 in the code above, it's really added to set1. Note that you cannot add items to a view outside of the specified bounds. For example, attempting to add a 9 to subset1 in the code above will result in an Argument-Exception because the view is between 2 and 6.

Try It Out

The new BCL features discussed in this column are a sampling of the new functionality available in the .NET Framework 4. These features are available in preview form as part of the .NET Framework 4 beta 1, which is available for download along with Visual Studio 2010 beta 1 at msdn.microsoft.com/en-us/netframework/dd819232.aspx. Download the beta, try out the new functionality, and let us know what you think at connect.microsoft.com/VisualStudio/content/content.aspx?ContentID=12362. Be sure to also keep an eye on the BCL team blog for upcoming posts on some of the other BCL additions and an announcement on what’s new in beta 2.


Post your questions and comments on the CLR Team Blog at https://blogs.msdn.com/clrteam/archive/tags/CLR+Inside+Out/default.aspx.

 

Justin Van Patten is a program manager on the CLR team at Microsoft, where he works on the Base Class Libraries. You can reach him via the BCL team blog at blogs.msdn.com/bclteam.