.NET Matters: NamedGZipStream, Covariance and Contravariance

Article
10/18/2019

.NET Matters

NamedGZipStream, Covariance and Contravariance

Stephen Toub

Code download available at: NETMatters0510.exe (150 KB)
Browse the Code Online

Q I'm using the System.IO.Compression.GZipStream class in the Microsoft® .NET Framework 2.0 to compress data into a .gz file. When I open the archive in WinZip, it appears that the name of the file I compressed was not stored as I had expected it to be. How can I add the name to the .gz file?

A The GZipStream class implements the GZIP file format as defined in RFC 1952. The GZIP file format makes use of the DEFLATE compression algorithm, detailed in RFC 1951, and simply defines a header and footer that are used to encapsulate the compressed data, providing some metadata for it. In fact, in the .NET Framework 2.0, GZipStream is a wrapper around the DeflateStream class, also in the System.IO.Compression namespace, that exists purely to inform the DeflateStream instance that GZIP headers and footers should be used.

The problem you're running into is that the GZIP header format declares the name of the compressed file as optional, most likely due to its UNIX heritage where the data to be compressed is frequently supplied through a stdio pipe rather than with file names as command-line arguments:

% gzip < input.txt > output.gz

In this case, the input data is purely a sequence of bytes without a name. This model is very similar to the world of managed streams, where the data being compressed isn't necessarily coming from a FileStream, but might be a NetworkStream or a MemoryStream, or any other type of readable stream. As such, GZipStream and DeflateStream provide no means of specifying a name for the compressed contents, choosing to leave the compressed file's name out of the header. You can compensate for this, but it means foregoing DeflateStream's support for the GZIP file format and writing the code yourself to generate the header and footer.

Figure 1 shows my implementation of NamedGZipStream, a class very similar in purpose to GZipStream that accepts as a parameter to the constructor the name of the file being compressed. Unlike GZipStream, which acts as a decorator to the DeflateStream class (to find more information on stream decorators, see my .NET Matters column in the September 2005 issue of MSDN®Magazine online), NamedGZipStream simply derives from DeflateStream and overrides a few key methods, to be discussed shortly. Note that this implementation is write-only, with the constructor always specifying CompressionMode.Compress. It's possible to write a bidirectional NamedGZipStream, though that requires a bit more work since it needs to be able to parse arbitrary GZIP headers. Modifying NamedGZipStream as such would probably include implementing a read-only Name property to expose the parsed name from the header as well as implementing a second constructor to allow the specification of the compression mode to be used.

Figure 1 NamedGZipStream

public class NamedGZipStream : DeflateStream
{
    private long _size;
    private uint _crc;
    private bool _leaveOpen;
    private Stream _output;

    public NamedGZipStream(Stream output, string name, bool leaveOpen) :
        base(output, CompressionMode.Compress, true)
    {
        if (string.IsNullOrEmpty(name)) 
            throw new ArgumentNullException("name");
        _output = output;
        _leaveOpen = leaveOpen;

        byte [] header = { 0x1f, 0x8b, 8, 8, 0, 0, 0, 0, 4, 0 };
        _output.Write(header, 0, header.Length);

        byte[] data = Encoding.UTF8.GetBytes(name);
        _output.Write(data, 0, data.Length);
        _output.WriteByte(0);
    }

    public override IAsyncResult BeginWrite(byte[] array, int offset, 
        int count, AsyncCallback asyncCallback, object asyncState)
    {
        IAsyncResult result = base.BeginWrite(array, offset, count, 
            asyncCallback, asyncState);
        _size += count;
        _crc = UpdateCrc(_crc, array, offset, count);
        return result;
    }

    public override void Write(byte[] array, int offset, int count)
    {
        base.Write(array, offset, count);
        _size += count;
        _crc = UpdateCrc(_crc, array, offset, count);
    }

    protected override void Dispose(bool disposing)
    {
        base.Dispose(disposing);
        if (disposing && _output != null)
        {
            _output.WriteByte((byte)(_crc & 0xff));
            _output.WriteByte((byte)((_crc >> 8) & 0xff));
            _output.WriteByte((byte)((_crc >> 16) & 0xff));
            _output.WriteByte((byte)((_crc >> 24) & 0xff));
            _output.WriteByte((byte)(_size & 0xff));
            _output.WriteByte((byte)((_size >> 8) & 0xff));
            _output.WriteByte((byte)((_size >> 16) & 0xff));
            _output.WriteByte((byte)((_size >> 24) & 0xff));
            if (!_leaveOpen) _output.Close();
            _output = null;
        }
    }

    private static uint UpdateCrc(uint crc, 
        byte[] array, int offset, int count)
    {
        crc ^= uint.MaxValue;
        while (—count >= 0)
        {
            crc = _crcTable[(crc ^ array[offset++]) & 0xFF] ^ (crc >> 8);
        }
        return crc ^ uint.MaxValue;
    }

    private static uint[] _crcTable = new uint[256]
    {  
        0x00000000u, 0x77073096u, 0xee0e612cu, 0x990951bau, 
        0x076dc419u, ...// the other 251 values omitted for brevity
    };
}

The constructor passes the supplied target stream to the base DeflateStream class's constructor in addition to storing it in a private field for future use. It then proceeds to write out the GZIP header block. The first 2 bytes of the GZIP header (0x1F and 0x8B) are a "magic number" used to indicate that this is indeed a GZIP file (many file formats begin with magic numbers or unique identifiers so that consuming applications can quickly bail out if they detect a discrepancy). Following that is a byte (0x8) that indicates this GZIP file is using the DEFLATE compression algorithm. The next byte is a bit flag that contains some information about the format of this GZIP header and the compressed data: the fourth least-significant bit when set indicates that the header contains the optional file name, and thus I've set the flag to be 0x8. The next 4 bytes are used to store a timestamp, which, as detailed in RFC 1952, can be omitted by using all zeroes. The following 2 bytes (0x4 and 0x0) contain information about the level of compression used and the operating system on which the compression was performed. Finally, the bytes that make up the file name are retrieved and written out to the header, null-terminated.

After the header comes the compressed data. Since the header is completely configured in the constructor of NamedGZipStream, users of the class can use the Write and BeginWrite methods as they would if they were simply using a GZipStream or a DeflateStream. However, you'll notice that I've overridden both BeginWrite and Write. This is because the GZIP footer contains two pieces of information that rely on the data written to the stream: the size of the uncompressed data and a CRC-32 checksum of the uncompressed data. In order to obtain these values, I override Write and BeginWrite, updating my internal _size and _crc fields based on the arrays of data supplied (the checksum implementation is a straight port of the sample code at the end of RFC 1952).

When the NamedGZipStream is eventually Disposed, it first calls the DeflateStream.Dispose method to ensure that any unflushed data is written out to the output stream. It then writes the footer, which simply consists of the size and CRC-32 values in little-endian ordering.

An example application using NamedGZipStream to compress a file is shown in Figure 2. This program accepts as a command-line argument the name of the file to be compressed and compresses it to a new file with the same name as the original, except with a .gz extension. One interesting point to note about the CopyStream static method I've used in Figure 2 is that it uses a fairly large buffer for reading in data from the input stream and for writing it out to the output stream. This makes little performance difference for reading in from an input FileStream, since FileStream has buffering code built into it. However, DeflateStream (and thus both GZipStream and my NamedGZipStream) does not have its own internal buffering mechanisms for written data. That's not a problem when writing out significantly sized chunks (such as I'm doing here, with chunks 4KB in size). However, if you write to a DeflateStream with significantly smaller array sizes (such as when using WriteByte, which ends up calling Write with an array of size one), you might notice a significant increase in the compressed file size as well as a significant decrease in compression performance.

Figure 2 Compressing a File

static class Program
{
    static void Main(string[] args)
    {
        if (args.Length > 0)
        {
            string filename = args[0];
            using (Stream input = File.OpenRead(filename))
            using (Stream output = File.OpenWrite(
                Path.ChangeExtension(filename, ".gz")))
            using (Stream gz = new NamedGZipStream(output, 
                Path.GetFileName(filename), true))
            {
                CopyStream(input, gz);
            }
        }
    }

    static void CopyStream(Stream input, Stream output)
    {
        const int bufferSize = 4096;
        int read;
        byte[] buffer = new byte[bufferSize];
        while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
        {
            output.Write(buffer, 0, read);
        }
    }
}

If you can't increase the size of the chunks you write to the DeflateStream, an easy fix is to take advantage of the buffering capabilities built into BufferedStream, a Stream-derived class in the System.IO namespace. You can create a new BufferedStream around your DeflateStream and let BufferedStream take care of buffering any data written to it, only writing to the underlying DeflateStream when it has acquired enough data. So, rather than creating a compression stream as shown here:

using(Stream compressStream = new DeflateStream(
    outputStream, CompressionMode.Compress))
{...}

I would strongly urge you to consider implementing it as follows:

using(Stream compressStream = new BufferedStream(new DeflateStream(
    outputStream, CompressionMode.Compress)))
{...}

A very simple addition like that could buy you some significant performance improvements. Of course, only by measuring will you be able to know for sure.

Q I've heard that delegates in C# 2.0 support covariance and contravariance. What do these terms mean?

A The terms "covariance" and "contravariance" are object-oriented lingo used to describe how types relate. You probably already have some degree of familiarity with the concepts represented, just without the fancy names. As covariance shows up much more frequently, I'll start there.

In mathematics, covariance is a measure of the degree to which two variables move up or down together. The term was co-opted by the OO crowd to describe the situation in which a derived type is used where its base type was expected. Let's first consider covariance as it applies to arrays, which has been supported in the .NET Framework since its inception. I've defined the following types:

class BaseClass {}
class DerivedClass : BaseClass {}

If I declare a one-dimensional array of type BaseClass, then because DerivedClass derives from BaseClass, I can actually store to that array variable a reference to a one-dimensional array of type DerivedClass, as shown here:

BaseClass[] arr = new DerivedClass[someSize];

This capability is referred to as array covariance. Specifically, for any two reference types A and B, if A derives from B, a conversion exists from an array of type A to an array of type B with the same rank (meaning the same number of dimensions). Note that array covariance in .NET does not apply to arrays of value types. For example, the following code in C# will fail to compile:

// error: Cannot implicitly convert type 'int[]' to 'object[]'
object[] arr = new int[someSize];

This is due to how reference types and value types are stored in arrays. Arrays of reference types are actually arrays of pointers, where each element in the array references the allocated reference type on the managed heap. Since all pointers are the same size (on the same platform) regardless of what they point to, the same representation can be used for an array of a base type and an array of a derived type. For value types, each element in the array stores the data for the value type rather than a pointer, and thus arrays of value types have different in-memory representations than arrays of reference types.

Of course, while array covariance for reference types is a very cool feature to have, it does come at a price. Consider the following:

BaseClass[] arr = new DerivedClass[10];
arr[4] = new BaseClass();

I've created an array of type DerivedClass and stored it to an array of type BaseClass, which is perfectly legal due to the principles of array covariance. I then store a new instance of BaseClass into an element of this array, which shouldn't be legal as it's really an array of DerivedClass. Does the compiler complain? Nope. As far as the C# compiler is concerned, this is an array of BaseClass. However, at run time an ArrayTypeMismatchException will be thrown: "Attempted to access an element as a type incompatible with the array." The exception is quite appropriate for this situation, but in order to be able to generate an exception like that, the runtime has to add type checks in most situations where objects are stored into arrays of reference types.

Another type of covariance available in some languages (for example, C++ and Eiffel) is return type covariance of method overrides. Consider the following base class that implements a Clone method in order to instantiate a copy of itself:

class BaseClass
{
    public virtual BaseClass Clone() { ... }
}

With invariant return types, a derived class that overrides the clone method would need to use the exact same return type (hence, invariant) as the base class:

class DerivedClass : BaseClass
{
    public override BaseClass Clone() { ... }
}

But with covariant return types, DerivedClass could actually be declared as follows:

class DerivedClass : BaseClass
{
    public override DerivedClass Clone() { ... } // won't compile
}

The idea here is that any clients using a variable of type BaseClass will expect the Clone method to return an object of type BaseClass. However, as DerivedClass derives from BaseClass, getting an instance of DerivedClass back from Clone would be a perfectly reasonable situation. Covariant return types on method overrides make this explicit, allowing the use of a type that is derived from the return type of the method being overridden as the return type of the method doing the overriding. Unfortunately, covariant return types are not supported in the common language runtime (CLR) and they're not supported in C# or Visual Basic®, at least not yet (Suggestion Details: Need covariant return types in C# / all .Net langage).

Now, to your original question. A delegate declaration in C# defines a reference type that can be used to encapsulate a method with a specific signature. Take the following delegate declaration:

public delegate BaseClass MyDelegate();

In C# 1.x, delegates are invariant, meaning that in order to wrap an instance of this delegate around a method, that method's signature has to exactly match the signature of this delegate; it has to be parameterless and return an object. Thus, in this C# 1.x example, the first snippet is valid while the second snippet is invalid:

// Valid: return types match
public BaseClass MyMethod1() { return new BaseClass(); }
...
MyDelegate del = new MyDelegate(MyMethod1);

// Invalid in C# 1.x: return types don't match
public DerivedClass MyMethod2() { return new DerivedClass(); }
...
MyDelegate del = new MyDelegate(MyMethod2); // compiler error

DerivedClass derives from BaseClass and thus, from a typing perspective, there shouldn't be any problem using a method that returns a DerivedClass where one that returns a BaseClass is expected. DerivedClass can be returned to a caller that expects a BaseClass, since an object can be cast implicitly to a parent type. New to the .NET Framework 2.0 and to C# 2.0, delegates are covariant on their return types. This means that you can use a delegate with a method whose return type is derived from the return type of the delegate. In other words, in C# 2.0, the previous invalid snippet is now valid.

Enough about covariance. Contravariance is, in a sense, the opposite of covariance. Consider the following delegate:

public delegate void FooHandler(object sender, FooEventArgs e);

Due to invariant delegates in C# 1.x, you can only use this delegate with methods that return void and that have the exact same parameter types. However, that's a bit restrictive. The client using this delegate must pass in an instance of FooEventArgs or of some type deriving from FooEventArgs (or null), which means that for a method to be compatible with this delegate, the real requirement is that it must be able to correctly store a FooEventArgs instance or a derived class in the parameter variable. In this case, that means that any method that has FooEventArgs or an ancestor of FooEventArgs (namely EventArgs or Object) as the second parameter would be valid. This concept is referred to as contravariance, and delegates in C# 2.0 are contravariant on their parameter types. This means that you can use a delegate with a method whose parameter types are ancestors of the corresponding parameter types in the delegate's signature:

// Valid: parameter types match
public void MyMethod1(object sender, FooEventArgs e) {}
...
FooHandler del = new FooHandler(MyMethod1);

// Invalid in C# 1.x, but valid in C# 2.0: parameter types don't match
public void MyMethod2(object sender, EventArgs e) {}
...
FooHandler del = new FooHandler(MyMethod2);

This is most useful when it comes to event handlers. While events don't require any particular method signature, in practice most events are of a delegate type that returns void and that accepts two parameters, an object representing the instance or type raising the event and an EventArgs or EventArgs-derived instance containing information about the event that took place. Due to parameter contravariance support in C# 2.0, this means a method that returns void and takes two parameters, one of type object and one of type EventArgs, can be used with any delegate that follows the standard event handler pattern, even if that delegate expects a type derived from EventArgs, rather than explicitly an EventArgs, as the second parameter.

In his May 2004 column, Paul DiLascia showed how Reflection.Emit and code generation could be used to dynamically generate the event handlers necessary to spy on all events on a Form in a Windows® Forms application. With delegate parameter type contravariance, the code in Figure 3 is all that is necessary (most of which is just verifying that the event's handler type does in fact have a compatible signature). For example, I might use the following code to trace all events on a Form (barring those that aren't compatible with the standard signature):

private void Form1_Load(object s, EventArgs ev)
{
    EventTracer.TraceEvents(this);
}

Figure 3 Tracing Events on an Object

public class EventTracer
{
    private string _name;

    private EventTracer(string name) { _name = name; }

    private void Handler(object sender, EventArgs e)
    {
        System.Diagnostics.Trace.WriteLine(_name + ": " + e);
    }

    public static void TraceEvents(object toTrace)
    {
        if (toTrace == null) throw new ArgumentNullException("toTrace");
        MethodInfo mi = typeof(EventTracer).GetMethod("Handler",
            BindingFlags.NonPublic | BindingFlags.Instance);
        foreach (EventInfo ei in toTrace.GetType().GetEvents())
        {
            MethodInfo invoke = ei.EventHandlerType.GetMethod("Invoke");
            ParameterInfo[] pars = invoke.GetParameters();
            if (invoke.ReturnType.Equals(typeof(void)) &&
                pars.Length == 2 && 
                (pars[1].ParameterType.IsSubclassOf(typeof(EventArgs)) ||
                 pars[1].ParameterType.Equals(typeof(EventArgs))))
            {
                Delegate d = Delegate.CreateDelegate(ei.EventHandlerType,
                    new EventTracer(ei.Name), mi);
                ei.AddEventHandler(toTrace, d);
            }
        }
    }
}

Or I might use EventTracer in order to trace all events on the current application domain, as shown in the following line:

EventTracer.TraceEvents(AppDomain.CurrentDomain);

Putting it all together, whereas in C# 1.x, a delegate and corresponding method signature have to match exactly (they are invariant), delegates in C# 2.0 support covariant return types and contravariant parameter types, so you can now use a delegate with a method whose return type is derived from the return type of the delegate and whose parameter types are ancestors of the corresponding parameters in the delegate's signature. Very cool.

Send your questions and comments to netqa@microsoft.com.

Stephen Toub is the Technical Editor for MSDN Magazine.

Additional resources