System.IO FAQ
Frequently asked questions about System.IO classes, which include classes such as File, Directory, FileInfo, DirectoryInfo, FileStream, StreamReader, and StreamWriter.
How do I convert between bytes and chars? How do encodings work?
Is there a good way to detect the encoding of a text file?
I've noticed that the IO APIs treat spaces in paths in an interesting way. Is there a pattern, and reason to this?
Why is there no capability to view Drive Information in V1.0/V1.1?
We love to hear about any ideas, issues, or requests you have. You could be wanting to know why something was designed a particular way, how best to solve a problem, or why your code isn't working the way you expect. In addition, feel free to send the BCL team any suggestions for future releases of the Framework, such as additional classes to include (it always helps to outline a scenario you need to support), or new APIs for existing classes.
To submit your question, simply send an e-mail to The BCL Team (bclpub@microsoft.com).
How do I convert between bytes and chars? How do encodings work?
As you may know, Strings are Unicode code points that may be represented in numerous different encodings.
Internally, our String class uses UTF-16 LE (for Unicode Transformation Format, 16 bit, little endian). When converting Strings or characters into bytes to write to a file, you can choose numerous different encodings, or ways of representing a character in bytes. One of the first was called ASCII, which defines 128 interesting character values. Eventually people added in various ANSI code pages, each of which map the upper 128 values of a byte in different ways depending on the code page, and some may even use multiple bytes to represent a Unicode code point. This is what many people think of when they talk about text files, where legacy C language decisions that said a char was 1 byte affected how everyone conceptualized a text file. However, these ANSI code pages cannot represent the full range of Unicode characters. Also, not all ANSI code pages are installed on all machines, and each localized version of the operating system (OS) may have a different default ANSI code page. So Notepad will produce a file that will be readable on all Japanese machines, for example, but may be illegible on US English machines.
The solution is to use an encoding such as UTF-16 (System.Text.UnicodeEncoding) or UTF-8 (System.Text.UTF8Encoding). These will convert your characters into one or more bytes. and vice versa, and also allow you to fully represent Unicode code points, including Unicode code points called surrogates (which are one million additional Unicode codepoints outside the basic multilingual plane).
Is there a good way to detect the encoding of a text file?
There is no great way to detect an arbitrary ANSI code page, though there have been some attempts to do this based on the probability of certain byte sequences in the middle of text. We don't try that in StreamReader. A few file formats like XML or HTML have a way of specifying the character set on the first line in the file, so Web browsers, databases, and classes like XmlTextReader can read these files correctly. But many text files don't have this type of information built in.
Instead, the Unicode standard defines one code point (U+FEFF) as a Unicode byte order mark (BOM). If you have files stored in UTF-16 in either big or little endian byte orders, if the first two bytes in the file are either FE FF or FF FE, you can be relatively certain the file is UTF-16 text and you can determine the byte order of the text file. While UTF-8 doesn't suffer from the byte order problem in UTF-16, we can still use this Unicode BOM as a strong hint that the file is UTF-8 text. Other encodings may theoretically also define some well-known prefix as well; they can express this by making Encoding::GetPreamble() return an interesting byte[].
StreamReader will automatically look for these byte sequences and adjust the encoding when you try to read the first character in the file. (This is why StreamReader has a property named CurrentEncoding, not Encoding.) You can disable this by calling an appropriate constructor, passing in false for detectEncodingFromByteOrderMarks.
Note that StreamWriter will also call Encoding::GetPreamble () and write out those bytes at the beginning of a text file. This is great: it means you can unambiguously detect the encoding of a text file. However, a lot of our developers came from a C background and found the Unicode byte order mark in UTF-8 very confusing at the beginning of their text files, and sometimes inconvenient for interoperability with Unicode-ignorant text editors like VI or older versions of emacs. So StreamWriter, by default, uses a version of the UTF8Encoding whose GetPreamble method returns an empty byte[]. You can explicitly specify Encoding.UTF8 in your code if you want to write out the Unicode byte order mark in UTF-8 files.
I've noticed that the IO APIs treat spaces in paths in an interesting way. Is there a pattern and reason to this?
Certain whitespace characters are trimmed off the end, beginning, or any place within a string where there is no way that a space can make sense. The reason is that whitespaces at the very beginning of a fully qualified path are never valid, nor are they valid at the end, and certain other specific scenarios. To make a more usable API, they are simply trimmed (removed) in these situations.
The characters that are trimmed are:
- (char) 0x9, (char) 0xA, (char) 0xB, (char) 0xC, (char) 0xD, (char) 0x20, (char) 0xA0, (char) 0x2000, (char) 0x2001, ( char) 0x2002, (char) 0x2003, (char) 0x2004, (char) 0x2005, (char) 0x2006, (char) 0x2007, (char) 0x2008, (char) 0x2009, (char) 0x200A, (char) 0x200B, (char) 0x3000, (char) 0xFEFF
This code demonstrates this trimming, or removal of whitespaces, by showing that looking for a path such as c:\Windows\System32 will be successful. The rules are:
- Whitespaces at the beginning of the specified path will be trimmed.
- Whitespaces at the end of the specified path will be trimmed.
- Whitespaces after a colon, but before the path separator character, will be trimmed.
- Whitespaces at the end of a directory or file name will be trimmed.
Note: Whitespaces at the beginning of a directory name, or file name, are allowed.
[C#]
using System;
using System.IO;
class Test {
public static void Main() {
Console.WriteLine( Directory.Exists(@"C: \Windows\system32") ); // prints True
Console.WriteLine( Directory.Exists(@" C:\Windows\system32") ); // prints True
Console.WriteLine( Directory.Exists( @"C: \Windows\system32") ); // prints True
Console.WriteLine( Directory.Exists(@"C:\ Windows\system32") ); // prints False
Console.WriteLine( Directory.Exists(@"C:\Windows \system32") ); // prints True
Console.WriteLine( Directory.Exists(@"C:\Windows\ system32") ); // prints False
Console.WriteLine( Directory.Exists(@"C:\Windows\system32 ") ) ; // prints True
}
}
Why is there no capability to view Drive Information in V1.0/V1.1?
Because of constraints on V1.0 and V1.1 scheduling, some features simply didnt make it into the product. You can expect to see this capability in the upcoming Visual Studio 2005 (Whidbey) release of the Framework.