System.IO FAQFrequently asked questions about System.IO classes, which include classes such as File, Directory, FileInfo, DirectoryInfo, FileStream, StreamReader, and StreamWriter. How do I convert between bytes and chars? How do encodings work? We love to hear about any ideas, issues, or requests you have. You could be wanting to know why something was designed a particular way, how best to solve a problem, or why your code isn't working the way you expect. In addition, feel free to send the BCL team any suggestions for future releases of the Framework, such as additional classes to include (it always helps to outline a scenario you need to support), or new APIs for existing classes. To submit your question, simply send an e-mail to The BCL Team (bclpub@microsoft.com).
How do I convert between bytes and chars? How do encodings work?
Internally, our String class uses UTF-16 LE (for Unicode Transformation Format, 16 bit, little endian). When converting Strings or characters into bytes to write to a file, you can choose numerous different encodings, or ways of representing a character in bytes. One of the first was called ASCII, which defines 128 interesting character values. Eventually people added in various ANSI code pages, each of which map the upper 128 values of a byte in different ways depending on the code page, and some may even use multiple bytes to represent a Unicode code point. This is what many people think of when they talk about text files, where legacy C language decisions that said a char was 1 byte affected how everyone conceptualized a text file. However, these ANSI code pages cannot represent the full range of Unicode characters. Also, not all ANSI code pages are installed on all machines, and each localized version of the operating system (OS) may have a different default ANSI code page. So Notepad will produce a file that will be readable on all Japanese machines, for example, but may be illegible on US English machines.
The solution is to use an encoding such as UTF-16 (System.Text.UnicodeEncoding) or UTF-8 (System.Text.UTF8Encoding). These will convert your characters into one or more bytes. and vice versa, and also allow you to fully represent Unicode code points, including Unicode code points called surrogates (which are one million additional Unicode codepoints outside the basic multilingual plane). Is there a good way to detect the encoding of a text file?
Instead, the Unicode standard defines one code point (U+FEFF) as a Unicode byte order mark (BOM). If you have files stored in UTF-16 in either big or little endian byte orders, if the first two bytes in the file are either FE FF or FF FE, you can be relatively certain the file is UTF-16 text and you can determine the byte order of the text file. While UTF-8 doesn't suffer from the byte order problem in UTF-16, we can still use this Unicode BOM as a strong hint that the file is UTF-8 text. Other encodings may theoretically also define some well-known prefix as well; they can express this by making Encoding::GetPreamble() return an interesting byte[].
StreamReader will automatically look for these byte sequences and adjust the encoding when you try to read the first character in the file. (This is why StreamReader has a property named CurrentEncoding, not Encoding.) You can disable this by calling an appropriate constructor, passing in false for detectEncodingFromByteOrderMarks.
Note that StreamWriter will also call Encoding::GetPreamble () and write out those bytes at the beginning of a text file. This is great: it means you can unambiguously detect the encoding of a text file. However, a lot of our developers came from a C background and found the Unicode byte order mark in UTF-8 very confusing at the beginning of their text files, and sometimes inconvenient for interoperability with Unicode-ignorant text editors like VI or older versions of emacs. So StreamWriter, by default, uses a version of the UTF8Encoding whose GetPreamble method returns an empty byte[]. You can explicitly specify Encoding.UTF8 in your code if you want to write out the Unicode byte order mark in UTF-8 files. I've noticed that the IO APIs treat spaces in paths in an interesting way. Is there a pattern and reason to this?
The characters that are trimmed are:
This code demonstrates this trimming, or removal of whitespaces, by showing that looking for a path such as c:\Windows\System32 will be successful. The rules are:
Note: Whitespaces at the beginning of a directory name, or file name, are allowed.
Why is there no capability to view Drive Information in V1.0/V1.1? |