Unicode Data Conversion

Article
02/06/2008

Glossary

Boundary: The point of interaction between systems or applications that use different character encodings.

In the foreseeable future, programs must be ready for a world in which some data is encoded using Unicode and some data is encoded using existing character sets. This section describes a function that converts text files to Unicode or Unicode files to ANSI-based text. The code follows the guidelines for generic coding presented at the end of this chapter. It can be used both in programs that are compiled for Unicode and in programs that are not. At the heart of the program is a function called Convert that takes three arguments: a code-page ID and two filenames. Assuming that the file UNICODE.TXT contained Unicode text, the following line of code would convert text in the file UNICODE.TXT to the current ANSI character set (designated as CP_ACP):

Convert( CP_ACP, "UNICODE.TXT", "ANSI.TXT" );

If the file ANSI.TXT contained ANSI text, the following line of code would do the reverse:

Convert( CP_ACP, "ANSI.TXT", "UNICODE.TXT" );

Convert uses the Win32 function IsTextUnicode to determine whether the input file is in Unicode. The easiest method of converting a file places both input and output files into simple buffers in memory—a process called memory mapping. The C++ class shown below wraps the few calls needed to set this up under Win32. C++ organizes the set of files and view handles well.

class CMapFile
{
public:
CMapFile(LPTSTR lpFilename, // opens/creates a file
BOOL fWriteable, // and maps a view
DWORD &dwSize);
~CMapFile(); // unmaps the view and
// closes the file

void Truncate(DWORD dwSize) // file is truncated at file
{ _dwSize = dwSize; }; // close to the new size

PBYTE_Buff; // pointer to the buffer to which the
// file is mapped in memory
private:
// These variables contain the handles and pointers
// associated with the file and a mapped view of the file.
HANDLE _hFile; // file handle
HANDLE _hMapFile; // handle for mapped view
BOOL _fWriteable; // TRUE for output file
DWORD _dwSize; // size of file
};

To create a pointer to a buffer that contains the entire file, we need to create an object of class CMapFile:

CMapFile mapFile(TEXT("UNICODE.TXT"), FALSE, 0L);

and access its pointer variable _Buff via mapFile->_Buff.

The constructor for CMapFile, shown below, takes care of opening the file, creating a file mapping of the desired size (a size of 0 means the whole file), and mapping a view. It updates the dwsize parameter with the actual size of the file.

CMapFile :: CMapFile(LPTSTR lpFilename, BOOL fWriteable,
DWORD &dwSize)
{
_Buff = NULL;
_fWriteable = fWriteable;

// Open a file for read or create a file for read/write.

_hFile = (HANDLE) CreateFile(lpFilename,
_fWriteable ?
GENERIC_READ | GENERIC_WRITE :
GENERIC_READ,
0, // exclusive access
0L, // no security
_fWriteable ?
CREATE_ALWAYS : OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
0L); // no template file

if ((DWORD)_hFile == 0xFFFFFFFF) goto error;

// Create a file mapping that is either read or read/write.

_hMapFile = CreateFileMapping(_hFile,
NULL, // no security attributes
_fWriteable ? PAGE_READWRITE:PAGE_READONLY,
0L, // high size DWORD
dwSize, // low size DWORD
NULL); // no name

if (!_hMapFile) goto error;

// Create a mapped view that is either read or read/write.

_Buff = (PBYTE) MapViewOfFile(_hMapFile,
fWriteable ?
FILE_MAP_WRITE : FILE_MAP_READ,
0L, // zero offset
0L,
dwSize);

if (!_Buff) goto error;

// Obtain and return the size of the file opened.

if ((!dwSize) && 0xFFFFFFFF == (dwSize = GetFileSize(_hFile,
NULL)))
goto error;

_dwSize = dwSize;
return;

error:
ReportError();
return;
}

When you are finished with the mapped view, the destructor will unmap the view, close the handle, truncate the file to the requested size (if it was writeable), and close the file. If mapFile is created as an automatic object on the stack, the destructor for CMapFile will be called automatically when the variable mapFile goes out of scope. This way you can't forget to clean up.

CMapFile :: ~CMapFile()
{
DWORD dwErrCode=0;

UnmapViewOfFile(_Buff); // Unmap the output file.
CloseHandle(_hMapFile); // Close the handle to the file mapping.
if (_fWriteable)
{ // Attempt to truncate the file.
if (_dwSize != SetFilePointer(_hFile,
_dwSize, // low 32 bits
0L, // high 32 bits
FILE_BEGIN))
ReportError();

if (!SetEndOfFile(_hFile))
ReportError();
}
CloseHandle(_hFile); // Close the file.
}

A nifty way to handle all of the error reporting—a method that will work in any language—is to use FormatMessage to retrieve a localized message from the system message table, as in this error-reporting function:

void ReportError()
{
LPTSTR lpMessage;
DWORD dwErrCode = GetLastError();
FormatMessage(FORMAT_MESSAGE_ALLOCATE_BUFFER |
FORMAT_MESSAGE_FROM_SYSTEM,
NULL, // no source buffer needed
dwErrCode, // error code for this message
NULL, // default language ID
(LPTSTR)&lpMessage, // allocated by fcn
NULL, // minimum size of buffer
NULL); // no inserts

MessageBox(NULL, lpMessage, TEXT("File Error"),
MB_ICONSTOP | MB_OK );
}

IsTextUnicode tests whether a text file is prefixed by a BOM, but because the BYTE_ORDER_MARK is not part of the data to be converted, the following function steps over it before translation begins:

#define BYTE_ORDER_MARK 0xFEFF

LPVOID IsFileMarkedUnicode( LPWSTR lp, DWORD &dwSize )
{
if ( dwSize > sizeof(WCHAR) )
{
if ( *lp == BYTE_ORDER_MARK )
{
dwSize -= sizeof(WCHAR);
lp++;
}
}
return lp;
}

The next function does the reverse—it marks a Unicode output file with a BYTE_ORDER_MARK. It initializes the buffer with 0xFEFF and adjusts the buffer size and pointer.

LPVOID MarkFileUnicode( LPWSTR lp, DWORD &dwSize )
{
if ( dwSize > sizeof(WCHAR) )
{
*lp = BYTE_ORDER_MARK;
dwSize -= sizeof(WCHAR); // size of buffer is reduced by one
// character
lp++;
}
return lp;
}

Now you are ready to pull it all together in the function Convert, shown on the following page. Convert performs the following tasks:

Opens and maps the input file by creating a CMapFile object as a local variable
Applies a heuristic check by calling IsTextUnicode
Estimates an upper bound on the size of the output file
Creates and maps the output file by creating a CMapFile object as a local variable
Converts the data
Truncates the output file
Cleans up implicitly by calling the destructors for the CMapFile objects when they go out of scope

void Convert(UINT iCodePage, LPTSTR lpInFile, LPTSTR lpOutFile)
{
DWORD dwSizeIn, dwSizeOut;

dwSizeIn = 0L; // Use the full size for input file.

// Memory map the input file and obtain its actual size in bytes.

CMapFile mapIn( lpInFile, FALSE, dwSizeIn );

if ( mapIn._Buff && dwSizeIn )
{

/* Analyze the input buffer for presence of BOM marker and
* step over it if Unicode is the input file or
* prepend BOM marker if Unicode is the output file.
* Size is reduced and pointers are adjusted as needed.
*/

dwSizeOut = dwSizeIn;

if (IsTextUnicode(mapIn._Buff, dwSizeIn, NULL))
{
LPWSTR lpInBuff =
(LPWSTR)
IsFileMarkedUnicode((LPWSTR)mapIn._Buff, dwSizeIn);

// Create an output file of the maximum size needed.

CMapFile mapOut( lpOutFile, TRUE, dwSizeOut );

if (mapOut._Buff)
{
// Perform the actual conversion.

BOOL bUsedDef = TRUE;

dwSizeOut = WideCharToMultiByte(iCodePage,
WC.COMPOSITECHECK | WC_DEFAULTCHAR, // default mapping
lpInBuff,
dwSizeIn/sizeof(WCHAR), // number of wide characters
(LPSTR) mapOut._Buff,
dwSizeOut, // number of bytes written
"\x7f", // use DEL as default char
&bUsedDef); // was default char used
// Truncate the output file to its actual length.
mapOut.Truncate(dwSizeOut);
}
}
else
{

// Create an output file of the maximum size needed.
dwSizeOut = (dwSizeIn+1) * sizeof(WCHAR);
CMapFile mapOut( lpOutFile, TRUE, dwSizeOut );

if (mapOut._Buff)
{
// Write a Byte Order Mark.
LPWSTR lpOutBuff = (LPWSTR)
MarkFileUnicode((LPWSTR) mapOut._Buff, dwSizeOut);
// Perform the actual conversion.
dwSizeOut = MultiByteToWideChar(iCodePage,
MB_COMPOSITE | MB_USEGLYPHCHARS,
(LPSTR) mapIn._Buff, // input buffer
dwSizeIn, // length of input in bytes
lpOutBuff, // output buffer
dwSizeOut); // max number of wchars to write
// Truncate the output file to its actual length.
mapOut.Truncate((dwSizeOut+1) * sizeof(WCHAR));
}
}

if ( !dwSizeOut )
{
ReportError();
}
}
}

If there is no BOM in the file, IsTextUnicode can still recognize Unicode data by performing several heuristic checks. Since the checks aren't perfect, it's usually best to let the user confirm the choice made by the function. Even if the file type is recognized correctly, the data might not be converted from Unicode if the input file contains characters that cannot be expressed in the code page of the output file. In this situation, WideCharToMultiByte can substitute a default character. Convert uses this option, but substituting the default character is not always the right answer. For example, if a filename entered in Unicode cannot be translated to the code page used in the directory structure of a FAT file system floppy disk, you should instead raise an error condition. Using the Windows default character ? in a filename might result in a file named ????????.???—a bad idea.

Unicode Data Conversion

Additional resources