Handling Uniform Resource Locators

A Uniform Resource Locator (URL) is a compact representation of the location and access method for a resource located on the Internet. Each URL consists of a scheme (HTTP, HTTPS, or FTP) and a scheme-specific string. This string can also include a combination of a directory path, search string, or name of the resource. The WinINet functions provide the ability to create, combine, break down, and canonicalize URLs. For more information on URLs, see RFC-1738 on Uniform Resource Locators (URL).

The URL functions operate in a task-oriented manner. The content and format of the URL that is given to the function is not verified. The calling application should track the use of these functions to ensure that the data is in the intended format. For example, the InternetCanonicalizeUrl function would convert the character "%" into the escape sequence "%25" when using no flags. If InternetCanonicalizeUrl is used on the canonicalized URL, the escape sequence "%25" would be converted into the escape sequence "%2525", which would not work properly.

What Is a Canonicalized URL?

The format of all URLs must follow the accepted syntax and semantics in order to access resources through the Internet. Canonicalization is the process of formatting a URL to follow this accepted syntax and semantics.

Characters that must be encoded include any characters that have no corresponding graphic character in the US-ASCII coded character set (hexadecimal 80-FF, which are not used in the US-ASCII coded character set, and hexadecimal 00-1F and 7F, which are control characters), blank spaces, "%" (which is used to encode other characters), and unsafe characters (<, >, ", #, {, }, |, \, ^, ~, [, ], and ').

Using the WinINet Functions to Handle URLs

The following table summarizes the URL functions.

Function Description
InternetCanonicalizeUrl Canonicalizes the URL.
InternetCombineUrl Combines base and relative URLs.
InternetCrackUrl Parses a URL string into components.
InternetCreateUrl Creates a URL string from components.
InternetOpenUrl Begins retrieving an FTP, HTTP, or HTTPS resource.

 

Canonicalizing URLs

Canonicalizing a URL is the process that converts a URL, which might contain unsafe characters such as blank spaces, reserved characters, and so on, into an accepted format.

The InternetCanonicalizeUrl function can be used to canonicalize URLs. This function is very task-oriented, so the application should track its use carefully. InternetCanonicalizeUrl does not verify that the URL passed to it is already canonicalized and that the URL that it returns is valid.

The following five flags control how InternetCanonicalizeUrl handles a particular URL. The flags can be used in combination. If no flags are used, the function encodes the URL by default.

Value Meaning
ICU_BROWSER_MODE Do not encode or decode characters after "#" or "?", and do not remove trailing white space after "?". If this value is not specified, the entire URL is encoded, and trailing white space is removed.
ICU_DECODE Convert all %XX sequences to characters, including escape sequences, before the URL is parsed.
ICU_ENCODE_SPACES_ONLY Encode spaces only.
ICU_NO_ENCODE Do not convert unsafe characters to escape sequences.
ICU_NO_META Do not remove meta sequences (such as "." and "..") from the URL.

 

The ICU_DECODE flag should be used only on canonicalized URLs, because it assumes that all %XX sequences are escape codes and converts them into the characters indicated by the code. If the URL has a "%" symbol in it that is not part of an escape code, ICU_DECODE still treats it as one. This characteristic might cause InternetCanonicalizeUrl to create an invalid URL.

To use InternetCanonicalizeUrl to return a completely decoded URL, the ICU_DECODE and ICU_NO_ENCODE flags must be specified. This setup assumes that the URL being passed to InternetCanonicalizeUrl has been previously canonicalized.

Combining Base and Relative URLs

A relative URL is a compact representation of the location of a resource relative to an absolute base URL. The base URL must be known to the parser and usually includes the scheme, network location, and parts of the URL path. An application can call InternetCombineUrl to combine the relative URL with its base URL. InternetCombineUrl also canonicalizes the resultant URL.

Cracking URLs

The InternetCrackUrl function separates a URL into its component parts and returns the components indicated by the URL_COMPONENTS structure that is passed to the function.

The components that make up the URL_COMPONENTS structure are the scheme number, host name, port number, user name, password, URL path, and additional information (such as search parameters). Each component, except the scheme and port numbers, has a string member that holds the information, and a member that holds the length of the string member. The scheme and port numbers have only a member that stores the corresponding value; they are both returned on all successful calls to InternetCrackUrl.

To get the value of a particular component in the URL_COMPONENTS structure, the member that stores the string length of that component must be set to a nonzero value. The string member can be either the address of a buffer or NULL.

If the pointer member contains the address of a buffer, the string length member must contain the size of that buffer. InternetCrackUrl returns the component information as a string in the buffer and stores the string length in the string length member.

If the pointer member is NULL, the string length member can be set to any nonzero value. InternetCrackUrl stores the address of the first character of the URL string that contains the component information and sets the string length to the number of characters in the remaining part of the URL string that pertains to the component.

All pointer members set to NULL with a nonzero length member point to the appropriate starting point in the URL string. The length stored in the length member must be used to determine the end of the individual component's information.

To finish initializing the URL_COMPONENTS structure properly, the dwStructSize member must be set to the size of the URL_COMPONENTS structure, in bytes.

The following example returns the components of the URL in the edit box, IDC_PreOpen1, and returns the components to the list box, IDC_PreOpenList. To display only the information for an individual component, this function copies the character immediately after the component's information in the string and temporarily replaces it with a NULL.

#include <windows.h>
#include <tchar.h>
#include <strsafe.h>
#include <wininet.h>
#include <stdlib.h>

#pragma comment(lib, "wininet.lib")
#pragma comment(lib, "user32.lib")

#define  CRACKER_BUFFER_SIZE           MAX_PATH

// For sample source code implementing the InternetErrorOut( ) 
// function referenced below, see the "Handling Errors" topic  
// under "Using WinInet"
extern BOOL WINAPI InternetErrorOut( HWND hWnd, DWORD dwError,
                                     LPCTSTR szFailingFunctionName );

// Forward declaration of listUrlPart helper functions:
BOOL listURLpart( HWND hDlg, int nListBoxID, 
                  LPTSTR szPartName, LPTSTR part, DWORD partLength );
BOOL listURLpart( HWND hDlg, int nListBoxID, 
                  LPTSTR szPartName, int partValue );

// Static list describing the URL Scheme types 
// enumerated in INTERNET_SCHEME:
TCHAR* schemeType[] =
{
  TEXT( "[Partial URL]" ),                //  0
  TEXT( "[Unknown scheme]" ),             //  1
  TEXT( "[Default scheme]" ),             //  2
  TEXT( "FTP" ),                          //  3
  TEXT( "Gopher" ),                       //  4
  TEXT( "HTTP" ),                         //  5
  TEXT( "HTTPS" ),                        //  6
  TEXT( "File" ),                         //  7
  TEXT( "News" ),                         //  8
  TEXT( "MailTo" ),                       //  9
  TEXT( "Socks" ),                        // 10
  TEXT( "JavaScript" ),                   // 11
  TEXT( "VBScript" )                      // 12
};
#define  CRACKER_SCHEME_TYPE_ARRAY_SIZE      13

BOOL WINAPI Cracker( HWND hDlg, int nURLtextBoxId, int nListBoxId )
{
   int i, j;
   TCHAR* failedFunctionName;
   TCHAR URL_buffer[CRACKER_BUFFER_SIZE];

   URL_COMPONENTS URLparts;

   URLparts.dwStructSize = sizeof( URLparts );

   // The following elements determine which components are displayed
   URLparts.dwSchemeLength    = 1;
   URLparts.dwHostNameLength  = 1;
   URLparts.dwUserNameLength  = 1;
   URLparts.dwPasswordLength  = 1;
   URLparts.dwUrlPathLength   = 1;
   URLparts.dwExtraInfoLength = 1;

   URLparts.lpszScheme     = NULL;
   URLparts.lpszHostName   = NULL;
   URLparts.lpszUserName   = NULL;
   URLparts.lpszPassword   = NULL;
   URLparts.lpszUrlPath    = NULL;
   URLparts.lpszExtraInfo  = NULL;

   SendDlgItemMessage( hDlg, nListBoxId, LB_RESETCONTENT, 0, 0 );
   if( !GetDlgItemText( hDlg, nURLtextBoxId, 
                        URL_buffer, CRACKER_BUFFER_SIZE ) )
   {
       failedFunctionName = TEXT( "GetDlgItemText" );
       goto CrackerError_01;
   }

   if( FAILED( StringCchLength( URL_buffer, CRACKER_BUFFER_SIZE, 
                                (size_t*) &i ) ) )
   {
       failedFunctionName = TEXT( "StringCchLength" );
       goto CrackerError_01;
   }

   if( !InternetCrackUrl( URL_buffer, (DWORD)_tcslen( URL_buffer ), 0, 
                          &URLparts ) )
   {
       failedFunctionName = TEXT( "InternetCrackUrl" );
       goto CrackerError_01;
   }

   failedFunctionName = TEXT( "listURLpart" );

   i = URLparts.nScheme + 2;
   if( ( i >= 0 ) && ( i < CRACKER_SCHEME_TYPE_ARRAY_SIZE ) )
   {
       StringCchLength( schemeType[i], 
                        CRACKER_BUFFER_SIZE, 
                        (size_t*) &j );
       if( !listURLpart( hDlg, nListBoxId, 
                         TEXT("Scheme type"), 
                         schemeType[i], j ))
           goto CrackerError_01;
   }

   if( !listURLpart( hDlg, nListBoxId, TEXT( "Scheme text" ), 
                     URLparts.lpszScheme, 
                     URLparts.dwSchemeLength ) ||
       !listURLpart( hDlg, nListBoxId, TEXT( "Host name" ), 
                     URLparts.lpszHostName, 
                     URLparts.dwHostNameLength) ||
       !listURLpart( hDlg, nListBoxId, TEXT( "Port number" ), 
                     (int) URLparts.nPort ) ||
       !listURLpart( hDlg, nListBoxId, TEXT( "User name" ), 
                     URLparts.lpszUserName, 
                     URLparts.dwUserNameLength) ||
       !listURLpart( hDlg, nListBoxId, TEXT( "Password" ), 
                     URLparts.lpszPassword, 
                     URLparts.dwPasswordLength) ||
       !listURLpart( hDlg, nListBoxId, TEXT( "Path" ), 
                     URLparts.lpszUrlPath, 
                     URLparts.dwUrlPathLength) ||
       !listURLpart( hDlg, nListBoxId, TEXT( "Extra information"), 
                     URLparts.lpszExtraInfo, 
                     URLparts.dwExtraInfoLength))
           goto CrackerError_01;

   return( TRUE );

CrackerError_01:
// For sample source code of the InternetErrorOut( ) function 
// referenced below, see the "Handling Errors" 
// topic under "Using WinInet"
   InternetErrorOut( hDlg, GetLastError( ), failedFunctionName );
   return FALSE;
}

// listURLpart( ) helper function for string parts
BOOL listURLpart( HWND hDlg, int nListBoxId, 
                  LPTSTR szPartName, LPTSTR part, DWORD partLength )
{
  TCHAR outputBuffer[CRACKER_BUFFER_SIZE];
  LPTSTR nextStart;
  size_t nextSize;

  if( partLength == 0 )  // Just skip empty ones
    return( TRUE );

  if( FAILED( StringCchCopyEx( outputBuffer, 
                              (size_t) CRACKER_BUFFER_SIZE,
                               szPartName, &nextStart, 
                               &nextSize, 0 ) ) ||
      FAILED( StringCchCopyEx( nextStart, nextSize, TEXT( ": " ), 
                               &nextStart, &nextSize, 0 ) ) ||
      FAILED( StringCchCopyNEx( nextStart, nextSize, part, 
                                (size_t) partLength,
                                &nextStart, &nextSize, 0 ) ) )
    return( FALSE );

  *nextStart = 0;
  if( SendDlgItemMessage( hDlg, nListBoxId, LB_ADDSTRING, 0, 
                          (LPARAM)outputBuffer ) < 0 )
    return( FALSE );
  return( TRUE );
}

// listURLpart( ) helper function for numeric parts
BOOL listURLpart( HWND hDlg, int nListBoxId, 
                  LPTSTR szPartName, int partValue )
{
  TCHAR outputBuffer[CRACKER_BUFFER_SIZE];

  if( FAILED( StringCchPrintf( outputBuffer, 
                               (size_t) CRACKER_BUFFER_SIZE,
                               TEXT( "%s: %d" ), szPartName, 
                               partValue ) ) ||
      ( SendDlgItemMessage( hDlg, nListBoxId, LB_ADDSTRING, 0, 
                            (LPARAM)outputBuffer ) < 0 ) )
    return( FALSE );
  return( TRUE );
}

Creating URLs

The InternetCreateUrl function uses the information in the URL_COMPONENTS structure to create a Uniform Resource Locator.

The components that make up the URL_COMPONENTS structure are the scheme, host name, port number, user name, password, URL path, and additional information (such as search parameters). Each component, except the port number, has a string member that holds the information, and a member that holds the length of the string member.

For each required component, the pointer member should contain the address of the buffer holding the information. The length member should be set to zero if the pointer member contains the address of a zero-terminated string; the length member should be set to the string length if the pointer member contains the address of a string that is not zero-terminated. The pointer member of any components that are not required must be NULL.

Accessing URLs Directly

FTP, and HTTP resources on the Internet can be accessed directly by using the InternetOpenUrl, InternetReadFile, and InternetFindNextFile functions. InternetOpenUrl opens a connection to the resource at the URL passed to the function. When this connection is made, there are two possible steps. First, if the resource is a file, InternetReadFile can download it; second, if the resource is a directory, InternetFindNextFile can enumerate the files within the directory (except when using CERN proxies). For more information on InternetReadFile, see Reading Files. For more information on InternetFindNextFile, see Finding the Next File.

For applications that need to operate through a CERN proxy, InternetOpenUrl can be used to access FTP directories and files. The FTP requests are packaged to appear like an HTTP request, which the CERN proxy would accept.

InternetOpenUrl uses the HINTERNET handle created by the InternetOpen function and the URL of the resource. The URL must include the scheme (http:, ftp:, file: [for a local file], or https: [for hypertext protocol secure]) and network location (such as www.microsoft.com). The URL can also include a path (for example, /isapi/gomscom.asp?TARGET=/windows/feature/) and resource name (for example, default.htm). For HTTP or HTTPS requests, additional headers can be included.

InternetQueryDataAvailable, InternetFindNextFile, InternetReadFile, and InternetSetFilePointer (HTTP or HTTPS URLs only) can use the handle that is created by InternetOpenUrl to download the resource.

The following diagram illustrates which handles to use with each function.

handles to use with functions

The root HINTERNET handle created by InternetOpen is used by InternetOpenUrl. The HINTERNET handle created by InternetOpenUrl can be used by InternetQueryDataAvailable, InternetReadFile, InternetFindNextFile (not shown here), and InternetSetFilePointer (HTTP or HTTPS URLs only).

For more information, see HINTERNET Handles.

Note

WinINet does not support server implementations. In addition, it should not be used from a service. For server implementations or services use Microsoft Windows HTTP Services (WinHTTP).