December 2012

Volume 27 Number 12

Windows 8 Networking - Windows 8 and the WebSocket Protocol

By Kenny Kerr | December 2012

The WebSocket protocol aims to provide bidirectional communication in a Web-saturated world dominated by clients solely responsible for establishing connections and initiating request/response pairs. It finally allows applications to enjoy many more of the benefits of TCP, but in a Web-friendly way. Considering that the WebSocket protocol was only standardized by the Internet Engineering Task Force in December 2011—and as I write this is still under consideration by the World Wide Web Consortium—it’s perhaps surprising just how comprehensively Windows 8 has embraced this new Internet technology.

In this article I’ll first show you how the WebSocket protocol works and explain its relationship to the larger TCP/IP suite. I’ll then explore the various ways in which Windows 8 enables programmers to easily adopt this new technology from within their applications.

Why WebSocket?

The main goal of this protocol is to provide a standard and efficient way for browser-based applications to communicate with servers freely outside of request/response pairs. A few years back, Web developers were all aflutter talking about Asynchronous JavaScript and XML (AJAX) and how it enables dynamic and interactive scenarios—and certainly it did, but the XMLHttpRequest object that inspired it all still only allowed the browser to make HTTP requests. What if the server wanted to send a message to the client out-of-band? That’s where the WebSocket protocol comes in. It not only allows the server to send messages to the client, but it does so without the overhead of HTTP, providing bidirectional communication that’s close to the speed of a raw TCP connection. Without the WebSocket protocol, Web developers have had to abuse HTTP by polling the server for updates, using Comet-style programming techniques, and employing many HTTP connections with a great deal of protocol overhead just to keep applications up-to-date. Servers are overloaded, bandwidth is wasted and Web applications are overly complicated. The WebSocket protocol solves these problems in a surprisingly simple and efficient way, but before I can describe how it works, I need to provide some foundational and historical context.

The TCP/IP Suite

TCP/IP is a protocol suite, or collection of interrelated protocols, that implements the Internet architecture. It has evolved into its current form over many years. The world has changed dramatically since the 1960s, when the concept of packet-switching networks first developed. Computers have become much faster, software has grown more demanding and the Internet has exploded into an all-encompassing web of information, communication and interaction, and is the backbone of so much of the software in popular use today.

The TCP/IP suite consists of a number of layers loosely modeled after the Open System Interconnection (OSI) layering model. Although the protocols at the different layers aren’t particularly well delineated, TCP/IP has clearly proven its effectiveness, and the layering problems have been overcome by a clever combination of hardware and software designs. Separating TCP/IP into layers, however vague they might be, has helped it evolve over time as hardware and technology have changed, and has allowed programmers with different skills to work at different levels of abstraction, either helping to build the protocol stack itself or to build applications making use of its various facilities.

At the lowest layers are the physical protocols, including the likes of wired media access control and Wi-Fi, providing physical connectivity as well as local addressing and error detection. Most programmers don’t think too much about these protocols.

Moving up the stack, the Internet Protocol (IP) itself resides in the networking layer and allows TCP/IP to become interoperable across different physical layers. It takes care of mapping computer addresses to physical addresses and routing packets from computer to computer.

Then there are ancillary protocols, and we could debate about which layer they reside on, but they really provide a necessary supporting role for things such as auto-configuration, name resolution, discovery, routing optimizations and diagnostics.

As we move further up the layering stack, the transport and application protocols come into view. The transport protocols take care of multiplexing and de-multiplexing the packets from the lower layers so that, even though there might only be a single physical and networking layer, many different applications can share the communication channel. The transport layer also typically provides further error detection, reliable delivery and even performance-related features such as congestion and flow control. The application layer has traditionally been the home of protocols such as HTTP (implemented by Web browsers and servers) and SMTP (implemented by e-mail clients and servers). As the world has started relying more heavily on protocols such as HTTP, their implementations have been pushed down into the depths of the OS, both to improve performance as well as to share the implementation among different applications.

TCP and HTTP

Of the protocols in the TCP/IP suite, the TCP and User Datagram Protocol (UDP) found at the transport layer are perhaps the most well known to the average programmer. Both define a “port” abstraction that these protocols use in combination with IP addresses to multiplex and de-multiplex packets as they arrive and when they’re sent.

Although UDP is used heavily for other TCP/IP protocols such as Dynamic Host Configuration Protocol and DNS, and has been adopted widely for private network applications, its adoption in the Internet at large hasn’t been as far-reaching as that of its sibling. TCP, on the other hand, has seen widespread adoption across the board, thanks in large part to HTTP. Although TCP is far more complex than UDP, much of this complexity is hidden from the application layer where the application enjoys the benefits of TCP without being subject to its complexity.

TCP provides a reliable flow of data between computers, the implementation of which is hugely complex. It concerns itself with packet ordering and data reconstruction, error detection and recovery, congestion control and performance, timeouts, retransmissions, and much more. The application, however, only sees a bidirectional connection between ports and assumes that data sent and received will transfer correctly and in order.

Contemporary HTTP presupposes a reliable connection-oriented protocol, and TCP is clearly the obvious and ubiquitous choice. In this model, HTTP functions as a client-server protocol. The client opens a TCP connection to a server. It then sends a request, to which the server evaluates and responds. This is repeated countless times every second of every day around the world.

Of course, this is a simplification or restriction of the functionality that TCP provides. TCP allows both parties to send data simultaneously. One end doesn’t need to wait for the other to send a request before it can respond. This simplification did, however, allow server-side caching of responses, which has had a huge impact on the Web’s ability to scale. But the popularity of HTTP was undoubtedly aided by its initial simplicity. Whereas TCP provides a bidirectional channel for binary data—a pair of streams, if you like—HTTP provides a request message preceding a response message, both consisting of ASCII characters, although the message bodies, if any, may be encoded in some other way. A simple request might look as follows:

GET /resource HTTP/1.1\r\n
host: example.com\r\n
\r\n

Each line concludes with a carriage return (\r) and line feed (\n). The first line, called a request line, specifies the method by which a resource is to be accessed (in this case GET), the path of the resource and finally the version of HTTP to be used. Similar to the lower-layer protocols, HTTP provides multiplexing and de-multiplexing via this resource path. Following this request line are one or more header lines. Headers consist of a name and value as illustrated in the preceding example. Some headers are required, such as host, while most are not and merely assist browsers and servers in communicating more efficiently or to negotiate features and functionality.

A response might look like this:

HTTP/1.1 200 OK\r\n
content-type: text/html\r\n
content-length: 1307\r\n
\r\n
<!DOCTYPE HTML><html> ... </html>

The format is basically the same, but instead of a request line, the response line affirms the version of HTTP to be used, a status code (200) and a description of the status code. The 200 status code indicates to the client that the request was processed successfully and any result is included immediately following any header lines. The server might, for example, indicate that the requested resource doesn’t exist by returning a 404 status code. The headers take the same form as those in the request. In this case the content-type header informs the browser that the requested resource in the message body is to be interpreted as HTML and the content-length header tells the browser how many bytes the message body contains. This is important because, as you’ll recall, HTTP messages flow over TCP, which doesn’t provide message boundaries. Without a content length, HTTP applications need to use various heuristics to determine the length of any message body.

This is all pretty simple, a testament to the straightforward design of HTTP. But HTTP isn’t simple anymore. Today’s Web browsers and servers are state-of-the-art programs with thousands of interrelated features, and HTTP is the workhorse that needs to keep up with it all. Much of the complexity was born out of a need for speed. There are now headers to negotiate compression of the message body, caching and expiration headers to avoid transmitting a message body at all, and much more. Techniques have been developed to reduce the number of HTTP requests by combining different resources. Content delivery networks (CDNs) have even been distributed around the world in an attempt to host commonly accessed resources closer to the Web browsers accessing them.

Despite all of these advances, many Web applications could achieve greater scalability and even simplicity if there were some way to occasionally break out of HTTP and return to the streaming model of TCP. This is exactly what the WebSocket protocol delivers.

The WebSocket Handshake

The WebSocket protocol fits somewhat neatly into the TCP/IP suite above TCP and alongside HTTP. One of the challenges with introducing a new protocol to the Internet is in somehow making the countless routers, proxies and firewalls think that nothing has changed under the sun. The WebSocket protocol achieves this goal by masquerading as HTTP before switching to its own WebSocket data transfer on the same underlying TCP connection. In this way, many unsuspecting intermediaries don’t have to be upgraded in order to allow WebSocket communication to traverse their network connections. In practice this doesn’t always work quite so smoothly because some overly zealous routers fiddle with the HTTP requests and responses, attempting to rewrite them to suit their own ends, such as proxy caching or address or resource translation. An effective solution in the short term is to use the WebSocket protocol over a secure channel—Transport Layer Security (TLS)—because this tends to keep the tampering to a minimum.

The WebSocket protocol borrows ideas from a variety of sources, including IP, UDP, TCP and HTTP, and makes those concepts available to Web browsers and other applications in a simpler form. It all starts with a handshake that’s designed to look and operate just like an HTTP request/response pair. This isn’t done so that clients or servers can somehow fool each other into using WebSockets, but rather to fool the various intermediaries into thinking it’s just another TCP connection serving up HTTP. In fact, the WebSocket protocol is specifically designed to prevent any party from being duped into accepting a connection accidentally. It begins with a client sending a handshake that is, for all intents and purposes, an HTTP request, and might look as follows:

GET /resource HTTP/1.1\r\n
host: example.com\r\n
upgrade: websocket\r\n
connection: upgrade\r\n
sec-websocket-version: 13\r\n
sec-websocket-key: E4WSEcseoWr4csPLS2QJHA==\r\n
\r\n

As you can see, nothing precludes this from being a perfectly valid HTTP request. An unsuspecting intermediary should simply pass this request along to the server, which may even be an HTTP server doubling as a WebSocket server. The request line in this example specifies a standard GET request. This also means that a WebSocket server might allow multiple endpoints to be serviced by a single server in the same way that most HTTP servers do. The host header is required by HTTP 1.1 and serves the same purpose—to ensure both parties agree on the hosting domain in shared hosting scenarios. The upgrade and connection headers are also standard HTTP headers used by clients to request an upgrade of the protocol used in the connection. This technique is sometimes used by HTTP clients to transition to a secure TLS connection, although that’s rare. These headers are, however, required by the WebSocket protocol. Specifically, the upgrade header indicates that the connection should be upgraded to the WebSocket protocol and the connection header specifies that this upgrade header is connection-specific, meaning that it must not be communicated by proxies over further connections.

The sec-websocket-version header must be included and its value must be 13. If the server is a WebSocket server but doesn’t support this version, it will abort the handshake, returning an appropriate HTTP status code. As you’ll see in a moment, even if the server knows nothing of the WebSocket protocol and happily returns a success response, the client is designed to abort the connection.

The sec-websocket-key header really is the key to the WebSocket handshake. The designers of the WebSocket protocol wanted to ensure that a server couldn’t possibly accept a connection from a client that was not in fact a WebSocket client. They didn’t want a malicious script to construct a form submission or use the XMLHttpRequest object to fake a WebSocket connection by adding the sec-* headers. To prove to both parties that a legitimate connection is being established, the sec-websocket-key header must also be present in the client handshake. The value must be a randomly selected—ideally cryptographically random—16-byte number, known as a nonce in security parlance, which is then base64-­encoded for this header value.

Once the client handshake is sent, the client waits for a response to validate that the server is indeed willing and able to establish a WebSocket connection. Assuming the server doesn’t object, it might send a server handshake as an HTTP response as follows:

HTTP/1.1 101 OK
upgrade: websocket\r\n
connection: upgrade\r\n
sec-websocket-accept: 7eQChgCtQMnVILefJAO6dK5JwPc=\r\n
\r\n

Again, this is a perfectly valid HTTP response. The response line includes the HTTP version followed by the status code, but instead of the regular 200 code indicating success, the server must respond with the standard 101 code indicating that the server understands the upgrade request and is willing to switch protocols. The English description of the status code makes absolutely no difference. It might be “OK” or “Switching to WebSocket” or even a random Mark Twain quote. The important thing is the status code and the client must ensure that it’s 101. The server could, for example, reject the request and ask the client to authenticate using a 401 status code before accepting a WebSocket client handshake. A successful response must, however, include the upgrade and connection headers to acknowledge that the 101 status code specifically refers to a switch to the WebSocket protocol, again to avoid anyone being duped.

Finally, to validate the handshake, the client ensures that the sec-websocket-accept header is present in the response and its value is correct. The server needn’t decode the base64-encoded value sent by the client. It merely takes this string, concatenates the string representation of a well-known GUID and hashes the combination with the SHA-1 algorithm to produce a 20-byte value that’s then base64-encoded and used as the value for the sec-websocket-accept header. The client can then easily validate that the server did indeed do as required and there’s then no doubt that both parties are consenting to a WebSocket connection.

If all goes well, at this point a valid WebSocket connection is established and both parties can communicate freely and simultaneously in both directions using WebSocket data frames. It’s clear from studying the WebSocket protocol that it was designed after the Web insecurity apocalypse. Unlike most of its predecessors, the WebSocket protocol was designed with security in mind. The protocol also requires that the client include the origin header if the client is in fact a Web browser. This allows browsers to provide protection against cross-origin attacks. Of course, this only makes sense in the context of a trusted hosting environment such as that of a browser.

WebSocket Data Transfer

The WebSocket protocol is all about getting the Web back to the relatively high-performance, low-overhead model of communication provided by IP and TCP, not adding further layers of complexity and overhead. For this reason, once the handshake completes, the WebSocket overhead is kept to a minimum. It provides a packet-­framing mechanism on top of TCP reminiscent of the IP packetization that TCP itself is built on and for which UDP is so popular, but without the packet size limitations with which those protocols are encumbered. Whereas TCP provides a stream-based abstraction, WebSocket provides a message-based abstraction to the application. And while TCP streams are transmitted via segments, WebSocket messages are transported as a sequence of frames. These frames are transmitted over the same TCP connection and thus naturally assume reliable and sequential delivery. The framing protocol is somewhat elaborate but is specifically designed to be extremely small, requiring in many cases only a few additional bytes of framing overhead. Data frames may be transmitted by either client or server at any time after the opening handshake has completed.

Each frame includes an opcode describing the frame type as well the size of the payload. This payload represents the actual data the application may want to communicate as well as any prearranged extension data. Interestingly, the protocol allows for messages to be fragmented. If you come from a hardcore networking background, you might be reminded of the performance implications of IP-level fragmentation and the pains to which TCP goes to avoid fragmentation. But the WebSocket concept of fragmentation is quite different. The idea here is to allow the WebSocket protocol to provide the convenience of network packets but without the size limits. If the sender doesn’t know the exact length of a message being sent, it may be fragmented, with each frame indicating how much data it provides and whether or not it’s the last fragment. Beyond that, the frame merely indicates whether it contains binary data or UTF-8-encoded text.

Control frames are also defined and primarily used to close a connection but can also be used as a heartbeat to ping the other endpoint to ensure it’s still responsive or to assist in keeping the TCP connection alive. Finally, I should point out that if you happen to poke at a WebSocket frame sent by a client using a network protocol analyzer such as Wireshark, you might notice that the data frames appear to contain encoded data. The WebSocket protocol requires that all data frames sent from the client to the server be masked. Masking involves a simple algorithm “XOR’ing” the data bytes with a masking key. The masking key is contained within the frame, so this isn’t meant to be some sort of ridiculous security feature, although it does relate to security. As mentioned, the designers of the WebSocket protocol spent a great deal of effort working through various security-related scenarios to try to anticipate the various ways in which the protocol might be attacked. One such attack vector that was analyzed involved attacking the WebSocket protocol indirectly by compromising other parts of the Internet’s infrastructure, in this case proxy servers. Unsuspecting proxy servers that may not be aware of the WebSocket handshake’s likeness to a GET request could be fooled into caching data for a fake GET request initiated by an attacker, in effect poisoning the cache for some users. Masking each frame with a new key mitigates this particular threat by ensuring that frames aren’t predictable and thus can’t be misconstrued on the wire. There’s quite a bit more to this attack, and undoubtedly researchers will uncover further possible exploits in time. Still, it’s impressive to see the lengths to which the designers have gone to try to anticipate many forms of attack.

Windows 8 and the WebSocket Protocol

As helpful as it is to have a deep understanding of the WebSocket protocol, it also helps a great deal to work on a platform with such wide-ranging support, and Windows 8 certainly delivers. Let’s take a look at some of the ways in which you can use the WebSocket protocol without actually having to implement the protocol yourself.

Windows 8 provides the Microsoft .NET Framework, supports clients through the Windows Runtime for both native and managed code and lets you create WebSocket clients using the Windows HTTP Services (WinHTTP) API in C++. Finally, IIS 8 provides a native WebSocket module, and of course Internet Explorer provides native support for the WebSocket protocol. That’s quite a mix of different environments, but what might be even more surprising is that Windows 8 only includes a single WebSocket implementation, which is shared among all of these. The WebSocket Protocol Component API implements all of the protocol rules for handshaking and framing without ever actually creating a network connection of any kind. The different platforms and runtimes can then use this common implementation and hook it into the networking stack of their choice.

.NET Clients and Servers

The .NET Framework provides extensions to ASP.NET and provides HttpListener—which is itself based on the native HTTP Server API used by IIS—to provide server support for the WebSocket protocol. In the case of ASP.NET, you can simply write an HTTP handler that calls the new HttpContext.AcceptWebSocketRequest method to accept a WebSocket request on a particular endpoint. You can validate that the request is indeed a WebSocket client handshake using the HttpContext.IsWebSocketRequest property. Outside of ASP.NET, you can host a WebSocket server by simply using the HttpListener class. The implementation is also mostly shared between the two. Figure 1 provides a simple example of such a server.

Figure 1 WebSocket Server Using HttpListener

static async Task Run()
{
  HttpListener s = new HttpListener();
  s.Prefixes.Add("https://localhost:8000/ws/");
  s.Start();
  var hc = await s.GetContextAsync();
  if (!hc.Request.IsWebSocketRequest)
  {
    hc.Response.StatusCode = 400;
    hc.Response.Close();
    return;
  }
  var wsc = await hc.AcceptWebSocketAsync(null);
  var ws = wsc.WebSocket;
  for (int i = 0; i != 10; ++i)
  {
    await Task.Delay(2000);
    var time = DateTime.Now.ToLongTimeString();
    var buffer = Encoding.UTF8.GetBytes(time);
    var segment = new ArraySegment<byte>(buffer);
    await ws.SendAsync(segment, WebSocketMessageType.Text,
      true, CancellationToken.None);
  }
  await ws.CloseAsync(WebSocketCloseStatus.NormalClosure,
    "Done", CancellationToken.None);
}

Here I’m using a C# async method to keep the code sequential and coherent, but in fact it’s all asynchronous. I start by registering the endpoint and waiting for an incoming request. I then check whether the request does in fact qualify as a WebSocket handshake and return a 400 “bad request” status code if it isn’t. I then call AcceptWebSocketAsync to accept the client handshake and wait for the handshake to complete. At this point, I can freely communicate using the WebSocket object. In this example the server sends 10 UTF-8 frames, each containing the time, after a short delay. Each frame is sent asynchronously using the SendAsync method. This method is quite powerful and can send UTF-8 or binary frames either as a whole or in fragments. The third parameter—in this case, true—indicates whether this call to SendAsync represents the end of the message. Thus, you can use this method repeatedly to send long messages that will be fragmented for you. Finally, the CloseAsync method is used to perform a clean closure of the WebSocket connection, sending a close control frame and waiting for the client to acknowledge with its own close frame.

On the client side, the new ClientWebSocket class uses an HttpWebRequest object internally to provide the ability to connect to a WebSocket server. Figure 2 provides a simple example of a client that can be used to connect to the server in Figure 1.

Figure 2 WebSocket Client Using ClientWebSocket

static async Task Client()
{
  ClientWebSocket ws = new ClientWebSocket();
  var uri = new Uri("ws://localhost:8000/ws/");
  await ws.ConnectAsync(uri, CancellationToken.None);
  var buffer = new byte[1024];
  while (true)
  {
    var segment = new ArraySegment<byte>(buffer);
    var result =
      await ws.ReceiveAsync(segment, CancellationToken.None);
    if (result.MessageType == WebSocketMessageType.Close)
    {
      await ws.CloseAsync(WebSocketCloseStatus.NormalClosure, "OK",
        CancellationToken.None);
      return;
    }
    if (result.MessageType == WebSocketMessageType.Binary)
    {
      await ws.CloseAsync(WebSocketCloseStatus.InvalidMessageType,
        "I don't do binary", CancellationToken.None);
      return;
    }
    int count = result.Count;
    while (!result.EndOfMessage)
    {
      if (count >= buffer.Length)
      {
        await ws.CloseAsync(WebSocketCloseStatus.InvalidPayloadData,
          "That's too long", CancellationToken.None);
        return;
      }
      segment =
        new ArraySegment<byte>(buffer, count, buffer.Length - count);
      result = await ws.ReceiveAsync(segment, CancellationToken.None);
      count += result.Count;
    }
    var message = Encoding.UTF8.GetString(buffer, 0, count);
    Console.WriteLine("> " + message);
  }
}

Here I’m using the ConnectAsync method to establish a connection and perform the WebSocket handshake. Notice that the URL uses the new “ws” URI scheme to identify this as a WebSocket endpoint. As with HTTP, the default port for ws is port 80. The “wss” scheme is also defined to represent a secure TLS connection and uses the corresponding port 443. The client then calls ReceiveAsync in a loop to receive as many frames as the server is willing to send. Once received, the frame is first checked to see whether it represents a close control frame. In this case the client responds by sending its own close frame, allowing the server to close the connection promptly. The client then checks whether the frame contains binary data, in which case it closes the connection with an error indicating that this frame type is unsupported. Finally, the frame data can be read. To accommodate fragmented messages, a while loop waits until the final fragment is received. The new ArraySegment structure is used to manage the buffer offset so the fragments are reassembled properly.

The WinRT Client

The Windows Runtime support for the WebSocket protocol is a little more restrictive. Only clients are supported, and fragmented UTF-8 messages must be completely buffered before they can be read. Only binary messages can be streamed with this API. Figure 3 provides a simple example of a client that can also be used to connect to the server in Figure 1.

Figure 3 WebSocket Client Using the Windows Runtime

static async Task Client()
{
  MessageWebSocket ws = new MessageWebSocket();
  ws.Control.MessageType = SocketMessageType.Utf8;
  ws.MessageReceived += (sender, args) =>
  {
    var reader = args.GetDataReader();
    var message = reader.ReadString(reader.UnconsumedBufferLength);
    Debug.WriteLine(message);
  };
  ws.Closed += (sender, args) =>
  {
    ws.Dispose();
  };
  var uri = new Uri("ws://localhost:8000/ws/");
  await ws.ConnectAsync(uri);
}

This example, although also written in C#, relies on event handlers for the most part, and the C# async method is of little utility, merely able to allow the MessageWebSocket object to connect asynchronously. The code is fairly simple, however, if a little quirky. The MessageReceived event handler is called once the entire (possibly fragmented) message is received and ready to read. Even though the entire message has been received and it can only ever be a UTF-8 string, it’s stored in a stream, and a DataReader object must be used to read the contents and return a string. Finally, the Closed event handler lets you know that the server has sent a close control frame, but as with the .NET ClientWebSocket class, you’re still responsible for sending a close control frame back to the server. The MessageWebSocket class, however, only sends this frame just before the object is itself destroyed. To make this happen promptly in C#, I need to call the Dispose method.

The Prototypical JavaScript Client

There’s little doubt that JavaScript is the environment in which the WebSocket protocol will make the most impact, and the API is impressively simple. Here’s all it takes to connect to the server in Figure 1:

var ws = new WebSocket("ws://localhost:8000/ws/");
ws.onmessage = function (args)
{
  var time = args.data;
  ...
};

Unlike the other APIs on Windows, the browser takes care of closing the WebSocket connection automatically when it receives a close control frame. You can, of course, explicitly close a connection or handle the onclose event, but no further action is required on your part to complete the closing handshake.

The WinHTTP Client for C++

Of course, the WinRT WebSocket client API can be used from native C++ as well, but if you’re looking for a bit more control, then WinHTTP is just the thing for you. Figure 4 provides a simple example of using WinHTTP to connect to the server in Figure 1. This example is using the WinHTTP API in synchronous mode for conciseness, but this would work equally well asynchronously.

Figure 4 WebSocket Client Using WinHTTP

auto s = WinHttpOpen( ... );
auto c = WinHttpConnect(s, L"localhost", 8000, 0);
auto r = WinHttpOpenRequest(c, nullptr, L"/ws/", ... );
WinHttpSetOption(r, WINHTTP_OPTION_UPGRADE_TO_WEB_SOCKET, nullptr, 0);
WinHttpSendRequest(r, ... );
VERIFY(WinHttpReceiveResponse(r, nullptr));
DWORD status;
DWORD size = sizeof(DWORD);
WinHttpQueryHeaders(r,
  WINHTTP_QUERY_STATUS_CODE | WINHTTP_QUERY_FLAG_NUMBER,
  WINHTTP_HEADER_NAME_BY_INDEX,
  &status,
  &size,
  WINHTTP_NO_HEADER_INDEX);
ASSERT(HTTP_STATUS_SWITCH_PROTOCOLS == status);
auto ws = WinHttpWebSocketCompleteUpgrade(r, 0);
char buffer[1024];
DWORD count;
WINHTTP_WEB_SOCKET_BUFFER_TYPE type;
while (NO_ERROR ==
  WinHttpWebSocketReceive(ws, buffer, sizeof(buffer), &count, &type))
{
  if (WINHTTP_WEB_SOCKET_CLOSE_BUFFER_TYPE == type)
  {
    WinHttpWebSocketClose(
      ws, WINHTTP_WEB_SOCKET_SUCCESS_CLOSE_STATUS, nullptr, 0);
    break;
  }
  if (WINHTTP_WEB_SOCKET_BINARY_MESSAGE_BUFFER_TYPE == type ||
    WINHTTP_WEB_SOCKET_BINARY_FRAGMENT_BUFFER_TYPE == type)
  {
    WinHttpWebSocketClose(
      ws, WINHTTP_WEB_SOCKET_INVALID_DATA_TYPE_CLOSE_STATUS, nullptr, 0);
    break;
  }
  std::string message(buffer, count);
  while (WINHTTP_WEB_SOCKET_UTF8_FRAGMENT_BUFFER_TYPE == type)
  {
    WinHttpWebSocketReceive(ws, buffer, sizeof(buffer), &count, &type);
    message.append(buffer, count);
  }
  printf("> %s\n", message.c_str());
}

As with all WinHTTP clients, you need to create a WinHTTP session, connection and request object. There’s nothing new here so I’ve elided some of the details. Before actually sending the request, you need to set the new WINHTTP_OPTION_UPGRADE_TO_WEB_SOCKET option on the request to instruct WinHTTP to perform a WebSocket handshake. The request is then ready to be sent with the WinHttpSendRequest function. The regular WinHttpReceiveResponse function is then used to wait for the response, which in this case will include the result of the WebSocket handshake. As always, to determine the result of a request, the WinHttpQueryHeaders function is called specifically to read the status code returned from the server. At this point, the WebSocket connection has been established and you can begin to use it directly. The WinHTTP API naturally handles the framing for you, and this functionality is exposed through a new WinHTTP WebSocket object that’s retrieved by calling the WinHttpWebSocketCompleteUpgrade function on the request object.

Receiving the messages from the server is done, at least conceptually, in much the same way as the example in Figure 2. The WinHttpWebSocketReceive function waits to receive the next data frame. It also lets you read fragments of any kind of WebSocket message, and the example in Figure 4 illustrates how this might be done in a loop. If a close control frame is received, then a matching close frame is sent to the server using the WinHttpWebSocketClose function. If a binary data frame is received, then the connection is similarly closed. Keep in mind that this only closes the WebSocket connection. You still need to call WinHttpCloseHandle to release the WinHTTP WebSocket object, as you have to do for all WinHTTP objects in your possession. A handle wrapper class such as the one I described in my July 2011 column, “C++ and the Windows API” (msdn.microsoft.com/magazine/hh288076), will do the trick.

The WebSocket protocol is a major new innovation in the world of Web applications and, despite its relative simplicity, is a welcome addition to the larger TCP/IP suite of protocols. I’ve little doubt that the WebSocket protocol will soon be almost as ubiquitous as HTTP itself, helping applications and connected systems of all kinds to communicate more easily and efficiently. Windows 8 has done its part to provide a comprehensive set of APIs for building both WebSocket clients and servers.


Kenny Kerr is a software craftsman with a passion for native Windows development. Reach him at kennykerr.ca.

Thanks to the following technical experts for reviewing this article: Piotr Kulaga and Henri-Charles Machalani