Online Carpentry: Crafting a New MSDN Table of Contents

Bryn Waibel and John Boylan
Microsoft Corporation

Download Toccode.exe.

Contents

The Situation
Making Sense of the Situation
The Sample Application
Taking It from Here

The MSDN Library is the central repository of information on MSDN, with well over 250,000 files, and the table of contents (TOC) is a main access point for this data. However, the TOC in the MSDN Library was working poorly. It needed to be scrapped and replaced—and the sooner, the better.

An MSDN Web design team, including Bryn Waibel and Mike Barta, was commissioned to create the new TOC. Fortunately, the solution was fairly straightforward: a new TOC using JavaScript with an overlay of XML as the data layer. In this article, we'll show you how we did it, and we'll include a small sample TOC that you can use as a design template.

The Situation

If the problem was simple, it was also severe—and was likely to get worse. Fortunately, the answers were clear. We knew the new TOC would need to deal with large data sets and with data arriving from many varied sources. We wanted a TOC that would provide cross-platform access, improved maintainability (with a standard data storage format), and the extended character set necessary for globalization. We also knew that we wanted to be able to link directly from anywhere in the Library into any node in the TOC.

The old TOC was difficult to maintain, because the data sources for the Java applet weren't well linked. The data sources were needed to describe the association of more than 200 disparate data sets, and the only way to link those data sets together without requiring that they be simultaneously shipped to the client was to copy given data sets into the hierarchy by hand.

Built when worldwide applications for the Web were in their infancy, the applet also didn't provide extended character support. It wasn't written to handle subsequent advances in browser technologies that accommodate an increasingly broad range of language-operating system pairs. For example, with the Java applet, the shift-jis character set could be viewed only in a browser whose operating system had the shift-jis set as the default ANSI character set. The Java applet also would not handle Unicode character sets, such as UTF-8.

What's more, user interaction with the original TOC was slow and awkward. And the TOC couldn't be reached from a Macintosh machine, effectively limiting access for Macintosh users.

Finally, the TOC ran up against a proprietary data structure. The sources for the TOC could not be reused without manufacturing a new implementation for the Java applet's proprietary data structure.

Making Sense of the Situation

With this model for what we didn't have, we began to work on what we needed to have. We began to explore some key design issues:

  • An XML structure. We knew we needed an XML data layer. Whatever individual solution we were considering, the answer always came back to XML.
  • Chunking the data. The data had to be pieced into chunks small enough to send to a client on any connection. The Windows 2000 team had a similar problem when trying to put the Windows 2000 Help files online, so their code was an obvious starting point.
  • Syncing to a node. Each node in the TOC had to be discoverable, knowing only the .htm page to which it pointed. This data had to be updateable and independent of unrelated changes made in the TOC.
  • Separating providers. MSDN Library data comes from hundreds of providers in many formats: .rtf, .chm, .htm, and so forth. We needed a way to both map them together and maintain their independence.
  • Reusability. The data needed to be reusable. We didn't want to find ourselves forced to re-author a section of the TOC in order to display that data in a different view.

As we move through the production of the TOC, we'll come back to these issues and show how they affected the final design.

Producing the Table of Contents

Now we had a framework for beginning to hammer together our TOC. We took all of the pieces—XML, JavaScript, XSL, and .asp files—and began to fit them into a whole.

The TOC XML Structure

The key issue we faced in creating the TOC was that of structure size. The entire MSDN TOC is too large to send to the client in one chunk. Currently it contains 250,000 nodes; at 150 bytes per node, that's more than 37 MB, and we couldn't send that to the client.

Because size was a major consideration, we tried to make our document type definition (DTD) as concise as possible without sacrificing readability. We also went with an attribute-centric, rather than an element-centric, approach. That way, we could consolidate the data and give the XML a structure that closely followed the type of data represented in the TOC. An attribute-centric approach would provide a node-node relationship in the TOC, while an element-centric approach would have resulted in many nodes of XML representing one node of TOC.

For our model, we started with a DTD from the newest version of the .hxt file in HTML Help 2.0, and embarked on creating a scheme that would work for us. We named each of our nodes MTN (MSDN TOC Node), and each node could contain zero or more child MTN nodes, as shown in the following code. We added a number of attributes that were not absolutely necessary for our MSDN Library implementation, but were useful for describing TOCs of all types on our site.

Library Attributes

Here are the attributes that are directly relevant to the MSDN Library TOC:

The nodeType attribute has three possible values:

  • node. This node has children, and they are present in this file.
  • leaf. This is a terminal node; it has no children.
  • collection. This node has children, but they are located in a different file.

The prePartum attribute references a file in which child nodes can be found. The value of this attribute is the path to the file containing an exact copy of the node that points to it, at its root—except that its nodeType value is "node", and its children are present. In terms of the XML Document Object Model (DOM), the operation is: Load the file pointed to by prePartum and replace the current node with the top node in the file.

The tocPath attribute contains the zero-indexed path to a given node in the TOC. Each node index is separated from the others by dashes, so that it can be easily taken from string form and plugged directly into the XML DOM. The first node in the TOC would have a tocPath value of 0, its second child's tocPath would be 0-1, and so on—forming a direct path to any node from the top of the TOC. We also added a character token to the beginning of the tocPath value to indicate the area to which the path relates. At the top of the Library, this token was "lib." We'll talk more about this later, when we describe how we maintain the independence of each of our content providers.

The state attribute is a display-driven attribute; if there is no state, the node is treated according to its default state. Nodes can have two states:

  • sel. This is the current node in the tree. If this node points to content, it will be bold; if it has nodeType="node", it will be displayed open.
  • open. This state indicates that this node is in the path to the current node. Only a node with nodeType="node" can have this attribute.

The ref attribute contains the path to the data to which the relevant node refers. In the Library, this path is a URL to an .htm or .asp file.

The title attribute contains the display name of the current node.

Chunking the Data

We started with the work by the Windows 2000 team to put its Help files online. The Windows 2000 team used XML to persist its data, and was building the files offline into an HTML solution—doing most of the build logic before the data was sent to the user.

In the Windows 2000 solution, what the user sees is stored exactly in the file system, with both a Netscape version and an Internet Explorer version. This approach has obvious drawbacks—including the need to build multiple solutions, each catering to a different client platform.

Performance was one of the key reasons for going with an offline build for the Help files, especially before the final release of Windows 2000. Windows 2000 has made remarkable improvements in both XSL transformations and XML DOM manipulation, so we were willing to take another look at the data handling. The result is that our TOC stores XML data on the file system, which is then transformed based on what the user needs to see, on the server.

Syncing to a Node

Syncing to a node was the trickiest problem we faced in designing this TOC. The Windows 2000 team was referencing its nodes using consecutive numeric indices, but this was a problem for us because we didn't want to have to change the entire syncing scheme every time a node was added or replaced. Also, with more than 250,000 nodes in our set, we didn't want to store sync information in one place; we needed some sort of hashing mechanism. An obvious choice was to store the data in the path of a given file, indexed by the filename in the folder. We decided to store the information in an XML file called a map file.

<!DOCTYPE MsdnTocMap [
<!ELEMENT MsdnTocMap (L+)>
<!ATTLIST MsdnTocMap
    rootToc CDATA #IMPLIED
>

<!ELEMENT L EMPTY>
<!ATTLIST L
    url ID #REQUIRED
    pth CDATA #REQUIRED
>
]>

As you can see, map files are fairly simple, and are also quite useful. A map file is just a way to create name/value pairs using XML. Each node consists of a file name as the name and a tocPath as the value, placing the name into the ID attribute URL and the value into the CDATA attribute pth.

Given only the path to a file and its name, we could now open the appropriate map.xml file in that directory and use the file name to look up the path to that node, using nodeFromID on the XMLDOMDocument object in which we loaded the file. The file name was escaped using the formatFileName( strFName ) function defined in the file toc.asp, as in the code below.

function FormatFileName( sFileName )
{
    if( "string" == typeof( sFileName ) )
    {
        // Remove anything that's not a word character, period, dash, 
underscore or colon (required for ID attributes)
        sFileName = sFileName.replace( /[^\w\.-_:]/gi , "" );
        // It also can't start with a number
        if ( sFileName.match( /^[\.\d]/i ) ) sFileName = "f" + 
sFileName;
        return sFileName;
    }
}

Separating Providers

To maintain independence among our content sets, we used a token to represent each set. Here is where the token that we talked about in the tocPath description became useful.

We created a name for each provider, then named the topmost nodes in the provider's tree by appending a unique integer to the end of each token. For simplicity, we used consecutive integers, starting with zero. The tocPaths were referenced in map files from the topmost nodes in that provider's content. We then had a file, submap.xml, that mapped each token to its respective position from the top of the TOC. The file in which we stored this data was just another map file—where the ref attribute contained the token name and pth was the path to that token. Now we could move content pieces around with respect to other pieces, needing to update only the master token map file. We could also move content around within a content piece without disrupting the order of other nodes outside of that piece. An added bonus was that we could make as many different views of the data as we liked, referencing the same data differently from the top level.

The Sample Application

Now that we've looked at how the XML played a part in our solution, it's time to talk about how all of this came together to make an application. We'll do that with a walk-through of the attached sample application. If we wanted to make a node that documented the TOC technology, representing a sample application that contained the TOC code, we would want it to look something like this:

Nodes and subnodes in the TOC

Figure 1. Sample TOC: the MSDN TOC node in the Code Center

Let's say that you want to be able to treat the node under "MSDN Code Examples" as an individual element. You need to have two XML files representing this structure: Msdnce.xml and Ltoc.xml.

msdnce.xml

<MsdnToc>
    <MTN title="MSDN Code Examples" 
nodeType="node" type="none" 
tocPath="msdnce-0" hal="en-us">
        <MTN title="The MSDN TOC" 
nodeType="collection" type="none" 
tocPath="ltoc0" hal="en-us" 
prePartum="ltoc0.xml"/>
    </MTN>
</MsdnToc>

The msdnce.xml file is fairly simple; it's really just a node that harnesses all of the MSDN code examples. This harness contains a single collection node pointing to Ltoc0.xml, our sample.

ltoc.xml

<MTN title="The MSDN TOC" 
nodeType="node" type="none" 
tocPath="ltoc0" hal="en-us">
    <MTN title="Revamping MSDN TOC" 
nodeType="leaf" type="file" 
ref="library.asp"  tocPath="ltoc0-0"/>
    <MTN title="ASP Files" 
nodeType="node" type="none"  
tocPath="ltoc0-1">
        <MTN title="loadtree.asp" 
nodeType="leaf" type="file" 
ref="loadtree.asp.txt"  tocPath="ltoc0-1-
0" />
        <MTN title="toc.asp" 
nodeType="leaf" type="file" 
ref="toc.asp.txt"  tocPath="ltoc0-1-1" 
/>
        <MTN title="default.asp" 
nodeType="leaf" type="file" 
ref="default.asp.txt"  tocPath="ltoc0-1-
2" />
    </MTN>
    <MTN title="CSS Files" 
nodeType="node" type="none"  
tocPath="ltoc0-2" >
        <MTN title="toc.css" 
nodeType="leaf" type="file" 
ref="toc.css.txt"  tocPath="ltoc0-2-0" 
/>
    </MTN>
    <MTN title="JS Files" 
nodeType="node" type="none"  
tocPath="ltoc0-3" >
        <MTN title="toc.js" 
nodeType="leaf" type="file" 
ref="toc.js.txt"  tocPath="ltoc0-3-0" 
/>
    </MTN>
    <MTN title="Include Files" 
nodeType="node" type="none"  
tocPath="ltoc0-4" >
        <MTN title="locals.inc" 
nodeType="leaf" type="file" 
ref="locals.inc"  tocPath="ltoc0-4-0" 
/>
    </MTN>
    <MTN title="XML Files" 
nodeType="node" type="none"  
tocPath="ltoc0-5" >
        <MTN title="msdnce.xml" 
nodeType="leaf" type="file" 
ref="msdnce.xml"  tocPath="ltoc0-5-0" 
/>
        <MTN title="ltoc0.xml" 
nodeType="leaf" type="file" 
ref="ltoc0.xml"  tocPath="ltoc0-5-1" 
/>
        <MTN title="map.xml" 
nodeType="leaf" type="file" 
ref="map.xml"  tocPath="ltoc0-5-2" />
        <MTN title="submap.xml" 
nodeType="leaf" type="file" 
ref="submap.xml"  tocPath="ltoc0-5-3" 
/>
    </MTN>
    <MTN title="XSL Files" 
nodeType="node" type="none"  
tocPath="ltoc0-6" >
        <MTN title="toc.xsl" 
nodeType="leaf" type="file" 
ref="tocdown.xsl"  tocPath="ltoc0-6-0" 
/>
    </MTN>
    <MTN title="Graphics Files" 
nodeType="node" type="none"  
tocPath="ltoc0-7" >
        <MTN title="bo.gif" 
nodeType="leaf" type="file" 
ref="bo.gif"  tocPath="ltoc0-7-0" />
        <MTN title="bs.gif" 
nodeType="leaf" type="file" 
ref="bs.gif"  tocPath="ltoc0-7-1" />
        <MTN title="dc.gif" 
nodeType="leaf" type="file" 
ref="dc.gif"  tocPath="ltoc0-7-2" />
    </MTN>
</MTN>

We've taken the liberty of naming both sets: ltoc is one of the provider tokens mentioned under "Separating Providers," and msdnce is the top-level name token that represents the name of this view of the data. Ltoc0.xml is a little more meaningful than msdnce.xml; it is the representation of the TOC for this sample. Remember that in our numbering system, ltoc0 is the first node in the ltoc collection.

At this point, we can readily see how reusability comes into play. We have the ability to point to ltoc0.xml from any other TOC on our site, and all we need to do is make a node just like the second-level node in the msdnce file. We also have the ability to add as many samples under the MSDN Code Examples node as we like, and they each maintain complete individuality.

map.xml

The map.xml file maps the content to a path within its subtree. The especially interesting part about this is that every node is referenced via the ltoc0 subtoken. This enables any application that needs to use the ltoc0 data to point to the same ltoc0.xml file. All the XML consumer has to do is to provide its own map to ltoc0, and the map.xml file will take it from there.

<?xml version="1.0"?>
<!DOCTYPE MsdnTocMap [
<!ELEMENT MsdnTocMap (L+)>
<!ATTLIST MsdnTocMap
    rootToc CDATA #IMPLIED
>

<!ELEMENT L EMPTY>
<!ATTLIST L
    url ID #REQUIRED
    pth CDATA #REQUIRED
>
]>
<MsdnTocMap>
    <L url="library.asp"  pth="ltoc0-
0"/>
    <L url="loadtree.asp.txt"  pth="ltoc0-
1-0" />
    <L url="toc.asp.txt"  pth="ltoc0-1-
1" />
    <L url="default.asp.txt"  pth="ltoc0-1-
2" />
    <L url="toc.css.txt"  pth="ltoc0-2-
0" />
    <L url="toc.js.txt"  pth="ltoc0-3-
0" />
    <L url="locals.inc"  pth="ltoc0-4-
0" />
    <L url="msdnce.xml"  pth="ltoc0-5-
0" />
    <L url="ltoc0.xml"  pth="ltoc0-5-1" />
    <L url="map.xml"  pth="ltoc0-5-2" />
    <L url="submap.xml"  pth="ltoc0-5-
3" />
    <L url="tocdown.xsl"  pth="ltoc0-6-
0" />
    <L url="bo.gif"  pth="ltoc0-7-0" 
/>
    <L url="bs.gif"  pth="ltoc0-7-1" 
/>
    <L url="dc.gif"  pth="ltoc0-7-2" 
/>
</MsdnTocMap>

submap.xml

The submap.xml file is the map that each XML consumer maintains; it provides a relationship among pieces of data. Since this view points to only one set of data (ltoc0), there is only one node. It is especially important to note the possibility of defining subareas within ltoc0 as well; their paths would simply be referenced relative to ltoc0 in submap.xml. If, say, you wanted to include the same .css node in another view, you could call that node toccss, and reference it in submap.xml as ltoc0-2—which from this view can be translated to the true path, msdnce-0-0-2.

<?xml version="1.0"?>
<!DOCTYPE MsdnTocMap [
<!ELEMENT MsdnTocMap (L+)>
<!ATTLIST MsdnTocMap
    rootToc CDATA #IMPLIED
>

<!ELEMENT L EMPTY>
<!ATTLIST L
    url ID #REQUIRED
    pth CDATA #REQUIRED
>
]>
<MsdnTocMap>
    <L url="ltoc0"  pth="msdnce-0-0" 
/>
</MsdnTocMap>

locals.inc

The locals.inc file contains all of the information about this view, and is referenced by every .asp file in the application. For our sample application, it looks like:

<%
// title of default.asp
var L_HomePageTitle_HTMLText=   "MSDN TOC Code 
Example";

// Text direction setting ( Right To Left would be 
"RTL" )
var L_strBiDiMode_HTMLText = "LTR";

// Loading message text
var L_LoadingMsg_HTMLText = "Loading, click to cancel... 
";

// Source Information
var RootDir = "msdnce.xml";
var DefaultTopic = "library.asp";
var SubMapPath = "subMap.xml";
var RootTocToken = "msdnce";
%>

ASP

There are two .asp workhorses in the TOC that do the XML processing and get the data ready for display to the client: loadtree.asp and toc.asp.

loadtree.asp

The loadtree.asp file takes prePartum and tocPath as parameters, and outputs the children of the node, which reside at the tocPath specified. There's an easy way and a hard way to make this happen. The easy way applies when the node you want is at the top of the file referenced by prePartum. In that case, it's as simple as loading the file into a DOMDocument object, transforming it with tocDown.xsl, and putting the results out to the page. Once the results are on the page, the user interface can begin to deal with them.

The second, and more difficult, case occurs when loadtree.asp has to find a node somewhere inside the file referenced by prePartum. This case can arise when the tocPath attribute supplied to Loadtree.asp is not the same as the tocPath parameter at the top of the loaded document. When this is true, loadtree.asp strips the tocPath parameter (in the top node of the XML file) from the tocPath parameter that is passed to it in the query string. The loadtree.asp file then uses the remaining fragment as a path to walk to. By the time it runs out of fragment, it has found the node it is looking for. It then makes a clone of that node (the style sheet needs to run from the root of the document), transforms that clone, and feeds the results to the page.

toc.asp

The toc.asp file takes both ref and tocPath as parameters, loads the top-level TOC, and outputs everything from the top level down to the node referenced by tocPath. If it can find the path to the node referenced by ref, it walks the tree, cutting off unwanted children until it gets to that node. If it doesn't, it just outputs to the top level.

XSL

The XSL is then used to display the XML in the browser, by transforming it to XHTML. The server can choose a different .xsl file for each client configuration. The XSL files would be similar to the extent that each would have a recursive template mechanism, which transforms the XML into HTML that is sent to the client. That mechanism looks something like this:

<xsl:template match="MTN[@nodeType='node']"><!-- HTML for 
node --><xsl:apply-templates 
select="MTN"/></xsl:template> 
<xsl:template match="MTN[@nodeType='collection']"><!-- HTML 
for collection --></xsl:template>
<xsl:template 
match="MTN[@nodeType='leaf']"><!-- HTML for 
leaf --></xsl:template>

JavaScript and CSS

The JavaScript and cascading style sheets (CSS) can also be different for each browser. The version in the attached sample runs the TOC only in Internet Explorer versions 4.0 and later. The HTML produced by the .xsl file does all of the messaging work for the .asp files, which take care of navigating and piecing the TOC together in other browsers.

Taking It from Here

That's the MSDN Library Table of Contents. It's a starting point for you to build your own TOC, especially if you need to connect a diverse group of files and have easy access to those files. We've designed the TOC so that it can provide a versatile model for any situation in which you want to process hierarchical data with XML—from a table of contents to organizational charts and schematics.

Read part two, Online Carpentry: Refinishing Your Table of Contents.

 

*Bryn Waibel is a Web design engineer on the MSDN team.

John Boylan is a developmental editor for the MSDN Online Library.*