Working with Office HTML

 

David Shank
Microsoft Corporation

December 9, 1999

Using the terms "Office" and "HTML" in the same sentence used to be a rarity. Occasionally you'd hear someone complain, "I wish I could save my Office document as an HTML file," but that was about it.

The combination of the widespread use of Microsoft Office and the widespread use of the Internet brought requests that Office documents satisfy the best of both worlds. It really wasn't until the release of Microsoft Office 2000 that we got used to the idea that an Office document and an HTML document could be one and the same. Office documents now support HTML as a native file format and use XML to preserve information about a document necessary to re-create it as an Office document once it has been saved as HTML. There really isn't anything special about the HTML in an Office document, so working with it is just like working with any HTML. By using HTML and XML, Office documents and data can be stored, distributed, and viewed using most Web browsers, while retaining the functionality of Office documents.

Certain HTML and XML tags help "round-trip" the document for editing purposes. For example, if you create the document in Word 2000 and save it as HTML, the code embedded in the document allows you to re-open the document in Word 2000. (Read the reference for more information on the Office HTML and XML file formats.)

But alas, every silver lining contains a dark cloud (or something like that). The bad news, to some folks at least, was that these HTML files could get quite large. If you don't need to "round-trip" the document, there is no need to preserve the Office-specific HTML and XML. You can learn more about Office HTML and XML and download a tool (http://office.microsoft.com/downloads/2000/Msohtmf2.aspx) for removing Office-specific markup tags embedded in Office 2000 documents.

This month, I'm going to talk about working with the HTML in Office documents and show you how you can remove unwanted HTML or XML from an Office document before programmatically saving a document as an HTML page. We'll use the Word object model and the FrontPage object model to work with HTML in each application.

The Scenario: Using VBA to Work with Office HTML in Multiple Office Applications

The scenario I will use is designed to illustrate how to take the portions of the HTML from a Word document and create a Web page based on that HTML. To be more specific, imagine you are working on a Word document and you want to run a Visual Basic for Applications procedure in Word to save your document as a page in a FrontPage Web site.

To accomplish this goal, I'll write Visual Basic for Applications code to:

  • Extract certain portions of the HTML from the active Word document.
  • Create a new Web page (or modify an existing page) in a FrontPage Web site.
  • Strip certain HTML tags from a preexisting Web page.
  • Insert portions of the Word-generated HTML into the specified Web page.

To follow along with me in this example, first open a new Word document, add some text, and assign styles and formatting to the text. The following graphic shows a Word document formatted in this manner.

Next, press ALT-F11 to open the Visual Basic Editor, and then choose Module from the Insert menu. Set the module Name property in the lower-left pane to modOfficeHTML.

Next, choose References from the Tools menu and set a reference to these two type libraries:

  • Microsoft FrontPage 4.0 Page Object Reference Library
  • Microsoft FrontPage 4.0 Web Object Reference Library

Setting these references to the FrontPage Page and Web object models allows you to work with the objects in those type libraries as they are used in the sample code below.

Extract HTML From a Word Document

Short of actually saving an Office document as an HTML page, the only way to see its HTML representation is to view the document using the Microsoft Script Editor. (You can press ALT-SHIFT-F11 to open the script editor, but you don't have to do so to make this example work.)

But we don't want to look at the HTML; we want to work with it programmatically. To do this, use the HTMLProjectItem object. In this case, I will use the Text property of the HTMLProjectItem object to get the HTML for the active Word document. What I want to do is remove only the <BODY> and <STYLE> portions of the HTML so I can insert them into an existing Web page. I use the GetHTMLPart function to do this work.

Function GetHTMLPart(wdDoc As Word.Document, strStartTag As String, _
                        strEndTag As String) As String
    Dim strText As String
    Dim lngStartPos As Long
    Dim lngEndPos As Long

    'This procedure returns the portion of the HTML in a Word document
    ' beginning with the HTML tag in the strStartTag variable and
    ' ending with the HTML tag in the strEndTag variable.

    ' First get all of the HTML in the document.
    strText = wdDoc.HTMLProject.HTMLProjectItems(wdDoc.Name).Text

    ' Locate the position of the starting tag.
    lngStartPos = InStr(strText, strStartTag)
    ' Locate the position of the ending tag.
    lngEndPos = InStr(strText, strEndTag) + Len(strEndTag)

    ' Return the HTML between the starting and ending tags.
    GetHTMLPart = Mid$(strText, lngStartPos, lngEndPos - lngStartPos)
End Function

An example will help illustrate how this function works. The following code returns all the HTML within the document's <BODY></BODY> tag pair to the strBodyHTML variable:

strBodyHTML  = GetHTMLPart(Activedocument, "<body", "</body>")

The procedure works because there is only one <BODY></BODY> tag pair in any document. Note how in the code the closing bracket is left off of the start position tag argument. This is because there is other text in the opening tag, and we may not be sure what that text is or where that tag ends. All we really care about is where the tag begins. In addition, note how the tag name in the actual code is lowercase. That is because the tags in an Office document's HTML are all in lower case. If you passed "<BODY" as the first argument, the GetHTMLPart function would not return any HTML.

The GetHTMLPart procedure locates the starting and ending points for this tag pair and then uses the Mid$ function to return only the HTML between the starting and ending points. This same function can be used to extract the HTML style sheet information represented by the <STYLE></STYLE> tag pair:

strStyleHTML  = GetHTMLPart(Activedocument, "<style", "</style>")

That is really all there is to it. By extracting the <BODY> and <STYLE> HTML, we leave behind any Office-specific HTML or XML. Now all we need to do is insert the extracted HTML into a Web page.

Creating a Page on a FrontPage Web Site

In our scenario, we want to add a new Web page to a FrontPage Web site. We will need a function that checks a URL to see whether it represents an existing FrontPage Web site. If it does, we will add a new page to that site. If it does not, we will create a new FrontPage Web site, and then we'll add a new page to it.

The following function accepts a single argument representing the URL to a FrontPage Web site.

Function CreateNewWeb(strPath As String) As FrontPage.web
    Dim wdNewWeb As FrontPage.web
    On Error Resume Next
    ' Check to see if strPath represents an existing FrontPage web.
    If FrontPage.Webs.Count > 0 Then
        For Each wdNewWeb In FrontPage.Webs
            If UCase(wdNewWeb.Url) = UCase(strPath) Then
                ' Return pointer to the existing FrontPage web
                ' and exit this function.
                Set CreateNewWeb = wdNewWeb
                Exit Function
            End If
        Next wdNewWeb
    End If
    ' The url in strPath does not reference an existing web so
    ' add this web to the Webs collection.
    Set CreateNewWeb = FrontPage.Webs.Add(strPath)
End Function

The CreateNewWeb function steps through each Web object in the Webs collection, comparing the URL in the strPath argument with the URL for each Web object. If a match is found, the function returns the matching Web object to the calling procedure. If no match is found, the function uses the Webs collection Add method to create a new Web site using the folder path contained in strPath. The following code illustrates how to call the CreateNewWeb function using a URL to a folder on the C: drive (note that all the variables needed for code in the remainder of this article are dimensioned here as well):

   Dim strWebPath As String
   Dim strNewPage As String
   Dim wbWeb As FrontPage.Web
   Dim wbFile As FrontPage.WebFile
   Dim wbFiles As FrontPage.WebFiles
   Dim pwWindow As FrontPage.PageWindow
   Dim fpDoc As FrontPageEditor.FPHTMLDocument
   Dim fpBody As FrontPageEditor.FPHTMLBody

   strWebPath = "file:///C:/Example Webs/WebSampleOne"
   Set wbWeb = CreateNewWeb(strWebPath)

Now that we have a variable representing a FrontPage Web site on disk, we can add our new Web page to that site. However, because the CreateNewWeb can return either a new Web site or a pointer to an existing site, we should ensure that our new page does not already exist before we try to create it.

You can use the Web object's LocateFile method to find an existing Web page. If the Web page we want to create already exists, we want to display a message box asking whether the file should be overwritten. If the Web page does not exist, we want to create it.

   On Error Resume Next
   strNewPage = "WordDoc.htm"
   Set wbFile = wbWeb.LocateFile(strNewPage)
   If Not wbFile Is Nothing Then
      If MsgBox("The file named '" & strNewPage & "' _
         & "already exists. Do you want to replace it " _
         & "with the active document?", vbOKCancel, _
         "Replace existing file?") = vbCancel Then
            Exit function
      End If
   Else
      Set wbFiles = wbWeb.RootFolder.Files
      Set wbFile = wbFiles.Add(strNewPage, False)
   End If

At this point, the wbFile object variable represents the Web page on the FrontPage Web site that we want to use to insert the HTML from our active Word document.

Strip HTML Tags from a Preexisting Page

Since we may be working with a preexisting page, and since we are going to add <STYLE> information from our active Word document, we have to strip out any existing <STYLE> tags in the page.

To do this work, I've written the StripTags function, which is designed to remove any specific collection of HTML tags in a Web page. You pass in the FrontPage Document object you want to work with and the name of the tag you want to remove.

You can create a Document object using the document property of a FrontPage PageWindow object. You can get the PageWindow object using the Edit method of the FrontPage WebFile object.

   ' Get the PageWindow object.
   Set pwWindow = wbFile.Edit(fpPageViewNormal)
   ' Get the Document object.
   Set fpDoc = pwWindow.Document
   ' Remove preexisting <style> tags.
   If fpDoc.all.tags("style").Length > 0 Then
      StripTags fpDoc, "style"
   End If

Function StripTags(fpDoc As FrontPageEditor.FPHTMLDocument, _
                    strTagName As String) As Boolean
    Dim intTag As Integer
    For intTag = fpDoc.all.tags(strTagName).Length - 1 To 0 Step -1
           fpDoc.all.tags(strTagName).Item(intTag).outerHTML = ""
    Next intTag
End Function

Now that we have cleaned up the Web page, all we have left to do is to add the <STYLE> and <BODY> portions of the active Word document to the Web page.

Insert Word HTML Into a Web Page

Most of the work required to get the Word HTML out of the active document and into the FrontPage Web page is done by the GetHTMLPart function discussed at the start of this article. Remember that the GetHTMLPart function returns a portion of the HTML from the Word document. In this case, we need it to return the <STYLE> and the <BODY> portions of the Word document. Once we have that HTML from the Word document, we need insert only the <STYLE> information within the <HEAD></HEAD> tag pair and replace the existing <BODY> HTML with the <BODY> HTML from the Word document.

The following code illustrates how to replace the <BODY> portion of the page with the <BODY> HTML from the Word document:

    Set fpBody = fpDoc.body
    fpBody.outerHTML = GetHTMLPart(ActiveDocument, "<body", "</body>")

The last step shows how to replace the <STYLE> portion of the page with the <STYLE> HTML from the Word document:

    fpDoc.all.tags("head").Item(0).insertAdjacentHTML "BeforeEnd", _
        GetHTMLPart(ActiveDocument, "<style", "</style>")

Finally, to save the changes to the Web page, use the Close method of the PageWindow object, where True is the method's argument:

    pwWindow.Close True

If we ran this code against the Word document shown in the first section of this article, we would create a Web page that looks remarkably similar to the original Word document, but without the unnecessary HTML or XML that Word inserts in its documents:

Where to Get More Info?

Check out the following links for more information on working with FrontPage:

Show: