This documentation is archived and is not being maintained.

Creating Markup Text in Visual Basic .NET

Visual Studio .NET 2003

Darren Neimke

September 2, 2003

Summary: Read about a tool that allows you to markup code in any language, is customizable for personal color schemes, supports drag-and-drop operations, produces output in HTML or XML tagged text, and renders HTML markup. (13 printed pages)

Download the MarkUp.msi sample file.


I'm always working with code. I write code for a living as well as for fun, and on top of that I regularly post snippets of code for my friends or to newsgroups. When sharing code, I'm fussy about what it looks like. For example, I like to ensure that I've used meaningful variable names and that I haven't declared any unused variables. I also like to make sure that the code snippet is syntax colored according to my editor settings—strings in red, comments in green, keywords in blue, fonts in Lucinda Console, and so on. There's several ways to do this and the method that I often revert to is to use Microsoft Word and Microsoft Visual Studio® like this:

  1. Copy some code from the Code or HTML view of a document in Visual Studio .NET.
  2. Paste it into a Word document as Paste Special | Formatted Text (RTF) {the styles are persisted}.
  3. Copy it from the Word document and paste it onto the Design View of a new HTML document.

The styles are transferred; however, you end up with the verbose Office formatting syntax that looks like this:

    <P class=MsoNormal style="mso-layout-grid-align: none">
        <SPAN lang=EN-US style="FONT-SIZE: 8pt; COLOR: blue; 
            FONT-FAMILY: 'Courier New'; mso-ansi-language: EN-US">using</SPAN>
        <SPAN lang=EN-US style="FONT-SIZE: 8pt; 
            FONT-FAMILY: 'Courier New'; mso-ansi-language: EN-US"> System;
            <?xml:namespace prefix = o />

When all that you really needed was this:

    <font color="Blue">using</font> System ;

In terms of maintainability, the second sample is obviously much easier to maintain, but also consider the extra text that is being sent across the wire in the first example. This leads to slower page rendering times.

It would be easy to write a macro that, using regular expressions, could parse the Microsoft Office formatting and to strip it back to a much less verbose format. However, that still wouldn't solve all problems and it essentially means that you are relying on having Visual Studio, Word, and a decent working knowledge of regular expressions every time that you wanted to markup code. On top of that, you would also be relying on Visual Studio .NET to provide syntax coloring for your language of choice.

So, I decided that I needed a tool that would:

  • Allow me to markup code that is written in any language.
  • Be highly configurable to support not only my color scheme, but also anybody else's color scheme.
  • Support drag and drop operations.
  • Output as either HTML markup or XML tagged text so that it can be given to someone else who could then mark it up in their color scheme.
  • Render HTML markup as either FONT tags or SPANs with CSS style class names, as shown in Figure 1.

    Figure 1. Screenshot of rendered HTML markup

Here's an example of what this application would allow me to do. Given the following code snippet:

    ' This is a comment
    Dim foo As New Bar()

It could produce an output in any of the following 3 formats:

Raw tokens

<Comment>' This is a comment</Comment>
<Keyword>Dim</Keyword> foo <Keyword>As</Keyword> 
<Keyword>New</Keyword> Bar()   

HTML format

<font color="green">' This is a comment</font>
<font color="blue">Dim</font> foo <font color="blue">As</font> 
<font color="blue">New</font> Bar()

CSS format

<span class="Comment">' This is a comment</span>
<span class="Keyword">Dim</span> foo <span class="Keyword">As</span> 
<span class="Keyword">New</span> Bar()

Even better, if the CSS format is chosen, the tool also emits a stylesheet definition for you.

Figure 2. Tool-generated CSS stylesheet

Defining What Code Is

Thankfully, while the actual words used to form a language (the syntax) differ, all computer languages share common elements such as keywords, operators, functions, and single-line or multi-line statements. This is because, regardless of the syntax used by a language, the operations (or semantics) that languages allow you to perform are essentially the same. For example, Repetition statements, Control statements, Functions, and variable declaration to name a few. It might help to highlight some of these syntactical differences by showing an abbreviated table of operations.

Operation Visual Basic C#
Control Statements Do, For...Each, While do, foreach, while
Strings "..." "..."
Comments '..., Rem //..., /*...*/
Classes/Functions Function, Class, Sub function, class, void
Data Types Integer, String, Boolean int, string, bool

More importantly, in addition to the operational groupings, language elements can also be defined at an even higher level into two distinct types:

  1. Types that are defined by beginning and ending characters.
  2. Types that are defined by a collection of characters within word boundaries.

To understand what I mean, consider Strings and Data Types from the operational table above. Strings are of the first type—those that are defined by beginning and ending characters. That is, whenever I find a beginning String character. I know that everything following it will be the String until I hit the ending String character. This is highlighted in the following example where the String is not dependant on anything other that it's beginning (") and its ending (") characters. The beginning/ending characters can also be non-visible characters, such as the beginning of a line or an end of line character. Visual Basic® comments, for example, begin with an apostrophe and end with an end of line character.

This is not in the string "but this is, and so is this", but, not this.

Data types, on the other hand are of the second type—those that are defined by word boundaries and contain a collection of characters. Unlike the first Types, you cannot predict which characters will be at the beginning or ending, but only which characters are needed in the middle. Word boundaries can be defined as the position at which there is a word character on one side and a non-word character on the other. This ensures that words will only be found where there are full words and not where they are embedded within larger words. For example the "Rem" in "Remember" is clearly not a complete word. A word boundary might be a parenthesis "(", a hyphen "-", a colon ":", or, even the beginning of a line.

For the remainder of this article, I'll refer to these Types of elements as follows:

Non-Word or Block types

  • MultiLine Block
  • SingleLine Block (defined by begin/end characters but constrained to a single line, such as Visual Basic Strings)

Word type

  • Keyword
  • Operator

Marking-up Text

It's now clear that the trick to creating a single tool flexible enough to markup any language lies in producing a definition for these Types based on the points covered in the previous section. To do this, I'll create an XML based configuration file that defines the language, the Types of elements contained by the language, as well as define each of the different operations so that different colors can be applied in a manner similar to my favorite editor—Visual Studio .NET.

Each language will be defined by a Language element that will contain multiple Pattern elements. These elements will be responsible for describing each of the syntax features. At a minimum, each Pattern element should contain the following:

  • Type: MultiLine, SingleLine, Keyword, Operator
  • Name: The name of the operation (String, Keyword, Comment)
  • BeginChar: The beginning character for NonWord Types
  • EndChar: The ending character for NonWord Types
  • Words: A words collection for Word Types
  • FontInfo: color, font-family, font-size, and so on

Another quirky yet important language feature is the ability to have escape characters within NonWord types that allow the character that would normally be the end delimiter to successfully be embedded within the NonWord type. For example, to print "Foo", including the quotes, C# provides the "\" escape character to be used as follows:

    string myString = "Norman says \"Hello\"." ;   // Prints:  Norman says "Hello".

Finally, certain languages need to support case sensitivity, as is the case with C#. Here's a very abbreviated version of the C# language definition:

    <Language name="CSharp" caseSensitive="true">
        <Pattern type="MultiLine" name="BlockComment" beginChar="/*" endChar="*/">
            <FontSettings name="Lucinda Console" color="Green" size="10" />
        <Pattern type="SingleLine" name="InlineComment" beginChar="//" endChar="\n">
            <FontSettings name="Lucinda Console" color="Green" size="10" />
        <Pattern type="SingleLine" name="XmlComment" beginChar="///" endChar="\n">
            <FontSettings name="Lucinda Console" color="LightGrey" size="10" />
        <Pattern type="MultiLine" name="String" beginChar="&quot;" endChar="&quot;" 
            <FontSettings name="Lucinda Console" color="Red" size="10" />
       <Pattern type="Keyword" name="ReferenceType">
            <FontSettings name="Lucinda Console" color="Blue" size="10" />

As you can see, MultiLine and SingleLine (NonWord) syntax types have the beginChar/endChar attributes, while Word types contain a Words collection. The language node also has a caseSensitive attribute value of true while the String Pattern type allows for an escapeChar of "\".

Creating the Tool

The tool itself is responsible for loading the language configuration file and applying a set of rules to derive regular expressions, and those regular expressions will then be used to locate the various elements within loaded code snippets.

Internally, the design of the tool is reasonably straightforward. I have a Language class and a Pattern class to abstract the data in my configuration file and an HtmlFormatting module that provides a set of routines for formatting and coloring the code snippets.

When creating the Pattern classes, I noticed that this was an ideal opportunity to use Inheritance, so I created a Base class to handle the common features and I'll extend that class to add specialized features, such as the BeginChar/EndChar properties or the Words collection depending on the Type of Pattern. Properties common to all patterns include Name, FontSettings, Type, and RegexPattern.

The specialized classes are named WordPattern and NonWordPattern, and they extend the base Pattern class. The NonWordPattern class is responsible for providing the BeginChar and EndChar properties, while the WordPattern class is responsible for exposing a collection of Words. Both properties provide specific implementations for exposing the regular expression that defines it.

In a nutshell, the semantics for creating those regex patterns look like this:

Non-Word types

BeginChar<Any Text Until>EndChar

In the case of single-line comments in Visual Basic, the pattern looks like the following:


Word types

WordBoundary<Any Single Word In Words Collection>WordBoundary

In the case of TSQL Functions, the pattern looks like this:


Problems Faced and Solved

At this point, the Pattern classes each provide a regular expression pattern so that they can be located within a body of text. What needs to be done now is to orchestrate the overall pattern-matching process in such a manner as to highlight all items within the code snippet. Some of the challenges here include:

  • Not marking-up the same piece of text more than once.
  • Defining WORD types that have unusual characters on their boundaries.
  • Allowing for escaped characters.

The first of these three generally arises where you have Keywords embedded within BLOCK elements, such as Strings or Comments. For example, in the following commented-out line of code, you wouldn't want the keywords to be recognized and colored separately. You want the entire line to be colored as a comment and only marked-up once.

   ' Dim foo As New Bar( whatever )

A basic algorithm would simply markup each item as it came along. Thus, if the task at hand was to find and markup comments, that is what would be done. If it were to markup keywords, that too would be done. I needed something a little smarter than that. I needed a way to leave behind a proverbial trail of breadcrumbs that would indicate to our parser that this section had already been marked-up.


To get around the problem of repeated markups, I implemented a Tokenizing phase, which wraps identified chunks of text with identifier tokens. In addition to that, the entire code snippet is wrapped in <available>...</available> tokens. When a particular language item is located within the code snippet, the tokens are laid, as well as closing and opening <available> tokens

Presume that you want to markup the following chunk of SQL code:

   FROM dbo.Customers
   WHERE dateCreated 
      BETWEEN @startDate 
      AND @endDate

At the start of the operation, the first step is to wrap the entire snippet in <available> tokens to indicate to the parser that everything is available:

   <available>SELECT *
   FROM dbo.Customers
   WHERE dateCreated 
      BETWEEN @startDate 
      AND @endDate</available>

Now presume that during tokenizing the words are found and matched in the following order—BETWEEN, WHERE, SELECT, FROM, and AND. Because of the closing-tokenizing-opening sequence of my parser, this results in the following changes to the snippet at the end of each match in the sequence:

After finding "BETWEEN"

   <available>SELECT *
   FROM dbo.Customers
   WHERE dateCreated 
      </available><token>BETWEEN</token><available> @startDate 
      AND @endDate</available>

After finding "WHERE"

   <available>SELECT *
   FROM dbo.Customers
   </available><token>WHERE</token><available> dateCreated 
      </available><token>BETWEEN</token><available> @startDate 
      AND @endDate</available>

After finding "SELECT"

   <available></available><token>SELECT</token><available> *
   FROM dbo.Customers
   </available><token>WHERE</token><available> dateCreated 
      </available><token>BETWEEN</token><available> @startDate 
      AND @endDate</available>

After finding "FROM"

   <available></available><token>SELECT</token><available> *
   </available><token>FROM</token><available> dbo.Customers
   </available><token>WHERE</token><available> dateCreated 
      </available><token>BETWEEN</token><available> @startDate 
      AND @endDate</available>

After finding "AND"

   <available></available><token>SELECT</token><available> *
   </available><token>FROM</token><available> dbo.Customers
   </available><token>WHERE</token><available> dateCreated 
      </available><token>BETWEEN</token><available> @startDate 
      </available><token>AND</token><available> @endDate</available>

As you can see, with each match the amount of available text is separated out and the matched text enclosed within a <token> element. With this mechanism in place, it's simply a matter of ensuring that when looking for text to markup, the search only occurs in text that is between <available> tokens. Once the matching process is finished, I remove whatever <available> or </available> markers remain, leaving me only with tokens that identify syntax elements.

   codeSnippet = Regex.Replace(codeSnippet, "</?available>", "")

   <token>SELECT</token> *
   <token>FROM</token> dbo.Customers
   <token>WHERE</token> dateCreated 
      <token>BETWEEN</token> @startDate 
      <token>AND</token> @endDate

In the days before .NET, I would have been stuck with creating a Pattern that captured both the Available and Non-Available sections, enumerating the ensuing Matches collection, marking-up the Available captures, and then re-assembling them by concatenation to ensure that I only matched within certain sections of the text. This method was fraught with danger as it is clumsy (especially without named captures) and you ended up touching much more text that needed.

The .NET variety of Regular Expressions allows the use of a MatchEvaluator delegate to be wired-up to the Regular expressions Replace method. This has the effect of passing only the matched text to an event-handler and returning replacement text. All of the concatenation is done behind the scenes. This ensures that I only touch the minimum amount of text, thereby reducing the likelihood of error.

Passing matches to a MatchEvaluator delegate

    ' important to ensure that we are only tokenizing parts of the snippet that 
    ' haven't as yet been tokenized
    Private m_AvailablePartPattern As String = _

    ' matches the "available" text and hands it off to a delegate for further inspection.
    Public Sub TokenizeWordElements( _
        ByVal patternName As String, _
        ByVal regexString As String, _
        ByVal _caseSensitive As Boolean, _
        ByRef codeSnippet As String _
        m_CaseSensitive = _caseSensitive
        m_Name = patternName
        m_REString = regexString

        Dim _delegate As New MatchEvaluator(AddressOf WordElementMatchHandler)

        Dim r As New Regex(m_AvailablePartPattern, _
                RegexOptions.IgnoreCase Or _
                RegexOptions.Compiled _
        codeSnippet = r.Replace(codeSnippet, _delegate)
    End Sub

    Private Function WordElementMatchHandler(ByVal _match As Match) As String
        ' if, for some reason no group was found... bail out
        If m_Name Is String.Empty Then Return _match.Value

        Dim opts As RegexOptions = RegexOptions.Multiline
        If Not m_CaseSensitive Then
            opts = opts Or RegexOptions.IgnoreCase
        End If
        Dim re As New Regex(m_REString, opts)

        ' tokenize the match
        Return re.Replace(_match.Value, "</available><" & m_Name & ">$1</" & _
            M_Name & "><available>")
    End Function

The Importance of Order

One thing that I mentioned earlier but haven't explained is in which order to do the matching. You should ensure that the items are matched from those that have the widest scope down to those that have the narrowest scope. This means Multiline BLOCK types are matched before Singleline pattern types, and that they are both matched before matching the WORD types. The reason being that BLOCK elements can contain items that would otherwise qualify as WORD types, but not the other way around.

Matching BLOCK elements are reasonably straightforward. You find the BeginChar character and then keep matching until you go beyond the end of where the EndChar character would be. In the case of a Visual Basic String, that would mean that you hit the end of the line without finding the closing double-quote character ("). Here is the code that's responsible for building that pattern string:

    Public Function GetBlockPattern() As String
        Dim rePattern As String = "("
        Dim ptrn As PatternBase
        For Each ptrn In Me.Patterns
            If TypeOf ptrn Is NonWordPattern Then
                If rePattern.Length > 1 Then
                    rePattern &= "|"
                End If
                rePattern &= "(?'" & ptrn.Name & "'" & _
                    ptrn.PatternString() & ")"
            End If
        rePattern &= ")+"
        Return rePattern
    End Function

As you can see, it's just an Or expression of each Block Pattern Type's individual expression. Looking at the logic for building the PatternString, you can see that that the implementation is dependant on whether or not there is an EscapeChar present for the individual pattern.

    ' class specific implementation of CreatePatternString
    ' logic for a non-Word type is: BeginChar <Any Text Until> EndChar
    Private Function CreatePatternString() As String
        Dim retVal As String
        If Me.EndChar = "\n" Then
            retVal = Regex.Escape(Me.BeginChar) & "[^\n\r]*"
            If Me.HasEscapeChar Then
                retVal = String.Format("{0}(?>{1}.|[^{2}]|.)*?{3}", _
                Regex.Escape(Me.BeginChar), Regex.Escape(Me.EscapeChar), _
                Regex.Escape(Left(Me.EndChar, 1)), Regex.Escape(Me.EndChar))

                retVal = String.Format("{0}[^{1}]*(?>[^{1}]|.)*?{2}", _
                Regex.Escape(Me.BeginChar), _
                Regex.Escape(Left(Me.EndChar, 1)), Regex.Escape(Me.EndChar))

            End If
        End If

        Return retVal
    End Function

Preserving States

When building the individual patterns for BLOCK types, I've used ExplicitCapture (?>...) to minimize the amount of backtracking that is available. Basically, I can use ExplicitCapture because I know that if I find the BeginChar for a pattern, then I want to keep consuming text until I either find the EndChar or fail. That is, I never want to give anything up. Throwing away saved states in this manner can allow a failure to occur more quickly than if the states are held onto and retried at a later time in the match attempt.

Possible Future Enhancements

I've attempted to create the individual regular expressions in a manner that knows very little about specific language implementations. One future possible enhancement would be to introduce known, language specific semantic checks in an endeavor to speed things up a bit. An example of this would be, in the case of TSql, to hard-code the opening word boundary marker as either "\b" or "@@" because it is known that "@@" can be included within a word even though the regex engine doesn't look at it that way. Creating this into the TSql language would reduce the tolerance on the BeginChar for languages that don't require it, thus creating an overall speed gain.


In this article you've seen how it is possible to create a small parser using Regular Expressions to separate language keywords and apply syntax coloring. You also saw that the new features that have been added to Regular Expressions in .NET provide a lot more power and flexibility when it comes to working with patterns in text.

Darren Neimke is a senior applications developer and has been building ASP applications since the heady days of the Internet in the late '90s. Prior to that, he was developing accounting programs using Access 2 and VBA. In the past couple of years a quirky character change led Darren into the dark world of Regular Expressions where he now spends much of his time devoted to maintaining and writing about regular expressions in his blog, which can be found at