Turn Your Log Files into Searchable Data Using Regex and the XML Classes

 

Roy Osherove

January 2004

Summary: Discover how to use the built-in regular expression functionality in the .NET Framework to convert delimited log files into XML for later processing. (12 printed pages)

Applies to
   Microsoft® ASP.NET
   Microsoft® .NET Framework
   Microsoft® Visual C#® Download the source code for this article.

Contents

What You'll Need
Introduction
Parsing Text Easily
Transforming to XML
A More Generic Approach
Putting It All Together
Letting the User Search for Data
Summary

What You'll Need

Before diving into this article, you'll want to:

Introduction

"What?! You want me to transform all this into something searchable?" I asked with dismay. It was an early morning meeting with the company's CTO. The topic was, "We've got a bunch of log files from our legacy application, and we need to provide a good way to collect and search data from those logs." The "logs" were a bunch of year-old text files in which our legacy application saved all its logs for a specific task it performed. Now the customer wanted to have some statistics about all the operations that were written in those files.

CTO: "Yeah, and we need it right away. Can you do it?"

"Sure", I replied. I had to think about how to do this, though. Surely, this would be a challenge.

Log File Spec

Taking a look at one of the log files, I saw the following:

25/05/2002   21:49   Search   Dozer   Anita1
25/05/2002   21:51   Update   Dozer   Anita1
26/05/2002   11:02   Search   Manda   Gerry2k
26/05/2002   11:12   Update   Manda   Gerry2k
27/05/2002   15:34   Search   Anka   Anita1
12/08/2002   10:14   Search   Amber   Huarez

Each line was built of the following columns, delimited by tabs:

  • Date (date/month/year)
  • Time (HH:MM)
  • Action Type
  • Record Name
  • User Name

The Game Plan

So, I needed to transform this blob of text and lines into something a little more structured. Something we could search through. You're thinking, "Hmm, why not import the file into Microsoft® Access using the Tab-Delimited Wizard?" That solution would be totally okay, if we had one, or just a small bunch of files. The solution here required a little more automation. Plus, had the log files been written in a different format—for example, several lines per log data—we'd be in trouble.

What we needed here was a raw searchable data format. Something like XML...

Thinking more about XML, I could see several benefits. Once these files were in XML, we could:

  • Import them into Microsoft® Excel and, more importantly, Access—thus allowing data to be searched.
  • Directly load a Dataset object from this XML, and perform searches on that dataset in memory (which I'll show later).
  • Create reports using XSLT.
  • Pretty much do anything we want with this data, since it would be pure XML.

But how to perform this magical act? What tools are in the Microsoft® .NET Framework that would allow us to:

  • Parse Text easily.
  • Write XML files easily from the parsed text.

Parsing Text Easily

If you've worked with regular expressions before, you know that using them is one of the fastest and easiest ways of parsing text. In the .NET Framework, the main class to be used in this area is the System.Text.RegularExpressions.Regex class. One of the most powerful features of this class is the ability to specify, within the search pattern, groups that will allow parsing and retrieval of parts of the text easily. For example:

Given the date 17/08/1975, and this regular expression: (?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>(?:\d{4}|\d{2}), you can write code to retrieve any part of the text in the date by name, like so:

const string pattern = 
 @"(?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>(?:\d{4}|\d{2}))";
string GivenDate = @"17/08/1975";
Match match = Regex.Match(GivenDate,pattern);
if(match.Success)
{
   Console.WriteLine(string.Format("Day:{0},Month:{1},Year:{2}",
                              match.Groups["day"].Value,
                              match.Groups["month"].Value,
                              match.Groups["year"].Value));
}

The result is: Day:17,Month:08,Year:1975.

Note If you don't understand the code above, you should refer to the two articles mentioned at the beginning of this article.

Transforming to XML

Once you have a bunch of data, like the date example above, you can easily transform it into XML. The XML representation of the date from the previous example could be represented like so:

<Date>
   <day>17</day>
   <month>08</month>
   <year>1975</year>
</Date>

To output this kind of XML before used to be a pretty easy task, yet also pretty error prone. Sure, you could just slap on string upon string of XML content into a memory buffer, but the amount of errors you could get into was horrible. The XMLTextWriter class in the System.XML namespace rids you of a lot of details here, and very conveniently abstracts away all the boilerplate code you need to write, allowing you to concentrate on the content you wish to write in your XML document.

To show just how easy it is to use this class, here's a class that takes in the MatchGroup object from the last example, writes an XML document with this data into a memory buffer, and returns this XML output.

public class XMLUtil
{
   public static string ToXML(Match regexMatch)
   {
      StringBuilder output = new StringBuilder();
      //we write the XML into an in-memory string buffer
      XmlTextWriter writer = 
         new XmlTextWriter(new          StringWriter(output));
      writer.Formatting=Formatting.Indented; 
//make the XML human-readable
      //write the Start is a standard XML document
      writer.WriteStartDocument();
      //create the openeing nnode for our date element
      writer.WriteStartElement("Date");
      //write out each date element value as a separate node
      writer.WriteElementString("day", 
         regexMatch.Groups["day"].Value );
      writer.WriteElementString("month", 
         regexMatch.Groups["month"].Value );
      writer.WriteElementString("day", 
         regexMatch.Groups["year"].Value );
      
      //close the date and finish the document
      //writer.WriteEndElement(); //not needed here
      writer.WriteEndDocument(); 
      //Closes any open elements automatically
      writer.Close();
      return output.ToString();   
   }
}

Here's the output:

<?xml version="1.0" encoding="utf-8"?>
<Date>
  <day>17</day>
  <month>08</month>
  <day>1975</day>
</Date>

As you can see, it's a very easy job to write XML using this class.

First you create an in-memory StringBuilder that will house the created XML. Then hand it off to the constructor of a StringWriter, which is used to contruct your XMLTextWriter object. By the way, you could easily pass any System.IO.StreamWriter-derived object, which gives you the flexibility of writing to pretty much anything you want. You can then call the WriteStartDocument method, which creates the <xml version-..> tag at the beginning of the XML text.

Now open a new element tag, which will contain sub-elements, using the WriteStartElement method. Proceed to write the actual values as sub nodes in the open element using the WriteElementString method, passing in the name of the node, and the value inside it. To finish it all off, you call the WriteEndDocument method, which closes all open elements in the XML. if you want to just close the current "Date" element, you would call the WriteEndElement method, and continue writing more elements.

Note If you need to write strings containing characters that are illegal in XML as the names of elements, see the XMLConvert class's EncodeName and DecodeName methods. XMLConvert is a great helper, and you'll need it in many situations.

A More Generic Approach

Actually, if you want to, you can make the writing function much more generic by automatically going through all the groups of a given match and writing their names and values as XML. The following bit of code shows how to do this:

//write out all the groups of a match to XML
Regex reg = new Regex(pattern);
Match = regexMatch reg.Match(inputString);
if(regexMatch.Success)
{
   for (int i=1;i<regexMatch.Groups.Count;i++)
   {
   writer.WriteElementString(reg.GroupNameFromNumber(i),
regexMatch.Groups[i].Value);
   }
}

In order to achieve this, you need to have an instance of the Regex class to play with. You have to use this same instance to receive the Match object. Then you can use the Regex instance to retrieve the name of a group based on its number:

reg.GroupNameFromNumber(i)

This means that for this functionality to work, you can't use the static Match method of the Regex class. This makes things a bit more cumbersome for us. That's why, for the remainder of the code samples, I'll use the earlier version of the code, although it's less generic. You can then implement this method in your programs if you wish to.

Putting It All Together

Okay, so we know how to parse, and we know how to output to XML. Let's try to wrap this up using a class that takes in a single Log file and transforms it into an XML file. This class should receive the name of the log file to read, parse it line by line, and generate a [logFileName].xml file:

public class LogConverter
{
public static void ConvertLogFile(string FileName)   
{
string Pattern = 
@"(?<date>(?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>(?:\d{4}|\d{2}))(?x))\t(?<time>(?<hour>\d{2}):(?<minutes>\d{2}))\t(?<action>.*)\t(?<record>.*)\t(?<user>\w*)";
   string line=string.Empty;
      
   //open the Log file for reading
   TextReader reader = new StreamReader(  File.OpenRead(FileName));

   //Create an XML textWriter object instance that
   //will write to in-memory String Buffer named 'output'
   StringBuilder output = new StringBuilder();
   XmlTextWriter xmlFile = new XmlTextWriter(new StringWriter(output));

   //initialize the xml writer
   xmlFile.Formatting= Formatting.Indented;
   xmlFile.WriteStartDocument();
   xmlFile.WriteStartElement("Entries");
         
      
   //read each line in the file
   while((line=reader.ReadLine())!=null)
   {
      //try to macth the line using regular expressions
      Match parsed = Regex.Match(line, Pattern);
      if (parsed.Success)
      {
         //If we get a regex Match, we pass
         //the XML writer off to a method that will
         //use the Match groups to generate XML data
         //inside our XML document
         WriteAsXML(parsed,xmlFile);
      }
   }
   //finish off any open elements 
   xmlFile.WriteEndDocument();
   xmlFile.Close();
         
   //write the xml log to a file
   StreamWriter fs =  File.CreateText(FileName + ".xml");
   fs.Write(output.ToString());
   fs.Close();
}

private static void WriteAsXML(Match regexMatch,XmlTextWriter writer)
{
   //open a new 'Entry' element
   writer.WriteStartElement("Entry");

   //write out each date element value as a separate node
         
   //date: Full format, and separated to day,month,year
   writer.WriteElementString("date", regexMatch.Groups["date"].Value );
   writer.WriteElementString("day", regexMatch.Groups["day"].Value );
   writer.WriteElementString("month", regexMatch.Groups["month"].Value );
   writer.WriteElementString("day", regexMatch.Groups["year"].Value );
         
   //Time: Full format, hours, and minutes
   writer.WriteElementString("time", regexMatch.Groups["time"].Value );
   writer.WriteElementString("hour", regexMatch.Groups["hour"].Value );
   writer.WriteElementString("minutes", regexMatch.Groups["minutes"].Value );
      
   //record ,actions and users
   writer.WriteElementString("action", regexMatch.Groups["action"].Value );
   writer.WriteElementString("record", regexMatch.Groups["record"].Value );
   writer.WriteElementString("user", regexMatch.Groups["user"].Value );

   writer.WriteEndElement();
}
      
}

This class is pretty straightforward. Here's what's taking place:

The class receives a file name to parse. It creates an in-memory XML text writer object and initializes it to the proper settings. It then creates an open Entries element inside it, in to which all the child Entry elements (for each line) will be written to. It then goes through each line in the log file, and uses the Regex.Match method on that line, using a pattern that matches each sub-group I've identified at the beginning of this article.

If the match was a success, it passes both the XML writer instance and the Match object to a separate method, which writes the group names and values into the XML writer instance. After going through all the lines in the log file, it closes all the elements in the XML file and writes it all to a file named the same as the original log file with the addition of a ".log" at the end.

If you now open the generated XML log file, you will see the following:

<?xml version="1.0" encoding="utf-8"?>
<Entries>
  <Entry>
    <date>25/05/2002</date>
.
.
.
.
</Entries>

Letting the User Search for Data

Now that you have your data stored as structured XML, you can use it to let the user easily search through it. To do this, you can use a very easy technique already provided by the .NET Framework—use a DataSet object to load your XML data, then "select" data from the Dataset using a filter that can be specified by the user. You can then display the resulting DataRows to the user.

The DataSet object has a LoadXML method, which allows you to pass it a file name and have it automatically load the data into a table structure inside the dataset. For our purposes, we can send in the file name without any additional parameters. What will be generated inside the Dataset's memory will be a table that contains a collection of DataRows, each one holding a set of columns that corresponds to the set of properties you create in the log file—Date, Time, Hour, Action and so on. Once you have this table in place, you can use the DataTable's Select method to retrieve any number of DataRow objects that matches the filter you provide.

Here's the code to do this:

private void LoadXMLFile()
{
   //load the XML file into the dataset
   m_ds.ReadXml(txtFileName.Text);
   
//Show All log entry Rows at first load
//by passing in a 'true' filter
//this is just like specifying 
//SELECT * FROM ENTRIES WHERE true
   RefreshResults("true");
}

private void RefreshResults(string filter)
{
   try
   {
      //Clear the result list view
      lvResults.Items.Clear();
      //get the first datatable inside the dataset
      //we know this one contains the data we need
      DataTable table = m_ds.Tables[0];
      //Get the datarows that match the user's filter
      //the filter can be any valid SQL filter
      DataRow[] rows = table.Select(filter);
      foreach(DataRow row in rows)
      {
         //Add an item to the list view
         ListViewItem item = new ListViewItem(row["date"].ToString());
         item.SubItems.Add(row["time"].ToString());
         item.SubItems.Add(row["record"].ToString());
         item.SubItems.Add(row["action"].ToString());
         item.SubItems.Add(row["user"].ToString());
      
         lvResults.Items.Add(item);
      }
   }
   catch(Exception e)
   {
      //the user might pass invalid filter expressions, 
      //in which case we get an exception notifying
      //the filter parsing error in question
      MessageBox.Show(e.Message);
   }
}

Using this straightforward code, you can let the user load any XML file, and filter its contents based on a SQL-compatible filter expression. Anything that can be written after the 'WHERE' clause of a SQL query, can be specified here.

You receive an array of DataRows, and since you know beforehand the names of the columns for each DataRow (same as the XML elements in your log file), you can just display the values for each column.

You could just as easily have looped through all the available columns and displayed each one's value to the user, without even knowing what kind of data was inside your DataRow. You could dynamically add columns to your ListView corresponding to the name of each DataColumn in the DataRow, and voila—you have yourself a more generic searching mechanism for practically any simple XML file.

Summary

Parsing log files is easy. Writing XML files is easy. Searching XML files is easy. Therefore, generating XML log files and searching them should be pretty darn easy!

References

About the Author

Roy Osherove has spent the past 5+ years developing data-driven applications for various companies in Israel. He's acquired several MCP titles, written a number of articles on various .NET topics (most of which can be found on his weblog), and loves discovering new things everyday. Roy is also the author of the Feedable service and of the free regular expression tool, The Regulator.

© Microsoft Corporation. All rights reserved.