Article
06/17/2015

February 2015

Volume 30 Number 2

.NET Framework - Building an Enterprise Search for .NET

The emergence of cloud computing in recent years has been a boon for organizations and users alike. Organizations can know their customers like never before, and target them with personalized communications. Users can get to their data from almost anywhere, making it far more accessible and useful. Huge datacenters have been built all around the world to store all of that data. But Big Data leads to big challenges.

The well-known quote by John Naisbitt, “We are drowning in data but starving for information,” in his book, “Megatrends: Ten New Directions Transforming Our Lives” (Warner Books, 1982), perfectly describes the current situation in the Big Data market. Companies are able to store petabytes of data, but being able to make sense of that data, and making it searchable, is far more difficult, especially because most data warehouses store data in a non-structured way (NoSQL) across multiple collections inside particular Big Data stores or even in a distributed form between different warehouses. Moreover, there are a variety of data formats, such as JSON documents, Microsoft Office files and so forth. Searching through a single unstructured collection typically isn’t a problem, but it’s much harder to search all unstructured data across multiple collections to find just a particular small subset of results when the user has no idea where it might be. This is where enterprise search comes into play.

Enterprise Search

Here’s the essential challenge of enterprise search: How can a large organization with a lot of data sources provide internal and external users with the ability to search all public company data sources through one interface? That single interface might be an API, a company Web site, or even a simple textbox with autocomplete functionality implemented under the hood. No matter which interface a company chooses, it must provide the ability to search through its entire data universe, which might include structured and unstructured databases, intranet documents in different formats, other APIs and other kinds of data sources.

Because searching through multiple datasets is quite complex, there are only a few recognized enterprise search solutions—and the bar is high. An enterprise search solution must include the following features:

Content awareness: Know where particular types of data can be located.
Real-time indexing: Keep all data indexed.
Content processing: Make different data sources accessible.

One of the most popular enterprise search solutions is the open source Elasticsearch (elasticsearch.org). This Java-based server built on top of Apache Lucene (lucene.apache.org) provides scalable, full-text search over multiple data sources, with JSON support and a REST Web interface, as well as high availability, conflict management and real-time analytics. Visit bit.ly/1vzoUrR to see its full feature set.

From a high level, the way Elasticsearch stores data is very simple. The topmost element of the structure within a server is called an index, and multiple indices can exist in the same data store. The index itself is just a container for documents (one or many), and each document is a collection of one or more fields (with no structures defined). Each index can contain data aggregated into units called types, which represent logical groups of data within a particular index.

It might be useful to think of Elasticsearch as similar to a table from the world of the relational databases. The same correlation exists between a table’s rows and columns and an index’s documents and fields, where a document corresponds to a row and a field to a column. However, with Elasticsearch, there’s no fixed data structure or database schema.

As I noted, developers can communicate with the Elasticsearch server via a REST Web interface. This means they can query indices, types, data or other system information just by sending REST Web requests from a browser or any other type of Web client. Here are some examples of GET requests:

Query for all indices:
https://localhost:9200/_cat/indices/?v
Query for index metadata:
https://localhost:9200/clients/_stats
Query for all index data:
https://localhost:9200/clients/_search?q=*:*
Search for a specific field value within the index:
https://localhost:9200/clients/_search?q=field:value
Getting all data within the index mapping type:
https://localhost:9200/clients/orders/_search?q=*:*

Creating a Search

To demonstrate how to create a simple, multi-source solution, I’m going to use Elasticsearch 1.3.4 with JSON documents, PDF documents and a SQL Server database. To start, I’ll briefly describe Elasticsearch setup and then demonstrate how to plug in each data source to make the data searchable. To keep things simple, I’ll present a close-to-real-life example that uses data sources from the well-known Contoso company.

I’ll use a SQL Server 2014 database with multiple tables in it, though I’ll only be using one, dbo.Orders. As the table name suggests, it stores records about the company’s client orders—a huge number of records, yet easy to manage:

CREATE TABLE [dbo].[Orders]
(
  [Id] [int] IDENTITY(1,1) NOT NULL primary key,
  [Date] [datetime] NOT NULL,
  [ProductName] [nvarchar](100) NOT NULL,
  [Amount] [int] NOT NULL,
  [UnitPrice] [money] NOT NULL
);

I also have a network share with multiple company documents organized in a folder hierarchy. The documents are related to different product marketing campaigns the company organized in the past and are stored in multiple formats, including PDF and Microsoft Office Word. The average document size is approximately 1MB.

Finally, I have an internal company API that exposes the company’s client information in JSON format; because I know the structure of the response, I’m able to easily deserialize it to an object of type client. My goal is to make all of the data sources searchable using the Elasticsearch engine. Moreover, I want to create a Web API 2-based Web service that, under the hood, will perform an enterprise query across all indices by making a single call to the Elasticsearch server. (Find more information on Web API 2 at bit.ly/1ae6uya.) The Web service will return the results as a list of suggestions with potential hints to the end user; such a list can later be consumed by an autocomplete control embedded in the ASP.NET MVC application, or by any other kind of Web site.

Setting Up

The first thing I need to do is install the Elasticsearch server. For Windows, you can do this either automatically or manually, with the same result—a running Windows service hosting the Elasticsearch server. The automatic installation is very quick and easy; all you need to do is to download and run the Elasticsearch MSI installer (bit.ly/12RkHDz). Unfortunately, there’s no way to choose a Java version or, much more important, an Elasticsearch version. The manual installation process, in contrast, requires a little more effort, but it allows much more control over the components so it’s more suitable in this case.

Setting up Elasticsearch as a Windows service manually requires the following steps:

Download and install the most recent Java SE Runtime Environment (bit.ly/1m1oKlp).
Add the environment variable called JAVA_HOME. Its value will be the folder path you’ve installed Java into (for example, C:\Program Files\Java\jre7), as shown in Figure 1.
Download the Elasticsearch file (bit.ly/1upadla) and unzip it.
Move unzipped sources to Program Files | Elasticsearch (optional).
Run the command prompt as an administrator and execute service.bat with the install parameter:
C:\Program Files\Elasticsearch\elasticsearch-1.3.4\bin>service.bat install

Figure 1 Setting the Java_Home Environment Variable

That’s all it takes to get the Windows service up and running, with the Elasticsearch server accessible on localhost, on the port 9200. Now I can make a Web request to the URL https://localhost:9200/ via any Web browser and I’ll get a response that looks like this:

{
  "status" : 200,
  "name" : "Washout",
  "version" : {
    "number" : "1.3.4",
    "build_hash" : "a70f3ccb52200f8f2c87e9c370c6597448eb3e45",
    "build_timestamp" : "2014-09-30T09:07:17Z",
    "build_snapshot" : false,
    "lucene_version" : "4.9"
  },
  "tagline" : "You Know, for Search"
}

Now my local instance of Elasticsearch is ready to go; however, the raw version doesn’t give me the ability to connect to SQL Server or run a full-text search through data files. To make these features available, I must also install several plug-ins.

Extending Elasticsearch

As I mentioned, the raw version of Elasticsearch doesn’t let you index an external data source like SQL Server, Office or even a PDF. To make all these data sources searchable I need to install a few plug-ins, which is quite easy.

My first goal is to provide full-text search support for attachments. By attachment, I mean a base64-encoded representation of the source file that has been uploaded to an Elasticsearch data store as a JSON document. (See bit.ly/12RGmvg for information on the attachment type.) The plug-in I need for this purpose is the Mapper Attachments Type for Elasticsearch version 2.3.2, available at bit.ly/1Alj8sy. This is the Elasticsearch extension that enables a full-text search for documents, and it’s based on the Apache Tika project (tika.apache.org), which detects and extracts metadata and text content from various types of documents and provides support for the file formats listed at bit.ly/1qEyVmr.

As with most plug-ins for Elasticsearch, this installation is extremely straightforward and all I need to do is run the command prompt as an Administrator and execute the following command:

bin>plugin --install elasticsearch/elasticsearch-mapper-attachments/2.3.2

After downloading and extracting the plug-in, I need to restart the Elasticsearch Windows service.

When that’s done, I need to configure SQL Server support. Of course, there’s also a plug-in for this. It’s called JDBC River (bit.ly/12CK8Zu) and allows fetching data from a JDBC source, such as SQL Server, for indexing into Elasticsearch. The plug-in is easily installed and configured, though the installation process has three stages that need to be completed.

First, I install the Microsoft JDBC Driver 4.0, a Java-based data provider for SQL Server that can be downloaded from bit.ly/1maiM2j. The important thing to remember is I need to extract the content of the downloaded file into the folder called Microsoft JDBC Driver 4.0 for SQL Server (which needs to be created if it doesn’t exist), directly under the Program Files folder, so the resulting path looks like: C:\Program Files\Microsoft JDBC Driver 4.0 for SQL Server.
Next, I install the plug-in using the following command:
bin> plugin --install
jdbc --url "https://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-river-jdbc/1.3.4.4/elasticsearch-river-jdbc-1.3.4.4-plugin.zip"
Finally, I copy the SQLJDBC4.jar file extracted in the first step (C:\Program Files\Microsoft JDBC DRIVER 4.0 for SQL Server\sqljdbc_4.0\enu\SQLJDBC4.jar) into the lib folder in the Elasticsearch directory (C:\Program Files\Elasticsearch\lib). When I’m done, I need to remember to restart the Windows service.

Now all of the required plug-ins have been installed. However, to verify the installations finished correctly, I need to send the following command as an HTTP GET request:

https://localhost:9200/_nodes/_all/plugins

In the response, I expect to see both installed plug-ins listed, as shown in Figure 2.

Figure 2 The Installed Plug-Ins

Setting up SQL Server

To use JDBC River with SQL Server, the SQL Server instance needs to be accessible via TCP/IP, which, by default, is disabled. However, enabling it is simple, and all I need to do is open SQL Server Configuration Manager and, under the SQL Server Network Configuration, for the SQL Server instance I want to connect to, change the Status value to Enable for TCP/IP protocol, as shown in Figure 3. After doing this, I should be able to log into my SQL Server instance via Management Studio by using localhost 1433 in the Server Name (port 1433 is the default port for accessing SQL Server via TCP/IP).

Figure 3 Enabling TCP/IP for the SQL Server Instance

Making Sources Searchable

All required plug-ins have been installed, so now it’s time to load the data. As mentioned earlier, I have three different data sources (JSON documents, files and an order table from a SQL Server database) that I wish to have indexed on Elasticsearch. I can index these data sources in many different ways, but I want to demonstrate how easy it is using a .NET-based application I’ve implemented. Therefore, as a pre-condition for indexing, I need to install an external library called NEST (bit.ly/1vZjtCf) for my project, which is nothing but a managed wrapper around an Elasticsearch Web interface. Because this library is available on NuGet, making it part of my project is as simple as executing a single command in the Package Manager Console:

PM> Install-Package NEST

Now, with the NEST library available in my solution, I can create a new class library project called ElasticSearchRepository. I used this name because I decided to keep all function calls from the ElasticClient class (which is part of the NEST library) separate from the rest of the solution. By doing this, the project becomes similar to the repository design pattern widely applied in Entity Framework-based applications, so it should be easy to understand. Also, in this project I have just three classes: the BaseRepository class, which initializes and exposes the inherited classes and the instance of the ElasticClient, and two other repository classes:

IndexRepository—a read/write class I’ll use to manipulate indices, set mappings and upload documents.
DiscoveryRepository—a read-only class I’ll use during API-based search operations.

Figure 4 shows the structure of the BaseRepository class with a protected property of ElasticClient type. This type, provided as part of the NEST library, centralizes communication between the Elasticsearch server and client application. To create an instance of it, I can pass a URL to the Elasticsearch server, which I pass as an optional class constructor parameter. If the parameter is null, a default value of https://localhost:9200 will be used.

Figure 4 The BaseRepository Class

namespace ElasticSearchRepository
{
  using System;
  using Nest;
  public class BaseRepository
  {
    protected ElasticClient client;
    public BaseRepository(Uri elastiSearchServerUrl = null)
    {
      this.client = elastiSearchServerUrl != null ?
        new ElasticClient(new ConnectionSettings(elastiSearchServerUrl)) :
        : new ElasticClient();
    }
  }
}

With the client ready, first I’ll index the Client data; this is the easiest scenario because no additional plug-ins are required:

public class Client
{
  public int Id { get; set; }
  public string Name { get; set; }
  public string Surname { get; set; }
  public string Email { get; set; }
}

To index this type of data, I call the instance of my Elasticsearch client, as well as the Index<T> function, where T is the type of my Client class, which represents serialized data returned from the API. This generic function takes three parameters: the instance of T object class, a target index name and a mapping name within the index:

public bool IndexData<T>(T data, string indexName =
  null, string mappingType = null)
  where T : class, new()
  {
    if (client == null)
    {
      throw new ArgumentNullException("data");
    }
    var result = this.client.Index<T>(data,
      c => c.Index(indexName).Type(mappingType));
    return result.IsValid;
  }

The last two parameters are optional because NEST will apply its default logic for creating a target index name based on the generic type.

Now I want to index the marketing documents, which are related to the company products. As I have these files stored in a network share, I can wrap information about each particular document into a simple MarketingDocument class. It’s worth noting here that if I want a document to be indexed in Elasticsearch, I need to upload it as a Base64-encoded string:

public class MarketingDocument
{
  public int Id { get; set; }
  public string Title { get; set; }
  public string ProductName { get; set; }
  // Base64-encoded file content.
  public string Document { get; set; }
}

The class is ready, so I can use the ElasticClient to mark a particular field in my MarketingDocument class as an attachment. I can achieve this by creating a new index called “products” and adding a new marketing mapping to it (for simplicity, the class name will be the mapping name):

private void CreateMarketingIndex()
  {
    client.CreateIndex("products", c =>
      c.AddMapping<Marketing>
      (m => m.Properties(ps =>ps.Attachment(a =>
            a.Name(o =>o.Document).TitleField(t =>
            t.Name(x => x.Name)
            TermVector(TermVectorOption.WithPositionsOffsets)
        )))));
  }

Now that I have both the .NET type and the mapping for the marketing data, as well as the definition for an attachment, I can start indexing my files in the same way I indexed the client data:

var documents = GetMarketingDocumentsMock();
documents.ForEach((document) =>
{
  indexRepository.IndexData<MarketingDocument>(document, "marketing");
});

The final step is the JDBC River setup on Elasticsearch. Unfortunately, NEST doesn’t support JDBC River yet. In theory, I can create a JDBC River mapping by using a Raw function to send a raw request with JSON, but I don’t want to overcomplicate things. Therefore, to complete the mapping creation process, I’m going to specify the following parameters:

A connection string to SQL Server database
A SQL query, which will be used to query for data
Update schedule
Target index name and type (optional)

(You’ll find a full list of configurable parameters at bit.ly/12CK8Zu.)

To create a new JDBC River mapping, I need to send a PUT request with a request body specified in it to the following URL:

https://localhost:9200/_river/{river_name}/_meta

In the example in Figure 5, I put a request body to create a new JDBC River mapping, which connects to the Contoso database hosted on the local SQL Server instance, which is accessible on port 1433 via TCP/IP.

Figure 5 HTTP PUT Request to Create a New JDBC River Mapping

PUT https://localhost:9200/_river/orders_river/_meta
{
"type":"jdbc",
"jdbc":
  {
  "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
  "url":"jdbc:sqlserver://127.0.0.1:1433;databaseName=Contoso",
  "user":"elastic",
  "password":"asd",
  "sql":"SELECT *  FROM dbo.Orders",
  "index" : "clients",
  "type" : "orders",
  "schedule": "0/30 0-59 0-23 ? * *"
  }
}

It uses the login “elastic” and the password “asd” to authenticate the user and execute the following SQL command:

SELECT * FROM dbo.Orders

Each row of data returned by this SQL query will be indexed under the clients index in the orders mapping type and the indexing will take place every 30 seconds (represented in the Cron notation, see bit.ly/1hCcmnN for more information).

When this process completes, you should see in the Elasticsearch log file (/logs/ elasticsearch.log) information similar to the following:

[2014-10-2418:39:52,190][INFO][river.jdbc.RiverMetrics]
pipeline org.xbib.elasticsearch.plugin.jdbc.RiverPipeline@70f0a80d
complete: river jdbc/orders_river metrics: 34553 rows, 6.229481683638776 mean, 
  (0.0 0.0 0.0), ingest metrics: elapsed 2 seconds, 
  364432.0 bytes bytes, 1438.0 bytes avg, 0.1 MB/s

If something is wrong with the river configuration, the error message will also be in the log.

Searching the Data

Once all the data has been indexed in the Elasticsearch engine, I can start querying. Of course, I can send simple requests to the Elasticsearch server to query one or multiple indices and mapping types at the same time, but I want to build something more useful and closer to a real-life scenario. So I’m going to split my project into three different components. The first component, which I already presented, is Elasticsearch, which is available via https://localhost:9200/. The second component is an API I’m going to build using Web API 2 technology. The last component is a console application I’ll use to set up my indices on Elasticsearch, as well as to feed it with data.

To create my new Web API 2 project, first I need to create an empty ASP.NET Web Application project and then, from the Package Manager Console, run the following install command:

Install-Package Microsoft.AspNet.WebApi

After the project is created, the next step is to add a new controller, which I’ll use to process query requests from the client and pass them to Elasticsearch. Adding a new controller named DiscoveryController involves nothing more than adding a new item, a Web API ApiController class (v2.1). And I need to implement a Search function, which will be exposed via the URL: https://website/api/discovery/search?searchTerm=user_input:

[RoutePrefix("api/discovery")]
public class DiscoveryController : ApiController
{
  [HttpGet]
  [ActionName("search")]
  public IHttpActionResult Search(string searchTerm)
  {
    var discoveryRepository = new DiscoveryRepository();
    var result = discoveryRepository.Search(searchTerm);
    return this.Ok(result);
  }
}

If the Web API 2 engine can’t serialize a response because of a self-referencing loop, you’ll have to add the following in the WebApiConfig.cs file, which is located in the AppStart folder:

GlobalConfiguration.Configuration
.Formatters
.JsonFormatter
.SerializerSettings
.ReferenceLoopHandling =
      ReferenceLoopHandling.Ignore;

As Figure 6 shows, in the body of the controller I created, I instantiated a class of DiscoveryRepository type, which is just a wrapper around the ElasticClient type from the NEST library. Inside this non-generic, read-only repository I implemented two types of search functions and both of them return a dynamic type. This part is important because by doing this in both function bodies, I’m not limiting my queries just to one index; instead, I’m querying all indices and all types at the same time. This means my results will have a different structure (will be of different types). The only difference between the functions is the query method. In the first function I just use a QueryString method (bit.ly/1mQEEg7), which is an exact match search, and in the second one, a Fuzzy method (bit.ly/1uCk7Ba), which performs a fuzzy search across indices.

Figure 6 Implementation of Two Types of Search

namespace ElasticSearchRepository
{
  using System;
  using System.Collections.Generic;
  using System.Linq;
  public class DiscoveryRepository : BaseRepository
  {
    public DiscoveryRepository(Uri elastiSearchServerUrl = null)
      : base(elastiSearchServerUrl)
    {
    }
    ///<summary>  
    public List<Tuple<string, string>> SearchAll(string queryTerm)
    {
      var queryResult=this.client.Search<dynamic>(d =>
        d.AllIndices()
        .AllTypes()
        .QueryString(queryTerm));
      return queryResult
        .Hits
        .Select(c => new Tuple<string, string>(
          c.Indexc.Source.Name.Value))
        .Distinct()
        .ToList();
     }
     ///<summary>  
     public dynamic FuzzySearch(string queryTerm)
     {
       return this.client.Search<dynamic>(d =>
         d.AllIndices()
         .AllTypes()
         .Query(q => q.Fuzzy(f =>
           f.Value(queryTerm))));
     }
  }
}

Now when my API is ready, I can run and start testing it just by sending GET requests to https://website:port/api/discovery/search?searchTerm=user_input, and pass user input as the value of the searchTerm query parameter. Therefore, Figure 7 shows the results that my API generates for the search term “scrum.” As I highlighted on the screenshot, a search function performed a query over all indices in the data stores and returned hits from multiple indices at the same time.

Figure 7 API Search Results for the Term “scrum”

By implementing the API layer I created the possibility of implementing multiple clients (like a Web site or mobile app), which will be able to consume it. This provides the ability of giving enterprise search functionality to end users. You can find an example implementation of the autocomplete control for an ASP.NET MVC 4-based Web client on my blog at bit.ly/1yThHiZ.

Wrapping Up

Big Data has brought a lot to the technology marketplace, in terms of both opportunities and challenges. One of the challenges, which is also potentially a great opportunity, is the possibility of implementing fast search across petabytes of data without having to know the exact location of the data in the data universe. In this article I described how to implement enterprise search and demonstrated how to do it in the .NET Framework in combination with the Elasticsearch and NEST libraries.

Damian Zapart is a development lead at Citigroup Inc. and focuses mainly on enterprise solutions. At the same time he’s a programming geek interested in cutting-edge technologies, design patterns and Big Data. Visit his blog at bit.ly/1suoKN0 to read more about him.

Thanks to the following technical experts for reviewing this article: Evren Onem (D&B) and Bruno Terkaly (Microsoft)
Evren Onem (D&B) is a principle software engineer at D&B. His job is to design and build massively scalable REST APIs. He also is a night-time researcher on cognitive radio ad hoc networks.

Bruno Terkaly is a principal software engineer at Microsoft with the objective of enabling development of industry leading applications and services across devices. He’s responsible for driving the top cloud and mobile opportunities across the United States and beyond from a technology-enablement perspective. He helps partners bring their applications to market by providing architectural guidance and deep technical engagement during the ISV’s evaluation, development and deployment. He also works closely with the cloud and mobile engineering groups, providing feedback and influencing the roadmap.