Integrating On-premises Search Engines with Windows Azure Web Applications
Author: Tejaswi Redkar, Solution Architect, Microsoft Consulting Services, Worldwide Services Community Lead – Windows Azure
Last updated: September 28, 2011
In a recent project, my customer wanted to move their entire suite of public facing web applications to Windows Azure. All the web applications were built in ASP.NET and were currently running in the datacenter of a Fortune 50 company that was charging them a fortune. The Windows Azure platform fit right into the customer’s cost-saving and long-term pay-by-demand strategy.
There was a caveat though. The customer was using Google Search Appliance (GSA) as their search service for the web sites, but they also wanted everything in the cloud or at least out of their datacenter. GSA is not available in the cloud. The customer was willing to migrate to any Microsoft-supported search service, provided it was available in the cloud. Surprisingly, we don’t have any enterprise-class search service in the cloud from any leading search vendors or partners. There are some open-source alternatives like Lucene.NET (http://code.msdn.microsoft.com/AzureDirectory), but it would require a significant makeover to replace a commercial-grade on-premises search engine.
As an Architect, I was assigned the task of figuring out a solution that would fit this requirement. Even though I had never used the GSA, I knew SharePoint Search and FAST Search fairly well. Therefore, I knew the concepts of how Search Engines work in general. Instead of going into the details of how a GSA works, let’s look at how search engines work in general. The figure below illustrates the high-level workings of a search engine.
There are 4 important concepts in a search engine:
Content Sources: This is where the search engine crawler crawls and gathers all the information
Crawler: The crawler gathers the content from data sources and stores it in the index
Index: The index is the heart of the search engine. It is a complex system of its own that includes ranking of the content based on usage, user models, geographic models, etc. (let’s keep it simple here)
Query: Query servers are the front end of a search engine that deliver the results of a search query
In our scenario, the web front ends will be running in Windows Azure and will call the GSA query service. But, we need to find a home for the GSA; the customer does not want it onsite, and Windows Azure does not support hosting of search appliances.
Because Microsoft is a partner-centric company, the first thing that came to my mind was to find a Microsoft partner that can host the appliance GSA. One of my colleagues found a Microsoft partner called Hostway who could host the appliance GSA and I found a GSA partner, InTechnic. Both offered competitive pricing for hosting. Hostway was attractive because their hosting facilities were in the same town as one of the Windows Azure datacenters, assuming that would reduce network latency between GSA and Web Applications. I thought we had our solution figured out for presenting to the customer.
The GSA was also used for crawling a SQL Server database that hosted the web site’s data. In the new solution, the SQL Server database would be moved to SQL Azure. GSA would then crawl the SQL Azure database directly. Because GSA ran Java on Linux, during initial testing, we found out that it had an issue crawling SQL Azure because of the time-to-live JDBC issue addressed here:
|Because GSA is an appliance product, we didn’t have access to the operating system to fix the issue.|
To overcome this, I had to go back to the hosting provider and ask them for their SQL Server hosting prices. Now the new architecture that was shaping up had ASP.NET web sites and the database hosted in the Windows Azure platform and the GSA with a virtualized instance of SQL Server with another hosting provider. The virtualized SQL Server instance would be used for crawling purposes only. The next task was to make sure that the SQL Server and SQL Azure databases were in sync. We recommended that the customer implement one of the following for data synchronization:
SQL Server Integration Services (running with the virtualized SQL Server)
SQLAzure Data Sync client application to synchronize the SQL Server database with the SQL Azure database.
We ran a two week proof-of-concept and proved the system working end-to-end.
The next concern was the security of the connection between web front-ends and the GSA. We had the following options:
Keep the appliance in the DMZ with datacenter-level IP filtering on the search appliance
Add a Windows Azure AppFabric Service Bus wrapper
Windows Azure Connect
In the POC, I tried using the Service Bus wrapper and technically it was returning results, but I had to build an additional layer of caching in the compute nodes to avoid the search latency. Caching options under considerations were memcached and Windows Azure Caching.
The new architecture saves the customer operating costs in the range of 50% or more even after considering all the data transfer costs and search appliance hosting costs.
The final recommended solution looked like the figure below.
My next design session and POC is using FAST Search with Windows Azure platform in a very similar setup. I will blog that experience once I finish the POC. Stay tuned.
Port Bridge by Clemens Vasters: http://vasters.com/clemensv/2009/11/18/Port+Bridge.aspx