Indexing Windows Azure Storage from an on-premises Fast ESP 5.3
Author: Nagendra Mishr, Senior Consultant, Microsoft FAST Services
Contributors: Tejaswi Redkar (Windows Azure Solution Architect) and Ann Skobodzinski (Microsoft Technology Center)
Last updated: December 9, 2011
Recently I participated in a two week Windows Azure proof-of-concept (POC). The goal was to port the customer’s production application into Windows Azure in the two week timeframe. We were able to port the customer’s front-end application, the content database, the antivirus checker and the local data store to Windows Azure within the two week project timeline. The customer would have preferred that everything run in Windows Azure, however two additional components, FAST Enterprise Search Platform (ESP) and the application authentication service, could not be ported to run in the cloud. The content in the local SQL Server database was stored in both SQL Azure and Windows Azure Storage, and was indexed by the FAST ESP system. The web front-end authenticated users from the on-premises datacenter and then allowed them to upload files, contacts, timesheets, tasks, and so on, and provided each user with their own virtual data space. When the POC was complete, the authenticated user could use the web front-end search box to find their content.
This article describes the implementation of one piece of the system, indexing the content in Windows Azure Storage using FAST ESP 5.3.
FAST ESP 5.3 was used in the proof-of-concept, as that is the version that the customer had already been running in the production version of their application. The techniques presented in this field note can be used as a starting point for integrating FAST search with Windows Azure and applies to both FAST ESP 5.3 as well as FAST Search Server 2010 for SharePoint.
At a high-level, the integration between Windows Azure and FAST ESP involved three areas:
Fetching content from SQL Azure.
Deciding if we need to get more content from Windows Azure Storage.
Allowing the Windows Azure front-end to connect to the Query API.
We used the FAST JDBC connector to connect to SQL Azure. The connector is an add-on to FAST ESP that pulls content from a relational database and feeds it to FAST ESP using the FAST ESP content API. Inside the data processing pipeline, we added a new stage to integrate with Windows Azure Storage. Finally, that last part involved using the Windows Azure Service Bus so that queries from the front-end could get routed correctly to the local index. Figure 1 shows the architecture of the system. The sections below describe each area in more detail.
Figure 1: Fast ESP Architecture
Getting the relational data from SQL Azure into FAST ESP was the simplest part of the migration, primarily because it presents a SQL interface to any on-premises application. After updating the version of the JDBC driver used by the connector, we edited the hostname, username, and password for the account in the connection script and we were done. There is a different script for each collection, but it was pretty straightforward to edit each collection separately. The FAST JDBC connector read the scripts and was able to download the information from SQL Azure quickly. The only potential issues here could be that the connection between the JDBC connector and the SQL Azure database should be over a secure protocol. This was highlighted in the POC and left as an exercise that needed to be resolved at implementation time. For the POC, we used the standard port 1433 TCP connection.
FAST ESP Document Pipeline
The customer database data consists of several content types, including documents, contacts, tasks, and calendar entries. Of these, the document’s type is stored in both SQL Server and Windows Azure Storage, and the document itself is stored in Windows Azure Storage. The FAST document pipeline needed to be updated to pull the document from Windows Azure Storage based on the document’s content type.
Windows Azure Storage provides access to the customer content via a RESTful API over HTTP(S). To fetch the content, we need to compose a URL for each document and download it. The service does not have IP address restrictions to the content so we cannot lock down the content to allow access only from the datacenter. Rather, it allows the application administrator to attach one or more access policies to the document container. Since the URL contains this policy ID, access to all the files within a container or disk folder can be either granted or revoked as necessary. Only the people with the correct URL can access the content. When a new customer signs up or if there is a security breach, a new policy can be generated.
The downside of this access policy model, from the on-premises prospective, is that when you introduce a new policy in the Windows Azure application you need to coordinate it with the on-premises application so that it formulates the access URL correctly. Additionally there will likely be a race condition that needs to be taken into account if a policy is invalidated midway during the indexing process. Resolving each of these edge conditions was not accounted for in our POC and was beyond the scope of our two week effort.
Document processing in FAST ESP consists of processing the data with several mini applications called stages, with each responsible for its own data transformations. All stages in FAST ESP can be configured by the system administrator to change the order of execution along with the various parameters used by the stage as defined by its XML configuration.
In order to fetch the Windows Azure Storage documents, we created a new stage to pull the content from Windows Azure Storage. Figure 2 shows the configuration XML file and its rendered screen for our stage.
Figure 2: AzureFetch stage configuration XML and page
The administrator’s ability to edit the account, hostname, and policy type on this screen and came in handy for us when we migrated the POC code from a development environment into the POC demo environments.
In our POC, we used a single policy for all the clients, but in production, the policy will be specific to each client so we also allowed the policy to be overridden from the relational database. As I mentioned previously, there were quite a few policy edge cases to consider which were beyond the scope of our two week POC.
The Custom Stage
The stage could have been written in .NET, but the general rule of thumb for FAST ESP customizations is to take working code and customize it based on user needs. I modified an out-of-the-box stage as a starting point for my custom stage. The stage is a python script that implements two functions:
Process(). Figure 3 shows the first call, which loads the customization parameters in memory until it is used during document processing.
Figure 3: AzureFetch.py stage code (part 1)
The second section of the file contains the
Process() method, which is called for each document that runs in the pipeline. In the Process method, we determine whether the stage has a field called
getPath. If it does not, we do nothing further in the stage by returning
ProcessorStatus.OK_NoChange, which instructs FAST ESP to continue by running the next stage. Next we check if no policy is defined by the administrator, if a policy has been configured, or if an override policy exists in the relational database. Regardless of the case, we construct a specific URL to fetch content from Windows Azure Storage and pull the content into memory. This code is very compact, which is normal for FAST ESP document stages. Because this was a POC, we did minimal data integrity checking.
For more details on document processing, refer to the FAST ESP Document Processor Integration Guide that comes with an installation of FAST ESP 5.3.
We were not sure of how much latency the URL fetch would introduce into the pipeline, so we created log entries on disk and captured the connect time into the Windows Azure Storage service as well as the time to download the actual content. You can see from the commented out code how we tested the different scenarios. We tested download times for both HTTP and HTTPS transactions by changing the hostname in the stage configuration. To compare with a local file access, we timed how long it took to fetch the content if it was only stored on local disk. We concluded that on average it took almost .4 seconds to establish a HTTPS connection into Windows Azure Storage. This is an acceptable latency because we are able to compensate for this delay by increasing the number of document processing pipelines within ESP 5.3. We estimated that we needed no more than four pipelines to process the documents at a sufficient rate to satisfy the customer’s SLAs.
Figure 4: AzureFetch.py stage code (Part 2)
The last piece of the puzzle was to configure FAST ESP over the Windows Azure Service Bus. FAST ESP listens on port 15100 for query requests. Because the customer did not want to open up that port for all internet traffic, we opted to use Service Bus to get requests from Windows Azure. This connection used Windows Azure Connect and is covered in great detail by Ann Skobodzinski in her article Using Windows Azure Connect to Integrate On-Premises Web Services.
Overall, the POC was successful and we were able to show end-to-end integration. The access latency for the Windows Azure Storage service on average was almost .6 seconds per document fetch. This issue can be resolved by either increasing the number documents we process in parallel in the pipeline or by including a proxy server in the middle and letting it keep an established connection to the Windows Azure Storage service. Both mechanisms will resolve the issue but have their own pros and cons. It was beyond the scope of the POC to investigate these. By the end of the two weeks, we were able to show that it is still fairly easy to integrate Windows Azure with FAST ESP running on-premises.
Field Note: Using Windows Azure Connect to Integrate On-Premises Web Services (http://www.microsoft.com/windowsazure/learn/real-world-guidance/field-notes/integrating-onpremises-using-connect/)
Field Note: Integrating On-Premises Web Services with Windows Azure Service Bus and Port Bridge (http://www.microsoft.com/windowsazure/learn/real-world-guidance/field-notes/integrating-with-service-bus-and-port-bridge/)