|s the data on your Web site being ripped off without your knowledge? As the popularity of the Web explodes, more and more companies are publishing huge volumes of data over the Internet, including data they may have spent millions of dollars to collect. Once published on their Web site, it's easy enough for a competitor to write a program and scrape the data for their own personal or financial gain. Detecting this sort of activity and investigating its origin could save your company lost revenue and protect your investment.|
This article describes such an application. It is written as an ISAPI filter that analyzes the regularity of visits from an IP address to determine if the traffic pattern matches that of real users or of a software device known as a spider. After detecting such an automated process, the application can take one of two countermeasures: redirect the visitor to a specified Web page, or allow the scraping and log the visitor's activity for evidence.
The solution I describe in this article was written to be generic enough to handle Web sites hosted on any Web server running Microsoft® Internet Information Services (IIS) 4.0 or 5.0, and it requires minimal configuration for proper operation. I used Microsoft Visual C++® 6.0, Windows® 2000 Professional Server, and IIS 5.0 peer Web services to develop my solution. This article assumes you have more than a passing acquaintance with the Microsoft Internet API known as ISAPI.
Statistics 101 I'm not a statistician, but I thought it might be fun to use a little regression analysis on the time intervals between hits to identify visiting trends for a Web site.
The result of this analysis is the linear correlation coefficient. Given an array of XY coordinates, each X representing the next Web visit and each Y representing the time since the last visit, the function in Figure 1 will return a value called the Pearson Product-Monument Correlation Coefficient (PR) (essentially the slope of the line or best fit), that lies between -1 and 1. If the value of PR is 1, then it has complete positive correlation (that is, the data points lie on a perfectly straight line with positive slope with X and Y increasing together). The value of 1 is independent of the magnitude of the slope. Every straight line represents perfect correlation. If the data points lie on a perfectly straight line with negative slope (Y decreasing as X increases), then PR has the value of -1—complete negative correlation (see Figure 1). (Of course, this statistical test is kind of overkill because it is not a multifactorial analysis. Only one factor—the time between visits—is plotted. But, this simple code should suffice as an example upon which you can build your own application.)
If PR is near zero, this indicates that the variables X and Y are uncorrelated. As I will demonstrate later, you can take the elapsed time between hits and place them in the Y scale. Then, as long as the X scale is spaced evenly, you can determine if the elapsed times between hits fall on a straight line. The closer PR is to the absolute value of 1 (indicating high regularity), the greater the probability that someone is scraping your Web site. The further the PR is from |1| (randomly timed visits), the more likely it is that legitimate users are visiting the Web site.
Developing the ISAPI Filter To develop my ISAPI filter, I started Visual C++ 6.0 and selected the new ISAPI Extension Wizard option from the new projects wizard screen, as shown in Figure 2.
Figure 2 New Project
I named the project SiteSentry, which is how I will refer to it for the remainder of this article. After naming the project and clicking the OK button, two more wizard dialogs appear. The first of these two dialogs, shown in Figure 3, gives you the option of creating either an ISAPI extension or filter object. Since I need to examine certain notifications to the Web sites before IIS gets to see it, I want a filter object.
Figure 3 ISAPI Extension Wizard
I selected the checkbox to generate a filter object and deselected Generate a Server Extension object. I chose to use MFC as a shared library, but you may select either one in this case. After clicking the next button, the dialog shown in Figure 4 will be displayed. I selected high priority and nonsecured port for my example.
Figure 4 Options
SiteSentry is interested in notifications of only two events: when the client first appeared on the IIS Server, and when the client broke the connection with the IIS Server. This defines a window of time when the user was actually on the Web site and could have been scraping data. By selecting the "URL mapping requests" notification, you can determine when the client requests a resource from a Web site. This translates the URL into a physical path and file name on the IIS server. You can use this notification to detect when the user first makes a resource request and when the user makes all subsequent requests.
By selecting the "End of connection" notification you can determine when the client left the Web site and you can clean up any memory associated with the connection. After clicking the Finish button, the project will be created with all the code for a simple ISAPI filter.
When writing ISAPI filters, select only the notifications in which you are interested. Selecting too many will degrade the performance of all IIS services. Also make sure the notifications you are going to handle are fast and efficient so you don't degrade response times on your Web site.
How SiteSentry Works Next, let's look at how the SiteSentry works. The operation of SiteSentry is much like a court proceeding: there is an arrest, judgment, and penalty phase to all visitors who are illegally accessing guarded resources on the IIS Server. Of course, the visitor is assumed innocent until proven guilty by the statistical function, PearsonsR, shown earlier in Figure 1.
Located at the top of BCSiteSentry.cpp are two MFC CMap dictionary classes:
They contain collections of CHitItem and CGuardedItem classes, which are defined in the Cache.h header file.
A guarded resource is defined as either a file name or directory, as shown in Figure 5. If both a file and the directory in which it resides are specified as guarded resources, the directory guard takes precedence over the file name guard.
For each guarded resource, there will be one instance of a CGuardedItem class in the gGuardedItemsCache collection, each with a different countermeasure when visitors are judged guilty of scraping. For each unique visitor on the Web site, there will be one instance of a CHitItem class in the gHitItemsCache collection.
The method GetFilterVersion is called only once during the startup phase of SiteSentry by the IIS Server. This is where I initialize structures and start the wtJudgment thread (explained later) that will run for the duration of SiteSentry.
The functions SysLoadOperationParameters and SysLoadGuardedItemsCache are called to load the information from the BCSS.CFG in Figure 6 and BCSS.DAT, which you already saw in Figure 5.
After SiteSentry loads and initializes, most of the work takes place in the OnUrlMap method. This method is called each time the visitor maps a logical URL to a physical path and file name on the Web server.
One of the first functions called in the OnUrlMap method, SysIsQualifiedResource(pMapInfo), examines the pMapInfo structure to determine if the specified physical path and file are contained in the gGuardedItemsCache global guarded cache collection.
The pMapInfo parameter is a pointer to a structure of type HTTP_FILTER_URL_MAP, which contains information IIS will use to map the URL to a physical path and file.
HTTP_FILTER_URL_MAP has the following layout:
Contained in SysIsQualifiedResource is an array szExtInclusion of file extensions that include well-known file extension types of files that can return HTML to the visitor. You can add to this list if you know of others that can return HTML.
typedef struct _HTTP_FILTER_URL_MAP
const CHAR* pszURL;
} HTTP_FILTER_URL_MAP, *PHTTP_FILTER_URL_MAP;
If you look at SysIsQualifiedResource, you will notice the following statement:
When IIS translates the URL into the physical path, it will first try mapping to the directory where the resource is located. It does this to see if the user account under which the Web site is running has access to the directory. SiteSentry will ignore this test and only test for the actual resource request. Note that the OnUrlMap method is being called twice for each request. IIS makes one request for the directory where the Web page lives to see if there are sufficient rights to access the Web page and a second request for the actual Web page.
if(NULL != pMapInfo->pszURL && 0 != stricmp(pMapInfo->pszURL,"/"))
After testing the resource request and determining whether it is a guarded resource, the next action you should take is to allocate the HITPARAMS structure and store a pointer to it in the pFilterContext member. The pFilterContext member is a generic VOID pointer which you can use to hold your own connection-specific information. This information will be deleted later on in the OnEndOfNetSession method.
HITPARAMS takes the following form:
After the structure is allocated and a pointer stored, the SysLoadClientHitStructure function is called and passed pCtxt and pMapInfo as parameters. This loads the HITPARAMS structure with information about the visitor. The pCtxt->GetServerVariable function retrieves the information from the IIS server.
CHAR szPhysicalPath [_MAX_PATH];
CHAR szRemoteAddr [MAX_REMOTE_ADDR];
CHAR szRemoteHost [MAX_REMOTE_HOST];
CHAR szRemoteUser [MAX_REMOTE_USER];
} HITPARAMS, *LPHITPARAMS;
The GetTickCount API saves the time unit. This value is saved in the lpHitParams->lTime member as early as possible in the OnUrlMap method so that the time results are not skewed with function call latencies.
I do not save the time as HH:MM:SS since I only need a reference in time so that elapsed time between hits can be determined. The GetTickCount API return value is the number of milliseconds that have elapsed since the system was started and is perfect for this example. It will wrap around to zero every 49.7 days if the system runs continuously, but this should not be a factor in SiteSentry.
LPHITPARAMS lpHitParams = (LPHITPARAMS)pCtxt->m_pFC->pFilterContext;
lpHitParams->lTime = (long)GetTickCount;
Now I start the wtArrester thread. I use a worker thread to efficiently integrate the hit into the gHitItemsCache collection. I do not want visitors waiting on functions to return; I want worker threads waiting on functions to return. Therefore, I take the information harvested earlier and start the wtArrester thread, passing in a pointer to the pCtxt->m_pFC->pFilterContext structure for integration into the gHitItemsCache.
The job of the wtArrester thread is first to see if the visitor is already in the cache by using the SysGetHitKey function. If not found in the gHitItemsCache collection, I add the visitor as a new visitor using the gHitItemsCache.SetAt(sHitKey,pHitItem) method of the CMap collection.
If the visitor is found, I perform several more tests to determine if I can reuse the entry using the pHitItem->CanRecycle method. When the client has accumulated enough hits to fill up their statistical sampling periods, a judgment is rendered, and if found not guilty, SiteSentry will reset all variables associated with the client and reuse the cache entry, as opposed to deleting and reallocating it. The two parameters, named gOP.iMaxStatisticalPeriods and gOP.lGuardedItemsCacheTimeoutSECS, are used to determine if the cache entry has reached the end of its statistical time period or the cache entry has reached the end of the time-to-live period, described in seconds, as defined in the configuration file BCSS.CFG. The time-to-live period is the number of seconds SiteSentry collects statistical data on an individual client. At the end of the time-to-live period, if there has been little activity and not enough statistical data collected to render a judgment, the cache entry is deleted.
In either case, the time period is incremented and recorded with a call to the pHitItem->IncrementPeriod method. The array that holds the time periods is CHitItem::m_lLinearTimePeriod, which has a dimension of MAX_ALLOWED_STATISTICAL_PERIODS. The actual number of time periods tested for is controlled by gOP.iMaxStatisticalPeriods. Each new visitor starts out with one time period, and with each new hit to the same guarded resource, I increment and record the time into the next array position. I do this until I have reached the end of statistical time as set by gOP.iMaxStatisticalPeriods.
The wtArrester will die a natural death after it has finished adding the hit to the gHitItemsCache collection so I do not store any handle information about the thread.
The last check I make in the OnUrlMap method is a test for guilt with the following statement:
I am interested only in visitors who have been rendered guilty by the wtJudgment worker thread (more about this later). The return value of OnUrlMap will be either the return value of the SysEnterPenaltyPhase function or SF_STATUS_REQ_NEXT_NOTIFICATION.
if(NULL != pCtxt->m_pFC->pFilterContext && SysIsClientGuilty(pCtxt))
SysEnterPenaltyPhase will return either SF_STATUS_REQ_NEXT_NOTIFICATION, which indicates that I am observing and logging all hits to the Web site to serve as evidence, or SF_STATUS_REQ_FINISHED_KEEP_CONN, which indicates that I have redirected the visitor to CGuard::m_sRedirectResource, the URL specified in the BCSS.DAT resource guard definition.
Notice that the SysEnterPenaltyPhase function contains the following property tests:
The function SysAnnunciate allows you to define a COM object with the name and CoClass of SSAnnunciate.Action that exposes one method with the parameters shown here:
if(pGuardedItem->m_bAnnunciate && !pHitItem->m_bAnnunciated)
This is your chance to hook into SiteSentry and perform other actions that you want to perform every time a guilty visitor is discovered. SysAnnunciate will try to instantiate the COM object and pass information to it about the visitor for notification by e-mail or pager, or to perform any other actions you decide. The value of pGuardedItem->m_bAnnunciate is defined in the BCSS.DAT guarded items configuration file. The member pHitItem->m_bAnnunciated is set in the SysAnnunciate function, so I only call the function once for each guilty visitor.
Included with this article is a Visual Basic® project named SSAnnunciate that demonstrates how to write a COM object that works with the SysAnnunciate function. The Visual Basic project installs in the SSAnnunciate directory under SiteSentry.
The role of the wtJudgment worker thread is twofold. Every five seconds, wtJudgment will iterate the entire gHitItemsCache looking for items that can be safely eliminated from the collection with a call to pHitItem->CanDiscard. Items that need a judgment rendered are identified using the following statement:
Earlier in the article, I mentioned that I preloaded the X and Y scale to always have a positive slope so that I'm only dealing with one comparison. The first step in converting the scales is to find the largest number contained in the pHitItem->m_lLinearTimePeriod statistical period array.
I take this number and build the Y scale with the following line:
fX = iHigh / gOP.iMaxStatisticalPeriods;
After the Y scale is built, I build the X scale with the following line:
y[i] = (float)y[i] + (iHigh * i);
This creates a perfectly spaced X and Y scale in which the correlation between X and Y produces a positive slope at a 45 degree angle (a slope of 1), with the difference between X periods in the Y direction representing the timing data collected in the OnUrlMap method.
x[i] = (float)fX * i;
Now all the values are in place to actually perform the analysis with the following line:
After the call to PearsonsR, the variable fR now contains a number between 0 and 1. The "comfort value" (gOP.fComfortValue) is a small number between -1.00000 and 1.00000 that is set by the user. If fR is above this number, the client is considered guilty. Below this number, the client is considered innocent. Typical values in my tests are within the range of 0.990000 to 1.000000. As you can see, the difference between guilt and innocence is a very small number indeed. I have found that setting the gOP.fComfortValue member to 0.000100 and gOP.iMaxStatisticalPeriods member to 10 is a good starting point to use for tests. The return value from the PearsonsR function is compared to this user-definable number comfort value. Now I can render a verdict for the visitor by subtracting from 1 the value contained in gOP.fComfortValue and testing if fR is equal to or more than this value. I then call either the member method pHitItem->SetInnocent or pHitItem->SetGuilty, depending on the outcome of the comparison (innocent or guilty).
float fR = PearsonsR(x,y,gOP.iMaxStatisticalPeriods -2);
The wtJudgment worker thread only renders the verdict. The actual penalty phase will be entered the next time the visitor enters the OnUrlMap method and the test is made with the SysIsClientGuilty function call.
Once a visitor is judged guilty, the log will contain entries as displayed in Figure 7 and Figure 8, depending on the penalty phase.
One of the following two lines appear in the log when a visitor first visits the Web site. If the visitor is new, the CACHE ALLOC: line will be written. If the visitor was previously judged innocent and the cache entry was reused, the CACHE RECYCLE: line will be written:
When visitors reach the end of their statistical time as determined by the member gOP.iMaxStatisticalPeriods, the wtJudgment worker thread writes one of the following lines to the log indicating the verdict rendered.
CACHE ALLOC: <Anonymous # 127.0.0.1 # C:\MYWEBSITE\DEFAULT.HTML>
at 01/17/2001 8:48:26
CACHE RECYCLE: <Anonymous # 127.0.0.1 # C:\MYWEBSITE\DEFAULT.HTML>
at 01/17/2001 8:48:46
The lines include the number of statistical time periods, the comfort value used, and the return value from the PearsonsR function.
CACHE INNOCENT: <Anonymous # 127.0.0.1 # C:\MYWEBSITE\DEFAULT.HTML>
Confidence Level: 0.991809 Comfort Value: 0.000100 Periods: 10
at 01/17/2001 8:48:45
CACHE GUILTY: <Anonymous # 127.0.0.1 # C:\MYWEBSITE\DEFAULT.HTML>
Confidence Level: 1.000000 Comfort Value: 0.000100 Periods: 10
at 01/17/2001 8:48:45
If the visitor was judged guilty and the pGuardedItem->m_iPenalty member is set to CGuardedItem::Observe, then after every five hits the following line will be written to the log.
If the visitor was judged guilty and the pGuardedItem->m_iPenalty member is set to CGuardedItem::Redirect, the following line will be written to the log each time the guilty visitor enters the OnUrlMap member.
OBSERVING: <Anonymous # 127.0.0.1 # C:\MYWEBSITE\DEFAULT.HTML>
at 01/17/2001 8:48:46 with 5 hit(s)
When the visitor has reached the time to live period as determined by the gOP.lGuardedItemsCacheTimeoutSECS member, the following line will be written to the log indicating that the removal from the gHitItemsCache collection has been performed:
REDIRECTED: <Anonymous # 127.0.0.1 # C:\MYWEBSITE\DEFAULT.HTML> to
LOCALHOST/SCRAPINGPOLICY.HTML at 01/17/2001 8:53:48
Last, but not least, the OnEndOfNetSession method is called when the visitor's session is ending. At this point I delete the HITPARAMS structure that was allocated during the first call the visitor made in the OnUrlMap method.
CACHE ELIM: <Anonymous # 127.0.0.1 # C:\MYWEBSITE\DEFAULT.HTML>
at 01/17/2001 8:53:48
Using the entire IP address is troublesome when identifying return visitors to a Web site. It would make more sense to support the TCP/IP Class C address range to identify return visitors as it only uses the first three octets of the IP address. The last octet (fourth number in the IP address) may be different on each visit even though it is the same visitor. This would probably be better supported for the guarded resource as an option that a Webmaster could determine at configuration time.
Conclusion This article only scratches the surface of what's possible when you write ISAPI applications. Microsoft has followed a long tradition of opening up the core services to the developer and making it possible for the creative programmer to accomplish goals he never imagined possible.
In fact, to demonstrate the power of the ISAPI programming model, Microsoft implements Proxy Server, SSL, and FrontPage® as ISAPI filters and ASP as an ISAPI extension. Most other uses of ISAPI filters include custom authentication routines, logging, and encryption techniques. Hopefully, my SiteSentry solution demonstrates the potential of using ISAPI filters while providing a useful, extensible tool that tracks the traffic patterns of your Web site.