Getting all the Entropy We Can Eat!
This next routine,
getEntropy, is the main routine for determining which classification is the most important for sales growth. We will turn to communication research and rip off one of their discoveries. As mentioned, entropy is a precise measurement of information.
In a nutshell, entropy is a measure of uncertainty in an observation. The entropy explains the amount of uncertainty about a particular outcome.
This is used a lot in determining information that comes over noisy data lines. We, however, have hijacked it for our marketing purposes.
We will be using logarithms, which happen to be a built-in function in VB 6.0. If you are a bit rusty on them, it simply means that logx = y where (2 to the power y) = x, in this case log has base 2. In English, this means that the log to base 2 of any number is any bits that it would take to represent that many numbers. For example, log2(16) = 4. This tells us that it takes 4 bits to uniquely represent 16 different values. So we will explain this in the code below. But hey, don't worry. We will just use the VB built-in log function to do the dirty work for us.
Try It Out - Determining the Entropy of a Classification
1. Please add another function to the
frmID3 form. Name the function
getEntropy and enter the code as shown. This routine will determine the entropy for the classification and return the entropy to the calling routine,
Public Function getEntropy(ByVal sField As String, itotalRecords As Integer) As Single '-------------------------------------------------------------- '-- First, determine how many unique values of this field are '-- in the table. Store them and determine the entropy for each '-------------------------------------------------------------- Dim sSql As String Dim adoHoldGroup As ADODB.Recordset Dim totalUp As Integer Dim totalDown As Integer Dim totalSamples As Integer Dim fractionalEntropy As Single Dim entropy As Single Dim searchValue As String fractionalEntropy = 0 Set adoHoldGroup = New ADODB.Recordset adoHoldGroup.CursorLocation = adUseClient '-- grab the info here-- sSql = "SELECT * FROM ID3 ORDER BY " & sField adoHoldGroup.Open sSql, adoConnection If (adoHoldGroup.BOF) And (adoHoldGroup.EOF) Then MsgBox "No records in the ID3 Table" Exit Function End If totalSamples = adoHoldGroup.RecordCount adoHoldGroup.MoveFirst searchValue = adoHoldGroup.Fields.Item(sField) With adoHoldGroup While Not .EOF totalUp = 0 totalDown = 0 '--------------------------------------------------- '-- Loop through all of the records with like values '--------------------------------------------------- Do While (searchValue = .Fields.Item(sField)) If !SalesUp = True Then totalUp = totalUp + 1 Else totalDown = totalDown + 1 End If .MoveNext If (.EOF) Then Exit Do Loop '--------------------------------------------------- '-- Now determine the entropy for that group -- '--------------------------------------------------- If (totalUp = 0) Then fractionalEntropy = -1# ElseIf (totalDown = 0) Then fractionalEntropy = -1# Else fractionalEntropy = _ -(totalUp / totalSamples) * Log(totalUp / totalSamples) - _ (totalDown / totalSamples) * Log(totalDown / totalSamples) End If fractionalEntropy = (fractionalEntropy / itotalRecords) entropy = entropy + fractionalEntropy If (.EOF) Then getEntropy = entropy + fractionalEntropy Else searchValue = adoHoldGroup.Fields.Item(sField) End If Wend End With End Function
How It Works
We want all of the records sorted by the particular field we will be examining. So if we are determining the entropy on the
CountryLanguage classification, for example, all of the records that have
English in the
CountryLanguage field will be lumped together.
So what we want to do is loop through the recordset and determine the entropy on each of the records that have the same value in the CountryLanguage field. When the value of the field changes to, say,
German, we will stop and calculate the value for
English because we know we have looked at each of those. So by returning the records sorted by the classification field, we can be sure that all like records are together. An easy way to do this is to take the first value passed into our
sField. We can dynamically build our SQL statement instructing the recordset to be returned ordered by that field. So if we were determining the
CountryLanguage, all entries for
English would be together, etc.
'-- grab the info here-- sSql = "SELECT * FROM ID3 ORDER BY " & sField adoHoldGroup.Open sSql, adoConnection
We start by opening the recordset
adoHoldGroup, which gives us all of the records in the
ID3 table. Now that the recordset is opened, we to know which classification we will be working with. After all, it might be
Country. Well, we passed in the value as a parameter to the function,
sField. Let's say that
Country. So we use that variable and pass it to the recordset's
Fields.Item() as a parameter. This will return the current value of that field (
Country) in the current record. That value, say
Spain, is assigned to the variable
searchValue. Please notice that this is another example of a generalized routine. We don't hard code in a value, but permit a parameter passed into the routine,
sField, to determine what we will be searching on. This allows us to use the same routine to examine each of the three classifications.
adoHoldGroup.MoveFirst searchValue = adoHoldGroup.Fields.Item(sField)
As mentioned, we want to handle each category. Let's assume that we passed in
CountryLanguage as the classification to be evaluated. If the
CountryLanguage of the current record is
English is assigned to
searchValue. As long as the value of the current record's
CountryLanguage) is equal to
English, we loop through the recordset and increment the
totalDown fields. We are simply incrementing the
totalDown fields if that particular record shows that sales were either up or down. So if we had 4 records from four countries that speak English, and three of the countries experienced sales growth while the fourth did not,
totalUp would be 3 and
totalDown would be 1. As long as the
English) is equal to the field being examined (
Do loop looks at each record and increments the
totalDown. In addition, since all of the records with like languages are grouped together, we know we will get them all. Now say the
MoveNext goes to the next record and the
Fields.Item(CountryLanguage) is now
German. The code exits the
Do loop and continues.
While Not .EOF totalUp = 0 totalDown = 0 '--------------------------------------------------- '-- Loop through all of the records with like values '--------------------------------------------------- Do While (searchValue = .Fields.Item(sField)) If !SalesUp = True Then totalUp = totalUp + 1 Else totalDown = totalDown + 1 End If .MoveNext If (.EOF) Then Exit Do Loop
The Scary Part
Essentially, we want to see how many records for countries that speak English had sales that were up and how many were down. So we are going to calculate something called
What we want to look at here is the
fractionalEntropy when both the
totalDown are not 0. If one of these were indeed 0, we would get a divide by zero error. So right away this tells us that we want to have enough samples so that typically both the
totalDown will have a number greater than zero. We calculate the
fractionalEntropy because this is currently being calculated for a single, say language. However, we still need to calculate the number for Spanish, French, German, etc. So we call this
fractionalEntropy because it is a piece of the total.
If (totalUp = 0) Then fractionalEntropy = -1# ElseIf (totalDown = 0) Then fractionalEntropy = -1# Else fractionalEntropy = _ -(totalUp / totalSamples) * Log(totalUp / totalSamples) - _ (totalDown / totalSamples) * Log(totalDown / totalSamples) End If
What we want to do is look at the frequency of an "up" value out of the total values. Let's say that there were 10 countries with the language of English. If 5 were up, then the probability of the language classification having a positive sales growth is 5 out of 10, or 5/10. Likewise, the probability of the language classification having a negative sales growth is also 5/10. In general terms, we can look at the formula like so:
Outcome = - Probability(up) log Probability(up) - Probability(down)
In our example we would then have
Outcome = -5/10 log (5/10) - 5/10 log (5/10)
Outcome = 1 (i.e. no change. We can't discern anything here)
The outcome represents the uncertainty of sales going up, based on the information we placed in the
ID3 table. This is NOT meant to be a math book, but only to add an element of interest in data mining and show a simple technique on how we can add another dimension to our knowledge of data mining.
fractionalEntropy = (fractionalEntropy / itotalRecords) entropy = entropy + fractionalEntropy If (.EOF) Then getEntropy = entropy + fractionalEntropy Else searchValue = adoHoldGroup.Fields.Item(sField) End If
We then divide the
fractionalEntropy by the number of records in the recordset and assign this to
fractionalEntropy. Now that we have calculated the
fractionalEntropy for say, the English language, we add that to the variable
entropy which represents the entire classification. So we would loop through to the next record in the
ID3 table and determine the
fractionalEntropy for all countries that speak Spanish. When finished, this
fractionalEntropy factor for Spanish would be added to the
entropy variable. So
entropy is cumulative. Each of the
fractionalEntropy factors for each language is just added to the variable
entropy, which is the sum of each of the similar records.
If we have reached the end of the recordset, we assign the result to the name of the function,
getEntropy, and return that value for the recordset based on the criteria we just analyzed. Otherwise, we update the value of
searchValue with the current record. The code then jumps to the top of the loop, initializes the variables, and looks at the next group. Notice how we update the
fractionalEntropy with the results from each grouping of say,
The final value is then returned to the calling routine,
determineEntropy, and placed in the array for sorting and display. Let's now take a look at how to sort our array to ensure the lowest entropy value is stored in the first position.
Try It Out - Sorting Our Multi-Dimensional Array
1. Please add a final sub to our
frmID3 form called
qsort. We will call this to sort the array,
Position, so the least entropy factor is in the lowest position. Essentially, this will sort our array in ascending order.
Public Sub qsort(ByRef myArray() As Variant, ByVal iLowBound As Integer, ByVal iHighBound As Integer) Dim intX As Integer Dim intY As Integer Dim intMiddle As Integer Dim sHoldString As String Dim varMidBound As Variant Dim varTmp As Variant If iHighBound > iLowBound Then intMiddle = ((iLowBound + iHighBound) \ 2) varMidBound = myArray(intMiddle, 0) intX = iLowBound intY = iHighBound Do While intX <= intY '-- if a value lower in the array is > than one -- '-- higher in the array, swap them now -- If myArray(intX, 0) >= varMidBound _ And myArray(intY, 0) <= varMidBound Then varTmp = myArray(intX, 0) sHoldString = myArray(intX, 1) myArray(intX, 0) = myArray(intY, 0) myArray(intX, 1) = myArray(intY, 1) myArray(intY, 0) = varTmp myArray(intY, 1) = sHoldString intX = intX + 1 intY = intY - 1 Else If myArray(intX, 0) < varMidBound Then intX = intX + 1 End If If myArray(intY, 0) > varMidBound Then intY = intY - 1 End If End If Loop Call qsort(myArray(), iLowBound, intY) Call qsort(myArray(), intX, iHighBound) End If End Sub
How It Works
We call this sub from the
determineEntropy routine. Notice that we are passing in the array
ByRef, or by reference. This is so that
qsort actually has the memory location of the array. Passing in the array by reference permits the
qsort routine to work on the actual array – not a copy of the array in memory. Passing in by reference is the default behavior of VB. It can make the changes for us here. A quick call to
determineEntropy will sort the array. We can then print the array from
determineEntropy and it will now be sorted for us.
qsort routine works on the divide-and-conquer principle. It splits the array into two groups and then sorts both the top and the bottom. It does this by calling itself to sort each group. When a routine calls itself, that process is called recursion.
If iHighBound > iLowBound Then intMiddle = ((iLowBound + iHighBound) \ 2) varMidBound = myArray(intMiddle, 0) intX = iLowBound intY = iHighBound
If the upper bound of the array is greater than the lower bound, we want to grab the middle. When we call
iHighBound = 2 and
iLowBound = 0. In the next line we determine the middle of the array – which will be 1. Recall that the entropy value is in position 0 of the array and the classification is in position 1. So
varMidBound is assigned the value from
myArray(1, 0) which is the entropy value of the classification.
Now we loop as long as the lower bound
<= the upper bound variables. If the entropy value stored in element
intX (which will be 0 the first time though) is
>= the entropy value stored in the middle value
And the entropy value stored in the
intY element (which will be 2) is
<= the middle entropy value, we have some switching to do.
Do While intX <= intY '-- if a value lower in the array is > than one -- '-- higher in the array, swap them now -- If myArray(intX, 0) >= varMidBound _ And myArray(intY, 0) <= varMidBound Then
So now we just swap what's in position
intX with what is in position
intY. Then we move the lower and upper boundaries towards each other and do the same test.
varTmp = myArray(intX, 0) sHoldString = myArray(intX, 1) yArray(intX, 0) = myArray(intY, 0) myArray(intX, 1) = myArray(intY, 1) myArray(intY, 0) = varTmp myArray(intY, 1) = sHoldString intX = intX + 1 intY = intY - 1
intX > intY, the code moves down and calls itself – recursion. The routine is called on the lower part of the array to sort it, then the upper part. This is the divide-and-conquer principle.
Call qsort(myArray(), iLowBound, intY) Call qsort(myArray(), intX, iHighBound)
Now don't worry if this part looks a bit cryptic. Years are spent in some computer science courses discussing the various sorts and the math behind them. Luckily we can learn from the masters. It's unfortunate that VB does not have any built-in sorting capabilities like
qsort. They are built into other languages, such as C.
To sort strings, instead of numbers, some programmers add a hidden list box to the form. The Sorted property is set to True. They just use the AddItem method to add strings in any order. When all of the strings are added, they start at position 0 of the list box and retrieve each of the items. They will of course now be in sorted order. The list box is not visible so the user never sees it, but this is a quick and dirty way to get some quick string sorting done. Unfortunately this will not work for numbers. A 2 would come before a –3, for example. Not exactly precise for our needs!
OK, finished! Press F5 to run your program. Don't forget to add a reference to the Microsoft ActiveX Data Objects 2.0 Library if you haven't done so already. Try the analyzing the various products and see how the results stack up. You are now ready to turn this beauty over to the product manager.
One word on the results. Since we only have a single country for each entry, this usually has the lowest entropy. So we can assume that the country is important. What really makes sense to look at is what is in the second position – after country. In this example for Hokkien Fried Mee, our program says that this is really on the decline in not a single country – but Europe in general! Now this is great information. The product manager might be wondering why sales are down in Germany, without realizing that sales across Europe for this product are in a nose-dive!
I think you can see the value of data mining and hopefully there is enough information provided for you to apply this technique to your real world problems. We re-aggregated the data into a single table. We then tried to find some logical classifications and applied the entropy approach to determine which one of them provided the least uncertainty about sales growth. We tried to illustrate a non-traditional use of database information.