Getting all the Entropy We Can Eat!

# Getting all the Entropy We Can Eat!

Visual Studio 6.0

This next routine, `getEntropy`, is the main routine for determining which classification is the most important for sales growth. We will turn to communication research and rip off one of their discoveries. As mentioned, entropy is a precise measurement of information.

In a nutshell, entropy is a measure of uncertainty in an observation. The entropy explains the amount of uncertainty about a particular outcome.

This is used a lot in determining information that comes over noisy data lines. We, however, have hijacked it for our marketing purposes.

We will be using logarithms, which happen to be a built-in function in VB 6.0. If you are a bit rusty on them, it simply means that logx = y where (2 to the power y) = x, in this case log has base 2. In English, this means that the log to base 2 of any number is any bits that it would take to represent that many numbers. For example, log2(16) = 4. This tells us that it takes 4 bits to uniquely represent 16 different values. So we will explain this in the code below. But hey, don't worry. We will just use the VB built-in log function to do the dirty work for us.

#### Try It Out - Determining the Entropy of a Classification

1.  Please add another function to the `frmID3` form. Name the function `getEntropy` and enter the code as shown. This routine will determine the entropy for the classification and return the entropy to the calling routine, `determineEntropy`.

``````Public Function getEntropy(ByVal sField As String, itotalRecords As Integer) As Single

'--------------------------------------------------------------
'-- First, determine how many unique values of this field are
'-- in the table. Store them and determine the entropy for each
'--------------------------------------------------------------

Dim sSql As String
Dim totalUp As Integer
Dim totalDown As Integer
Dim totalSamples As Integer
Dim fractionalEntropy As Single
Dim entropy As Single
Dim searchValue As String

fractionalEntropy = 0

'-- grab the info here--
sSql = "SELECT * FROM ID3 ORDER BY " & sField
MsgBox "No records in the ID3 Table"
Exit Function
End If

While Not .EOF
totalUp = 0
totalDown = 0
'---------------------------------------------------
'-- Loop through all of the records with like values
'---------------------------------------------------
Do While (searchValue = .Fields.Item(sField))
If !SalesUp = True Then
totalUp = totalUp + 1
Else
totalDown = totalDown + 1
End If
.MoveNext
If (.EOF) Then Exit Do
Loop
'---------------------------------------------------
'-- Now determine the entropy for that group --
'---------------------------------------------------
If (totalUp = 0) Then
fractionalEntropy = -1#
ElseIf (totalDown = 0) Then
fractionalEntropy = -1#
Else
fractionalEntropy = _
-(totalUp / totalSamples) * Log(totalUp / totalSamples) - _
(totalDown / totalSamples) * Log(totalDown / totalSamples)
End If
fractionalEntropy = (fractionalEntropy / itotalRecords)
entropy = entropy + fractionalEntropy
If (.EOF) Then
getEntropy = entropy + fractionalEntropy
Else
End If
Wend
End With

End Function``````

### How It Works

We want all of the records sorted by the particular field we will be examining. So if we are determining the entropy on the `CountryLanguage` classification, for example, all of the records that have `English` in the `CountryLanguage` field will be lumped together.

So what we want to do is loop through the recordset and determine the entropy on each of the records that have the same value in the CountryLanguage field. When the value of the field changes to, say, `German`, we will stop and calculate the value for `English` because we know we have looked at each of those. So by returning the records sorted by the classification field, we can be sure that all like records are together. An easy way to do this is to take the first value passed into our `getEntropy` function, `sField`. We can dynamically build our SQL statement instructing the recordset to be returned ordered by that field. So if we were determining the `CountryLanguage`, all entries for `English` would be together, etc.

``````'-- grab the info here--
sSql = "SELECT * FROM ID3 ORDER BY " & sField

We start by opening the recordset `adoHoldGroup`, which gives us all of the records in the `ID3` table. Now that the recordset is opened, we to know which classification we will be working with. After all, it might be `CountryRegion`, `CountryLanguage` or `Country`. Well, we passed in the value as a parameter to the function, `sField`. Let's say that `sField` contains `Country`. So we use that variable and pass it to the recordset's `Fields.Item()` as a parameter. This will return the current value of that field (`Country`) in the current record. That value, say `Spain`, is assigned to the variable `searchValue`. Please notice that this is another example of a generalized routine. We don't hard code in a value, but permit a parameter passed into the routine, `sField`, to determine what we will be searching on. This allows us to use the same routine to examine each of the three classifications.

``````adoHoldGroup.MoveFirst

As mentioned, we want to handle each category. Let's assume that we passed in `CountryLanguage` as the classification to be evaluated. If the `CountryLanguage` of the current record is `English`, then `English` is assigned to `searchValue`. As long as the value of the current record's `sField` (`CountryLanguage`) is equal to `English`, we loop through the recordset and increment the `totalUp` and `totalDown` fields. We are simply incrementing the `totalUp` or `totalDown` fields if that particular record shows that sales were either up or down. So if we had 4 records from four countries that speak English, and three of the countries experienced sales growth while the fourth did not, `totalUp` would be 3 and `totalDown` would be 1. As long as the `searchValue` (say `English`) is equal to the field being examined (`CountryLanguage`), the `Do` loop looks at each record and increments the `totalUp` or `totalDown`. In addition, since all of the records with like languages are grouped together, we know we will get them all. Now say the `MoveNext` goes to the next record and the `Fields.Item(CountryLanguage)` is now `German`. The code exits the `Do` loop and continues.

``````While Not .EOF
totalUp = 0
totalDown = 0
'---------------------------------------------------
'-- Loop through all of the records with like values
'---------------------------------------------------
Do While (searchValue = .Fields.Item(sField))
If !SalesUp = True Then
totalUp = totalUp + 1
Else
totalDown = totalDown + 1
End If
.MoveNext
If (.EOF) Then Exit Do
Loop``````

## The Scary Part

Essentially, we want to see how many records for countries that speak English had sales that were up and how many were down. So we are going to calculate something called `fractionalEntropy`.

What we want to look at here is the `fractionalEntropy` when both the `totalUp` and `totalDown` are not 0. If one of these were indeed 0, we would get a divide by zero error. So right away this tells us that we want to have enough samples so that typically both the `totalUp` and `totalDown` will have a number greater than zero. We calculate the `fractionalEntropy` because this is currently being calculated for a single, say language. However, we still need to calculate the number for Spanish, French, German, etc. So we call this `fractionalEntropy` because it is a piece of the total.

``````If (totalUp = 0) Then
fractionalEntropy = -1#
ElseIf (totalDown = 0) Then
fractionalEntropy = -1#
Else
fractionalEntropy = _
-(totalUp / totalSamples) * Log(totalUp / totalSamples) - _
(totalDown / totalSamples) * Log(totalDown / totalSamples)
End If``````

What we want to do is look at the frequency of an "up" value out of the total values. Let's say that there were 10 countries with the language of English. If 5 were up, then the probability of the language classification having a positive sales growth is 5 out of 10, or 5/10. Likewise, the probability of the language classification having a negative sales growth is also 5/10. In general terms, we can look at the formula like so:

`Outcome = - Probability(up) log Probability(up) - Probability(down) `

`                   log Probability(down)`

In our example we would then have

`Outcome =  -5/10 log (5/10) - 5/10 log (5/10)`

`Outcome = 1 `(i.e. no change. We can't discern anything here)

The outcome represents the uncertainty of sales going up, based on the information we placed in the `ID3` table. This is NOT meant to be a math book, but only to add an element of interest in data mining and show a simple technique on how we can add another dimension to our knowledge of data mining.

``````       fractionalEntropy = (fractionalEntropy / itotalRecords)
entropy = entropy + fractionalEntropy
If (.EOF) Then
getEntropy = entropy + fractionalEntropy
Else
End If``````

We then divide the `fractionalEntropy` by the number of records in the recordset and assign this to `fractionalEntropy`. Now that we have calculated the `fractionalEntropy` for say, the English language, we add that to the variable `entropy` which represents the entire classification. So we would loop through to the next record in the `ID3` table and determine the `fractionalEntropy` for all countries that speak Spanish. When finished, this `fractionalEntropy` factor for Spanish would be added to the `entropy` variable. So `entropy` is cumulative. Each of the `fractionalEntropy` factors for each language is just added to the variable `entropy`, which is the sum of each of the similar records.

If we have reached the end of the recordset, we assign the result to the name of the function, `getEntropy`, and return that value for the recordset based on the criteria we just analyzed. Otherwise, we update the value of `searchValue` with the current record. The code then jumps to the top of the loop, initializes the variables, and looks at the next group. Notice how we update the `fractionalEntropy` with the results from each grouping of say, `CountryLanguage`.

The final value is then returned to the calling routine, `determineEntropy`, and placed in the array for sorting and display. Let's now take a look at how to sort our array to ensure the lowest entropy value is stored in the first position.

#### Try It Out - Sorting Our Multi-Dimensional Array

1.  Please add a final sub to our `frmID3` form called `qsort`. We will call this to sort the array, `Position`, so the least entropy factor is in the lowest position. Essentially, this will sort our array in ascending order.

``````Public Sub qsort(ByRef myArray() As Variant, ByVal iLowBound As Integer, ByVal iHighBound As Integer)

Dim intX As Integer
Dim intY As Integer
Dim intMiddle As Integer
Dim sHoldString As String
Dim varMidBound As Variant
Dim varTmp As Variant

If iHighBound > iLowBound Then
intMiddle = ((iLowBound + iHighBound) \ 2)
varMidBound = myArray(intMiddle, 0)

intX = iLowBound
intY = iHighBound

Do While intX <= intY
'-- if a value lower in the array is > than one --
'-- higher in the array, swap them now          --
If myArray(intX, 0) >= varMidBound _
And myArray(intY, 0) <= varMidBound Then
varTmp = myArray(intX, 0)
sHoldString = myArray(intX, 1)
myArray(intX, 0) = myArray(intY, 0)
myArray(intX, 1) = myArray(intY, 1)
myArray(intY, 0) = varTmp
myArray(intY, 1) = sHoldString
intX = intX + 1
intY = intY - 1
Else
If myArray(intX, 0) < varMidBound Then
intX = intX + 1
End If
If myArray(intY, 0) > varMidBound Then
intY = intY - 1
End If
End If
Loop
Call qsort(myArray(), iLowBound, intY)
Call qsort(myArray(), intX, iHighBound)

End If
End Sub``````

### How It Works

We call this sub from the `determineEntropy` routine. Notice that we are passing in the array `ByRef`, or by reference. This is so that `qsort` actually has the memory location of the array. Passing in the array by reference permits the `qsort` routine to work on the actual array – not a copy of the array in memory. Passing in by reference is the default behavior of VB. It can make the changes for us here. A quick call to `qsort` from `determineEntropy` will sort the array. We can then print the array from `determineEntropy` and it will now be sorted for us.

The `qsort` routine works on the divide-and-conquer principle. It splits the array into two groups and then sorts both the top and the bottom. It does this by calling itself to sort each group. When a routine calls itself, that process is called recursion.

``````If iHighBound > iLowBound Then
intMiddle = ((iLowBound + iHighBound) \ 2)
varMidBound = myArray(intMiddle, 0)

intX = iLowBound
intY = iHighBound``````

If the upper bound of the array is greater than the lower bound, we want to grab the middle. When we call `qsort` from `determineEntropy`, `iHighBound = 2` and `iLowBound = 0`. In the next line we determine the middle of the array – which will be 1. Recall that the entropy value is in position 0 of the array and the classification is in position 1. So `varMidBound` is assigned the value from `myArray(1, 0)` which is the entropy value of the classification.

Now we loop as long as the lower bound `<=` the upper bound variables. If the entropy value stored in element `intX` (which will be 0 the first time though) is `>=` the entropy value stored in the middle value `And` the entropy value stored in the `intY` element (which will be 2) is `<=` the middle entropy value, we have some switching to do.

``````Do While intX <= intY
'-- if a value lower in the array is > than one --
'-- higher in the array, swap them now          --
If myArray(intX, 0) >= varMidBound _
And myArray(intY, 0) <= varMidBound Then``````

So now we just swap what's in position `intX` with what is in position `intY`. Then we move the lower and upper boundaries towards each other and do the same test.

``````             varTmp = myArray(intX, 0)
sHoldString = myArray(intX, 1)
yArray(intX, 0) = myArray(intY, 0)
myArray(intX, 1) = myArray(intY, 1)
myArray(intY, 0) = varTmp
myArray(intY, 1) = sHoldString
intX = intX + 1
intY = intY - 1``````

When `intX > intY`, the code moves down and calls itself – recursion. The routine is called on the lower part of the array to sort it, then the upper part. This is the divide-and-conquer principle.

``````   Call qsort(myArray(), iLowBound, intY)
Call qsort(myArray(), intX, iHighBound)``````

Now don't worry if this part looks a bit cryptic. Years are spent in some computer science courses discussing the various sorts and the math behind them. Luckily we can learn from the masters. It's unfortunate that VB does not have any built-in sorting capabilities like `qsort`. They are built into other languages, such as C.

To sort strings, instead of numbers, some programmers add a hidden list box to the form. The Sorted property is set to True. They just use the AddItem method to add strings in any order. When all of the strings are added, they start at position 0 of the list box and retrieve each of the items. They will of course now be in sorted order. The list box is not visible so the user never sees it, but this is a quick and dirty way to get some quick string sorting done. Unfortunately this will not work for numbers. A 2 would come before a –3, for example. Not exactly precise for our needs!

OK, finished! Press F5 to run your program. Don't forget to add a reference to the Microsoft ActiveX Data Objects 2.0 Library if you haven't done so already. Try the analyzing the various products and see how the results stack up. You are now ready to turn this beauty over to the product manager.

One word on the results. Since we only have a single country for each entry, this usually has the lowest entropy. So we can assume that the country is important. What really makes sense to look at is what is in the second position – after country. In this example for Hokkien Fried Mee, our program says that this is really on the decline in not a single country – but Europe in general! Now this is great information. The product manager might be wondering why sales are down in Germany, without realizing that sales across Europe for this product are in a nose-dive!

I think you can see the value of data mining and hopefully there is enough information provided for you to apply this technique to your real world problems. We re-aggregated the data into a single table. We then tried to find some logical classifications and applied the entropy approach to determine which one of them provided the least uncertainty about sales growth. We tried to illustrate a non-traditional use of database information.

Show: