Evaluate Recommender
Updated: June 27, 2017
Evaluates the accuracy of recommender model predictions
Category: Machine Learning / Evaluate
You can use the Evaluate Recommender module to measure the accuracy of predictions made by a recommendation model. There are four different kinds of predictions that you can evaluate:
Ratings predicted for a given user and item
Items recommended for a given user
A list of users found to be related to a given user
A list of items found to be related to a given item
Each of these four prediction kinds returns a scored dataset with a different column format, containing either user-item-rating triples, users and their recommended items, users and their related users, or items and their related items.
The Evaluate Recommender module deduces the kind of prediction from the column format of the scored dataset and applies the appropriate performance metrics. The performance metrics used are dependent on the kind of prediction made prior to evaluation.
Note that the Score Matchbox Recommender module produces scored datasets that can be understood by Evaluate Recommender.
Learn everything you need to know about the end-to-end experience of building a recommendation system in this tutorial from the .NET development team. Includes sample code and discussion of how to call Azure Machine Learning from an application. Building recommendation engine for .NET applications using Azure Machine Learning |
The Evaluate Recommender module compares the predictions output by the recommendation model with the corresponding "ground truth" data, and from this determines the accuracy of predictions. To do this, it requires two inputs:
Test dataset. The test dataset must contains the "ground truth" data in the form of user-item-rating triples.
Use the Split Data module in the RecommenderSplit mode to produce a training dataset and a test set from an existing dataset of user-item-rating triples.
Scored dataset. The second input, the score dataset, contains the predictions that were generated by the recommendation model.
The columns included in this second dataset depend on the kind of prediction that was made. For example, the scored dataset might contain user-item-rating triples, or it might contain just a list of users and their recommended items. Another type of output would be a list of related users, or a list of related items.
The performance metrics that are created depend on the type and order of the columns in the scored dataset. For details, see these sections:
How to Evaluate Predicted Ratings
When evaluating rating predictions, the scored dataset (the second input to Evaluate Recommender) contains user-item-rating triples.
The first column of the dataset contains user identifiers.
The second column contains the item identifiers.
The third column contains the corresponding user-item ratings.
In rating prediction, other parameter settings have no effect on evaluation.
For evaluation to succeed, the column names must be |
Evaluate Recommender compares the ratings in the ground truth dataset to the predicted ratings of the scored dataset, and computes the mean absolute error (MAE) and the root mean squared error (RMSE).
How to Evaluate Item Recommendations
When evaluating item recommendation, the scored dataset that you provide the second input to Evaluate Recommender, should include the recommended items for each user.
The first column of the dataset must contain the user identifiers.
All subsequent columns contain the corresponding recommended item identifiers, ordered by how relevant an item is to the user. Be sure to sort so that the most relevant item comes first.
In item recommendation, the parameter settings of Evaluate Recommender have no effect on evaluation.
For Evaluate Recommender to work, the column names must be |
Evaluate Recommender computes the average normalized discounted cumulative gain (NDCG) and returns it in the output dataset.
Because it is impossible to know the actual "ground truth" for the recommended items, Evaluate Recommender uses the user-item ratings in the test dataset as gains in the computation of the NDCG. To evaluate, the recommender scoring module must only produce recommendations for items with ground truth ratings (in the test dataset).
How to Evaluate Predictions of Related Users
When evaluating the prediction of related users, the scored dataset (second input to Evaluate Recommender) must contain the related users for each user of interest.
The first column contains the identifiers of the users of interest.
All subsequent columns contain the identifiers of the predicted related users, ordered by how related they are to the user of interest (most related user first).
You can influence evaluation by setting the minimum number of items that a user of interest and its related users must have in common.
For Evaluate Recommender to work, the column names must be |
Evaluate Recommender computes the average normalized discounted cumulative gain (NDCG), based on Manhattan (L1 Sim NDCG) and Euclidean (L2 Sim NDCG) distances, and returns both values in the output dataset.
Since there is no actual ground truth for the related users, Evaluate Recommender uses the following procedure to compute the average NDCGs. For each user of interest in the scored dataset:
Find all items in the test dataset which have been rated by both the user of interest and the related user under consideration.
Create two vectors from the ratings of these items, one for the user of interest, and one for the related user under consideration.
Compute the gain as the similarity of the resulting two rating vectors, in terms of their Manhattan (L1) or Euclidean (L2) distance.
Compute the L1 Sim NDCG and the L2 Sim NDCG, using the gains of all related users.
The shown NDCGs are averaged over all users in the scored dataset.
In other words, gain is computed as the similarity (normalized Manhattan or Euclidian distances) between a user of interest (the entry in the first column of scored dataset) and a given related user (the entry in the n-th column of the scored dataset). The gain of this user pair is computed using all items for which both items have been rated in the original data (test set). The NDCG is then computed by aggregating the individual gains for a single user of interest and all related users, using logarithmic discounting. That is, one NDCG value is computed for each user of interest (each row in the scored dataset). The number that is finally reported is the arithmetic average over all users of interest in the scored dataset (i.e. its rows).
Hence, to evaluate, the recommender scoring module must only predict related users who have items with ground truth ratings (in the test dataset).
How to Evaluate Predictions of Related Items
When evaluating the prediction of related items, the scored dataset (second input to Evaluate Recommender) must contain the related items for each item of interest.
The first column contains the identifiers of the items of interest.
All subsequent columns contain the identifiers of the predicted related items, ordered by how related they are to the item of interest (most related item first).
You can influence evaluation by setting the minimum number of users that an item of interest and its related items must have in common.
For Evaluate Recommender to work, the column names must be |
Evaluate Recommender computes the average normalized discounted cumulative gain (NDCG) based on Manhattan (L1 Sim NDCG) and Euclidean (L2 Sim NDCG) distances and returns both values in the output dataset.
Since there is no actual ground truth for the related items, Evaluate Recommender computes the average NDCGs as follows.
For each item of interest in the scored dataset:
Find all users in the test dataset who have rated both the item of interest and the related item under consideration.
Create two vectors from the ratings of these users, one for the item of interest and for the related item under consideration.
Compute the gain as the similarity of the resulting two rating vectors in terms of their Manhattan (L1) or Euclidean (L2) distance.
Compute the L1 Sim NDCG and the L2 Sim NDCG using the gains of all related items.
The shown NDCGs are averaged over all items of interest in the scored dataset.
In other words, gain is computed as the similarity (normalized Manhattan or Euclidian distances) between an item of interest (the entry in the first column of scored dataset) and a given related item (the entry in the n-th column of the scored dataset). The gain of this item pair is computed using all users who have rated both of these items in the original data (test set). The NDCG is then computed by aggregating the individual gains for a single item of interest and all its related items, using logarithmic discounting. That is, one NDCG value is computed for each item of interest (each row in the scored dataset). The number that is finally reported is the arithmetic average over all items of interest in the scored dataset (i.e. its rows).
Hence, to evaluate, the recommender scoring module must only predict related items with ground truth ratings (in the test dataset).
For examples of how recommendation models are used in Azure Machine Learning, see these sample experiments in the Model Gallery:
The Movie recommender sample demonstrates how to train, evaluate, and score using a recommendation model.
See this blog for a detailed write-up of how to build a movie recommendation model: Building recommendation engine for .NET applications using Azure Machine Learning
| Name | Type | Description |
|---|---|---|
| Test dataset | Data Table | Test dataset |
| Scored dataset | Data Table | Scored dataset |
| Name | Range | Type | Default | Description |
|---|---|---|---|---|
| Minimum number of items that the query user and the related user must have rated in common | >=1 | Integer | 2 | Specify the minimum number of items that must have been rated by both the query user and the related user This parameter is optional |
| Minimum number of users that the query item and the related item must have been rated by in common | >=1 | Integer | 2 | Specify the minimum number of users that must have rated both the query item and the related item This parameter is optional |
| Name | Type | Description |
|---|---|---|
| Metric | Data Table | A table of evaluation metrics |
For a list of all module errors, see Error Code.
| Exception | Description |
|---|---|
| Error 0022 | Exception occurs if number of selected columns in input dataset does not equal to the expected number. |
| Error 0003 | Exception occurs if one or more of inputs are null or empty. |
| Error 0017 | Exception occurs if one or more specified columns have type unsupported by current module. |
| Error 0034 | Exception occurs if more than one rating exists for a given user-item pair. |
| Error 0018 | Exception occurs if input dataset is not valid. |
| Error 0002 | Exception occurs if one or more parameters could not be parsed or converted from specified type into required by target method type. |