Split Data using Recommender Split
Updated: June 2, 2017
This topic describes how to use the Recommender Split option in the Split Data module of Azure Machine Learning. Dividing datasets used for training and testing, either randomly or by some criteria, is an important task in many machine learning workflows. However, there are special requirements for data used to train recommendation systems that makes dividing the data more complex.
The Recommender split option makes this process easier by asking for the type of recommendation model you are working with: for example, are you recommending items, suggesting a rating, or finding related users? It then divides the dataset by criteria you specify, such as how to handle cold users or cold items.
When you split the datasets, the module returns two datasets, one intended for training and the other for testing or model evaluation. If the input dataset contains any extra data per instance (such as ratings), it is preserved in the output.
For general information about data partitioning for machine learning experiments, see Split Data and Partition and Split.
The Recommender Split option is provided specifically for data used to train recommendation systems. Be sure before you use this option that your data is in a compatible format: item-user pairs, or item-user-rating tuples. For detailed information about the supported data formats, see Train Matchbox Recommender.
Add the Split Data module to your experiment, and connect it as input to the dataset you want to split.
For Splitting mode, select Recommender split.
The following options are unique to recommender data, and control how values are divided among training and test sets, or among training and scoring sets. in all cases, you specify a percentage represented as a number between 0 and 1.
Fraction of training only users: Specify the fraction of users that should be assigned only to the training data set. This means the rows would never be used to test the model.
Fraction of test user ratings for training: Specify that some portion of the user ratings you have collected can be used for training.
Fraction of cold users: Cold users are users that the system has not previously encountered. Typically, because the system has no information on these users, they are valuable for training, but predictions might be less accurate.
Fraction of cold items: Cold items are items that the system has not previously encountered. Because the system has no information about these items, they are valuable for training, but predictions might be less accurate.
Fraction of ignored users: This option allows the recommender to ignore some users, which lets you train the model on a subset of data. This might be useful for performance reasons. You specify the percentage of users that should be ignored.
Fraction of ignored items: The recommender splitter can ignore some items and train the model on a subset of data. This might be useful for performance reasons. You specify the percentage of items to ignore.
The option, Remove occasionally produced cold items is typically set to zero. This ensures that all entities in the test set are included in the training set. An item is said to be "occasionally cold" if it is covered only by the test set and it wasn't explicitly chosen as cold. Such items can be produced by steps (4) and (6) in the algorithm described in the How Recommender Data is Split section.
Random seed for recommender: Specify a seed value if you want to split the data the same way every time. Otherwise, by default the input data is randomly split.
The recommender splitter works under the assumption the dataset consists only of user-item pairs or user-item-rating triples. Therefore, the Split Data module cannot work on datasets that have more than three columns, to avoid confusion with feature-type data. If your dataset contains too many columns, you might get this error: Error 0022: Number of selected columns in input dataset does not equal to x As a workaround, you can use Select Columns in Dataset to remove some columns, and then add the columns later using Add Columns. Alternatively, if your dataset has many features that you want to use in the model, divide the dataset using a different option, and train the model using Train Model rather than Train Matchbox Recommender. |
For examples of how to divide a set of ratings and features used for training or testing a recommendation model, we recommend that you review the walkthrough provided with this sample experiment in the Model Gallery: Movie Recommendation
This module requires a dataset that contains at least two rows as input.
If you specify a number as a percentage, or if you use a string that contains the "%" character, the value is interpreted as a percentage.
All percentage values must be within the range (0, 100), not including the values 0 and 100.
If you specify a number or percentage that is a floating point number less than one, and you do not use the percent symbol (%), the number is interpreted as a proportional value.
Input Data Requirements
The recommender splitter works under the assumption the dataset consists only of user-item pairs or user-item-rating triples. Therefore, the Split Data module cannot work on datasets that have more than three columns, to avoid confusion with feature-type data. If your dataset contains too many columns, you might get this error:
Error 0022: Number of selected columns in input dataset does not equal to x
As a workaround, you can use Select Columns in Dataset to remove some columns. You can always add the columns back later, by using the Add Columns module.
Alternatively, if your dataset has many features that you want to use in the model, divide the dataset using a different option, and train the model using Train Model rather than Train Matchbox Recommender.
How Recommender Data is Split
The following algorithm is used when splitting data into training and test sets for use with a recommendation model:
The requested fraction of ignored items is removed with all associated observations.
The requested fraction of cold items is moved to the test set with all associated observations.
The requested fraction of ignored users that remain after the first two steps is removed with all associated observations.
The requested fraction of cold users that remain after the first two steps is moved to the test set with all associated observations.
The requested fraction of training-only users that remain after the first two steps is moved to the training set with all associated observations.
For each user that remains after all the previous steps, the requested fraction of test user ratings for training is moved to the training set, and the remainder is moved to the test set.
At least one observation is always moved to the training set for each user.
If requested, instances that are associated with the occasionally produced cold items can be removed from the test set.
An item is said to be "occasionally cold" if it is covered only by the test set, and it wasn't explicitly chosen as cold. Such items can be produced by steps (4) and (6).
The anticipated use of this option is that the requested number of cold users and items is set to zero. This ensures that all entities in the test set are included in the training set.
Split Data using Regular Expression
Split Data using Split Rows
Split Data using Relative Expression
A-Z Module List