SMOTE
Updated: May 31, 2016
Increases the number of low incidence examples in a dataset using synthetic minority oversampling
Category: Data Transformation / Manipulation
You can use the SMOTE module to apply the Synthetic Minority Oversampling Technique to an input dataset. This is a statistical technique for increasing the number of cases in your dataset in a balanced way.
You use SMOTE in datasets that are imbalanced. Typically this means that the class you want to analyze is under-represented. There are many reasons for this: the category you are targeting might be very rare in the population, or the data might simply be difficult to collect. Regardless, SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.
The module returns a dataset that contains the original samples, plus an additional number of synthetic minority samples, depending on the percentage you specify.
SMOTE works by generating new instances from existing cases that you supply as input. The new instances are created by taking samples of the feature space for each target class and its nearest neighbors, and generating new examples that combine features of the target case with features of its neighbors. This approach increases the features available to each class and makes the samples more general.
SMOTE takes the entire dataset as an input and increases the percentage of the minority cases only. For example, if you have an imbalanced dataset with 1% of Class A (the minority class) and 99% of Class B, you would enter 200 to increase the percentage of minority cases to twice the previous percentage.
We recommend that you try using SMOTE with a small dataset to see how it works.
The following example uses the Blood Donation dataset available in Azure Machine Learning Studio.
If you click Visualize on the dataset’s output, you can see that, of the 748 rows or cases in the dataset, there are 570 cases (76%) of Class 0, and 178 cases (24%) of class 1. Although this isn’t terribly imbalanced, Class 1 represents the people who donated blood, and thus these rows contain the features you want to model. To increase the number of cases, you can set the value of SMOTE percentage in multiples of 100 as follows:
Class 0 | Class 1 | total | |
|---|---|---|---|
Original dataset (equivalent to SMOTE percentage = 0) | 570 76% | 178 24% | 748 |
SMOTE percentage = 100 | 570 62% | 356 38% | 926 |
SMOTE percentage = 200 | 570 52% | 534 48% | 1104 |
SMOTE percentage = 300 | 570 44% | 712 56% | 1282 |
Warning |
|---|
Increasing the number of cases using SMOTE is not guaranteed to produce more accurate models. You should try experimenting with different percentages, different feature sets, and different numbers of nearest neighbors to see how adding cases influences your model. |
Determine which columns in the dataset to use as input.
Creation of new cases using SMOTE is based on all the columns that you provide as inputs. You cannot select specific columns, or exclude columns. Therefore, if you want to use a limited feature space for building the new cases, you should use Select Columns in Dataset to select those columns before using SMOTE.
Ensure that the column containing the label, or target class, is marked as such.
If there is no label column, use the Edit Metadata module to select the column that contains the class labels, and select Label from the Fields dropdown list.
SMOTE automatically identifies the minority class in the label column, and then gets all examples for the minority class.
In the SMOTE percentage option, type a whole number that indicates the target percentage of minority cases in the output dataset. For example:
You type 0 (%). The SMOTE module returns exactly the same dataset that you provided as input, adding no new minority cases.
In this dataset, the class proportion has not changed.
You type 100 (%). The SMOTE module generates new minority cases, adding the same number of minority cases that were in the original dataset.
Because SMOTE does not increase the number of majority cases, the proportion of cases of each class has now changed.
You type 200 (%). The module doubles the percentage of minority cases compared to the original dataset.
This does not result in having twice as many minority cases as before. Rather, the size of the dataset is increased in such a way that the number of majority cases stays the same, and the number of minority cases is increased till it matches the desired percentage value.
Use the Number of nearest neighbors option to determine the size of the feature space that the SMOTE algorithm uses when in building new cases. A nearest neighbor is a row of data (a case) that is very similar to some target case. The distance between any two cases is measured by combining the weighted vectors of all features.
By increasing the number of nearest neighbors, you get features from more cases. By keeping the number of nearest neighbors low, you use features that are more like those in the original sample.
Run the experiment.
The output of the module is a dataset containing the original rows plus some number of added rows with minority cases.
SMOTE never changes the number of majority cases.
Name | Type | Description |
|---|---|---|
Samples | A dataset of samples |
Name | Range | Type | Default | Description |
|---|---|---|---|---|
SMOTE percentage | >=0 | Integer | 100 | Amount of oversampling in multiples of 100. |
Number of nearest neighbors | >=1 | Integer | 1 | The number of nearest neighbors from which to draw features for new cases |
Random seed | Any | Integer | 0 | Seed for the random number generator |
Name | Type | Description |
|---|---|---|
Table | A Data Table containing the original samples plus an additional number of synthetic minority class samples. The number of new samples is (smotePercent/100)*T, where T is the number of minority class samples. |
