Join Data
Updated: October 16, 2015
Joins two datasets
Category: Data Transformation / Manipulation
You can use the Join Data module to perform a join operation on two tables. The module supports the following join types:
Inner Join. An inner join is the typical join operation. It returns the combined rows only when the values of the key columns match.
Left Outer Join. A left outer join returns joined rows for all rows from the left table. When a row in the left table has no matching rows in the right table, the returned row contains missing values for all columns that come from the right table unless you specify a replacement value for missing values.
Full Outer Join. A full outer join returns all rows from the left table (table1) and from the right table (table2).
For each of the rows in the left table that have no matching rows in the right table, the join results include a row containing missing values from the right table.
For each of the rows in the right table that have no matching rows in the left table, the join results include a row containing missing values for all columns from the left table.
Left Semi-Join. A left semi-join returns only the values from the left table when the values of the key columns match.
To perform a join on two datasets, they must be related by a single key column. Composite keys are not supported.
You can see examples of how this module is used by exploring these sample experiments in the Model Gallery:
In the Breast cancer detection sample, Join Data is used to combine the positive training cases with the negative training cases after the proportion of cases has been adjusted.
In the Flight delay prediction sample, Join Data is used to bring together useful features from external datasets.
In the Movie recommendation sample, the datasets are joined so that we can present the recommended movie titles rather than a movie ID.
In the Prediction of student performance sample, Join Data is used to bring in new features.
For additional examples of Join Data, see these sample experiments:
The combined dataset cannot have two columns with the same name. Therefore, if the left and right datasets have any duplicate column names, a numeric suffix is appended to the column names of the right dataset to make them unique. For example, if both datasets had a column named Month, the column from the left dataset would remain as is, and the column from the right dataset would be renamed Month (1).
The algorithm that is used for comparison of key values is hash-forced.
Each column of the joined dataset preserves a categorical type, if the corresponding column of the input dataset is categorical.
In left outer joins, if there are any missing values, a categorical level is created in the left dataset for missing values. This is true even if there are no missing values in the joined (right) dataset.
Name | Type | Description |
|---|---|---|
Dataset1 | First dataset to join | |
Dataset2 | Second dataset to join |
Name | Range | Type | Default | Description |
|---|---|---|---|---|
Join key columns for L | Any | ColumnSelection | Select the join key columns for the first dataset. | |
Join key columns for R | Any | ColumnSelection | Select the join key columns for the second dataset. | |
Match case | Any | Boolean | True | Indicate whether a case-sensitive comparison is allowed in key columns. |
Join type | List | Type | Inner join | Choose a join type. |
Keep right key columns in joined table | Any | Boolean | True | Indicate whether to keep key columns from the second dataset in the joined dataset. |
Name | Type | Description |
|---|---|---|
Results dataset | Result of join operation |
For a list of all exceptions, see Machine Learning Module Error Codes.
Exception | Description |
|---|---|
An exception occurs if one or more specified columns of the dataset couldn't be found. | |
An exception occurs if one or more inputs are null or empty. | |
An exception occurs if the parameter is greater than or equal to the specified value. | |
An exception occurs if the input datasets that are passed to the module should have compatible column types, but they do not. | |
An exception occurs if one or more specified columns have types that are unsupported by the current module. | |
An exception occurs if the number of columns in some of the datasets that are passed to the module is too small. | |
An exception occurs when the column set contains duplicate column names and it is not allowed. | |
An exception occurs if the argument for the passed column set does not apply to any dataset columns. | |
An exception occurs when two objects have to be of the same size, but they are not. |