
Data Required for Time Series Models
When you prepare data for use in training any data mining model, make sure that you understand the requirements for the particular model and how the data is used.
Each forecasting model must contain a case series, which is the column that specifies the time slices or other series over which change occurs. For example, the data in the previous diagram shows the series for historical and forecasted bicycle sales over a period of several months. For this model, each region is a series, and the date column contains the time series, which is also the case series. In other models, the case series can be a text field or some identifier such as a customer ID or transaction ID. However, a time series model must always use a date, time, or some other unique numeric value for its case series.
The requirements for a time series model are as follows:
-
A single key time column Each model must contain one numeric or date column that is used as the case series, which defines the time slices that the model will use. The data type for the key time column can be either a datetime data type or a numeric data type. However, the column must contain continuous values, and the values must be unique for each series. The case series for a time series model cannot be stored in two columns, such as a Year column and a Month column.
-
A predictable column Each model must contain at least one predictable column around which the algorithm will build the time series model. The data type of the predictable column must have continuous values. For example, you can predict how numeric attributes, such as income, sales, or temperature, change over time. However, you cannot use a column that contains discrete values, such as purchasing status or level of education, as the predictable column.
-
An optional series key column Each model can have an additional key column that contains unique values that identify a series. The optional series key column must contain unique values. For example, a single model can contain sales for many product models, as long as there is only one record for each product name for every time slice.
You can define input data for the Microsoft Time Series model in several different ways. However, because the format of the input cases affects the definition of the mining model, you must consider your business needs and prepare your data accordingly. The following two examples illustrate how the input data affects the model. In both examples, the completed mining model contains patterns for four distinct series:
-
Sales for Product A
-
Sales for Product B
-
Volume for Product A
-
Volume for Product B
In both examples, you can predict new future sales and volume for each product. You cannot predict new values for product or for time.
Example 1: Two Predictable Columns
This example uses the following table of input cases:
|
TimeID
|
Product
|
Sales
|
Volume
|
|---|
|
1/2001
|
A
|
1000
|
600
|
|
2/2001
|
A
|
1100
|
500
|
|
1/2001
|
B
|
500
|
900
|
|
2/2001
|
B
|
300
|
890
|
The TimeID column in the table contains a time identifier, and has two entries for each day. The TimeID column becomes the case series. Therefore, you would designate this column as the key time column for the time series model.
The Product column defines a product in the database. This column contains the product series. Therefore, you would designate this column as a second key for the time series model.
The Sales column describes the gross profits of the specified product for one day, and the Volume column describes the quantity of the specified product that remains in the warehouse. These two columns contain the data that is used to train the model.
Example 2: Four Predictable Columns
Although this example uses basically the same input data as the first example, the input data is structured differently, as shown in the following table:
|
TimeID
|
A_Sales
|
A_Volume
|
B_Sales
|
B_Volume
|
|---|
|
1/2001
|
1000
|
600
|
500
|
900
|
|
2/2001
|
1100
|
500
|
300
|
890
|
In this table, the TimeID column still contains the case series for the time series model, which you designate as the key time column. However, the previous Sales and Volume columns are now split into two columns and each of those columns are preceded by the product name. As a result, only a single entry exists for each day in the TimeID column. This creates a time series model that would contain four predictable columns: A_Sales, A_Volume, B_Sales, and B_Volume.
Furthermore, because you have separated the products into different columns, you do not have to specify an additional series key column. All the columns in the model are either a case series column or a predictable column.