AutoML
Introduction to machine learning
One definition of machine learning is:
"Machine learning (ML) is the study of computer algorithms that improve automatically through experience." (source: Wikipedia)
In practice this means that the goal of machine learning must be provided by humans to the machine. Next, how to get to the goal is learned automatically, it is derived from past experiences in the form of existing data.
This definition and explanation provide some guidance on when to use machine learning:
- The goal is clear
- It's not clear how to get there, though
- Plenty of suitable historical data is available for training.
In this case, machine learning could be a good solution.
Also in applications built with the Thinkwise Platform, machine learning techniques are often leveraged for calculations that are difficult or impossible to express with regular code. This is possible because historical data are usually available for training a model. Some examples include:
- Price predictions, for example house prices
- Project cost- and effort predictions
- Risk assessments based on many variables, for example insurance risks
- Assigning tickets automatically to the right department (no more need to open, asses and assign a ticket manually)
- Profit predictions
- Assigning quality labels to red wine, based on various characteristics.
Automated machine learning
One of the main goals of the Thinkwise Platform is to enable developers to focus on the functional aspects of software development instead of the technical aspects.
However, common machine learning combines software engineering with the field of mathematics and statistics to create algorithms that can learn from experience. To apply this in the field, data scientists use domain knowledge to apply the algorithms to create machine learning models. The broad skillset needed to develop these solutions usually does not match the skillset of a typical developer using the Thinkwise Platform.
To allow developers to leverage machine learning with minimal technical hurdles, Automated Machine Learning (AutoML) is available in the Software Factory. This is a solution designed to automate the process of machine learning. The available prediction models can be trained, deployed and applied in a production environment for two types of problems:
-
Classification problems focus on assigning a type, group or other predefined classification based on the provided input. Often, this is a domain with elements, such as a risk level or priority.
-
Regression problems focus on determining a numerical value based on the provided input. Think of numerical values such as total profit or response time.
-
More types of problems will be resolvable with AutoML in future releases, such as forecasting.
The Thinkwise Platform will automatically choose the right type of model for your problem.
Prerequisites
- To enable AutoML, ensure Indicium is configured to support Automated Machine Learning. More information here.
- Historical production data is ideal for model training. The best training is achieved with an extensive historical data set, containing relevant data. Although AutoML features a filter for irrelevant data, the training process will be faster if irrelevant data is filtered out beforehand (for example, primary keys and identities), especially in large datasets. If a subset of the records in the table is required, a view can be used to limit the training set.
Training a model
Set-up
An AutoML model is based on a specific table, view or materialized query table (mqt). One of the columns can be used as the prediction column for the AutoML model. Other columns of this table can be used as the input that determines the outcome.
The trained AutoML model will be useable for any scenario that provides the required input. Predictions can also be done for different tables or ad-hoc input. The table used to create the AutoML model is merely used for the training data.
Menu Integration & AI > AutoML > tab Form
- Select a Table. The data from this table will be used to train the model.
- Select an AutoML model.
Target and predictors
Menu Integration & AI > AutoML > tab Target / Predictors
- To successfully set-up an AutoML configuration, select one target and one or more predictors.
- At any point in time during set-up, use the Load training data task to retrieve the content of the table which will be used to train the model.
Every potential target and predictor has a type-classification that determines how the AutoML engine interprets the values of this column. This is based on how the column is modeled:
- Binary data only has two options. Checkboxes, combo's with two options, etc.
- Nominal data is simply a set of labels. Every value for this column can occur in one or more records. There is no ordering in the values and the value itself cannot be used in calculations.
- Ordinal data is an ordered set of labels. Columns using domains that sort on the order number of the elements are considered ordinal.
- Quantitative data is numerical data that can be used in calculations. Not all numerical data is quantitative. Domains with elements and foreign keys are not quantitative.
The trained AutoML model will never be aware of nominal and ordinal values outside of the training set. For this reason, it doesn't make much sense to have free-text fields or non-base data look-ups used as predictors.
Re-training is required:
- when updating the elements of a target domain (without retraining, prediction is not possible)
- when removing elements from the target domain (without retraining, predictions will still be made on the removed elements and errors will occur because these invalid predictions cannot be saved).
- if columns have been renamed or removed since finishing the training
- if new data has been added that derives from the training data. For example, if new categories have been added to nominal or ordinal predictor variables.
The AutoML configuration screen
Queueing and training
Once one or more predictors and the target have been chosen and training data has been loaded, training can be queued.
Menu Integration & AI > AutoML
- Press the Queue AutoML model for training task to queue training.
Once the configuration has changed status to Queued for training, no further modifications can be done.
Reviewing training results
Menu Integration & AI > AutoML > tab Result models
- Open the Result models tab to monitor the various types of models that will be included in the training process. It can take up to 30 seconds before the models are shown here.
The various types of models will automatically be queued for training. Once training starts, the AutoML configuration will change status to Training. A training's duration can last from mere minutes to many hours. This depends on the size of the dataset, mainly on the number of columns and, to a lesser extent, the number of rows.
Types of models
When a type of model is done training, performance metrics will be shown to indicate the quality of the trained model based on samples of the training data. The default sorting of this tab page is that the best performing model will most commonly be shown on top after training has finished.
- If the overall quality is good, it is possible to compare the models and select the one that best fits your needs. Different models can, for example, differ on accuracy or precision. In that case it depends on which metric is more important for the goal you wish to achieve with machine learning.
- A low accuracy score for all models is considered low quality. In that case: check the quality of your data set.
When all types of models have been trained, the AutoML configuration will change status from Training to Training finished.
Only one model is trained in parallel. When multiple models or branches are training AutoML models, it might take longer to see results.
Activating a trained model
Menu Integration & AI > AutoML > tab Result models
- Select one of the trained model types as the active model for this AutoML configuration. Use the corresponding tasks at the top of the tab page to select the desired model.
The selected model can be changed afterwards. Process actions can now use this AutoML configuration and will use the selected model for prediction execution.
The AutoML configuration is now Ready for use.
Running a prediction
The easiest way to perform a prediction is to queue them in the database and use a scheduled system flow to pick up the queued predictions one-by-one.
Currently, the Run AutoML model
process action can only be used in scheduled process flows. Once process flows are available in Universal, this process
action can also be used in non-scheduled process flows performed by a user using Universal.
Step 1: Create a scheduled process flow with only the Run AutoML model
connector. Select the table the AutoML model has been trained on and select the
AutoML model. Have the Start flag to the process action, have the process action point to itself with success and to Stop with failure.
A simple flow using a process action to run an AutoML model prediction
Add a schedule to the process flow to have this process flow periodically run the AutoML model. Set the schedule to default if activation by an IAM administrator is not required.
Once an AutoML configuration is in use by a process action, the status will change from Ready for use to Active.
Step 2: Create process flow variables for the predictors, the target and the status code. Also create one or more process flow variables to store the ID of the queued item to predict.
Map the variables for the predictors, target and status codes to the process action. It is important to use the same data set-up as the training data: the same columns with the same name.
If columns have been renamed or removed since finishing the training, retrain the model.
Predictor mapping
Step 3: Create the process logic. Mark the process action to use process logic and create a template.
The template should consist of two parts. The first part is to process the results of the previous execution. The second part is used to load a new item to predict from the queue.
The first part should contain the following statements:
- When the status code is
-2
, the AutoML service is not running. Inform the user accordingly. - Save the result of the last item
The second part should contain the following statements:
- Load the id and the predictors for the next item from the queue
- Decide whether the process flow should continue
The template will look something like this:
-- Check the result of the last execution.
if @status_code = -2
begin
-- The AutoML service is not running. Clear the queue and inform the user.
update prediction_queue
set sale_price = -1,
failed = 1
where sale_price is null;
end
else if @status_code = 0
begin
-- Store the result of the previous prediction
update prediction_queue
set sale_price = coalesce(@sale_price, -1)
where id = @id;
end;
-- Load primary key and the predictors for the next item to predict from the queue
select top 1
@id = p.id,
@above_grade_living_area = h.above_grade_living_area,
@alley = h.alley,
@basement_condition = h.basement_condition,
@basement_exposure = h.basement_exposure,
@basement_quality = h.basement_quality,
...
from prediction_queue p
join house h
on h.id = p.id
and p.sale_price is null;
-- Start prediction if a new item was loaded from the queue. Stop the process flow if not.
if @@ROWCOUNT = 0
begin
set @automl_run_model_house_training_data_stop = 10;
set @automl_run_model_house_training_data_automl_run_model_house_training_data = null;
end
else
begin
set @automl_run_model_house_training_data_stop = null;
set @automl_run_model_house_training_data_automl_run_model_house_training_data = 10;
end;
In this example, the AutoML process action is executed with empty predictors on the first run. The returned status code will be -3
. This can be ignored.
Step 4: Synchronize the model to IAM to activate the scheduled process flow.
AutoML connectors
In applications built with the Thinkwise Platform, you can use machine learning techniques for calculations that are difficult to write out manually. This is possible because historical data are usually available for training a model. Examples are price, project cost-, effort, and profit predictions, risk assessments, automated ticket assignment, or assigning quality labels. For more information, see the Machine learning manual or contact your Thinkwise representative.
Run AutoML model
Starting point Universal GUI | Starting point Win/Web GUI | Starting point system flow (Indicium) | System flow action |
---|---|---|---|
- | - | + | + |
A trained AutoML model can be used in process flows to perform predictions. Before creating this process action, one of the trained AutoML models must be activated. More information about training AutoML models can be found here.
At the moment, this process action is exclusive to scheduled process flows. In the future, this process action will become available to user-initiated flows.
Input options | |
---|---|
[PREDICTOR] | The value of a predictor, used as input for the trained AutoML model |
Output options | |
---|---|
[TARGET] | The predicted value of the target |
Status code | The status code of the executed action. 0 - Successful -1 - Unsuccessful (unknown) -2 - Unsuccessful (the AutoML service is not running) -3 - Unsuccessful (All predictors are empty) -4 - Unsuccessful (The trained model could not be found) -100 - Unsuccessful (Indicium is not used by the client executing the process flow) |
Train classification/regression model
Starting point Universal GUI | Starting point Win/Web GUI | Starting point system flow (Indicium) | System flow action |
---|---|---|---|
- | - | + | + |
AutoML training options
Starting point Universal GUI | Starting point Win/Web GUI | Starting point system flow (Indicium) | System flow action |
---|---|---|---|
- | - | + | + |
Time-series forecasting connector
You can use the time-series forecasting connector in a system flow to run a time-series forecast.
Input parameters
The connector has the following input parameters:
-
Input set - This required set must be provided with tabular data, in JSON format: an array with an object per row,
-
with at minimum an Index column and Measurement column. Any additional columns will be interpreted as predictors.
Example:[
{
"invoice_date": "1986-12-13",
"total_amount": 515,
"is_xmas": false,
"no_of_machines_active": 15
},
"invoice_date": "1986-12-14",
"total_amount": 514,
"is_xmas": false,
"no_of_machines_active": 15
}
] -
Index column - This required input parameter points to a property in the Input and Output set, indicating the index of the time series. The index may only be a date in ISO 8601 format or an integer number.
Example: invoice_date -
Measurement column - This required input parameter points to a property in the input set, and indicates the value to be forecast. The measurement may only be a number.
Example: total_amount -
Horizon - This is an alternative to Prediction set. It does not work with predictors.
This optional value sets the number of future samples to predict. If combined with a date-valued index column, the date indices associated with the forecast depend on the sizes of the intervals between consecutive indices in the Input set. Note that the quality of your results depends on the size and completeness of your dataset. -
Prediction set - This is an alternative to Horizon.
The prediction set is an optional JSON array with values for the Index column containing the desired indices to be forecast. If the input set contains any predictors, those must be present here as well.
Example:[
{
"invoice_date": "1986-12-15",
"is_xmas": false,
"no_of_machines_active": 15
},
{
"invoice_date": "1986-12-16",
"is_xmas": false,
"no_of_machines_active": 14
}
] -
Confidence levels - This optional parameter represents the levels of confidence in integer percentages [X, Y, Z] for which to return a confidence interval. A confidence interval contains the points above and below the predicted value for which there is an [X, Y, Z] probability of the actual value lying between those points, assuming the model is correct. The confidence interval must be expressed as either of the following options:
- Integer value(s) between 1 and 99
- A comma-separated set
- A JSON array
null
The lower a provided confidence percentage is, the tighter the band will be around the predicted value.
This value will fall back to 50, 90.
Examples:- 50
- 80
- 50, 80
- [75, 80]
While the forecasting service accepts date strings as index, this is only cosmetic. The algorithms under the hood expect equally spaced data. This means you cannot get back forecasts for arbitrary dates. The intervals between indices within the input and prediction set as well as the interval between the last input and first prediction must be equal. If all else fails, try doing the conversion to and from integers yourself.
Output parameters
The connector has the following output parameters:
-
Status code - The status code can have the following values:
- 0 - Successful
- -1 - Unsuccessful: unknown.
- -2 - Unsuccessful: the forecasting service is not running.
- -3 - Unsuccessful.
- -100 - Unsuccessful: Indicium is not used by the client executing the process flow.
-
Output set - A JSON array identical to the Prediction set, but with the value column added and filled with a predicted value. It also contains values for the confidence(s), if available.
Example:[
{
"invoice_date": "1986-12-15",
"total_amount": 514
"confidence":
{
"50":
{
"upper": 520,
"lower": 510
}
}
},
{
"invoice_date": "1986-12-16",
"total_amount": 514,
"confidence":
{
"50":
{
"upper": 521,
"lower": 510
}
}
}
] -
Fitted set - A JSON array identical to the Input set, but with the value column filled with a fitted value. This data might help visualize how well the algorithm has understood the input set.
Example:[
{
"invoice_date": "1986-12-13",
"total_amount": 515
},
{
"invoice_date": "1986-12-14",
"total_amount": 515
}
]
- Warnings - An array of warnings that might have affected execution.
Example:
["Default value 'ordinal' was applied as measurement data scale"]
- Errors - A JSON array of errors that might have affected execution.
It is only available if the process reported the status code
-1
.
Example:["Confidence interval may not exceed 99%"]