Introduction to machine learning
One definition of machine learning is:
“Machine learning (ML) is the study of computer algorithms that improve automatically through experience.” (source: Wikipedia)
In practice this means that the goal of machine learning must be provided by humans to the machine. Next, how to get to the goal is learned automatically, it is derived from past experiences in the form of existing data.
This definition and explanation provide some guidance on when to use machine learning:
- The goal is clear
- It's not clear how to get there, though
- Plenty of suitable historical data is available for training.
In this case, machine learning could be a good solution.
Also in applications built with the Thinkwise Platform, machine learning techniques are often leveraged for calculations that are difficult or impossible to express with regular code. This is possible because historical data are usually available for training a model. Some examples include:
- Price predictions, eg house prices
- Project cost- and effort predictions
- Risk assessments based on many variables, eg insurance risks
- Assigning tickets automatically to the right department (no more need to open, asses and assign a ticket manually)
- Profit predictions
- Assigning quality labels to red wine, based on various characteristics.
Automated machine learning
One of the main goals of the Thinkwise Platform is to enable developers to focus on the functional aspects of software development instead of the technical aspects.
However, common machine learning combines software engineering with the field of mathematics and statistics to create algorithms that can learn from experience. To apply this in the field, data scientists use domain knowledge to apply the algorithms to create machine learning models. The broad skillset needed to develop these solutions usually does not match the skillset of a typical developer using the Thinkwise Platform.
To allow developers to leverage machine learning with minimal technical hurdles, Automated Machine Learning (AutoML) is available in the Software Factory. This is a solution designed to automate the process of machine learning. The available prediction models can be trained, deployed and applied in a production environment for two types of problems:
Classification problems focus on assigning a type, group or other predefined classification based on the provided input. Often, this is a domain with elements, such as a risk level or priority.
Regression problems focus on determining a numerical value based on the provided input. Think of numerical values such as total profit or response time.
More types of problems will be resolvable with AutoML in future releases, such as forecasting.
The Thinkwise Platform will automatically choose the right type of model for your problem.
- To enable AutoML, ensure Indicium is configured to support Automated Machine Learning. More information here.
- Historical production data is ideal for model training. The best training is achieved with an extensive historical data set, containing relevant data. Although AutoML features a filter for irrelevant data, the training process will be faster if irrelevant data is filtered out beforehand (eg primary keys and identities), especially in large datasets. If a subset of the records in the table is required, a view can be used to limit the training set.
Training a model
An AutoML model is based on a specific table, view or materialized query table (mqt). One of the columns can be used as the prediction column for the AutoML model. Other columns of this table can be used as the input that determines the outcome.
The trained AutoML model will be useable for any scenario that provides the required input. Predictions can also be done for different tables or ad-hoc input. The table used to create the AutoML model is merely used for the training data.
Menu Machine learning > AutoML > tab Form
- Select a Table. The data from this table will be used to train the model.
- Select an AutoML model.
Target and predictors
Menu Machine learning > AutoML > tab Target / Predictors
- To successfully set-up an AutoML configuration, select one target and one or more predictors.
- At any point in time during set-up, use the Load training data task to retrieve the content of the table which will be used to train the model.
Every potential target and predictor has a type-classification that determines how the AutoML engine interprets the values of this column. This is based on how the column is modeled:
- Binary data only has two options. Checkboxes, combo's with two options, etc.
- Nominal data is simply a set of labels. Every value for this column can occur in one or more records. There is no ordering in the values and the value itself cannot be used in calculations.
- Ordinal data is an ordered set of labels. Columns using domains that sort on the order number of the elements are considered ordinal.
- Quantitative data is numerical data that can be used in calculations. Not all numerical data is quantitative. Domains with elements and foreign keys are not quantitative.
The trained AutoML model will never be aware of nominal and ordinal values outside of the training set. For this reason, it doesn't make much sense to have free-text fields or non-base data look-ups used as predictors.
Re-training is required:
- when updating the elements of a target domain (without retraining, prediction is not possible)
- when removing elements from the target domain (without retraining, predictions will still be made on the removed elements and errors will occur because these invalid predictions cannot be saved).
- if columns have been renamed or removed since finishing the training
- if new data has been added that derives from the trainingdata. Eg if new categories have been added to nominal or ordinal predictor variables.
The AutoML configuration screen
Queueing and training
Once one or more predictors and the target have been chosen and training data has been loaded, training can be queued.
Menu Machine learning > AutoML
- Press the Queue AutoML model for training task to queue training.
Once the configuration has changed status to Queued for training, no further modifications can be done.
Reviewing training results
Menu Machine learning > AutoML > tab Result models
- Open the Result models tab to monitor the various types of models that will be included in the training process. It can take up to 30 seconds before the models are shown here.
The various types of models will automatically be queued for training. Once training starts, the AutoML configuration will change status to Training. A training's duration can last from mere minutes to many hours. This depends on the size of the dataset, mainly on the number of colums and, to a lesser extent, the number of rows.
Types of models
When a type of model is done training, performance metrics will be shown to indicate the quality of the trained model based on samples of the training data. The default sorting of this tab page is that the best performing model will most commonly be shown on top after training has finished.
- If the overall quality is good, it is possible to compare the models and select the one that best fits your needs. Different models can, for example, differ on accuracy or precision. In that case it depends on which metric is more important for the goal you wish to achieve with machine learning.
- A low accuracy score for all models is considered low quality. In that case: check the quality of your data set.
When all types of models have been trained, the AutoML configuration will change status from Training to Training finished.
Only one model is trained in parallel. When multiple projects or branches are training AutoML models, it might take longer to see results.
Activating a trained model
Menu Machine learning > AutoML > tab Result models
- Select one of the trained model types as the active model for this AutoML configuration. Use the corresponding tasks at the top of the tab page to select the desired model.
The selected model can be changed afterwards. Process actions can now use this AutoML configuration and will use the selected model for prediction execution.
The AutoML configuration is now Ready for use.
Running a prediction
The easiest way to perform a prediction is to queue them in the database and use a scheduled system flow to pick up the queued predictions one-by-one.
Run AutoML modelprocess action can only be used in scheduled process flows. Once process flows are available in Universal, this process action can also be used in non-scheduled process flows performed by a user using Universal.
Step 1: Create a scheduled process flow with only the
Run AutoML model connector. Select the table the AutoML model has been trained on and select the
AutoML model. Have the Start flag to the process action, have the process action point to itself with success and to Stop with failure.
A simple flow using a process action to run an AutoML model prediction
Add a schedule to the process flow to have this process flow periodically run the AutoML model. Set the schedule to default if activation by an IAM administrator is not required.
Once an AutoML configuration is in use by a process action, the status will change from Ready for use to Active.
Step 2: Create process flow variables for the predictors, the target and the status code. Also create one or more process flow variables to store the ID of the queued item to predict.
Map the variables for the predictors, target and status codes to the process action. It is important to use the same data set-up as the training data: the same columns with the same name.
If columns have been renamed or removed since finishing the training, retrain the model.
Step 3: Create the process logic. Mark the process action to use process logic and create a template.
The template should consist of two parts. The first part is to process the results of the previous execution. The second part is used to load a new item to predict from the queue.
The first part should contain the following statements:
- When the status code is
-2, the AutoML service is not running. Inform the user accordingly.
- Save the result of the last item
The second part should contain the following statements:
- Load the id and the predictors for the next item from the queue
- Decide whether the process flow should continue
The template will look something like this:
-- Check the result of the last execution. if @status_code = -2 begin -- The AutoML service is not running. Clear the queue and inform the user. update prediction_queue set sale_price = -1, failed = 1 where sale_price is null; end else if @status_code = 0 begin -- Store the result of the previous prediction update prediction_queue set sale_price = coalesce(@sale_price, -1) where id = @id; end; -- Load primary key and the predictors for the next item to predict from the queue select top 1 @id = p.id, @above_grade_living_area = h.above_grade_living_area, @alley = h.alley, @basement_condition = h.basement_condition, @basement_exposure = h.basement_exposure, @basement_quality = h.basement_quality, ... from prediction_queue p join house h on h.id = p.id and p.sale_price is null; -- Start prediction if a new item was loaded from the queue. Stop the process flow if not. if @@ROWCOUNT = 0 begin set @automl_run_model_house_training_data_stop = 10; set @automl_run_model_house_training_data_automl_run_model_house_training_data = null; end else begin set @automl_run_model_house_training_data_stop = null; set @automl_run_model_house_training_data_automl_run_model_house_training_data = 10; end;
In this example, the AutoML process action is executed with empty predictors on the first run. The returned status code will be
-3. This can be ignored.
Step 4: Synchronize the model to IAM to activate the scheduled process flow.