The config.yaml file plays one of the most significant roles within the workflow of the Core Engine because it includes all the necessary configuration settings defined to a specific pipeline run.
Structurally, it has 8 main keys, namely
preprocessing. While the
timeseries is an optional key, the rest are all mandatory and
need to be defined. The rest of this section focuses on the functionality and the internal structure of each main key.
Version of the config file
As the name suggests, the main key
version indicates the version of the YAML configuration file. In the current
build, the running version of the configuration YAML file is
Splitting your dataset
split is utilized to configure the process of splitting the dataset into a training and an eval
dataset. Through this block, the Core Engine allows its users to define more than just a random split. Overall,
there are 6 optional keys as displayed in the table below:
categorize_by: string which specifies the name of a selected categorical column
categories: dictionary which has the keys
eval. Under each key, there must be a list of strings, defining which category belongs to which split. Can not be used while
category_ratio: dictionary which has the keys
eval. Under each key, there is a float value determining the split ratio within the categories. Can not be used while
index_by: string which specifies the name of a selected index column
index_ratio: dictionary which has the keys
eval. Under each key, there is a float value determining the ratio of the split (it is based on a sorted index column if
where: list of strings, which defines additional conditions while querying the datasource
You can use the categorical column to group the data before the split or even to specify which category goes to which split. As for the indexed column, you can use it sort your data before the split.
Let's work on the case where your dataset consists of data gathered from 5 different assets in the field for 10 days and your data is timestamped.
If the id of the asset which recorded the data and its timestamp do not play a role, you can use either:
which will on default split the dataset to a training (70%) and an eval (30%) dataset in a random manner, or you can define the ratio yourself:
If you would like to use the last 2 days within your dataset, you can sort your data before the split happens
by utilizing the
Even further, maybe the days of the experiment for each asset do not align and you want the last two days of each
respective asset in the eval dataset. You can achieve that by using the
categorize_by, which will divide your
dataset into categories, before the sorting and splitting happens.
If you want some of the categories solely in the eval dataset, you can use either the
The categorization happens prior to the sorting and splitting. Which means, in the last two examples above, data points with certain categories will be first put in the eval dataset first and then, the remaining categories will still go through the sorting and splitting process.
Finally, you can also use
where key to define a list of conditions over your data points. For instance, if you
want to work on the data from the year 2018, you can use:
Pre-processing your data
Following the splitting process, the next point to cover is the configuration of the pre-processing steps in your pipeline. In the Core Engine, pre-processing is carried out on a feature level, which implies that in every instance, where a feature is specified (for instance when one defines which features will be used for the training), the pipeline needs to be able to infer what kind of pre-processing step it should apply to the specified feature.
There are two possible ways of achieving the this goal. The first approach is to explicitly configure each one of the pre-processing steps for every given feature and the correct way of using this approach will be covered in following sections. The second approach is to define a set of default behaviors based on each possible data type. However, in order to understand how to define the defaults, one must first understand the possible pre-processing steps within a the context of a pipeline.
In the Core Engine, pre-processing is categorized under 4 different categories:
filling: used to fill the missing datapoints in the dataset. Possible methods include:
- forward: fill the missing values using the last available value based on a index (only applicable in the timeseries setting)
- backwards: fill the missing values using the next available value based on a index (only applicable in the timeseries setting)
- min: fill the missing values with the minimum value in the data column
- max: fill the missing values with the maximum value in the data column
- mean: fill the missing values with the mean value in the data column
- custom: fill the missing values with a custom value in the data column, requires parameters 'custom_value'
resampling: used to define sequence-level resampling operations (only applied in the timeseries setting)
- mean: aggregate values based on their mean value
- mode: aggregate values based on their mode value
- median: aggregate values based on their median value
- threshold: aggregate values based on a condition, requires parameters 'cond', 'c_value', 'threshold', 'set_value'
label_tuning: used to specify a set of pre-processing steps which are just dedicated for labels
- leadup: marks the data points within the a specified period of time prior to an event as events,requires parameters 'event_value', 'duration'
- followup: marks the data points within the a specified period of time after an event as events, requires parameters 'event_value', 'duration'
- shift: shifts the labels by given steps,requires parameters 'shift_steps', 'fill_value'
- map: maps the values within a column into new values, requires parameters 'mapper'
- no_tuning: no label tuning is applied
transform: used to define data transformation steps
- scale_by_min_max: scale the values within a given range,requires parameters 'min', 'max'
- scale_to_0_1: scale the values between 0 and 1
- scale_to_z_score: standardization with a mean of 0 and variance of 1
- tfidf: term frequency–inverse document frequency, requires parameters 'vocab_size'
- compute_and_apply_vocabulary: create a vocabulary based on the values and apply it to the data column
- ngrams: requires parameters 'ngram_range', 'separator'
- hash_strings: hash strings into buckets, requires parameters 'hash_buckets'
- bucketize: requires parameters 'num_buckets'
- no_transform: no transformations are applied
The pre-processing steps are applied with the same order defined above. For instance, on a sequential dataset, the data points will be first filled, then resampled. Followed by labels being tuned, ultimately, the transform operations will be applied.
As we covered each possible pre-processing step within a pipeline, you can start building the configuration block for the defaults under the main key preprocessing.
On its first layer, this block needs to cover the 4 main data type, namely
boolean. For each data type, a dictionary needs to be defined, which then specifies a default behaviour for
all 4 pre-processing steps,
label_tuning using the keys
For instance, assume that you have a dataset which has a several data columns which contain integer values and you want
to fill in the missing data values with the value 42. In that case, the specific section of the
should look like this.
However, this example only covers data type
integer and pre-processing step
filling. A complete and proper
example for a
preprocessing block looks like this:
The functionality of the filling paradigm depends on whether the pipeline is working on sequential data. For instance,
in the case of
max being selected as the filling method with non-sequential dataset, the pipeline will use the
all the datapoints in the data column to infer the max value, whereas with sequential datasets, this value will be
inferred within sequences.
The documentation on how to deal with time-series data will be covered further down in this chapter.
In order to have a successful training combined with a comprehensive analysis of your model, it is critical to assign the right columns to the right tasks. In this section, we will cover how to select data columns for training, evaluation and labeling respectively.
The main key
features is used to define which set of features will be used during the training and it allows
the users to modify the pre-processing steps
transform (and possibly
resampling in case of
sequential data) for each one of these features.
Structurally, each key under
features represents a selected feature. Under each key, the user has the chance
to determine a method for each pre-processing step to use. If it is not explicitly defined, the behaviour will be
inferred from the
preprocessing based on the data type.
One of the simplest scenarios is where the method for each pre-processing step for each feature can be inferred from
preprocessing. In this case, one only needs to specify the names of the data columns as keys and the values are
just empty dictionaries:
Moving on to a slightly more complex configuration, the following example shows how to overwrite the default behaviour.
In this scenario, the configuration specifically states that the
transform operation on the
feature_1 will scale
the values between -1 and 1 with the help of the method
scale_by_min_max and the
filling operation on
feature_2 apply the method
mean on its datapoints. The rest of the configuration will be derived from the
As the name suggests, the main key
labels is used to determine which data column will be used as the label during the
training. The inner structure of this column is quite similar to the block
features, where the keys denote the
data columns which are selected and the values include the pre-processing configuration.
However, on top of
resampling, 3 more keys play a role on the definition of a label.
label_tuningrepresents a set of possible transformations which are specifically dedicated for labels. Since, it is also defined in the defaults, it is an optional key. In the absence of this key, the label tuning method for the selected label will be derived from the
lossholds a string value which defines the loss function for the selected label to be used during the training. Possible selections include 'mse', 'categorical_crossentropy' and 'binary_crossentropy'. It is important to note that this key is required whenever a label is defined.
metricsholds a list of string values and it encompasses the selection of metrics to be used during the evaluation. It is also a required key, which means that it needs to hold at least an empty list, if there are no selected metrics.
The Core Engine only supports config files with a single label for now.
For this example, imagine a classical classification scenario with a twist. Let's assume that you want your model to
classify an asset on its brand, 'brandA' or 'brandB'. However, the information about the brand of assets is embedded
within a column called 'modelname', which includes three models from each model, namely '_brandA_model1',
'brandA_model2', 'brandA_model3', 'brandB_model1', 'brandB_model2' and 'brandB_model3'. In such cases, you can use
the label tuning method called
map and define your own mapping.
Another good example for this block revolves around a sequential task. Imagine that you have a label column called
'eventlabel' which has the value '_1' when a critical event occurs and '0' on idle and you would like to detect
the critical events in your system based on your sensors, but you do not just want to detect it as it happens,
but you also want to predict it 3 hours prior to the failure. One of possible ways of labeling your data for
such a task is to introduce a lead-up for your labels. In this example, the method
leadup is used to mark not just
the events but also any datapoints which occurred maximum 3 hours prior to any event.
Finally, the main key
evaluator determines which data columns will be used in the evaluation of the trained model.
Structurally, it shares the same structure as the
For the evaluation, your datapoints will be sliced on the columns that specified in this block. For this reason, it is highly recommended that you select categorical columns in this block.
For example, if you would like to evaluate your trained your model based on a categorical column such as 'coutry_of_origin', you can simply use:
Training your model
When it comes to any ML workflow, one of the most critical steps is to design the model architecture. Similar to the
previous steps, the Core Engine handles it through the configuration file through a main key called
Structurally, it includes the following keys:
architecture: string value which determines the architecture of the model, possible values include 'feedforward' for feedforward networks, 'sequence' for LSTM networks and 'sequence_ae' for sequence-to-sequence autoencoders
type: string value which defines the type of the problem at hand, possible selections include 'regression', 'classification', 'autoencoder'
train_batch_size: the batch size during the training
eval_batch_size: the batch size during the eval steps
train_steps: the number of batches which should be processed through the training
save_checkpoints_steps: the number of training batches, which will indicate the frequency of the validation steps
optimizer: the name of the selected optimizer for the training
last_activation: the name of the last layer in the model
num_output_units: the number of output units in the last layer
sequence_length: the length of the sequence in one data point (provided only on a sequential problem setting)
layers: list of dictionaries, which hold the layer configurations
Amongst the keys listed above, the key
layers is one of the most important ones. As mentioned, it holds a
list of dictionaries, where each dictionary represents a layer within the model definition. The type of the
layer is specified under the key
type and the number of units for this layer is specified under the key
As of now, the Core Engine currently supports 4 different types of layers:
- 'dense': a basic fully-connected dense layer, additional params: 'activation'
- 'lstm': a regular LSTM layer, additional params: 'activation', 'return_sequences'
- 'dropout': a dropout layer, additional params: 'rate'
- 'latent': a latent representation dense layer specifically used in autoencoders
We are constantly working to add more options to the trainer. If your model cannot be configured via our YAML
config, please write to us at
email@example.com and we will try to include support for our next release.
The first example on the
trainer block showcases how to build a feed-forward network for a simple regression task.
The model has 4 hidden dense layers and due to the nature of our problem an output layer with 1
output unit which has a 'sigmoid' activation function.
In the second example, we cover a multi-class classification problem model. The model features 2 hidden LSTM layers and 1 dense hidden layer. As in the output layer, it uses 'softmax' for the activation and has 3 output units.
Working with timeseries data
With the Core Engine, it is also possible to work on timeseries datasets. However, due to the nature of sequential tasks, a few additional parameters need to be defined to enable this functionality.
Structurally, the timeseries holds 4 mandatory and 1 optional value, such as:
|int or float||True|
|int or float||True|
resampling_rate_in_secsdefines the resampling rate in seconds and it will be used at the corresponding pre-precessing step
trip_gap_threshold_in_secsdefines a maximum threshold in seconds in order to split the dataset into trips. Sequential transformations will occur once the data is split into trips based on this value.
process_sequence_w_timestampspecifies which data column holds the timestamp.
process_sequence_w_categoryis an optional value, which, if provided, will be used to split the data into categories before the sequential processes
sequence_shiftdefines the shift (in datapoints) while extracting sequences from the dataset
Imagine the scenario, where you have an asset on the field, which is transmitting sensory
data and it is only active during a certain period of time every day. However, the
different sensors have different transmission frequencies. In order to be able to feed
your model with consistent (and equidistant) datapoints you have to resmaple your dataset. In
that case you can build the
timeseries block as follows:
This will instruct the Core Engine to split your data into trips which are at least 1 hour away from each other and then resample those trips with a rate of 30 seconds using the 'timestamp_column' as the time index. Moreover, while extracting sequences after the resampling, it will only shift by 1 data point before extracting the next sequence.
You can even build on top of this example by using not just one asset but a fleet of assets.
In this scenario, multiple assets might be functional simultaneously, which means, during the resampling
process, values from one asset might influence the values from another asset. That's exactly
where the key
process_sequence_w_category comes into play. It is used to split the data into
categories before the sequential transformations, so the integrity of the data can remain intact
within categories for tasks such as the one explained above. The resulting block is as follows: