Pipelines - The Config YAML

The config.yaml file plays one of the most significant roles within the workflow of the Core Engine because it includes all the necessary configuration settings defined to a specific pipeline run.

Structurally, it has 8 main keys, namely version, timeseries, split, features, labels, trainer, evaluator and preprocessing. While the timeseries is an optional key, the rest are all mandatory and need to be defined. The rest of this section focuses on the functionality and the internal structure of each main key.

Version of the config file

Main Key: version

As the name suggests, the main key version indicates the version of the YAML configuration file. In the current build, the running version of the configuration YAML file is 1.

Splitting your dataset

Main Key: split

The key split is utilized to configure the process of splitting the dataset into a training and an eval dataset. Through this block, the Core Engine allows its users to define more than just a random split. Overall, there are 6 optional keys as displayed in the table below:

Parametersdtyperequired
categorize_bystringFalse
categoriesdictFalse
category_ratiodictFalse
index_bystringFalse
index_ratiodictFalse
wherelistFalse
  • categorize_by: string which specifies the name of a selected categorical column
  • categories: dictionary which has the keys train and eval. Under each key, there must be a list of strings, defining which category belongs to which split. Can not be used while category_ratio is present
  • category_ratio: dictionary which has the keys train and eval. Under each key, there is a float value determining the split ratio within the categories. Can not be used while categories is present
  • index_by: string which specifies the name of a selected index column
  • index_ratio: dictionary which has the keys train and eval. Under each key, there is a float value determining the ratio of the split (it is based on a sorted index column if index_by is specified)
  • where: list of strings, which defines additional conditions while querying the datasource

You can use the categorical column to group the data before the split or even to specify which category goes to which split. As for the indexed column, you can use it sort your data before the split.

Examples

Let's work on the case where your dataset consists of data gathered from 5 different assets in the field for 10 days and your data is timestamped.

If the id of the asset which recorded the data and its timestamp do not play a role, you can use either:

split: {}

which will on default split the dataset to a training (70%) and an eval (30%) dataset in a random manner, or you can define the ratio yourself:

split:
index_ratio:
train: 0.8
eval: 0.2

If you would like to use the last 2 days within your dataset, you can sort your data before the split happens by utilizing the index_by key.

split:
index_by: 'timestamp'
index_ratio:
train: 0.8
eval: 0.2

Even further, maybe the days of the experiment for each asset do not align and you want the last two days of each respective asset in the eval dataset. You can achieve that by using the categorize_by, which will divide your dataset into categories, before the sorting and splitting happens.

split:
categorize_by: 'asset_id'
index_by: 'timestamp'
index_ratio:
train: 0.8
eval: 0.2

If you want some of the categories solely in the eval dataset, you can use either the categories or category_ratio key.

With the categories key:

split:
categorize_by: 'asset_id'
categories:
train:
- "asset_1"
- "asset_2"
- "asset_3"
- "asset_4"
eval:
- "asset_5"
index_by: 'timestamp'
index_ratio:
train: 0.8
eval: 0.2

With the category_ratio key:

split:
categorize_by: 'asset_id'
category_ratio:
train: 0.8
eval: 0.2
index_by: 'timestamp'
index_ratio:
train: 0.8
eval: 0.2
important

The categorization happens prior to the sorting and splitting. Which means, in the last two examples above, data points with certain categories will be first put in the eval dataset first and then, the remaining categories will still go through the sorting and splitting process.

Finally, you can also use where key to define a list of conditions over your data points. For instance, if you want to work on the data from the year 2018, you can use:

split:
index_ratio:
train: 0.8
eval: 0.2
where:
- "timestamp >= '2018-01-01T00:00:00'"
- "timestamp <= '2018-31-12T23:59:59'"

Pre-processing your data

Following the splitting process, the next point to cover is the configuration of the pre-processing steps in your pipeline. In the Core Engine, pre-processing is carried out on a feature level, which implies that in every instance, where a feature is specified (for instance when one defines which features will be used for the training), the pipeline needs to be able to infer what kind of pre-processing step it should apply to the specified feature.

There are two possible ways of achieving the this goal. The first approach is to explicitly configure each one of the pre-processing steps for every given feature and the correct way of using this approach will be covered in following sections. The second approach is to define a set of default behaviors based on each possible data type. However, in order to understand how to define the defaults, one must first understand the possible pre-processing steps within a the context of a pipeline.

In the Core Engine, pre-processing is categorized under 4 different categories:

  1. filling: used to fill the missing datapoints in the dataset. Possible methods include:
    • forward: fill the missing values using the last available value based on a index (only applicable in the timeseries setting)
    • backwards: fill the missing values using the next available value based on a index (only applicable in the timeseries setting)
    • min: fill the missing values with the minimum value in the data column
    • max: fill the missing values with the maximum value in the data column
    • mean: fill the missing values with the mean value in the data column
    • custom: fill the missing values with a custom value in the data column, requires parameters 'custom_value'
  2. resampling: used to define sequence-level resampling operations (only applied in the timeseries setting)
    • mean: aggregate values based on their mean value
    • mode: aggregate values based on their mode value
    • median: aggregate values based on their median value
    • threshold: aggregate values based on a condition, requires parameters 'cond', 'c_value', 'threshold', 'set_value'
  3. label_tuning: used to specify a set of pre-processing steps which are just dedicated for labels
    • leadup: marks the data points within the a specified period of time prior to an event as events,requires parameters 'event_value', 'duration'
    • followup: marks the data points within the a specified period of time after an event as events, requires parameters 'event_value', 'duration'
    • shift: shifts the labels by given steps,requires parameters 'shift_steps', 'fill_value'
    • map: maps the values within a column into new values, requires parameters 'mapper'
    • no_tuning: no label tuning is applied
  4. transform: used to define data transformation steps
    • scale_by_min_max: scale the values within a given range,requires parameters 'min', 'max'
    • scale_to_0_1: scale the values between 0 and 1
    • scale_to_z_score: standardization with a mean of 0 and variance of 1
    • tfidf: term frequency–inverse document frequency, requires parameters 'vocab_size'
    • compute_and_apply_vocabulary: create a vocabulary based on the values and apply it to the data column
    • ngrams: requires parameters 'ngram_range', 'separator'
    • hash_strings: hash strings into buckets, requires parameters 'hash_buckets'
    • bucketize: requires parameters 'num_buckets'
    • no_transform: no transformations are applied
caution

The pre-processing steps are applied with the same order defined above. For instance, on a sequential dataset, the data points will be first filled, then resampled. Followed by labels being tuned, ultimately, the transform operations will be applied.

Main Key: preprocessing

As we covered each possible pre-processing step within a pipeline, you can start building the configuration block for the defaults under the main key preprocessing.

Parametersdtyperequired
integerdictTrue
floatdictTrue
booleandictTrue
stringdictTrue

On its first layer, this block needs to cover the 4 main data type, namely string, integer, float and boolean. For each data type, a dictionary needs to be defined, which then specifies a default behaviour for all 4 pre-processing steps, filling, transform, resampling and label_tuning using the keys method and parameters.

Example

For instance, assume that you have a dataset which has a several data columns which contain integer values and you want to fill in the missing data values with the value 42. In that case, the specific section of the preprocessing block should look like this.

preprocessing:
integer:
filling:
method: "custom"
parameters: {custom_value: 42}

However, this example only covers data type integer and pre-processing step filling. A complete and proper example for a preprocessing block looks like this:

preprocessing:
integer:
filling:
method: "max"
parameters: {}
resampling:
method: "max"
parameters: {}
transform:
method: "scale_to_z_score"
parameters: {}
label_tuning:
method: "no_tuning"
parameters: {}
float:
filling:
method: "mean"
parameters: {}
resampling:
method: "mean"
parameters: {}
transform:
method: "scale_to_z_score"
parameters: {}
label_tuning:
method: "no_tuning"
parameters: {}
boolean:
filling:
method: "max"
parameters: {}
resampling:
method: "threshold"
parameters: {cond: "greater",
c_value: 0,
threshold: 0,
set_value: 1}
transform:
method: "no_transform"
parameters: {}
label_tuning:
method: "no_tuning"
parameters: {}
string:
filling:
method: "custom"
parameters: {custom_value: ''}
resampling:
method: "mode"
parameters: {}
transform:
method: "compute_and_apply_vocabulary"
parameters: {}
label_tuning:
method: "no_tuning"
parameters: {}
important

The functionality of the filling paradigm depends on whether the pipeline is working on sequential data. For instance, in the case of max being selected as the filling method with non-sequential dataset, the pipeline will use the all the datapoints in the data column to infer the max value, whereas with sequential datasets, this value will be inferred within sequences.

The documentation on how to deal with time-series data will be covered further down in this chapter.

Feature Selection

In order to have a successful training combined with a comprehensive analysis of your model, it is critical to assign the right columns to the right tasks. In this section, we will cover how to select data columns for training, evaluation and labeling respectively.

Main Key: features

The main key features is used to define which set of features will be used during the training and it allows the users to modify the pre-processing steps filling, transform (and possibly resampling in case of sequential data) for each one of these features.

Parametersdtyperequired
data_column_AdictFalse
data_column_BdictFalse
.........
data_column_NdictFalse

Structurally, each key under features represents a selected feature. Under each key, the user has the chance to determine a method for each pre-processing step to use. If it is not explicitly defined, the behaviour will be inferred from the preprocessing based on the data type.

Examples

One of the simplest scenarios is where the method for each pre-processing step for each feature can be inferred from the preprocessing. In this case, one only needs to specify the names of the data columns as keys and the values are just empty dictionaries:

features:
feature_1: {}
feature_2: {}
feature_3: {}
feature_4: {}
feature_5: {}

Moving on to a slightly more complex configuration, the following example shows how to overwrite the default behaviour. In this scenario, the configuration specifically states that the transform operation on the feature_1 will scale the values between -1 and 1 with the help of the method scale_by_min_max and the filling operation on feature_2 apply the method mean on its datapoints. The rest of the configuration will be derived from the preprocessing.

features:
feature_1:
transform:
method: "scale_by_min_max"
parameters: {min: -1, max: 1}
feature_2:
filling:
method: "mean"
parameters: {}
feature_3: {}
feature_4: {}
feature_5: {}

Main Key: labels

As the name suggests, the main key labels is used to determine which data column will be used as the label during the training. The inner structure of this column is quite similar to the block features, where the keys denote the data columns which are selected and the values include the pre-processing configuration.

Parametersdtyperequired
label_columndictFalse

However, on top of filling, transform and resampling, 3 more keys play a role on the definition of a label.

  • label_tuning represents a set of possible transformations which are specifically dedicated for labels. Since, it is also defined in the defaults, it is an optional key. In the absence of this key, the label tuning method for the selected label will be derived from the preprocessing key.
  • loss holds a string value which defines the loss function for the selected label to be used during the training. Possible selections include 'mse', 'categorical_crossentropy' and 'binary_crossentropy'. It is important to note that this key is required whenever a label is defined.
  • metrics holds a list of string values and it encompasses the selection of metrics to be used during the evaluation. It is also a required key, which means that it needs to hold at least an empty list, if there are no selected metrics.
caution

The Core Engine only supports config files with a single label for now.

Examples

For this example, imagine a classical classification scenario with a twist. Let's assume that you want your model to classify an asset on its brand, 'brandA' or 'brandB'. However, the information about the brand of assets is embedded within a column called 'modelname', which includes three models from each model, namely '_brandA_model1', 'brandA_model2', 'brandA_model3', 'brandB_model1', 'brandB_model2' and 'brandB_model3'. In such cases, you can use the label tuning method called map and define your own mapping.

labels:
model_name:
label_tuning:
method: 'map'
parameters:
mapper:
brandA_model1: "A"
brandA_model2: "A"
brandA_model3: "A"
brandB_model1: "B"
brandB_model2: "B"
brandB_model3: "B"
loss: 'binary_crossentropy'
metrics: ['accuracy']

Another good example for this block revolves around a sequential task. Imagine that you have a label column called 'eventlabel' which has the value '_1' when a critical event occurs and '0' on idle and you would like to detect the critical events in your system based on your sensors, but you do not just want to detect it as it happens, but you also want to predict it 3 hours prior to the failure. One of possible ways of labeling your data for such a task is to introduce a lead-up for your labels. In this example, the method leadup is used to mark not just the events but also any datapoints which occurred maximum 3 hours prior to any event.

labels:
event_label:
label_tuning:
method: 'leadup'
parameters:
event_value: 1
duration: 10800 # 3 hours in secs
loss: 'binary_crossentropy'
metrics: []

Main Key: evaluator

Finally, the main key evaluator determines which data columns will be used in the evaluation of the trained model. Structurally, it shares the same structure as the features block.

Parametersdtyperequired
slice_column_1dictFalse
slice_column_2dictFalse
.........
slice_column_NdictFalse
important

For the evaluation, your datapoints will be sliced on the columns that specified in this block. For this reason, it is highly recommended that you select categorical columns in this block.

Examples

For example, if you would like to evaluate your trained your model based on a categorical column such as 'coutry_of_origin', you can simply use:

evaluator:
coutry_of_origin: {}

Training your model

When it comes to any ML workflow, one of the most critical steps is to design the model architecture. Similar to the previous steps, the Core Engine handles it through the configuration file through a main key called trainer.

Main Key: trainer

Structurally, it includes the following keys:

Parametersdtyperequired
architecturestringTrue
typestringTrue
sequence_lengthintegerFalse
num_output_unitsintegerTrue
train_batch_sizeintegerTrue
eval_batch_sizeintegerTrue
train_stepsintegerTrue
save_checkpoints_stepsintegerTrue
optimizerstringTrue
last_activationstringTrue
layersdictTrue
  • architecture: string value which determines the architecture of the model, possible values include 'feedforward' for feedforward networks, 'sequence' for LSTM networks and 'sequence_ae' for sequence-to-sequence autoencoders
  • type: string value which defines the type of the problem at hand, possible selections include 'regression', 'classification', 'autoencoder'
  • train_batch_size: the batch size during the training
  • eval_batch_size: the batch size during the eval steps
  • train_steps: the number of batches which should be processed through the training
  • save_checkpoints_steps: the number of training batches, which will indicate the frequency of the validation steps
  • optimizer: the name of the selected optimizer for the training
  • last_activation: the name of the last layer in the model
  • num_output_units: the number of output units in the last layer
  • sequence_length: the length of the sequence in one data point (provided only on a sequential problem setting)
  • layers: list of dictionaries, which hold the layer configurations

Amongst the keys listed above, the key layers is one of the most important ones. As mentioned, it holds a list of dictionaries, where each dictionary represents a layer within the model definition. The type of the layer is specified under the key type and the number of units for this layer is specified under the key units. As of now, the Core Engine currently supports 4 different types of layers:

  1. 'dense': a basic fully-connected dense layer, additional params: 'activation'
  2. 'lstm': a regular LSTM layer, additional params: 'activation', 'return_sequences'
  3. 'dropout': a dropout layer, additional params: 'rate'
  4. 'latent': a latent representation dense layer specifically used in autoencoders

We are constantly working to add more options to the trainer. If your model cannot be configured via our YAML config, please write to us at support@maiot.io and we will try to include support for our next release.

Examples

The first example on the trainer block showcases how to build a feed-forward network for a simple regression task. The model has 4 hidden dense layers and due to the nature of our problem an output layer with 1 output unit which has a 'sigmoid' activation function.

trainer:
type: regression
architecture: feedforward
train_steps: 2000
save_checkpoints_steps: 200
train_batch_size: 64
eval_batch_size: 64
layers:
- {type: dense, units: 128}
- {type: dense, units: 128}
- {type: dense, units: 128}
- {type: dense, units: 64}
last_activation: sigmoid
num_output_units: 1
optimizer: adam

In the second example, we cover a multi-class classification problem model. The model features 2 hidden LSTM layers and 1 dense hidden layer. As in the output layer, it uses 'softmax' for the activation and has 3 output units.

trainer:
type: classification
architecture: lstm
sequence_length: 6
train_steps: 13500
save_checkpoints_steps: 1000
train_batch_size: 64
eval_batch_size: 64
layers:
- {type: lstm, units: 128, return_sequences: true}
- {type: lstm, units: 128, return_sequences: false}
- {type: dense, units: 64}
last_activation: softmax
num_output_units: 3
optimizer: adam

Working with timeseries data

With the Core Engine, it is also possible to work on timeseries datasets. However, due to the nature of sequential tasks, a few additional parameters need to be defined to enable this functionality.

Main Key: timeseries (optional)

Structurally, the timeseries holds 4 mandatory and 1 optional value, such as:

Parametersdtyperequired
resampling_rate_in_secsint or floatTrue
trip_gap_threshold_in_secsint or floatTrue
process_sequence_w_timestampstringTrue
process_sequence_w_categorystringFalse
sequence_shiftintTrue
  • resampling_rate_in_secs defines the resampling rate in seconds and it will be used at the corresponding pre-precessing step
  • trip_gap_threshold_in_secs defines a maximum threshold in seconds in order to split the dataset into trips. Sequential transformations will occur once the data is split into trips based on this value.
  • process_sequence_w_timestamp specifies which data column holds the timestamp.
  • process_sequence_w_category is an optional value, which, if provided, will be used to split the data into categories before the sequential processes
  • sequence_shift defines the shift (in datapoints) while extracting sequences from the dataset

Examples

Imagine the scenario, where you have an asset on the field, which is transmitting sensory data and it is only active during a certain period of time every day. However, the different sensors have different transmission frequencies. In order to be able to feed your model with consistent (and equidistant) datapoints you have to resmaple your dataset. In that case you can build the timeseries block as follows:

timeseries:
resampling_rate_in_secs: 30
trip_gap_threshold_in_secs: 1800
process_sequence_w_timestamp: 'timestamp_column'
sequence_shift: 1

This will instruct the Core Engine to split your data into trips which are at least 1 hour away from each other and then resample those trips with a rate of 30 seconds using the 'timestamp_column' as the time index. Moreover, while extracting sequences after the resampling, it will only shift by 1 data point before extracting the next sequence.

You can even build on top of this example by using not just one asset but a fleet of assets. In this scenario, multiple assets might be functional simultaneously, which means, during the resampling process, values from one asset might influence the values from another asset. That's exactly where the key process_sequence_w_category comes into play. It is used to split the data into categories before the sequential transformations, so the integrity of the data can remain intact within categories for tasks such as the one explained above. The resulting block is as follows:

timeseries:
resampling_rate_in_secs: 30
trip_gap_threshold_in_secs: 1800
process_sequence_w_timestamp: 'timestamp_column'
process_sequence_w_category: 'asset_id'
sequence_shift: 1