Pipelines - Creating a config

As mentioned in the previous chapters, the pipelines in the Core Engine are defined by a configuration file. The Command Line Interface of the Core Engine provides its users two possible ways of generate such a file.

tip

Before using either one of the approaches make sure that you have already selected a workspace and a datasource. You can check it via:

cengine workspace list
cengine datasource list

For both approaches, the general formula of the main command is:

cengine pipeline configure [--input_path] [--output_path] SUBCOMMAND1 [OPTIONS1] SUBCOMMAND2 [OPTIONS2] ..
typenamedtypedescriptionrequired
option--input_pathstringPath to an existing config file for a warm startFalse
option--output_pathstringPath to save the final configuration file toTrue

Besides the options above, there is a set of sub-commands available for configure and each sub-command represent a block within the config file, namely sequence, labels, features, evaluator, split, trainer and preprocessing. The selection of your approach depends whether you use these sub-commands or not.

caution

Regardless of which approach you choose, the CLI will generate a configuration YAML file with the same running version.

If you would like to submit a configuration file with a different version than the running version, please make sure that it would comply with the standards of the selected version.

First approach: Using the questionnaire

Using cengine pipeline configure without using any of the sub-commands will trigger a questionnaire where the user needs to respond to a set of prompts regarding the data and the model. Since this option also provides comprehensive explanation, this is also recommended for new users.

cengine pipeline configure [--output_path]

Second approach: Using the sub-commands

cengine pipeline configure [--input_path] [--output_path] \
[splits] \
[--ratio] [--sort_by] [--category_column] [--category_ratio] [--train_category] [--eval_category] \
[sequence] \
[--timestamp_column] [--resampling_rate] [--trip_gap] [--sequence_shift] [--category_column] \
[trainer] \
[--trainer_type] [--trainer_architecture] [--sequence_length] [--num_output_units] \
[--train_batch_size] [--eval_batch_size] [--train_steps] [--save_checkpoints_steps] \
[--optimizer] [--last_activation] \
[labels] \
[--label] \
[features] \
[--feature] [--blacklist] [--remove_labels] [--regex] \
[evaluator] \
[--slice] \
[preprocessing] \
[--defaults_path]

For the sub-command split:

namedtypedescriptionrequired
--ratiofloatdefines the ratio of the train dataset, needs to be between 0 and 1True
--sort_bystrif specified, the data will be sorted by this column before the splitFalse
--category_columnstrif specified, the data will be categorized by this column before the splitFalse
--category_ratiofloatif specified, the ratio of the categories which will be put in the training datasetFalse
--train_categorystrname of a category which will be put in the training dataset, can be used multiple timesFalse
--eval_categorystrname of a category which will be put in the eval dataset, can be used multiple timesFalse

For the sub-command sequence:

namedtypedescriptionrequired
--timestamp_columnstrthe name of the data columnn which holds the timestampTrue
--resampling_rateintthe resampling rate in secondsTrue
--trip_gapintthe gap threshold between different trips in the dataTrue
--sequence_shiftintshift in the time steps while extracting sequences from the datasetTrue
--category_columnstrif specified, the data will be categorized by this column before the sequential processesFalse

For the sub-command trainer:

namedtypedescriptionrequired
--trainer_typestrdetermines the architecture of the model, possible values include 'feedforward', 'sequence''sequence_ae'True
--trainer_architecturestrdefines the type of the problem at hand, possible selections include 'regression', 'classification', 'autoencoder'True
--sequence_lengthintthe length of the sequence in one data point (provided only on a sequential problem setting)False
--num_output_unitsintthe number of output units in the last layer, default=1False
--train_batch_sizeintthe batch size during the training, default=32False
--eval_batch_sizeintthe batch size during the eval steps, default=32False
--train_stepsintthe number of batches which should be processed through the training, default=5000False
--save_checkpoints_stepsintthe number of training batches, which will indicate the frequency of the validation steps, default=200False
--optimizerstrthe name of the selected optimizer for the training, default='adam'False
--last_activationstrthe type of the output layer in the model, default='sigmoid'False

For the sub-command labels:

namedtypedescriptionrequired
--labeltuple(str, str)used to define pairs of (name_of_the_label_column, loss_function), can be used multiple timesFalse

For the sub-command features:

namedtypedescriptionrequired
--featurestrname of the feature to be included in the training, can be used multiple timesFalse
--blackliststrname of the feature to be excluded from the training, can be used multiple timesFalse
--remove_labelsflagif given, automatically removes all the labels from the training feature setFalse
--regexflagif given, feature and blacklist will be used as regex pattern while including and excludingFalse

For the sub-command evaluator:

namedtypedescriptionrequired
--slicestrname of the data column to slice the data for the evaluation, can be used multiple timesFalse

For the sub-command preprocessing:

namedtypedescriptionrequired
--defaults_pathstrpath to a .yaml file which would be used to define the defaultsFalse
  • While you are working with the sub-commands, you can additionally use the input_path to load an existing config file for a warm start.

  • It is possible for the user to only use a subset of sub-commands defined above. For instance, using this approach, the user can pull a pipeline, modify a specific section and register a new pipeline with the modified config.

  • However, it must be noted, that, in cases such as this one, the CLI will only work on the selected blocks. If you are working on a brand new configuration, it is recommended to use all the sub-commands which are responsible to create the mandatory main keys in the config.

tip

Even though, both methods are aimed to be comprehensive assistants when generating the configuration files, they do not cover every possible option. If you would like to define any configuration, which is not possible through the CLI, we recommend modifying the file after generating at least a template. However, make sure your changes do not go against the rules of the configuration. If you would like to get more information or examples about the specific blocks please refer to here.