Overview

All the configuration settings available under one roof!

In the Core Engine, the most important aspect of creating a pipeline is to define a set of configuration settings, which will in the end cover the entire flow in a machine learning workflow.

Structurally, there are eight different configuration blocks, namely version, split, features, labels, trainer, evaluator and preprocessing, timeseries. While timeseries is an optional block, the rest are all mandatory and need to be defined.

Block

Description

Required

version

the version of the configuration structure

True

split

configuration block for splitting the data source

into different splits

True

features

configuration block for feature selection

True

labels

configuration block for label selection

True

evaluator

configuration block for the selection of evaluation features

True

trainer

configuration block for the training

True

preprocessing

configuration block to define the default preprocessing

behavior

True

timeseries

configuration block to define additional parameters for

sequential datasources

False

You can possibly build this configuration through different methods. As of now, you can either use the Python SDK to create it or construct a YAML file over the CLI and build it step-by-step.

Start off with a template

When you start working on your pipeline, you can create a template pipeline configuration which you can use as a starting point and adjust further to your needs.

While creating a template, you also the option of selecting a datasource commit, which will make sure that each feature column in your datasource commit will be pre-selected for the feature selection.

Python SDK
CLI
CE Dashboard
Python SDK

In order to create an instance of a pipeline configuration, you can simply import the classPipelineConfig from cengine and use it to generate an empty configuration.

from cengine import PipelineConfig
c = PipelineConfig()

If you want to base your configuration on a datasource commit, you can use the method from_datasource which is defined as a secondary constructor under PipelineConfig.

class PipelineConfig:
def from_datasource(cls,
client,
datasource_id: str,
commit_id: str = None)

When you use the method from_datasource, you need to provide a datasource_id. On the other hand, the input commit_id is optional and if it is not provided, the Core Engine will proceed with the latest commit of the selected datasource.

from cengine import Client
from cengine import PipelineConfig
# Create a client with a proper username and password
client = Client(username='zip', password='zip')
# Get a datasource (optionally with a specific commit)
datasource = ...
datasource_commit = ...
# Create a template from the datasource_id only
# c = PipelineConfig.from_datasource(
# client=client,
# datasource_id=datasource.id)
# Create a template from a datasource_id and a corresponding commit id
c = PipelineConfig.from_datasource(
client=client,
datasource_id=datasource.id,
datasource_commit_id=datasource_commit.id)
print(c)
CLI
cengine pipeline template [OPTIONS]

When creating a template for the configuration through the CLI, you have to option to define an output path. If not defined, it will create it astemplate.config.yaml in the working directory.

Moreover, the Core Engine will look for a datasource commit to use when creating the template. This datasource commit can either be appointed to by selecting a datasource commit through the CLI or explicitly defining it over the source_id. When using the source_id, make sure that you use one of the following formats:

  • datasource_id: the Core Engine will use the latest commit of the datasource

  • datasource_id:commit_id: The Core Engine will use the specified commit

You can also disentangle this process from datasources by using the flag no_datasource. Furthermore, if you want a template without additional documentation, you can also use the no_docs flag.

OPTIONS

TYPE

DESCRIPTION

-o, --output_path

path

output path to save a template YAML file, defaults to the working directory

-s, --source_id

string

the selected datasource

--no_docs

flag

save file without additional documentation

--no_datasource

flag

save template without connecting to the datasource

CE Dashboard

Currently, this feature is unavailable.

We are working hard to create a dashboard for the Core Engine. Please see our roadmap for an indication on when this feature will be released.

Main block: version

As you know have a starting point, the first thing that you should know about the context within the configuration is the main block version. As the name suggests, it indicates the version of the configuration structure. In the current build, the running version of the configuration YAML file is 1.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
# Create an instance of the configuration and display its version
p = PipelineConfig()
print(p.version)
YAML
version: 1