In the Core Engine, the most important aspect of creating a pipeline is to define a set of configuration settings, which will in the end cover the entire flow in a machine learning workflow.
Structurally, there are eight different configuration blocks, namely
timeseries is an optional block, the rest are all mandatory and need to be defined.
the version of the configuration structure
configuration block for splitting the data source
into different splits
configuration block for feature selection
configuration block for label selection
configuration block for the selection of evaluation features
configuration block for the training
configuration block to define the default preprocessing
configuration block to define additional parameters for
You can possibly build this configuration through different methods. As of now, you can either use the Python SDK to create it or construct a YAML file over the CLI and build it step-by-step.
When you start working on your pipeline, you can create a template pipeline configuration which you can use as a starting point and adjust further to your needs.
In order to create an instance of a pipeline configuration, you can simply import the class
cengine and use it to generate an empty configuration.
from cengine import PipelineConfigc = PipelineConfig()
If you want to base your configuration on a datasource commit, you can use the method
from_datasource which is defined as a secondary constructor under
class PipelineConfig:def from_datasource(cls,client,datasource_id: str,commit_id: str = None)
When you use the method
from_datasource, you need to provide a
datasource_id. On the other hand, the input
commit_id is optional and if it is not provided, the Core Engine will proceed with the latest commit of the selected datasource.
from cengine import Clientfrom cengine import PipelineConfig# Create a client with a proper username and passwordclient = Client(username='zip', password='zip')# Get a datasource (optionally with a specific commit)datasource = ...datasource_commit = ...# Create a template from the datasource_id only# c = PipelineConfig.from_datasource(# client=client,# datasource_id=datasource.id)# Create a template from a datasource_id and a corresponding commit idc = PipelineConfig.from_datasource(client=client,datasource_id=datasource.id,datasource_commit_id=datasource_commit.id)print(c)
cengine pipeline template [OPTIONS]
When creating a template for the configuration through the CLI, you have to option to define an output path. If not defined, it will create it as
template.config.yaml in the working directory.
Moreover, the Core Engine will look for a datasource commit to use when creating the template. This datasource commit can either be appointed to by selecting a datasource commit through the CLI or explicitly defining it over the
source_id. When using the
source_id, make sure that you use one of the following formats:
datasource_id: the Core Engine will use the latest commit of the datasource
datasource_id:commit_id: The Core Engine will use the specified commit
You can also disentangle this process from datasources by using the flag
no_datasource. Furthermore, if you want a template without additional documentation, you can also use the
output path to save a template YAML file, defaults to the working directory
the selected datasource
save file without additional documentation
save template without connecting to the datasource
Currently, this feature is unavailable.
We are working hard to create a dashboard for the Core Engine. Please see our roadmap for an indication on when this feature will be released.
As you know have a starting point, the first thing that you should know about the context within the configuration is the main block
version. As the name suggests, it indicates the version of the configuration structure. In the current build, the running version of the configuration YAML file is
from cengine import PipelineConfig# Create an instance of the configuration and display its versionp = PipelineConfig()print(p.version)