Splitting

Smart splitting into training, evaluation and test datasets...

In the Core Engine configuration, the configuration block split is utilized to configure the process of splitting the dataset into multiple smaller datasets for different purposes, such as training and evaluation. Through this block, the Core Engine allows its users to define more than just a random split. Overall, there are 4 optional keys to customize your split, displayed in the table below:

Attributes

Description

Required

categorize

the options for a categorical split

False

index

the options for an index-based split

False

ratio

common percentage-based splitting approach for both categorical and index-based splits (if given)

False

where

additional conditions while querying the data source

False

Both the categorize for categorical grouping and splitting and the index additionally support more optional parameters to enable more complicated splits. The available options for the categorize are:

  • by: A single string containing the column name in your dataset that should be used for categorical splitting.

  • categories: A dictionary with the chosen ML workload as keys, e.g. train and eval, or a list. If a dict, under each key, a list of strings has to be given, defining which category belongs to which split. If a list, it should contain the categories of interest in the categorical column (under by) as strings.

  • ratio: A dictionary with the chosen ML workload as keys, e.g.train and eval. Under each key, there is a float value determining the split ratio within the categories. Cannot be used while categories is present.

For the index, the two available optional keys are:

  • by: A single string containing the column name in your dataset that should be used for index-based splitting.

  • ratio: A dictionary with the chosen ML workload as keys, e.g.train and eval. Under each key, there is a float value determining the split ratio within the index.

You can use the categorical column to group the data before the split or even to specify which category goes into which split. As for the indexed column, you can use it to sort your data before the split.

Caveats and restrictions

Every split configuration specified by the user is verified internally at the time of pipeline construction. In order to ensure a meaningful result, a few restrictions to the above specifications apply.

  • The ML workloads supported (specified by the ratio and categories dictionary keys) at this moment are train + eval , train + eval + test , and a special nosplitoption designed for running batch inference jobs. These keys have to be consistent throughout the entire split configuration, so mixing keys in different places in the config is not allowed.

  • As the ratio dictionaries specified under the ratio keys specify splits based on percentages, the float values in these dictionaries have to sum up to 1.

  • A top-level ratio dict may not be given together with a ratio dict in the index option. If you want to specify a ratio for an index-based split , please do so under the index key.

  • If a list of categories is given under categories instead of a dict, a corresponding ratio has to be specified to partition the categories into different datasets, either at the top-level or under the categorize key.

Examples

Let's work on the case where your dataset consists of data gathered from 5 different assets in the field for 10 days and your data is timestamped.

If the ID of the asset which recorded the data and its timestamp do not play a role, you can use ratio to split the dataset to a training (80%) and an eval (20%) dataset in a random manner:

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
p = PipelineConfig()
p.split.ratio = {'train':0.8, 'eval': 0.2}
YAML
split:
ratio:
train: 0.8
eval: 0.2

If you would like to use the last 2 days within your dataset, you can sort your data before the split happens by utilizing index.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
p = PipelineConfig()
p.split.index.by = 'timestamp'
p.split.index.ratio = {'train':0.8, 'eval': 0.2}
YAML
split:
index:
by: 'timestamp'
ratio:
train: 0.8
eval: 0.2

Even further, maybe the days of the experiment for each asset do not align and you want the last two days of each respective asset in the eval dataset. You can achieve that by using categorize, which will divide your dataset into categories before the sorting and splitting happens.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
p = PipelineConfig()
p.split.categorize.by = 'asset_id'
p.split.index.by = 'timestamp'
p.split.index.ratio = {'train':0.8, 'eval': 0.2}
YAML
split:
categorize:
by: 'asset_id'
index:
by: 'timestamp'
ratio:
train: 0.8
eval: 0.2

If you want some of the categories solely in the eval dataset, you can use either the categories or ratio key. Using categories:

Python SDK
YAML
Python SDK
om cengine import PipelineConfig
p = PipelineConfig()
p.split.categorize.by = 'asset_id'
p.split.categorize.categories = {'train': ['asset_1',
'asset_2',
'asset_3',
'asset_4'],
'eval': ['asset_5']}
p.split.index.by = 'timestamp'
p.split.index.ratio = {'train':0.8, 'eval': 0.2}
YAML
split:
categorize:
by: 'asset_id'
categories:
train:
- 'asset_1'
- 'asset_2'
- 'asset_3'
- 'asset_4'
eval:
- 'asset_5'
index:
by: 'timestamp'
ratio:
train: 0.8
eval: 0.2

on the other hand, using ratio within categorize:

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
p = PipelineConfig()
p.split.categorize.by = 'asset_id'
p.split.categorize.ratio = {'train':0.8, 'eval': 0.2}
p.split.index.by = 'timestamp'
p.split.index.ratio = {'train':0.8, 'eval': 0.2}
YAML
split:
categorize:
by: 'asset_id'
ratio:
train: 0.8
eval: 0.2
index:
by: 'timestamp'
ratio:
train: 0.8
eval: 0.2

The categorization happens prior to the sorting and splitting. Which means, in the last two examples above, data points with certain categories will be first put in the eval dataset first and then, the remaining categories will still go through the sorting and splitting process.

Finally, you can also use where key to define a list of conditions over your data points. For instance, if you want to work on the data from the year 2018, you can use:

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
p = PipelineConfig()
p.split.index.by = 'timestamp'
p.split.index.ratio = {'train':0.8, 'eval': 0.2}
p.split.where = ["timestamp >= '2018-01-01T00:00:00'",
"timestamp <= '2018-31-12T23:59:59'"]
YAML
split:
index:
ratio:
train: 0.8
eval: 0.2
where:
- "timestamp >= '2018-01-01T00:00:00'"
- "timestamp <= '2018-31-12T23:59:59'"