Hybrid split

Full-fledged category and index splitting.

If both the categorical and index-based splits appeal to you, then you can also combine the best of both worlds and choose to create a hybrid split of your data.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine.utils.split_utils import hybrid_split
# First possible way
p1 = PipelineConfig()
p1.split.categorize.by = 'category_name'
p1.split.categorize.categories = {
'train': ['category_1', 'category_2', 'category_3'],
'eval': ['category_4', 'category_5']}
p1.split.index.by = 'index_col'
p1.split.index.ratio = {'train': 0.7, 'eval': 0.3}
YAML
split:
categorize:
by: 'category_name'
categories:
train:
- 'category_1'
- 'category_2'
- 'category_3'
eval:
- 'category_4'
- 'category_5'
index:
by: 'index_col'
ratio:
train: 0.7
eval: 0.3

The above config is a minimal example of such a hybrid split. It results in the following two chronological steps:

  1. First, the whole dataset is split categorically based on the configuration given under the categorize key. For more information on the categorical split, see the previous section on categorical splitting.

  2. Subsequently, on the training dataset generated by this split, an index-based split is run with the parameters specified under index in the config.

In the above config, the training set, which consists of the categories 1 to 3 after Step 1, would be split up again, with 30% of all training points being transferred to the evaluation set. This ensures that a basic index-based split can be facilitated while making sure that whole categories can be pre-selected into different data sets like eval or test sets.

When using a hybrid split, make sure to specify consistent options for both categorize and index keys. As an example, specifying a split into a train and eval set in the categorical step and a split into train, eval and test in the index-based step is not allowed.