Categorical split

Easy data grouping based on a categorical attribute.

If your dataset contains a categorical attribute (a variable that can take finitely many, distinct values), you can split your data for your machine learning tasks based on the categories in that column. There are two main ways of doing this in the Core Engine.

The categorical attribute specified above needs to be a column in your dataset of type STRING or INTEGER.

Method 1: Direct specification by a dictionary

The first method of categorical splitting is by specifying a dictionary under the categories key in the categorize split config key. Assuming a training and evaluation workload (specified by the train and eval keys), an example of this could be:

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import category_split
# First possible way
p1 = PipelineConfig()
p1.split.categorize.by = 'category_name'
p1.split.categorize.categories = {
'train': ['category_1', 'category_2', 'category_3'],
'eval': ['category_4', 'category_5']}
# Second possible way
p2 = PipelineConfig()
p2.split = category_split(by='category_name',
categories={'train': ['category_1',
'category_2',
'category_3'],
'eval': ['category_4',
'category_5']})
YAML
split:
categorize:
by: 'category_name'
categories:
train:
- 'category_1'
- 'category_2'
- 'category_3'
eval:
- 'category_4'
- 'category_5'

This configuration ensures that all data points of categories 1 to 3 as their value in the category_name column will be put into the training dataset, whereas the categories 4 and 5 are assigned exclusively to the evaluation dataset. This method requires a more detailed specification by the user and may result in a longer configuration file, but gives the most freedom and detail in specifying the split.

Method 2: Ratio-based splitting of a category list

If assigning whole categories to different datasets is still important, but the extremely detailed designation from a dictionary-based approach is overkill, then a ratio-based categorical split can be chosen. An example config for this based on the same category names as above could be the following:

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import category_split
# First possible way
p1 = PipelineConfig()
p1.split.categorize.by = 'category_name'
p1.split.categorize.categories = ['category_1',
'category_2',
'category_3',
'category_4',
'category_5']
p1.split.categorize.ratio = {'train': 0.8,
'eval': 0.2}
# Second possible way
p2 = PipelineConfig()
p2.split = category_split(by='category_name',
ratio={'train': 0.8,
'eval': 0.2},
categories={'train': ['category_1',
'category_2',
'category_3'],
'eval': ['category_4',
'category_5']})
YAML
split:
categorize:
by: 'category_name'
categories:
- 'category_1'
- 'category_2'
- 'category_3'
- 'category_4'
- 'category_5'
ratio:
train: 0.6
eval: 0.4

Here, the categories of interest are specified as a list under the categories key. In contrast to before, now the split information comes from the ratio key at the bottom: In this example, 80% of all the specified categories should go into the training dataset and 20% into the evaluation dataset. When executing the corresponding pipeline, the category list is then bisected according to the percentages in the ratio dict.

When opting for a list-based categorical split, it is necessary to supply a ratio dict, otherwise the split attempt will result in an error.