Single split

Everything stays together.

Splitting is an important step in many machine learning scenarios. But sometimes, you might prefer leaving your data as it is, for example when conducting inference to gain insights. For situations like these, the Core Engine gives you the option of not splitting your data and relaying it further down your pipeline.

Depending on how much freedom you require in pre-selecting your data for your particular job, there are two major ways of signalling a "no-split".

Method 1: Supply an empty split configuration

The most straightforward way of skipping the split is to specify an empty split config key, just like this:

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine.utils.split_utils import no_split
# First possible way
p1 = PipelineConfig()
p1.split = {}
# Second possible way
p2 = PipelineConfig()
p2.split = no_split()
YAML
split: {}

This will result in your data being passed on to other downstream pipeline components without splitting.

Method 2: Categorical pre-selection

This method is preferable when you have your data available, but you are only interested in data sharing a specific value of an attribute. For more information on this way of splitting based on a categorical attribute, please refer to the section on categorical splitting.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
p = PipelineConfig()
p.split.categorize.by = 'category_name'
p.split.categorize.categories = {'no_split': ['category_1', 'category_2']}
YAML
split:
categorize:
by: 'category_name'
categories:
nosplit:
- 'category_1'
- 'category_2'

A minimal example of a single-split config with a categorical selection is shown above. Specifying your config with the nosplit key as seen here selects only data points of categories 1 and 2 in the category_name column.