Vertical split

Index-based data splits, the old-fashioned way.

If you are not concerned about isolating specific categories into your datasets, an index-based "vertical" split is the most intuitive option. Here, you can simply split your data based on a ratio key in your configuration file. A small working example of such a config is listed below.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine.utils.split_utils import index_split
# First possible way
p1 = PipelineConfig()
p1.split.index.by = 'index_col'
p1.split.index.ratio = {'train': 0.7, 'eval': 0.3}
# Second possible way
p2 = PipelineConfig()
p2.split = index_split(by='index_col', ratio={'train': 0.7, 'eval': 0.3})
YAML
split:
index:
by: 'index_col'
ratio:
train: 0.7
eval: 0.3

If no categorical split is specified, which we assume here (for such a "hybrid" split, see the Hybrid Split documentation), this config is functionally equivalent to the following:

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine.utils.split_utils import index_split
# First possible way
p1 = PipelineConfig()
p1.split.index.by = 'index_col'
p1.split.ratio = {'train': 0.7, 'eval': 0.3}
YAML
split:
index:
by: 'index_col'
ratio:
train: 0.7
eval: 0.3

This is probably the most basic method of splitting your dataset. In this case, the data is indexed by a column named index_col . In this configuration, 70% of all available data points are put into the training dataset, with the remaining 30% going into the evaluation dataset. The size of the resulting split datasets can then be directly adjusted by tuning the percentages in the ratio dict.