Preprocessing

In the Core Engine, the possible preprocessing methods are categorized under four main preprocessing paradigms that you can apply to your data:

  1. filling: used to fill the missing datapoints in the dataset. Built-in methods include:

    • forward (timeseries specific): fill the missing values using the last available value

    • backwards (timeseries specific): fill the missing values using the next available value

    • min: fill the missing values with the minimum value in the data column

    • max: fill the missing values with the maximum value in the data column

    • mean: fill the missing values with the mean value in the data column

    • custom: fill the missing values with a custom value in the data column

  2. resampling (timeseries specific): used to define sequence-level resampling operations. Built-in methods include:

    • mean: aggregate values based on their mean value

    • mode: aggregate values based on their mode value

    • median: aggregate values based on their median value

    • threshold: aggregate values based on a condition

  3. label_tuning (timeseries specific): used to specify a set of preprocessing steps which are just dedicated for sequence-level labels. Built-in methods include:

    • leadup: marks the data points within the a specified period of time prior to an event as events

    • followup: marks the data points within the a specified period of time after an event as events

    • shift: shifts the labels in time by given steps

    • map: maps the values within a column into new values

    • no_tuning: no label tuning is applied

  4. transform: used to define data transformation steps. Built-in methods include:

Moreover, preprocessing is carried out on a feature level, which implies that in every instance, where a feature is selected to be used in the pipeline, the pipeline needs to be able to infer what kind of preprocessing step it should apply to the specified feature.

There are two possible ways of achieving the this goal. The first approach is to explicitly configure each one of the preprocessing steps for every given feature and the second approach is to define a set of default behaviors based on each possible data type. In this remaining part of this chapter, we will focus on the second approach.

On a side note, it is important to note that the preprocessing steps are applied with the same order defined above. For instance, on a sequential dataset, the data points will be first filled, then resampled. Afterwards the label will be tuned and ultimately, the transform operations will be applied.

Main block: preprocessing

The main block preprocessing is used in the configuration to define a set of default preprocessing steps for 4 main data types, namely integer, float, boolean and string. For each data type, a default behavior is defined covering all 4 preprocessing steps, filling, transform, resampling and label_tuning.

Attributes

Description

Required

integer

default preprocessing methods for integer values

True

float

default preprocessing methods for float values

True

boolean

default preprocessing methods for boolean values

True

string

default preprocessing methods for string values

True

Example

For instance, assume that you have a dataset which has a several data columns which contain integer values and you want to fill in the missing data values with the value 42. In that case, the specific section of the preprocessing block should look like this.

Python SDK
YAML
Python SDK
from cengine import Method
from cengine import PipelineConfig
p = PipelineConfig()
p.preprocessing.integer.filling = Method(method='custom',
parameters={'custom_value': 42})
YAML
preprocessing:
integer:
filling:
method: "custom"
parameters: {custom_value: 42}

However, this example only covers data type integer and preprocessing step filling. A complete and proper example for a preprocessing block looks like this:

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
p = PipelineConfig()
# Instances of PipelineConfig come with a pre-defined set of defaults
print(p.preprocessing)
# {'boolean':
# {'filling': [{'method': 'max', 'parameters': {}}],
# 'label_tuning': [{'method': 'no_tuning', 'parameters': {}}],
# 'resampling': [{'method': 'max', 'parameters': {}}],
# 'transform': [{'method': 'no_transform', 'parameters': {}}]},
# 'float':
# {'filling': [{'method': 'max', 'parameters': {}}],
# 'label_tuning': [{'method': 'no_tuning', 'parameters': {}}],
# 'resampling': [{'method': 'mean', 'parameters': {}}],
# 'transform': [{'method': 'scale_to_z_score', 'parameters': {}}]},
# 'integer':
# {'filling': [{'method': 'max', 'parameters': {}}],
# 'label_tuning': [{'method': 'no_tuning', 'parameters': {}}],
# 'resampling': [{'method': 'mean', 'parameters': {}}],
# 'transform': [{'method': 'scale_to_z_score', 'parameters': {}}]},
# 'string':
# {'filling': [{'method': 'custom',
# 'parameters': {'custom_value': ''}}],
# 'label_tuning': [{'method': 'no_tuning', 'parameters': {}}],
# 'resampling': [{'method': 'mode', 'parameters': {}}],
# 'transform': [{'method': 'compute_and_apply_vocabulary',
# 'parameters': {}}]}}
YAML
preprocessing:
integer:
filling:
method: "max"
parameters: {}
resampling:
method: "max"
parameters: {}
transform:
method: "scale_to_z_score"
parameters: {}
label_tuning:
method: "no_tuning"
parameters: {}
float:
filling:
method: "mean"
parameters: {}
resampling:
method: "mean"
parameters: {}
transform:
method: "scale_to_z_score"
parameters: {}
label_tuning:
method: "no_tuning"
parameters: {}
boolean:
filling:
method: "max"
parameters: {}
resampling:
method: "threshold"
parameters: {cond: "greater",
c_value: 0,
threshold: 0,
set_value: 1}
transform:
method: "no_transform"
parameters: {}
label_tuning:
method: "no_tuning"
parameters: {}
string:
filling:
method: "custom"
parameters: {custom_value: ''}
resampling:
method: "mode"
parameters: {}
transform:
method: "compute_and_apply_vocabulary"
parameters: {}
label_tuning:
method: "no_tuning"
parameters: {}

The functionality of the filling paradigm depends on whether the pipeline is working on sequential data. For instance, in the case of max being selected as the filling method with non-sequential dataset, the pipeline will use the all the datapoints in the data column to infer the max value, whereas with sequential datasets, this value will be inferred within sequences.

The documentation on how to deal with time-series data will be covered further down in this chapter.