Built-in functions

Sometimes there is no need to reinvent the wheel

Filling

forward

Fill the missing values using the last available value based on an index (only applicable in the timeseries setting).

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].filling = Method(method='forward',
parameters={})
YAML
feature_a:
filling:
method: 'forward'
parameters: {}

backwards

Fill the missing values using the next available value based on an index (only applicable in the timeseries setting).

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].filling = Method(method='backwards',
parameters={})
YAML
feature_a:
filling:
method: 'backwards'
parameters: {}

min

Fill the missing values with the minimum value of the selected feature.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].filling = Method(method='min',
parameters={})
YAML
feature_a:
filling:
method: 'min'
parameters: {}

In a pipeline which operates on a sequential dataset, this function utilizes the minimum value within a trip instead of a global minimum. You can find a more in-depth guide on how to use sequential datasets here.

max

Fill the missing values with the maximum value of the selected feature.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].filling = Method(method='max',
parameters={})
YAML
feature_a:
filling:
method: 'max'
parameters: {}

In a pipeline which operates on a sequential dataset, this function utilizes the maximum value within a trip instead of a global minimum. You can find a more in-depth guide on how to use sequential datasets here.

mean

Fill the missing values with the mean value of the selected feature.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].filling = Method(method='mean',
parameters={})
YAML
feature_a:
filling:
method: 'mean'
parameters: {}

In a pipeline which operates on a sequential dataset, this function utilizes the mean value within a trip instead of a global minimum. You can find a more in-depth guide on how to use sequential datasets here.

custom

Fill the missing values of the selected with a custom value, requires the following parameters:

  • custom_value: Selected value for the filling process.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].filling = Method(method='custom',
parameters={'custom_value': 3})
YAML
feature_a:
filling:
method: 'backwards'
parameters: {}

Resampling

mean

Aggregate values within a time interval based on the mean value.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].resampling = Method(method='mean',
parameters={})
YAML
feature_a:
resampling:
method: 'mean'
parameters: {}

mode

Aggregate values within a time interval based on the mode.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].resampling = Method(method='mode',
parameters={})
YAML
feature_a:
resampling:
method: 'mode'
parameters: {}

median

Aggregate values based on their median value.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].resampling = Method(method='median',
parameters={})
YAML
feature_a:
resampling:
method: 'median'
parameters: {}

threshold

Aggregate values based on a condition, requires the following parameters:

  • cond: A string, representing the selected condition. Choose between the options greater, greater_or_equal, equal, less_or_equal, less, includes.

  • c_value: The conditional value.

  • threshold: Limit on the number of occurrences of the condition being True.

  • set_value: Value to use if the threshold has been surpassed.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].resampling = Method(method='threshold',
parameters={'cond': 'greater',
'c_value': 42,
'threshold': 5,
'set_value': 1})
YAML
feature_a:
resampling:
method: 'threshold'
'parameters': {'cond': 'greater',
'c_value': 42,
'threshold': 5,
'set_value': 1}

Label Tuning

leadup

Marks the data points within a specified period of time prior to an event also as events, requires the following parameters:

  • event_value: The value which specifies the occurrence of the event.

  • duration: The amount of seconds to cover prior to the event.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.labels['label_a'].label_tuning = Method(method='leadup',
parameters={'event_value': 1,
'duration':30})
YAML
label_a:
label_tuning:
method: 'leadup'
parameters: {'event_value': 1,
'duration': 30}

followup

Marks the data points within a specified period of time after an event also as events, requires following the parameters:

  • event_value: The value which specifies the occurrence of the event.

  • duration: The amount of seconds to cover after the event.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.labels['label_a'].label_tuning = Method(method='followup',
parameters={'event_value': 1,
'duration':30})
YAML
label_a:
label_tuning:
method: 'followup'
parameters: {'event_value': 1,
'duration': 30}

shift

Shifts the labels by given steps, requires the following parameters:

  • shift_steps: The number of steps to shift.

  • fill_value: A value to fill the empty values with.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.labels['label_a'].label_tuning = Method(method='shift',
parameters={'shift_steps': 5,
'fill_value': 0})
YAML
label_a:
label_tuning:
method: 'shift'
parameters: {'shift_steps': 5,
'fill_value': 0}

map

Maps the values within a column into new values, requires the following parameters:

  • mapper: A dictionary used to map different values.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.labels['label_a'].label_tuning = Method(method='map',
parameters={'mapper': {
'brandA_model1': 'A',
'brandA_model2': 'A',
'brandA_model3': 'A',
'brandB_model1': 'B',
'brandB_model2': 'B',
'brandB_model3': 'B'}})
YAML
label_a:
label_tuning:
method: 'map'
parameters:
mapper:
brandA_model1: 'A'
brandA_model2: 'A'
brandA_model3: 'A'
brandB_model1: 'B'
brandB_model2: 'B'
brandB_model3: 'B'

no_tuning

No label tuning is applied.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.labels['label_a'].label_tuning = Method(method='no_tuning',
parameters={})
YAML
label_a:
label_tuning:
method: 'no_tuning'
parameters: {}

Transform

scale_by_min_max

Scale the values within a given range, requires the following parameters:

  • min: The minimum value of the scaled feature.

  • max: The maximum value of the scaled feature.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].transform = Method(method='scale_by_min_max',
parameters={'min': -1,
'max': 1})
YAML
feature_a:
transform:
method: 'mean'
parameters: {}

scale_to_0_1

Scale the values between 0 and 1.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].transform = Method(method='scale_to_0_1',
parameters={})
YAML
feature_a:
transform:
method: 'scale_to_0_1'
parameters: {}

scale_to_z_score

Standardization with a mean of 0 and variance of 1.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].transform = Method(method='scale_to_z_score',
parameters={})
YAML
feature_a:
transform:
method: 'scale_to_z_score'
parameters: {}

tfidf

Term frequency–inverse document frequency, requires the following parameters:

  • vocab_size: The number of vocabulary used.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].transform = Method(method='tfidf',
parameters={'vocab_size': 100})
YAML
feature_a:
transform:
method: 'tfidf'
parameters:
vocab_size: 100

compute_and_apply_vocabulary

Create a vocabulary based on the values and apply it to the data column.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].transform = Method(method='compute_and_apply_vocabulary',
parameters={})
YAML
feature_a:
transform:
method: 'compute_and_apply_vocabulary'
parameters: {}

n-grams

Computes n-grams for a given input, requires the following parameters:

  • ngram_range: The range of n-gram sizes to return.

  • separator: A string value which would be put between tokens when constructing n-grams.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].transform = Method(method='ngrams',
parameters={'ngram_range': ,
'seperator': })
YAML
feature_a:
transform:
method: 'ngrams'
parameters:
ngram_range:
separator: ' '

hash_strings

Hash strings into buckets, requires the following parameters:

  • hash_buckets: Number of hash buckets.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].transform = Method(method='hash_buckets',
parameters={'hash_buckets': })
YAML
feature_a:
transform:
method: 'hash_strings'
parameters:
hash_buckets:

bucketize

Map the inputs into buckets, requires the following parameters:

  • num_buckets: The number of buckets to be used.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].transform = Method(method='bucketize',
parameters={'num_buckets': 3})
YAML
feature_a:
transform:
method: 'bucketize'
parameters:
num_buckets: 3

no_transform

No transformations are applied.

Python SDK
YAML
Python SDK
from cengine import PipelineConfig
from cengine import Method
p = PipelineConfig()
p.features['feature_a'].transform = Method(method='no_transform',
parameters={})
YAML
feature_a:
transform:
method: 'no_transform'
parameters: {}