Feature selection

In order to have a successful training which goes hand-in-hand with a comprehensive analysis of your model, it is critical to assign the right features to the right tasks. In this section, we will cover how to select data columns for training, evaluation and labeling respectively.

Main block: features

The main block features is used to define which set of features will be used during the training and it allows the users to modify the preprocessing steps filling, transform (and possibly resampling in case of sequential data) for each one of these features.

Structurally, each key under features represents a selected feature. Under each key, the user has the chance to determine the methods for each preprocessing step to use. If it is not explicitly defined, the behavior will be inferred from the preprocessing based on the data type.

Examples

One of the simplest scenarios is where the method for each preprocessing step for each feature can be inferred from the preprocessing. In this case, one only needs to specify the names of the data columns as keys and the values are just empty dictionaries:

Python SDK
YAML
Python SDK
from cengine import Method
from cengine import PipelineConfig
p = PipelineConfig()
# Defining features
p.features = ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5']
YAML
features:
feature_1: {}
feature_2: {}
feature_3: {}
feature_4: {}
feature_5: {}

Moving on to a slightly more complex configuration, the following example shows how to overwrite the default behavior. In this scenario, the configuration specifically states that the transform operation on the feature_1 will scale the values between -1 and 1 with the help of the method scale_by_min_max and the filling operation on feature_2 apply the method mean on its datapoints. The rest of the configuration will be derived from the preprocessing.

Python SDK
YAML
Python SDK
from cengine import Method
from cengine import PipelineConfig
p = PipelineConfig()
# Defining features
p.features = ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5']
# Overwriting default functions
p.features['feature_1'].filling = Method(method='mean')
p.features['feature_2'].transform = Method(method='scale_by_min_max',
parameters={'min': -1, 'max': 1})
YAML
features:
feature_1:
filling:
method: 'mean'
parameters: {}
feature_2:
transform:
method: 'scale_by_min_max'
parameters: {min: -1, max: 1}
feature_3: {}
feature_4: {}
feature_5: {}

Main block: labels

As the name suggests, the main key labels is used to determine which data column will be used as the label during the training. The inner structure of this column is quite similar to the block features, where the keys denote the data columns which are selected and the values include the preprocessing configuration.

Examples

For this example, imagine a classical classification scenario with a twist. Let's assume that you want your model to classify an asset on its brand, brandA or brandB. However, the information about the brand of assets is embedded within a column called label_1, which includes three models from each model, namely brandA_model1, brandA_model2, brandA_model3, brandB_model1, brandB_model2 and brandB_model3. In such cases, you can use the label tuning method called map and define your own mapping.

Python SDK
YAML
Python SDK
from cengine import Method
from cengine import PipelineConfig
p = PipelineConfig()
# Defining labels
p.labels = ['label_1']
# Overwriting default functions
p.labels['label_1'].label_tuning = Method(method='map',
parameters={'mapper':{
'brandA_model1': 'A',
'brandA_model2': 'A',
'brandA_model3': 'A',
'brandB_model1': 'B',
'brandB_model2': 'B',
'brandB_model3': 'B'}})
YAML
labels:
model_name:
label_tuning:
method: 'map'
parameters:
mapper:
brandA_model1: 'A'
brandA_model2: 'A'
brandA_model3: 'A'
brandB_model1: 'B'
brandB_model2: 'B'
brandB_model3: 'B'

Another good example for this block revolves around a sequential task. Imagine that you have a label column called event_label which has the value 1 when a critical event occurs and 0 on idle and you would like to detect the critical events in your system based on your sensors, but you do not just want to detect it as it happens, but you also want to predict it 3 hours prior to the failure. One of possible ways of labeling your data for such a task is to introduce a lead-up for your labels. In this example, the method leadup is used to mark not just the events but also any datapoints which occurred maximum 3 hours prior to any event.

Python SDK
YAML
Python SDK
from cengine import Method
from cengine import PipelineConfig
p = PipelineConfig()
# Defining labels
p.labels = ['label_1']
# Overwriting default functions
p.labels['label_1'].label_tuning = Method(method='leadup',
parameters={'event_value': 1,
'duration': 10800})
YAML
labels:
event_label:
label_tuning:
method: 'leadup'
parameters:
event_value: 1
duration: 10800 # 3 hours in secs

Main Key: evaluator

Finally, the main key evaluator determines which data columns will be used in the evaluation of the trained model. Structurally, it shares the same structure as the features block.

Parameters

dtype

required

slice_column_1

dict

False

slice_column_2

dict

False

...

...

...

slice_column_N

dict

False

For the evaluation, your datapoints will be sliced on the columns that specified in this block. For this reason, it is highly recommended that you select categorical columns in this block.

Examples

For example, if you would like to evaluate your trained your model based on a categorical column such as price_category, you can simply use:

Python SDK
YAML
Python SDK
from cengine import Method
from cengine import PipelineConfig
p = PipelineConfig()
# Defining evaluating features
p.evaluator = ['price_category']
YAML
evaluator:
price_category: {}