Caching

Caching is an important mechanism in the Core Engine. Whenever, you execute a pipeline in the same workspace, the Core Engine breaks your preprocessing into individual pieces, called components. These components correspond, roughly but not exactly, to the main keys in the configuration YAML file - timeseries, split, features, labels, trainer, evaluator and preprocessing. The outputs of these components are stored as they are computed. Then, whenever another pipeline is run that has a similar configuration to a previously run pipeline, the Core Engine simply uses the previously computed output to warm start the pipeline, rather then recomputing the output.

This not only makes each subsequent run potentially much faster, but also saves on compute cost.

caution

Caching only works across pipelines in the same workspace. Please make sure to put all related pipelines in the same workspace to leverage the advantages of caching.

Example

Lets take a simple configuration file, i.e., our Hello World example to illustrate the power of caching.

The example below uses the same datasource as outlined in the Quick Start, i.e., the census data of adult income levels.

Lets say the config file looks like this for our first pipeline, called Pipeline 1.

split:
index_ratio: {train: 0.7, eval: 0.3}
...
features:
occupation: {}
race: {}
age: {}
...
evaluator:
native_country: {}
..
labels:
income_bracket:
loss: binary_crossentropy
metrics: [accuracy]
trainer:
architecture: feedforward
layers:
- {type: dense, units: 128}
- {type: dense, units: 128}
- {type: dense, units: 64}
last_activation: sigmoid
train_batch_size: 16
train_steps: 2500
optimizer: adam
...

Then we can run this pipeline and see its status

cengine pipeline run <pipeline_1_id>
cengine pipeline status --pipeline_id <pipeline_1_id>

And you should be an output like this:

ID | Name | Pipeline Status | Completion | Compute Cost (€) | Training Cost (€) | Total Cost (€) | Execution Time
--------------------+-----------------------+-------------------+--------------+--------------------+---------------------+------------------+---------------
<pipeline_1_id> | Pipeline 1 | Succeeded | 100% | 0.012 | 0.2167 | 0.2287 | 0:14:21.187081

Now, you create a Pipeline 2, where the only thing thats different is the train_batch_size from 16 to 32, i.e.:

trainer:
...
**train_batch_size**: 32
...

And run it again:

cengine pipeline run <pipeline_2_id>
cengine pipeline status --pipeline_id <pipeline_2_id>

And now you can see the output:

ID | Name | Pipeline Status | Completion | Compute Cost (€) | Training Cost (€) | Total Cost (€) | Execution Time
-------------------+--------------+------------------+--------------+--------------------+---------------------+------------------+----------------
<pipeline_1_id> | Pipeline 1 | Succeeded | 100% | **0.012** | 0.2167 | 0.2287 | 0:14:21.187081
<pipeline_2_id> | Pipeline 2 | Succeeded | 100% | **0.0027** | 0.2167 | 0.2193 | 0:09:36.930483

The columns to watch out for are Compute Cost (€) and Execution Time. As can be seen, we were able to improve the cost of the pipeline by 77.5% and the speed it up by ~33%.

Credit where credit is due

Our caching is powered by the wonderful ML Metadata store from the Tensorflow Extended project. TFX is an awesome open-source and free tool by Google, and we use it intensively under-the-hood of our own Core Engine!