Caching

Caching is an important mechanism in the Core Engine. Whenever, you execute a pipeline in the same workspace, the Core Engine breaks your preprocessing into individual pieces, called components. These components correspond, roughly but not exactly, to the main keys in the configuration YAML file and the Python SDK PipelineConfig object - timeseries, split, features, labels, trainer, evaluator and preprocessing. The outputs of these components are stored as they are computed. Then, whenever another pipeline is run that has a similar configuration to a previously run pipeline, the Core Engine simply uses the previously computed output to warm start the pipeline, rather then recomputing the output.

This not only makes each subsequent run potentially much faster, but also saves on compute cost.

Caching only works across pipelines in the same workspace. Please make sure to put all related pipelines in the same workspace to leverage the advantages of caching.

Example

Lets take a simple configuration file, i.e., our Quickstart example to illustrate the power of caching. The example below uses the same datasource as outlined in the Quick Start, i.e., the census data of adult income levels.

Lets say the config file looks like this for our first pipeline, called Pipeline 1.

Python SDK
YAML
Python SDK
YAML
split:
index_ratio: {train: 0.7, eval: 0.3}
...
features:
occupation: {}
race: {}
age: {}
...
evaluator:
native_country: {}
..
labels:
income_bracket: {}
trainer:
fn: generic@latest
params:
loss: binary_crossentropy
metrics:
- accuracy
last_activation: sigmoid
n_output_units: 1
input_units: 12
train_steps: 2000
eval_steps: 1500
...

Then we can run this pipeline and see its status

Python SDK
CLI
Python SDK
# coming soon
CLI
cengine pipeline train PIPELINE_1_ID
cengine pipeline status --pipeline_id PIPELINE_1_ID

And you should be an output like this:

Python SDK
CLI
Python SDK
# coming soon
CLI
ID | Name | Pipeline Status | Completion | Compute Cost (€) | Training Cost (€) | Total Cost (€) | Execution Time
--------------------+-----------------------+-------------------+--------------+--------------------+---------------------+------------------+---------------
<pipeline_1_id> | Pipeline 1 | Succeeded | 100% | 0.012 | 0.2167 | 0.2287 | 0:14:21.187081

Now, you create a Pipeline 2, where the only thing that is different is the train_steps from 2000 to 2500, i.e.:

Python SDK
YAML
Python SDK
# coming soon
YAML
trainer:
...
train_steps: 2500
...

And run it again:

Python SDK
CLI
Python SDK
# coming soon
CLI
cengine pipeline run PIPELINE_2_ID
cengine pipeline status --pipeline_id PIPELINE_2_ID

And now you can see the output:

Python SDK
CLI
Python SDK
# coming soon
CLI
ID | Name | Compute Cost (€) | Training Cost (€) | Total Cost (€) | Execution Time
-------------------+--------------+------------------+--------------+--------------------+---------------------+---
<pipeline_1_id> | Pipeline 1 | 0.012 | 0.2167 | 0.2287 | 0:14:21.187081
<pipeline_2_id> | Pipeline 2 | 0.0027 | 0.2167 | 0.2193 | 0:09:36.930483

The columns to watch out for are Compute Cost (€) and Execution Time. As can be seen, we were able to improve the cost of the pipeline by 77.5% and the speed it up by ~33%.

Credit where credit is due

Our caching is powered by the wonderful ML Metadata store from the Tensorflow Extended project. TFX is an awesome open-source and free tool by Google, and we use it intensively under-the-hood of our own Core Engine!