Caching is an important mechanism in the Core Engine. Whenever, you execute a pipeline in the same workspace, the
Core Engine breaks your preprocessing into individual pieces, called components. These components correspond, roughly but not exactly, to
the main keys in the configuration YAML file -
preprocessing. The outputs
of these components are stored as they are computed. Then, whenever another pipeline is run that has a similar configuration to a previously run
pipeline, the Core Engine simply uses the previously computed output to warm start the pipeline, rather then recomputing the output.
This not only makes each subsequent run potentially much faster, but also saves on compute cost.
Caching only works across pipelines in the same workspace. Please make sure to put all related pipelines in the same workspace to leverage the advantages of caching.
Lets take a simple configuration file, i.e., our Hello World example to illustrate the power of caching.
Lets say the config file looks like this for our first pipeline, called Pipeline 1.
Then we can run this pipeline and see its status
And you should be an output like this:
Now, you create a Pipeline 2, where the only thing thats different is the
train_batch_size from 16 to 32, i.e.:
And run it again:
And now you can see the output:
The columns to watch out for are
Compute Cost (€) and
Execution Time. As can be seen, we were able to improve the cost of the pipeline by
77.5% and the speed it up by ~33%.
Credit where credit is due
Our caching is powered by the wonderful ML Metadata store from the Tensorflow Extended project. TFX is an awesome open-source and free tool by Google, and we use it intensively under-the-hood of our own Core Engine!