Quick Start

Overview

So you signed up, installed and now you probably just want to start creating pipelines - so lets get to it!

For our quickstart, we chose a public BigQuery table with census data of adult income levels. Heres a snapshot of the data:

age | workclass | functional_weight | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income_bracket
-------+--------------+---------------------+-------------+-----------------+--------------------+-------------------+----------------+--------------------+--------+----------------+----------------+------------------+------------------+------------------
39 | Private | 297847 | 9th | 5 | Married-civ-spouse | Other-service | Wife | Black | Female | 3411 | 0 | 34 | United-States | <=50K
72 | Private | 74141 | 9th | 5 | Married-civ-spouse | Exec-managerial | Wife | Asian-Pac-Islander | Female | 0 | 0 | 48 | United-States | >50K
45 | Private | 178215 | 9th | 5 | Married-civ-spouse | Machine-op-inspct | Wife | White | Female | 0 | 0 | 40 | United-States | >50K

The column on the very right called 'income_bracket' is our label, i.e., what we want to predict based on all the other columns.

We'll aim to get a validation accuracy of around 90%, quickly run multiple experiments on various pipelines and see how its possible to save up to 80% pre-processing costs using caching.

Login

Once you've successfully installed the cengine CLI, you can login with your credentials using:

cengine auth login

Set your workspace

The following command will list your workspaces.

cengine workspace list

Expected output:

Selection | ID | Name
-------------+----------------+---------------------
| <workspace_id> | Default Workspace

Oh look at that! You already have a pre-made workspace ready for you! Go ahead and make that workspace active.

cengine workspace set <workspace_id>

Great, now all proceeding actions will be under the context of this workspace.

A closer look at the datasource

A pipeline must have a datasource to consume from of course. You can see the datasource by using:

cengine datasource list

Expected output:

Selection | ID | Name | Rows | Cols | Size (MB)
-------------+-----------------+----------------------+-----------+--------+-------------
| <datasource_id> | Census Income Data | 32561 | 15 | 5

As expected, the census data datasource has been added by default to your account, just like the workspace and pipeline. The dataset has 32561 rows, 15 cols and weighs in around 5 MB. A relatively tiny dataset but consider this our illustrative MNIST.

We can set this to be the active datasource as follows:

cengine datasource set <datasource_id>

We can also peek at the datasource using:

cengine datasource peek <datasource_id>

This will print out 10 randomly sampled data points from the BigQuery table!

Run your first pipeline

Now you are ready to run your first pipeline.

The following command will list your registered pipelines.

cengine pipeline list

Expected output:

ID | Name
----------------+---------------
<pipeline_id> | Hello World

Looks like someone (cough) created your first pipeline already. Go ahead and run it!

cengine pipeline run <pipeline_id>

You should see a success message with your chosen configuration. The Core Engine will select 1 worker at 1 cpu per worker for this pipeline, based on the size of the datasource connected. It will provision these resources in the cloud, connect automatically to the datasource, and create a machine learning pipeline to train a model.

Check pipeline status

You can always check your pipelines status with:

cengine pipeline status --pipeline_id <pipeline_id>

You should see something like this:

ID | Name | Pipeline Status | Completion | Compute Cost (€) | Training Cost (€) | Total Cost (€) | Execution Time
---------- ----+-----------------------+-------------------+--------------+--------------------+---------------------+------------------+------------------
<pipeline_id> | Hello World | Running | 86% | 0 | 0 | 0 | 0:14:21.187081

You'll notice that while the pipeline is Running, all the costs will be set to 0. Thats because only fully completed and successful pipelines are charged to your billing.

tip

If you don't want to run the command again and again, you can set a watch on it.

watch cengine pipeline status --pipeline_id <pipeline_id>

macOS users can install watch with brew install watch

As completion hits 100%, we would be able to see the results. However, as this might take 13 minutes or so, we can use the time to inspect the config file that generated this pipeline.

Check your config

To understand what our pipeline was in more detail, run:

cengine pipeline pull <pipeline_id> --output_path /some/empty/dir/out.yaml

This will pull the config file of the pipeline that we just executed into a local file, which you can view on your favorite text editor. It is a simple YAML file, so most editors would make it easy to peruse.

This config file requires a bit more work to understand, and more details can be found here. However, you can see the relevant bits here:

The train-eval split. You can see that we took a 70-30 split.

split:
...
index_ratio: {train: 0.7, eval: 0.3}

The label, as mentioned, is the column income_bracket, with the corresponding loss and metrics we want to track.

labels:
income_bracket:
loss: binary_crossentropy
metrics: [accuracy]

The model definition itself is also easy to understand. We're using a 3-layer feed-forward network here, with a batch size of 16 and training for around 2500 steps (one step is one pass of a batch through the network).

trainer:
architecture: feedforward
layers:
- {type: dense, units: 128}
- {type: dense, units: 128}
- {type: dense, units: 64}
last_activation: sigmoid
train_batch_size: 16
train_steps: 2500
optimizer: adam
...

See the results

important

In order to get the below evaluation visualizations to work, you must also run:

jupyter nbextension install --py --symlink tensorflow_model_analysis
jupyter nbextension enable --py tensorflow_model_analysis

You might be required to run this as sudo if you're not working on a virtualenv.

Once your pipeline reaches a 100% completion (should take approximately 13 minutes), you can see the cost breakdown, in addition to how long the pipeline took.

You can also see the results of the model training by running:

cengine pipeline evaluate <pipeline_id>

Woah, did your browser just open up to a Jupyter notebook? Trippy. If you run that pre-made notebooks blocks you'll see a few handy plugins to showcase results.

You will be able to see Tensorboard logs and Tensorflow Model Analysis (TFMA) results. Both of these combined will give a great overview of how we did!

tfma_analysis

We are able to see not only the overall accuracy (~78% if all went well), but also the accuracy across a spliting_column of our choice. A spliting_column is simply a feature in your data that you would like to see your metric (e.g. accuracy) across in the final evaluation. In the above example, by splitting our metric into marital_status, we are able to identity that we did worst with married people with children! We should definitely try and improve our model for that segment of our data.

tip

You can edit the last line of the TFMA notebook block to include your slicing metrics:

tfma.view.render_slicing_metrics(evaluation, slicing_column='marital_status')

Theres a lot more to see here. Click here for more information regarding the evaluation results.

Congratulations! You have successfully run your first (we're claiming it) production ready machine learning pipeline.

Iterate

Ok so we got some one result, but we shouldn't stop there should we? Lets create more pipelines and iterate!

You can always edit your config yaml file directly or through the CLI to create a new pipeline. For this example, we can edit the config file that we pulled previously. Open up your favorite text editor, and edit the trainer like so (only changes shown):

trainer:
...
layers:
- {type: dense, units: 32}
- {type: dense, units: 32}
- {type: dense, units: 16}
...

As you can see, we reduced the neurons in each layer because we want to check if a smaller network might perform better. Feel free to make any change you want here, and make sure to follow the trainer documentation for more details!

Now you can register your pipeline by running:

cengine pipeline push /path/to/modified_out.yaml "Hello World Smaller Network"

Where modified_out.yaml is the edited config YAML file.

This will push another pipeline to your workspace with the name Hello World Smaller Network. The pipeline will be automatically assigned a new <pipeline_id>. At this point, an ops configuraton (workers, cpus_per_worker) will be assigned to this pipeline automatically by the Core Engine. However, you can always update this as follows:

cengine pipeline update <pipeline_id> --workers <NUM_WORKERS> --cpus_per_worker <CPUS_PER_WORKER>

You can now run this pipeline again as follows:

cengine pipeline run <pipeline_id>

Depending on what you change in the config yaml, you will notice a marked speed improvement in subsequent pipelines in the same workspace. This is because the Core Engine heavily utilizes previously computed cached intermediate results to ensure that you do not repeat the same processing steps again. E.g. If you only change something in the trainer key in the YAML, only the training part will run again. This way, you can quickly iterate over many different machine learning models and arrive at the best one.

You will also notice that the Core Engine will not charge you for cached processing steps, as indicated by the reduced cost of subsequent pipelines.

ID | Name | Pipeline Status | Completion | Compute Cost (€) | Training Cost (€) | Total Cost (€) | Execution Time
---------------+----------------------------------+-------------------+--------------+--------------------+---------------------+------------------+------------------
<pipeline_id> | Hello World | Succeeded | 100% | 0.012 | 0.2167 | 0.2287 | 0:14:21.187081
<pipeline_id> | Hello World Smaller Network | Succeeded | 100% | 0.0027 | 0.2167 | 0.2193 | 0:09:36.930483
<pipeline_id> | Hello World Increased Batch Size | Succeeded | 100% | 0.0027 | 0.2167 | 0.2193 | 0:10:03.570141

Therefore, we went from compute cost of 0.012 to 0.0027, an improvement of about ~80%.

This difference might seem small now, but boy does it scale! With the Core Engine you don't have to worry about repeating the same steps over and over. You just use the resources you need to compute precisely what you want to. Hyper-parameter tuning, running fast iterations and model comparison is quick, cheap and easy.

Check the statistics

If you want to dive into the statistics of the datasource to get more of a feel for the results, you can simply run:

cengine pipeline statistics <pipeline_id>

Wait a few seconds and your preferred browser will open and display the statistics of this datasource. You will notice that you can view both train andeval splits. Statistics are tied to pipelines because each pipeline has exactly one datasource.

Statistics

For more information about the statistics view, please click here.

Download the model

Ok, so now you're happy with the results, and you want that trained model? Easy! Go ahead and execute:

cengine pipeline model <pipeline_id> --output_path /path/to/download/model/

This will download the model to the specified directory, as a Tensorflow Saved Model. For more information about the model format, please click here.

Conclusion

That was easy! Now you're ready to start making your own pipelines. To start, find out how to create a workspace, or add your own datasource, and finally make your own pipeline.