For our quickstart, we chose a public BigQuery table with census data of adult income levels. Heres a snapshot of the data:
The column on the very right called 'income_bracket' is our label, i.e., what we want to predict based on all the other columns.
We'll aim to get a validation accuracy of around 90%, quickly run multiple experiments on various pipelines and see how its possible to save up to 80% pre-processing costs using caching.
Once you've successfully installed the cengine CLI, you can login with your credentials using:
Set your workspace
The following command will list your workspaces.
Oh look at that! You already have a pre-made workspace ready for you! Go ahead and make that workspace active.
Great, now all proceeding actions will be under the context of this workspace.
A closer look at the datasource
A pipeline must have a datasource to consume from of course. You can see the datasource by using:
As expected, the census data datasource has been added by default to your account, just like the workspace and pipeline. The dataset has 32561 rows, 15 cols and weighs in around 5 MB. A relatively tiny dataset but consider this our illustrative MNIST.
We can set this to be the active datasource as follows:
We can also
peek at the datasource using:
This will print out 10 randomly sampled data points from the BigQuery table!
Run your first pipeline
Now you are ready to run your first pipeline.
The following command will list your registered pipelines.
Looks like someone (cough) created your first pipeline already. Go ahead and run it!
You should see a success message with your chosen configuration. The Core Engine will select 1 worker at 1 cpu per worker for this pipeline, based on the size of the datasource connected. It will provision these resources in the cloud, connect automatically to the datasource, and create a machine learning pipeline to train a model.
Check pipeline status
You can always check your pipelines status with:
You should see something like this:
You'll notice that while the pipeline is
Running, all the costs will be set to 0. Thats because
only fully completed and successful pipelines are charged to your billing.
If you don't want to run the command again and again, you can set a
watch on it.
macOS users can install watch with
brew install watch
As completion hits 100%, we would be able to see the results. However, as this might take 13 minutes or so, we can use the time to inspect the config file that generated this pipeline.
Check your config
To understand what our pipeline was in more detail, run:
pull the config file of the pipeline that we just executed into a local file, which you can view on your favorite text editor. It is a simple YAML file, so most editors would make it easy to peruse.
This config file requires a bit more work to understand, and more details can be found here. However, you can see the relevant bits here:
The train-eval split. You can see that we took a 70-30 split.
The label, as mentioned, is the column
income_bracket, with the corresponding loss and metrics we want to track.
The model definition itself is also easy to understand. We're using a 3-layer feed-forward network here, with a batch size of
training for around 2500 steps (one step is one pass of a batch through the network).
See the results
In order to get the below evaluation visualizations to work, you must also run:
You might be required to run this as
sudo if you're not working on a virtualenv.
Once your pipeline reaches a 100% completion (should take approximately 13 minutes), you can see the cost breakdown, in addition to how long the pipeline took.
You can also see the results of the model training by running:
Woah, did your browser just open up to a Jupyter notebook? Trippy. If you run that pre-made notebooks blocks you'll see a few handy plugins to showcase results.
We are able to see not only the overall accuracy (~78% if all went well), but also the accuracy across a
spliting_column of our choice. A
spliting_column is simply a feature
in your data that you would like to see your metric (e.g. accuracy) across in the final evaluation. In the above example, by splitting our metric into
marital_status, we are able to identity
that we did worst with married people with children! We should definitely try and improve our model for that segment of our data.
You can edit the last line of the TFMA notebook block to include your slicing metrics:
Theres a lot more to see here. Click here for more information regarding the evaluation results.
Congratulations! You have successfully run your first (we're claiming it) production ready machine learning pipeline.
Ok so we got some one result, but we shouldn't stop there should we? Lets create more pipelines and iterate!
You can always edit your config yaml file directly or through the CLI to create a new pipeline. For this example, we can edit the config file that we pulled previously. Open up your favorite text editor, and edit the trainer like so (only changes shown):
As you can see, we reduced the neurons in each layer because we want to check if a smaller network might perform better. Feel free to make any change you want here, and make sure to follow the trainer documentation for more details!
Now you can register your pipeline by running:
Where modified_out.yaml is the edited config YAML file.
This will push another pipeline to your workspace with the name
Hello World Smaller Network. The pipeline will be
automatically assigned a new
<pipeline_id>. At this point, an ops configuraton (workers, cpus_per_worker) will be
assigned to this pipeline automatically by the Core Engine. However, you can always update this as follows:
You can now run this pipeline again as follows:
Depending on what you change in the config yaml, you will notice a marked speed improvement in subsequent pipelines in the same workspace. This is because the Core Engine heavily utilizes previously computed cached intermediate results to ensure that you do not repeat the same processing steps
again. E.g. If you only change something in the
trainer key in the YAML, only the training part will run again. This way, you can quickly iterate over many different machine learning models and arrive at the best one.
You will also notice that the Core Engine will not charge you for cached processing steps, as indicated by the reduced cost of subsequent pipelines.
Therefore, we went from compute cost of 0.012 to 0.0027, an improvement of about ~80%.
This difference might seem small now, but boy does it scale! With the Core Engine you don't have to worry about repeating the same steps over and over. You just use the resources you need to compute precisely what you want to. Hyper-parameter tuning, running fast iterations and model comparison is quick, cheap and easy.
Check the statistics
If you want to dive into the statistics of the datasource to get more of a feel for the results, you can simply run:
Wait a few seconds and your preferred browser will open and display the statistics of this datasource.
You will notice that you can view both
Statistics are tied to pipelines because each pipeline has exactly one datasource.
For more information about the statistics view, please click here.
Download the model
Ok, so now you're happy with the results, and you want that trained model? Easy! Go ahead and execute: