This is adapted from the post Deep Learning on 33,000,000 data points using a few lines of YAML , originally appearing in the maiot blog
In this tutorial, we'll go through the step by step process of building a simple feedforward classifier trained on a public BigQuery datasource.
The dataset is a public Google BigQuery table, namely the New York Citi Bike dataset, which contains 33 million data points, holding information about various bike sharing trips in New York City. Here is a snippet of what the datasource looks like (*only relevant columns shown):
Our goal is to try to see if we can infer the
birth_year of the person,
given all the rest of the data in this table. The hypothesis is that because the data contains metadata of the trip, the
model can accurately predict the age of the person doing the trip.
Add the datasource
As the BigQuery table is public, it can be added by running:
After that we can run
And see the following details:
As can be seen, the data contains 33,319,019 rows with 15 columns.
Configuring the pipeline
There are two options to configure the pipeline
cengine pipeline configureto create the YAML through a questionnaire-like setting.
- Create the YAML manually.
- Start with the questionnaire and then make adjustments to the auto-created YAML file.
For this tutorial, we will go through manually creating the YAML file step by step.
Step 1: Select Features
To select the columns to use for training, we start with the
For this example, we are treating the
end_station_id as integer ID's denoting different stations.
This is not ideal, and a better way would be to utilize an
Embedding layer. However, to get to a baseline, it should
Step 2: Add our Label
To configure the label column, we add:
birth_year as the label, with an associated
mse (mean_squared_error) loss. The metric we will be
mae (mean absolute error).
Step 3: Split the data
The Core Engine let's you split up the data in a variety of ways
eval (more splits support on its way!):
The Core Engine allows categorizing data before splitting it.
For our case, we want all start stations to be equally represented to avoid any biases. So we grouped by the
start_station_name and divided each possible group in a 90-10 split. For you SQL folk, this is similar to doing a
GROUP BY and then taking a partition over an index. This way our training and test data will have data with all the
Step 4: The Trainer (Model definition)
The trainer would look like:
We define 2 dense layers, set the optimizer and a few more nuts and bolts. The interesting bit about
this trainer is the
batch_size. One step is one whole batch passing through
the network, so with a 33 million datapoint dataset, 230,000 steps of 256 would be roughly 2 epochs of the data.
Step 5: Set the Evaluation (Splitting Metrics)
One last thing we might want to do is to add some evaluator slices. We may not just want to
look at the overall metrics (i.e. overall
mae) of the model, but the
mae across a
We defined three columns which are potentially interesting to see the slicing metrics across.
The full config YAML
There are some things that we have intentionally skipped in the config for the sake of brevity. For reference,
you can find the pipeline configuration ready to download here.
For further clarity, please also refer to the
config yaml docs. Most notably, the
is perhaps important to look at as it defines the pre-processing steps that we took to normalize the data.
Run the pipeline
Register the pipeline as follows:
nyc_citibike_experiment is the name of the pipeline, and is arbitary
The Core Engine will check the active datasource, and give an ops configuration that it deems suitable for the size
of the job we're about to run. For this experiments, the Core Engine registered the pipeline with 4
workers at 96
cpus_per_worker. We can always change this if you want by running:
And to run:
Y for the safety prompt that appears, and let it run!
The platform will provision the resources in the cloud, connect automatically to the datasource, and create a machine learning pipeline to train the model. All preprocessing steps of the pipeline will be distributed across the workers and cpus. The training will happen on a Tesla K80 (distributed training coming soon!).
Evaluate the results
While running, the status of a pipeline can be checked with:
Once the pipeline hits the 100% completion mark, we can see the compute (preprocessing + evaluation) cost and training cost it incurred. This would depend on your Ops configuration. Generally, more compute makes things faster but almost more expensive.
For evaluating the results, execute:
This opens up a pre-configured Jupyter notebook where one can view Tensorboard logs, along with the excellent Tensorflow Model Analysis (TFMA) plugin. Both of these will show different things about the pipeline.
Tensorboard will show the model graph, the train and eval loss etc. For example:
TFMA will show the metrics defined in the YAML and add the ability to slice the metric across the columns
defined in the
evaluator key. E.g. We can slice across the label,
birth_year to see performance over every
birth year in our data.
Now that we have the baseline model, its very simple to iterate on different sorts of models very quickly. The Core Engine has stored all intermediate states of the pipeline (i.e. the preprocessed data) in an efficient and compressed binary format. Subsequent pipeline runs will warmstart the pipeline straight to the training part, given that everything else stays the same.