Building a classifier on 33M samples

This is adapted from the post Deep Learning on 33,000,000 data points using a few lines of YAML , originally appearing in the maiot blog

In this tutorial, we'll go through the step by step process of building a simple feedforward classifier trained on a public BigQuery datasource.



The dataset is a public Google BigQuery table, namely the New York Citi Bike dataset, which contains 33 million data points, holding information about various bike sharing trips in New York City. Here is a snippet of what the datasource looks like (*only relevant columns shown):

birth_year | gender | end_station_id | start_station_id | tripduration | usertype
1977 | Female | 103 | 100 | 1012 | Subscriber
1991 | Male | 1089 | 23 | 530 | Customer
... etc. etc. 33 million more times

Our goal is to try to see if we can infer the birth_year of the person, given all the rest of the data in this table. The hypothesis is that because the data contains metadata of the trip, the model can accurately predict the age of the person doing the trip.

Add the datasource

As the BigQuery table is public, it can be added by running:

cengine datasource create bq --name citibike_trips \
--project "bigquery-public-data" \
--dataset new_york \
--table citibike_trips \
--table_type public

After that we can run

cengine datasource list

And see the following details:

Selection | ID | Name | Rows | Cols | Size (MB)
* | 16 | citibike_trips | 33319019 | 15 | 4689

As can be seen, the data contains 33,319,019 rows with 15 columns.

Configuring the pipeline

There are two options to configure the pipeline

  • Use cengine pipeline configure to create the YAML through a questionnaire-like setting.
  • Create the YAML manually.
  • Start with the questionnaire and then make adjustments to the auto-created YAML file.

For this tutorial, we will go through manually creating the YAML file step by step.

Step 1: Select Features

To select the columns to use for training, we start with the features key:

end_station_id: {}
gender: {}
start_station_id: {}
tripduration: {}
usertype: {}

For this example, we are treating the start_station_id and end_station_id as integer ID's denoting different stations. This is not ideal, and a better way would be to utilize an Embedding layer. However, to get to a baseline, it should prove sufficient.

Step 2: Add our Label

To configure the label column, we add:

loss: mse
metrics: [mae]

We define birth_year as the label, with an associated mse (mean_squared_error) loss. The metric we will be tracking is mae (mean absolute error).

Step 3: Split the data

The Core Engine let's you split up the data in a variety of ways into train and eval (more splits support on its way!):

categorize_by: start_station_name
index_ratio: {train: 0.9, eval: 0.1}

The Core Engine allows categorizing data before splitting it. For our case, we want all start stations to be equally represented to avoid any biases. So we grouped by the start_station_name and divided each possible group in a 90-10 split. For you SQL folk, this is similar to doing a GROUP BY and then taking a partition over an index. This way our training and test data will have data with all the stations.

Step 4: The Trainer (Model definition)

The trainer would look like:

- {type: dense, units: 64} # a dense layer 64 units
- {type: dense, units: 32} # a dense layer with 32 units
architecture: feedforward # can be feedforward or sequential
last_activation: linear # last layer: we can take relu, but linear should also be fine
num_output_units: 1 # How many units in the last layer? We choose 1 because we want to regress one number (i.e. date_of_birth)
optimizer: adam # optimizer for loss function
save_checkpoints_steps: 15000 # how many steps before we do a checkpoint evaluation for our Tensorboard logs
eval_batch_size: 256 # batch size for evalulation that happens at every checkpoint
train_batch_size: 256 # batch size for training
train_steps: 230000 # two epochs
type: regression # choose from [regression, classification, autoencoder]

We define 2 dense layers, set the optimizer and a few more nuts and bolts. The interesting bit about this trainer is the train_steps and batch_size. One step is one whole batch passing through the network, so with a 33 million datapoint dataset, 230,000 steps of 256 would be roughly 2 epochs of the data.

Step 5: Set the Evaluation (Splitting Metrics)

One last thing we might want to do is to add some evaluator slices. We may not just want to look at the overall metrics (i.e. overall mae) of the model, but the mae across a categorical column.

birth_year: {} # I'd like to see how I did across each year
gender: {} # I'd like to see if the model biases because of gender
start_station_name: {} # I'd like to see how I did across each station

We defined three columns which are potentially interesting to see the slicing metrics across.

The full config YAML

There are some things that we have intentionally skipped in the config for the sake of brevity. For reference, you can find the pipeline configuration ready to download here. For further clarity, please also refer to the config yaml docs. Most notably, the preprocessing key is perhaps important to look at as it defines the pre-processing steps that we took to normalize the data.

Run the pipeline

Register the pipeline as follows:

cengine pipeline push my_config.yaml nyc_citibike_experiment

nyc_citibike_experiment is the name of the pipeline, and is arbitary

The Core Engine will check the active datasource, and give an ops configuration that it deems suitable for the size of the job we're about to run. For this experiments, the Core Engine registered the pipeline with 4 workers at 96 cpus_per_worker. We can always change this if you want by running:

cengine pipeline update <pipeline_id> --workers <num_workers> --num_cpus_per_worker <num_cpus_per_worker>

And to run:

cengine pipeline run <pipeline_id>

Enter Y for the safety prompt that appears, and let it run!

The platform will provision the resources in the cloud, connect automatically to the datasource, and create a machine learning pipeline to train the model. All preprocessing steps of the pipeline will be distributed across the workers and cpus. The training will happen on a Tesla K80 (distributed training coming soon!).

Evaluate the results

While running, the status of a pipeline can be checked with:

cengine pipeline status --pipeline_id <pipeline_id>

Sample output:

ID | Name | Pipeline Status | Completion | Compute Cost (€) | Training Cost (€) | Total Cost (€) | Execution Time
1 | nyc_citibike_experiment | Running | 13% | 0 | 0 | 0 | 0:14:21.187081

Once the pipeline hits the 100% completion mark, we can see the compute (preprocessing + evaluation) cost and training cost it incurred. This would depend on your Ops configuration. Generally, more compute makes things faster but almost more expensive.

For evaluating the results, execute:

cengine pipeline evaluate <pipeline_id>

This opens up a pre-configured Jupyter notebook where one can view Tensorboard logs, along with the excellent Tensorflow Model Analysis (TFMA) plugin. Both of these will show different things about the pipeline.

Tensorboard will show the model graph, the train and eval loss etc. For example:


TFMA will show the metrics defined in the YAML and add the ability to slice the metric across the columns defined in the evaluator key. E.g. We can slice across the label, birth_year to see performance over every birth year in our data.


Wrap up

Now that we have the baseline model, its very simple to iterate on different sorts of models very quickly. The Core Engine has stored all intermediate states of the pipeline (i.e. the preprocessed data) in an efficient and compressed binary format. Subsequent pipeline runs will warmstart the pipeline straight to the training part, given that everything else stays the same.

If you have questions regarding this tutorial, or would like to connect, please make sure to reach out to the maiot team on Twitter, LinkedIn or just chat with us on our Discord server.