Datasources

A datasource is the beginning of any pipeline. It is where the raw data is stored. Currently, we support BigQuery tables only, but more datasource formats are coming soon.

Basics

In order to see the available datasources within your organization:

cengine datasource list

In order to create a new datasource:

cengine datasource create bq
argumentstypedescription
namestrthe name of the datasource
projectstrproject name of BQ table
datasetstrdataset name of BQ table
tablestrtable name of BQ table
table_typestrchoose from public or private
service_accountstrpath to service account json file for access to private BQ table.

All of the above are required, except service_account which is only required if table_type == private.

In order to set a datasource active:

cengine datasource set <datasource_id>
argumentstypedescription
<datasource_id>intthe id of the selected datasource

In order to peek at a datasource

The peek function will randomly sample a number of samples, in order to get a basic look at the data.

cengine datasource peek <datasource_id> --sample_size <SAMPLE_SIZE>

Service Account

If you want to add a private BigQuery table as a datasource, you need to provide the Core Engine with a service account that has the BigQuery Data Viewer role. We will use this service account to create a copy of the BigQuery table to our own cloud, and use the copy as a datasource from then on. You can think of it as creating a snapshot of your BigQuery table.

In order to create such a service account, its easiest to use the gcloud CLI.

First create a service account as follows:

gcloud iam service-accounts create [SA-NAME] \
--description "[SA-DESCRIPTION]" \
--display-name "[SA-DISPLAY-NAME]"

Then add the BigQuery Data Viewer role to the service account.

gcloud projects add-iam-policy-binding [YOUR-PROJECT-ID] \
--member serviceAccount:[SA-NAME]@[YOUR-PROJECT-ID].iam.gserviceaccount.com \
--role roles/bigquery.dataViewer

Finally, create the service account json file! The json file will be stored at ~./key.json if you're in a Linux based system. You can now use this file when you create your private BigQuery table.

gcloud iam service-accounts keys create ~/key.json \
--iam-account [SA-NAME]@[YOUR-PROJECT-ID].iam.gserviceaccount.com

Supported data types

BigQuery Data TypeSupported
INTEGERYES
FLOATYES
STRINGYES
TIMESTAMPYES
BOOLEANYES
RECORDNO
DATETIMENO
ARRAYNO
STRUCTNO
BYTESNO
GEOGRAPHYNO

We are working hard to bring the more supported data types to the Core Engine. Please give us feedback at support@maiot.io so that we can prioritize the most important ones quicker!

Limitations

For now, only BigQuery is supported as a datasource. We are actively looking for feedback for different formats. Please let us know at support@maiot.io which format would you like supported next!

The BigQuery table must also conform to the following restrictions:

  • Max number of rows: 20,000,000
  • Max number of columns: 1000
  • Location: Private datasets must be in location EU

If your dataset is located outside of EU, please try and follow this handy Google guide to copy it from any location to EU. We apologize for this workaround and are working hard to bring all locations to the Core Engine!