i wanted to try out the automatic loading of CSV data into Bigquery, specifically using a Cloud Function that would automatically run whenever a new CSV file was uploaded into a Google Cloud Storage bucket.
it worked like a champ. here’s what i did to PoC:
- generate a CSV file with 1000 lines of dummy data via https://www.mockaroo.com/ with their default schema so it looks like:
$ head -3 testdata.csv id,first_name,last_name,email,gender,ip_address 1,Andres,Berthot,aberthot0@pbs.org,Male,104.204.241.0 2,Iris,Zwicker,izwicker1@icq.com,Female,61.87.224.4
$ gsutil mb gs://csvtestbucket
$ pip3 install google-cloud-bigquery --upgrade
$ bq mk --dataset rickts-dev-project:csvtestdataset
$ bq mk -t csvtestdataset.csvtable \ id:INTEGER,first_name:STRING,last_name:STRING,email:STRING,gender:STRING,ip_address:STRING
BUCKET: csvtestbucket DATASET: csvtestdataset TABLE: csvtable VERSION: v14
google-cloud google-cloud-bigquery
*csv *yaml
$ ls env.yaml main.py requirements.txt testdata.csv
$ gcloud beta functions deploy csv_loader \ --runtime=python37 \ --trigger-resource=gs://csvtestbucket \ --trigger-event=google.storage.object.finalize \ --entry-point=csv_loader \ --env-vars-file=env.yaml
$ gsutil cp testdata.csv gs://csvtestbucket/ Copying file://testdata.csv [Content-Type=text/csv]... - [1 files][ 60.4 KiB/ 60.4 KiB] Operation completed over 1 objects/60.4 KiB.
$ gcloud functions logs read [ ... snipped for brevity ... ] D csv_loader 274732139359754 2018-10-22 20:48:27.852 Function execution started I csv_loader 274732139359754 2018-10-22 20:48:28.492 Starting job 9ca2f39c-539f-454d-aa8e-3299bc9f7287 I csv_loader 274732139359754 2018-10-22 20:48:28.492 Function=csv_loader, Version=v14 I csv_loader 274732139359754 2018-10-22 20:48:28.492 File: testdata2.csv I csv_loader 274732139359754 2018-10-22 20:48:31.022 Job finished. I csv_loader 274732139359754 2018-10-22 20:48:31.136 Loaded 1000 rows. D csv_loader 274732139359754 2018-10-22 20:48:31.139 Function execution took 3288 ms, finished with status: 'ok'
looks like the function ran as expected!
$ bq show csvtestdataset.csvtable Table rickts-dev-project:csvtestdataset.csvtable Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Labels ----------------- ----------------------- ------------ ------------- ------------ ------------------- -------- 22 Oct 13:48:29 |- id: integer 1000 70950 |- first_name: string |- last_name: string |- email: string |- gender: string |- ip_address: string
great! there are now 1000 rows. looking good.
$ egrep '^[1,2,3],' testdata.csv 1,Andres,Berthot,aberthot0@pbs.org,Male,104.204.241.0 2,Iris,Zwicker,izwicker1@icq.com,Female,61.87.224.4 3,Aime,Gladdis,agladdis2@hugedomains.com,Female,29.55.250.191
with the first 3 rows of the bigquery table
$ bq query 'select * from csvtestdataset.csvtable \ where id IN (1,2,3)' Waiting on bqjob_r6a3239576845ac4d_000001669d987208_1 ... (0s) Current status: DONE +----+------------+-----------+---------------------------+--------+---------------+ | id | first_name | last_name | email | gender | ip_address | +----+------------+-----------+---------------------------+--------+---------------+ | 1 | Andres | Berthot | aberthot0@pbs.org | Male | 104.204.241.0 | | 2 | Iris | Zwicker | izwicker1@icq.com | Female | 61.87.224.4 | | 3 | Aime | Gladdis | agladdis2@hugedomains.com | Female | 29.55.250.191 | +----+------------+-----------+---------------------------+--------+---------------+
and whaddyaknow, they match! w00t!
proof of concept: complete!
conclusion: cloud functions are pretty great.