i wanted to try out the automatic loading of CSV data into Bigquery, specifically using a Cloud Function that would automatically run whenever a new CSV file was uploaded into a Google Cloud Storage bucket.
it worked like a champ. here’s what i did to PoC:
- generate a CSV file with 1000 lines of dummy data via https://www.mockaroo.com/ with their default schema so it looks like:
$ head -3 testdata.csv id,first_name,last_name,email,gender,ip_address 1,Andres,Berthot,aberthot0@pbs.org,Male,104.204.241.0 2,Iris,Zwicker,izwicker1@icq.com,Female,61.87.224.4
$ gsutil mb gs://csvtestbucket
$ pip3 install google-cloud-bigquery --upgrade
$ bq mk --dataset rickts-dev-project:csvtestdataset
$ bq mk -t csvtestdataset.csvtable \ id:INTEGER,first_name:STRING,last_name:STRING,email:STRING,gender:STRING,ip_address:STRING
BUCKET: csvtestbucket DATASET: csvtestdataset TABLE: csvtable VERSION: v14
google-cloud google-cloud-bigquery
*csv *yaml
$ ls env.yaml main.py requirements.txt testdata.csv
$ gcloud beta functions deploy csv_loader \ --runtime=python37 \ --trigger-resource=gs://csvtestbucket \ --trigger-event=google.storage.object.finalize \ --entry-point=csv_loader \ --env-vars-file=env.yaml
$ gsutil cp testdata.csv gs://csvtestbucket/ Copying file://testdata.csv [Content-Type=text/csv]... - [1 files][ 60.4 KiB/ 60.4 KiB] Operation completed over 1 objects/60.4 KiB.
$ gcloud functions logs read [ ... snipped for brevity ... ] D csv_loader 274732139359754 2018-10-22 20:48:27.852 Function execution started I csv_loader 274732139359754 2018-10-22 20:48:28.492 Starting job 9ca2f39c-539f-454d-aa8e-3299bc9f7287 I csv_loader 274732139359754 2018-10-22 20:48:28.492 Function=csv_loader, Version=v14 I csv_loader 274732139359754 2018-10-22 20:48:28.492 File: testdata2.csv I csv_loader 274732139359754 2018-10-22 20:48:31.022 Job finished. I csv_loader 274732139359754 2018-10-22 20:48:31.136 Loaded 1000 rows. D csv_loader 274732139359754 2018-10-22 20:48:31.139 Function execution took 3288 ms, finished with status: 'ok'
looks like the function ran as expected!
$ bq show csvtestdataset.csvtable Table rickts-dev-project:csvtestdataset.csvtable Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Labels ----------------- ----------------------- ------------ ------------- ------------ ------------------- -------- 22 Oct 13:48:29 |- id: integer 1000 70950 |- first_name: string |- last_name: string |- email: string |- gender: string |- ip_address: string
great! there are now 1000 rows. looking good.
$ egrep '^[1,2,3],' testdata.csv 1,Andres,Berthot,aberthot0@pbs.org,Male,104.204.241.0 2,Iris,Zwicker,izwicker1@icq.com,Female,61.87.224.4 3,Aime,Gladdis,agladdis2@hugedomains.com,Female,29.55.250.191
with the first 3 rows of the bigquery table
$ bq query 'select * from csvtestdataset.csvtable \ where id IN (1,2,3)' Waiting on bqjob_r6a3239576845ac4d_000001669d987208_1 ... (0s) Current status: DONE +----+------------+-----------+---------------------------+--------+---------------+ | id | first_name | last_name | email | gender | ip_address | +----+------------+-----------+---------------------------+--------+---------------+ | 1 | Andres | Berthot | aberthot0@pbs.org | Male | 104.204.241.0 | | 2 | Iris | Zwicker | izwicker1@icq.com | Female | 61.87.224.4 | | 3 | Aime | Gladdis | agladdis2@hugedomains.com | Female | 29.55.250.191 | +----+------------+-----------+---------------------------+--------+---------------+
and whaddyaknow, they match! w00t!
proof of concept: complete!
conclusion: cloud functions are pretty great.
what are the contents of your main.py
LikeLike
main.py is shown on the page! do you not see it? if not, you should be able to see it here: https://gist.github.com/rickt/faa1fbfd99f2aa83dedb4a6d6e32771f#file-main-py
LikeLike