PoC: stream nginx access logs into Bigquery

lets say you have some servers in a cluster serving vhost foo.com and you want to put all the access logs from all the webservers for that vhost into Bigquery so you can perform analyses, or you just want all the access logs in one place.

in addition to having the raw weblog data, you also want to keep track of which webserver the hits were served by, and what the vhost (Host header) was.

so, foreach() server, we will install fluentd, configure it to tail the nginx access log, and upload everything to Bigquery for us.

it worked like a champ. here’s what i did to PoC:

  1. InstallĀ fluentd
    $ curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-xenial-td-agent3.sh | sh
    
  2. Create a Bigquery dataset
     $ bq mk --dataset rickts-dev-project:nginxweblogs
    Dataset 'rickts-dev-project:nginxweblogs' successfully created.
  3. Create a JSON schema to handle the weblogs + server hostname + vhost name
    [
      {
        "name": "agent",
        "type": "STRING"
      },
      {
        "name": "code",
        "type": "STRING"
      },
      {
        "name": "host",
        "type": "STRING"
      },
      {
        "name": "method",
        "type": "STRING"
      },
      {
        "name": "path",
        "type": "STRING"
      },
      {
        "name": "referer",
        "type": "STRING"
      },
      {
        "name": "size",
        "type": "INTEGER"
      },
      {
        "name": "user",
        "type": "STRING"
      },
      {
        "name": "time",
        "type": "INTEGER"
      },
      {
        "name": "hostname",
        "type": "STRING"
      },
      {
        "name": "vhost",
        "type": "STRING"
      }
    ]
  4. Create a table in the Bigquery dataset to store the weblog data
    $ bq mk -t nginxweblogs.nginxweblogtable schema.json
    Table 'rickts-dev-project:nginxweblogs.nginxweblogtable' successfully created.
  5. Install the fluentd Google Bigquery plugins
    $ sudo /usr/sbin/td-agent-gem install fluent-plugin-bigquery --no-ri --no-rdoc -V
  6. Configure fluentd to read the nginx access log for this vhost and upload to Bigquery (while also adding the server hostname and vhost name) by creating an /etc/td-agent/td-agent.conf similar to this: https://gist.github.com/rickt/641e086d37ff7453b7ea202dc4266aa5 (unfortunately WordPress won’t render it properly, sorry)

    You’ll note we are using the record_transformer fluentd filter plugin to transform the access log entries with the webserver hostname and webserver virtualhost name before injection into Bigquery.

  7. After making sure that the user fluentd runs as (td-agent by default) has read access to your nginx access logs, start (or restart) fluentd
     $ sudo systemctl start td-agent.service
  8. Now make a call to your vhost (in my case, localhost)
     $ hostname
    hqvm
    $ curl http://localhost/index.html?text=helloworld
    you sent: "helloworld"
  9. Query Bigquery to look for that specific hit, first using the bq command line tool
     $ bq query 'SELECT * FROM nginxweblogs.nginxweblogtable WHERE path = "/index.html?text=helloworld"'
    +-------------+------+------+--------+-----------------------------+---------+------+------+------+----------+--------------------------+
    |    agent    | code | host | method |            path             | referer | size | user | time | hostname |          vhost           |
    +-------------+------+------+--------+-----------------------------+---------+------+------+------+----------+--------------------------+
    | curl/7.47.0 | 200  | ::1  | GET    | /index.html?text=helloworld | -       |   14 | -    | NULL | hqvm     | rickts-dev-box.fix8r.com |
    +-------------+------+------+--------+-----------------------------+---------+------+------+------+----------+--------------------------+
  10. Congratulations, you have just setup your web access logs to inject to a Bigquery table!

proof of concept: complete!

conclusion: pushing your web access logs into Bigquery is extremely easy, not to mention, a smart thing to do.

the benefits exponentially increase as your server + vhost count increases. try consolidating, compressing and analyzing logs from N+ servers using months of data in-house and you’ll see the benefits of Bigquery right away.

enjoy!