PoC: stream nginx access logs into Bigquery

lets say you have some servers in a cluster serving vhost foo.com and you want to put all the access logs from all the webservers for that vhost into Bigquery so you can perform analyses, or you just want all the access logs in one place.

in addition to having the raw weblog data, you also want to keep track of which webserver the hits were served by, and what the vhost (Host header) was.

so, foreach() server, we will install fluentd, configure it to tail the nginx access log, and upload everything to Bigquery for us.

it worked like a champ. here’s what i did to PoC:

  1. Install fluentd
    $ curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-xenial-td-agent3.sh | sh
    
  2. Create a Bigquery dataset
     $ bq mk --dataset rickts-dev-project:nginxweblogs
    Dataset 'rickts-dev-project:nginxweblogs' successfully created.
  3. Create a JSON schema to handle the weblogs + server hostname + vhost name
    [
      {
        "name": "agent",
        "type": "STRING"
      },
      {
        "name": "code",
        "type": "STRING"
      },
      {
        "name": "host",
        "type": "STRING"
      },
      {
        "name": "method",
        "type": "STRING"
      },
      {
        "name": "path",
        "type": "STRING"
      },
      {
        "name": "referer",
        "type": "STRING"
      },
      {
        "name": "size",
        "type": "INTEGER"
      },
      {
        "name": "user",
        "type": "STRING"
      },
      {
        "name": "time",
        "type": "INTEGER"
      },
      {
        "name": "hostname",
        "type": "STRING"
      },
      {
        "name": "vhost",
        "type": "STRING"
      }
    ]
  4. Create a table in the Bigquery dataset to store the weblog data
    $ bq mk -t nginxweblogs.nginxweblogtable schema.json
    Table 'rickts-dev-project:nginxweblogs.nginxweblogtable' successfully created.
  5. Install the fluentd Google Bigquery plugins
    $ sudo /usr/sbin/td-agent-gem install fluent-plugin-bigquery --no-ri --no-rdoc -V
  6. Configure fluentd to read the nginx access log for this vhost and upload to Bigquery (while also adding the server hostname and vhost name) by creating an /etc/td-agent/td-agent.conf similar to this: https://gist.github.com/rickt/641e086d37ff7453b7ea202dc4266aa5 (unfortunately WordPress won’t render it properly, sorry)

    You’ll note we are using the record_transformer fluentd filter plugin to transform the access log entries with the webserver hostname and webserver virtualhost name before injection into Bigquery.

  7. After making sure that the user fluentd runs as (td-agent by default) has read access to your nginx access logs, start (or restart) fluentd
     $ sudo systemctl start td-agent.service
  8. Now make a call to your vhost (in my case, localhost)
     $ hostname
    hqvm
    $ curl http://localhost/index.html?text=helloworld
    you sent: "helloworld"
  9. Query Bigquery to look for that specific hit, first using the bq command line tool
     $ bq query 'SELECT * FROM nginxweblogs.nginxweblogtable WHERE path = "/index.html?text=helloworld"'
    +-------------+------+------+--------+-----------------------------+---------+------+------+------+----------+--------------------------+
    |    agent    | code | host | method |            path             | referer | size | user | time | hostname |          vhost           |
    +-------------+------+------+--------+-----------------------------+---------+------+------+------+----------+--------------------------+
    | curl/7.47.0 | 200  | ::1  | GET    | /index.html?text=helloworld | -       |   14 | -    | NULL | hqvm     | rickts-dev-box.fix8r.com |
    +-------------+------+------+--------+-----------------------------+---------+------+------+------+----------+--------------------------+
  10. Congratulations, you have just setup your web access logs to inject to a Bigquery table!

proof of concept: complete!

conclusion: pushing your web access logs into Bigquery is extremely easy, not to mention, a smart thing to do.

the benefits exponentially increase as your server + vhost count increases. try consolidating, compressing and analyzing logs from N+ servers using months of data in-house and you’ll see the benefits of Bigquery right away.

enjoy!

HOW-TO: use nginx maps & rewrites to redirect mobile users from product-specific pages on your desktop site to the same product-specific pages on your mobile site when desktop & mobile product ID’s & URL schemas are completely different

[i’d not seen this well-documented on the interwebs so here it is for posterity’s sakes]

problem: you have a “desktop” and a “mobile” site. they’re completely separate infra: you have two separate docroots, two separate vhosts. both sites use numeric IDs to target specific “product pages” , but because marketing departments exist and your mobile site was recently Angular-ized, your desktop and mobile sites have different URL standards and different product ID’s. to make your day even better, you’ve just been told that mobile users who hit desktop-site product URLs need to get redirected to the matching product URL on the mobile site.

the relevant URL schemas:

desktop site:

http://www.foo.com/productpages/NNNNNNN_NNN/index.html

(where N is 0-9)

mobile site:

http://m.foo.com/app/product/?p=NNNNN

(where N is 0-9)

you’re choking right now because those ID differences are just obnoxious and there’s no obvious relationship between the 6_3 digit ID’s (desktop) and 5 digit ID’s (mobile), but it’s going to be okay because your DB guys can give you a table dump with the desktop –> mobile product IDs. ok. so let’s talk specifics. your site has a product, a very thin and barely-purchased pamphlet titled “Great British Sportscars of the 1980s”. the pamphlet’s URLs:

desktop site :

/productpages/672019_029/index.html

mobile site:

/app/product/?p=40083

ok no problem right? useragent inspection in nginx is a cinch, as are nginx rewrites. you also know about nginx maps and how you very easily use them in an

"$OLDURL $NEWURL;"

way. job done, lets go home early, right?

well, all of this is true, but nginx (of course) doesn’t make it terribly obvious how you might combine useragent-based rewrites AND transforming specific ID’s from one schema another. in general, we want to rewrite that mobile user to a URL that nginx has in a map, and then that subsequent map match will send the user onto the correct mobile URL (with the correct mobile product ID) on your mobile site.

the first thing to do is to get the desktop and mobile product ID’s in a map so that nginx can do the desktop –> mobile ID transform for you. how you massage or get your data in this form is entirely up to your infrastructure and imagination, but you want to end up with a file like this:

/_mobile/productpages/672019_029/ http://m.foo.com/app/product/?p=40083;
/_mobile/productpages/562174_334/ http://m.foo.com/app/product/?p=10834;
/_mobile/productpages/455383_931/ http://m.foo.com/app/product/?p=90211;
/_mobile/productpages/369410_365/ http://m.foo.com/app/product/?p=16388;

the format (basically) is: a slightly different version of your desktop site ID’s + URI on the left, mobile site ID’s + full URL on the right. so, assuming your txt file is

/etc/nginx/vhostconf.d/www.foo.com/productids.map

, within the nginx server { } block of your www.foo.com vhost, define the map as per:

map $uri $productid_map {
    include /etc/nginx/vhostconf.d/www.foo.com/productids.map;
}

this tells nginx to setup a map, using the contents of your .map file for data. depending on the size of your map(s) you may have to increase the amount of memory (check your hash_bucket_size, map_hash_max_size, map_hash_bucket_size variable values) that nginx allocates for maps & such.

ok, so the map is setup, now we create a rewrite rule to use it. first we’ll create a variable and setup a basic mobile useragent regex match:

set $typeof_request N;
if ($http_user_agent ~ "(iPod|iPad|iPhone|BlackBerry|Android|HTC|Motorola)") {
     set $typeof_request "MOBILE";
}

if the useragent of the request matches our basic check above, the content of the $typeof_request variable is set to MOBILE. now we need to check the URI of the request to see if it matches our desktop product ID URL schema. if it does, we’ll append “_DESKREQ” to $typeof_request.

if ($request_uri ~* ^/productpages/(dddddd_ddd)/.*$) {
 set $original_productid $1;
 set $typeof_request "${typeof_request}_DESKREQ";
}

the idea here is that if we get a mobile request for a desktop URI, we want the variable $typeof_request to have the value “MOBILE_DESKREQ”. why? so we can rewrite that mobile request:

if ($typeof_request = MOBILE_DESKREQ) {
 rewrite ^ $scheme://$host/_mobile/productpages/${original_productid}/;
 break;
}

this rewrite will only occur if the request came from a mobile user and the request was for a specific syntax of URL. lets do an example. assume that someone on a mobile device requests

http://www.foo.com/productpages/672019_029/index.html

first, the $typeof_request variable would be set to MOBILE because of the $http_user_agent check matching their mobile browser useragent string. second, since the URL of the request matches our desktop product ID URL schema, _DESKREQ would be appended to $typeof_request making its value MOBILE_DESKREQ.

and so when the final $typeof_request check is done, $typeof_request is indeed set to MOBILE_DESKREQ and the URL would be rewritten to

http://www.foo.com/_mobile/productpages/672019_029/.

at this point you’re laughing because you already have an nginx map configured to look for strings like

/_mobile/productpages/672019_029/

for the express purpose of easily “mapping” them to strings like

http://m.foo.com/app/product/?p=40083.

all the pieces we need are in place, now we just ask nginx to 301 redirect to the appropriate mobile URL if the requested URI matches any of the slightly modified desktop URIs in our map:

if ($productid_map) {
 return 301 $productid_map;
}

that’s it! a quick overview of the process:

  • define a map that “maps” desktop site product ID scheme URIs (prefixed with /_mobile/) to full mobile site product ID URLs
  • check useragent for match against mobile useragents
  • check URI for match against desktop product ID URL schema
  • if both useragent and URL match, rewrite URL to “hidden” URL prefixed with /_mobile/
  • nginx map check rewrites /_mobile/ prefixed URLs to appropriate URL on mobile site
  • profit

any questions, feel free to say hi on twitter, or drop me a line.

-RMT