Migrating to a new Puppet certification authority

At Songkick all our servers are managed using Puppet, an open source configuration management tool. We use it in client-server mode and recently had the need to replace the certification authority certificates on all our nodes. I couldn’t find much information on how to do this without logging onto every machine, so I’ve documented my method.

What is this Puppet CA anyway?

If you’re using puppet in its typical client-server or agent-master setup, then when the puppet master is first started it will create a certification authority (CA) which all clients that connect to it need to be trusted by and must trust. This usually happens transparently, so often people aren’t aware that this certification authority exists.

The CA is an attempt to have trust between the agents and the master, so that an attacker cannot set up malicious puppet masters and tell puppet agents to do his or her bidding and also prevent malicious clients being able to see configuration data for other clients. Agents should only connect to masters that have certificates signed by its CA and masters should only send configuration information to clients that have certificates signed by the same CA.

There’s a more comprehensive explanation of Puppet SSL written by Brice Figureau which goes into far more detail than we have space for. The main thing to understand is that the CA is an important part of maintaining security and that you can only have one CA across a set of machines that access the same secured resources.

Why would I want to migrate to a new CA?

  • Your current CA certificate is about to expire. By default, CA certificates have a validity period of 5 years, so fairly early adopters of puppet will need to replace them.
  • You’ve had multiple CAs in use and need to consolidate on one.
  • You believe that your certificates and private keys are in the hands of people who could cause mischief with them.
  • You have fallen foul of bugs relating to the fact that you use a CA created in an older version of puppet.

It was in fact the second of these reasons that applied to Songkick; we’d previously been using multiple puppet masters, each with their own CA. We wanted to start using exported resources, stored in the same PuppetDB instance for all nodes. This meant that each master needed to be trusted by the same CA that signed the PuppetDB instance; hence we needed to consolidate on one CA.

How do I do it?

Set up NEW puppet master(s)

Set up at least one new puppet master server, with a new CA certificate.

If you have a lot of existing hosts managed by puppet, then it’s worth considering enabling the autosign option, even if only temporarily, as you’ll have a number of certificate requests to approve manually otherwise.

Configure AGents to connect to THe new master(S)

We’re assuming here that you’re managing the puppet agent configuration through puppet. and that changes to the puppet configuration cause an automatic restart of the puppet agent.

Change the configuration of your puppet agents, to connect to the new master(s) and use a different ssldir:

[main]
server = <new server hostname> 
ssldir = /var/lib/puppet/ssl2

Be careful not to apply this change to your newly created puppet master.

Your clients should reconfigure themselves, restart and when they start up, connect to your new puppet master, forgetting their old ssl configuration, including the CA certificates.

If you have autodiscovery records for puppet in DNS, e.g. an A record for ‘puppet’ or the SRV records, then you should leave them in place for now. Agents that have not been migrated to the new CA may need it.

It is a good idea to test this on a handful of nodes and check that it works in a completely automated fashion before applying to every node.

Tidying up (part 1)

Once every node has checked in with the new master and been issued with a new certificate, it’s time to start the process of tidying up. It’s a good idea to revert back to using the default ssldir, so that when agents bootstrap themselves with the default config, they do not then switch to the new ssldir and thus forget their old certificates. This will cause the master to refuse to talk to them, as this looks like a spoofing attempt.

On each client, we mirror the new ssldir to the old one:

file { '/var/lib/puppet/ssl': 
  source => 'file:///var/lib/puppet/ssl2',
  recurse => true, 
  purge => true, 
  force => true, 
}

Be careful not to apply this change to your newly created puppet master.

Tidy up (part 2)

Once that’s shipped everywhere, we remove the ssldir configuration, fall back on the default ssldir and remove the above resource definiton to copy the ssldir.

Tidy up (part 3)

You can now update your autodiscovery DNS entries, to point to the new servers and remove the autosign configuration, if desired.

Finally, we ship a change to the clients that removes the temporary /var/lib/puppet/ssl2 directory.

And that’s it, everything has been migrated to the new CA, with no need to do anything outside of puppet.

Testing your database backups: the test environment database refresh pattern

When did you last try restoring your database backups? A month ago, a week ago? A year ago? Never? When was the last time you refreshed the data in your test environments? When I joined Songkick, one of the first things I asked was when we last tested a restore of our database backups. The answer, pleasingly, was at 03:00 UK time that morning and not coincidentally, that’s when we last refreshed the data in our test environments.

Here’s how we get the warm and fuzzy feeling of knowing that our backups contain data that can be restored and makes sense.

  1. Every morning, our database servers run their scheduled backups, copying the resulting images to a backup server in the data centre.
  2. Overnight those backups get copied to the office, giving us an offsite copy.
  3. In the small hours, when most of us are asleep, each of the database servers in our staging environment retrieve the backups, erase their local data files and then restore the production backups over the top of them.
  4. We perform sanitisation on the data, to make it suitable for use in a testing environment.
  5. And finally, and most importantly, we use the databases in our testing.

By doing this, we identified one case when our backups seemed to work, produced plausible looking backups, but MySQL failed to apply InnoDB log records during recovery. It was inconvenient to discover this problem in our staging environment, but far less inconvenient than discovering it when we needed the backups to put our production system back into operation.

Here are some practical tips based on our experience implementing and managing this system at Songkick:

Back all databases up at the same time

If your system is composed of services backed by independent databases on different machines, it’s possible that there’s some implicit consistency between them. For example, a common situation at Songkick is to have an accounts service responsible for storing user accounts and another service that stores user data keyed against a user, then there’s an expectation that those databases have some degree of consistency.

If you back them up at different times, you’ll find inconsistencies, that a service might have a reference to a user that doesn’t yet exist. If the ID of the user is exposed to other services and that ID can be reused, you may find that newly created users in your test environment have existing data associated with them and this can cause significant problems in testing.

It’s worth noting that, in the case of a production restore, these issues would need to be diagnosed and solved in the heat of the moment. By finding them in your test environment, you’re giving yourself the space to solve them earlier, under less pressure.

Design the backups to be regularly exercised

Some types of backups are more amenable to being restored regularly in test environments. For example, our initial MongoDB database backups performed snapshots of our MongoDB database path. These proved difficult to restore, because they included local databases which contained information on replica set membership. This means that on startup, our staging MongoDB server would forget its existing replica set membership and try to talk to the production servers instead.

We switched to using mongodump to take a logical export of the database, simply so that we could restore it on the primary member of our existing staging replica set and update the entire replica set.

Sanitisation tips

After we’ve restored the databases, there are certain things we do to make them safe and usable in our testing environments.

  • Remove or obfuscate email addresses. We’re not fond of accidentally emailing people with test events we’ve created in staging, so we change people’s email addresses to be unusable, so that can’t happen. We leave people’s email addresses alone if they work at Songkick, so we can test email features by emailing ourselves.
  • Remove or obfuscate payment tokens. If it’s uncool to accidentally email people, accidentally charging them is positively hostile. Anything that’s used for payment needs to be removed.
  • Fix or replace information about the environment. It’s best to avoid keeping references to your technical environment in the same database as your application data, but sometimes it’s tricky to workaround. For example, our MogileFS installation needs to be kept in sync with our production one, to avoid problems with missing media. This means that we need to manually update the database to substitute the hostnames of the mogilefs servers.

Write code that can withstand the database going away

Unless you’ve put some work in, almost no database driver will gracefully handle the disappearance of a database server and then its re-appearance some time later. If the restore in your test environment is the first time you’ve tried this, you may find that you need to manually restart services, even after the database re-appears on the network.

The solution will vary depending on the database client being used, but often it’s a case of catching an exception, or changing some options when you establish the connection.

By making your applications reconnect to the database with no manual input, you are again fixing a problem that will eventually occur in production – a much more stressful time for it to be diagnosed and fixed.

Summary

Testing your database backups by restoring them automatically and regularly in your test environments is a great way to battle-harden your backups and applications and to make sure that your test environment looks like the real production environment.


If you’ve liked what you’ve read, why not head over to our jobs page? We’re looking for a Systems Engineer to add more touches like these to our infrastructure.

A little bit of Pig

In our Science team at Songkick I’ve been working with Apache Pig to generate lots of interesting metrics for our business intelligence. We use Amazon’s MapReduce and Pig to avoid having to run complex, long running and intensive queries on our live db, we can run them on Amazon in a timely fashion instead. So lets dive into Pig and how we use it at Songkick.com.

Pig (whats with all these silly names)

The Apache project Pig is a data flow language designed for analysing large datasets. It provides a high-level platform for creating MapReduce programs used with Hadoop. A little bit like SQL but Pig’s programs by their structure are suitable for parallelization, which is why they are great at  handling very large data sets.

Data (Pig food)

Lets start by uploading some huge and interesting data about Songkicks artists onto S3. We start by dumping a table from mysql (along with a lot of other tables) and then query that data with Pig on Hadoop. While we could extract all the artist data by querying the live table its actually faster to use mysqldump and dump the table as a TSV file.

For example it took 35 minutes to dump our artist table with a sql query ‘select * from artists’. It takes 10 minutes to dump the entire table with mysqldump.

We format the table dump as a TSV which we push to S3 as that makes it super easy to use Amazons ElasticMapReduce with Pig.

shell> mysqldump --user=joe --password \
                 --fields-optionally-enclosed-by='"' \
                 --fields-terminated-by='\t' \
                 --tab /tmp/path_to_dump/ \
                 songkick artist_trackings

Unfortunately this has to be run on the db machine since mysqldump needs access to the file system to save the data. If this is a problem for you there is a Ruby script for dumping tables to TSV: http://github.com/apeckham/mysqltsvdump/blob/master/mysqltsvdump.rb

Launching (Pig catapult)

We will be using Amazons Elastic MapReduce to run our Pig scripts. We can start our job in interactive Pig mode which allows us to ssh to the box and run the pig script line by line.

Examples (Dancing Pigs)

An important thing to note when running pig scripts interactively is that they defer execution until they have to expose a result. This means you can get nice schema checks and validations helping ensure your PIG script is valid without actually executing it over your large dataset.

We are going to try and calculate the average number of users tracking an artist based on the condition that we only count users who logged in, in the last 30 days.

This is what our Pig script is doing:

The Pig script:

-- Define some useful dates we will use later
%default TODAYS_DATE `date  +%Y/%m/%d`
%default 30_DAYS_AGO `date -d "$TODAYS_DATE - 30 day" +%Y-%m-%d`

-- Pig is smart enough when given a folder to go and find files, decompress them if necessary and load them.
-- Note we have to specify the schema as PIG does not know know this from our TSV file.
trackings = LOAD 's3://songkick/db/trackings/$TODAYS_DATE/' AS (id:int, artist_id:int,  user_id:int);
users = LOAD 's3://songkick/db/users/$TODAYS_DATE/' AS (id:int, username:chararray, last_logged_in_at:chararray);

This gives the following values for trackings and users:
trackings
<1, 1, 1>
<2, 1, 2>

users
<1,'josephwilk', '11/06/2012'>
<2,'elisehuard', '11/06/2012'>
<3,'tycho', '11/06/2010'>

-- Filter users to only those who logged in, in the last 30 days
-- Pig does not understand dates, so just treat them as strings
active_users = FILTER users by last_logged_in_at gte '$30_DAYS_AGO'

Users
<1,'josephwilk', '11/06/2012'>
<2,'elisehuard', '11/06/2012'>

active_users_and_trackings = JOIN active_users BY id, trackings BY user_id

-- group all the users tracking an artists so we can count them.
active_users_and_trackings_grouped = GROUP active_users_and_trackings BY active_users::user_id;

<1, 1, {<1,'josephwilk', '11/06/2012'>, <2,'elisehuard', '11/06/2012'>}>

trackings_per_artist = FOREACH active_users_and_trackings_grouped GENERATE group, COUNT($2) as number_of_trackings;

<{<1,'josephwilk', '11/06/2012'>, <2,'elisehuard', '11/06/2012'>}, 2>

-- group all the counts so we can calculate the average
all_trackings_per_artist = GROUP trackings_per_artist ALL;

<{{<1,'josephwilk', '11/06/2012'>, <2,'elisehuard', '11/06/2012'>}, 2}>

-- Calculate the average
average_artist_trackings_per_active_user = FOREACH all_trackings_per_artist
  GENERATE '$DATE' as dt, AVG(trackings_per_artist.number_of_trackings);

<{<'11/062012', 2>}>

--Now we have done the work store the result in S3.
STORE average_artist_trackings_per_active_user INTO
  's3://songkick/stats/average_artist_trackings_per_active_user/$TODAYS_DATE'

Debugging Pigs (Pig autopsy)

In an interactive pig session there are two useful commands for debugging:

  • DESCRIBE to see the schema.
  • ILLUSTRATE to see the schema with sample data:
DESCRIBE users;
users: {id:int, username:chararray, created_at:chararray, trackings:int}
ILLUSTRATE users;
----------------------------------------------------------------------
| users   | id: int | username:chararray | created_at | trackings:int |
----------------------------------------------------------------------
|         | 18      | Joe                | 10/10/13   | 1000          |
|         | 20      | Elise              | 10/10/14   | 2300          |
----------------------------------------------------------------------

Automating Elastic MapReduce (Pig robots)

Once you are happy with your script you’ll want to automate all of this. I currently do this by having a cron task which at regular intervals uses the elastic-mapreduce-ruby lib to fire up a elastic map reduce job and run it with the pig script to execute.

Its important to note that I store the pig scripts on S3 so its easy for elastic-mapreduce to find the scripts.

Follow the instructions to install elastic-mapreduce-ruby: https://github.com/tc/elastic-mapreduce-ruby

To avoid having to call elastic-mapreduce with 100s of arguments a colleague has written a little python wrapper to make it quick and easy to use: https://gist.github.com/2911006

You’ll need to configure where you’re elastic-mapreduce tool is installed AND where you want elastic map-reduce to log to on S3 (this means you can debug your elastic map reduce job if things go wrong!).

Now all we need to do is pass the script the path to the pig script on S3.

./emrjob s3://songkick/lib/stats/pig/average_artist_trackings_per_active_user.pig

Testing with PigUnit (Simulating Pigs)

Pig scripts can still take a long time to run even with all that Hadoop magic. Thankfully there is a testing framework PigUnit.

http://pig.apache.org/docs/r0.8.1/pigunit.html#Overview

Unfortunately this is where you have to step into writing Java. So I skipped it. Sshhh.

References

  1. Apache Pig official site: http://pig.apache.org
  2. Nearest Neighbours with Apache Pig and JRuby: http://thedatachef.blogspot.co.uk/2011/10/nearest-neighbors-with-apache-pig-and.html
  3. Helpers for messing with Elastic MapReduce in Ruby https://github.com/tc/elastic-mapreduce-ruby
  4. mysqltsvdump http://github.com/apeckham/mysqltsvdump/blob/master/mysqltsvdump.rb