Agile Testing

Thursday, November 05, 2009

Automated deployments with Puppet and Fabric

I've been looking into various configuration management/automated deployment tools lately. At OpenX we used slack, but I wanted something with a bit more functionality than that (although I'm not badmouthing slack by any means -- it can definitely be bent to your will to do pretty much whatever you need in terms of automating your deployments).

From what I see, there are 2 types of configuration management tools:

The first type I call 'pull', which means that the servers pull their configurations and their marching orders in terms of applying those configurations from a centralized location -- both slack and Puppet are in this category. I think this is great for initial configuration of a server. As I described in another post, you can have a server bootstrap itself by installing Puppet (or slack) and then 'call home' to the central Puppet master (or slack repository) and get all the information it needs to configure itself
The second type I call 'push', which means that you send configurations and commands to a list of servers from a centralized location -- Fabric is in this category. I think this is a more appropriate mode for application-specific deployments, where you might want to deploy first to a subset of servers, then push it to all servers.

So, as a rule of thumb, I think it makes sense to use a tool like Puppet for the initial configuration of the OS and of the packages required by your application (things like MySQL, Apache, Tomcat, Tornado, Nginx, or whatever your application relies on). When it comes time to deploy your application, I think a tool like Fabric is more appropriate, since it gives you more immediate and finer-grained control over what you want to do.

I also like the categorization of these tools done by the people at ControlTier. Check out their blog post on Achieving Fully Automated Provisioning (which also links to a white paper PDF) for a nice diagram of hierarchy of deployment tools:

at the bottom you have tools that install or launch the initial OS on physical servers (via Kickstart/Jumpstart/Cobbler) or on virtual machines/cloud instances (via various vendor tools, or by rolling your own)
in the middle you have what they call 'system configuration' tools, such as Puppet/Chef/SmartFrog/cfengine/bcfg2
at the top you have what they call 'application service deployment' tools, such as Fabric/Capistrano/Func -- and of course their own ControlTier tool

In a comment on one of my posts, Damon Edwards from ControlTier calls Fabric a "command dispatching tool", as opposed to Puppet, which he calls a "configuration management tool". I think this relates to the 2 types of tools I described above, where you 'push' or 'dispatch' commands with Fabric, and you 'pull' configurations and actions with Puppet.

Before I go on, let me just say that in my evaluation of different deployment tools, I quickly eliminated the ones that use XML as their configuration language. In my experience, many tools that aim to be language-neutral end up using XML as their configuration language, and then they try to bend XML into a 'real' programming language, thus ending up reinventing the wheel badly. I'd rather use a language I like (Python in my case) as the glue around the various tools in my toolchain. Your mileage may vary of course.

OK, enough theory, let's see some practical examples of Puppet and Fabric in action. While Fabric is very easy to install and has a minimal learning curve, I can't say the same about Puppet. It takes a while to get your brain wrapped around it, and there isn't a lot of great documentation online, so for this reason I warmly recommend that you go buy the book.

Puppet examples

The way I organize things in Puppet is by creating a module for each major package I need to configure. On my puppetmaster server, under /etc/puppet/modules, I have directories such as apache2, mysqlserver, nginx, scribe, tomcat, tornado. Under each such directory I have 2 directories, one called files and one called manifests. I keep files and directories that I need downloaded to the puppet clients under files, and I create manifests (series of actions to be taken on the puppet clients) under manifests. I usually have a single manifest file called init.pp.

Here's an example of the init.pp manifest file for my tornado module:

class tornado {
 $tornado = "tornado-0.2"
 $url = "http://mydomain.com/download"

 $tornado_root_dir = "/opt/tornado"
 $tornado_log_dir = "/opt/tornado/logs"
 $tornado_src_dir = "/opt/tornado/$tornado"

 Exec {
  logoutput => on_failure,
  path => ["/bin", "/sbin", "/usr/bin", "/usr/sbin", "/usr/local/bin",  "/usr/local/sbin"]
 }

 file { 
  "$tornado_root_dir":
  ensure => directory,
  recurse => true,
  source =>  "puppet:///tornado/bin";
 }

 file { 
  "$tornado_log_dir":
  ensure => directory,
 }

 package {
  ["curl", "libcurl3", "libcurl3-gnutls", "python-setuptools", "python-pycurl", "python-simplejson", "python-memcache", "python-mysqldb", "python-imaging"]:
  ensure => installed;
 }

 define install_pkg ($pkgname, $extra_easy_install_args = "", $module_to_test_import) {
  exec {
   "InstallPkg_$pkgname":
   command => "easy_install-2.6 $extra_easy_install_args $pkgname",
   unless => "python2.6 -c 'import $module_to_test_import'",
   require => Package["python-setuptools"];
  }
 }

 install_pkg {
  "virtualenv":
  pkgname => "virtualenv",
  module_to_test_import => "virtualenv";

  "boto":
  pkgname => "boto",
  module_to_test_import => "boto";

  "grizzled":
  pkgname => "grizzled",
  module_to_test_import => "grizzled.os";
 }

 $oracle_root_dir = "/opt/oracle"
 
 case $architecture {
  i386, i686: { 
   $oracle_instant_client_pkg = "instantclient_11_2-linux-i386"
   $oracle_instant_client_dir = "instantclient_11_2"
  }
  x86_64: { 
   $oracle_instant_client_pkg = "instantclient_11_1-linux-x86_64"
   $oracle_instant_client_dir = "instantclient_11_1"
  }
 }

 package {
  ["libaio-dev", "gcc"]:
  ensure => installed;
 }

 file { 
  "$oracle_root_dir":
  ensure => directory;
 }

 exec {
  "InstallOracleInstantclient":
  command => "(cd $oracle_root_dir; wget $url/$oracle_instant_client_pkg.tar.gz; tar xvfz $oracle_instant_client_pkg.tar.gz; rm $oracle_instant_client_pkg.tar.gz; 
cd $oracle_instant_client_dir; ln -s libclntsh.so.11.1 libclntsh.so); echo $oracle_root_dir/$oracle_instant_client_dir > /etc/ld.so.conf.d/oracleinstantclient.conf; ldconfig",
  creates => "$oracle_root_dir/$oracle_instant_client_dir",
  require => File[$oracle_root_dir];
 }

 $cx_oracle = "cx_Oracle-5.0.2"
 exec {
  "InstallCxOracle":
  command => "(cd $oracle_root_dir; wget $url/$cx_oracle.tar.gz; tar xvfz $cx_oracle.tar.gz; rm $cx_oracle.tar.gz; cd $oracle_root_dir/$cx_oracle; export ORACLE_HO
ME=$oracle_root_dir/$oracle_instant_client_dir; python2.6 setup.py install)",
  unless => "python2.6 -c 'import cx_Oracle'",
  require => [Package["libaio-dev"], Package["gcc"], Exec["InstallOracleInstantclient"]];
 }

 exec {
  "InstallTornado":
  command => "(cd $tornado_root_dir; wget $url/$tornado.tar.gz; tar xvfz $tornado.tar.gz; rm $tornado.tar.gz; cd $tornado; python2.6 setup.py install)",
  creates => $tornado_src_dir,
  unless => "python2.6 -c 'import tornado.web'",
  require => [File[$tornado_root_dir], Package["python-pycurl"], Package["python-simplejson"], Package["python-memcache"], Package["python-mysqldb"]];
 }
}

I'll go through this file from the top down. At the very top I declare some variables that are referenced throughout the file. In particular, $url points to the location where I keep large files that I need every puppet client to download. I could have kept the files inside the tornado module's files directory, and they would have been served by the puppetmaster process, but I prefered to use Apache for better performance and scalability. Note that I do this only for relatively large files such as tar.gz archives.

The Exec stanza (note upper case E) defines certain parameters that will be common to all 'exec' actions that follow. In my case, I specify that I only want to log failures, and I also specify the path for the binaries called in the various 'exec' actions -- this is so I don't have to specify that path each and every time I call 'exec' (alternatively, you can specify the full path to each binary that you call).

The next 2 stanzas define files and directories that I want created on the puppet client nodes. Both 'exec' and 'file' are what is called 'types' in Puppet lingo. I first specify that I wanted the directory /opt/tornado created on each node, and by setting 'recurse=>true' I'm saying that the contents of that directory should be taken from a source which in my case is "puppet:///tornado/bin". This translates to a directory called bin which I created under /etc/puppet/modules/tornado/files. The contents of that directory will be copied over via the puppet internal communication protocol to the destination /opt/tornado by each Puppet client node.

The 'package' type that follows specifies the list of packages I want installed on the client nodes. Note that I don't need to specify how I want those packages installed, only what I want installed. Puppet's language is mostly declarative -- you tell Puppet what you want done, and it does it for you, using OS-specific commands that can vary from one client node to another. It so happens in my case that I know my client nodes all run Ubuntu, so I did specify Ubuntu/Debian-specific package names.

Next in my manifest file is a function definition. You can have these definitions inline, or in a separate manifest file. In my case, I declare a function called 'install_pkg' which takes 3 arguments: the package name, any extra arguments to be passed to the installer, and a module name to test the installation with. The function runs the easy_install command via the 'exec' type, but only if the specified module wasn't already installed on the system.

A paranthesis: the Puppet docs don't recommend the overuse of the 'exec' type, because it strays away from the declarative nature of the Puppet language. With exec, you specifically tell the remote node how to run a specific command, not merely what to do. I find myself using exec very heavily though. I means that I don't grokk Puppet fully yet, but it also means that Puppet doesn't have enough native types yet that can hide OS-specific commands.

One important thing to keep in mind is that for every exec action that you write, you need to specify a condition which becomes true after the successful completion of the action. Otherwise exec will be called each and every time the manifest will be inspected by the puppet nodes. Examples of such conditions:

'creates' -- specifies a file or directory that gets created by the exec action; if the file or directory is already there, exec won't be called
'unless' -- specifies a condition that, if true, results in exec not being called. In my case, this condition is the import of a given Python module, but it can be any shell command that returns 0

Another thing to note in the exec action is the 'require' parameter. You'll find yourself using 'require' over and over again. It is a critical component of Puppet manifests, and it is so important because it allows you to order the actions in the manifest. Without it, actions would be executed in random order, which is most likely something you don't want. In my function definition, I require the existence of the package python-setuptools, and I do it because I need the easy_install command to be present on the remote node.

After defining the function 'install_pkg', I call it 3 times, with various parameters, thus installing 3 Python packages -- virtualenv, boto and grizzled. Note that the syntax for calling a function is funky; it's one of the many things I don't necessarily like about Puppet, but it's an evil you learn to deal with.

Next up in my manifest file is a case statement based on the $architecture variable. Puppet makes several such variables available to your manifests, based on facts gathered from the remote nodes via Facter (which comes with Puppet).

Moving along, we have a package definition, a file definition -- both should be familiar by now -- followed by 3 exec actions:

InstallOracleInstantclient performs the download and unpacking of this package, followed by some ldconfig incantations to actually make it work
InstallCxOracle downloads and installs the cx_Oracle Python package (not a trivial feat at all in and of itself); note that for this action, the require parameter contains Package["libaio-dev"], Package["gcc"], Exec["InstallOracleInstantclient"] -- so we're saying that these 2 packages, and the Instantclient Oracle libraries need to be installed before attempting to even install cx_Oracle
InstallTornado -- pretty self-explanatory, with the observation that the require parameter again points to a directory and several packages that need to be on the remote node before the installation of Tornado is attempted

Whew. Nobody said Puppet is easy. But let me tell you, when you get everything working smoothly (after much pulling of hair), it's a great feeling to let a node 'phone home' to the puppetmaster server and configure itself unattended in a matter of minutes. It's worth the effort and the pain.

One more thing here: once you have a module with manifests and files defined properly, you need to define the set of nodes that this module will apply to. The way I do it is to have the following files on the puppet master, in /etc/puppet/manifests:

1) A file called modules.pp which imports the modules I have defined, for example:

import "common" 
import "tornado"

('common' can be a module where you specify actions that are common across all types of nodes)

2) A file called nodetemplates.pp which contains definitions for 'node templates', i.e. classes of nodes that have the same composition in terms of modules they import and actions they perform. For example:

node basenode {
    include common
}

node default inherits basenode {
}

node webserver inherits basenode {
    include scribe
    include apache2
    $required_apache2_modules = ["rewrite", "proxy", "proxy_http", "proxy_balancer", "deflate", "headers", "expires"]
    apache2::module {
        $required_apache2_modules:
        ensure => 'present',
    }
    include tomcat
    include tornado
}

Here I defined 3 types of nodes: basenode (which includes the 'common' module), default (which applies to any machine not associated with a specific node definition) and webserver (which includes modules such as apache2, tomcat, tornado, and also requires that certain apache modules be enabled).

3) A file called nodes.pp which maps actual machine names of the Puppet clients to node template definitions. For example:

node "web1.mydomain.com" inherits webserver {}

4) A file called site.pp which ties together all these other files. It contains:

import "modules"
import "nodetemplates"
import "nodes"

Much more documentation on node definition and node inheritance can be found on the Puppet wiki, especially in the Language Tutorial.

Fabric examples

In comparison with Puppet, Fabric is a breeze. I wanted to live on the cutting edge, so I installed the latest version (alpha, pre-1.0) from github via:

git clone git://github.com/bitprophet/fabric.git

I also easy_install'ed paramiko, which at this time brings down paramiko-1.7.6 (the Fabric documentation warns against using 1.7.5, but I assume 1.7.6 is OK).

Then I proceeded to create a so-called 'fabfile', which is a Python module containing fabric-specific functions. Here is a fragment of a file I called fab_nginx.py:

from __future__ import with_statement
import os
from fabric.api import *
from fabric.contrib.files import comment, sed

# Globals

env.user = 'myuser'
env.password = 'mypass'
env.nginx_conf_dir = '/usr/local/nginx/conf'
env.nginx_conf_file = '%(nginx_conf_dir)s/nginx.conf' % env

# Environments


def prod():
    """Nginx production environment."""
    env.hosts = ['nginx1', 'nginx2']

def test():
    """Nginx test environment."""
    env.hosts = ['nginx3']

# Tasks

def disable_server_in_lb(hostname):
    require('hosts', provided_by=[nginx,nginxtest])
    comment(env.nginx_conf_file, "server %s" % hostname, use_sudo=True)
    restart_nginx()

def enable_server_in_lb(hostname):
    require('hosts', provided_by=[nginx,nginxtest])
    sed(env.nginx_conf_file, "#server %s" % hostname, "server %s" % hostname, use_sudo=True)
    restart_nginx()

def restart_nginx():
    require('hosts', provided_by=[nginx,nginxtest])
    sudo('/etc/init.d/nginx restart')
    is_nginx_running()

def is_nginx_running(warn_only=False):
    with settings(warn_only=warn_only):
        output = run('ps -def|grep nginx|grep -v grep')
        if warn_only:
            print 'output:', output
            print 'failed:', output.failed
            print 'return_code:', output.return_code

Note that in its 0.9 and later versions, Fabric uses the 'env' environment dictionary for configuration purposes (it used to be called 'config' pre-0.9).

My file starts by defining or assigning global env configuration variables, for example env.user and env.password (which are special pre-defined variables that I assign to, and which are used by Fabric when connecting to remote hosts via the ssh functionality provided by paramiko). I also define my own variables, for example env.nginx_conf_dir and env.nginx_conf_file. This makes it easy to pass the env dictionary as a whole when I need to format a string. Here's an example from another fab file:

cmd = 'mv -f %(crt_egg)s %(backup_dir)s' % env

I then have 2 function definitions in my fab file: one called prod, which sets env.hosts to a list of production nginx servers, and one called test, which does the same but sets env.hosts to test nginx servers.

Next I have the actions or tasks that I want performed on the remote hosts. Note the require function (similar in a way to the parameter used in Puppet manifests), which says that the function will only be executed if the given variable in the env dictionary has been assigned to (in my case, the variable is hosts, and I require that the value need to have been provided by either the prod or the test function). This is a useful mechanism to ensure that certain things have been defined before attempting to run commands on the remote servers.

The first task is called disable_server_in_lb. It takes a host name as a parameter, which is the server that I want disabled in the nginx configuration file. I use the handy 'comment' function available in fabric.contrib.files to comment out the lines that contain 'server HOSTNAME' in the nginx configuration. The comment function can be invoked with sudo rights on the remote host by passing use_sudo=True.

The task also calls another function defined in my fab file, restart_nginx. This taks simply calls '/etc/init.d/nginx restart' on the remote host, then verifies that nginx is running by calling is_nginx_running.

By default, when running a command on the remote host, if the command returns a non-zero code, it is considered to have failed by Fabric, and execution stops. In most cases, this is exactly what you want. In case you just want to run a command to get the output, and you don't care if it fails, you can set warn_only=True before running the command. I show an example if this in the is_nginx_running function.

The other main task in my fabfile is enable_server_in_lb. Here I use another handy function offered by Fabric -- the sed function. I substitute '#server HOSTNAME' with 'server HOSTNAME' in the nginx configuration file, then I restart nginx.
So now that we have the fabfile, how do we actually perform the tasks we defined? Let's assume we have a server called 'web1.mydomain.com' that we want disabled in nginx. We want to test our task first in a test environment, so we would call:

fab -f fab_nginx.py test disable_server_in_lb:web1.mydomain.com

(note the syntax for passing parameters to a function/task)

By specifying test on the command line before specifying the task, I ensure that Fabric first calls the function named 'test' in the fabfile, which sets the hosts to the test nginx servers.

Once I'm satisfied that this works well in the test environment, I call:

fab -f fab_nginx.py prod disable_server_in_lb:web1.mydomain.com

For a real deployment procedure, let's say for deploying tornado-based servers that are behind one or more nginx load balancer, I would do something like this:

fab -f fab_nginx.py prod disable_server_in_lb:web1.mydomain.com
fab -f fab_tornado.py prod deploy
fab -f fab_nginx.py prod enable_server_in_lb:web1.mydomain.com

This will deploy my new application code to web1.mydomain.com. Of course I can script this and call the above sequence for all my production servers. I assume here that I have another fabfile called fab_tornado.py and a task defined in in which does the actual deployment of the application code (most likely by downloading and easy_install'ing an egg).

That's it for today. It's been more like a whirlwind through two types of automated deployment tools -- Puppet/pull and Fabric/push. I didn't do justice to either of these tools in terms of their full capabilities, but I hope this will still be useful for some people as a starting point into their own explorations.

Tuesday, October 13, 2009

Thierry Carrez on running your own Ubuntu Enterprise Cloud

Thierry Carrez, who works in the Ubuntu Server team, has a great series of blog posts on how to run your own Ubuntu Enterprise Cloud. I haven't had a chance to try this yet, but it's high on my TODO list. Thierry uses the Ubuntu Enterprise Cloud product (which has been part of Ubuntu server starting with 9.04) together with Eucalyptus. Here are the links to Thierry's posts:

Friday, October 09, 2009

Compiling, installing and test-running Scribe

I went to the Hadoop World conference last week and one thing I took away was how Facebook and other companies handle the problem of scalable logging within their infrastructure. The solution found by Facebook was to write their own logging server software called Scribe (more details on the FB blog).

Scribe is mentioned in one of the best presentations I attended at the conference -- 'Hadoop and Hive Development at Facebook' by Dhruba Borthakur and Zheng Shao. If you look at page 4, you'll see the enormity of the situation they're facing: 4 TB of compressed data (mostly logs) handled every day, and 135 TB of compressed data scanned every day. All this goes through Scribe, so that gives me a warm fuzzy feeling that it's indeed scalable and robust. For more details on Scribe, see the wiki page of the project. It's my intention here to detail the steps needed for compiling and installing it, since I found that to be a non-trivial process to say the least. I'm glad Facebook open-sourced Scribe, but its packaging could have been a bit more straightforward. Anyway, here's what I did to get it to run. I followed roughly the same steps on Ubuntu and on Gentoo.

1) Install pre-requisite packages

On Ubuntu, I had to install the following packages via apt-get: g++, make, build-essential, flex, bison, libtool, mono-gmcs, libevent-dev.

2) Install the boost libraries

Very important: scribe needs boost 1.36 or newer, so make sure you don't have older boost libraries already installed. If you install libboost-* in Ubuntu, it tries to bring down 1.34 or 1.35, which will NOT work with scribe. If you have libboost-* already installed, you need to uninstall them. Now. Trust me, I spent several hours pulling my hair on this one.

- download the latest boost source code from SourceForge (I got boost 1.40 from here)

- untar it, then cd into the boost directory and run:

$ ./boostrap.sh
$ ./bjam
$ sudo ./bjam install

3) Install thrift and fb303

- get thrift source code with git, compile and install:

$ git clone git://git.thrift-rpc.org/thrift.git
$ cd thrift
$ ./bootstrap.sh
$ ./configure
$ make
$ sudo make install

- compile and install the Facebook fb303 library:

$ cd contrib/fb303
$ ./bootstrap.sh
$ make
$ sudo make install

- install the Python modules for thrift and fb303:

$ cd TOP THRIFT DIRECTORY
$ cd lib/py
$ sudo python setup.py install
$ cd TOP THRIFT DIRECTORY
$ cd contrib/fb303/py
$ sudo python setup.py install

To check that the python modules have been installed properly, run:

$ python -c 'import thrift' ; python -c 'import fb303'

4) Install Scribe

- download latest source code from SourceForge (I got it from here)

- untar, then run:

$ cd scribe
$ ./bootstrap.sh
$ make
$ sudo make install
$ sudo ldconfig (this is necessary so that the boost shared libraries are loaded)

- install Python modules for scribe:

$ cd lib/py
$ sudo python setup.py install

- to test that scribed (the scribe server process) was installed correctly, just run 'scribed' at a command line; you shouldn't get any errors
- to test that the scribe Python module was installed correctly, run
$ python -c 'import scribe'

5) Initial Scribe configuration

- create configuration directory -- in my case I created /etc/scribe
- copy one of the example config files from TOP_SCRIBE_DIRECTORY/examples/example*conf to /etc/scribe/scribe.conf -- a good one to start with is example1.conf
- edit /etc/scribe/scribe.conf and replace file_path (which points to /tmp) to a location more suitable for your system
- you may also want to replace max_size, which dictates how big the local files can be before they're rotated (by default it's 1 MB, which is too small -- I set it to 100 MB)
- run scribed either with nohup or in a screen session (it doesn't seem to have a daemon mode):

$ scribed -c /etc/scribe/scribe.conf

6) Test run

To test Scribe, you can install it on a remote machine, configure scribed on that machine to use a configuration file similar to examples/example2client.conf, then change remote_host in the config file to point to the central scribe server configured in step 5.

Once scribed is configured and running on the remote machine, you can test it with a nice utility written by Silas Sewell, called scribe_pipe. For example, you can pipe an Apache log file from the remote machine to the central scribe server by running:

cat apache_access_log | ./scribe_pipe apache.access

On the scribe server, you should see at this point a directory called apache.access under the main file_path directory, and files called apache.access_00000, apache.access_00001 etc (in chunks of max_size bytes).

I'll post separately about actually using Scribe in production. I hope this post will at least get you started on using Scribe and save you some headaches during its installation process.

Tuesday, October 06, 2009

Brandon Burton on 'Automation is the cloud'

Great post from Brandon Burton, my ex-colleague at RIS/Reliam, on why automation is the foundation of cloud computing. Brandon discusses automation at various levels, starting with virtualization and networking, then moving up the layers and covering OS, configuration management and application deployment. Highly recommended.

Thursday, September 24, 2009

Pybots success stories and a call for help

Any report of the death of the Pybots project is an exaggeration. But not by much. First, some history.

Some history

The idea behind the Pybots project is to allow people to run automated tests for their Python projects, while using Python binaries built from the very latest source code from the Python subversion repository.

The idea originated from Glyph, of Twisted fame. He sent out a message to the python-dev mailing list in which he said:

"I would like to propose, although I certainly don't have time to implement, a program by which Python-using projects could contribute buildslaves which would run their projects' tests with the latest Python trunk. This would provide two useful incentives: Python code would gain a reputation as generally well-tested (since there is a direct incentive to write tests for your project: get notified when core python changes might break it), and the core developers would have instant feedback when a "small" change breaks more code than it was expected to."

This was back in July 2006. I volunteered to maintain a buildbot master (running on a server belonging to the PSF) and also to rally a community of people interested in running this type of tests. The hard part was (and still is) to find people willing to donate client machines to act as build slaves for a particular project, and even more so people willing to keep up with the status of their build slaves. The danger here, as in any continuous integratin system, is that once the status turns to red and doesn't go back to green, people start to ignore the failed steps. Even if those steps exhibit new and interesting failures, it's too late at this point (this is related to the broken windows theory).

The project starting fairly strong, gained some momentum, but then slowly ran out of steam. It was a combination of me not having the time to do the rallying, and of people not being interested in participating in the project anymore. At the height of its momentum, in early 2007, the Pybots farm consisted of 11 buildslaves running automated tests for more than 20 Python projects, including Twisted, Django, SQLAlchemy, MySQLdb, Bazaar, nose, twill, Storm, Trac, CherryPy, Genshi, Roundup. Pretty much a who's who of the Python project world.

Early success stories

Here are some examples of bugs discovered by the buildslaves in the Python farm:

new keywords 'as' and 'with' in Python 2.6 causing problems for projects that had variables with those names
Python install step failing even though all unit tests were passing (this underscores the importance of functional test ing)
platform-specific issues -- for example Bazaar issues on Windows due to TCP client behavior, Twisted issues on Red Hat 9 due to multicast behavior, Python core issues on OS X due to string formatting errors

(for a more thorough overview of the Pybots project, including lessons learned, see also my PyCon07 presentation)

Recent signs of life and more success stories

In the last month or so there has been a flurry of activity related to the Pybots farm. It all started with an upgrade of the buildbot version on the machine hosting the Pybots buildmaster. This broke the master's configuration file, so the Pybots status page went completely dark.

As a result, Steve Holden posted a plea for help answered by a few people who showed interest in adding build slaves to the project. In parallel, Jean-Paul Calderone jumped in to help on the buildmaster side and he managed to fix the buildbot upgrade issue (thanks, JP!) David Stanek also expressed interest in taking a more active role on the buildmaster side.

Jean-Paul also sent more success stories to the Pybots mailing list. Here they are, verbatim, with his permission:

"The skip story:

The Twisted pybots slave started skipping every Twisted test one day. I noticed and filed http://twistedmatrix.com/trac/ticket/3703 (which goes into a bit of detail about why this happened). This happened to come up during the PyCon language summit, so there was some real-time discussion about it, resulting in a Python bug being filed, http://bugs.python.org/issue5571. Then, as that ticket shows, Benjamin Peterson was nice enough to fix the incompatibility.

The array/buffer story:

The Twisted pybots slave started to fail some Twisted tests one day. ;) The tests in question were actually calling into some PyCrypto code, so this failure wasn't in Twisted directly. PyCrypto loads some bytes into an array.array and then tries to hash them (for some part of its random pool API). I filed http://bugs.python.org/issue6071 on which someone explained that hashlib switched over to the new buffer API, lost support for hashing anything that only provides the old buffer API, and that array.array still only supports the old buffer API. This one hasn't been fixed yet, but it sounds like Gregory Smith plans to fix it before 2.7 is released.

There are other success stories too, incompatible changes that are more like bugs on the Twisted side than on the Python side (assuming one is generous and believes that incompatible changes in Python can actually be Twisted bugs ;). Things like typos that didn't result in syntax errors in an older version of Python but became syntax errors in newer versions (in particular, a variable was defined as 0x+80000000 instead of 0x80000000 - the former actually being valid syntax in 2.5 but became illegal in 2.6)."

My hope is that stories like these will convince more people about the usefulness of running tests for their projects against 'live' changes in the Python trunk (or other Python branches). I am not aware of any other testing project that accomplishes this for other programming languages.

In particular, if there is enough interest, we can also configure the Pybots master to trigger test runs for your project of choice using Py3k binaries! Think how cool you'll appear to your grandchildren!

How you can help

If you want to be involved in the Pybots project, please subscribe to the Pybots mailing list and show your interest by sending a message to the list. Here are some resources to get you started:

Setting up a Pybots buildslave (scroll to the end for an update)
Pybots FAQ
Pybots Google Code page
Examples of buildslave scripts that trigger the automated builds and tests
Sidnei Da Silva's experiences on setting up a Win32 buildslave

Jeff Roberts on a scalable DNS scheme for EC2

My ex-colleague from OpenX, Jeff Roberts, has another great blog post on 'A Scalable DNS Scheme for Amazon's EC2 Cloud'. If you need to deploy an internal DNS infrastructure in EC2, you have to read this post. It's based on battle-tested experience.

Monday, September 14, 2009

A/B testing and online experimentation at Microsoft

Via Greg Linden, I found a great presentation from Ronny Kohavi on "Online experimentation at Microsoft". All kinds of juicy nuggets of information on how to conduct meaningful A/B testing and other types of controlled online experiments.

One of my favorite slides is 'Key Lessons', from which I quote:

"Avoid the temptation to try and build optimal features through extensive planning without early testing of ideas"
"Experiment often"
"Try radical ideas. You may be surprised"

The entire presentation is highly recommended. You can tell that this wisdom was earned in the school of hard knocks, which is the best school there is in my experience, at least for software engineering.

Wednesday, September 02, 2009

Bootstrapping EC2 images as Puppet clients

I've been looking at Puppet lately as an alternative to slack for automated deployment and configuration management. I can't say I love it, but I think it's good enough that it warrants banging your head against the wall repeatedly until you learn how to use it. I do wish it was written in Python, but hey, you do what you need to do. I did look at Fabric, and I might still use it for 'push'-type deployments, but it has nowhere near the features that Puppet has (and its development and maintenance just changed hands, which makes it too cutting edge for me at this point.)

But this is not a post about Puppet -- although I promise I'll blog about that too. This is a post on how to get to the point of using Puppet in an EC2 environment, by automatically configuring EC2 instances as Puppet clients once they're launched.

While the mechanism I'll describe can be achieved by other means, I chose to use the Ubuntu EC2 AMIs provided by alestic. As a parenthesis, if you're thinking about using Ubuntu in EC2, do yourself a favor and read Eric Hammond's blog (which can be found at alestic.com) He has a huge number of amazingly detailed posts related to this topic, and they're all worth your while to read.

Unsurprisingly, I chose a mechanism provided by the alestic AMIs to bootstrap my EC2 instances -- specifically, passing user-data scripts that will be automatically run on the first boot of the instance. You can obviously also bake this into your own custom AMI, but the alestic AMIs already have this hook baked in, which I LIKE (picture Borat's voice). What's more, Eric kindly provides another way to easily run custom scripts within the main user-data script -- I'm referring to his runurl script, detailed in this blog post. Basically you point runurl at a URL that contains the location of another script that you wrote, and runurl will download and run that script. You can also pass parameters to runurl, which will in turn be passed to your script.

Enough verbiage, let's see some examples.

Here is my user-data file, whose file name I am passing along as a parameter when launching my EC2 instances:


#!/bin/bash -ex

cat <<EOL > /etc/hosts
127.0.0.1 localhost.localdomain localhost
10.1.1.1  puppetmaster

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
EOL

wget -qO/usr/bin/runurl run.alestic.com/runurl
chmod 755 /usr/bin/runurl
runurl ec2web.mycompany.com/upgrade/apt
runurl ec2web.mycompany.com/customize/ssh
runurl ec2web.mycompany.com/customize/vim
runurl ec2web.mycompany.com/install/puppet

The first thing I do in this script is to add an entry to the /etc/hosts file pointing at the IP address of my puppetmaster server. You can obviously do this with an internal DNS server too, but I've chosen not to maintain my own internal DNS servers in EC2 for now.

My script then retrieves the runurl utility from alestic.com, puts it in /usr/bin and chmod's it to 755. Then the script uses runurl and points it at various other scripts I wrote, all hosted on an internal web server.

For example, the contents of upgrade/apt are:


#!/bin/bash
apt-get update
apt-get -y upgrade
apt-get -y autoremove

For ssh customizations, my scripts downloads a specific .ssh/authorized_keys file, so I can ssh to the new instance using certain ssh keys.

To install and customize vim, I have customize/vim:


#!/bin/bash
apt-get -y install vim
wget -qO/root/.vimrc http://ec2web.mycompany.com/configs/os/.vimrc
echo 'alias vi=vim' >> /root/.bashrc

...where .vimrc is a customized file that I keep under the document root of the same web server where I keep my scripts.

Finally, install/puppet looks like this:


#!/bin/bash
apt-get -y install puppet
wget -qO/etc/puppet/puppetd.conf http://ec2web.mycompany.com/configs/puppet/puppetd.conf
/etc/init.d/puppet restart

Here I am installing puppet via apt-get, then I'm downloading a custom puppetd.conf configuration, which points at puppetmaster as its server name (instead of the default, which is puppet). Finally, I restart puppet so that the new configuration takes effect.

Note that I want to keep these scripts to the bare minimum that allows me to:

1) ssh into the instance in case anything goes wrong
2) install and configure puppet so the instance can talk to the puppetmaster

The actual package and application installations and customizations on my newly launched image will be done through puppet, by associating the instance hostname with a node that is defined on the puppetmaster; I am also adding more entries to /etc/hosts as needed using puppet-specific mechanisms such as the 'host' type (as promised, blog post on this forthcoming...)

Note that you need to make sure you have good security for the web server instance which is serving your scripts to runurl; Eric Hammond talks about using S3 for that, but it's too complicated IMO (you need to sign URL and expire them, etc.) In my case, I preferred to use an internal Apache instance with basic HTTP authentication, and to only allow traffic on port 80 from certain security groups within EC2 (my Apache server doubles as the puppetmaster BTW).

Is your hosting provider Reliam?

Some exciting news from RIS Technology, the hosting company I used to work for. They changed their name to Reliam, which stands for Reliable Internet Application Management. And I think it's an appropriate name, because RIS has always been much more involved into the application stack than your typical hosting provider. When I was there, we rolled out Django applications, Tomcat instances, MySQL, PostgreSQL and Oracle installations, and we maintained them 24x7, which required a deep understanding of the applications. We also provided the glue that tied all the various layers together, from deployment to monitoring.

Since I left, RIS/Reliam has invested heavily in a virtual infrastructure that can be combined where it makes sense with physical dedicated servers. The DB layer is usually dedicated, since the closest you are to bare metal, the better off you are in terms of database access. But the application layer can easily be virtualized and scaled on demand. So you can the scaling benefit of cloud computing, and the performance benefit of dedicated servers.

Here are some stats for the infrastructure that RIS/Reliam used for supporting traffic during the recent Miss Universe event (they host missuniverse.com and missusa.com):

* 135 virtual servers running the Web application
* 9 virtual servers running mysql-proxy
* 1 master DB server and 5 read-only slave DB servers running MySQL
* 301.52 Mbps bandwidth
* 33,750 concurrent users
* over 150K concurrent sessions per second

An interesting note is that they used round robin DNS to load balance between the mysql proxies and had all proxies configured to use the master and all five slaves. They managed to get mysql-proxy 0.7.2 running with this patch.

So...what's the point of this note? It's a shout-out to my friends at RIS/Reliam, and a warm recommendation for them in case you need a hosting provider with strong technical capabilities that cover cloud/hybrid computing, system architecture design, application deployment and deep application monitoring and graphing.

Tuesday, August 25, 2009

New presentation and new job

I gave a presentation last night on 'Agile and Automated Testing Techniques and Tools' to the Pasadena Java User Group. It was a version of the talk I gave earlier this year to the XP/Agile SoCal User Group. This time I posted the slides to Slideshare.

If you look attentively at the first slide, you'll notice I have a new job at Evite, as a Sr. Systems Architect. I started a couple of weeks ago, and it's been great. Expect more blog posts on automated deployments to the cloud (using Ubuntu images from Alestic), on continuous integration/build/release management processes, on Hadoop, and on any interesting stuff that comes my way.

Tuesday, July 28, 2009

noSQL databases? map-reduce? Erlang? it's all in this cartoon

Hilarious cartoon (not sure why it's titled 'Fault Tolerance' though) seen on the High Scalability blog. Captures very well the spirit and hype of our times in the IT world.

Monday, July 27, 2009

Python well represented in NASA's Nebula cloud

I found out today from the cloud-computing mailing list about NASA's Nebula project. Here's what the 'About' page of the project's web site says:

"NEBULA is a Cloud Computing environment developed at NASA Ames Research Center, integrating a set of open-source components into a seamless, self-service platform. It provides high-capacity computing, storage and network connectivity, and uses a virtualized, scalable approach to achieve cost and energy efficiencies."

The Services page has some nice architectural diagrams. I wasn't surprised to see that their VM enviroment is managed via Eucalyptus. I also shouldn't have been surprised by the large number of Python modules and applications they're using, especially on the client side. Pretty much all the frontend applications are Python bindings for the various backend technologies they're using (such as LUSTRE, RabbitMQ, Subversion). Of course Trac is there too.

But the most interesting thing for Python fans will be undoubtedly their selection for the Web application framework. Maybe again unsurprisingly, they chose...Django:

"After an extensive trade study, the NEBULA team selected Django, a python-based web application framework, as the first and primary application environment for the Cloud. NEBULA users have access to an extensive collection of open-source django "apps", providing features ranging from simple blogs, wikis, and discussion forums, to more advanced collaboration suites, image processing, and more."

Other interesting tidbits from those diagrams:

deployments are automated with Fabric
distributed automated testing is done with Selenium Grid
continuous integration is done with CruiseControl
for the database backend, they use a MySQL cluster with DRBD
for the file system they use LUSTRE
queuing is done with RabbitMQ (which is written in Erlang)
search and indexing is done with SOLR

All in all, an interesting mix of technologies. Besides Python, Java and Erlang are well represented, as expected. Not a bad model to follow if you want to build your own private cloud environment.

Sunday, July 26, 2009

How to roll your own Amazon EC2 image

Jeff Roberts, the vim-fu guru, does it again with a great post on "Bundling versioned AMIs rapidly in Amazon's EC2". It's a step-by-step guide on how to roll your own AMI, bundle it and upload it to S3, while keeping it versioned at the same time. Highly recommended.

Tuesday, July 21, 2009

Automated testing of production deployments

When you work as a systems engineer at a company that has a large scale system infrastructure, sooner or later you realize that you need to automate pretty much everything you do. You can't afford not to, if you want to keep up with the ever-present demands of scaling up and down the infrastructure.

The main promise of cloud computing -- infinite elastic scaling based on demand -- is real, but you can only achieve it if you automate your deployments. It's fairly safe to say that most teams that are involved in such infrastructures have achieved high levels of automation. Some fearless teams practice continuous deployment, others do frequent dark launches. All these practices are great, but my thesis is that in order to achieve fearlessness you need automated tests of your production deployments.

Note the word 'production' -- I believe it is necessary to go one step beyond running automated tests in an isolated staging environment (although that is a very good thing to do, especially if staging mirrors production at a smaller scale). That next step is to run your test harness in production, every time you deploy. And deployment, at a fast moving Web company these days, can happen multiple times a day. Trust me, with no automated tests in place, you'll never get rid of that nagging feeling in the pit of your stomach that you might have broken things horribly, in production.

So how do you go about writing automated tests for your deployments? I wrote a while ago about automating and testing your system setup checklists. Even testing small things such as 'is httpd/mysqld/postfix setup to run at boot time' will go a long way in achieving peace of mind.

Assuming you have a list of things to test (it can be just a couple of critical things for starters), how and when do you run the tests? Again, you can do the simplest thing that works -- a bash shell that iterates through your production servers and runs the test scripts remotely on the servers via ssh. Some things I test this way these days are:

* do local MySQL databases on servers in a particular cluster contain the same data in certain tables? (this shows me that things are in sync across servers)
* is MySQL replication working as expected across the cluster of read-only slaves?
* are periodic operations happening as expected (here I can do a simple tail of a log file to figure it out)
* are certain PHP modules correctly installed?
* is Apache serving a number of requests per second that is not too high, but not too low either (where high and low are highly dependent on your traffic and application obviously)

I run these tests (and many others) each time I push a change to production. No matter how small the change can seem, it can have unanticipated side effects. I found that having tests that probe the system from as many angles as possible are the most efficient -- the angles in my case being Apache, MySQL, PHP, memcached for example. I also found that this type of testing (push-based if you want) is very good at showing discrepancies between servers. If you see a server being out of wack this way, then you know you need to attempt to fix it, or even terminate it and deploy a new one.

Another approach in your automated testing strategy is to run your test harness periodically (via cron for example) and also to write the harness in a proper language (Python comes to mind), integrated into a test framework. You can have the results of the tests emailed to you in case of failure. The advantage of this approach is that you can have things run automatically without your intervention (in the first approach, you still have to remember to run the test suite!).

The ultimate in terms of automated testing is to integrate it with your monitoring infrastructure. If you use Nagios for example, you can easily write plugins that essentialy probe for the same things that your tests probe for. The advantage of this approach is that the tests will run every time Nagios runs, and you can set up alerts easily. One disadvantage is that it can slow down your monitoring, depending on the number of tests you need to run on each server. Monitoring typically happens very often (every 5 minutes is a common practice), so it may be overkill to run all the tests every 5 minutes. Of course, this should be configurable in your monitoring tool, so you can have a separate class of checks that only happen every N hours for example.

In any case, let me assure you that even if you take the first approach I mentioned (ssh into all servers and run commands remotely that way), you'll reap the rewards very fast. In fact, you'll like it so much that you'll want to keep adding more tests, so you can achieve more inner peace. It's a sure way to becoming test infected, but also to achieve deployment nirvana.

Friday, July 17, 2009

Managing multiple MySQL instances with MySQL Sandbox

MySQL doesn't support multi-master replication, i.e. you can't have one MySQL instance acting as a replication slave to more than one master. There are times when you need this functionality, for example for disaster recovery purposes, where you have a machine with tons of CPU, RAM and disk running several MySQL instances, each being a replication slave to a different MySQL master.

One tool I've used for easy management of multiple MySQL instances on the same box is MySQL Sandbox. It's nothing fancy -- a Perl module which offers a collection of scripts -- but it does make your life much easier.

To install MySQL Sandbox, download it from its Launchpad page, then run 'perl Makefile.PL; make; make install'. You also need to download a MySQL binary tarball which will serve as a common base used by your MySQL instances.

Here's an example of a script I wrote which creates a new MySQL Sandbox instance under a common directory (/var/mysql_slaves in my case). The script takes 2 arguments: a database name, and the name of the MySQL master from which that database is replicated. The script automatically increments the port number that the new sandbox instance will listen on, then creates the instance via a call like this:

/usr/bin/make_sandbox /usr/local/src/mysql-5.1.32-linux-x86_64-glibc23.tar.gz \
        --upper_directory=/var/mysql_slaves \
 --sandbox_directory=$SLAVEDB_NAME --sandbox_port=$LAST_PORT_NUMBER \
 --db_user=$SLAVEDB_NAME --db_password=PASSWORD \
 --no_confirm

As a result, there will be a new directory called $SLAVEDB_NAME under /var/mysql_slaves, which serves as the sandbox for the newly created MySQL instance. The script also adds some lines related to replication to the new MySQL instance configuration file (which is /var/mysql_slaves/my.sandbox.cnf).

To start the instance, run

/var/mysql_slaves/$SLAVEDB_NAME/start

To stop the instance, run

/var/mysql_slaves/$SLAVEDB_NAME/start

To go to a MySQL prompt for this instance, run

/var/mysql_slaves/$SLAVEDB_NAME/use

At this point, you still don't have a functioning slave. You need to load the data from the master. One way to do this is to run mysqldump on the master with options such as ' --single-transaction --master-data=1'. This will include the master information (binlog name and position) in the DB dump.

The next step is to transfer the DB dump over to the box running MySQL Sandbox, and load it into the MySQL instance. I use a script similar to this.

You should now have a MySQL instance that acts as a replication slave to a specific master server. Repeat this process to set up other sandboxed MySQL instances that are slaves to other masters.

Note that MySQL Sandbox already includes some replication-related utilities (which I haven't used) and also an admin-type tool called sbtool. The documentation is pretty good.

Tuesday, July 14, 2009

Kent Langley's '10 rules for launching a web site'

The advice in this blog post by Kent Langley resonates with my experiences launching Web infrastructures of all types, large and small. Deploy early and often, automate your deployments, use version control, create checklists, have a rollback plan -- these are all very sensible things to do.

I would add one more very important thing that seems to be missing from the list: have an extensive suite of automated tests to check that your deployment steps did the right thing. Many people just stop at the automation step, and don't go beyond that to the testing step. It will come back to haunt them in the long run. But this is fodder for another blog post, which will be coming real soon now ;-)

Monday, July 13, 2009

Greatest invention since sliced bread: vimdiff

If you work at the ssh command prompt all day long (like I do), and if you need to compare text files and merge differences (like I do), then make sure you check out vimdiff (thanks to Jeff Roberts for bringing it to my attention).

If you run 'vimdiff file1 file2', the tool will split your screen vertically, with file1 displayed in a vim session on the left and file2 on the right. The differences between the 2 files will be highlighted. To jump from difference to difference, use ]c (forward) and [c (backward). When the cursor is on a difference block, use :diffget or do to merge the difference from the other file into the file where the cursor is; use :diffput or dp to merge the other way. To jump from one file's window to the other, use Ctrl-w-w. Google vimdiff for other tips and tricks. Definitely a good tool to have in your arsenal.

If you have the luxury of a graphical enviroment, I also recommend meld (thanks to Chris Nutting for the tip).

Recommended blog: Elastician

Elastician is the blog of Mitch Garnaat, the author of the amazingly useful boto Python library -- a collection of modules for managing AWS resources (EC2, S3, SQS, SimpleDB and more recently CloudWatch).

Mitch has a great picture on what he calls the 'Cloud Computing Hierarchy of Needs' (in a reference to Maslow's self-actualization hierarchy). Very insightful.

Friday, July 10, 2009

Python mock testing techniques and tools

This is an article I wrote for Python Magazine as part of the 'Pragmatic Testers' column. Titus and I have taken turns writing the column, although we haven't produced as many articles as we would have liked.

Here is the content of my article, which appeared in the February 2009 issue of PyMag:

Mock testing is a controversial topic in the area of unit testing. Some people swear by it, others swear at it. As always, the truth is somewhere in the middle.

Let's get some terminology clarified: when people say they use mock objects in their testing, in most cases they actually mean stubs, not mocks. The difference is expanded upon with his usual brilliance by Martin Fowler in his article "Mocks aren't stubs".

In his revised version of the article, Fowler uses the terminology from Gerard Meszaros's 'xUnit Test Patterns' book. In this nomenclature, both stubs and mocks are special cases of 'test doubles', which are 'pretend' objects used in place of real objects during testing. Here is Meszaros's definition of a test double:

Sometimes it is just plain hard to test the system under test (SUT) because it depends on other components that cannot be used in the test environment. This could be because they aren't available, they will not return the results needed for the test or because executing them would have undesirable side effects. In other cases, our test strategy requires us to have more control or visibility of the internal behavior of the SUT.

When we are writing a test in which we cannot (or chose not to) use a real depended-on component (DOC), we can replace it with a Test Double. The Test Double doesn't have to behave exactly like the real DOC; it merely has to provide the same API as the real one so that the SUT thinks it is the real one!

These 'other components' that cannot be used in a test environment, or can only be used with a high setup cost, are usually external resources such as database servers, Web servers, XML-RPC servers. Many of these resources may not be under your control, or may return data that often contains some randomness which makes it hard or impossible for your unit tests to assert things about it.

So what is the difference between stubs and mocks? Stubs are used to return canned data to your SUT, so that you can make some assertions on how your code reacts to that data. This eliminates randomness from the equation, at least in the test environment. Mocks, on the other hand, are used to specify expectations on the behavior of the object called by your SUT. You indicate your expectations by specifying that certain methods of the mock object need to be called by the SUT in a certain order and with certain arguments.

Fowler draws a further distinction between stubs and mocks by saying that stubs are used for “state verification”, while mocks are used for “behavior verification”. When we use state verification, we assert things about the state of the SUT after the stub returned the canned data back to the SUT. We don't care how the stub obtained that data, we just care about the final result (the data itself) and about how our SUT processed that data. When we use behavior verification, not only do we care about the data, but we also make sure that the SUT made the correct calls, in the correct order, and with the correct parameters, to the object representing the external resource.

If readers are still following along after all this theory, I'm fairly sure they have at least two questions:

1) when exactly do I use mock testing in my overall testing strategy?; and
2) if I do use mock testing, should I use mocks or stubs?

I already mentioned one scenario when you might want to use mock testing: when your SUT needs to interact with external resources which are either not under your control, or which return data with enough randomness to make it hard for your SUT to assert anything meaningful about it (for example external weather servers, or data that is timestamped). Another area where mock testing helps is in simulating error conditions which are not always under your control, and which are usually hard to reproduce. In this case, you can mock the external resource, simulate any errors or exceptions you want, and see how your program reacts to them in your unit tests (for example, you can simulate various HTTP error codes, or database connection errors).

Now for the second question, should you use mocks or stubs? In my experience, stubs that return canned data are sufficient for simulating the external resources and error conditions I mentioned. However, if you want to make sure that your application interacts correctly with these resources, for example that all the correct connection/disconnection calls are made to a database, then I recommend using mocks. One caveat of using mocks: by specifying expectations on the behavior of the object you're mocking and on the interaction of your SUT with that object, you couple your unit tests fairly tightly to the implementation of that object. With stubs, you only care about the external interface of the object you're mocking, not about the internal implementation of that object.

Enough theory, let's see some practical examples. I will discuss some unit tests I wrote for an application that interacts with an external resource, in my case a SnapLogic server. I don't have the space to go into detail about SnapLogic, but it is a Python-based Open Source data integration framework. It allows you to unify the access to the data needed by your application through a single API. Behind the scenes, SnapLogic talks to database servers, CSV files, and other data sources, then presents the data to your application via a simple unified API. The main advantage is that your application doesn't need to know the particular APIs for accessing the various external data sources.

In my case, SnapLogic talks to a MySQL database and presents the data returned by a SELECT SQL query to my application as a list of rows, where each row is itself a list. My application doesn't know that the data comes from MySQL, it just retrieves the data from the SnapLogic server via the SnapLogic API. I encapsulated the code that interacts with the SnapLogic server in its own class, which I called SnapLogicManager. My main SUT is passed a SnapLogicManager object in its __init__ method, then calls its methods to retrieve the data from the SnapLogic server.

I think you know where this is going – SnapLogic is an external resource as far as my SUT is concerned. It is expensive to set up and tear down, and it could return data with enough randomness so I wouldn't be able to make meaningful assertions about it. It would also be hard to simulate errors using the real SnapLogic server. All this indicates that the SnapLogicManager object is ripe for mocking.

My application code makes just one call to the SnapLogicManager object, to retrieve the dataset it needs to process:



rows = self.snaplogic_manager.get_attrib_values()

Then the application processes the rows (list of lists) and instantiates various data structures based on the values in the rows. For the purpose of this article, I'll keep it simple and say that each row has an attribute name (element #0), and attribute value (element #1) and an attribute target (element #2). For example, an attribute could have the name “DocumentRoot”, the value “/var/www/mydocroot” and the target “apache”. The application expects that certain attributes are there with the correct target. If they're not, it raises an exception.

How do we test that the application correctly instantiates the data structure, and correctly reacts to the presence or absence of certain attributes? You guessed it, we use a mock SnapLogicManager object, and we return canned data to our application.

I will show here how to achieve this using two different Python mock testing frameworks: Mox, written by Google engineers, and Mock, written by Michael Foord.

Mox is based on the Java EasyMock framework, and it does have a Java-esque feel to it, down to the CamelCase naming convention. Mock feels more 'pythonic' – more intuitive and with cleaner APIs. The two frameworks also differ in the way they set up and verify the mock objects: Mox uses a record/replay/verify pattern, whereas Mock uses an action/assert pattern. I will go into these differences by showing actual code below.

Here is a unit test that uses Mox:



    def test_get_attrib_value_with_expected_target(self):



        # We return a SnapLogic dataset which contains attributes with correct targets

        canned_snaplogic_rows = [

        [u'DocumentRoot', u'/var/www/mydocroot', u'apache'],

        [u'dbname', u'some_dbname', u'database'],

        [u'dbuser', u'SOME_DBUSER', u'database'],

        ]



        # Create a mock SnapLogicManager

        mock_snaplogic_manager = mox.MockObject(SnapLogicManager)



        # Return the canned list of rows when get_attrib_values is called

        mock_snaplogic_manager.get_attrib_values(self.appname, self.hostname).AndReturn(canned_snaplogic_rows)



        # Put all mocks created by mox into replay mode

        mox.Replay(mock_snaplogic_manager)



        # Run the test

        myapp = MyApp(self.appname, self.hostname, mock_snaplogic_manager)

        myapp.get_attr_values_from_snaplogic()



        # Verify all mocks were used as expected

        mox.Verify(mock_snaplogic_manager)



        # We test that attributes with correct targets are retrieved correctly

        assert '/var/www/mydocroot' == myapp.get_attrib_value_with_expected_target("DocumentRoot", "apache")

        assert 'some_dbname' == myapp.get_attrib_value_with_expected_target("db_name", "database")

        assert 'SOME_DBUSER' == myapp.get_attrib_value_with_expected_target("db_user", "database")

Some explanations are in order. With the Mox framework, when you instantiate a MockObject, it is in 'record' mode, which means it's waiting for you to specify expectations on its behavior. You specify these expectations by telling the mock object what to return when called with a certain method. In my example, I tell the mock object that I want the list of canned rows to be returned when I call its 'get_attrib_values' method: mock_snaplogic_manager.get_attrib_values(self.appname, self.hostname).AndReturn(canned_snaplogic_rows)

I only have one method that I am recording the expectations for in my example, but you could have several. When you are done recording, you need to put the mock object in 'replay' mode by calling mox.Replay(mock_snaplogic_manager). This means the mock object is now ready to be called by your application, and to verify that the expectations are being met.

Then you call your application code, in my example by passing the mock object in the constructor of MyApp: myapp = MyApp(self.appname, self.hostname, mock_snaplogic_manager). My test then calls myapp.get_attr_values_from_snaplogic(), which in turn interacts with the mock object by calling its get_attrib_values() method.

At this point, you need to verify that the expectations you set happened correctly. You do this by calling the Verify method of the mock object: mox.Verify(mock_snaplogic_manager).

If any of the methods you recorded were not called, or where called in the wrong order, or with the wrong parameters, you would get an exception at this point and your unit tests would fail.

Finally, you also assert various things about your application, just as you would in any regular unit test. In my case, I assert that the get_attrib_value_with_expected_target method of MyApp correctly retrieves the value of an attribute.

This seems like a lot of work if all you need to do is to return canned data to your application. Enter the other framework I mentioned, Mock, which lets you specify canned return values very easily, and also allows you to assert certain things about the way the mock objects were called without the rigorous record/replay/verify pattern.

Here's how I rewrote my unit test using Mock:



    def test_get_attrib_value_with_expected_target(self):

        # We return a SnapLogic dataset which contains attributes with correct targets

        canned_snaplogic_rows = [

        [u'DocumentRoot', u'/var/www/mydocroot', u'apache'],

        [u'dbname', u'some_dbname', u'database'],

        [u'dbuser', u'SOME_DBUSER', u'database'],

        ]



        # Create a mock SnapLogicManager

        mock_snaplogic_manager = Mock()



        # Return the canned list of rows when get_attrib_values is called

        mock_snaplogic_manager.get_attrib_values.return_value = canned_snaplogic_rows



        # Run the test

        myapp = MyApp(self.appname, self.hostname, mock_snaplogic_manager)

        myapp.get_attr_values_from_snaplogic()



        # Verify that mocks were used as expected

        mock_snaplogic_manager.get_attrib_values.assert_called_with(self.appname, self.hostname)



        # We test that attributes with correct targets are retrieved correctly

        assert '/var/www/mydocroot' == myapp.get_attrib_value_with_expected_target("DocumentRoot", "apache")

        assert 'some_dbname' == myapp.get_attrib_value_with_expected_target("db_name", "database")

        assert 'SOME_DBUSER' == myapp.get_attrib_value_with_expected_target("db_user", "database")

As you can see, Mock allows you to specify the return value for a given method of the mock object, in my case for the 'get_attrib_values' method. Mock also allows you to verify that the method has been called with the correct arguments. I do that by calling assert_called_with on the mock object. If you just want to verify that the method has been called at all, with no regard to the arguments, you can use assert_called.

There are many other things you can do with both Mox and Mock. Space doesn't permit me to go into many more details here, but I strongly encourage you to read the documentation and try things out on your own.

Another technique I want to show is how to simulate exceptions using the Mox framework. In my unit tests, I wanted to verify that my application reacts correctly to exceptions thrown by the SnapLogicManager class. Those exception are thrown for example when the SnapLogic server is not running. Here is the unit test I wrote:



    def test_get_attr_values_from_snaplogic_when_errors(self):

        # We simulate a SnapLogicManagerError and verify that it is caught properly



        # Create a mock SnapLogicManager

        mock_snaplogic_manager = mox.MockObject(SnapLogicManager)



        # Simulate a SnapLogicManagerError when get_attrib_values is called

        mock_snaplogic_manager.get_attrib_values(self.appname, self.hostname).AndRaise(SnapLogicManagerError('Boom!'))



        # Put all mocks created by mox into replay mode

        mox.Replay(mock_snaplogic_manager)



        # Run the test

        myapp = MyApp(self.appname, self.hostname, mock_snaplogic_manager)

        myapp.get_attr_values_from_snaplogic()





        # Verify all mocks were used as expected

        mox.Verify(mock_snaplogic_manager)



        # Verify that MyApp caught and logged the exception

        line = get_last_line_from_log(self.logfile)

        assert re.search('myapp - CRITICAL - get_attr_values_from_snaplogic --> SnapLogicManagerError: \'Boom!\'', line)

I used the following Mox API for simulating an exception:

mock_snaplogic_manager.get_attrib_values(self.appname, self.hostname).AndRaise(SnapLogicManagerError('Boom!')).

To verify that my application reacted correctly to the exception, I checked the application log file, and I made sure that the last line logged contained the correct exception type and value.

Space does not permit me to show a Python-specific mock testing technique which for lack of a better name I call 'namespace overriding' (actually this is a bit like monkey patching, but for testing purposes; so maybe we can call it monkey testing?). I refer the reader to my blog post on 'Mock testing examples and resources' and I just quickly describe here the technique. Imagine that one of the methods of your application calls urllib.urlretrieve in order to download a file from an external Web server. Did I say external Web server, as in 'external resource not under your control'? I did, so you know that mock testing will help. My blog post shows how you can write a mocked_urlretrieve function, and override the name urllib.urlretrieve in your unit tests with your mocked version mocked_urlretrieve. Simple and elegant. The blog post also shows how you can return various canned valued from the mocked version of urlretrieve, based on different input values.

I started this article by saying that mock testing is a controversial topic in the area of unit testing. Many people feel that you should not use mock testing because you are not testing your application in the presence of the real objects on which it depends, so if the code for these objects changes, you run the risk of having your unit tests pass even though the application will break when it interacts with the real objects. This is a valid objection, and I don't recommend you go overboard with mocking every single interaction in your application. Instead, limit your mock testing, as I said in this article, to resources whose behavior and returned data are hard to control.

Another important note: whatever your unit testing strategy is, whether you use mock testing techniques or not, do not forget that you also need to have functional test and integration tests for your application. Integration tests especially do need to exercise all the resources that your application needs to interact with. For more information on different types of testing that you need to consider, please see my blog posts 'Should acceptance tests be included in the continuous build process?' and 'On the importance of functional testing'.

Thursday, July 09, 2009

Setting system-wide environment variables on RedHat-based machines

I keep forgetting this, so I'm committing it to long-term memory. If you have a RedHat-based operating system (RH, CentOS etc) and you need to set certain environment variables so they're available to every user, one good place to do it is to drop a script ending in .sh in /etc/profile.d. Then export your desired environment variables there.

Here's an example from a CentOS machine I have:

# cd /etc/profile.d
# cat java.sh 
export JAVA_HOME=/usr/java/default

Note that you can do whatever else you need in these scripts -- for example you can set up aliases etc. Every script in /etc/profile.d which ends in .sh gets sourced in /etc/profile.

Monday, July 06, 2009

Resource monitoring and graphing with Cacti in EC2

My colleague Jeff Roberts just posted a blog entry on 'Using Cacti to monitor a large scale infrastructure in Amazon’s EC2'. Highly recommended for people interested in monitoring and graphing their system resources across hundreds of nodes in Amazon EC2.

Thursday, July 02, 2009

Dark launching and other lessons from Facebook on massive deployments

I came across this note from the Engineering team at Facebook which talks about how they managed to smoothly launch their recent 'pick a username' feature. The title of the note is, appropriately enough, 'Hammering usernames' -- this is of course because they were expecting their infrastructure to be hammered.

In the note I saw for the first time a name for a strategy that teams I've been involved with have applied before: 'dark launching'. Essentially, dark launching is releasing a new feature to a subset of your users, mostly with no UI changes, but otherwise exercising all the parts of your infrastructure involved in serving that feature. A good strategy to apply when you're dealing with massive, large-scale deployments, and when you want to see how your infrastructure behaves in conditions that are as close to production as possible. Because remember, there's nothing like production! Your careful load/stress testing exercises in a lab environment ain't gonna cut it.

The note from Facebook has all sorts of other nuggets of wisdom related to massive infrastructure deployments. I recommend subscribing to the RSS feed for 'Engineering @ Facebook's Notes'.

While googling for 'dark launching', I also came across this very good post by Dare Obasanjo. Recommended reading.

Tuesday, June 30, 2009

A redbot from mnot

mnot == Mark Nottingham (RESTful Web Services guru and blogger)

redbot == Resource Expert Droid == a testing tool written in Python that "checks HTTP resources to see how they use HTTP, makes suggestions, and finds common protocol mistakes"

Use this tool if you want to see how well your Web application does in terms of HTTP connections, content negotiation, caching and validation. Very useful both for traditional Web sites and for Web services.

Monday, June 29, 2009

Presentations from AWS Start-Up Event

OpenX was invited to present as part of the AWS Start-Up Tour 2009, which stopped in Los Angeles on June 15th. John Linden, our CTO, gave a presentation on the ways we use EC2 here at OpenX. He got a lot of questions, which means people were awake and alert ;-)

Until John's slides are posted to Slideshare, here are 2 other presentations, one from eHarmony on the way they use Elastic MapReduce, and one from Amazon's own Jinesh Varia on 'Architecting for the AWS cloud'. Interesting stuff. Check out this Slideshare page for more links to other AWS-related presentations.

Thursday, May 21, 2009

The Second Law of Automated Testing

I was invited by Paul Moore and Paul Hodgetts to give a presentation at the Agile/XPSoCal monthly evening meeting, which happened last night in Irvine, at the Capital Group offices. The topic of my presentation was 'How to Get to "Done" - Agile and Automated Testing Techniques and Tools'. I think it went pretty well, there were 30+ people in attendance and I got a lot of questions at the end, which is always a good sign. Here are my slides in PDF format. I presented a lot of tools as live demos outside of the slides, but I hope that the points I made in the slides will still be useful to some people.

In particular, I want to present here what I claim to be...

The Second Law of Automated Testing

"If you ship versioned software, you need automated tests."

At the talk last night I was waiting to be asked about the first law of automated testing, but nobody ventured to ask that question ;-) (for the record, my answer would have been 'you need to buy me a beer to find that out').

But I strongly believe that if you have software that SHIPS and that is VERSIONED, then you need automated tests for it. Why? Because how would you know otherwise that version 1.4 didn't break things horribly compared to version 1.3? You either employ an army of testers to manually test each and every 1.3 feature that is present in 1.4, or you use a strong suite of automated regression tests that cover all major features of 1.3 and that show you right away if any were broken in 1.4. Your choice.

Notice that I also qualify the software as 'software that ships'. This implies that you hopefully use sound software engineering processes and techniques to build that software. I am not referring to toy projects, or 1-page Web sites for temporary events, or even academic projects that are never shipped widely to users. All these can probably survive with no automated tests.

If you think you have some software that ships and is versioned, but you found that you're doing very well with no automated tests, I'd like to hear about it, so please leave a comment.

Friday, May 15, 2009

MySQL fault-tolerance and disaster recovery techniques

Any non-trivial MySQL installation needs to be protected against failures, and especially so in a 'cloud' environment, where failure should be expected. I've had bad experiences with MySQL clustering (I tenderly refer to it as MySQL clusterf**k), so I'm going to talk about MySQL replication in this post.

The most common fault-tolerance scenario in a MySQL environment is to have a master database server and a pool of load-balanced slave database servers. Hopefully your application is configurable so it can write to the master DB and read from the slave DB pool. If it is not, you can still use this technique (with some limitations) by going through MySQL Proxy, as detailed in another blog post of mine.

There is plenty of documentation available on setting up MySQL replication. I will jot down here some notes on things I find myself doing over and over again, in a condensed format that hopefully will benefit others too.

Step 0 is to enable binary logging on the master database. That's all you need to do for a MySQL DB server to be able to function as a master. To achieve this, you can add lines like these in /etc/my.cnf and restart mysqld:

server-id = 1log-bin = /var/lib/mysql/mysql-bin

One other option you might want to set up is the binlog format. For recent MySQL versions, the default is STATEMENT. For some types of updates to the master, I found it is better to specify ROW as the binlog format (for an explanation of the differences between the 2 types, and for more info that you ever wanted about binary logging, see the official documentation):

binlog_format = ROW

You also need to create a MySQL user on the master DB and grant it REPLICATION SLAVE rights. You can use a statement like this:

GRANT REPLICATION SLAVE ON *.* TO 'replicant'@'IP_of_slave_DB' IDENTIFIED BY 'somepassword';

Setting up a MySQL slave when you can lock tables on the master

This is the recommended way of setting up a MySQL slave DB machine. It requires locking the tables for writes on the master DB, which is something you may or may not afford to do. Here are the steps you need to go through:

1) Connect to the master DB server and issue this command:

FLUSH TABLES WITH READ LOCK;

2) Note the binlog file name and position on the master by running this command:


SHOW MASTER STATUS;
| File| Position | Binlog_Do_DB | Binlog_Ignore_DB
| mysql-bin.000004 | 87547369 || |
1 row in set (0.01 sec)

3) Leave the current mysql session open so that the tables are still locked on the master, and in a different session take a database dump of the mysql database and of the application database on the master. You can use a command line such as:

mysqldump -u root -p$MY_ROOT_PW --database mysql \
--lock-all-tables | /bin/gzip > mysql.sql.gz

mysqldump -u root -p$MY_ROOT_PW --database $MYDB \
--lock-all-tables | /bin/gzip > $MYDB.sql.gz

4) Once the dump is done (a process which on a very large database can take hours), go ahead and unlock the tables in the first MySQL session:

UNLOCK TABLES;

5) Now you're ready to set up a MySQL slave database. It's a good idea to set up binary logging on all your slaves, so that if your master DB fails, any slave can be promoted to a master. If you do turn binary logging on, do NOT also enable log-slave-updates (because if you do, and if you promote a slave to a master, then the other slaves might receive some updates twice -- complete explanation available here).

The DB machine you want to set up as a slave should have lines similar to these in its /etc/my.cnf file (server-id needs to be different from the master ID and any other slave IDs that talk to the same master):

server-id = 2
log-bin = /var/lib/mysql/mysql-binbinlog_format = ROW

6) On the machine you want to set up as a slave, load the mysql dumps of the mysql DB and of your application database (the ones you took in step 3). Note that you may need to create the application database before you can load the application DB dump into it.

7) On the slave, fire up a mysql prompt and use the 'CHANGE MASTER TO' command to specify the master DB, the binglog file and the binlog position (you need to use the values from step 2):

STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO 
MASTER_HOST='master_database_server_name', 
MASTER_USER='replicant', 
MASTER_PASSWORD='somepassword', 
MASTER_LOG_FILE='mysql-bin.000004', 
MASTER_LOG_POS=87547369;
START SLAVE;

8) Run the 'SHOW SLAVE STATUS \G' command on the newly created slave DB and make sure that the values for both Slave_IO_Running and Slave_SQL_Running show as YES, and that Seconds_Behind_Master is 0 (it can take a while initially for this value to converge to 0, but it should do so). Here is an example of the output of this command:


*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: my_master_host
Master_User: replicant
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000004
Read_Master_Log_Pos: 157767054
Relay_Log_File: crt-relay-bin.000012
Relay_Log_Pos: 112340434
Relay_Master_Log_File: mysql-bin.000004
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table: MYDB.tmp\_%
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 112340289
Relay_Log_Space: 112340630
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: Yes
Master_SSL_CA_File: /etc/pki/tls/cert.pem
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
1 row in set (0.00 sec)

Note that I am explicitly excluding from replication tables that start with tmp, which in my case are temporary tables created by certain operations on the master DB which are not needed on the slaves. To do this, I added this line to /etc/my.cnf on the slaves (all replication filtering is done at the slave level):

replicate-wild-ignore-table = MYDB.tmp\_%

Promoting a slave database to master

Let's say disaster strikes and your master DB goes down. At this point, if you have replication set up as above, you can easily turn one of the slave DB machines into a slave, and reconfigure the other slaves to have this newly promoted machine as their master. The official documentation for this scenario is here and it's very good. Let's slave you have master M01 and slaves S01, S02 and S03. Master 01 dies. You want to promote slave S01 to master, and set up S02 and S03 to replicate from S01.

On S01, run these commands at the MySQL prompt:

STOP SLAVE;
RESET MASTER;
CHANGE MASTER TO MASTER_HOST='';

On S02 and S03, run these commands at the MySQL prompt:

STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO MASTER_HOST='S01';
START SLAVE;

Now if you run 'SHOW SLAVE STATUS\G' on the slaves, you should see no errors, and you should also see the master DB hostname shown as 'S01' instead of 'M01'.

While we're on the subject of switching the master DB, it can happen that the slave DBs will get some udpates from the newly promoted master that will conflict with their current view of the database. For example, they can receive from the master a duplicate insert, or a delete on a row that doesn't exist in their database. In these cases, to bring the slave to a sane state, you can issue commands like this one, where N is 1 or 2 (see full explanation here):

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = N;
START SLAVE;

You can try running the skip command repeatedly until the slave goes back to a successful replication state.

Setting up a slave database from a hot backup of the master

Let's say you have your master database up and running, and you want to set up a new slave without locking the tables for writes on the master. In this case, you can use a product such as InnoDB Hot Backup, which is very much worth its $500/year/host price. What's more, they provide a 30-day free evaluation binary tied to the host name of your DB machine, which is nice if you need something in a critical situation, or if you want to test it before committing to pay.

Here's a procedure for setting up a new slave DB from a hot backup on the master. The InnoDB Hot Backup documentation is very good, and what follows is a subset I used from that documentation.

1) On the master, create two mini configuration files which are tiny subsets of my.cnf. Call one for example my.cnf.source and the other one my.cnf.destination. The source file needs to contain lines similar to these referring to the location of your live MySQL installation:

# cat /etc/my.cnf.source
[mysqld]
datadir = /var/lib/mysql/
innodb_data_home_dir = /var/lib/mysql/
innodb_data_file_path = ibdata1:10M:autoextend
innodb_log_group_home_dir = /var/lib/mysql/
set-variable = innodb_log_files_in_group=2
set-variable = innodb_log_file_size=512M

The destination file needs to contain similar lines, but pointing to a directory where the backup files will be created (that directory needs to be empty). For example:

# cat /etc/my.cnf.destination
[mysqld]
datadir = /var/hot-backups
innodb_data_home_dir = /var/hot-backups
innodb_data_file_path = ibdata1:10M:autoextend
innodb_log_group_home_dir = /var/hot-backups
set-variable = innodb_log_files_in_group=2
set-variable = innodb_log_file_size=512M

2) On the master, run the ibbackup binary and point it to the 2 configuration files:

# /path/to/ibbackup /etc/my.cnf.source /etc/my.cnf.destination

This step can be quite lengthy, depending on the size of your database, but note that you don't need to lock any tables on the master during this time. Upon the completion of this step, you should see an InnoDB data file (its name is the one you specified in the innodb_data_file_path variable in the config files), and an InnoDB transaction log called ibbackup_logfile. Note that this is not identical to the InnoDB logs on the master. To create those logs, you need to go to the next step.

3) On the master, apply the transaction logs created by the hot backup process by running this command:

# /path/to/ibbackup --apply-log /etc/my.cnf.destination

When this is done (again it can take a while), you should see N log files called ib_logfile1, ib_logfile2, ..., ib_logfileN in the destination directory -- where N is the value of the variable innodb_log_files_in_group that you set in the configuration file.

4) On the master, do a tar.gz of all directories in the MySQL datadir which contain MyISAM tables, or .frm tables from InnoDB tables (the main one being of course the mysql directory, containing the MyISAM tables for the mysql database -- assuming of course you've kept the default of MyISAM for the mysql DB).

5) Now you're ready to transfer the data file created in step 2, the log files created in step 3, and the archives created in step 4 to a new machine running MySQL, which you intend to set up as a slave DB. Simply scp the files over. On the target machine, stop mysql, move /var/lib/mysql (or wherever your datadir is) to /var/lib/mysql.bak, create a brand new /var/lib/mysql directory and drop all the files you transferred into that directory (un-tar-ing the tar.gz files appropriately). Also run 'chmod -R mysql.mysql /var/lib/mysql'. Finally, make sure the my.cnf file on the slave has binlog enabled (in case you ever need to promote this slave to a master).

6) Restart the mysqld process on the target machine, and make note of the binlog file and position, which are captured in the mysql log file. You should see a line similar to this:

InnoDB: Last MySQL binlog file position 0 6199825, file name /var/lib/mysql/mysql-bin.000008

Now go to the mysql prompt on the target machine and run:

STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO 
MASTER_HOST='master_database_server_name', 
MASTER_USER='replicant', 
MASTER_PASSWORD='somepassword', 
MASTER_LOG_FILE='mysql-bin.000008', 
MASTER_LOG_POS=6199825;
START SLAVE;

At this point, 'SHOW SLAVE STATUS\G' should show no errors, and the new slave should be replicating correctly from the master DB server. It may take a while for the slave to catch up, depending on when you took the hot backup on the master.

Before I finish this post, one word of advice when it comes to mounting EBS volumes in EC2: do not mount /var by itself on an EBS, because if for some reason the EBS becomes unavailable or fails, you won't be able to ssh back into your instance. Why is that? Because sshd (at least in CentOS) needs /var/empty to be available for privilege separation purposes.

If you want to take advantage of an EBS on an EC2 instance functioning as a MySQL database server, it's better to either mount /var/lib/mysql on an EBS, or specify a non-default data directory for MySQL, which you then mount from an EBS.

UPDATE: EC2 backup strategies

An anonymous comment reminded me that I need to also discuss backups. Doh. In an EC2 environment, it's very easy to backup up a whole EBS by means of a snapshot.

Of course, if you do a snapshot with no other backups, the database files will be 'live', but I managed in one case to
1) detach an EBS containing /var/lib/mysql from an instance that was failing, and
2) attach the EBS to another instance and mount it in /var/lib/mysql

I then restarted mysqld on the new instance and everything worked as expected. This is NOT the recommended strategy however. What is recommended is to do a database dump (either a hot backup if you can afford it, or a simple mysqldump) to an EBS, and snapshot the EBS periodically.

Alternatively, you can use various S3 utilities to capture the backups directly to S3. The EBS snapshot solution is better IMO because you can quickly recreate an EBS volume from a snapshot, then mount it to either the original instance, or to a new instance.

However, EBS volumes DO sometimes fail, so another thing to think about is to run your EC2 instances (especially your slave DBs) in different availability zones. We had an issue with 2 of our database servers failing at the same time in zone US-East-1a due to EBS issues, and the thing that saved us is that we had slaves in other availability zones that weren't affected.