Suicide Workers

by Akash Agrawal on May 4, 2011

in Uncategorized

At SlideShare, we process many thousands of documents a day. The documents come in all types, from PDF and PPT to Keynote files. The pace of uploads is heavy, and varies dramatically over time.

The documents have to be processed from their native formats into a process that is sharable on the web. This is hard work: it can take a beefy server over a minute working at full capacity just to convert one document. And converting one video can take several minutes! So we have lots of servers for doing this.

We host our document conversion infrastructure on the cloud (Amazon EC2) so we knew we could quickly create new servers when needed (via an API call), and just as quickly remove servers when they were no longer needed.  So we started exploring how to match the supply of our servers with the demands of our users in realtime.

Our first effort at dealing with this was to use Amazon EC2 to add resources during times of peak load. We identified the business hours in US time zones as being peak time. So we wrote a script that created more servers at the beginning of the New York work day, and tore them down at the end of the Los Angeles work day.

This didn’t work as well as we had hoped. We were able to handle the peak traffic better than before, but the system still got overwhelmed, both during peak and non-peak times. And fixing this by adding more servers that were only occasionally being used seemed like a wasteful solution to the problem. Instead of adjusting the number of servers twice a day according to a fixed plan written ages ago, we needed to adjust them constantly in response to how much load we were getting right now.

Adding new servers when the system was under load was straightforward: since we dispatch work to our cluster using a queue (specifically SQS), a cron job that tracks the size of our job queue can easily create new servers when it notices the queue starting to get too long.

But how to tear down the servers when they are no longer needed? You don’t want to suddenly delete a server that’s in the middle of doing a job. And you don’t want to spawn servers and then lose track of them, or you’ll quickly start wasting money. As we thought about it, we realized that the surplus server itself was in the best position to know when it was a safe time to shut itself down and make sure that it got done.

We call these servers “suicide workers” for this reason: they work for a couple of hours, clean up their worksite, and then carefully remove themselves from existence. This gives us the ability to throw extra capacity at a problem very aggressively, knowing that the capacity will bleed off in a couple hours automatically. We don’t even have to keep track of these servers very carefully, because we know they are transient and will shut themselves down in a couple hours.

We chose two hours as an interval because it takes about 15 minutes to spin up an EC2 instance. So if you use a cloud server for 1 hour, you end up paying for 60 minutes of compute and only getting 45! This is not a good deal (25% waste). Paying for 120 minutes and getting 105 minutes (12.5% waste) is much better. For machines that live a short number of hours, you want to make sure that you close them down right at the end of the hour or you’ll end up paying for an extra hour of compute time that you didn’t actually use. We experimented with having suicide workers live for 3 and 4 hours (rather than 2): it ended up being more expensive without improving our average wait time.

Of course, there are lots of other ways to scale cloud servers up and down in response to demand. The most obvious is the Auto-Scaling feature provided by AWS. However, AutoScaling is works best for scaling servers when their internal metrics (CPU load, etc) hit a certain point. We needed to allocate capacity when the system has hit its capacity limit, not just when a given machine is working hard. The length of the job queue is the best proxy for this.

So if you’re doing batch processing using cloud servers, I recommend you give the suicide workers approach a try! Below are some code snippets that should get you started. And if you enjoy thinking about this kind of stuff, check out our jobs page … we’re hiring. ;->

Scripts for spawning new adhoc instances:

We run following script, which is monitoring SQS length and based on length it spawns new EC2 instances.

#!/bin/bash

echo $$ > /var/run/pid_of_this_script

QUEUE='MY-QUEUE'  #SQS for specific factory

while [ 1 ]
do
 QUEUE_LEN_INIT=`ruby SQSlen.rb $QUEUE | tail -1`
 sleep 5m
 QUEUE_LEN_FINAL=`ruby SQSlen.rb $QUEUE | tail -1`
 if [ $QUEUE_LEN_INIT -gt 50  -a  $QUEUE_LEN_FINAL -gt 50  ]
 then
     launch_adhoc_instance
 fi
done

Following is Ruby code to get SQS length:

#!/usr/bin/ruby

#File: SQSlen.rb

require 'rubygems'
require 'right_aws'
require 'yaml'

def get_queue_len(queue_name)
begin
 return RightAws::SqsGen2.new($conf['SQS']['access-key'], $conf['SQS']['secret-key'], {:logger => $logger}).queue(queue_name).size
rescue Exception => e
 puts("SQS Error")
 return 0
end
end

if ARGV.length != 1
puts "Please Enter arguments correctly. \"ruby SQSlen.rb \""
exit
end

queue_name = ARGV[0]
config_path = "SQSlen.yml"
$conf = YAML.load(File.open(config_path))

len = get_queue_len(queue_name)
puts len

Following is script which creates new worker instance

#!/bin/bash

#File: launch_adhoc_instance

if [ $# -ne 1 ]
then
     echo "Usage: launch_adhoc_instance AdhocRole"
     exit 1
fi

source configs/global_config
source configs/adhoc_config

#Spawn up a regular instance
ec2-run-instances $AMI -n $NUM_INSTANCES -t $TYPE -k $KEY_PAIR --group $SECURITY_GROUP > /tmp/ec2_instance_request

INSTANCE_ID=`cat /tmp/ec2_instance_request | tail -1 | cut -f2`
STATUS=`cat /tmp/ec2_instance_request | tail -1 | cut -f6`
rm -f /tmp/ec2_instance_request

REQUIRED_STATUS="running"
while [ $STATUS != $REQUIRED_STATUS ]
do
 sleep 60
 STATUS=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f6`
done

sleep 60

#Instance is now active. Capture data associated with instance like instance-id, external and internal dns.
INSTANCE_EXTERNAL_DNS=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 4`
INSTANCE_INTERNAL_HOSTNAME=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 5 | cut -f 1 -d'.' `

#We need to do the record keeping
echo "`date`: $INSTANCE_ID: $INSTANCE_EXTERNAL_DNS: $INSTANCE_INTERNAL_HOSTNAME: PDF2SWFADHOC" >> $DB_FILE

#If we are not able to get internal hostname within next 1 minutes for some reason then quit
while [ -z $INSTANCE_INTERNAL_HOSTNAME ]
do
 COUNT=`expr $COUNT + 1 `
 sleep 10
 INSTANCE_INTERNAL_HOSTNAME=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 5 | cut -f 1 -d'.' `
 INSTANCE_EXTERNAL_DNS=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 4`
 if [ $COUNT -ge 6 ]
 then
     exit 1
 fi
done

#Configure puppetmaster to associate relevant class with node
$SCP root@$PUPPET_MASTER:/etc/puppet/manifests/adhoc_nodes.pp /tmp/adhoc_nodes.pp
grep $INSTANCE_INTERNAL_HOSTNAME /tmp/adhoc_nodes.pp

if [ $? -eq 0 ]
then
 sed "/$INSTANCE_INTERNAL_HOSTNAME/d" /tmp/adhoc_nodes.pp > /tmp/newnodes.pp
 mv /tmp/newnodes.pp /tmp/adhoc_nodes.pp
fi

echo "node $INSTANCE_INTERNAL_HOSTNAME { include $1 }" >> /tmp/adhoc_nodes.pp
$SCP /tmp/adhoc_nodes.pp root@$PUPPET_MASTER:/tmp
$SSH root@$PUPPET_MASTER "mv /tmp/adhoc_nodes.pp /etc/puppet/manifests/adhoc_nodes.pp"

rm -f /tmp/adhoc_nodes.pp

#Do a run of puppet twice
$SSH root@$INSTANCE_EXTERNAL_DNS "puppetd --test"
sleep 60
$SSH root@$INSTANCE_EXTERNAL_DNS "puppetd --test"

Akash Agrawal
Senior Software Engineer

Previous post:

Next post: