Suicide Workers

At SlideShare, we process many thousands of documents a day. The documents come in all types, from PDF and PPT to Keynote files. The pace of uploads is heavy, and varies dramatically over time.

The documents have to be processed from their native formats into a process that is sharable on the web. This is hard work: it can take a beefy server over a minute working at full capacity just to convert one document. And converting one video can take several minutes! So we have lots of servers for doing this.

We host our document conversion infrastructure on the cloud (Amazon EC2) so we knew we could quickly create new servers when needed (via an API call), and just as quickly remove servers when they were no longer needed.  So we started exploring how to match the supply of our servers with the demands of our users in realtime.

Our first effort at dealing with this was to use Amazon EC2 to add resources during times of peak load. We identified the business hours in US time zones as being peak time. So we wrote a script that created more servers at the beginning of the New York work day, and tore them down at the end of the Los Angeles work day.

This didn’t work as well as we had hoped. We were able to handle the peak traffic better than before, but the system still got overwhelmed, both during peak and non-peak times. And fixing this by adding more servers that were only occasionally being used seemed like a wasteful solution to the problem. Instead of adjusting the number of servers twice a day according to a fixed plan written ages ago, we needed to adjust them constantly in response to how much load we were getting right now.

Adding new servers when the system was under load was straightforward: since we dispatch work to our cluster using a queue (specifically SQS), a cron job that tracks the size of our job queue can easily create new servers when it notices the queue starting to get too long.

But how to tear down the servers when they are no longer needed? You don’t want to suddenly delete a server that’s in the middle of doing a job. And you don’t want to spawn servers and then lose track of them, or you’ll quickly start wasting money. As we thought about it, we realized that the surplus server itself was in the best position to know when it was a safe time to shut itself down and make sure that it got done.

We call these servers “suicide workers” for this reason: they work for a couple of hours, clean up their worksite, and then carefully remove themselves from existence. This gives us the ability to throw extra capacity at a problem very aggressively, knowing that the capacity will bleed off in a couple hours automatically. We don’t even have to keep track of these servers very carefully, because we know they are transient and will shut themselves down in a couple hours.

We chose two hours as an interval because it takes about 15 minutes to spin up an EC2 instance. So if you use a cloud server for 1 hour, you end up paying for 60 minutes of compute and only getting 45! This is not a good deal (25% waste). Paying for 120 minutes and getting 105 minutes (12.5% waste) is much better. For machines that live a short number of hours, you want to make sure that you close them down right at the end of the hour or you’ll end up paying for an extra hour of compute time that you didn’t actually use. We experimented with having suicide workers live for 3 and 4 hours (rather than 2): it ended up being more expensive without improving our average wait time.

Of course, there are lots of other ways to scale cloud servers up and down in response to demand. The most obvious is the Auto-Scaling feature provided by AWS. However, AutoScaling is works best for scaling servers when their internal metrics (CPU load, etc) hit a certain point. We needed to allocate capacity when the system has hit its capacity limit, not just when a given machine is working hard. The length of the job queue is the best proxy for this.

So if you’re doing batch processing using cloud servers, I recommend you give the suicide workers approach a try! Below are some code snippets that should get you started. And if you enjoy thinking about this kind of stuff, check out our jobs page … we’re hiring. ;->

Scripts for spawning new adhoc instances:

We run following script, which is monitoring SQS length and based on length it spawns new EC2 instances.


echo $$ > /var/run/pid_of_this_script

QUEUE='MY-QUEUE'  #SQS for specific factory

while [ 1 ]
 QUEUE_LEN_INIT=`ruby SQSlen.rb $QUEUE | tail -1`
 sleep 5m
 QUEUE_LEN_FINAL=`ruby SQSlen.rb $QUEUE | tail -1`
 if [ $QUEUE_LEN_INIT -gt 50  -a  $QUEUE_LEN_FINAL -gt 50  ]

Following is Ruby code to get SQS length:


#File: SQSlen.rb

require 'rubygems'
require 'right_aws'
require 'yaml'

def get_queue_len(queue_name)
 return$conf['SQS']['access-key'], $conf['SQS']['secret-key'], {:logger => $logger}).queue(queue_name).size
rescue Exception => e
 puts("SQS Error")
 return 0

if ARGV.length != 1
puts "Please Enter arguments correctly. \"ruby SQSlen.rb \""

queue_name = ARGV[0]
config_path = "SQSlen.yml"
$conf = YAML.load(

len = get_queue_len(queue_name)
puts len

Following is script which creates new worker instance


#File: launch_adhoc_instance

if [ $# -ne 1 ]
     echo "Usage: launch_adhoc_instance AdhocRole"
     exit 1

source configs/global_config
source configs/adhoc_config

#Spawn up a regular instance
ec2-run-instances $AMI -n $NUM_INSTANCES -t $TYPE -k $KEY_PAIR --group $SECURITY_GROUP > /tmp/ec2_instance_request

INSTANCE_ID=`cat /tmp/ec2_instance_request | tail -1 | cut -f2`
STATUS=`cat /tmp/ec2_instance_request | tail -1 | cut -f6`
rm -f /tmp/ec2_instance_request

 sleep 60
 STATUS=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f6`

sleep 60

#Instance is now active. Capture data associated with instance like instance-id, external and internal dns.
INSTANCE_EXTERNAL_DNS=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 4`
INSTANCE_INTERNAL_HOSTNAME=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 5 | cut -f 1 -d'.' `

#We need to do the record keeping

#If we are not able to get internal hostname within next 1 minutes for some reason then quit
 COUNT=`expr $COUNT + 1 `
 sleep 10
 INSTANCE_INTERNAL_HOSTNAME=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 5 | cut -f 1 -d'.' `
 INSTANCE_EXTERNAL_DNS=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 4`
 if [ $COUNT -ge 6 ]
     exit 1

#Configure puppetmaster to associate relevant class with node
$SCP root@$PUPPET_MASTER:/etc/puppet/manifests/adhoc_nodes.pp /tmp/adhoc_nodes.pp
grep $INSTANCE_INTERNAL_HOSTNAME /tmp/adhoc_nodes.pp

if [ $? -eq 0 ]
 sed "/$INSTANCE_INTERNAL_HOSTNAME/d" /tmp/adhoc_nodes.pp > /tmp/newnodes.pp
 mv /tmp/newnodes.pp /tmp/adhoc_nodes.pp

echo "node $INSTANCE_INTERNAL_HOSTNAME { include $1 }" >> /tmp/adhoc_nodes.pp
$SCP /tmp/adhoc_nodes.pp root@$PUPPET_MASTER:/tmp
$SSH root@$PUPPET_MASTER "mv /tmp/adhoc_nodes.pp /etc/puppet/manifests/adhoc_nodes.pp"

rm -f /tmp/adhoc_nodes.pp

#Do a run of puppet twice
$SSH root@$INSTANCE_EXTERNAL_DNS "puppetd --test"
sleep 60
$SSH root@$INSTANCE_EXTERNAL_DNS "puppetd --test"

Akash Agrawal
Senior Software Engineer

5 Responses to “Suicide Workers”

  1. Bobbie Marquis

    Don’t like the name “Suicide Workers”

  2. Asesoria Seo

    Really a very difficult job. You must have some monster servers to accomplish the task of processing thousands of documents a day …

    Congratulations and keep up

  3. Brian Poissant

    Quick question on the desgin as I too am tackling exactly what to do with “surplus” machines in a service.
    1) In your article you mention that the machines are in the best position to know when they are not needed. what metric(s) are you looking at on that machine to determine this?
    2) Since it does take minutes to spin up new instances, instead of having “suicide” machines, that terminate, would it not make more sense to place some of these in a “self cleaned and Stopped” state, so that they could be spun up quicker? Not sure of the financials behind the idea, but I woudl think always keeping one or two on-demand instances in a stopped state, in addition to using spot instances might be a better performer.
    3) Another technique I’m working on is using spot instances for the “stable long term” loads, and reserving the on-demand instances for the “quick cache, immediate on” load. If an on-demand instance is seen to be operating at a certain pace for certain period of time, it will launch a spot isntance to transfer its work to, and then stop/terminate depending not only the current load but also on the type of load.
    Your thoughts?