6 foot clownfish swimming through the slideshare sf office

So we got one of those awesome thinkgeek toys a couple weeks ago … a huge remote-control fish. Amazingly, it propels itself by swimming … it doesn’t have a propeller or anything. As a result it looks amazingly realistic and is a lot of fun to drive. Here’s a video of it in action.

Of course we broke it within a couple of hours. But the kind folk at thinkgeek promise that spare parts should arrive any day now, so the fun will continue. We’re thinking of getting a great white share to go along with it, or possibly some remote-control piranhas!

The buddy system: an alternative to pair programming

Pair programming has always rubbed me the wrong way. I understand why some people like it. It’s great to have collective code ownership. It’s great to have conversations about what the code should look like before the code is actually written. And it’s great to have a collaborative work environment where people are always learning from each other.

But a LOT of people are turned off by pair programming. It’s too easy for the person who isn’t typing to just zone out. Some (many?) developers just don’t enjoy pair programming as a practice. It generates a LOT of noise (imagine 5 pairs of people talking at once). And no great internet company that I’m aware of seems to do it heavily. In fact, the biggest advocates of pair programming are often consulting shops (who charge by the programmer-hour and therefore have an obvious conflict of interest).

Trying to figure out how to get some of the benefits of pair programming without the drawbacks, we stumbled onto the “buddy system” at SlideShare. The rule is pretty simple: if you’re going to be doing something dangerous and complicated (like swimming or writing production ruby code) you probably shouldn’t be doing it alone.

So developers at slideshare just tend to work together on stuff. They don’t work on the same code … there’s always a bunch of different files that will need to be edited or created for implementing a new feature. But they’ll work on the new feature or bug together, usually in ad-hoc teams of two or three.

Some of the benefits of this approach are:
* Architecture and overall code structure always has a consensus of at least two behind it. If there’s a disagreement, it will be audible and the rest of the team can get involved as needed until consensus has been achieved. This dramatically reduces rework caused by one developer making an architectural decision that the rest of the team doesn’t agree with.
* There’s always someone available who can code-review your code and already understands the context of your code. This is crucial, because we do code-reviews before every checkin. And an uninformed code review doesn’t have value (it’s likely to be a “lgtm”, or “looks good to me”).
* There are always at least two people always understand a given section of the code base.
* The work feels a lot less lonely. There’s someone else deeply involved in the same problem that you are facing. It’s easy to learn from them because you’re working together.
* You still get lot’s of “me” time, with just you and the compiler. Lot’s of engineers got into programming because they enjoy quietly writing code, and there’s no need to take that away from people as long as enough collaboration is also happening.
* The collective nature of the work makes it more likely that peer pressure will keep developers from cutting corners on unit tests or other good practices that your team has adopted.

Just to be fair, there are a couple of disadvantages:
* Not all problems are big enough for two people to work on. Simple bug fixes, for example, should just be grabbed and worked on.
* Unlike pair programming, some code will typically be written before a second person looks at it. So you miss out on getting the feedback as early as possible (when the code is most easy to change and no one has emotional investment in it yet)

Overall, though, we’ve found the buddy system to be a remarkably fun and productive way of working.

Does this style of working sound fun to you? We’re always looking for great engineers, and can train you in Ruby if you’re already comfortable programming in other interpreted languages and are comfortable with Linux. Check out our jobs page for more info.

Using sounds for ambient alerting in a web startup

Over the last few weeks, Sylvain Kalache built a sound alerting system for the SlideShare office. What it does is make various noises in the slideshare office when a new subscription user signs up, cancels, or renews. It also makes sounds when a build fails, when a deployment to production starts, and when a deployment to production is successfully completed, and for when the site goes down. Sounds are a really pleasant and natural way to signal attention-worthy events!

The system was almost embarassingly easy to make (zero coding required):
1) First we configured the systems we wanted alerts from (our subscription billing vendor, our deployment system, our continuous integration system, and out uptime monitoring system) to send emails to a particular gmail account when events that we want the team to know about happen.
2) Then we hooked up an old windows desktop to a cambridge soundworks speaker system, and configured outlook to make various sounds when it receives emails that have particular strings in the subject. Basically, we’re using email like a lowest-common-denominator enterprise message bus.

Here’s a screenshot of what the outlook config looks like:
outlook configuration settings

And here’s the computer and sound system.
picture of computer and sound system
(historical note: this is the same computer that i refer to in this 2005 blog post, “Laundry Room PC”).

Having the sounds go off in the office is fun, but it’s also really practical. We all know whenever the build fails or when there’s a deployment in progress. And if we get a rush of subscriptions or cancellations we immediately investigate to figure out what we’re doing right or wrong. Plus, it’s great for team morale to hear the steady flow of renewals coming in throughout the day.

Here’s some of the sounds that we’re using, and the events that they represent:
Subscription renewal: hit the sales gong
Subscription cancelled?: Sad Trombone sound
New Subscription: Slot machine jackpot
Deploying: Sound the trumpets
Website Down: Air-raid siren

Final note: if you like making machines make noises, and you like to program, then check out our jobs page. We’re hiring!

SlideShare Ditches Flash for HTML5

Watch our HTML5 gallery here.

SlideShare today announced the biggest change since we started. We are now rendering presentations and documents using HTML5 instead of Flash. This is a milestone. 5 years ago, it was impossible to build something like SlideShare or Youtube without Flash. But the web has finally caught up.

This project was the biggest engineering project in SlideShare’s history. A lot of SlideShare engineering has been working on this around-the-clock for the last six months. As we have learnt over the past five years, people are picky about how their presentations look. Getting the fonts and the text placement to look exactly right across all supported browsers was a real engineering challenge. So we’re happy to finally be able to see this on SlideShare.net.

Ditching Flash for HTML5 feels like the right choice for us for a number of engineering reasons.

  1. The exact same HTML5 documents work on the iPhone / iPad, Android phones/tablets, and modern desktop browsers. This is great from an operations perspective. This saves us from extra storage costs, and maximizes the cache hit ration on our CDN (since a desktop request fills the cache for a mobile request, and vice-versa). It’s also great from a software engineering perspective, because we can put all our energy into supporting one format and making it really great.
  2. Documents load 30% faster and are 40% smaller. ‘Nuff said on that front, faster is ALWAYS better.
  3. The documents are semantic and accessible. Google can parse it and index the documents, and so can any other bot, scraper, spider, or screen-reader. This means that you can write code that does interesting things with the text on the slideshare pages. You can even copy and paste text from a SlideShare document, something that was always a pain with Flash.

What were the most challenging parts of this project? Glad you asked.

Font Conversion

Font handling was the biggest challenge. We had to build support for rendering arbitrary fonts in your browser that are not available on the client. If you invent a new font, and upload a pdf that uses it, it should still render perfectly on SlideShare. Whoa!

Text Placement

Placing the text is very tricky due to differences between different browsers, differences between fonts (handling ligature), and several other complexities. To illustrate: the PDF coordinate system starts in the bottom left. HTML starts in the top left. Pdfs use points, HTML you get your choice of unit, however no two browsers agree on how precise any particular unit is! The largest problem we face with placement is normalization. We spent a lot of time finding that magic combination of em’s, percentages and zoom which gives us correct placement across the web.

Error Handling

We also built a system to find out when there is variance between an image of the HTML output and an image generated directly from the document. If there’s more than a certain amount of variance, we consider that an error and we won’t serve that page as HTML5. Instead we’ll serve a png image of the page when that page is requested. There was some hard-core computer vision involved in the error-handling system. The way we look at it, we want to serve HTML5, but not at the expense of a document that looks bad and disappoints the author.

Cloud Computing

Our conversion stack runs on Amazon EC2 and is configured and managed by Puppet. We’ve been using EC2 for our conversion stack for years, so we’re old hands at that stuff. For this new system, we started out with a number of different types of servers (a font extractor, a font generator, etc). What we found out is that the coordination time between different machines (using Amazon SQS) and the IO time (using S3) were a huge bottleneck. So our architecture for this new system is more remenicent of the netflix “Rambo” architecture. Each box is a self-contained system that can do the entire job of conversion, with no help from anyone.

As we speak, an army of hundreds of Amazon EC2 instances is crunching away at converting the *millions* and *millions* of presentations and documents that have been uploaded to slideshare over the last 5 years to HTML5. New documents will automatically be converted to HTML5 from now on. We hope to have the transition complete by the end of the year (maybe sooner, but no promises!). At that point all slideshare content will be served as Html5.

Next Steps

This is a work in progress … we are betting the company on HTML 5, and are going to continue to invest in the HTML5 conversion stack and JavaScript player technologies that we’re releasing today. Some of the next things on our plate include

  1. Handling Z-indexes (objects occluding other objects) better
  2. continued development on our font extraction techology
  3. Adding some features that we just weren’t able to port to our html5 player in time for this launch, like embedded video and synchronized audio.

Obligatory recruiting pitch

If you’re a developer and like working on this kind of stuff, SlideShare wants to talk to you! Check out our jobs page for details.

Applying for a SlideShare internship? Make sure to read this first hand account…

Saket Choudhary is an engineering student from the Indian Institute of Technology, Mumbai who interned at the SlideShare Delhi office this summer (May15 – July15). He worked on an important internal project, which has since been rolled out to production.

A couple of weeks back, Saket sent us this deck Sliding Summer on Rails@SlideShare that he created to summarize his internship.

@saket.. thanks for making a SlideShare deck to summarize your experience! This is very creative. And we miss your geeky ebullience in our corridors.

DevOps at SlideShare: Talk given at DevOpsDays Bangalore 2011

We’ve adopted DevOps as a part of our culture at SlideShare. We believe that it was essential for us in order to become an agile & lean organization. DevOps has helped us in many ways, especially in our goal to do multiple deployments a day on production.

At the recently concluded DevOpsDays conference in Bangalore, we presented our experiences and achievements as we embraced DevOps at SlideShare. Here are the slides of the talk we gave at the conference:

The conference itself was a good learning experience. This was the first (of hopefully upcoming many) DevOpsDays conference in India held at ThoughtWorks Bangalore office. There were solid techies speaking on subjects like Puppet, parallelizing tests, RoR deployment patterns, DTrace etc. Also, there were 5 minute ignite talks, Open Space discussions and good socializing to go along. Overall, a good effort put together by @AjeyGore & his team.

Just in case, you like to stay on the cutting edge and are excited by DevOps, you should know that we’re hiring.

Kapil Mohan & Mayank Joshi

The importance of silly projects

In engineering, we love to hack. It’s not just something we do because we get paid … it’s something we do for fun (that’s what’s kinda awesome about software engineering as a job). Obviously we’re usually working on the core product, building new features, fixing bugs, and refactoring code to make it harder/faster/better/stronger. But sometimes, towards the end of the day, we write stuff that is just for fun.
For example, a couple of weeks ago, Eugene noticed that we waste a lot of time arguing about where to go for lunch. So he wrote a “lunchbot”, that hangs out on our irc channel. The lunchbot knows the restaurants in the area and picks a random one whenever we ask it. This is a dorky but extremely efficient way to decide where to eat … no one can argue with the verdict the lunchbot gives.
Last week, Sylvain noticed that there was no way to know whether code was currently being deployed to the server. So he hooked up some (huge) speakers to a spare desktop and wrote a script that plays a bell sound every time the deploy script is run.  That way, two people won’t try to deploy at the same time (we’re not sure what would happen in that case, but our theory is that it wouldn’t be good). We’re thinking of upgrading this script to tell us when someone subscribes to slideshare, renews, or cancels … this would give everyone on the team a visceral feeling of the rhythms of our business (sort of a “sales gong for fermium”). When do people upgrade? When do they downgrade? (Full disclosure: the idea came from the movie “Middle Men”, which was otherwise mostly terrible).
Fun projects like this are why a lot of us started programming in the first place, but with deadlines and customers and all the pressures of writing production code, it’s really easy to get all serious and forget about the sheer joy of programming. As a programmer, you can literally write software that changes your workplace for the better! So the next time you need a break, look around and see if there’s something about your workplace that could be improved with a 20-line ruby program. If this sounds like fun to you, you should know that we’re hiring.

Suicide Workers

At SlideShare, we process many thousands of documents a day. The documents come in all types, from PDF and PPT to Keynote files. The pace of uploads is heavy, and varies dramatically over time.

The documents have to be processed from their native formats into a process that is sharable on the web. This is hard work: it can take a beefy server over a minute working at full capacity just to convert one document. And converting one video can take several minutes! So we have lots of servers for doing this.

We host our document conversion infrastructure on the cloud (Amazon EC2) so we knew we could quickly create new servers when needed (via an API call), and just as quickly remove servers when they were no longer needed.  So we started exploring how to match the supply of our servers with the demands of our users in realtime.

Our first effort at dealing with this was to use Amazon EC2 to add resources during times of peak load. We identified the business hours in US time zones as being peak time. So we wrote a script that created more servers at the beginning of the New York work day, and tore them down at the end of the Los Angeles work day.

This didn’t work as well as we had hoped. We were able to handle the peak traffic better than before, but the system still got overwhelmed, both during peak and non-peak times. And fixing this by adding more servers that were only occasionally being used seemed like a wasteful solution to the problem. Instead of adjusting the number of servers twice a day according to a fixed plan written ages ago, we needed to adjust them constantly in response to how much load we were getting right now.

Adding new servers when the system was under load was straightforward: since we dispatch work to our cluster using a queue (specifically SQS), a cron job that tracks the size of our job queue can easily create new servers when it notices the queue starting to get too long.

But how to tear down the servers when they are no longer needed? You don’t want to suddenly delete a server that’s in the middle of doing a job. And you don’t want to spawn servers and then lose track of them, or you’ll quickly start wasting money. As we thought about it, we realized that the surplus server itself was in the best position to know when it was a safe time to shut itself down and make sure that it got done.

We call these servers “suicide workers” for this reason: they work for a couple of hours, clean up their worksite, and then carefully remove themselves from existence. This gives us the ability to throw extra capacity at a problem very aggressively, knowing that the capacity will bleed off in a couple hours automatically. We don’t even have to keep track of these servers very carefully, because we know they are transient and will shut themselves down in a couple hours.

We chose two hours as an interval because it takes about 15 minutes to spin up an EC2 instance. So if you use a cloud server for 1 hour, you end up paying for 60 minutes of compute and only getting 45! This is not a good deal (25% waste). Paying for 120 minutes and getting 105 minutes (12.5% waste) is much better. For machines that live a short number of hours, you want to make sure that you close them down right at the end of the hour or you’ll end up paying for an extra hour of compute time that you didn’t actually use. We experimented with having suicide workers live for 3 and 4 hours (rather than 2): it ended up being more expensive without improving our average wait time.

Of course, there are lots of other ways to scale cloud servers up and down in response to demand. The most obvious is the Auto-Scaling feature provided by AWS. However, AutoScaling is works best for scaling servers when their internal metrics (CPU load, etc) hit a certain point. We needed to allocate capacity when the system has hit its capacity limit, not just when a given machine is working hard. The length of the job queue is the best proxy for this.

So if you’re doing batch processing using cloud servers, I recommend you give the suicide workers approach a try! Below are some code snippets that should get you started. And if you enjoy thinking about this kind of stuff, check out our jobs page … we’re hiring. ;->

Scripts for spawning new adhoc instances:

We run following script, which is monitoring SQS length and based on length it spawns new EC2 instances.

#!/bin/bash

echo $$ > /var/run/pid_of_this_script

QUEUE='MY-QUEUE'  #SQS for specific factory

while [ 1 ]
do
 QUEUE_LEN_INIT=`ruby SQSlen.rb $QUEUE | tail -1`
 sleep 5m
 QUEUE_LEN_FINAL=`ruby SQSlen.rb $QUEUE | tail -1`
 if [ $QUEUE_LEN_INIT -gt 50  -a  $QUEUE_LEN_FINAL -gt 50  ]
 then
     launch_adhoc_instance
 fi
done

Following is Ruby code to get SQS length:

#!/usr/bin/ruby

#File: SQSlen.rb

require 'rubygems'
require 'right_aws'
require 'yaml'

def get_queue_len(queue_name)
begin
 return RightAws::SqsGen2.new($conf['SQS']['access-key'], $conf['SQS']['secret-key'], {:logger => $logger}).queue(queue_name).size
rescue Exception => e
 puts("SQS Error")
 return 0
end
end

if ARGV.length != 1
puts "Please Enter arguments correctly. \"ruby SQSlen.rb \""
exit
end

queue_name = ARGV[0]
config_path = "SQSlen.yml"
$conf = YAML.load(File.open(config_path))

len = get_queue_len(queue_name)
puts len

Following is script which creates new worker instance

#!/bin/bash

#File: launch_adhoc_instance

if [ $# -ne 1 ]
then
     echo "Usage: launch_adhoc_instance AdhocRole"
     exit 1
fi

source configs/global_config
source configs/adhoc_config

#Spawn up a regular instance
ec2-run-instances $AMI -n $NUM_INSTANCES -t $TYPE -k $KEY_PAIR --group $SECURITY_GROUP > /tmp/ec2_instance_request

INSTANCE_ID=`cat /tmp/ec2_instance_request | tail -1 | cut -f2`
STATUS=`cat /tmp/ec2_instance_request | tail -1 | cut -f6`
rm -f /tmp/ec2_instance_request

REQUIRED_STATUS="running"
while [ $STATUS != $REQUIRED_STATUS ]
do
 sleep 60
 STATUS=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f6`
done

sleep 60

#Instance is now active. Capture data associated with instance like instance-id, external and internal dns.
INSTANCE_EXTERNAL_DNS=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 4`
INSTANCE_INTERNAL_HOSTNAME=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 5 | cut -f 1 -d'.' `

#We need to do the record keeping
echo "`date`: $INSTANCE_ID: $INSTANCE_EXTERNAL_DNS: $INSTANCE_INTERNAL_HOSTNAME: PDF2SWFADHOC" >> $DB_FILE

#If we are not able to get internal hostname within next 1 minutes for some reason then quit
while [ -z $INSTANCE_INTERNAL_HOSTNAME ]
do
 COUNT=`expr $COUNT + 1 `
 sleep 10
 INSTANCE_INTERNAL_HOSTNAME=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 5 | cut -f 1 -d'.' `
 INSTANCE_EXTERNAL_DNS=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 4`
 if [ $COUNT -ge 6 ]
 then
     exit 1
 fi
done

#Configure puppetmaster to associate relevant class with node
$SCP root@$PUPPET_MASTER:/etc/puppet/manifests/adhoc_nodes.pp /tmp/adhoc_nodes.pp
grep $INSTANCE_INTERNAL_HOSTNAME /tmp/adhoc_nodes.pp

if [ $? -eq 0 ]
then
 sed "/$INSTANCE_INTERNAL_HOSTNAME/d" /tmp/adhoc_nodes.pp > /tmp/newnodes.pp
 mv /tmp/newnodes.pp /tmp/adhoc_nodes.pp
fi

echo "node $INSTANCE_INTERNAL_HOSTNAME { include $1 }" >> /tmp/adhoc_nodes.pp
$SCP /tmp/adhoc_nodes.pp root@$PUPPET_MASTER:/tmp
$SSH root@$PUPPET_MASTER "mv /tmp/adhoc_nodes.pp /etc/puppet/manifests/adhoc_nodes.pp"

rm -f /tmp/adhoc_nodes.pp

#Do a run of puppet twice
$SSH root@$INSTANCE_EXTERNAL_DNS "puppetd --test"
sleep 60
$SSH root@$INSTANCE_EXTERNAL_DNS "puppetd --test"

Akash Agrawal
Senior Software Engineer

How EC2 gave us a 130x throughput increase in generating millions of images

Images are a key part of the zillions of websites out there, and SlideShare is no different.  A few weeks back, we made some optimizations to load images below the fold lazily.  In this post, we discuss some more optimizations we have done to user profile pictures.

Across SlideShare, there are primarily 3 sizes of profile images used – 100×100, 50×50 and 32×32.  As the site evolved over a period of time, we do not  have images of these 3 sizes for the pictures of all the users. Running a query through our db revealed that 2 million profile images have  a single size – 100×100. In places were we needed a smaller size, we were letting the browser do the resizing as needed.

At this stage, we were faced with the mammoth task of generating 3 different sizes for each of these 2 million images. For each image, we had to make a http call to Amazon’s S3 storage service, read the image into memory, use RMagick to generate 3 variants and upload each of these variants back to S3. We put together a small ruby script to loop over the 2 million images and do all these tasks.

A preliminary test on my dev machine showed us that it would take 27 days to run through all images, if only 1 machine and 1 process was used!  The throughput we were achieving was roughly 1 image per second.

27 days was too long and we *had* to bring the time down to a few hours. Amazon EC2 instances to the rescue. We tested out with a single EC2 machine with 100,000 images and 4 processes running simultaneously. We were able to run through the 100,000 images in under 3 and half hours – which would mean a throughput of around 7 images per second, which is not really bad! But still this throughput means the time would only go down from 27 days to 4 days. Not good enough.

We then created an AMI (Amazon Machine Image) out of the first instance and generated 19 more replicas of the first instance. This multiplied the processing power at hand by 20. The numbers looked good now – A combined throughput of 140 images per second !! Each EC2 instance taking care of 100,000 images – the entire exercise was complete in a matter of  4-5 hours!!

Number of machines Number of processes Images handled per second Total time taken
1 dev machine 1 1 27 days
1 EC2 instance 4 7 4 days
20 EC2 instances 4 140 5 hours

 
The core part of the code was this:

# Generates the 3 image sizes and uploads them back to s3
# @param [String] login login of user for whom pics are to be generated
# @param [Object] img The RMagick::Image object that has to be resized
def generate_and_upload(login, img)
  SIZES.each { |key, size|
    suffix = "."+ img.format.downcase
    headers = {"Content-Type" => "image/#{img.format.downcase}"}
    filename = "profile-photo-#{login}-#{size}x#{size}"
    writeTo = $ss_convert_store + filename + suffix

    #generate
    newfile = img.resize(size, size)
    newfile.write(writeTo)

    #upload
    $awsHelper.put_file_with_key(writeTo, filename)

    #cleanup
    FileUtils.rm(writeTo)
    $progress.write(size.to_s + " ")
  }
end

At the end of the exercise, we reached a state where all users on SlideShare had these 3 variants of images, and a couple of days later, our codebase was updated to make use of these new images, instead of the original 100×100 variant. This is a nice-to-have performance win, and combined with lazy loading, our slide-view page is now even faster. Also, a few tweaks were made to the image uploading mechanism to give more control to the users. We now generate all required sizes when a new profile image is uploaded.

Overall, the project was an exciting one, with work ranging from frontend jquery plugins to EC2-S3 interactions and ruby scripts running at web scale.

If attacking challenging problems like these is your cup of tea, you might consider joining us!

Prafulla Kiran P
Software Engineer,
SlideShare