Hadoop and Near Real Time Analytics at SlideShare

by Nikhil Prabhakar on April 15, 2014

in Architecture,Big data / Analytics,Databases,Operations,QA

Hadoop and Near Real Time Analytics at SlideShare
The upgrade process and what we learnt along the way

 

At SlideShare-LinkedIn, we’re driven by the motto of “Members First” and constantly look for ways to make the product more valuable for our community. Last year, we upgraded by launching a new version of the analytics [1] and migrated all data points based on HAProxy logs to Hadoop/HBase infrastructure. We’re happy to announce that on April 11 we achieved another major milestone – we launched near real time analytics. Data points are now being updated after a lag of just 30-90 mins instead of the former time lag of 24-36 hrs!

We’re writing this post to share our journey so far with you – starting from 3 nodes POC to current 31 nodes highly available Hadoop/HBase cluster with no SPOF for masters supporting near real time analytics [see section on "HA Setup for Hadoop/HBase Masters" for details].

We started with an early version of CDH4 distribution for Hadoop [2] and after multiple upgrades that brought a lot of useful features, we’re now running CDH 4.3.0 [3] in production (we are really excited about our next major upgrade to CDH5).

Need for Hadoop

Previously, analytics were backed by MySQL and a supporting Ruby batch processing infrastructure.

When we started thinking about new backend system for building analytics from scratch, we had the following requirements in mind:
1. Better latency: Old analytics was slow and sometimes not the most user-friendly.
2. Horizontal scalability: In terms of number of users and data it can process.
3. Supporting other products in future: We use MongoDB for some of our products and we could eventually move some of these products in particular Newsfeed to HBase.
4. Adhoc analysis: We wanted to increase Hadoop usage across the company to get better insights from different datasets and logs. Search and newsletter activity could benefit from adhoc analysis.
5. Near real time analytics: Our goal was to make analytics near real time
6. More analytics features: Introducing additional features that have been part of the product roadmap but haven’t been implemented yet because of the limitations of the old analytics infrastructure.

Selection of Technologies

While we were fairly certain about selection of Hadoop [3] for processing the ever-increasing datasets including HAProxy logs, there were other technologies we had to select to make our Hadoop ecosystem complete.

1. HBase (Database):

HBase [4] leverages Hadoop HDFS for its storage and integrates well with Hadoop ecosystem. Its architecture designed for doing bulk writes and low latency random reads made it a clear cut choice for DB layer.

2. Pig (MR jobs):

For writing MR jobs we wanted to use a high level language with good performance and optimization. Pig [5] comes with good support for HBase storage and load functions. Moreover, you can extend by writing your own UDFs [6]. It’s also very easy to use by non-Hadoop or non-tech guys.

 3. Oozie (Workflow Scheduler):

First, we started with a bunch cron jobs and when it became unmanageable to run interdependent MR/Pig jobs, we switched to Oozie [7]. It was a steep learning curve but we have found it to be a good investment.

 4. Java (MR jobs/Oozie API/Pig UDFs):

It’s hard to avoid Java when you’re working in a Hadoop ecosystem. Initially, we were skeptical about using Java. Our stage 1 jobs, which directly process raw logs are written in Java because we can use LoadIncrementalHFiles [8] to bulk write HFiles [9] and skip write path completely. It reduces the load on cluster and decreases the time required to complete stage 1 of our data processing pipeline.

We also use Java for writing Pig UDFs.

 5. TorqueBox/JRuby (REST API/SOA):

When we launched our new analytics, we were using Thrift client to access HBase API. Eventually as the aggregation logic became complex, we decided to write REST API for analytics data points using TorqueBox [10]. TorqueBox is a high performance [11] application server written in JRuby [12] on top of JBoss. With JRuby, we now have the freedom to use HBase Java API completely. In future, we’ll also have an option to move to a more performant HBase client if need be.

 6. Phoenix (SQL for HBase):

Phoenix [13] is an excellent project that allows you to write SQL queries over HBase tables and get answers with decent latency. We use Phoenix to generate on the fly reports for user or presentation for custom date range. Phoenix converts your SQL query to scans and uses filters and coprocessors to do the magic!

Data Collection and Processing Pipeline

 Hadoop Analytics Data Collection

Fig 1: Hadoop Analytics Data Collection

As you can see above, we use a log collection and aggregation tool, Fluentd to collect the HAProxy logs from different sources, buffered into 64MB chunks and pushed into the HDFS sink through the webHDFS interface that Hadoop offers. A new file is created every hour to allow for hourly batch processing.

 

image03

Fig 2: Data Processing Pipeline – High Level

At a high level, the whole analytics data processing pipeline is divided into two independent parallel components that run separately as 2 Oozie coordinators. While daily data processing pipeline runs once a day, hourly data processing runs every hour. Data processed from these 2 pipelines are then combined and aggregated using an API layer known as TreeBeard. REST API exposed by TreeBeard gives most of the data points required by analytics.

image10

Fig 3: Daily Data Processing Pipeline

  • Daily workflow starts at 5 minutes past midnight every day
  • All data points are computed at user and slideshow level
  • Daily workflow completes generally in 4-5 hours
  • For top and total tables we store data for 3 consecutive days

Data Processing Pipeline - Hourly

Fig 4: Hourly Data Processing Pipeline

  • Hourly workflow runs 24 times a day and starts after 5 mins of preceding hour
  • Hourly workflow completes in 10-20 mins.

Overview of Hadoop Infra

image11

 Fig 5: Overview of Hadoop/HBase Cluster at SlideShare

  • We have a highly available setup for masters which means that there is no SPOF
  • Datanode, Tasktracker and Regionserver run on same node
  • We currently have 6 master nodes and 25 slave nodes
  • Oozie is running on one of the namenodes
  • ZooKeeper [15] ensemble is hosted on multiple master nodes

HA Setup for Hadoop/HBase Masters

 hadoop_ha

Fig 6: SlideShare Hadoop/HBase HA setup for masters

Since we rely on Hadoop/HBase for storing analytics data, it’s quite critical to the infrastructure and hence we have a complete HA setup [16]:

1. Active-standby namenode setup: In the event of namenode failure, a hot standby takes over as active.

  • JournalNodes : Daemons synchronize EDITs across namenodes so that standby namenodes are always updated and ensure HOT standby state.
  • HDFS ZKFC: Daemons track the namenode status and trigger a failover in an event of namenode failure. This relies on ZooKeeper cluster.
  • ZooKeeper Quorum: a cluster of hosts that synchronize information and store the state of namenode.

2. Active-standby Jobtracker setup: In the event of JobTracker failure, the failover standby server takes over and handles running JobTracker information.

  • JobTracker HA Daemons: JobTracker hosts have JobTracker HA daemon running on them.
  • JobTracker ZKFC: Daemon triggers failover in an event of JobTracker failure
  • HDFS Jobtracker Information: The place where we store jobtracker information  in HDFS so that it can be used in event of failure.

3. Active-Backup HBase Masters setup: Two HBase masters will be running with one running as master and another as backup master.

Configuration/Optimization Work

Hadoop ecosystem is highly configurable [17] which is good but can be very frustrating [18] at times in a fast moving ecosystem. There is a sea of configuration parameters and it’s a big challenge to understand these params. Moreover, values for many of these parameters depend on your use case and a lot of them have to actually be tried in production to see what value works best for you.

Strategy that we follow:

1. Understand the config parameter
2. Look out for recommended values on web and books
3. Discuss with colleagues whether it’s good for our use case
4. Change one parameter at a time and monitor its effect
5. Negative impact – revert
6. Positive impact – joy!

Hadoop:

1. dfs.blocksize: Increased HDFS block size to 128 MB [19] [23]
2. mapred.child.java.opts: (Default: -Xmx200m, Using: -Xmx800m) [19] [23]
We override global value with 1 GB in most Pig scripts.
3. io.sort.mb: The total amount of buffer memory to use while sorting files, in megabytes.
(Default: 100, Using: 200) [19] We override it in many Pig scripts by using 512MB because of significant amount of spilled records. [23]
4. mapred.output.compression.type: If the job outputs are to be written as SequenceFiles, this parameter specifies compression type to use. (Default: RECORD, Using: BLOCK) [20] [21] [23]
5. mapred.output.compression.codec and mapred.map.output.compression.codec: (Default: org.apache.hadoop.io.compress.DefaultCodec, Using: org.apache.hadoop.io.compress.SnappyCodec) [20] [21] [23]
6. mapred.compress.map.output: (Default: false, Using: true) [21] [23]
7. mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum: Depends on RAM, CPU, other services running on node, types of typical jobs being run on the cluster. We’re using 7 and 4 respectively. [22] [23]
8. io.file.buffer.size: Determines how much data should be buffered during read/write operations.
(Default: 4096, Using: 131072) [23]
9. mapred.jobtracker.taskScheduler: We’re in the process of fine tuning FairScheduler config parameters. It’ll become very important once we start supporting a number of products,
adhoc analysis, and data processing pipelines.
Default: org.apache.hadoop.mapred.JobQueueTaskScheduler
Using: org.apache.hadoop.mapred.FairScheduler
10. io.sort.factor: The number of streams to merge at once while sorting files. [23]
Default: 10
Using: 50
11. mapred.reduce.tasks: We override at Job level (Pig scripts). [23]
Default: 1
12. tasktracker.http.threads: The number of worker threads for the http server. This is used for map output fetching. [23]
Default: 40
Using: 80

HBase: [37]

1. HBASE_HEAPSIZE: The maximum amount of heap to use, in MB. (Default: 1000, Using: 10000) [22] [24]
2. JVM GC Tuning: Our JVM GC parameters are based on the guidelines mentioned in [25]
3. hbase.regionserver.global.memstore.upperLimit and hbase.regionserver.global.memstore.lowerLimit: Using Default
4. hbase.regionserver.handler.count: Default: 10, Using: 20
5. Bulk Loading: We use bulk load API [8] of HBase to process HAProxy logs in the first stage of data processing pipeline. Bulk load bypasses write path, is very fast and decreases the load on cluster. [26] [27]
6. Snappy Compression: We have enabled snappy compression for all tables. It saved us about 5-10x storage space.
7. Future work/Work in Progress: [28]
(Mostly related with read performance)

  • Changing Prefix compression to FAST_DIFF
  • Reducing HBase block size
  • More block cache
  • Short circuit local reads
  • Row level bloom filters

Pig: [29] [30]

1. hbase.client.scanner.caching: HBase param set from Pig script. Number of rows that will be fetched when calling next on a scanner if it is not served from (local, client) memory. Higher caching values will enable faster scanners but will eat up more memory.
Using: 10000.
2. hbase.client.write.buffer: HBase param in bytes set from Pig script. A bigger buffer takes more memory — on the client and server side since server instantiates the passed write buffer to process it — but a larger buffer size reduces the number of RPCs made.
Default: 2097152
Using: 10000000
3. default_parallel: Sets the number of reducers at the script level. Using: 20/30
4. pig.cachedbag.memusage: Sets the amount of memory allocated to bags.
Default: 0.2 (20% of available memory)
Using: 0.7 for some Pig scripts
5. pig.exec.mapPartAgg: Improves the speed of group-by operations.

Migrating Old Analytics to Hadoop/HBase

Since this was a one-off exercise, instead of building an ETL pipeline, we manually dumped all the MySQL tables of interest into HDFS and wrote some Pig scripts to transform the data and stored it into HBase. In some cases, there were one to many relationships between the MySQL and HBase tables. In those cases, we wrote some extra tests just in case to make sure that there were no data discrepancies between the sources.

And then there was the question of dealing with data points which were not available in the old system and some which we started collecting only recently, like outbound clicks. Since back processing old logs was not an option, we maintain a list of the date of “inception” of all data points. Additionally,  we’ve written the web app in such a way that the numbers gray out and warn you in case the date range that you had requested for was not available for this particular datapoint.

TreeBeard – The Aggregation Layer

At the aggregation layer, we calculate dynamic range based on hourly and daily workflow update tables. These two tables give us information about the completion of last run hourly and daily processing pipeline.

image06

Fig 7: TreeBeard API – Near Realtime Aggregation Layer [Daily Views]

Range will have some dates for which we can’t get precomputed aggregated daily views from daily processing pipeline. For these dates we use hourly tables which are updated every hour. We query hourly table and sum the views on the fly at query time.

Explanation with an Example:

We’ll explain the aggregation logic with an example.

Consider server time of Apr 3, 2014 03:00 hrs. At this point of time daily processing for Apr 1 is complete and daily processing for Apr 2 is underway, having started at Apr 3, 00:05 hrs. We’ll also have 2 hours of hourly views available.

Range for showing 1 month of daily views will be Mar 4 to Apr 3. For Mar 4 to Apr 1, we’ll use precomputed daily views from daily processing pipeline. For Apr 2 and Apr 3, we’ll use summation of hourly views.

image12

Fig 8: TreeBeard API – Near Realtime Aggregation Layer [Total Views]

The new dynamic range will have an overlap with date range for which daily processing for total views is complete. For computing new total views we’ll have to add views for the days for which daily processing is not complete and deduct views from the starting of range for precomputed total views obtained from daily processing pipeline.

Explanation with an Example:

We’ll explain the aggregation logic with an example.

Consider server time of Apr 3, 2014 03:00 hrs. At this point of time daily processing for Apr 1 is complete and daily processing for Apr 2 is underway, having started at Apr 3, 00:05 hrs. We’ll also have 2 hours of hourly views available.

Range for showing 1 month of total views will be Mar 4 to Apr 3. Most recent precomputed total views from daily processing pipeline is for the range Mar 2 to Apr 1.

Total views (Mar 4 to Apr 3) = Total views from daily data processing pipeline (Mar 2 to Apr 1) + Total views for Apr 2 and Apr 3 obtained by summing hourly views – Total views for Mar 2 and Mar 3 obtained by summing daily views

image09

Fig 9: TreeBeard API – Near Realtime Aggregation Layer [Top Views]

The new dynamic range will have an overlap with date range for which daily processing for top views is complete. For computing new top views we take a practical approach of getting top 1000 views from daily processing pipeline and adding delta hits to all these views. Then we sort in memory and return top 5/10/20 views.

Explanation with an Example:

We’ll explain the aggregation logic with an example.

Consider server time of Apr 3, 2014 03:00 hrs. At this point of time daily processing for Apr 1 is complete and daily processing for Apr 2 is underway, having started at Apr 3, 00:05 hrs. We’ll also have 2 hours of hourly views available.

Range for showing 1 month of top views will be Mar 4 to Apr 3. Most recent pre-computed and sorted top views from daily processing pipeline is for the range Mar 2 to Apr 1.

Top views for Mar 4 to Apr 3 is calculated as follows -

Step 1: Top 1000 views for Mar 2 to Apr 1

Step 2: Add total views for Apr 2 and Apr 3 obtained by summing hourly views to each of the sorted hits count

Step 3: Deduct total views for Mar 2 and Mar 3 obtained by summing daily views from each of the sorted hits count

Step 4: In memory sorting

Step 5: Return top 5/10/20

Testing (Analytics Data Points)

 1. Reconciliation Test verifying the integrity between various pages

 

image02

 Fig 10: Summary Tab

  • Checking for each of top 5 content Views <= Total views
  • Checking for each of top 5 country Views <= Total views

image05

 Fig 11: Views Tab

  • Total views = SlideShare + Embeds
  • Each of top (5/10/20) <= Total views
  • Sum of top content views <=  Total views

image01

 Fig 12: Sources Tab

  • Total views = Sum of all sources
  • Total views count >= Each of source count
  • Each of top sources (5/10/20) <= Total views
  • Sum of top sources <= Total views

image00

 Fig 13: Geo Tab

  • Each of top country (5/10/20) <= Total views
  • Sum of top countries <= Total views

2. Data consistency amongst various Treebeard API calls (realtime = true)

Data Type Data For Trend Realtime Range Additional Params
Views User hourly true 1w source
Traffic Sources Slideshow daily false 1m date
Country total 3m limit
Referer Views top 6m country
all user_id
referer
referer_source

Use Cases

  • User and slideshow level
  • All 5 types of uploads (Presentation, documents, infographics, videos, and slidecasts)
  • Low, mid and high usage users
  • 1w, 1m, 3m, 6m and all time ranges
  • Hours when daily processing finishes

Test Cases

  • Total views = Total views (embed) + Total views (onsite)
  • Total views = Total traffic sources
  • Total views >= Total country views
  • Total views = Sum daily views
  • Total views >= Sum top content
  • Total views >= Sum top traffic sources
  • Total country views = Sum daily country views
  • Total traffic source = Sum daily traffic source
  • Total referer views = Sum daily referer views
  • Top traffic sources <= Total referer views
  • Daily views >= Sum hourly views (0-5% error)
  • Daily country views >= Sum hourly country views (0-5% error)
  • Daily traffic source >= Sum hourly traffic source (0-5% error)
  • Total views (Slideshow) = Views on Top content
  • Limits are working (max being 1000)
  • New user/uploads/country processed in hourly data processing pipeline

What We Learnt

  • HBase schema design is critical to latency as well as data that is scanned while running MR jobs.
  • Number of regions map to number of tasks that are created while running MR jobs over HBase tables. Because of oversight and/or poor understanding at one point of time we were stuck with 24+ hours of processing for daily workflow. [38] [39] Pre split regions if required. HBase 0.94 improved the default region split policy and is much better. [31] [32]
  • We had written a UDF that used to add up the views for a given range. It caused increased memory requirement for task and because of that regionservers used to die regularly with OOM error. Finally, we removed custom UDF with in built SUM and GROUP BY operations with much less memory requirement at the cost of increased processing time because of spilled records.
  • Always change configuration parameter one at a time, otherwise you may not be able to measure the impact.

Limitations

  • HBase schema design is very tightly coupled with query patterns. Slight change in query requirement can make you to create a new table (and potentially reprocess a lot of data again)
  • Pig script debugging is hard.
  • Top lists can incorporate new entries only in daily processing, not in hourly.
  • Hourly data stored only for past 10 days.
  • Top views can show a maximum of 1000 entries.
  • Latest views limited to 1000 entries.
  • Some time is spent in recovering from/repairing failed HBase components [33]
  • Currently we have one monolithic cluster, which is supporting multiple use cases. [34]

Looking Forward

  • Support more products on the cluster
  • Reduce the entry barrier for non-Hadoop and non-tech guys. Spend time on setting up and increasing the adoption of Hue [35]
  • CDH5 upgrade [36]

Blog Post Contributors and Team Members

  • Akshit Khurana
  • Amit Sawhney
  • Anil Siwach
  • Ankita Gupta
  • Bubby Rayber
  • Diksha Kuhar
  • Hari Prasanna
  • Jasmeet Singh
  • Nikhil Chandna
  • Nikhil Prabhakar
  • Prashant Verma

References

1. PRO Analytics Are Now Faster, Simpler, Better

2. CDH: 100% Open Source Distribution including Apache Hadoop

3. CDH4.3.0 Documentation

4. Apache HBase Project

5. Apache Pig Project

6. Chapter 10. Writing Evaluation and Filter Functions [Programming Pig by Alan Gates]

7. Apache Oozie Workflow Scheduler for Hadoop

8. Tool to load the output of HFileOutputFormat into an existing table [API]

9. Apache HBase I/O – HFile

10. TorqueBox Project

11. TorqueBox 2.x Performance Benchmarks

12. JRuby: The Ruby Programming Language on the JVM

13. Apache Phoenix: “We put the SQL back in NoSQL”

14. Fluentd: Open Source Log Management

15. Apache ZooKeeper

16. CDH4 High Availability Guide

17. Configuring/Optimizing Hadoop ecosystem

18. https://twitter.com/_nipra/status/364662881708544000

19. Tuning DSE Hadoop MapReduce

20. Using Snappy for MapReduce Compression

21. Chapter 7: Tuning a Hadoop Cluster for Best Performance
(Using compression for input and output)
[Hadoop Operations and Cluster Management Cookbook]

22. 8.10. Recommended Memory Configurations for the MapReduce Service

23. Chapter 5. Installation and Configuration
(MapReduce – Optimization and Tuning / HDFS – Optimization and Tuning)
[Hadoop Operations]

24. HBase Administration, Performance Tuning

25. HBase JVM Tuning

26. How-to: Use HBase Bulk Loading, and Why

27. 9.8. Bulk Loading

28. Apache HBase at Pinterest

29. Performance and Efficiency [Pig]

30. Chapter 8. Making Pig Fly

31. HBASE-4365 – Add a decent heuristic for region size

32. RegionSplitPolicy [API]

33. HBase: How to get MTTR below 1 minute

34. Hadoop Hardware @Twitter: Size does matter! [Video] [Slideshow]

35. Hue

36. CDH 5 Release Notes

37. HBase Performance Tuning

38. 7.1. Map-Task Splitting (The Apache HBase™ Reference Guide – Latest)

39. HBase as MapReduce job data source and sink

Other Resources

 

Do you enjoy solving complex problems in Big Data Analytics? SlideShare engineers are constantly working on such challenging projects. If work like this excites you, this is the place to be. We are hiring.

 

{ 0 comments }

10 ways we made SlideShare faster

by Apoorvi Kapoor on October 18, 2013

in HTML5,JavaScript,Performance

At SlideShare, we love metrics. Be it weekly traffic data, number of uploads,number of pro sign ups or page load time ;we track each number very closely and strive to improve the numbers everyday. For Second Quarter of 2013, SlideShare’s CTO Jonathan Boutelle came up with the target of reducing our page load time to 5 seconds, which was hovering at an average of 7.5 seconds at that time.

This was not an easy task. We were already following web practices listed at Yslow. We had made every extra effort to achieve good grades at webpagetest. When it comes to page load time every millisecond reduction counts. Read on to find out how we got close to our target.

1. Load all third party libraries after window.load - To start with, all the javascript required for social sharing widgets eg Facebook like button should be loaded asynchronously.As a second step, move the script tags from the html page to a JavaScript(JS) file and bind the asynchronous loading of the external JS libraries to window.loadevent. For example, to load FB Like Button, place the following code in the JS file included in the head
$(window).load(function(){
(function(d){
var js, id = 'facebook-jssdk',
ref = d.getElementsByTagName('script')[0];
if (d.getElementById(id)) {return;}
js = d.createElement('script'); js.id = id; js.async = true;
js.src = "//connect.facebook.net/en_US/all.js";
ref.parentNode.insertBefore(js, ref);
}(document));
}

And this is how our new relic graph looked like after we made this change.

ThirdPartWidgets

2. Asynchronously Load Jquery - Another useful trick is to load jQuery asynchronously so that other components on the page are not blocked.We used jql plugin which can be inlined in the HTML page.You don’t have to worry about missing document.ready( ) calls since jQl automatically catches all jQuery().ready() calls and executes them as soon as jQuery is loaded and the DOM is ready.

3. Combine and Compress - Combining the JS files help in eliminating the initial DNS lookup and TCP connection components that goes in every request. This is very evident from Yslow requests but an interesting tweak to this will be conditional compiling.For example let’s say your page requires jquery, file1.js, logged_in.js, logged_out.js.

Conditional compiling should give you combined_logged_in.js[jquery+file1.js+logged_in.js] and combined_logged_out.js[jquery+file1.js+logged_out.js]

For compressing JS , closure compiler is known to give the best results.

4. Smarter JS packaging - SlideShare has different functionality for logged in and logged out users. We load only base files for logged out users and additional JS files when the user logs in.Similarly the HTML5 player loads different type of combined JS files depending on the content type say presentations or documents.There is also a possibility that some plugins that you added initially are not being used anymore. Removing them is the first step towards minimizing size of assets that block page load.

5. Defer loading of JS - The main component of SlideShare which is the HTML5 player needs to be loaded as fast as possible. For this we are dependent on document.ready event. Therefore,the JS that is required for above the fold functionality is minimal and is loaded quite early.The other JS that is required for additional user functionality is loaded after window.load.

6. Lazy Load HTML and JS - This is a commonly used technique to show user generated content on page like comments etc only when the user interacts.We removed the inline html required for user actions like embed code and retrieved it on user click. This reduces the page size that the browser needs to parse and speeds up the load time.

7. Use of Cache Busters - All the assets like images, sprites, javascript files should have a cachebuster appended at the end of the URL so that the browser can cache the assets and makes a new request only if cache buster has changed.

8. Remove HTTPS calls from an http page- This can help you chop off a few milliseconds spent in SSL negotiation.

9. DNS prefetching - SlideShare uses multiple cdns to fetch various assets. We figured out that we can reduce DNS look up time by preresolving DNS in the browser.

10. Redesigning HTML5 player - Apart from these we also made certain improvements in our HTML5 player.The core of the SlideShare – HTML5 player is written in JavaScript using jQuery.The HTML5 player is a complex system which supports displaying presentations,documents , slidecasts and now Infographics too! To make a dent in the page load time by a good margin, we realized that we need to redesign the HTML5 player

  • We started with the idea that the player should be able to display content even if JavaScript is disabled in the browser and JS should only be used to make the player alive.

  • All the markup was generated on the client side using javascript which led to a lot of DOM manipulation. This was resolved by moving markup creation logic to backend.

  • The layout of slides and other styling logic was rewritten so that CSS rules can be used effectively instead of relying on JS

  • To make the user experience delightful, we included the first slide in the markup itself so that it is triggered by the browser parser instead of waiting on JS.

  • So what was the result of the above experiments? Our new relic graphs tell the entire story. A whooping drop from 5.7 to 5.03 seconds!

Before

newrelic1

After

last 3 hours

We used webpagetst extensively to measure the performance of the changes we were making. Here is how we have moved the needle for document complete event.

Before

webpagetest1

After

webpagetest3

None of the above would have been possible without the help of Jean Benois and Jeba Emmanuel.

If you find these tips useful or have any more ideas to make SlideShare faster ,please give a shout out in the comments below.So, what did you do to make your website faster?

Thanks

Apoorvi Kapoor

{ Comments on this entry are closed }

Brilliant Hacks at SlideShare Hackday 2013

by Prashant Verma on October 18, 2013

in Announcement,Hackdays,Hacks

Folks,

The 2013 SlideShare Hackday was organized on 3rd and 4th October. It was 30 hours of pure hackery and fun. Brilliant hacks, great participation and good food made this the best SlideShare hackday ever! We managed to capture some of the moments here for you:-

What cooked? Over 28 teams registered themselves and 25 of them successfully completed the hack prototypes. There were many brilliant hacks such as: -

  1. Minority report ARDrone Quadcopter with Leapmotion Control that flies and hounds the developer who breaks the build by Jeba Singh Emmanuel
  2. HTML5 Presentations – Upload native HTML5 presentations to SlideShare (yay!) by Neha, Akshit, Nikhil
  3. Picturesque – Quickly create a SlideShare photodeck from your Facebook photo album by Hitesh, Apoorvi, Himanshu
  4. Pseudo upload – Upload to SlideShare directly from your Gmail attachments by Jasmeet, Nitish, Gaurav
  5. Slideshare for Weddings – Quickly create a visual wedding card and upload to SlideShare by Jai, Shishir, Vishu
  6. #Comment – Capture discussions on a deck from Twitter and bring it to SlideShare as comments on the deck by Pranav, Tushar, Arpit
  7. Let’s create content – Content creation platform on slideshare by Arpit D, Dheeraj, Saptarshi
  8. Speaking Decks – Listen to the slide transcripts Troy, Arun, Bubby
  9. Conversion dev setup on vagrant by Chris
  10. SlideCaster – Slidecast and Zipcast at the same time by CaseyA, Omar, Mark
  11. Recommend content based on historic user-slideshow interaction by Kunal Mittal, Karan Jindal, Anupam Jain, Hari
  12. SoS – Chrome Widget for creating Content by Anil, Atul, Prashant
  13. Mobile reader by Andri, Jesse, Ben Silver, Ellis
  14. Slideshow as Video on facebook by Nishant, Jack
  15. SlideShare Collections by Varun , Vivek
  16. Content Organizer by Archana, Deepti, Moumita
  17. Photodeck – turn your album into slidedeck by Amit Banerjee
  18. Improved Transcript by Akash
  19. Infographic résumé by Simla Ceyhan, Sylvain Kalache, Yifu Diao
  20. Share with a click – share open positions on SS careers page to LI/Facebook by Priyanka R
  21. Rietveld on CentOS by Toby
  22. Think Mobile – Preview how your SlideShare deck appears on mobile devices by Ajay
  23. Convert blogs into SS decks by Subhendra Basu & Rashi

Who won? Each hack explored a different idea from a unique perspective. The most well-executed idea won. Here are the results –

Delhi Office:
1st Prize: Slideshare for Weddings by Jai, Shishir, Vishu
2nd Prize. HTML5 Presentation Support by Akshit, Neha, Nikhil

SF Office:
1st Prize: SlideCaster by Casey A, Omar & Mark.
2nd Prize: Infographic résumé by Simla, Sylvain & Yifu

Worthy mentions

  • Let’s create content (Arpit D, Dheeraj, Saptarshi)
  • Think Mobile (Ajay)
  • #Comment (Arpit B, Pranav, Tushar)

  • Does this sound exciting to you? If yes, then you’re like us. Why not consider applying for a job at SlideShare and become a part of this fun?

    { Comments on this entry are closed }

    Fluentd at Slideshare

    by Sylvain Kalache on June 28, 2013

    in Uncategorized

    SlideShare will be hosting the next Fluentd meetup July 8th at our San Francisco office.

    Fluentd is an open-source program that we have been using for the past year and a half that helps us with log management, carrying them from point A to point B in a fast and reliable way.

    Among a lot of good things about Fluentd, here are the three we like the most:

    • Everything that goes in and out of Fluentd is JSON.
    • Fluentd is written in about 2,000 lines of open source Ruby code. When you have an issue, you can just read/patch the code.
    • There is already a huge plugins library — about 150 — that allows you to import, filter and export your data into a variety of systems.

    If you want to know more about how we are using Fluentd, please have a look at one of our projects in which we are using it, and join us on July 8th!

    { Comments on this entry are closed }

    What’s Cooking in Our Labs: SlideShare HandsFree!

    by kapil on March 20, 2013

    in Announcement,Hackdays,Hacks,User Experience

    [Reposted from http://blog.slideshare.net/2013/03/19/whats-cooking-in-our-labs-slideshare-handsfree/]

    Clicking through presentations can be cumbersome, especially when you’re talking through them with a live audience at hand. You have to find the right key on your keyboard, or move your mouse to the correct button, often halting the flow of your speech. What if you could just flick your finger in the air, indicating movement to the next page?

    Our engineers are on it. We figured if you could play motion-sensing tennis on the Nintendo Wii, couldn’t you at least flip thought SlideShare presentations with the wave of a hand? Here’s a preview of what we’re working on:

    And here’s what the engineer himself, Shirsendu Karmakar, had to say about developing it (yes, he’s pretty cool!):
    If you have used/seen flutter, you wish you could use it on websites too. A few days back, I saw an interesting Chrome Experiment. My initial reaction: SlideShare “Minority Report” style! I started working on something similar for SlideShare. After an hour or so, with JavaScript as my weapon and some simple techniques and approximations, SlideShare presentations were gestures ready. It took me around 30 lines to code to make SlideShare work via my gestures.
    Whats happening behind the scenes.

    • webRTC has made it possible to access the web camera directly from the browser. No Flash required!
    • A image is snapped at regular intervals.
    • HTML5 canvas is used to draw the current image.
    • The movement delta = The difference between the last image and the current image is calculated.
    • Depending on the value of delta, we try to detect which movement was done. Currently only for basic movements are supported: left, right, top, bottom.
    • Each of the directions are mapped with the SlideShare’s player API functions. Whenever, a movement is detected successfully, the player executes the mapped action.

    What are other features you’d like to see us develop in SlideShare Labs?

    { Comments on this entry are closed }

    Introducing SlideShare API Explorer

    by jai on November 19, 2012

    in Announcement,API/Mashups

    We know that at times it becomes difficult for you to keep up with our API documentation, which results in failed attempts at testing out an end point, and  in turn results in loss of productivity. To solve this issue, we created SlideShare API Explorer, which helps you get started with our API, and makes it super easy to test out an API endpoint.

    Just follow the simple steps below:

    1. Apply for an API key (requires a user account)- You’ll get the API credentials on your registered email address.
    2. Go to apiexplorer.slideshare.net and fill in your credentials which you have received in the email.

    Fill in your API Credentials

    3. Now select an endpoint to test, fill in test parameters and click on ‘try it’. It will query our endpoint and show the generated query string, response headers and response body.

    We are constantly working on making the API development workflow a seamless experience for you, and would love to hear your feedback/suggestions on this, feel free to share your comments below.

    { Comments on this entry are closed }

    DevelopHer Hackday Delhi

    June 22, 2012

    Announcing DevelopHer Hackday Delhi at SlideShare’s New Delhi office. First of its kind event! Same dates as DevelopeHer Hackday in the Bay Area organized by LinkedIn. Come on women geeks and hackers.. this is your stage! Form a team, code all night, create something awesome, and present it to the judges to win prizes. Dave […]

    Read more →

    SlideShare is looking for Rockstar Designers in New Delhi

    February 20, 2012

    [Reposted from the SlideShare Blog http://blog.slideshare.net/2012/02/17/designer-dream-job-in-delhi/] Are you passionate about digital design? Do you dream of working at one of the world’s fastest growing startups? If you are an experienced web or interactive designer, come work with SlideShare in our New Delhi office. You’ll be collaborating closely with our software developers, product managers, and analysts […]

    Read more →

    6 foot clownfish swimming through the slideshare sf office

    January 25, 2012

    So we got one of those awesome thinkgeek toys a couple weeks ago … a huge remote-control fish. Amazingly, it propels itself by swimming … it doesn’t have a propeller or anything. As a result it looks amazingly realistic and is a lot of fun to drive. Here’s a video of it in action. Floating […]

    Read more →

    The buddy system: an alternative to pair programming

    November 28, 2011

    Pair programming has always rubbed me the wrong way. I understand why some people like it. It’s great to have collective code ownership. It’s great to have conversations about what the code should look like before the code is actually written. And it’s great to have a collaborative work environment where people are always learning […]

    Read more →