SlideShare’s first experience with iOS and Swift

With little experience developing on iOS, SlideShare’s mobile team started working on our first iOS app in May. We had a small team of engineers, working on a not-yet-released version of iOS, using a brand-new programming language (Swift) — needless to say it was an exciting ride!

Fast forward four months and we’ve just released our iOS 8 app, which allows users to view and save presentations, and discover content tailored to them, on a beautiful interface. Here is how we did it.

We followed a process similar to the one we used with our Android app. Our new iOS lead, Jeba Emmanuel, began an exploration phase where the goal was to create an initial throw-away prototype of a single vertical slice of functionality. This exercise was a way for him to simultaneously become familiarized with the platform and the language, while also prototyping the app and validating early theories for implementation.

Until this point, Objective-C and iOS 7 were pretty much a given for our project but luckily, timing worked in our favor. We heard the announcement for iOS 8 and Swift when we had completed the early prototype and were just starting to work on the full app.

SlideShare is a Ruby shop, and the comparison between Swift and Objective-C definitely favored the former. That, combined with the patterns of quick adoption of new iOS versions (see the iOS 7 case here) experienced by Apple in the past, made the decision of implementing an iOS 8-only app in Swift a clear one.

Any time you’re an early adopter on a new platform, you’ll find a healthy mix of elements you love and quirks to understand. For us, the balance was definitely positive and we are very satisfied with the decision of going with the iOS 8 / Swift combination.

Below are a couple of minor issues we worked through initially:

  • Beta Version Bugs: As early adopters, we were working on beta versions of iOS (OS), Swift (language) and Xcode (tool), so we expected minor issues – that’s just par for the course with any brand-new platform. But kudos to Apple – we never found ourselves blocked. In fact, there were always workarounds and the frequency of releases for patched versions of iOS and Xcode was very useful. Some of the bugs were:
    • Access qualifiers not initially available in Swift
    • Compilation and build time errors not always accurate or descriptive enough
    • Limited documentation for language and IDE
    • Occasionally, the problems we encountered had never been encountered by anyone else before, so at that particular moment in time, stackoverflow wasn’t an option for us.
  • Interoperability with Objective-C: This sometimes forced us to write less expressive code. That said, interoperability with Objective-C is in the list of things that made our experience great, as you’ll soon see (What can I say? We are complicated!)

Here’s what worked great for us:

  • Support for Different Screen Sizes:We were super excited that Apple released new devices with different screen sizes, and we made full use of these screens with Launch Images/XIBs, Size Classes, Trait Collections and Auto Layout. Instead of just blowing up the view and stretching the layout, we customized our layouts to give users more information when on a bigger screen. The image above shows the layouts for our profile page in an iPhone 6 before (left) and after (right) adding the Launch XIBs. The differences in layouts can be implemented with the use of Launch XIBs, size classes, trait collections and Auto Layout. A minor code change was necessary to calculate the thumbnails’ sizes appropriately.
  • Support for Different Orientation Layouts:image01Apple’s size classes made laying out adaptive interfaces really easy with Auto Layout, Size Classes and Trait Collections. The image above shows the SlideShare player in different orientations. The action bar is laid out in two completely different ways with no need for layout handling code. In the portrait orientation, the toolbar is pinned to the bottom of the screen using a set of Auto Layout constraints associated with the Compact-Regular size class, whereas the Auto Layout constraints used for the landscape orientation pin it to the right of the screen. This is the new way to handle device rotations. The old way of handling rotations (which requires code) is now deprecated in iOS 8. For more information on Adaptive UI, check out Apple’s reference here.
  • New iOS 8 UIKit Components:image02With iOS 8, we can now use the UISplitViewController in an iPhone (before it was only available for iPad). The image above shows how we managed to use this type of view controller to implement the SlideShare Player slide index. Handling showing and hiding the index based on orientation was also simple using the different modes of UISplitViewController and size classes. This significantly reduced the amount and complexity of our code.
  • Swiftness With SwiftThe combination of language syntax and functional programming features available in Swift makes it a much more expressive language than Objective-C, allowing us to write fewer lines of more readable code, thanks to features like the following:
    • Closures: Very clean syntax in Swift
    • Tuples: Functions with multiple return values no longer need to be implemented as side effects inside of objects passed as references or held by data structures created only to be used as return values
    • Property observers: You can piggyback some work onto setters, in a very clean and readable way
    • Generics: A great language feature that helped us to prevent redundancy and repetition in our code

    While Objective-C interoperability made us write more verbose code, it allowed us to use existing libraries in a way similar to what you see with modern JVM-based languages that can use existing Java libraries. The ability to use existing open source and in-house Objective-C libraries was key for us to be early adopters of Swift.

    Lastly, these Swift features allowed us to write a very stable app:

    • Constants and Optionals
    • Type Safety
    • Automatic break in switch
    • Required braces for all if statements
    • Enums
    • Computed Properties

    Even as first-time users, the development was, in a word, “Swift” – and our users are pleased with the results.

Many thanks to Jeba Singh Emmanuel, Ellis Weng, Kyle Sherman and Alex Corre whose input was very important for the writing of this post.

Skynet Project – monitor, scale and auto-heal a system in the Cloud

Skynet is a set of tools designed to monitor, scale and maintain a system in the Cloud. Put more simply, it’s a system that is aware about what’s happening on every single machine so it can also know about how the cluster is doing as a whole.

skynet archi

Background:

Our document conversion infrastructure is running in EC2. Pay-as-you-go is great for us, as we can scale depending on the number of documents our users are uploading to SlideShare.

We are firm believers in automation, so we decided to make the scaling process automated. The initial attempt was written in Bash, which was good enough while we were small. However, our cluster has grown by an order of magnitude. That’s why Casey Brown and I decided to build Skynet.

What and Why:

Skynet consists of:
- collectors (ruby code)
- message bus (Fluentd)
- data store (mongodb)
- api (ruby code)
- controller (ruby code)
- actions / scenarios (yaml)

The data collection part happens via two kinds of data collectors that we wrote: a library to gather application logs, and a daemon present on each machine to collect system metrics. These data are sent via Fluentd to multiple datastore in a reliable, fast and flexible fashion. We built these data collection tools ourselves because we wanted to be free to record what we wanted in the programming language we like (Ruby).

We are using MongoDB which we liked when starting the project because we were unclear about how the data would look. MongoDB gave us the flexibility that we needed. In front of that we have a REST API that allows anyone to consume data in an easy way without learning MongoDB-specific queries. It also gives us the possibility to change the datastore technology without disturbing data consumers (graph dashboard, analytics reports, Skynet controller…).

The scaling part happens with the controller, based on simple information like: number of documents waiting to be converted, load on machines and number of active connections on the web servers. You can easily decide if you need more capacity.

Auto-Healing

Let’s discuss the neat part: auto healing. We realized that for the majority of the on-call pages we get, we needed to perform a set of repetitive actions, which took us away from our precious foosball time. To solve that issue we provided the Skynet Controller a set of actions that it can perform, with which we can create scenarios (both actions and scenarios are organized in YAML files). Let’s pick an example where Skynet detects that a machine is not processing documents:

  • It first gets the status of the application process; it finds out that it’s not running
  • It attempts to restart process. The restart fails.
  • It checks if the PID file is present, and it is. It deletes the PID file
  • It try another restart. It works!

The scenario that I just described is a very classic one that any Ops person already performs hundreds of times in his career. Scenarios are actually possibility trees. Depending on the output of an action, it will pick the next action to perform. Additionally, scenarios can mix in other scenarios.

The decision engine, which is the Controller, gives us the ability to take smarter decisions than if every server would take decision locally. Let’s say that a condition shows up for every server at the same time: the Controller can decide to apply a scenario on a small part of the cluster, analyse the output and carry on or stop depending on how it went.

Finally we want to make Skynet able to learn. In the event that it can not solve a situation by applying a known scenario, it will attempt to execute a series of authorized actions and record whether they worked or not. Next time the Controller faces a similar issue, it will try the scenario that previously succeeded, and eventually scenarios that don’t work will be discarded.

Skynet is not open source yet but we are working on it. If you want to contribute to it now, we are hiring.

Hadoop and Near Real Time Analytics at SlideShare

Hadoop and Near Real Time Analytics at SlideShare
The upgrade process and what we learnt along the way

 

At SlideShare-LinkedIn, we’re driven by the motto of “Members First” and constantly look for ways to make the product more valuable for our community. Last year, we upgraded by launching a new version of the analytics [1] and migrated all data points based on HAProxy logs to Hadoop/HBase infrastructure. We’re happy to announce that on April 11 we achieved another major milestone – we launched near real time analytics. Data points are now being updated after a lag of just 30-90 mins instead of the former time lag of 24-36 hrs!

We’re writing this post to share our journey so far with you – starting from 3 nodes POC to current 31 nodes highly available Hadoop/HBase cluster with no SPOF for masters supporting near real time analytics [see section on "HA Setup for Hadoop/HBase Masters" for details].

We started with an early version of CDH4 distribution for Hadoop [2] and after multiple upgrades that brought a lot of useful features, we’re now running CDH 4.3.0 [3] in production (we are really excited about our next major upgrade to CDH5).

Need for Hadoop

Previously, analytics were backed by MySQL and a supporting Ruby batch processing infrastructure.

When we started thinking about new backend system for building analytics from scratch, we had the following requirements in mind:
1. Better latency: Old analytics was slow and sometimes not the most user-friendly.
2. Horizontal scalability: In terms of number of users and data it can process.
3. Supporting other products in future: We use MongoDB for some of our products and we could eventually move some of these products in particular Newsfeed to HBase.
4. Adhoc analysis: We wanted to increase Hadoop usage across the company to get better insights from different datasets and logs. Search and newsletter activity could benefit from adhoc analysis.
5. Near real time analytics: Our goal was to make analytics near real time
6. More analytics features: Introducing additional features that have been part of the product roadmap but haven’t been implemented yet because of the limitations of the old analytics infrastructure.

Selection of Technologies

While we were fairly certain about selection of Hadoop [3] for processing the ever-increasing datasets including HAProxy logs, there were other technologies we had to select to make our Hadoop ecosystem complete.

1. HBase (Database):

HBase [4] leverages Hadoop HDFS for its storage and integrates well with Hadoop ecosystem. Its architecture designed for doing bulk writes and low latency random reads made it a clear cut choice for DB layer.

2. Pig (MR jobs):

For writing MR jobs we wanted to use a high level language with good performance and optimization. Pig [5] comes with good support for HBase storage and load functions. Moreover, you can extend by writing your own UDFs [6]. It’s also very easy to use by non-Hadoop or non-tech guys.

 3. Oozie (Workflow Scheduler):

First, we started with a bunch cron jobs and when it became unmanageable to run interdependent MR/Pig jobs, we switched to Oozie [7]. It was a steep learning curve but we have found it to be a good investment.

 4. Java (MR jobs/Oozie API/Pig UDFs):

It’s hard to avoid Java when you’re working in a Hadoop ecosystem. Initially, we were skeptical about using Java. Our stage 1 jobs, which directly process raw logs are written in Java because we can use LoadIncrementalHFiles [8] to bulk write HFiles [9] and skip write path completely. It reduces the load on cluster and decreases the time required to complete stage 1 of our data processing pipeline.

We also use Java for writing Pig UDFs.

 5. TorqueBox/JRuby (REST API/SOA):

When we launched our new analytics, we were using Thrift client to access HBase API. Eventually as the aggregation logic became complex, we decided to write REST API for analytics data points using TorqueBox [10]. TorqueBox is a high performance [11] application server written in JRuby [12] on top of JBoss. With JRuby, we now have the freedom to use HBase Java API completely. In future, we’ll also have an option to move to a more performant HBase client if need be.

 6. Phoenix (SQL for HBase):

Phoenix [13] is an excellent project that allows you to write SQL queries over HBase tables and get answers with decent latency. We use Phoenix to generate on the fly reports for user or presentation for custom date range. Phoenix converts your SQL query to scans and uses filters and coprocessors to do the magic!

Data Collection and Processing Pipeline

 Hadoop Analytics Data Collection

Fig 1: Hadoop Analytics Data Collection

As you can see above, we use a log collection and aggregation tool, Fluentd to collect the HAProxy logs from different sources, buffered into 64MB chunks and pushed into the HDFS sink through the webHDFS interface that Hadoop offers. A new file is created every hour to allow for hourly batch processing.

 

image03

Fig 2: Data Processing Pipeline – High Level

At a high level, the whole analytics data processing pipeline is divided into two independent parallel components that run separately as 2 Oozie coordinators. While daily data processing pipeline runs once a day, hourly data processing runs every hour. Data processed from these 2 pipelines are then combined and aggregated using an API layer known as TreeBeard. REST API exposed by TreeBeard gives most of the data points required by analytics.

image10

Fig 3: Daily Data Processing Pipeline

- Daily workflow starts at 5 minutes past midnight every day
- All data points are computed at user and slideshow level
- Daily workflow completes generally in 4-5 hours
- For top and total tables we store data for 3 consecutive days

Data Processing Pipeline - Hourly

Fig 4: Hourly Data Processing Pipeline

- Hourly workflow runs 24 times a day and starts after 5 mins of preceding hour
- Hourly workflow completes in 10-20 mins.

Overview of Hadoop Infra

image11

 Fig 5: Overview of Hadoop/HBase Cluster at SlideShare

- We have a highly available setup for masters which means that there is no SPOF
- Datanode, Tasktracker and Regionserver run on same node
- We currently have 6 master nodes and 25 slave nodes
- Oozie is running on one of the namenodes
- ZooKeeper [15] ensemble is hosted on multiple master nodes

HA Setup for Hadoop/HBase Masters

 hadoop_ha

Fig 6: SlideShare Hadoop/HBase HA setup for masters

Since we rely on Hadoop/HBase for storing analytics data, it’s quite critical to the infrastructure and hence we have a complete HA setup [16]:

1. Active-standby namenode setup: In the event of namenode failure, a hot standby takes over as active.

- JournalNodes : Daemons synchronize EDITs across namenodes so that standby namenodes are always updated and ensure HOT standby state.
- HDFS ZKFC: Daemons track the namenode status and trigger a failover in an event of namenode failure. This relies on ZooKeeper cluster.
- ZooKeeper Quorum: a cluster of hosts that synchronize information and store the state of namenode.

2. Active-standby Jobtracker setup: In the event of JobTracker failure, the failover standby server takes over and handles running JobTracker information.

- JobTracker HA Daemons: JobTracker hosts have JobTracker HA daemon running on them.
- JobTracker ZKFC: Daemon triggers failover in an event of JobTracker failure
- HDFS Jobtracker Information: The place where we store jobtracker information  in HDFS so that it can be used in event of failure.

3. Active-Backup HBase Masters setup: Two HBase masters will be running with one running as master and another as backup master.

Configuration/Optimization Work

Hadoop ecosystem is highly configurable [17] which is good but can be very frustrating [18] at times in a fast moving ecosystem. There is a sea of configuration parameters and it’s a big challenge to understand these params. Moreover, values for many of these parameters depend on your use case and a lot of them have to actually be tried in production to see what value works best for you.

Strategy that we follow:

1. Understand the config parameter
2. Look out for recommended values on web and books
3. Discuss with colleagues whether it’s good for our use case
4. Change one parameter at a time and monitor its effect
5. Negative impact – revert
6. Positive impact – joy!

Hadoop:

1. dfs.blocksize: Increased HDFS block size to 128 MB [19] [23]
2. mapred.child.java.opts: (Default: -Xmx200m, Using: -Xmx800m) [19] [23]
We override global value with 1 GB in most Pig scripts.
3. io.sort.mb: The total amount of buffer memory to use while sorting files, in megabytes.
(Default: 100, Using: 200) [19] We override it in many Pig scripts by using 512MB because of significant amount of spilled records. [23]
4. mapred.output.compression.type: If the job outputs are to be written as SequenceFiles, this parameter specifies compression type to use. (Default: RECORD, Using: BLOCK) [20] [21] [23]
5. mapred.output.compression.codec and mapred.map.output.compression.codec: (Default: org.apache.hadoop.io.compress.DefaultCodec, Using: org.apache.hadoop.io.compress.SnappyCodec) [20] [21] [23]
6. mapred.compress.map.output: (Default: false, Using: true) [21] [23]
7. mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum: Depends on RAM, CPU, other services running on node, types of typical jobs being run on the cluster. We’re using 7 and 4 respectively. [22] [23]
8. io.file.buffer.size: Determines how much data should be buffered during read/write operations.
(Default: 4096, Using: 131072) [23]
9. mapred.jobtracker.taskScheduler: We’re in the process of fine tuning FairScheduler config parameters. It’ll become very important once we start supporting a number of products,
adhoc analysis, and data processing pipelines.
Default: org.apache.hadoop.mapred.JobQueueTaskScheduler
Using: org.apache.hadoop.mapred.FairScheduler
10. io.sort.factor: The number of streams to merge at once while sorting files. [23]
Default: 10
Using: 50
11. mapred.reduce.tasks: We override at Job level (Pig scripts). [23]
Default: 1
12. tasktracker.http.threads: The number of worker threads for the http server. This is used for map output fetching. [23]
Default: 40
Using: 80

HBase: [37]

1. HBASE_HEAPSIZE: The maximum amount of heap to use, in MB. (Default: 1000, Using: 10000) [22] [24]
2. JVM GC Tuning: Our JVM GC parameters are based on the guidelines mentioned in [25]
3. hbase.regionserver.global.memstore.upperLimit and hbase.regionserver.global.memstore.lowerLimit: Using Default
4. hbase.regionserver.handler.count: Default: 10, Using: 20
5. Bulk Loading: We use bulk load API [8] of HBase to process HAProxy logs in the first stage of data processing pipeline. Bulk load bypasses write path, is very fast and decreases the load on cluster. [26] [27]
6. Snappy Compression: We have enabled snappy compression for all tables. It saved us about 5-10x storage space.
7. Future work/Work in Progress: [28]
(Mostly related with read performance)

- Changing Prefix compression to FAST_DIFF
- Reducing HBase block size
- More block cache
- Short circuit local reads
- Row level bloom filters

Pig: [29] [30]

1. hbase.client.scanner.caching: HBase param set from Pig script. Number of rows that will be fetched when calling next on a scanner if it is not served from (local, client) memory. Higher caching values will enable faster scanners but will eat up more memory.
Using: 10000.
2. hbase.client.write.buffer: HBase param in bytes set from Pig script. A bigger buffer takes more memory — on the client and server side since server instantiates the passed write buffer to process it — but a larger buffer size reduces the number of RPCs made.
Default: 2097152
Using: 10000000
3. default_parallel: Sets the number of reducers at the script level. Using: 20/30
4. pig.cachedbag.memusage: Sets the amount of memory allocated to bags.
Default: 0.2 (20% of available memory)
Using: 0.7 for some Pig scripts
5. pig.exec.mapPartAgg: Improves the speed of group-by operations.

Migrating Old Analytics to Hadoop/HBase

Since this was a one-off exercise, instead of building an ETL pipeline, we manually dumped all the MySQL tables of interest into HDFS and wrote some Pig scripts to transform the data and stored it into HBase. In some cases, there were one to many relationships between the MySQL and HBase tables. In those cases, we wrote some extra tests just in case to make sure that there were no data discrepancies between the sources.

And then there was the question of dealing with data points which were not available in the old system and some which we started collecting only recently, like outbound clicks. Since back processing old logs was not an option, we maintain a list of the date of “inception” of all data points. Additionally,  we’ve written the web app in such a way that the numbers gray out and warn you in case the date range that you had requested for was not available for this particular datapoint.

TreeBeard – The Aggregation Layer

At the aggregation layer, we calculate dynamic range based on hourly and daily workflow update tables. These two tables give us information about the completion of last run hourly and daily processing pipeline.

image06

Fig 7: TreeBeard API – Near Realtime Aggregation Layer [Daily Views]

Range will have some dates for which we can’t get precomputed aggregated daily views from daily processing pipeline. For these dates we use hourly tables which are updated every hour. We query hourly table and sum the views on the fly at query time.

Explanation with an Example:

We’ll explain the aggregation logic with an example.

Consider server time of Apr 3, 2014 03:00 hrs. At this point of time daily processing for Apr 1 is complete and daily processing for Apr 2 is underway, having started at Apr 3, 00:05 hrs. We’ll also have 2 hours of hourly views available.

Range for showing 1 month of daily views will be Mar 4 to Apr 3. For Mar 4 to Apr 1, we’ll use precomputed daily views from daily processing pipeline. For Apr 2 and Apr 3, we’ll use summation of hourly views.

image12

Fig 8: TreeBeard API – Near Realtime Aggregation Layer [Total Views]

The new dynamic range will have an overlap with date range for which daily processing for total views is complete. For computing new total views we’ll have to add views for the days for which daily processing is not complete and deduct views from the starting of range for precomputed total views obtained from daily processing pipeline.

Explanation with an Example:

We’ll explain the aggregation logic with an example.

Consider server time of Apr 3, 2014 03:00 hrs. At this point of time daily processing for Apr 1 is complete and daily processing for Apr 2 is underway, having started at Apr 3, 00:05 hrs. We’ll also have 2 hours of hourly views available.

Range for showing 1 month of total views will be Mar 4 to Apr 3. Most recent precomputed total views from daily processing pipeline is for the range Mar 2 to Apr 1.

Total views (Mar 4 to Apr 3) = Total views from daily data processing pipeline (Mar 2 to Apr 1) + Total views for Apr 2 and Apr 3 obtained by summing hourly views – Total views for Mar 2 and Mar 3 obtained by summing daily views

image09

Fig 9: TreeBeard API – Near Realtime Aggregation Layer [Top Views]

The new dynamic range will have an overlap with date range for which daily processing for top views is complete. For computing new top views we take a practical approach of getting top 1000 views from daily processing pipeline and adding delta hits to all these views. Then we sort in memory and return top 5/10/20 views.

Explanation with an Example:

We’ll explain the aggregation logic with an example.

Consider server time of Apr 3, 2014 03:00 hrs. At this point of time daily processing for Apr 1 is complete and daily processing for Apr 2 is underway, having started at Apr 3, 00:05 hrs. We’ll also have 2 hours of hourly views available.

Range for showing 1 month of top views will be Mar 4 to Apr 3. Most recent pre-computed and sorted top views from daily processing pipeline is for the range Mar 2 to Apr 1.

Top views for Mar 4 to Apr 3 is calculated as follows -

Step 1: Top 1000 views for Mar 2 to Apr 1

Step 2: Add total views for Apr 2 and Apr 3 obtained by summing hourly views to each of the sorted hits count

Step 3: Deduct total views for Mar 2 and Mar 3 obtained by summing daily views from each of the sorted hits count

Step 4: In memory sorting

Step 5: Return top 5/10/20

Testing (Analytics Data Points)

 1. Reconciliation Test verifying the integrity between various pages

 

image02

 Fig 10: Summary Tab

- Checking for each of top 5 content Views <= Total views
- Checking for each of top 5 country Views <= Total views

image05

 Fig 11: Views Tab

- Total views = SlideShare + Embeds
- Each of top (5/10/20) <= Total views
- Sum of top content views <=  Total views

image01

 Fig 12: Sources Tab

- Total views = Sum of all sources
- Total views count >= Each of source count
- Each of top sources (5/10/20) <= Total views
- Sum of top sources <= Total views

image00

 Fig 13: Geo Tab

- Each of top country (5/10/20) <= Total views
- Sum of top countries <= Total views

2. Data consistency amongst various Treebeard API calls (realtime = true)

Data Type Data For Trend Realtime Range Additional Params
Views User hourly true 1w source
Traffic Sources Slideshow daily false 1m date
Country total 3m limit
Referer Views top 6m country
all user_id
referer
referer_source

Use Cases

- User and slideshow level
- All 5 types of uploads (Presentation, documents, infographics, videos, and slidecasts)
- Low, mid and high usage users
- 1w, 1m, 3m, 6m and all time ranges
- Hours when daily processing finishes

Test Cases

- Total views = Total views (embed) + Total views (onsite)
- Total views = Total traffic sources
- Total views >= Total country views
- Total views = Sum daily views
- Total views >= Sum top content
- Total views >= Sum top traffic sources
- Total country views = Sum daily country views
- Total traffic source = Sum daily traffic source
- Total referer views = Sum daily referer views
- Top traffic sources <= Total referer views
- Daily views >= Sum hourly views (0-5% error)
- Daily country views >= Sum hourly country views (0-5% error)
- Daily traffic source >= Sum hourly traffic source (0-5% error)
- Total views (Slideshow) = Views on Top content
- Limits are working (max being 1000)
- New user/uploads/country processed in hourly data processing pipeline

What We Learnt

- HBase schema design is critical to latency as well as data that is scanned while running MR jobs.

- Number of regions map to number of tasks that are created while running MR jobs over HBase tables. Because of oversight and/or poor understanding at one point of time we were stuck with 24+ hours of processing for daily workflow. [38] [39] Pre split regions if required. HBase 0.94 improved the default region split policy and is much better. [31] [32]

- We had written a UDF that used to add up the views for a given range. It caused increased memory requirement for task and because of that regionservers used to die regularly with OOM error. Finally, we removed custom UDF with in built SUM and GROUP BY operations with much less memory requirement at the cost of increased processing time because of spilled records.

- Always change configuration parameter one at a time, otherwise you may not be able to measure the impact.

Limitations

- HBase schema design is very tightly coupled with query patterns. Slight change in query requirement can make you to create a new table (and potentially reprocess a lot of data again)

- Pig script debugging is hard.

- Top lists can incorporate new entries only in daily processing, not in hourly.

- Hourly data stored only for past 10 days.

- Top views can show a maximum of 1000 entries.

- Latest views limited to 1000 entries.

- Some time is spent in recovering from/repairing failed HBase components [33]

- Currently we have one monolithic cluster, which is supporting multiple use cases. [34]

Looking Forward

- Support more products on the cluster

- Reduce the entry barrier for non-Hadoop and non-tech guys. Spend time on setting up and increasing the adoption of Hue [35]

- CDH5 upgrade [36]

Blog Post Contributors and Team Members

- Akshit Khurana
- Amit Sawhney
- Anil Siwach
- Ankita Gupta
- Bubby Rayber
- Diksha Kuhar
- Hari Prasanna
- Jasmeet Singh
- Nikhil Chandna
- Nikhil Prabhakar
- Prashant Verma

References

1. PRO Analytics Are Now Faster, Simpler, Better

2. CDH: 100% Open Source Distribution including Apache Hadoop

3. CDH4.3.0 Documentation

4. Apache HBase Project

- Apache HBase [hbase-0.94.6-cdh4.3.0]

- HBase 0.94.6-cdh4.3.0 API

- Apache HBase For Architects

- Transcript of HBase for Architects Presentation

- Tutorial: HBase (Theory and Practice of a Distributed Data Store) [pdf]

- Apache HBase Write Path

- HBase Architecture 101 – Storage

- HBase Architecture 101 – Write-ahead-Log

- HBase and HDFS: Understanding FileSystem Usage in HBase

5. Apache Pig Project

- Pig Documentation [pig-0.11.0-cdh4.3.0]

- Pig 0.11.0-cdh4.3.0 API

- Introduction To Apache Pig

- Pig Mix – PigMix is a set of queries used test pig performance from release to release

- HBaseStorage – HBase implementation of LoadFunc and StoreFunc

6. Chapter 10. Writing Evaluation and Filter Functions [Programming Pig by Alan Gates]

7. Apache Oozie Workflow Scheduler for Hadoop

- Oozie, Workflow Engine for Apache Hadoop [oozie-3.3.2-cdh4.3.0]

- Apache Oozie Client 3.3.2-cdh4.3.0 API

- How-To: Schedule Recurring Hadoop Jobs with Apache Oozie

8. Tool to load the output of HFileOutputFormat into an existing table [API]

9. Apache HBase I/O – HFile

- File format for hbase. A file of sorted key/value pairs. [API]

- HBase I/O: HFile

10. TorqueBox Project

11. TorqueBox 2.x Performance Benchmarks

12. JRuby: The Ruby Programming Language on the JVM

- Calling Java from JRuby

13. Apache Phoenix: “We put the SQL back in NoSQL”

- Phoenix in 15 minutes or less

14. Fluentd: Open Source Log Management

- HDFS (WebHDFS) Output Plugin

15. Apache ZooKeeper

16. CDH4 High Availability Guide

17. Configuring/Optimizing Hadoop ecosystem

- Hadoop

  1. core-default.xml
  2. hdfs-default.xml
  3. mapred-default.xml
  4. Deprecated Properties
  5. Ports Used by Components of CDH4

- HBase

  1. 2.3. Configuration Files [The Apache HBase™ Reference Guide - Latest]
  2. hbase-default.xml [0.94.6]
  3. HConstants.java [0.94.6]
  4. Constant Field Values [hbase-0.94.6-cdh4.3.0]
  5. What are some tips for configuring HBase?
  6. Guide to Using Apache HBase Ports
  7. HBaseConfTool

- Pig

  1. Pig Properties
  2. Pig Cookbook [Oozie]

- Oozie

  1. How-to: Use the ShareLib in Apache Oozie

18. https://twitter.com/_nipra/status/364662881708544000

19. Tuning DSE Hadoop MapReduce

20. Using Snappy for MapReduce Compression

21. Chapter 7: Tuning a Hadoop Cluster for Best Performance
(Using compression for input and output)
[Hadoop Operations and Cluster Management Cookbook]

22. 8.10. Recommended Memory Configurations for the MapReduce Service

23. Chapter 5. Installation and Configuration
(MapReduce – Optimization and Tuning / HDFS – Optimization and Tuning)
[Hadoop Operations]

24. HBase Administration, Performance Tuning

25. HBase JVM Tuning

- Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Part 1

- Avoiding Full GCs in HBase with MemStore-Local Allocation Buffers: Part 2

- Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Part 3

- 12.3.1. The Garbage Collector and Apache HBaseChapter 11. Performance Tuning / Garbage Collection Tuning (HBase – The Definitive Guide)

26. How-to: Use HBase Bulk Loading, and Why

27. 9.8. Bulk Loading

28. Apache HBase at Pinterest

29. Performance and Efficiency [Pig]

30. Chapter 8. Making Pig Fly

31. HBASE-4365 – Add a decent heuristic for region size

32. RegionSplitPolicy [API]

- ConstantSizeRegionSplitPolicy [Default split policy before 0.94.0]

- IncreasingToUpperBoundRegionSplitPolicy [Default split policy since 0.94.0]

33. HBase: How to get MTTR below 1 minute

34. Hadoop Hardware @Twitter: Size does matter! [Video] [Slideshow]

35. Hue

36. CDH 5 Release Notes

37. HBase Performance Tuning

- Chapter 12. Apache HBase Performance Tuning (The Apache HBase™ Reference Guide – Latest)

- Chapter 11. Performance Tuning (HBase – The Definitive Guide)

38. 7.1. Map-Task Splitting (The Apache HBase™ Reference Guide – Latest)

39. HBase as MapReduce job data source and sink

Other Resources

- Hadoop: The Definitive Guide, 3rd Edition

- HBase: The Definitive Guide

- Hadoop Operations

- Programming Pig

- HBase in Action

- Hadoop Operations and Cluster Management Cookbook

- Optimizing Hadoop for MapReduce

- HBase Administration Cookbook

- Cloudera Developer Blog · HBase Posts

- HBaseCon Videos

- Cloudera Blog Archives

- From the Dev Team [Hortonworks Blog]

- Hadoop Weekly

- Deploying with JRuby: Deliver Scalable Web Apps using the JVM

- Using JRuby: Bringing Ruby to Java

- TorqueBox 3.0.0 Documentation

- JRuby Wiki

- hbase-jruby: a simple JRuby binding for HBase

- Treebeard

- … and some fun Harlem Shake at SlideShare Delhi

 

Do you enjoy solving complex problems in Big Data Analytics? SlideShare engineers are constantly working on such challenging projects. If work like this excites you, this is the place to be. We are hiring.

 

10 ways we made SlideShare faster

At SlideShare, we love metrics. Be it weekly traffic data, number of uploads,number of pro sign ups or page load time ;we track each number very closely and strive to improve the numbers everyday. For Second Quarter of 2013, SlideShare’s CTO Jonathan Boutelle came up with the target of reducing our page load time to 5 seconds, which was hovering at an average of 7.5 seconds at that time.

This was not an easy task. We were already following web practices listed at Yslow. We had made every extra effort to achieve good grades at webpagetest. When it comes to page load time every millisecond reduction counts. Read on to find out how we got close to our target.

1. Load all third party libraries after window.load - To start with, all the javascript required for social sharing widgets eg Facebook like button should be loaded asynchronously.As a second step, move the script tags from the html page to a JavaScript(JS) file and bind the asynchronous loading of the external JS libraries to window.loadevent. For example, to load FB Like Button, place the following code in the JS file included in the head
$(window).load(function(){
(function(d){
var js, id = 'facebook-jssdk',
ref = d.getElementsByTagName('script')[0];
if (d.getElementById(id)) {return;}
js = d.createElement('script'); js.id = id; js.async = true;
js.src = "//connect.facebook.net/en_US/all.js";
ref.parentNode.insertBefore(js, ref);
}(document));
}

And this is how our new relic graph looked like after we made this change.

ThirdPartWidgets

2. Asynchronously Load Jquery - Another useful trick is to load jQuery asynchronously so that other components on the page are not blocked.We used jql plugin which can be inlined in the HTML page.You don’t have to worry about missing document.ready( ) calls since jQl automatically catches all jQuery().ready() calls and executes them as soon as jQuery is loaded and the DOM is ready.

3. Combine and Compress - Combining the JS files help in eliminating the initial DNS lookup and TCP connection components that goes in every request. This is very evident from Yslow requests but an interesting tweak to this will be conditional compiling.For example let’s say your page requires jquery, file1.js, logged_in.js, logged_out.js.

Conditional compiling should give you combined_logged_in.js[jquery+file1.js+logged_in.js] and combined_logged_out.js[jquery+file1.js+logged_out.js]

For compressing JS , closure compiler is known to give the best results.

4. Smarter JS packaging - SlideShare has different functionality for logged in and logged out users. We load only base files for logged out users and additional JS files when the user logs in.Similarly the HTML5 player loads different type of combined JS files depending on the content type say presentations or documents.There is also a possibility that some plugins that you added initially are not being used anymore. Removing them is the first step towards minimizing size of assets that block page load.

5. Defer loading of JS - The main component of SlideShare which is the HTML5 player needs to be loaded as fast as possible. For this we are dependent on document.ready event. Therefore,the JS that is required for above the fold functionality is minimal and is loaded quite early.The other JS that is required for additional user functionality is loaded after window.load.

6. Lazy Load HTML and JS - This is a commonly used technique to show user generated content on page like comments etc only when the user interacts.We removed the inline html required for user actions like embed code and retrieved it on user click. This reduces the page size that the browser needs to parse and speeds up the load time.

7. Use of Cache Busters - All the assets like images, sprites, javascript files should have a cachebuster appended at the end of the URL so that the browser can cache the assets and makes a new request only if cache buster has changed.

8. Remove HTTPS calls from an http page- This can help you chop off a few milliseconds spent in SSL negotiation.

9. DNS prefetching - SlideShare uses multiple cdns to fetch various assets. We figured out that we can reduce DNS look up time by preresolving DNS in the browser.

10. Redesigning HTML5 player - Apart from these we also made certain improvements in our HTML5 player.The core of the SlideShare – HTML5 player is written in JavaScript using jQuery.The HTML5 player is a complex system which supports displaying presentations,documents , slidecasts and now Infographics too! To make a dent in the page load time by a good margin, we realized that we need to redesign the HTML5 player

  • We started with the idea that the player should be able to display content even if JavaScript is disabled in the browser and JS should only be used to make the player alive.

  • All the markup was generated on the client side using javascript which led to a lot of DOM manipulation. This was resolved by moving markup creation logic to backend.

  • The layout of slides and other styling logic was rewritten so that CSS rules can be used effectively instead of relying on JS

  • To make the user experience delightful, we included the first slide in the markup itself so that it is triggered by the browser parser instead of waiting on JS.

  • So what was the result of the above experiments? Our new relic graphs tell the entire story. A whooping drop from 5.7 to 5.03 seconds!

Before

newrelic1

After

last 3 hours

We used webpagetst extensively to measure the performance of the changes we were making. Here is how we have moved the needle for document complete event.

Before

webpagetest1

After

webpagetest3

None of the above would have been possible without the help of Jean Benois and Jeba Emmanuel.

If you find these tips useful or have any more ideas to make SlideShare faster ,please give a shout out in the comments below.So, what did you do to make your website faster?

Thanks

Apoorvi Kapoor

Brilliant Hacks at SlideShare Hackday 2013

Folks,

The 2013 SlideShare Hackday was organized on 3rd and 4th October. It was 30 hours of pure hackery and fun. Brilliant hacks, great participation and good food made this the best SlideShare hackday ever! We managed to capture some of the moments here for you:-

What cooked? Over 28 teams registered themselves and 25 of them successfully completed the hack prototypes. There were many brilliant hacks such as: -

  1. Minority report ARDrone Quadcopter with Leapmotion Control that flies and hounds the developer who breaks the build by Jeba Singh Emmanuel
  2. HTML5 Presentations – Upload native HTML5 presentations to SlideShare (yay!) by Neha, Akshit, Nikhil
  3. Picturesque – Quickly create a SlideShare photodeck from your Facebook photo album by Hitesh, Apoorvi, Himanshu
  4. Pseudo upload – Upload to SlideShare directly from your Gmail attachments by Jasmeet, Nitish, Gaurav
  5. Slideshare for Weddings – Quickly create a visual wedding card and upload to SlideShare by Jai, Shishir, Vishu
  6. #Comment – Capture discussions on a deck from Twitter and bring it to SlideShare as comments on the deck by Pranav, Tushar, Arpit
  7. Let’s create content – Content creation platform on slideshare by Arpit D, Dheeraj, Saptarshi
  8. Speaking Decks – Listen to the slide transcripts Troy, Arun, Bubby
  9. Conversion dev setup on vagrant by Chris
  10. SlideCaster – Slidecast and Zipcast at the same time by CaseyA, Omar, Mark
  11. Recommend content based on historic user-slideshow interaction by Kunal Mittal, Karan Jindal, Anupam Jain, Hari
  12. SoS – Chrome Widget for creating Content by Anil, Atul, Prashant
  13. Mobile reader by Andri, Jesse, Ben Silver, Ellis
  14. Slideshow as Video on facebook by Nishant, Jack
  15. SlideShare Collections by Varun , Vivek
  16. Content Organizer by Archana, Deepti, Moumita
  17. Photodeck – turn your album into slidedeck by Amit Banerjee
  18. Improved Transcript by Akash
  19. Infographic résumé by Simla Ceyhan, Sylvain Kalache, Yifu Diao
  20. Share with a click – share open positions on SS careers page to LI/Facebook by Priyanka R
  21. Rietveld on CentOS by Toby
  22. Think Mobile – Preview how your SlideShare deck appears on mobile devices by Ajay
  23. Convert blogs into SS decks by Subhendra Basu & Rashi

Who won? Each hack explored a different idea from a unique perspective. The most well-executed idea won. Here are the results –

Delhi Office:
1st Prize: Slideshare for Weddings by Jai, Shishir, Vishu
2nd Prize. HTML5 Presentation Support by Akshit, Neha, Nikhil

SF Office:
1st Prize: SlideCaster by Casey A, Omar & Mark.
2nd Prize: Infographic résumé by Simla, Sylvain & Yifu

Worthy mentions

  • Let’s create content (Arpit D, Dheeraj, Saptarshi)
  • Think Mobile (Ajay)
  • #Comment (Arpit B, Pranav, Tushar)

  • Does this sound exciting to you? If yes, then you’re like us. Why not consider applying for a job at SlideShare and become a part of this fun?

    Fluentd at Slideshare

    SlideShare will be hosting the next Fluentd meetup July 8th at our San Francisco office.

    Fluentd is an open-source program that we have been using for the past year and a half that helps us with log management, carrying them from point A to point B in a fast and reliable way.

    Among a lot of good things about Fluentd, here are the three we like the most:

    • Everything that goes in and out of Fluentd is JSON.
    • Fluentd is written in about 2,000 lines of open source Ruby code. When you have an issue, you can just read/patch the code.
    • There is already a huge plugins library — about 150 — that allows you to import, filter and export your data into a variety of systems.

    If you want to know more about how we are using Fluentd, please have a look at one of our projects in which we are using it, and join us on July 8th!

    What’s Cooking in Our Labs: SlideShare HandsFree!

    [Reposted from http://blog.slideshare.net/2013/03/19/whats-cooking-in-our-labs-slideshare-handsfree/]

    Clicking through presentations can be cumbersome, especially when you’re talking through them with a live audience at hand. You have to find the right key on your keyboard, or move your mouse to the correct button, often halting the flow of your speech. What if you could just flick your finger in the air, indicating movement to the next page?

    Our engineers are on it. We figured if you could play motion-sensing tennis on the Nintendo Wii, couldn’t you at least flip thought SlideShare presentations with the wave of a hand? Here’s a preview of what we’re working on:

    And here’s what the engineer himself, Shirsendu Karmakar, had to say about developing it (yes, he’s pretty cool!):
    If you have used/seen flutter, you wish you could use it on websites too. A few days back, I saw an interesting Chrome Experiment. My initial reaction: SlideShare “Minority Report” style! I started working on something similar for SlideShare. After an hour or so, with JavaScript as my weapon and some simple techniques and approximations, SlideShare presentations were gestures ready. It took me around 30 lines to code to make SlideShare work via my gestures.
    Whats happening behind the scenes.
    • webRTC has made it possible to access the web camera directly from the browser. No Flash required!
    • A image is snapped at regular intervals.
    • HTML5 canvas is used to draw the current image.
    • The movement delta = The difference between the last image and the current image is calculated.
    • Depending on the value of delta, we try to detect which movement was done. Currently only for basic movements are supported: left, right, top, bottom.
    • Each of the directions are mapped with the SlideShare’s player API functions. Whenever, a movement is detected successfully, the player executes the mapped action.

    What are other features you’d like to see us develop in SlideShare Labs?

    Introducing SlideShare API Explorer

    We know that at times it becomes difficult for you to keep up with our API documentation, which results in failed attempts at testing out an end point, and  in turn results in loss of productivity. To solve this issue, we created SlideShare API Explorer, which helps you get started with our API, and makes it super easy to test out an API endpoint.

    Just follow the simple steps below:

    1. Apply for an API key (requires a user account)- You’ll get the API credentials on your registered email address.
    2. Go to apiexplorer.slideshare.net and fill in your credentials which you have received in the email.

    Fill in your API Credentials

    3. Now select an endpoint to test, fill in test parameters and click on ‘try it’. It will query our endpoint and show the generated query string, response headers and response body.

    We are constantly working on making the API development workflow a seamless experience for you, and would love to hear your feedback/suggestions on this, feel free to share your comments below.

    DevelopHer Hackday Delhi

    DevelopHer Hackday Delhi
    Announcing DevelopHer Hackday Delhi at SlideShare’s New Delhi office. First of its kind event! Same dates as DevelopeHer Hackday in the Bay Area organized by LinkedIn. Come on women geeks and hackers.. this is your stage! Form a team, code all night, create something awesome, and present it to the judges to win prizes.

    Dave McClure (silicon valley guru, investor & founder of 500startups) and Rashmi Sinha will be judging the event. Participants in the winning team get an Apple Macbook Air each. Participants in the team winning the Second Prize get Apple iPads.

    Are you ready? If you have not registered yet, hurry and register at http://hackday.linkedin.com/developher/delhi for the Hackday on Saturday, June 30th and Sunday, July 1st.

    DevelopHer is being organised by LinkedIn at its Mountain View office. DevelopHer Hackday Delhi is a parallel event being organised by SlideShare (which is a part of LinkedIn now) at their New Delhi office

    SlideShare is looking for Rockstar Designers in New Delhi

    [Reposted from the SlideShare Blog http://blog.slideshare.net/2012/02/17/designer-dream-job-in-delhi/]

    Are you passionate about digital design? Do you dream of working at one of the world’s fastest growing startups? If you are an experienced web or interactive designer, come work with SlideShare in our New Delhi office. You’ll be collaborating closely with our software developers, product managers, and analysts to build products that reach millions of users.

    Here’s what we’re looking for
    - 1 to 5 years experience in similar role at a software, Internet or Web design company.
    - Strong information and interaction design skills
    - Proficiency in using common design tools like Photoshop, Illustrator, Balsamiq
    - Understanding of interface and interaction design principles as they relate to web sites, tablets, and handheld devices.
    - Proven skills with XHTML (handcoding), and CSS.
    - Excellent understanding of Web 2.0 design patterns
    - Excellent collaboration, communication & writing skills
    - Use of quantitive and qualitative feedback to make a design better. At SlideShare we use a variety of methods: A/B testing, viral loop tracking, user testing with tools like Google Analytics, Kissmetrics, CrazyEgg, Mixpanel, UserTesting
    - College degree in Web design, Graphic Design, Interaction Design, HCI or Software Engineering

    Show us what you’ve got
    Be prepared to show us examples of:
    - Your ability to conceptualize low-fidelity wireframes & mockups (using Powerpoint/Balsamiq/Photoshop), convert these into high fidelity prototypes and then handcode into HTML/CSS.
    - Your strong visual design abilities: with web pages, microsites, marketing collateral, logos whatever you have designed.
    - Wireframes, mockups and final designs that reflect rigorous attention to visual, interaction & usability details.

    How to apply
    Send your resume, with work samples and/or a link to your online portfolio to jobs@slideshare.com.