Platform Status -- Performance, Performance, Performance

Hi folks,

As many of you know we've been working around the clock to target and fix performance and stability issues on the platform.  A big thanks for everyone's patience and understanding.

I'm happy to say that we've identified the root source of our stability problems.  It's related to notifications in the platform.  With notifications off in the past 36 hours we have not seen any of the erratic behavior of previous nights.  We have code fixes we will deploy on Sunday that will  re-enable notifications for most Applications.

So what happened?

The following histogram of http responses tells the tale:

response_hits_perc.dyn.png
Red is 500 errors, light blue timeouts, dark blue is good responses.

Starting April 15th we started noticing poor performance overnight.  You've seen it, timeouts, 500 errors, gateway timeouts, etc.  We started our normal site stability processes to target the problem.  Stack traces were generated, memory dumps analyzed, etc.  This technique had served us well in diagnosing performance problems generating FOAF data for users with thousands of users.  At the same time we added 50% more servers to the pool.

In our analysis we quickly focused our attention to concurrency problems in some portions of our code.  We found that Java itself has some serious contention for global Character Set data and Crypto providers.  So we spent a lot of time working around these problems.  However we'd fix one concurrency problem, and quickly hit another one.  This went on for days, we'd think we had the problem solved, then it would come back with a vengeance. 

This past friday, the 25th we were dealing with the same issues.   We noticed that the poor performance coincided with notification database alerts. At 11:00 we turned off notifications and the entire system was suddenly stable.

With this breathing room we've now had time to finish a project to improve notifications scalability.  This code will go out on Sunday, with most applications having notifications functionality by Sunday.


Why Did This Happen?

We underestimated the amount of notifications sent, and the popularity of their use on the site.  At first glance this just meant that posting and browsing notifications were slow.  We didn't expect that other requests would suffer collateral damage.

Changes Going Forward...

Today we made the following Notifications changes.  Our goal is to get Notifications on for all applications while maintaining overall site stability:

  • Accepting Notifications Asynchronously
  • Applying Notification Retention policy to remove the oldest Notifications from the system (14 day retention)
  • Adding Extra Notification Capacity
  • Only allowing Notification REST calls with a token generated in the preceding 4 hours.
  • Additional privacy controls to insure that the notification is legitimate.
  • Conversations with our partners that are using the Notifications Feature the most.

Both Operations and Engineering have been happy to get a few good nights sleep since friday.  We look forward to working on new features instead of fighting fires.

One final note on platform status -- we know that stats were unavailable in the dev console this weekend and expect to have them back online tomorrow.

1 Comment

I'm getting this 95% of the time with our app. We're not using any notifications, but does this still affect us?

Leave a comment

Recent Entries

  • OpenSocial 0.8 Moved To Live Environment

    We have now finished the migration of the Platform in production to OpenSocial v0.8. We'd like to thank the developers who helped test 0.8 while...

  • Translation Service for OpenSocial Applications on hi5

    Reaching a Wider Audience: Community-based Translations for Applications Hi5 has a large audience in Spanish-speaking markets, Thailand, Romania, Portugal, and many other countries. How much...

  • OpenSocial 0.8 In Beta On hi5

    Following close behind the release of the OpenSocial 0.8 specification two months ago, we have been hard at work implementing it, and are happy to...

  • Statistics API Available on Sandbox

    The Statistics API that we announced two weeks ago is available on sandbox. Please use the endpoints described in the earlier post, prefixed with http://sandbox.hi5.com/rest....

  • hi5 Providing Library For Templates

    The hi5.template library is a browser side, Javascript library which enables you to fuse Javascript data and logic into your HTML. It simplifies writing...

Close