Platform Status -- Performance, Performance, Performance
| Permalink | Comments (1) | TrackBacks (0)
Hi folks,
As many of you know we've been working around the clock to target and fix performance and stability issues on the platform. A big thanks for everyone's patience and understanding.
I'm happy to say that we've identified the root source of our stability problems. It's related to notifications in the platform. With notifications off in the past 36 hours we have not seen any of the erratic behavior of previous nights. We have code fixes we will deploy on Sunday that will re-enable notifications for most Applications.
So what happened?
The following histogram of http responses tells the tale:

Starting April 15th we started noticing poor performance overnight. You've seen it, timeouts, 500 errors, gateway timeouts, etc. We started our normal site stability processes to target the problem. Stack traces were generated, memory dumps analyzed, etc. This technique had served us well in diagnosing performance problems generating FOAF data for users with thousands of users. At the same time we added 50% more servers to the pool.
In our analysis we quickly focused our attention to concurrency problems in some portions of our code. We found that Java itself has some serious contention for global Character Set data and Crypto providers. So we spent a lot of time working around these problems. However we'd fix one concurrency problem, and quickly hit another one. This went on for days, we'd think we had the problem solved, then it would come back with a vengeance.
This past friday, the 25th we were dealing with the same issues. We noticed that the poor performance coincided with notification database alerts. At 11:00 we turned off notifications and the entire system was suddenly stable.
With this breathing room we've now had time to finish a project to improve notifications scalability. This code will go out on Sunday, with most applications having notifications functionality by Sunday.
Why Did This Happen?
We underestimated the amount of notifications sent, and the popularity of their use on the site. At first glance this just meant that posting and browsing notifications were slow. We didn't expect that other requests would suffer collateral damage.
Changes Going Forward...
Today we made the following Notifications changes. Our goal is to get Notifications on for all applications while maintaining overall site stability:
Both Operations and Engineering have been happy to get a few good nights sleep since friday. We look forward to working on new features instead of fighting fires.
One final note on platform status -- we know that stats were unavailable in the dev console this weekend and expect to have them back online tomorrow.
As many of you know we've been working around the clock to target and fix performance and stability issues on the platform. A big thanks for everyone's patience and understanding.
I'm happy to say that we've identified the root source of our stability problems. It's related to notifications in the platform. With notifications off in the past 36 hours we have not seen any of the erratic behavior of previous nights. We have code fixes we will deploy on Sunday that will re-enable notifications for most Applications.
So what happened?
The following histogram of http responses tells the tale:

Red is 500 errors, light blue timeouts, dark blue is good responses.
In our analysis we quickly focused our attention to concurrency problems in some portions of our code. We found that Java itself has some serious contention for global Character Set data and Crypto providers. So we spent a lot of time working around these problems. However we'd fix one concurrency problem, and quickly hit another one. This went on for days, we'd think we had the problem solved, then it would come back with a vengeance.
This past friday, the 25th we were dealing with the same issues. We noticed that the poor performance coincided with notification database alerts. At 11:00 we turned off notifications and the entire system was suddenly stable.
With this breathing room we've now had time to finish a project to improve notifications scalability. This code will go out on Sunday, with most applications having notifications functionality by Sunday.
Why Did This Happen?
We underestimated the amount of notifications sent, and the popularity of their use on the site. At first glance this just meant that posting and browsing notifications were slow. We didn't expect that other requests would suffer collateral damage.
Changes Going Forward...
Today we made the following Notifications changes. Our goal is to get Notifications on for all applications while maintaining overall site stability:
- Accepting Notifications Asynchronously
- Applying Notification Retention policy to remove the oldest Notifications from the system (14 day retention)
- Adding Extra Notification Capacity
- Only allowing Notification REST calls with a token generated in the preceding 4 hours.
- Additional privacy controls to insure that the notification is legitimate.
- Conversations with our partners that are using the Notifications Feature the most.
Both Operations and Engineering have been happy to get a few good nights sleep since friday. We look forward to working on new features instead of fighting fires.
One final note on platform status -- we know that stats were unavailable in the dev console this weekend and expect to have them back online tomorrow.
0 TrackBacks
Listed below are links to blogs that reference this entry: Platform Status -- Performance, Performance, Performance.
TrackBack URL for this entry: http://www.hi5networks.com/cgi-bin/mt/mt-tb.cgi/109
1 Comments
Leave a comment
© 2008 hi5Networks
I'm getting this 95% of the time with our app. We're not using any notifications, but does this still affect us?