As many of you know we've been working around the clock to target and fix performance and stability issues on the platform. A big thanks for everyone's patience and understanding.
I'm happy to say that we've identified the root source of our stability problems. It's related to notifications in the platform. With notifications off in the past 36 hours we have not seen any of the erratic behavior of previous nights. We have code fixes we will deploy on Sunday that will re-enable notifications for most Applications.
So what happened?
The following histogram of http responses tells the tale:

In our analysis we quickly focused our attention to concurrency problems in some portions of our code. We found that Java itself has some serious contention for global Character Set data and Crypto providers. So we spent a lot of time working around these problems. However we'd fix one concurrency problem, and quickly hit another one. This went on for days, we'd think we had the problem solved, then it would come back with a vengeance.
This past friday, the 25th we were dealing with the same issues. We noticed that the poor performance coincided with notification database alerts. At 11:00 we turned off notifications and the entire system was suddenly stable.
With this breathing room we've now had time to finish a project to improve notifications scalability. This code will go out on Sunday, with most applications having notifications functionality by Sunday.
Why Did This Happen?
We underestimated the amount of notifications sent, and the popularity of their use on the site. At first glance this just meant that posting and browsing notifications were slow. We didn't expect that other requests would suffer collateral damage.
Changes Going Forward...
Today we made the following Notifications changes. Our goal is to get Notifications on for all applications while maintaining overall site stability:
- Accepting Notifications Asynchronously
- Applying Notification Retention policy to remove the oldest Notifications from the system (14 day retention)
- Adding Extra Notification Capacity
- Only allowing Notification REST calls with a token generated in the preceding 4 hours.
- Additional privacy controls to insure that the notification is legitimate.
- Conversations with our partners that are using the Notifications Feature the most.
Both Operations and Engineering have been happy to get a few good nights sleep since friday. We look forward to working on new features instead of fighting fires.
One final note on platform status -- we know that stats were unavailable in the dev console this weekend and expect to have them back online tomorrow.
I'm getting this 95% of the time with our app. We're not using any notifications, but does this still affect us?