What Wrong with Facebook

What Wrong With Facebook - Early today Facebook was down or unreachable for a lot of you for approximately 2.5 hrs. This is the worst failure we have actually had in over 4 years, and we intended to firstly apologize for it. We likewise intended to give much more technical detail on what took place and share one large lesson found out.

What's Wrong With Facebook

What Wrong With Facebook


The essential flaw that caused this failure to be so extreme was an unfortunate handling of a mistake problem. An automatic system for validating configuration values wound up triggering far more damage than it fixed.

The intent of the automatic system is to look for setup values that are invalid in the cache and also replace them with updated worths from the consistent shop. This functions well for a transient problem with the cache, yet it doesn't work when the persistent store is void.

Today we made a modification to the consistent copy of a setup worth that was interpreted as invalid. This meant that every customer saw the invalid value and also tried to repair it. Since the fix involves making an inquiry to a collection of databases, that cluster was quickly bewildered by numerous hundreds of queries a 2nd.

To make matters worse, every time a customer obtained a mistake attempting to quiz one of the data sources it translated it as an invalid worth, as well as deleted the corresponding cache secret. This implied that even after the initial trouble had been taken care of, the stream of questions continued. As long as the databases stopped working to service some of the demands, they were triggering a lot more requests to themselves. We had gone into a comments loophole that didn't allow the data sources to recoup.

The way to stop the comments cycle was rather painful - we needed to quit all website traffic to this database collection, which implied switching off the site. When the databases had actually recovered and the source had been dealt with, we gradually permitted more people back onto the website.

This obtained the site back up and running today, and also for now we've turned off the system that attempts to fix configuration worths. We're checking out brand-new layouts for this arrangement system adhering to layout patterns of other systems at Facebook that deal more gracefully with comments loopholes and also short-term spikes.

We ask forgiveness once again for the website blackout, as well as we want you to know that we take the efficiency as well as integrity of Facebook very seriously.