Facebook You Re Doing It Wrong

Facebook You Re Doing It Wrong - Early today Facebook was down or inaccessible for most of you for roughly 2.5 hrs. This is the worst blackout we've had in over four years, as well as we intended to firstly apologize for it. We also wished to offer much more technical information on what occurred and share one big lesson learned.

What's Wrong With Facebook

Facebook You Re Doing It Wrong


The vital flaw that caused this failure to be so serious was an unfortunate handling of a mistake problem. An automated system for validating arrangement values wound up causing a lot more damage than it fixed.

The intent of the automated system is to look for setup values that are void in the cache and replace them with updated worths from the relentless store. This works well for a short-term trouble with the cache, however it doesn't function when the consistent store is void.

Today we made a modification to the consistent copy of an arrangement value that was interpreted as void. This indicated that every customer saw the void worth and tried to repair it. Since the repair entails making a query to a cluster of data sources, that collection was promptly bewildered by numerous thousands of questions a second.

To make issues worse, every time a customer got a mistake trying to quiz one of the databases it translated it as a void worth, and erased the matching cache key. This suggested that also after the initial trouble had been dealt with, the stream of questions proceeded. As long as the databases stopped working to service several of the requests, they were triggering a lot more requests to themselves. We had actually gone into a responses loop that didn't allow the data sources to recuperate.

The way to quit the feedback cycle was quite painful - we had to quit all website traffic to this database cluster, which indicated shutting off the site. When the databases had actually recouped as well as the root cause had been repaired, we slowly permitted even more people back onto the website.

This obtained the site back up and running today, and also in the meantime we've shut off the system that tries to correct configuration values. We're discovering brand-new styles for this setup system complying with style patterns of various other systems at Facebook that deal more with dignity with responses loopholes and also short-term spikes.

We say sorry again for the site interruption, and we desire you to recognize that we take the performance and also dependability of Facebook very seriously.